-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: Katib integration with tf-operator #267
Conversation
/hold |
/assign @YujiOshima |
/retest |
/retest |
/test all |
/retest |
@@ -13,6 +13,14 @@ rules: | |||
- update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please split this change to another PR since this change is related to #256 not only for TF_JOB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. The other change in this file is still required, otherwise watching tfjobs will throw a permission error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
if instance.Spec.WorkerSpec != nil { | ||
wretain = instance.Spec.WorkerSpec.Retain | ||
case DefaultJobWorker: | ||
if err := r.deleteWorkerResources(instance, &batchv1.Job{}, ns, w.WorkerID); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you remove retain
logic? It allows retaining workers after completion for debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is moved to deleteWorkerResources().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed it, thanks.
pkg/controller/studyjob/const.go
Outdated
) | ||
|
||
const ( | ||
WorkerState_Active = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you redefine state? Why not use api.State_COMPLETED
etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a generic way to represent the state of a worker, which can be either a batch job or TF job. In lines 415-454 we compute the current worker status and pass the handling to updateWorker(). This is to avoid duplicating the long code section for each job type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you don't need to define new state here, you can use Katib api State type.
Line 165 in fb27298
type State int32 |
Then
WorkerStatus
will be like this.
type WorkerStatus struct {
CompletionTime *metav1.Time
WorkerState katibapi. State
}
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, fixed.
LGTM |
@richardsliu Thank you for a reply to my comments. |
/retest |
@richardsliu Thank you! When we try TFJob example, we need to install TFJob-operator. /lgtm |
@YujiOshima Yes, I will add the documentation change. I was testing this on kubeflow/kubeflow which already installs all of the components. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: richardsliu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is the first iteration for #39.
The example provided is trivially simple. The code is borrowed from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py. It only uses 1 TF worker and exposes learning_rate and batch_size as tunable hyperparameters.
This change is