-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a bunch of problems in TfJob CRD that crept in while tests were broken #308
Conversation
…orker. * This was added in kubeflow#221 and accidentally removed in the refactor in kubeflow#234.
periodically; otherwise we don't periodically check the status of the job and update it when its done. * The controller should only forget about a job after its been cleaned up.
…he TrainingJob in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlewi Hi, thanks for this PR, some quick comments (review not finish yet).
} | ||
// Check that each replica has a TensorFlow container. | ||
chiefExists := false | ||
|
||
// Check that each replica has a TensorFlow container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dup comments?
pkg/controller/controller.go
Outdated
@@ -156,6 +156,9 @@ func (c *Controller) processNextWorkItem() bool { | |||
if err == nil { | |||
if forget { | |||
c.WorkQueue.Forget(key) | |||
} else { | |||
// Requeue the key so that we will reconcile it again even if no events occur. | |||
c.WorkQueue.AddAfter(key.(string), time.Second * 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This maybe full fill the work queue after a while, how about use event-driven
:
- Listen on
Pods
which created by TFJob controller.- This needs a little change to use
Pod
instead ofJob
. - And set
OwnerReferences
of eachPods
toTFJob controller
.
- This needs a little change to use
- On Pod created/updated/deleted, get
TFJob
by parseOwnerReferences
, set theTFJob.Status.TFClusterStatus
map as mentioned here. - Listen on
TFJobs
changed (we update the status in previous). - Set the whole status of this
TFJob
due to theTFClusterStatus
map (like what we do in Reconcile).- And update the
TFJob.Status.Condition
.
- And update the
- Terminated/Delete the
TFJob
is every Pod is completed.
@gaocegege @mqliang FYI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened up #314. Overall, I think moving in the direction of being more event driven is a good idea. Although, I think we should probably still call reconcile periodically as a catch all. Its not clear to me why the work queue would fill up.
The original design used one go func for each TrainingJob. In the current design, there would be one item in the queue for each TrainingJob and we can increase the number of workers handling queue items.
Either way, making the controller more event driven is a pretty big change. I think just reenqueing the items is a simpler change and should get head fixed sooner and thus unblock other work.
pkg/controller/controller.go
Outdated
@@ -166,9 +169,11 @@ func (c *Controller) processNextWorkItem() bool { | |||
return true | |||
} | |||
|
|||
// syncJob will sync the job with the given key if it has had its expectations fulfilled, meaning | |||
// it did not expect to see any more of its pods created or deleted. This function is not meant to be invoked | |||
// syncJob will sync the job with the given. This function is not meant to be invoked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/syncJob/syncTFJob
…handle update events and informer should send resync events.
Since there are no more comments and this fixes head I'm going to merge this. |
that successful jobs failed
This change is