Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bunch of problems in TfJob CRD that crept in while tests were broken #308

Merged
merged 10 commits into from
Jan 16, 2018

Conversation

jlewi
Copy link
Contributor

@jlewi jlewi commented Jan 14, 2018

  • In syncTfJob when checking whether a work queue item corresponds to a TrainingJob already in the map we need to check the UID. Otherwise we will not properly handle the case where a training job is deleted and then a new job is recreated with the same name.
  • We need to make sure that the Replicas field in TrainingJob is always properly set;
    • We were only initializing replicas in setup which was problematic in the case where the TfJob controller gets restarted because on restarted setup won't be invoked because the job is past that phase and as a result the replicas won't be reinitialized.
  • test_runner needs to ignore case when checking whether the job succeeded otherwise we conclude
    that successful jobs failed
  • The controller should only forget about job after the job has been cleaned up; not when it is marked as succeeded or failed.
  • Add back code to support termination policies use the worker and not the master as the chief

This change is Reviewable

…orker.

  * This was added in kubeflow#221 and accidentally removed in the refactor in kubeflow#234.
@coveralls
Copy link

coveralls commented Jan 14, 2018

Coverage Status

Coverage increased (+0.2%) to 31.837% when pulling 29358ab on jlewi:fix_gpu_timeout into 98a34a1 on tensorflow:master.

jlewi added 2 commits January 14, 2018 16:50
  periodically; otherwise we don't periodically check the status of the job
  and update it when its done.

* The controller should only forget about a job after its been cleaned up.
Copy link
Member

@ScorpioCPH ScorpioCPH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlewi Hi, thanks for this PR, some quick comments (review not finish yet).

}
// Check that each replica has a TensorFlow container.
chiefExists := false

// Check that each replica has a TensorFlow container.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dup comments?

@@ -156,6 +156,9 @@ func (c *Controller) processNextWorkItem() bool {
if err == nil {
if forget {
c.WorkQueue.Forget(key)
} else {
// Requeue the key so that we will reconcile it again even if no events occur.
c.WorkQueue.AddAfter(key.(string), time.Second * 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This maybe full fill the work queue after a while, how about use event-driven:

  • Listen on Pods which created by TFJob controller.
    • This needs a little change to use Pod instead of Job.
    • And set OwnerReferences of each Pods to TFJob controller.
  • On Pod created/updated/deleted, get TFJob by parse OwnerReferences, set the TFJob.Status.TFClusterStatus map as mentioned here.
  • Listen on TFJobs changed (we update the status in previous).
  • Set the whole status of this TFJob due to the TFClusterStatus map (like what we do in Reconcile).
    • And update the TFJob.Status.Condition.
  • Terminated/Delete the TFJob is every Pod is completed.

@gaocegege @mqliang FYI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened up #314. Overall, I think moving in the direction of being more event driven is a good idea. Although, I think we should probably still call reconcile periodically as a catch all. Its not clear to me why the work queue would fill up.

The original design used one go func for each TrainingJob. In the current design, there would be one item in the queue for each TrainingJob and we can increase the number of workers handling queue items.

Either way, making the controller more event driven is a pretty big change. I think just reenqueing the items is a simpler change and should get head fixed sooner and thus unblock other work.

@@ -166,9 +169,11 @@ func (c *Controller) processNextWorkItem() bool {
return true
}

// syncJob will sync the job with the given key if it has had its expectations fulfilled, meaning
// it did not expect to see any more of its pods created or deleted. This function is not meant to be invoked
// syncJob will sync the job with the given. This function is not meant to be invoked
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/syncJob/syncTFJob

@jlewi jlewi changed the title Fix regression; Add back code to support termination policies other than master Fix a bunch of problems in TfJob CRD that crept in while tests were broken Jan 15, 2018
@coveralls
Copy link

coveralls commented Jan 15, 2018

Coverage Status

Coverage decreased (-0.2%) to 31.439% when pulling 66cc51a on jlewi:fix_gpu_timeout into 2109be9 on tensorflow:master.

@coveralls
Copy link

coveralls commented Jan 15, 2018

Coverage Status

Coverage decreased (-0.1%) to 31.541% when pulling e63da86 on jlewi:fix_gpu_timeout into 2109be9 on tensorflow:master.

@jlewi
Copy link
Contributor Author

jlewi commented Jan 16, 2018

Since there are no more comments and this fixes head I'm going to merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants