Fix a bunch of problems in TfJob CRD that crept in while tests were broken #308

jlewi · 2018-01-14T19:46:56Z

In syncTfJob when checking whether a work queue item corresponds to a TrainingJob already in the map we need to check the UID. Otherwise we will not properly handle the case where a training job is deleted and then a new job is recreated with the same name.
We need to make sure that the Replicas field in TrainingJob is always properly set;
- We were only initializing replicas in setup which was problematic in the case where the TfJob controller gets restarted because on restarted setup won't be invoked because the job is past that phase and as a result the replicas won't be reinitialized.
test_runner needs to ignore case when checking whether the job succeeded otherwise we conclude
that successful jobs failed
The controller should only forget about job after the job has been cleaned up; not when it is marked as succeeded or failed.
Add back code to support termination policies use the worker and not the master as the chief
- This was added in allow using WORKER:0 as chief #221 and accidentally removed in the refactor in refactor the controller logic #234.

This change is

…orker. * This was added in kubeflow#221 and accidentally removed in the refactor in kubeflow#234.

coveralls · 2018-01-14T19:57:39Z

Coverage increased (+0.2%) to 31.837% when pulling 29358ab on jlewi:fix_gpu_timeout into 98a34a1 on tensorflow:master.

periodically; otherwise we don't periodically check the status of the job and update it when its done. * The controller should only forget about a job after its been cleaned up.

…he TrainingJob in memory.

ScorpioCPH

@jlewi Hi, thanks for this PR, some quick comments (review not finish yet).

ScorpioCPH · 2018-01-15T10:03:47Z

pkg/apis/tensorflow/validation/validation.go

+	}
+	// Check that each replica has a TensorFlow container.
+	chiefExists := false
+
 	// Check that each replica has a TensorFlow container.


dup comments?

ScorpioCPH · 2018-01-15T10:08:21Z

pkg/controller/controller.go

@@ -156,6 +156,9 @@ func (c *Controller) processNextWorkItem() bool {
 	if err == nil {
 		if forget {
 			c.WorkQueue.Forget(key)
+		} else {
+			// Requeue the key so that we will reconcile it again even if no events occur.
+			c.WorkQueue.AddAfter(key.(string), time.Second * 10)


This maybe full fill the work queue after a while, how about use event-driven:

Listen on Pods which created by TFJob controller.

This needs a little change to use Pod instead of Job.

And set OwnerReferences of each Pods to TFJob controller.

On Pod created/updated/deleted, get TFJob by parse OwnerReferences, set the TFJob.Status.TFClusterStatus map as mentioned here.

Listen on TFJobs changed (we update the status in previous).

Set the whole status of this TFJob due to the TFClusterStatus map (like what we do in Reconcile).

And update the TFJob.Status.Condition.

Terminated/Delete the TFJob is every Pod is completed.

@gaocegege @mqliang FYI.

I opened up #314. Overall, I think moving in the direction of being more event driven is a good idea. Although, I think we should probably still call reconcile periodically as a catch all. Its not clear to me why the work queue would fill up.

The original design used one go func for each TrainingJob. In the current design, there would be one item in the queue for each TrainingJob and we can increase the number of workers handling queue items.

Either way, making the controller more event driven is a pretty big change. I think just reenqueing the items is a simpler change and should get head fixed sooner and thus unblock other work.

ScorpioCPH · 2018-01-15T10:20:39Z

pkg/controller/controller.go

@@ -166,9 +169,11 @@ func (c *Controller) processNextWorkItem() bool {
 	return true
 }

-// syncJob will sync the job with the given key if it has had its expectations fulfilled, meaning
-// it did not expect to see any more of its pods created or deleted. This function is not meant to be invoked
+// syncJob will sync the job with the given. This function is not meant to be invoked


nit: s/syncJob/syncTFJob

…handle update events and informer should send resync events.

coveralls · 2018-01-15T18:50:24Z

Coverage decreased (-0.2%) to 31.439% when pulling 66cc51a on jlewi:fix_gpu_timeout into 2109be9 on tensorflow:master.

coveralls · 2018-01-15T21:53:19Z

Coverage decreased (-0.1%) to 31.541% when pulling e63da86 on jlewi:fix_gpu_timeout into 2109be9 on tensorflow:master.

jlewi · 2018-01-16T12:50:00Z

Since there are no more comments and this fixes head I'm going to merge this.

* Add back code to support termination policies which don't use the w…

29358ab

…orker. * This was added in kubeflow#221 and accidentally removed in the refactor in kubeflow#234.

jlewi added 2 commits January 14, 2018 16:50

* We need to requeue work items for each job so that reconcile is called

81f95b5

periodically; otherwise we don't periodically check the status of the job and update it when its done. * The controller should only forget about a job after its been cleaned up.

We need to take into account the UID when deciding whether to trust t…

0b90c37

…he TrainingJob in memory.

This was referenced Jan 15, 2018

TrainingJob.reconcile not called periodically #309

Closed

E2E test delete and recreate job with same name #310

Closed

ScorpioCPH reviewed Jan 15, 2018

View reviewed changes

Ignore case when determining whether a job finished.

7e0df14

jlewi changed the title ~~Fix regression; Add back code to support termination policies other than master~~ Fix a bunch of problems in TfJob CRD that crept in while tests were broken Jan 15, 2018

jlewi mentioned this pull request Jan 15, 2018

Make the TfJob controller more event driven #314

Closed

Address comments.

b054abe

jlewi mentioned this pull request Jan 15, 2018

add UpdateFunc to handle update events #313

Merged

jlewi added 3 commits January 15, 2018 09:55

Merge remote-tracking branch 'upstream/master' into fix_gpu_timeout

8594bc4

Don't requeue the work item. That shouldn't be necessary since we no …

b0abb18

…handle update events and informer should send resync events.

Run goimports.

c1714ce

This was referenced Jan 15, 2018

replace TPR with CRD #307

Merged

Move around due to new directories layout #273

Merged

Fix tests.

66cc51a

Fix golint issues.

e63da86

jlewi merged commit b97dfc7 into kubeflow:master Jan 16, 2018

jlewi mentioned this pull request Feb 7, 2018

Recreating a failed/successful job with same name doesn't work #322

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a bunch of problems in TfJob CRD that crept in while tests were broken #308

Fix a bunch of problems in TfJob CRD that crept in while tests were broken #308

jlewi commented Jan 14, 2018 •

edited

Loading

coveralls commented Jan 14, 2018 •

edited

Loading

ScorpioCPH left a comment

ScorpioCPH Jan 15, 2018

ScorpioCPH Jan 15, 2018

jlewi Jan 15, 2018

ScorpioCPH Jan 15, 2018

coveralls commented Jan 15, 2018 •

edited

Loading

coveralls commented Jan 15, 2018 •

edited

Loading

jlewi commented Jan 16, 2018

Fix a bunch of problems in TfJob CRD that crept in while tests were broken #308

Fix a bunch of problems in TfJob CRD that crept in while tests were broken #308

Conversation

jlewi commented Jan 14, 2018 • edited Loading

coveralls commented Jan 14, 2018 • edited Loading

ScorpioCPH left a comment

Choose a reason for hiding this comment

ScorpioCPH Jan 15, 2018

Choose a reason for hiding this comment

ScorpioCPH Jan 15, 2018

Choose a reason for hiding this comment

jlewi Jan 15, 2018

Choose a reason for hiding this comment

ScorpioCPH Jan 15, 2018

Choose a reason for hiding this comment

coveralls commented Jan 15, 2018 • edited Loading

coveralls commented Jan 15, 2018 • edited Loading

jlewi commented Jan 16, 2018

jlewi commented Jan 14, 2018 •

edited

Loading

coveralls commented Jan 14, 2018 •

edited

Loading

coveralls commented Jan 15, 2018 •

edited

Loading

coveralls commented Jan 15, 2018 •

edited

Loading