Make the TfJob controller more event driven #314

jlewi · 2018-01-15T17:36:25Z

Right now the controller relies on the TrainingJob.reconcile being called frequently to check the state of the job and take any needed action.

In #308 it was suggested we adopt a more event driven design.

Here's the comment from @ScorpioCPH

Listen on Pods which created by TFJob controller.
- This needs a little change to use Pod instead of Job.
- And set OwnerReferences of each Pods to TFJob controller.
On Pod created/updated/deleted, get TFJob by parse OwnerReferences, set the TFJob.Status.TFClusterStatus map as mentioned here.
Listen on TFJobs changed (we update the status in previous).
Set the whole status of this TFJob due to the TFClusterStatus map (like what we do in Reconcile).
- And update the TFJob.Status.Condition.
Terminated/Delete the TFJob is every Pod is completed.

I think its more complicated than that since we create other resources (e.g. services, config maps, etc...).

Its also not clear to me why the queue would get filled up since the number of items in the queue would be the same as number of jobs in the cluster.

The text was updated successfully, but these errors were encountered:

jlewi · 2018-01-25T13:26:58Z

Should we try to get this done by Kubecon?

gaocegege · 2018-02-11T03:45:43Z

I think so, and we should refactor the test in trainer package to controller level. But I am not sure if we could finish it before kubecon

This PR is a part of #325: rename jobName() to genName() create Pod instead of Job TODOs (in another PR): use controller.PodControlInterface and CreatePodsWithControllerRef to create Pod Listen Pod CRUD and update TFJob status which descried in #314

This PR is a part of kubeflow#325: rename jobName() to genName() create Pod instead of Job TODOs (in another PR): use controller.PodControlInterface and CreatePodsWithControllerRef to create Pod Listen Pod CRUD and update TFJob status which descried in kubeflow#314

gaocegege · 2018-04-22T10:41:20Z

Closed by #492

jlewi mentioned this issue Jan 15, 2018

Fix a bunch of problems in TfJob CRD that crept in while tests were broken #308

Merged

gaocegege mentioned this issue Jan 19, 2018

Manage Pods directly instead of using Job controllers #325

Closed

ScorpioCPH mentioned this issue Jan 24, 2018

Create Pod instead of Job #344

Merged

jlewi added this to the Kubecon Europe milestone Jan 25, 2018

gaocegege mentioned this issue Jan 25, 2018

refactor the TfJob to use Informer and Controller #206

Closed

gaocegege added kind/feature area/lifecycle area/operator labels Feb 1, 2018

gaocegege mentioned this issue Feb 1, 2018

Potential data race, maybe #302

Closed

gaocegege added the difficulty/hard label Feb 16, 2018

jlewi added the priority/p2 label Mar 7, 2018

gaocegege closed this as completed Apr 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the TfJob controller more event driven #314

Make the TfJob controller more event driven #314

jlewi commented Jan 15, 2018

jlewi commented Jan 25, 2018

gaocegege commented Feb 11, 2018

gaocegege commented Apr 22, 2018

Make the TfJob controller more event driven #314

Make the TfJob controller more event driven #314

Comments

jlewi commented Jan 15, 2018

jlewi commented Jan 25, 2018

gaocegege commented Feb 11, 2018

gaocegege commented Apr 22, 2018