-
Notifications
You must be signed in to change notification settings - Fork 742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the TfJob controller more event driven #314
Milestone
Comments
Should we try to get this done by Kubecon? |
I think so, and we should refactor the test in trainer package to controller level. But I am not sure if we could finish it before kubecon |
jlewi
pushed a commit
that referenced
this issue
Mar 5, 2018
jimexist
pushed a commit
to jimexist/tf-operator
that referenced
this issue
Mar 7, 2018
This PR is a part of kubeflow#325: rename jobName() to genName() create Pod instead of Job TODOs (in another PR): use controller.PodControlInterface and CreatePodsWithControllerRef to create Pod Listen Pod CRUD and update TFJob status which descried in kubeflow#314
Closed by #492 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Right now the controller relies on the TrainingJob.reconcile being called frequently to check the state of the job and take any needed action.
In #308 it was suggested we adopt a more event driven design.
Here's the comment from @ScorpioCPH
Pods
which created by TFJob controller.Pod
instead ofJob
.OwnerReferences
of eachPods
toTFJob controller
.TFJob
by parseOwnerReferences
, set theTFJob.Status.TFClusterStatus
map as mentioned here.TFJobs
changed (we update the status in previous).TFJob
due to theTFClusterStatus
map (like what we do in Reconcile).TFJob.Status.Condition
.TFJob
is every Pod is completed.I think its more complicated than that since we create other resources (e.g. services, config maps, etc...).
Its also not clear to me why the queue would get filled up since the number of items in the queue would be the same as number of jobs in the cluster.
The text was updated successfully, but these errors were encountered: