-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cleanup jobs after finished #725
Conversation
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. |
CLAs look good, thanks! |
Travis tests have failedHey @ccding, 1st Buildgoveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go"
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @jlewi @codeflitting
return | ||
} | ||
for _, pod := range pods { | ||
if err := |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we delete all pods? I think it is better to follow the cleanpod policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cleanpod policy also deletes all pods. see https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v2/controller_tfjob.go#L107
/ok-to-test |
@@ -48,6 +48,10 @@ type TFJobSpec struct { | |||
// Default to Running. | |||
CleanPodPolicy *CleanPodPolicy `json:"cleanPodPolicy,omitempty"` | |||
|
|||
// CleanupTimeoutAfterFinished defines the timeout before cleaning finished jobs. | |||
// Default to 0: doesn't clean up. | |||
CleanupTimeoutAfterFinished *CleanupTimeoutAfterFinished `json:"cleanupTimeoutAfterFinished,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether we should have a single CleanPolicy which contains
{
CleanPolicyType (CleanRunningPod, CleanAllPods, CleanJob, None)
CleanupDelay
}
@gaocegege @jlewi what's your opinion ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's reach agreement here, then I will fix other comments. I personally like @jian-he 's proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this deprecate the existing CleanPodPolicy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works for me. I think we could use AllReplicas, RunningReplicas, None and Job
/cc @jlewi
} | ||
|
||
// deleteTFJob deletes all the pods and services in a tfJob. | ||
func (tc *TFJobController) deleteTFJob(tfJob *tfv1alpha2.TFJob) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function only removes the pods/services associated with the tfJob, what we want is to delete the entire tfJob including the tfJob itself. So, if we just delete the tfJob, the pods/services will be removed automatically
@@ -97,6 +98,15 @@ func (tc *TFJobController) deletePodsAndServices(tfJob *tfv1alpha2.TFJob, pods [ | |||
|
|||
// Delete nothing when the cleanPodPolicy is None. | |||
if *tfJob.Spec.CleanPodPolicy == tfv1alpha2.CleanPodPolicyNone { | |||
// If CleanupTimeoutAfterFinished is specified, delete after that timeout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mixed the logic of cleaningup job and cleaning up of pods and may confuse user. How about separate the if condition ?
for { | ||
select { | ||
case ci := <-tc.cleanupInfoCh: | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about check in reconcileTFJobs that if currentTime-Job.Status.CompletionTIme > timeout, if true, delete the job. Then we don't need to start a go routine for each timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be better. If we use goroutine, we can not recover from the crash.
b436567
to
e767ec7
Compare
Travis tests have failedHey @ccding, 2nd Buildgometalinter --config=linter_config.json --vendor ./...
3rd Buildgometalinter --config=linter_config.json --vendor ./...
|
Travis tests have failedHey @ccding, 2nd Buildgometalinter --config=linter_config.json --vendor ./...
3rd Buildgometalinter --config=linter_config.json --vendor ./...
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go"
|
Can you explain why we want this feature? I don't see a rationale given in #718. Why would we implement garbage collection this way? e.g. if we want to garbage collect old resources why wouldn't we just have an agent running in the cluster that we prune objects of a certain age that were done? This way we could handle a single agent for multiples of resources e.g. Argo workflows. |
I left some questions here in the issue as well. |
04def1e
to
53f1090
Compare
I think we could add delay field first, then keep CleanPodPolicy. The problem of the PR wants to solve is that the etcd will be high loaded when the tfjobs are cumulate. Then we need to, for example, delete tfjobs which is completed several days ago. Implementing a cleanup operator can solve it while the code will be similar to this PR. I think we could add delay first (default to 0). As for how to support the above use case, I think we could have a deeper dive for it. My opinion is to implement it in this operator because it will cover more use cases. We should let the users decide whether to use their own cleanup operator or internal support in tfjob operator. |
Travis tests have failedHey @ccding, 2nd Buildgoveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go"
3rd Buildgoveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go"
|
Travis tests have failedHey @ccding, 2nd Buildgoveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go"
3rd Buildgoveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go"
|
@gaocegege If this use case is Garbage Collection to prevent overloading the etcd operator then that's a strong reason for doing it as a separate agent than can be used with multiple resources. We have the exact same garbage collection problems with all batch resources e.g.
Why implement this in TFJob controller if we know we are just going to have to reimplement it for other operators. Furthermore, it doesn't seem to be like garbage collection policy should be set per job by the user of the job. Garbage collection policy seems like a policy that should be set by admins. Requiring users to set it won't work very well. All it will take is one user not setting the policy on their jobs and submitting a hyperparmeter tuning job to create a whole bunch of jobs that will overwhelm the cluster. |
@ccding Thanks for your work! |
We'll have a default value for all jobs set by admin, and gives the option for users to override it. There are scenarios where a large number of preliminary testing jobs don't need to stay for that long and can be removed right away. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gaocegege The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I think so. It works for me. |
I'm a little lost. Is this PR actually cleaning up finished jobs? It looks like the code is just adding a delay before enforcing the pod deletion policy. Doesn't this just exacerbate the issue of overloading the master by letting more resources persist longer. See comment in #718. There is a proposal in the K8s community to add a ttl field to jobs. I think we should follow a similar design. |
This reverts commit 0b9d422.
this implements #718
This change is