Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: Katib integration with tf-operator #267

Merged
merged 11 commits into from
Dec 5, 2018

Conversation

richardsliu
Copy link
Contributor

@richardsliu richardsliu commented Nov 29, 2018

This is the first iteration for #39.

The example provided is trivially simple. The code is borrowed from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py. It only uses 1 TF worker and exposes learning_rate and batch_size as tunable hyperparameters.

This change is Reviewable

@richardsliu richardsliu changed the title WIP TF operator WIP Katib integration with tf-operator Dec 1, 2018
@richardsliu richardsliu changed the title WIP Katib integration with tf-operator POC: Katib integration with tf-operator Dec 1, 2018
@richardsliu
Copy link
Contributor Author

/hold

@richardsliu
Copy link
Contributor Author

/assign @YujiOshima
/assign @gaocegege
/assign @johnugeorge

@richardsliu
Copy link
Contributor Author

/retest

@richardsliu
Copy link
Contributor Author

/retest

@richardsliu
Copy link
Contributor Author

/test all

@richardsliu
Copy link
Contributor Author

/retest

@@ -13,6 +13,14 @@ rules:
- update
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please split this change to another PR since this change is related to #256 not only for TF_JOB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The other change in this file is still required, otherwise watching tfjobs will throw a permission error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

if instance.Spec.WorkerSpec != nil {
wretain = instance.Spec.WorkerSpec.Retain
case DefaultJobWorker:
if err := r.deleteWorkerResources(instance, &batchv1.Job{}, ns, w.WorkerID); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you remove retain logic? It allows retaining workers after completion for debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is moved to deleteWorkerResources().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed it, thanks.

)

const (
WorkerState_Active = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you redefine state? Why not use api.State_COMPLETED etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a generic way to represent the state of a worker, which can be either a batch job or TF job. In lines 415-454 we compute the current worker status and pass the handling to updateWorker(). This is to avoid duplicating the long code section for each job type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need to define new state here, you can use Katib api State type.

type State int32

Then WorkerStatus will be like this.

type WorkerStatus struct {
 CompletionTime *metav1.Time
 WorkerState katibapi. State
}

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, fixed.

@johnugeorge
Copy link
Member

johnugeorge commented Dec 4, 2018

LGTM

@YujiOshima
Copy link
Contributor

@richardsliu Thank you for a reply to my comments.
Let me try to test this in my env.

@richardsliu
Copy link
Contributor Author

/retest

@YujiOshima
Copy link
Contributor

@richardsliu Thank you! When we try TFJob example, we need to install TFJob-operator.
We should add doc for installing it by kf command.
Could you add it later?

/lgtm

@richardsliu
Copy link
Contributor Author

@YujiOshima Yes, I will add the documentation change. I was testing this on kubeflow/kubeflow which already installs all of the components.

@richardsliu
Copy link
Contributor Author

/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: richardsliu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 3516dda into kubeflow:master Dec 5, 2018
@richardsliu richardsliu deleted the tfjob branch January 17, 2019 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants