POC: Katib integration with tf-operator #267

richardsliu · 2018-11-29T02:15:01Z

This is the first iteration for #39.

The example provided is trivially simple. The code is borrowed from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py. It only uses 1 TF worker and exposes learning_rate and batch_size as tunable hyperparameters.

This change is

richardsliu · 2018-12-01T01:08:47Z

/hold

richardsliu · 2018-12-01T01:09:19Z

/assign @YujiOshima
/assign @gaocegege
/assign @johnugeorge

richardsliu · 2018-12-01T02:11:01Z

/retest

richardsliu · 2018-12-01T03:06:17Z

/retest

richardsliu · 2018-12-01T03:06:51Z

/test all

richardsliu · 2018-12-03T23:23:25Z

/retest

YujiOshima · 2018-12-04T01:10:25Z

manifests/studyjobcontroller/rbac.yaml

@@ -13,6 +13,14 @@ rules:
  - update


Please split this change to another PR since this change is related to #256 not only for TF_JOB.

Done. The other change in this file is still required, otherwise watching tfjobs will throw a permission error.

YujiOshima · 2018-12-04T01:13:43Z

pkg/controller/studyjob/studyjob_controller.go

-					if instance.Spec.WorkerSpec != nil {
-						wretain = instance.Spec.WorkerSpec.Retain
+				case DefaultJobWorker:
+					if err := r.deleteWorkerResources(instance, &batchv1.Job{}, ns, w.WorkerID); err != nil {


Why do you remove retain logic? It allows retaining workers after completion for debugging.

The logic is moved to deleteWorkerResources().

I missed it, thanks.

YujiOshima · 2018-12-04T01:19:30Z

pkg/controller/studyjob/const.go

+)
+
+const (
+	WorkerState_Active = 0


Why do you redefine state? Why not use api.State_COMPLETED etc?

This is a generic way to represent the state of a worker, which can be either a batch job or TF job. In lines 415-454 we compute the current worker status and pass the handling to updateWorker(). This is to avoid duplicating the long code section for each job type.

I think you don't need to define new state here, you can use Katib api State type.

katib/pkg/api/api.pb.go

Line 165 in fb27298

type State int32

Then WorkerStatus will be like this.

type WorkerStatus struct { CompletionTime *metav1.Time WorkerState katibapi. State }

WDYT?

Sure, fixed.

johnugeorge · 2018-12-04T17:24:13Z

LGTM

YujiOshima · 2018-12-05T01:13:42Z

@richardsliu Thank you for a reply to my comments.
Let me try to test this in my env.

richardsliu · 2018-12-05T07:07:47Z

/retest

YujiOshima · 2018-12-05T07:56:39Z

@richardsliu Thank you! When we try TFJob example, we need to install TFJob-operator.
We should add doc for installing it by kf command.
Could you add it later?

/lgtm

richardsliu · 2018-12-05T08:01:48Z

@YujiOshima Yes, I will add the documentation change. I was testing this on kubeflow/kubeflow which already installs all of the components.

richardsliu · 2018-12-05T08:31:26Z

/approve

k8s-ci-robot · 2018-12-05T08:31:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: richardsliu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [richardsliu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

TF operator part 1

8171c62

k8s-ci-robot added do-not-merge/work-in-progress size/L labels Nov 29, 2018

k8s-ci-robot requested review from jose5918 and libbyandhelen November 29, 2018 02:15

richardsliu added 4 commits November 28, 2018 18:42

Add consts

8850757

Fix

ee8ac90

Update worker; fix schemes

55e4921

Change example

e268314

richardsliu changed the title ~~WIP TF operator~~ WIP Katib integration with tf-operator Dec 1, 2018

Add rbac rules

4cd7487

richardsliu changed the title ~~WIP Katib integration with tf-operator~~ POC: Katib integration with tf-operator Dec 1, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress label Dec 1, 2018

k8s-ci-robot added the do-not-merge/hold label Dec 1, 2018

k8s-ci-robot assigned gaocegege, johnugeorge and YujiOshima Dec 1, 2018

Add crd

26c8d4f

k8s-ci-robot added size/XL and removed size/L labels Dec 1, 2018

richardsliu added 2 commits December 3, 2018 13:38

Add sleep for debugging

7d18ec1

Log cluster name

fb27298

YujiOshima reviewed Dec 4, 2018

View reviewed changes

Remove unrelated change

e295a62

use katibapi.State

1ef9d2b

k8s-ci-robot added size/L and removed size/XL labels Dec 5, 2018

k8s-ci-robot added the lgtm label Dec 5, 2018

richardsliu removed the do-not-merge/hold label Dec 5, 2018

k8s-ci-robot added the approved label Dec 5, 2018

k8s-ci-robot merged commit 3516dda into kubeflow:master Dec 5, 2018

richardsliu deleted the tfjob branch January 17, 2019 23:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: Katib integration with tf-operator #267

POC: Katib integration with tf-operator #267

richardsliu commented Nov 29, 2018 •

edited

Loading

richardsliu commented Dec 1, 2018

richardsliu commented Dec 1, 2018

richardsliu commented Dec 1, 2018

richardsliu commented Dec 1, 2018

richardsliu commented Dec 1, 2018

richardsliu commented Dec 3, 2018

YujiOshima Dec 4, 2018

richardsliu Dec 4, 2018

YujiOshima Dec 5, 2018

YujiOshima Dec 4, 2018

richardsliu Dec 4, 2018

YujiOshima Dec 5, 2018

YujiOshima Dec 4, 2018

richardsliu Dec 4, 2018

YujiOshima Dec 5, 2018

richardsliu Dec 5, 2018

johnugeorge commented Dec 4, 2018 •

edited

Loading

YujiOshima commented Dec 5, 2018

richardsliu commented Dec 5, 2018

YujiOshima commented Dec 5, 2018

richardsliu commented Dec 5, 2018

richardsliu commented Dec 5, 2018

k8s-ci-robot commented Dec 5, 2018

POC: Katib integration with tf-operator #267

POC: Katib integration with tf-operator #267

Conversation

richardsliu commented Nov 29, 2018 • edited Loading

richardsliu commented Dec 1, 2018

richardsliu commented Dec 1, 2018

richardsliu commented Dec 1, 2018

richardsliu commented Dec 1, 2018

richardsliu commented Dec 1, 2018

richardsliu commented Dec 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnugeorge commented Dec 4, 2018 • edited Loading

YujiOshima commented Dec 5, 2018

richardsliu commented Dec 5, 2018

YujiOshima commented Dec 5, 2018

richardsliu commented Dec 5, 2018

richardsliu commented Dec 5, 2018

k8s-ci-robot commented Dec 5, 2018

richardsliu commented Nov 29, 2018 •

edited

Loading

johnugeorge commented Dec 4, 2018 •

edited

Loading