[v1alpha2]Unable to create pod #641

yph152 · 2018-06-12T13:10:00Z

apiVersion: "kubeflow.org/v1alpha2"
kind: "TFJob"
metadata:
  name: "dist-mnist-for-e2e-test"
spec:
  tfReplicaSpecs:
    PS:
      replicas: 2
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: dist-mnist-ps
              image: tensorflow:v1.2.1
              command:
              - /bin/bash
              - -c
              - sleep 96000
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: dist-mnist-worker
              image:  tensorflow:v1.2.1
              command:
              - /bin/bash
              - -c
              - sleep 180;test

The text was updated successfully, but these errors were encountered:

gaocegege · 2018-06-12T13:15:43Z

Could you please give me more information? I can not know what happened from the log.

yph152 · 2018-06-12T13:20:18Z

I used the above configuration file to create tfjob, but I can't create a new pod.

gaocegege · 2018-06-12T13:22:42Z

Could you please show me kubectl describe tfjob?

yph152 · 2018-06-12T13:31:30Z

Name:         dist-mnist-for-e2e-test
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1alpha2
Kind:         TFJob
Metadata:
  Cluster Name:                   
  Creation Timestamp:             2018-06-12T12:35:59Z
  Deletion Grace Period Seconds:  <nil>
  Deletion Timestamp:             <nil>
  Initializers:                   <nil>
  Resource Version:               9460225
  Self Link:                      /apis/kubeflow.org/v1alpha2/namespaces/default/tfjobs/dist-mnist-for-e2e-test
  UID:                            28c11d5c-6e3d-11e8-9745-52540014a78b
Spec:
  Tf Replica Specs:
    PS:
      Replicas:        2
      Restart Policy:  Never
      Template:
        Spec:
          Containers:
            Command:
              /bin/bash
              -c
              sleep 96000
            Image:  tensorflow:v1.2.1
            Name:   dist-mnist-ps
    Worker:
      Replicas:        3
      Restart Policy:  OnFailure
      Template:
        Spec:
          Containers:
            Command:
              /bin/bash
              -c
              sleep 180;test
            Image:  tensorflow:v1.2.1
            Name:   dist-mnist-worker
Events:             <none>

gaocegege · 2018-06-12T13:50:45Z

Are you using the latest master? I tried and it works.

Name:         dist-mnist-for-e2e-test
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1alpha2
Kind:         TFJob
Metadata:
  Cluster Name:        
  Creation Timestamp:  2018-06-12T13:50:03Z
  Generation:          1
  Resource Version:    2788
  Self Link:           /apis/kubeflow.org/v1alpha2/namespaces/default/tfjobs/dist-mnist-for-e2e-test
  UID:                 81a4d640-6e47-11e8-95f5-484d7e9d305b
Spec:
  Tf Replica Specs:
    PS:
      Replicas:        2
      Restart Policy:  Never
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Command:
              /bin/bash
              -c
              sleep 96000
            Image:  busybox:1
            Name:   tensorflow
            Ports:
              Container Port:  2222
              Name:            tfjob-port
            Resources:
    Worker:
      Replicas:        4
      Restart Policy:  Never
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Command:
              /bin/bash
              -c
              sleep 180;test
            Image:  busybox:1
            Name:   tensorflow
            Ports:
              Container Port:  2222
              Name:            tfjob-port
            Resources:
Status:
  Conditions:  <nil>
  Tf Replica Statuses:
    PS:
    Worker:
Events:
  Type    Reason                   Age   From         Message
  ----    ------                   ----  ----         -------
  Normal  SuccessfulCreatePod      6s    tf-operator  Created pod: dist-mnist-for-e2e-test-ps-0
  Normal  SuccessfulCreatePod      6s    tf-operator  Created pod: dist-mnist-for-e2e-test-ps-1
  Normal  SuccessfulCreateService  6s    tf-operator  Created service: default-dist-mnist-for-e2e-test-ps-0
  Normal  SuccessfulCreateService  5s    tf-operator  Created service: default-dist-mnist-for-e2e-test-ps-1
  Normal  SuccessfulCreatePod      5s    tf-operator  Created pod: dist-mnist-for-e2e-test-worker-0
  Normal  SuccessfulCreatePod      5s    tf-operator  Created pod: dist-mnist-for-e2e-test-worker-1
  Normal  SuccessfulCreatePod      5s    tf-operator  Created pod: dist-mnist-for-e2e-test-worker-2
  Normal  SuccessfulCreatePod      4s    tf-operator  Created pod: dist-mnist-for-e2e-test-worker-3
  Normal  SuccessfulCreateService  4s    tf-operator  Created service: default-dist-mnist-for-e2e-test-worker-0
  Normal  SuccessfulCreateService  3s    tf-operator  Created service: default-dist-mnist-for-e2e-test-worker-1
  Normal  SuccessfulCreateService  2s    tf-operator  Created service: default-dist-mnist-for-e2e-test-worker-2
  Normal  SuccessfulCreateService  2s    tf-operator  Created service: default-dist-mnist-for-e2e-test-worker-3

yph152 · 2018-06-12T14:24:30Z

thanks ,i will test it.

gaocegege · 2018-06-13T02:39:49Z

Could we close the issue?

jiaxuanzhou · 2018-06-13T03:03:16Z

@gaocegege there is no templates for v2 tfjobs, is there any plan to provide the whole solution of v1alpha2?

gaocegege · 2018-06-13T03:06:12Z

@jiaxuanzhou We have a template in https://github.com/kubeflow/tf-operator/tree/master/test/e2e/dist-mnist

jiaxuanzhou · 2018-06-13T03:26:32Z

@gaocegege seems i could not create pod with v1alpha2 neither using the template below, i think controller.v2 does not recognize the spec area，but no error returned：

Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1alpha2
Kind:         TFJob
Metadata:
  Cluster Name:        
  Creation Timestamp:  2018-06-13T02:40:55Z
  Resource Version:    2642416
  Self Link:           /apis/kubeflow.org/v1alpha2/namespaces/default/tfjobs/fengzhu-example-job3
  UID:                 327b6cd6-6eb3-11e8-816e-44a84235ba25
Spec:
  Replica Specs:
    Replicas:  1
    Template:
      Spec:
        Containers:
          Command:
            sh
            -c
            sleep 20000
          Image:  xxx
          Limits:
            Nvidia . Com / Gpu:  1
          Name:                  tensorflow
        Restart Policy:          OnFailure
      tfReplicaType: MASTER
      template:
        spec:
          containers:
            - image: xxx
              name: tensorflow
              limits:
                nvidia.com/gpu: 1
              command:
               - sh
               - -c
               - 'sleep 20000'
          restartPolicy: OnFailure
    - replicas: 1
      tfReplicaType: WORKER
      template:
        spec:
          containers:
            - image: xxx
              name: tensorflow
              command:
               - sh
               - -c
               - 'sleep 10'
          restartPolicy: OnFailure
    - replicas: 2
      tfReplicaType: PS
      template:
        spec:
          containers:
            - image: xxx
              name: tensorflow
              command:
               - sh
               - -c
               - 'sleep 10'
          restartPolicy: OnFailure

gaocegege · 2018-06-13T03:32:19Z

@jiaxuanzhou v1alpha1 and v1alpha2 has different spec, you may try

apiVersion: "kubeflow.org/v1alpha2"
kind: "TFJob"
metadata:
  name: "dist-mnist-for-e2e-test"
spec:
  tfReplicaSpecs:
    PS:
      replicas: 2
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: kubeflow/tf-dist-mnist-test:1.0
    Worker:
      replicas: 4
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: kubeflow/tf-dist-mnist-test:1.0
              args: ["train_steps", "50000"]

jiaxuanzhou · 2018-06-13T03:40:26Z

@gaocegege yes, it works using the template above, but my point is that once wrong tfjob obj created ,controller of tf-oprerator should deny and print err info to let the users know what happened.
BTW, how to use the service related tf-operator, any templates?

gaocegege · 2018-06-13T03:52:06Z

Yeah, I agree with you. If you are using lastest master, I think the operator will submit an event to the TFJob to tell the users that the spec is invalid. Ref ea770be

And, could you please explain what the service related to tf-operator is?

jiaxuanzhou · 2018-06-13T03:55:18Z

@gaocegege one scenario for example: one PS job want to communicate with another Worker job within one tfjob, services of ps and worker may work for this.

gaocegege · 2018-06-13T04:01:17Z

Oh, I understand. The operator will generate cluster spec and headless services for the TFJob, then set the cluster spec as env var TF_CONFIG. Then you do not need to care about how to create services for the PS and workers.

jiaxuanzhou · 2018-06-13T04:17:00Z

great, thanks, that's what i want.

jiaxuanzhou · 2018-06-15T09:00:21Z

@gaocegege i have tested again with the template below, controller.v2 will not send out event and log the err. this is a bug , i will submit one pr soon.

apiVersion: "kubeflow.org/v1alpha2"
kind: "TFJob"
metadata:
  name: "chuhe-example-job3"
spec:
  replicaSpecs:
    - replicas: 1
      tfReplicaType: MASTER
      template:
        spec:
          containers:
            - image: registry.v2.wx.service.mogujie.org/public/tinytf_baseline:20180508
              name: tensorflow
              limits:
                nvidia.com/gpu: 1
              command:
               - sh
               - -c
               - 'sleep 20000'
          restartPolicy: OnFailure
    - replicas: 1
      tfReplicaType: WORKER
      template:
        spec:
          containers:
            - image: registry.v2.wx.service.mogujie.org/public/tinytf_baseline:20180508
              name: tensorflow
              command:
               - sh
               - -c
               - 'sleep 20'
          restartPolicy: OnFailure
    - replicas: 2
      tfReplicaType: PS
      template:
        spec:
          containers:
            - image: registry.v2.wx.service.mogujie.org/public/tinytf_baseline:20180508
              name: tensorflow
              command:
               - sh
               - -c
               - 'sleep 20'
          restartPolicy: OnFailure

here is my test log

INFO[0041] test for jiaxuanzhou: the obj of v1 is %v &{map[spec:map[replicaSpecs:[map[tfReplicaType:MASTER replicas:1 template:map[spec:map[containers:[map[image:registry.v2.wx.service.mogujie.org/public/tinytf_baseline:20180508 limits:map[nvidia.com/gpu:1] name:tensorflow command:[sh -c sleep 20000]]] restartPolicy:OnFailure]]] map[replicas:1 template:map[spec:map[containers:[map[command:[sh -c sleep 20] image:registry.v2.wx.service.mogujie.org/public/tinytf_baseline:20180508 name:tensorflow]] restartPolicy:OnFailure]] tfReplicaType:WORKER] map[replicas:2 template:map[spec:map[containers:[map[command:[sh -c sleep 20] image:registry.v2.wx.service.mogujie.org/public/tinytf_baseline:20180508 name:tensorflow]] restartPolicy:OnFailure]] tfReplicaType:PS]]] apiVersion:kubeflow.org/v1alpha2 kind:TFJob metadata:map[namespace:default resourceVersion:2996839 selfLink:/apis/kubeflow.org/v1alpha2/namespaces/default/tfjobs/chuhe-example-job3 uid:00467461-707a-11e8-816e-44a84235ba25 clusterName: creationTimestamp:2018-06-15T08:56:32Z name:chuhe-example-job3]]}  filename="controller.v2/controller_tfjob.go:17"
INFO[0041] test for jiaxuanzhou: %s {"kind":"TFJob","apiVersion":"kubeflow.org/v1alpha2","metadata":{"name":"chuhe-example-job3","namespace":"default","selfLink":"/apis/kubeflow.org/v1alpha2/namespaces/default/tfjobs/chuhe-example-job3","uid":"00467461-707a-11e8-816e-44a84235ba25","resourceVersion":"2996839","creationTimestamp":"2018-06-15T08:56:32Z"},"spec":{"tfReplicaSpecs":null},"status":{"conditions":null,"tfReplicaStatuses":null}}  filename="controller.v2/controller_tfjob.go:20"

the func tfJobFromUnstructured does not recognize the spec area and return tfjob like below

{
  "kind": "TFJob",
  "apiVersion": "kubeflow.org/v1alpha2",
  "metadata": {
    "name": "chuhe-example-job3",
    "namespace": "default",
    "selfLink": "/apis/kubeflow.org/v1alpha2/namespaces/default/tfjobs/chuhe-example-job3",
    "uid": "00467461-707a-11e8-816e-44a84235ba25",
    "resourceVersion": "2996839",
    "creationTimestamp": "2018-06-15T08:56:32Z"
  },
  "spec": {
    "tfReplicaSpecs": null
  },
  "status": {
    "conditions": null,
    "tfReplicaStatuses": null
  }
}

gaocegege added the api/v1alpha2 label Jun 12, 2018

gaocegege added the community/question label Jun 13, 2018

jiaxuanzhou mentioned this issue Jun 15, 2018

return err if the spec area is nil after unmashal for tfjob v1alpha2 #678

Merged

k8s-ci-robot closed this as completed in #678 Jun 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1alpha2]Unable to create pod #641

[v1alpha2]Unable to create pod #641

yph152 commented Jun 12, 2018 •

edited

Loading

gaocegege commented Jun 12, 2018

yph152 commented Jun 12, 2018

gaocegege commented Jun 12, 2018

yph152 commented Jun 12, 2018

gaocegege commented Jun 12, 2018

yph152 commented Jun 12, 2018

gaocegege commented Jun 13, 2018

jiaxuanzhou commented Jun 13, 2018

gaocegege commented Jun 13, 2018

jiaxuanzhou commented Jun 13, 2018 •

edited

Loading

gaocegege commented Jun 13, 2018

jiaxuanzhou commented Jun 13, 2018

gaocegege commented Jun 13, 2018

jiaxuanzhou commented Jun 13, 2018

gaocegege commented Jun 13, 2018

jiaxuanzhou commented Jun 13, 2018

jiaxuanzhou commented Jun 15, 2018 •

edited

Loading

[v1alpha2]Unable to create pod #641

[v1alpha2]Unable to create pod #641

Comments

yph152 commented Jun 12, 2018 • edited Loading

gaocegege commented Jun 12, 2018

yph152 commented Jun 12, 2018

gaocegege commented Jun 12, 2018

yph152 commented Jun 12, 2018

gaocegege commented Jun 12, 2018

yph152 commented Jun 12, 2018

gaocegege commented Jun 13, 2018

jiaxuanzhou commented Jun 13, 2018

gaocegege commented Jun 13, 2018

jiaxuanzhou commented Jun 13, 2018 • edited Loading

gaocegege commented Jun 13, 2018

jiaxuanzhou commented Jun 13, 2018

gaocegege commented Jun 13, 2018

jiaxuanzhou commented Jun 13, 2018

gaocegege commented Jun 13, 2018

jiaxuanzhou commented Jun 13, 2018

jiaxuanzhou commented Jun 15, 2018 • edited Loading

yph152 commented Jun 12, 2018 •

edited

Loading

jiaxuanzhou commented Jun 13, 2018 •

edited

Loading

jiaxuanzhou commented Jun 15, 2018 •

edited

Loading