Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1alpha2]Unable to create pod #641

Closed
yph152 opened this issue Jun 12, 2018 · 17 comments · Fixed by #678
Closed

[v1alpha2]Unable to create pod #641

yph152 opened this issue Jun 12, 2018 · 17 comments · Fixed by #678

Comments

@yph152
Copy link
Contributor

yph152 commented Jun 12, 2018

@gaocegege

apiVersion: "kubeflow.org/v1alpha2"
kind: "TFJob"
metadata:
  name: "dist-mnist-for-e2e-test"
spec:
  tfReplicaSpecs:
    PS:
      replicas: 2
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: dist-mnist-ps
              image: tensorflow:v1.2.1
              command:
              - /bin/bash
              - -c
              - sleep 96000
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: dist-mnist-worker
              image:  tensorflow:v1.2.1
              command:
              - /bin/bash
              - -c
              - sleep 180;test

image

@gaocegege
Copy link
Member

Could you please give me more information? I can not know what happened from the log.

@yph152
Copy link
Contributor Author

yph152 commented Jun 12, 2018

I used the above configuration file to create tfjob, but I can't create a new pod.

@gaocegege
Copy link
Member

Could you please show me kubectl describe tfjob?

@yph152
Copy link
Contributor Author

yph152 commented Jun 12, 2018

Name:         dist-mnist-for-e2e-test
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1alpha2
Kind:         TFJob
Metadata:
  Cluster Name:                   
  Creation Timestamp:             2018-06-12T12:35:59Z
  Deletion Grace Period Seconds:  <nil>
  Deletion Timestamp:             <nil>
  Initializers:                   <nil>
  Resource Version:               9460225
  Self Link:                      /apis/kubeflow.org/v1alpha2/namespaces/default/tfjobs/dist-mnist-for-e2e-test
  UID:                            28c11d5c-6e3d-11e8-9745-52540014a78b
Spec:
  Tf Replica Specs:
    PS:
      Replicas:        2
      Restart Policy:  Never
      Template:
        Spec:
          Containers:
            Command:
              /bin/bash
              -c
              sleep 96000
            Image:  tensorflow:v1.2.1
            Name:   dist-mnist-ps
    Worker:
      Replicas:        3
      Restart Policy:  OnFailure
      Template:
        Spec:
          Containers:
            Command:
              /bin/bash
              -c
              sleep 180;test
            Image:  tensorflow:v1.2.1
            Name:   dist-mnist-worker
Events:             <none>

@gaocegege
Copy link
Member

Are you using the latest master? I tried and it works.

Name:         dist-mnist-for-e2e-test
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1alpha2
Kind:         TFJob
Metadata:
  Cluster Name:        
  Creation Timestamp:  2018-06-12T13:50:03Z
  Generation:          1
  Resource Version:    2788
  Self Link:           /apis/kubeflow.org/v1alpha2/namespaces/default/tfjobs/dist-mnist-for-e2e-test
  UID:                 81a4d640-6e47-11e8-95f5-484d7e9d305b
Spec:
  Tf Replica Specs:
    PS:
      Replicas:        2
      Restart Policy:  Never
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Command:
              /bin/bash
              -c
              sleep 96000
            Image:  busybox:1
            Name:   tensorflow
            Ports:
              Container Port:  2222
              Name:            tfjob-port
            Resources:
    Worker:
      Replicas:        4
      Restart Policy:  Never
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Command:
              /bin/bash
              -c
              sleep 180;test
            Image:  busybox:1
            Name:   tensorflow
            Ports:
              Container Port:  2222
              Name:            tfjob-port
            Resources:
Status:
  Conditions:  <nil>
  Tf Replica Statuses:
    PS:
    Worker:
Events:
  Type    Reason                   Age   From         Message
  ----    ------                   ----  ----         -------
  Normal  SuccessfulCreatePod      6s    tf-operator  Created pod: dist-mnist-for-e2e-test-ps-0
  Normal  SuccessfulCreatePod      6s    tf-operator  Created pod: dist-mnist-for-e2e-test-ps-1
  Normal  SuccessfulCreateService  6s    tf-operator  Created service: default-dist-mnist-for-e2e-test-ps-0
  Normal  SuccessfulCreateService  5s    tf-operator  Created service: default-dist-mnist-for-e2e-test-ps-1
  Normal  SuccessfulCreatePod      5s    tf-operator  Created pod: dist-mnist-for-e2e-test-worker-0
  Normal  SuccessfulCreatePod      5s    tf-operator  Created pod: dist-mnist-for-e2e-test-worker-1
  Normal  SuccessfulCreatePod      5s    tf-operator  Created pod: dist-mnist-for-e2e-test-worker-2
  Normal  SuccessfulCreatePod      4s    tf-operator  Created pod: dist-mnist-for-e2e-test-worker-3
  Normal  SuccessfulCreateService  4s    tf-operator  Created service: default-dist-mnist-for-e2e-test-worker-0
  Normal  SuccessfulCreateService  3s    tf-operator  Created service: default-dist-mnist-for-e2e-test-worker-1
  Normal  SuccessfulCreateService  2s    tf-operator  Created service: default-dist-mnist-for-e2e-test-worker-2
  Normal  SuccessfulCreateService  2s    tf-operator  Created service: default-dist-mnist-for-e2e-test-worker-3

@yph152
Copy link
Contributor Author

yph152 commented Jun 12, 2018

thanks ,i will test it.

@gaocegege
Copy link
Member

Could we close the issue?

@jiaxuanzhou
Copy link
Contributor

@gaocegege there is no templates for v2 tfjobs, is there any plan to provide the whole solution of v1alpha2?

@gaocegege
Copy link
Member

@jiaxuanzhou
Copy link
Contributor

jiaxuanzhou commented Jun 13, 2018

@gaocegege seems i could not create pod with v1alpha2 neither using the template below, i think controller.v2 does not recognize the spec area,but no error returned:

Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1alpha2
Kind:         TFJob
Metadata:
  Cluster Name:        
  Creation Timestamp:  2018-06-13T02:40:55Z
  Resource Version:    2642416
  Self Link:           /apis/kubeflow.org/v1alpha2/namespaces/default/tfjobs/fengzhu-example-job3
  UID:                 327b6cd6-6eb3-11e8-816e-44a84235ba25
Spec:
  Replica Specs:
    Replicas:  1
    Template:
      Spec:
        Containers:
          Command:
            sh
            -c
            sleep 20000
          Image:  xxx
          Limits:
            Nvidia . Com / Gpu:  1
          Name:                  tensorflow
        Restart Policy:          OnFailure
      tfReplicaType: MASTER
      template:
        spec:
          containers:
            - image: xxx
              name: tensorflow
              limits:
                nvidia.com/gpu: 1
              command:
               - sh
               - -c
               - 'sleep 20000'
          restartPolicy: OnFailure
    - replicas: 1
      tfReplicaType: WORKER
      template:
        spec:
          containers:
            - image: xxx
              name: tensorflow
              command:
               - sh
               - -c
               - 'sleep 10'
          restartPolicy: OnFailure
    - replicas: 2
      tfReplicaType: PS
      template:
        spec:
          containers:
            - image: xxx
              name: tensorflow
              command:
               - sh
               - -c
               - 'sleep 10'
          restartPolicy: OnFailure

@gaocegege
Copy link
Member

@jiaxuanzhou v1alpha1 and v1alpha2 has different spec, you may try

apiVersion: "kubeflow.org/v1alpha2"
kind: "TFJob"
metadata:
  name: "dist-mnist-for-e2e-test"
spec:
  tfReplicaSpecs:
    PS:
      replicas: 2
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: kubeflow/tf-dist-mnist-test:1.0
    Worker:
      replicas: 4
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: kubeflow/tf-dist-mnist-test:1.0
              args: ["train_steps", "50000"]

@jiaxuanzhou
Copy link
Contributor

@gaocegege yes, it works using the template above, but my point is that once wrong tfjob obj created ,controller of tf-oprerator should deny and print err info to let the users know what happened.
BTW, how to use the service related tf-operator, any templates?

@gaocegege
Copy link
Member

Yeah, I agree with you. If you are using lastest master, I think the operator will submit an event to the TFJob to tell the users that the spec is invalid. Ref ea770be

And, could you please explain what the service related to tf-operator is?

@jiaxuanzhou
Copy link
Contributor

@gaocegege one scenario for example: one PS job want to communicate with another Worker job within one tfjob, services of ps and worker may work for this.

@gaocegege
Copy link
Member

Oh, I understand. The operator will generate cluster spec and headless services for the TFJob, then set the cluster spec as env var TF_CONFIG. Then you do not need to care about how to create services for the PS and workers.

@jiaxuanzhou
Copy link
Contributor

great, thanks, that's what i want.

@jiaxuanzhou
Copy link
Contributor

jiaxuanzhou commented Jun 15, 2018

@gaocegege i have tested again with the template below, controller.v2 will not send out event and log the err. this is a bug , i will submit one pr soon.

apiVersion: "kubeflow.org/v1alpha2"
kind: "TFJob"
metadata:
  name: "chuhe-example-job3"
spec:
  replicaSpecs:
    - replicas: 1
      tfReplicaType: MASTER
      template:
        spec:
          containers:
            - image: registry.v2.wx.service.mogujie.org/public/tinytf_baseline:20180508
              name: tensorflow
              limits:
                nvidia.com/gpu: 1
              command:
               - sh
               - -c
               - 'sleep 20000'
          restartPolicy: OnFailure
    - replicas: 1
      tfReplicaType: WORKER
      template:
        spec:
          containers:
            - image: registry.v2.wx.service.mogujie.org/public/tinytf_baseline:20180508
              name: tensorflow
              command:
               - sh
               - -c
               - 'sleep 20'
          restartPolicy: OnFailure
    - replicas: 2
      tfReplicaType: PS
      template:
        spec:
          containers:
            - image: registry.v2.wx.service.mogujie.org/public/tinytf_baseline:20180508
              name: tensorflow
              command:
               - sh
               - -c
               - 'sleep 20'
          restartPolicy: OnFailure

here is my test log

INFO[0041] test for jiaxuanzhou: the obj of v1 is %v &{map[spec:map[replicaSpecs:[map[tfReplicaType:MASTER replicas:1 template:map[spec:map[containers:[map[image:registry.v2.wx.service.mogujie.org/public/tinytf_baseline:20180508 limits:map[nvidia.com/gpu:1] name:tensorflow command:[sh -c sleep 20000]]] restartPolicy:OnFailure]]] map[replicas:1 template:map[spec:map[containers:[map[command:[sh -c sleep 20] image:registry.v2.wx.service.mogujie.org/public/tinytf_baseline:20180508 name:tensorflow]] restartPolicy:OnFailure]] tfReplicaType:WORKER] map[replicas:2 template:map[spec:map[containers:[map[command:[sh -c sleep 20] image:registry.v2.wx.service.mogujie.org/public/tinytf_baseline:20180508 name:tensorflow]] restartPolicy:OnFailure]] tfReplicaType:PS]]] apiVersion:kubeflow.org/v1alpha2 kind:TFJob metadata:map[namespace:default resourceVersion:2996839 selfLink:/apis/kubeflow.org/v1alpha2/namespaces/default/tfjobs/chuhe-example-job3 uid:00467461-707a-11e8-816e-44a84235ba25 clusterName: creationTimestamp:2018-06-15T08:56:32Z name:chuhe-example-job3]]}  filename="controller.v2/controller_tfjob.go:17"
INFO[0041] test for jiaxuanzhou: %s {"kind":"TFJob","apiVersion":"kubeflow.org/v1alpha2","metadata":{"name":"chuhe-example-job3","namespace":"default","selfLink":"/apis/kubeflow.org/v1alpha2/namespaces/default/tfjobs/chuhe-example-job3","uid":"00467461-707a-11e8-816e-44a84235ba25","resourceVersion":"2996839","creationTimestamp":"2018-06-15T08:56:32Z"},"spec":{"tfReplicaSpecs":null},"status":{"conditions":null,"tfReplicaStatuses":null}}  filename="controller.v2/controller_tfjob.go:20"

the func tfJobFromUnstructured does not recognize the spec area and return tfjob like below

{
  "kind": "TFJob",
  "apiVersion": "kubeflow.org/v1alpha2",
  "metadata": {
    "name": "chuhe-example-job3",
    "namespace": "default",
    "selfLink": "/apis/kubeflow.org/v1alpha2/namespaces/default/tfjobs/chuhe-example-job3",
    "uid": "00467461-707a-11e8-816e-44a84235ba25",
    "resourceVersion": "2996839",
    "creationTimestamp": "2018-06-15T08:56:32Z"
  },
  "spec": {
    "tfReplicaSpecs": null
  },
  "status": {
    "conditions": null,
    "tfReplicaStatuses": null
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants