-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v1alpha2]Unable to create pod #641
Comments
Could you please give me more information? I can not know what happened from the log. |
I used the above configuration file to create tfjob, but I can't create a new pod. |
Could you please show me |
|
Are you using the latest master? I tried and it works.
|
thanks ,i will test it. |
Could we close the issue? |
@gaocegege there is no templates for v2 tfjobs, is there any plan to provide the whole solution of v1alpha2? |
@jiaxuanzhou We have a template in https://github.com/kubeflow/tf-operator/tree/master/test/e2e/dist-mnist |
@gaocegege seems i could not create pod with v1alpha2 neither using the template below, i think controller.v2 does not recognize the spec area,but no error returned:
|
@jiaxuanzhou v1alpha1 and v1alpha2 has different spec, you may try apiVersion: "kubeflow.org/v1alpha2"
kind: "TFJob"
metadata:
name: "dist-mnist-for-e2e-test"
spec:
tfReplicaSpecs:
PS:
replicas: 2
restartPolicy: Never
template:
spec:
containers:
- name: tensorflow
image: kubeflow/tf-dist-mnist-test:1.0
Worker:
replicas: 4
restartPolicy: Never
template:
spec:
containers:
- name: tensorflow
image: kubeflow/tf-dist-mnist-test:1.0
args: ["train_steps", "50000"] |
@gaocegege yes, it works using the template above, but my point is that once wrong tfjob obj created ,controller of tf-oprerator should deny and print err info to let the users know what happened. |
Yeah, I agree with you. If you are using lastest master, I think the operator will submit an event to the TFJob to tell the users that the spec is invalid. Ref ea770be And, could you please explain what the service related to tf-operator is? |
@gaocegege one scenario for example: one PS job want to communicate with another Worker job within one tfjob, services of ps and worker may work for this. |
Oh, I understand. The operator will generate cluster spec and headless services for the TFJob, then set the cluster spec as env var |
great, thanks, that's what i want. |
@gaocegege i have tested again with the template below, controller.v2 will not send out event and log the err. this is a bug , i will submit one pr soon.
here is my test log
the func
|
@gaocegege
The text was updated successfully, but these errors were encountered: