Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubleshooting Guide: no matches for tensorflow.org/, Kind=TfJob #174

Closed
jlewi opened this issue Nov 25, 2017 · 5 comments
Closed

Troubleshooting Guide: no matches for tensorflow.org/, Kind=TfJob #174

jlewi opened this issue Nov 25, 2017 · 5 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Nov 25, 2017

We should start a troubleshooting guide. There have been multiple problems (#149, #173) with jobs failing with

error: unable to recognize "tf_job.yaml": no matches for tensorflow.org/, Kind=TfJob

This error indicates the CRD wasn't created. We should provide instructions for troubleshooting this issue.

@DjangoPeng
Copy link
Member

DjangoPeng commented Nov 26, 2017

Yep, I'm in the process of building TFOperator and running TFJob example in bare metal. I'd like to take the work to provide a step by step instructions guide.

@jlewi
Copy link
Contributor Author

jlewi commented Nov 26, 2017

Great thank you.

@sanchitarora
Copy link

Looking for a troubleshooting guide to help me resolve the same problem.
error: unable to recognize "config.yaml": no matches for tensorflow.org/, Kind=TfJob
I have tried the pointers given in the previous issues without much luck. The crd, deployment and pod seem to be successfully created:

> kubectl get crd
NAME                    AGE
tfjobs.tensorflow.org   45m

> kubectl get deployment tf-job-operator
NAME              DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
tf-job-operator   1         1         1            1           15m

> kubectl get pod tf-job-operator-3267587941-2p5cq
NAME                               READY     STATUS    RESTARTS   AGE
tf-job-operator-3267587941-2p5cq   1/1       Running   0          20m

Logs from the pod don't seem to have the same problem for the controller config file which seems to have been successfully loaded. There is a separate issue referenced here kubernetes/client-go#255 but not sure if that is the reason I am seeing issues.

> kubectl logs tf-job-operator-3267587941-2p5cq
I0124 22:31:39.745299       1 server.go:60] tf_operator Version: 0.3.0+git
I0124 22:31:39.746114       1 server.go:61] Git SHA: 11b2fad-dirty-e3b0c44
I0124 22:31:39.746127       1 server.go:62] Go Version: go1.8.2
I0124 22:31:39.746130       1 server.go:63] Go OS/Arch: linux/amd64
I0124 22:31:39.750898       1 server.go:137] Loading controller config from /etc/config/controller-config-file.yaml.
I0124 22:31:39.751207       1 server.go:147] ControllerConfig: {
  "Accelerators": {
    "alpha.kubernetes.io/nvidia-gpu": {
      "Volumes": [
        {
          "Name": "lib",
          "HostPath": "/usr/lib/nvidia-384",
          "MountPath": "/usr/local/nvidia/lib64"
        },
        {
          "Name": "bin",
          "HostPath": "/usr/lib/nvidia-384/bin",
          "MountPath": "/usr/local/nvidia/bin"
        },
        {
          "Name": "libcuda",
          "HostPath": "/usr/lib/x86_64-linux-gnu/libcuda.so.1",
          "MountPath": "/usr/lib/x86_64-linux-gnu/libcuda.so.1"
        }
      ],
      "EnvVars": null
    }
  },
  "GrpcServerFilePath": "/opt/mlkube/grpc_tensorflow_server/grpc_tensorflow_server.py"
}
I0124 22:31:39.751355       1 controller.go:98] Setting up event handlers
I0124 22:31:40.329659       1 leaderelection.go:174] attempting to acquire leader lease...
E0124 22:31:55.603903       1 event.go:260] Could not construct reference to: '&v1.Endpoints{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"tf-operator", GenerateName:"", Namespace:"default", SelfLink:"/api/v1/namespaces/default/endpoints/tf-operator", UID:"1fc4ee50-0152-11e8-aba5-000d3af8f626", ResourceVersion:"942410", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{sec:63652428086, nsec:0, loc:(*time.Location)(0x1852480)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"tf-job-operator-3267587941-2p5cq\",\"leaseDurationSeconds\":15,\"acquireTime\":\"2018-01-24T22:31:55Z\",\"renewTime\":\"2018-01-24T22:31:55Z\",\"leaderTransitions\":1}"}, OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Subsets:[]v1.EndpointSubset(nil)}' due to: 'no kind is registered for the type v1.Endpoints'. Will not report event: 'Normal' 'LeaderElection' 'tf-job-operator-3267587941-2p5cq became leader'
I0124 22:31:55.604401       1 leaderelection.go:184] successfully acquired lease default/tf-operator
I0124 22:31:55.604468       1 controller.go:140] Starting TFJob controller
I0124 22:31:55.604478       1 controller.go:143] Waiting for informer caches to sync
I0124 22:31:55.704651       1 controller.go:148] Starting %v workers1
I0124 22:31:55.704693       1 controller.go:154] Started workers

Any pointers will be appreciated

@jlewi
Copy link
Contributor Author

jlewi commented Jan 25, 2018

What's the output of?

kubectl get crd -o yaml

You might also want to try TFJob as opposed to TfJob the capitalization just changed #332 .

@sanchitarora
Copy link

@jlewi Thanks for the response! Changing the capitalization seems to have done the trick.
For anyone else who lands here - from the output of the crd yaml we can see that the name was capitalized (TFJob) hence the older name (TfJob) was not found

> kubectl get crd -o yaml
apiVersion: v1
items:
- apiVersion: apiextensions.k8s.io/v1beta1
  kind: CustomResourceDefinition
  metadata:
    creationTimestamp: 2018-01-24T22:01:26Z
    name: tfjobs.tensorflow.org
    namespace: ""
    resourceVersion: "938670"
    selfLink: /apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/tfjobs.tensorflow.org
    uid: 1f6db90d-0152-11e8-aba5-000d3af8f626
  spec:
    group: tensorflow.org
    names:
      kind: TFJob
      listKind: TFJobList
      plural: tfjobs
      singular: tfjob
    scope: Namespaced
apiVersion: tensorflow.org/v1alpha1
    version: v1alpha1
  status:
    acceptedNames:
apiVersion: tensorflow.org/v1alpha1
      kind: TFJob
      listKind: TFJobList
      plural: tfjobs
      singular: tfjob
    conditions:
    - lastTransitionTime: null
      message: no conflicts found
      reason: NoConflicts
      status: "True"
      type: NamesAccepted
    - lastTransitionTime: 2018-01-24T22:01:26Z
      message: the initial names have been accepted
      reason: InitialNamesAccepted
      status: "True"
      type: Established
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants