Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix helm templates so that we don't require a configmap. #176

Merged
merged 2 commits into from
Dec 4, 2017

Conversation

jlewi
Copy link
Contributor

@jlewi jlewi commented Nov 25, 2017

  • If no --cloud isn't specified then we shouldn't specify
    controller_config_file

  • Add options to allow the user to specify the name of a config map
    if they want to set the configmap themselves.

  • The options are set so that specifying a custom configmap overrides
    the value selected by setting cloud so that the two options can be used
    together.

Fixes #175

* If no --cloud isn't specified then we shouldn't specify
  controller_config_file

* Add options to allow the user to specify the name of a config map
  if they want to set the configmap themselves.

* The options are set so that specifying a custom configmap overrides
  the value selected by setting cloud so that the two options can be used
  together.
@jlewi
Copy link
Contributor Author

jlewi commented Nov 25, 2017

Testing: no cloud

helm install ./tf-job-operator-chart --dry-run --debug
[debug] Created tunnel using local port: '46607'

[debug] SERVER: "localhost:46607"

[debug] Original chart version: ""
[debug] CHART PATH: /home/jlewi/git_training/tf-job-operator-chart

NAME:   flabby-robin
REVISION: 1
RELEASED: Sat Nov 25 12:03:02 2017
CHART: tf-job-operator-chart-0.2.0
USER-SUPPLIED VALUES:
{}

COMPUTED VALUES:
cloud: null
config:
  configmap: null
  file: /etc/config/controller_config_file.yaml
image: gcr.io/tf-on-k8s-dogfood/tf_operator:latest
rbac:
  apiVersion: v1beta1
  install: false
test_image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff

HOOKS:
---
# flabby-robin-tfjob-test-d8mn2l
apiVersion: v1
kind: Pod
metadata:
  # We give the pod a random name so that we can run helm test multiple times and not
  # have issues because the pod already exists.
  #
  # TODO(jlewi): I think the randAlphaNum is evaluated when you do deploy.
  # So running helm test tf-job multiple times causes problems because it will
  # end up using the same pod.
  name: "flabby-robin-tfjob-test-d8mn2l"
  annotations:
    # See https://github.com/kubernetes/helm/blob/master/docs/chart_tests.md
    "helm.sh/hook": test-success
spec:
  containers:
    - name: basic-test
      image: gcr.io/tf-on-k8s-dogfood/tf_operator:latest
      command:
        - /opt/mlkube/test/e2e
        - -alsologtostderr
        - --image=gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
        - -v=1
  
  restartPolicy: Never
MANIFEST:

---
# Source: tf-job-operator-chart/templates/deployment.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: tf-job-operator
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: tf-job-operator
    spec:
      containers:
      - name: tf-job-operator
        image: gcr.io/tf-on-k8s-dogfood/tf_operator:latest
        command:
          - /opt/mlkube/tf_operator
          - -alsologtostderr
          - -v=1
        env:
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
  • No config volume is mounted and controller_config_file isn't set

Testing: set cloud

 helm install ./tf-job-operator-chart --dry-run --debug --set cloud=gke
[debug] Created tunnel using local port: '40176'

[debug] SERVER: "localhost:40176"

[debug] Original chart version: ""
[debug] CHART PATH: /home/jlewi/git_training/tf-job-operator-chart

NAME:   hopping-blackbird
REVISION: 1
RELEASED: Sat Nov 25 12:03:59 2017
CHART: tf-job-operator-chart-0.2.0
USER-SUPPLIED VALUES:
cloud: gke

COMPUTED VALUES:
cloud: gke
config:
  configmap: null
  file: /etc/config/controller_config_file.yaml
image: gcr.io/tf-on-k8s-dogfood/tf_operator:latest
rbac:
  apiVersion: v1beta1
  install: false
test_image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff

HOOKS:
---
# hopping-blackbird-tfjob-test-zpxrw1
apiVersion: v1
kind: Pod
metadata:
  # We give the pod a random name so that we can run helm test multiple times and not
  # have issues because the pod already exists.
  #
  # TODO(jlewi): I think the randAlphaNum is evaluated when you do deploy.
  # So running helm test tf-job multiple times causes problems because it will
  # end up using the same pod.
  name: "hopping-blackbird-tfjob-test-zpxrw1"
  annotations:
    # See https://github.com/kubernetes/helm/blob/master/docs/chart_tests.md
    "helm.sh/hook": test-success
spec:
  containers:
    - name: basic-test
      image: gcr.io/tf-on-k8s-dogfood/tf_operator:latest
      command:
        - /opt/mlkube/test/e2e
        - -alsologtostderr
        - --image=gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
        - -v=1
  
  restartPolicy: Never
MANIFEST:

---
# Source: tf-job-operator-chart/templates/config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tf-job-operator-config
data:
  controller_config_file.yaml: |
    grpcServerFilePath: /opt/mlkube/grpc_tensorflow_server/grpc_tensorflow_server.py
    accelerators:
      alpha.kubernetes.io/nvidia-gpu:
        volumes:
          - name: nvidia-libraries
            mountPath: /usr/local/nvidia/lib64 # This path is special; it is expected to be present in `/etc/ld.so.conf` inside the container image.
            hostPath: /home/kubernetes/bin/nvidia/lib
          - name: nvidia-debug-tools # optional
            mountPath: /usr/local/bin/nvidia
            hostPath: /home/kubernetes/bin/nvidia/bin
---
# Source: tf-job-operator-chart/templates/deployment.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: tf-job-operator
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: tf-job-operator
    spec:
      containers:
      - name: tf-job-operator
        image: gcr.io/tf-on-k8s-dogfood/tf_operator:latest
        command:
          - /opt/mlkube/tf_operator
          - --controller_config_file=/etc/config/controller_config_file.yaml
          - -alsologtostderr
          - -v=1
        env:
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name       
        volumeMounts:
          - name: config-volume
            mountPath: /etc/config
      volumes:
        - name: config-volume
          configMap:
            name: tf-job-operator-config

Testing: override the config map

 helm install ./tf-job-operator-chart --dry-run --debug --set config.configmap=custom_map,config.file=/tmp/file
[debug] Created tunnel using local port: '38265'

[debug] SERVER: "localhost:38265"

[debug] Original chart version: ""
[debug] CHART PATH: /home/jlewi/git_training/tf-job-operator-chart

NAME:   left-cricket
REVISION: 1
RELEASED: Sat Nov 25 12:05:19 2017
CHART: tf-job-operator-chart-0.2.0
USER-SUPPLIED VALUES:
config:
  configmap: custom_map
  file: /tmp/file

COMPUTED VALUES:
cloud: null
config:
  configmap: custom_map
  file: /tmp/file
image: gcr.io/tf-on-k8s-dogfood/tf_operator:latest
rbac:
  apiVersion: v1beta1
  install: false
test_image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff

HOOKS:
---
# left-cricket-tfjob-test-p8d2mx
apiVersion: v1
kind: Pod
metadata:
  # We give the pod a random name so that we can run helm test multiple times and not
  # have issues because the pod already exists.
  #
  # TODO(jlewi): I think the randAlphaNum is evaluated when you do deploy.
  # So running helm test tf-job multiple times causes problems because it will
  # end up using the same pod.
  name: "left-cricket-tfjob-test-p8d2mx"
  annotations:
    # See https://github.com/kubernetes/helm/blob/master/docs/chart_tests.md
    "helm.sh/hook": test-success
spec:
  containers:
    - name: basic-test
      image: gcr.io/tf-on-k8s-dogfood/tf_operator:latest
      command:
        - /opt/mlkube/test/e2e
        - -alsologtostderr
        - --image=gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
        - -v=1
  
  restartPolicy: Never
MANIFEST:

---
# Source: tf-job-operator-chart/templates/deployment.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: tf-job-operator
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: tf-job-operator
    spec:
      containers:
      - name: tf-job-operator
        image: gcr.io/tf-on-k8s-dogfood/tf_operator:latest
        command:
          - /opt/mlkube/tf_operator
          - --controller_config_file=/tmp/file
          - -alsologtostderr
          - -v=1
        env:
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name       
        volumeMounts:
          - name: config-volume
            mountPath: /etc/config
      volumes:
        - name: config-volume
          configMap:
            name: custom_map

@jlewi
Copy link
Contributor Author

jlewi commented Nov 25, 2017

@DjangoPeng @jianzi123 does this fix your issues?

@jianzi123
Copy link

@jlewi I have fixed the issues, but I don't know how it to do, such as how to use my own code.

@DjangoPeng
Copy link
Member

I have set a configmap according to my bare metal environment. I'd like to check whether the PR fix my issue on next Monday.

@jlewi
Copy link
Contributor Author

jlewi commented Nov 28, 2017

@jianzi123 How did you fix the issue?

@jianzi123
Copy link

@jlewi I used kubectl create configmap --from-file to create the configmap and that put configmap into other configmap...

@jlewi
Copy link
Contributor Author

jlewi commented Nov 30, 2017

@jianzi123 @DjangoPeng if I don't see any comments tomorrow, I'll go ahead and merge this.

@DjangoPeng
Copy link
Member

@jlewi Go head please. I think this PR is helpful and the implementation is not bad.

@gaocegege
Copy link
Member

Hi, @jlewi

I have a question about the implementation. What happened if no config volume is mounted and controller_config_file isn't set.

AFAIK there is a necessary config grpcServerFilePath in the config file for PS 🤔

@jlewi
Copy link
Contributor Author

jlewi commented Nov 30, 2017

@gaocegege You raise a good point about grpcServerFilePath. I think the expectation should be that if no config file is explicitly set, the TfJob operator choose sensible defaults. A separate issue related to (#179). Is what the TfJob operator should do if grpcServerFilePath is the empty string. Some options would be

  1. Require a non empty value; so TfJob operator should error out on startup if the path doesn't point to a valid file
  2. TfJob's that require grpcServerFilePath will fail with an error indicating the TfJob operator isn't configured correctly.

@jlewi
Copy link
Contributor Author

jlewi commented Nov 30, 2017

Opened issue #188 to figure out what to do about empty grpcServerFilePath.

@gaocegege
Copy link
Member

SGTM :)

@jlewi jlewi merged commit 03c5b4b into kubeflow:master Dec 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants