Fix helm templates so that we don't require a configmap. #176

jlewi · 2017-11-25T20:02:37Z

If no --cloud isn't specified then we shouldn't specify
controller_config_file
Add options to allow the user to specify the name of a config map
if they want to set the configmap themselves.
The options are set so that specifying a custom configmap overrides
the value selected by setting cloud so that the two options can be used
together.

Fixes #175

* If no --cloud isn't specified then we shouldn't specify controller_config_file * Add options to allow the user to specify the name of a config map if they want to set the configmap themselves. * The options are set so that specifying a custom configmap overrides the value selected by setting cloud so that the two options can be used together.

jlewi · 2017-11-25T20:06:07Z

Testing: no cloud

helm install ./tf-job-operator-chart --dry-run --debug
[debug] Created tunnel using local port: '46607'

[debug] SERVER: "localhost:46607"

[debug] Original chart version: ""
[debug] CHART PATH: /home/jlewi/git_training/tf-job-operator-chart

NAME:   flabby-robin
REVISION: 1
RELEASED: Sat Nov 25 12:03:02 2017
CHART: tf-job-operator-chart-0.2.0
USER-SUPPLIED VALUES:
{}

COMPUTED VALUES:
cloud: null
config:
  configmap: null
  file: /etc/config/controller_config_file.yaml
image: gcr.io/tf-on-k8s-dogfood/tf_operator:latest
rbac:
  apiVersion: v1beta1
  install: false
test_image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff

HOOKS:
---
# flabby-robin-tfjob-test-d8mn2l
apiVersion: v1
kind: Pod
metadata:
  # We give the pod a random name so that we can run helm test multiple times and not
  # have issues because the pod already exists.
  #
  # TODO(jlewi): I think the randAlphaNum is evaluated when you do deploy.
  # So running helm test tf-job multiple times causes problems because it will
  # end up using the same pod.
  name: "flabby-robin-tfjob-test-d8mn2l"
  annotations:
    # See https://github.com/kubernetes/helm/blob/master/docs/chart_tests.md
    "helm.sh/hook": test-success
spec:
  containers:
    - name: basic-test
      image: gcr.io/tf-on-k8s-dogfood/tf_operator:latest
      command:
        - /opt/mlkube/test/e2e
        - -alsologtostderr
        - --image=gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
        - -v=1
  
  restartPolicy: Never
MANIFEST:

---
# Source: tf-job-operator-chart/templates/deployment.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: tf-job-operator
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: tf-job-operator
    spec:
      containers:
      - name: tf-job-operator
        image: gcr.io/tf-on-k8s-dogfood/tf_operator:latest
        command:
          - /opt/mlkube/tf_operator
          - -alsologtostderr
          - -v=1
        env:
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name

No config volume is mounted and controller_config_file isn't set

Testing: set cloud

 helm install ./tf-job-operator-chart --dry-run --debug --set cloud=gke
[debug] Created tunnel using local port: '40176'

[debug] SERVER: "localhost:40176"

[debug] Original chart version: ""
[debug] CHART PATH: /home/jlewi/git_training/tf-job-operator-chart

NAME:   hopping-blackbird
REVISION: 1
RELEASED: Sat Nov 25 12:03:59 2017
CHART: tf-job-operator-chart-0.2.0
USER-SUPPLIED VALUES:
cloud: gke

COMPUTED VALUES:
cloud: gke
config:
  configmap: null
  file: /etc/config/controller_config_file.yaml
image: gcr.io/tf-on-k8s-dogfood/tf_operator:latest
rbac:
  apiVersion: v1beta1
  install: false
test_image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff

HOOKS:
---
# hopping-blackbird-tfjob-test-zpxrw1
apiVersion: v1
kind: Pod
metadata:
  # We give the pod a random name so that we can run helm test multiple times and not
  # have issues because the pod already exists.
  #
  # TODO(jlewi): I think the randAlphaNum is evaluated when you do deploy.
  # So running helm test tf-job multiple times causes problems because it will
  # end up using the same pod.
  name: "hopping-blackbird-tfjob-test-zpxrw1"
  annotations:
    # See https://github.com/kubernetes/helm/blob/master/docs/chart_tests.md
    "helm.sh/hook": test-success
spec:
  containers:
    - name: basic-test
      image: gcr.io/tf-on-k8s-dogfood/tf_operator:latest
      command:
        - /opt/mlkube/test/e2e
        - -alsologtostderr
        - --image=gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
        - -v=1
  
  restartPolicy: Never
MANIFEST:

---
# Source: tf-job-operator-chart/templates/config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tf-job-operator-config
data:
  controller_config_file.yaml: |
    grpcServerFilePath: /opt/mlkube/grpc_tensorflow_server/grpc_tensorflow_server.py
    accelerators:
      alpha.kubernetes.io/nvidia-gpu:
        volumes:
          - name: nvidia-libraries
            mountPath: /usr/local/nvidia/lib64 # This path is special; it is expected to be present in `/etc/ld.so.conf` inside the container image.
            hostPath: /home/kubernetes/bin/nvidia/lib
          - name: nvidia-debug-tools # optional
            mountPath: /usr/local/bin/nvidia
            hostPath: /home/kubernetes/bin/nvidia/bin
---
# Source: tf-job-operator-chart/templates/deployment.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: tf-job-operator
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: tf-job-operator
    spec:
      containers:
      - name: tf-job-operator
        image: gcr.io/tf-on-k8s-dogfood/tf_operator:latest
        command:
          - /opt/mlkube/tf_operator
          - --controller_config_file=/etc/config/controller_config_file.yaml
          - -alsologtostderr
          - -v=1
        env:
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name       
        volumeMounts:
          - name: config-volume
            mountPath: /etc/config
      volumes:
        - name: config-volume
          configMap:
            name: tf-job-operator-config

Testing: override the config map

 helm install ./tf-job-operator-chart --dry-run --debug --set config.configmap=custom_map,config.file=/tmp/file
[debug] Created tunnel using local port: '38265'

[debug] SERVER: "localhost:38265"

[debug] Original chart version: ""
[debug] CHART PATH: /home/jlewi/git_training/tf-job-operator-chart

NAME:   left-cricket
REVISION: 1
RELEASED: Sat Nov 25 12:05:19 2017
CHART: tf-job-operator-chart-0.2.0
USER-SUPPLIED VALUES:
config:
  configmap: custom_map
  file: /tmp/file

COMPUTED VALUES:
cloud: null
config:
  configmap: custom_map
  file: /tmp/file
image: gcr.io/tf-on-k8s-dogfood/tf_operator:latest
rbac:
  apiVersion: v1beta1
  install: false
test_image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff

HOOKS:
---
# left-cricket-tfjob-test-p8d2mx
apiVersion: v1
kind: Pod
metadata:
  # We give the pod a random name so that we can run helm test multiple times and not
  # have issues because the pod already exists.
  #
  # TODO(jlewi): I think the randAlphaNum is evaluated when you do deploy.
  # So running helm test tf-job multiple times causes problems because it will
  # end up using the same pod.
  name: "left-cricket-tfjob-test-p8d2mx"
  annotations:
    # See https://github.com/kubernetes/helm/blob/master/docs/chart_tests.md
    "helm.sh/hook": test-success
spec:
  containers:
    - name: basic-test
      image: gcr.io/tf-on-k8s-dogfood/tf_operator:latest
      command:
        - /opt/mlkube/test/e2e
        - -alsologtostderr
        - --image=gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
        - -v=1
  
  restartPolicy: Never
MANIFEST:

---
# Source: tf-job-operator-chart/templates/deployment.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: tf-job-operator
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: tf-job-operator
    spec:
      containers:
      - name: tf-job-operator
        image: gcr.io/tf-on-k8s-dogfood/tf_operator:latest
        command:
          - /opt/mlkube/tf_operator
          - --controller_config_file=/tmp/file
          - -alsologtostderr
          - -v=1
        env:
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name       
        volumeMounts:
          - name: config-volume
            mountPath: /etc/config
      volumes:
        - name: config-volume
          configMap:
            name: custom_map

jlewi · 2017-11-25T20:07:01Z

@DjangoPeng @jianzi123 does this fix your issues?

jianzi123 · 2017-11-26T02:18:44Z

@jlewi I have fixed the issues, but I don't know how it to do, such as how to use my own code.

DjangoPeng · 2017-11-26T05:09:13Z

I have set a configmap according to my bare metal environment. I'd like to check whether the PR fix my issue on next Monday.

jlewi · 2017-11-28T02:58:32Z

@jianzi123 How did you fix the issue?

jianzi123 · 2017-11-28T06:49:50Z

@jlewi I used kubectl create configmap --from-file to create the configmap and that put configmap into other configmap...

jlewi · 2017-11-30T07:52:25Z

@jianzi123 @DjangoPeng if I don't see any comments tomorrow, I'll go ahead and merge this.

DjangoPeng · 2017-11-30T07:56:25Z

@jlewi Go head please. I think this PR is helpful and the implementation is not bad.

gaocegege · 2017-11-30T08:15:31Z

Hi, @jlewi

I have a question about the implementation. What happened if no config volume is mounted and controller_config_file isn't set.

AFAIK there is a necessary config grpcServerFilePath in the config file for PS 🤔

jlewi · 2017-11-30T12:37:26Z

@gaocegege You raise a good point about grpcServerFilePath. I think the expectation should be that if no config file is explicitly set, the TfJob operator choose sensible defaults. A separate issue related to (#179). Is what the TfJob operator should do if grpcServerFilePath is the empty string. Some options would be

Require a non empty value; so TfJob operator should error out on startup if the path doesn't point to a valid file
TfJob's that require grpcServerFilePath will fail with an error indicating the TfJob operator isn't configured correctly.

jlewi · 2017-11-30T12:51:36Z

Opened issue #188 to figure out what to do about empty grpcServerFilePath.

gaocegege · 2017-11-30T13:39:39Z

SGTM :)

jlewi added 2 commits November 25, 2017 11:57

Update the README.md

647d567

jlewi mentioned this pull request Nov 28, 2017

Helm should only set --controller_config_file conditionally #175

Closed

jlewi mentioned this pull request Nov 30, 2017

Handle the case where grpcServerFilePath is the empty string #188

Closed

jlewi merged commit 03c5b4b into kubeflow:master Dec 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix helm templates so that we don't require a configmap. #176

Fix helm templates so that we don't require a configmap. #176

jlewi commented Nov 25, 2017

jlewi commented Nov 25, 2017

jlewi commented Nov 25, 2017

jianzi123 commented Nov 26, 2017

DjangoPeng commented Nov 26, 2017

jlewi commented Nov 28, 2017

jianzi123 commented Nov 28, 2017

jlewi commented Nov 30, 2017

DjangoPeng commented Nov 30, 2017

gaocegege commented Nov 30, 2017

jlewi commented Nov 30, 2017

jlewi commented Nov 30, 2017

gaocegege commented Nov 30, 2017

Fix helm templates so that we don't require a configmap. #176

Fix helm templates so that we don't require a configmap. #176

Conversation

jlewi commented Nov 25, 2017

jlewi commented Nov 25, 2017

jlewi commented Nov 25, 2017

jianzi123 commented Nov 26, 2017

DjangoPeng commented Nov 26, 2017

jlewi commented Nov 28, 2017

jianzi123 commented Nov 28, 2017

jlewi commented Nov 30, 2017

DjangoPeng commented Nov 30, 2017

gaocegege commented Nov 30, 2017

jlewi commented Nov 30, 2017

jlewi commented Nov 30, 2017

gaocegege commented Nov 30, 2017