Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1 and v2 E2E tests appear to be stomping on each other #748

Closed
jlewi opened this issue Jul 23, 2018 · 1 comment
Closed

v1 and v2 E2E tests appear to be stomping on each other #748

jlewi opened this issue Jul 23, 2018 · 1 comment

Comments

@jlewi
Copy link
Contributor

jlewi commented Jul 23, 2018

Here's a post submit test.
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/kubeflow_tf-operator/745/kubeflow-tf-operator-presubmit/872

We run 2 separate workflows for v1 and v2 but we only see test results for

simple-tfjob-v1alpha1
gpu-tfjob-v1alpha1

The logs for the v1alpha2 test appear to indicate a problem running the job.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 23, 2018

So here's the problem.

We give each workflow its own directory named after the workflow in the shared NFS directory.

But the artifacts for all the workflows end up being copied to the same GCS bucket for gubernator because they are part of the same prow job.

gs://kubernetes-jenkins/logs/kubeflow_tf-operator/kubeflow-tf-operator-postsubmit/205

As a result the junit files end up clobbering each other.

jlewi added a commit to jlewi/testing that referenced this issue Jul 23, 2018
* See kubeflow/trainer#748
* A test can run multiple instances of a workflow but with different parameters.
* In this case we need to make sure the junit files and other artifacts
  copied to GCS for gubernator have unique names.

* One way to make this easier is to have copy-artifacts automatically
  append a unique suffix to each file before copying it to GCS.
jlewi added a commit to jlewi/k8s that referenced this issue Jul 24, 2018
* It turns out that although we running the v1alpha2 tests, failures
  were not being properly reported in Prow because the junit xml files
  had the same names for the v2 pipeline as the v1 pipeline and the v2
  results were being clobbered by v1.

* Ensure the artifacts for each run of the E2E test have a unix name
  based on the TFJob version so that the E2E tests for the different
  TFJob versions won't clobber each other.

* Log the exception in wait for condition.

* Need to pass --tfjob_version to the tests so it uses the proper client.

* run_gpu and run_test stage need to use a v1alpha2 version of the test
  workflow.

* Update the tf_smoke program to accept chief as a valid worker type so that
  it works with v1alpha2.

* In v1alpha2 we need to terminate all workers. It looks like there was a
regression in v1alpha2
  kubeflow#751
  and we require all workers to terminate as opposed to just worker 0.

* Delete a bunch of environments for the test app that shouldn't have been
  committed.

Fix kubeflow#748
k8s-ci-robot pushed a commit to kubeflow/testing that referenced this issue Jul 24, 2018
…x. (#183)

* Copy artifacts should make the file names unique by appending a suffix.

* See kubeflow/trainer#748
* A test can run multiple instances of a workflow but with different parameters.
* In this case we need to make sure the junit files and other artifacts
  copied to GCS for gubernator have unique names.

* One way to make this easier is to have copy-artifacts automatically
  append a unique suffix to each file before copying it to GCS.

* Fix lint.
k8s-ci-robot pushed a commit that referenced this issue Jul 25, 2018
…749)

* Prevent multiple versions of an E2E test from clobbering each other.

* It turns out that although we running the v1alpha2 tests, failures
  were not being properly reported in Prow because the junit xml files
  had the same names for the v2 pipeline as the v1 pipeline and the v2
  results were being clobbered by v1.

* Ensure the artifacts for each run of the E2E test have a unix name
  based on the TFJob version so that the E2E tests for the different
  TFJob versions won't clobber each other.

* Log the exception in wait for condition.

* Need to pass --tfjob_version to the tests so it uses the proper client.

* run_gpu and run_test stage need to use a v1alpha2 version of the test
  workflow.

* Update the tf_smoke program to accept chief as a valid worker type so that
  it works with v1alpha2.

* In v1alpha2 we need to terminate all workers. It looks like there was a
regression in v1alpha2
  #751
  and we require all workers to terminate as opposed to just worker 0.

* Delete a bunch of environments for the test app that shouldn't have been
  committed.

Fix #748

* * Use kubeflow/testing@HEAD rather than the hack of pinning PR
  kubeflow/testing#183 which I was using to test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant