v1 and v2 E2E tests appear to be stomping on each other #748

jlewi · 2018-07-23T22:05:21Z

Here's a post submit test.
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/kubeflow_tf-operator/745/kubeflow-tf-operator-presubmit/872

We run 2 separate workflows for v1 and v2 but we only see test results for

simple-tfjob-v1alpha1
gpu-tfjob-v1alpha1

The logs for the v1alpha2 test appear to indicate a problem running the job.

The text was updated successfully, but these errors were encountered:

jlewi · 2018-07-23T22:11:09Z

So here's the problem.

We give each workflow its own directory named after the workflow in the shared NFS directory.

But the artifacts for all the workflows end up being copied to the same GCS bucket for gubernator because they are part of the same prow job.

gs://kubernetes-jenkins/logs/kubeflow_tf-operator/kubeflow-tf-operator-postsubmit/205

As a result the junit files end up clobbering each other.

* See kubeflow/trainer#748 * A test can run multiple instances of a workflow but with different parameters. * In this case we need to make sure the junit files and other artifacts copied to GCS for gubernator have unique names. * One way to make this easier is to have copy-artifacts automatically append a unique suffix to each file before copying it to GCS.

* It turns out that although we running the v1alpha2 tests, failures were not being properly reported in Prow because the junit xml files had the same names for the v2 pipeline as the v1 pipeline and the v2 results were being clobbered by v1. * Ensure the artifacts for each run of the E2E test have a unix name based on the TFJob version so that the E2E tests for the different TFJob versions won't clobber each other. * Log the exception in wait for condition. * Need to pass --tfjob_version to the tests so it uses the proper client. * run_gpu and run_test stage need to use a v1alpha2 version of the test workflow. * Update the tf_smoke program to accept chief as a valid worker type so that it works with v1alpha2. * In v1alpha2 we need to terminate all workers. It looks like there was a regression in v1alpha2 kubeflow#751 and we require all workers to terminate as opposed to just worker 0. * Delete a bunch of environments for the test app that shouldn't have been committed. Fix kubeflow#748

…x. (#183) * Copy artifacts should make the file names unique by appending a suffix. * See kubeflow/trainer#748 * A test can run multiple instances of a workflow but with different parameters. * In this case we need to make sure the junit files and other artifacts copied to GCS for gubernator have unique names. * One way to make this easier is to have copy-artifacts automatically append a unique suffix to each file before copying it to GCS. * Fix lint.

…749) * Prevent multiple versions of an E2E test from clobbering each other. * It turns out that although we running the v1alpha2 tests, failures were not being properly reported in Prow because the junit xml files had the same names for the v2 pipeline as the v1 pipeline and the v2 results were being clobbered by v1. * Ensure the artifacts for each run of the E2E test have a unix name based on the TFJob version so that the E2E tests for the different TFJob versions won't clobber each other. * Log the exception in wait for condition. * Need to pass --tfjob_version to the tests so it uses the proper client. * run_gpu and run_test stage need to use a v1alpha2 version of the test workflow. * Update the tf_smoke program to accept chief as a valid worker type so that it works with v1alpha2. * In v1alpha2 we need to terminate all workers. It looks like there was a regression in v1alpha2 #751 and we require all workers to terminate as opposed to just worker 0. * Delete a bunch of environments for the test app that shouldn't have been committed. Fix #748 * * Use kubeflow/testing@HEAD rather than the hack of pinning PR kubeflow/testing#183 which I was using to test.

jlewi added area/testing priority/p1 area/engprod area/0.3.0 labels Jul 23, 2018

k8s-ci-robot closed this as completed in #749 Jul 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1 and v2 E2E tests appear to be stomping on each other #748

v1 and v2 E2E tests appear to be stomping on each other #748

jlewi commented Jul 23, 2018

jlewi commented Jul 23, 2018

v1 and v2 E2E tests appear to be stomping on each other #748

v1 and v2 E2E tests appear to be stomping on each other #748

Comments

jlewi commented Jul 23, 2018

jlewi commented Jul 23, 2018