Extend Test Infrastructure to run multiple E2E tests in parallel #120

jlewi · 2017-11-03T14:07:03Z

We need to extend the E2E test infrastructure to run multiple tests in parallel. The diagram below illustrates the expected flow

The sequence of events is:

An event (e.g. GitHub pull request) triggers a prow job
Prow job builds the artifacts (Docker images, helm packages, etc...)
Create one or more K8s clusters
* We potentially need multiple clusters to test different features (e.g. GPUs) or different versions of
K8s
Deploy the operator
Run a bunch of tests
* Each test will exercise different TfJob features; e.g. non-distributed vs. distributed)
Cleanup

So the sequence of events is well represented by the DAG as illustrated by the diagram. I'd like to find a convenient way to express and execute the DAG; preferably without rolling our own workflow system.

Option 1 One Prow Job Per Test
We could create one prow job for each test and rely on prow to run the tests in parallel.

Disadvantages

Integrating new jobs into prow is tedious; lots of config files need to be updated
Can't reuse resources (build artifacts, clusters, etc...) across tests
Seems like a prow anti pattern
- Most prow jobs run a suite of tests

Option 2 Use Ginkgo

Advantages

Fairly lightweight
Used by K8s for testing

Ginkgo is a go framework. I'm increasingly leaning towards using Python not go for our CI/CT. Since the TensorFlow test programs will be written in Python, I think Python is a good choice for CT.

Option 3 Airflow

We can express the DAG illustrated above as an Airflow pipeline. The prow job can then just trigger the pipeline and upload artifacts to GCS as needed.

Advantages

Don't need to learn a new tool for CT
- We can treat our CT pipeline just like an ML pipeline (preprocessing, training, evaluation, etc...)
- I'm assuming that Airflow will become one of the standard workflow tools in ML
Python
Integration with K8s is coming
* This should make it easy to deploy/manage Airflow on K8s
* Should make it easy to express the steps in the DAG as K8s Jobs and orchestrate them using Airflow

Disadvantages

Heavyweight

bhack · 2017-11-03T16:19:40Z

For option 3 do you know the status of https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=71013666?

bhack · 2017-11-03T16:22:02Z

I know that there will be a meetup quite soon.

jlewi · 2017-11-03T20:59:50Z

@foxish could provide more information about status of Airflow on K8s.

foxish · 2017-11-03T21:17:24Z

We'll present a demo at kubernetes sig-apps soon as well. The branch everyone is committing to currently is https://github.com/bloomberg/airflow/tree/airflow-kubernetes-executor. Some discussion remains around credential management and storage, but mostly, the executor is in pretty good shape. I think the expectation is for the basic first PR upstreamed in the next month or so.

cc/ @dimberman

bhack · 2017-11-03T22:49:58Z

Nice I was following that branch some times ago

bhack · 2017-11-04T10:22:18Z

I think that option 3 is more on the edge but probably the most future-proof.

jlewi · 2017-11-08T12:48:14Z

CNCF has a project to do cross-cloud integration for its projects. This slide seems to indicate its based on GitLab.

jlewi · 2017-11-08T13:19:08Z

I took a look at GitLab CI/CD and I think I prefer the K8s + Airflow approach.

It looks like GitLab has its own workflow system based on a simple declarative syntax (YAML)
GitLab has its own abstractions for runners, processes, and containers.

Its not obvious to me that the cognitive load of using a different set of tools for CI/CD (compared to datascience/ml pipelines) is worth the slicker UI and better out of box support for some common CI/CD tasks.

I like the approach prow has taken of building a set of microservices and would rather continue in that direction rather than adopt a monolithic tool. Towards that end, I'd rather evolve a set of reusable scripts for common CI/CD tasks that can easily be reused with Airflow + K8s.

jimexist · 2017-11-14T01:38:37Z

had some internal efforts trying out option 3 except for that we went with luigi which was more mature at the time than Airflow (currently incubating). I would ditto the claim that airflow is more future-proof.

* deploy provides commands to provision resources (K8s clusters) needed for E2E tests. * release.py make some changes needed to build artifacts as part of an E2E pipeline. * util.py add a work around for an issue with the Kubernetes Python client library that prevents credentials from working correctly when using a service account. * test_util provides routines for creating junit XML files; this will be used to create results for gubernator. * These binaries will be used in #120

Create an Airflow pipeline to run E2E tests in parallel. * The high level goal is outlined in #120. * This commit creates an Airflow pipeline that is equivalent to our existing test; it has steps to build the artifacts, setup a GKE cluster, run the helm test and teardown the cluster. * We create a suitable Dockerfile and K8s deployment for running Airflow on K8s. * The README.md provides instructions.

Our PROW jobs should trigger our Airflow pipelines to run our tests. This aims to address #120 This PR changes the code invoked by PROW to trigger Airflow. We replace bootstrap.py with py/airflow.py; we use this Python script to trigger an Airflow pipeline. We update the steps in the Airflow E2E pipeline to create junit XML files in GCS so Gubernator can compute the results of the test. This PR introduces a regression into our PROW jobs in that the current pipeline doesn't run lint checks on our Python code; we will fix that in a subsequent PR. Our Prow job uses the same base container as our Airflow deployment; the only difference is the entrypoint. We get rid of main.go because our AIrflow pipeline takes care of setting up the CRD and cluster.

jlewi mentioned this issue Nov 15, 2017

Create binaries to run steps in an E2E test pipeline. #148

Merged

jlewi mentioned this issue Nov 17, 2017

Integrate Airflow with Prow #158

Merged

jlewi closed this as completed Jan 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend Test Infrastructure to run multiple E2E tests in parallel #120

Extend Test Infrastructure to run multiple E2E tests in parallel #120

jlewi commented Nov 3, 2017

bhack commented Nov 3, 2017

bhack commented Nov 3, 2017

jlewi commented Nov 3, 2017

foxish commented Nov 3, 2017

bhack commented Nov 3, 2017

bhack commented Nov 4, 2017 •

edited

Loading

jlewi commented Nov 8, 2017

jlewi commented Nov 8, 2017

jimexist commented Nov 14, 2017

Extend Test Infrastructure to run multiple E2E tests in parallel #120

Extend Test Infrastructure to run multiple E2E tests in parallel #120

Comments

jlewi commented Nov 3, 2017

bhack commented Nov 3, 2017

bhack commented Nov 3, 2017

jlewi commented Nov 3, 2017

foxish commented Nov 3, 2017

bhack commented Nov 3, 2017

bhack commented Nov 4, 2017 • edited Loading

jlewi commented Nov 8, 2017

jlewi commented Nov 8, 2017

jimexist commented Nov 14, 2017

bhack commented Nov 4, 2017 •

edited

Loading