Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Test Infrastructure to run multiple E2E tests in parallel #120

Closed
jlewi opened this issue Nov 3, 2017 · 9 comments
Closed

Extend Test Infrastructure to run multiple E2E tests in parallel #120

jlewi opened this issue Nov 3, 2017 · 9 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Nov 3, 2017

We need to extend the E2E test infrastructure to run multiple tests in parallel. The diagram below illustrates the expected flow

test_infrastructure

The sequence of events is:

  • An event (e.g. GitHub pull request) triggers a prow job
  • Prow job builds the artifacts (Docker images, helm packages, etc...)
  • Create one or more K8s clusters
    * We potentially need multiple clusters to test different features (e.g. GPUs) or different versions of
    K8s
  • Deploy the operator
  • Run a bunch of tests
    * Each test will exercise different TfJob features; e.g. non-distributed vs. distributed)
  • Cleanup

So the sequence of events is well represented by the DAG as illustrated by the diagram. I'd like to find a convenient way to express and execute the DAG; preferably without rolling our own workflow system.

Option 1 One Prow Job Per Test
We could create one prow job for each test and rely on prow to run the tests in parallel.

Disadvantages

  • Integrating new jobs into prow is tedious; lots of config files need to be updated
  • Can't reuse resources (build artifacts, clusters, etc...) across tests
  • Seems like a prow anti pattern
    • Most prow jobs run a suite of tests

Option 2 Use Ginkgo

Advantages

  • Fairly lightweight
  • Used by K8s for testing

Ginkgo is a go framework. I'm increasingly leaning towards using Python not go for our CI/CT. Since the TensorFlow test programs will be written in Python, I think Python is a good choice for CT.

Option 3 Airflow

We can express the DAG illustrated above as an Airflow pipeline. The prow job can then just trigger the pipeline and upload artifacts to GCS as needed.

Advantages

  • Don't need to learn a new tool for CT
    • We can treat our CT pipeline just like an ML pipeline (preprocessing, training, evaluation, etc...)
    • I'm assuming that Airflow will become one of the standard workflow tools in ML
  • Python
  • Integration with K8s is coming
    * This should make it easy to deploy/manage Airflow on K8s
    * Should make it easy to express the steps in the DAG as K8s Jobs and orchestrate them using Airflow

Disadvantages

  • Heavyweight
@bhack
Copy link

bhack commented Nov 3, 2017

For option 3 do you know the status of https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=71013666?

@bhack
Copy link

bhack commented Nov 3, 2017

I know that there will be a meetup quite soon.

@jlewi
Copy link
Contributor Author

jlewi commented Nov 3, 2017

@foxish could provide more information about status of Airflow on K8s.

@foxish
Copy link

foxish commented Nov 3, 2017

We'll present a demo at kubernetes sig-apps soon as well. The branch everyone is committing to currently is https://github.com/bloomberg/airflow/tree/airflow-kubernetes-executor. Some discussion remains around credential management and storage, but mostly, the executor is in pretty good shape. I think the expectation is for the basic first PR upstreamed in the next month or so.

cc/ @dimberman

@bhack
Copy link

bhack commented Nov 3, 2017

Nice I was following that branch some times ago

@bhack
Copy link

bhack commented Nov 4, 2017

I think that option 3 is more on the edge but probably the most future-proof.

@jlewi
Copy link
Contributor Author

jlewi commented Nov 8, 2017

CNCF has a project to do cross-cloud integration for its projects. This slide seems to indicate its based on GitLab.

@jlewi
Copy link
Contributor Author

jlewi commented Nov 8, 2017

I took a look at GitLab CI/CD and I think I prefer the K8s + Airflow approach.

  • It looks like GitLab has its own workflow system based on a simple declarative syntax (YAML)
  • GitLab has its own abstractions for runners, processes, and containers.

Its not obvious to me that the cognitive load of using a different set of tools for CI/CD (compared to datascience/ml pipelines) is worth the slicker UI and better out of box support for some common CI/CD tasks.

I like the approach prow has taken of building a set of microservices and would rather continue in that direction rather than adopt a monolithic tool. Towards that end, I'd rather evolve a set of reusable scripts for common CI/CD tasks that can easily be reused with Airflow + K8s.

@jimexist
Copy link
Member

had some internal efforts trying out option 3 except for that we went with luigi which was more mature at the time than Airflow (currently incubating). I would ditto the claim that airflow is more future-proof.

jlewi added a commit that referenced this issue Nov 15, 2017
* deploy provides commands to provision resources (K8s clusters) needed
  for E2E tests.

* release.py make some changes needed to build artifacts as part of an
  E2E pipeline.

* util.py add a work around for an issue with the Kubernetes Python client
  library that prevents credentials from working correctly when using a
  service account.

* test_util provides routines for creating junit XML files; this will be
  used to create results for gubernator.

* These binaries will be used in #120
jlewi added a commit that referenced this issue Nov 17, 2017
Create an Airflow pipeline to run E2E tests in parallel.

* The high level goal is outlined in #120.

* This commit creates an Airflow pipeline that is equivalent to our existing test; it has steps to build the artifacts, setup a GKE cluster, run the helm test and teardown the cluster.

* We create a suitable Dockerfile and K8s deployment for running Airflow on K8s.

* The README.md provides instructions.
jlewi added a commit that referenced this issue Nov 21, 2017
Our PROW jobs should trigger our Airflow pipelines to run our tests.

This aims to address #120

This PR changes the code invoked by PROW to trigger Airflow.

We replace bootstrap.py with py/airflow.py; we use this Python script to trigger an Airflow pipeline.

We update the steps in the Airflow E2E pipeline to create junit XML files in GCS so Gubernator can compute the results of the test.

This PR introduces a regression into our PROW jobs in that the current pipeline doesn't run lint checks on our Python code; we will fix that in a subsequent PR.

Our Prow job uses the same base container as our Airflow deployment; the only difference is the entrypoint.

We get rid of main.go because our AIrflow pipeline takes care of setting up the CRD and cluster.
@jlewi jlewi closed this as completed Jan 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants