-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend Test Infrastructure to run multiple E2E tests in parallel #120
Comments
For option 3 do you know the status of https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=71013666? |
I know that there will be a meetup quite soon. |
@foxish could provide more information about status of Airflow on K8s. |
We'll present a demo at kubernetes sig-apps soon as well. The branch everyone is committing to currently is https://github.com/bloomberg/airflow/tree/airflow-kubernetes-executor. Some discussion remains around credential management and storage, but mostly, the executor is in pretty good shape. I think the expectation is for the basic first PR upstreamed in the next month or so. cc/ @dimberman |
Nice I was following that branch some times ago |
I think that option 3 is more on the edge but probably the most future-proof. |
I took a look at GitLab CI/CD and I think I prefer the K8s + Airflow approach.
Its not obvious to me that the cognitive load of using a different set of tools for CI/CD (compared to datascience/ml pipelines) is worth the slicker UI and better out of box support for some common CI/CD tasks. I like the approach prow has taken of building a set of microservices and would rather continue in that direction rather than adopt a monolithic tool. Towards that end, I'd rather evolve a set of reusable scripts for common CI/CD tasks that can easily be reused with Airflow + K8s. |
had some internal efforts trying out option 3 except for that we went with luigi which was more mature at the time than Airflow (currently incubating). I would ditto the claim that airflow is more future-proof. |
* deploy provides commands to provision resources (K8s clusters) needed for E2E tests. * release.py make some changes needed to build artifacts as part of an E2E pipeline. * util.py add a work around for an issue with the Kubernetes Python client library that prevents credentials from working correctly when using a service account. * test_util provides routines for creating junit XML files; this will be used to create results for gubernator. * These binaries will be used in #120
Create an Airflow pipeline to run E2E tests in parallel. * The high level goal is outlined in #120. * This commit creates an Airflow pipeline that is equivalent to our existing test; it has steps to build the artifacts, setup a GKE cluster, run the helm test and teardown the cluster. * We create a suitable Dockerfile and K8s deployment for running Airflow on K8s. * The README.md provides instructions.
Our PROW jobs should trigger our Airflow pipelines to run our tests. This aims to address #120 This PR changes the code invoked by PROW to trigger Airflow. We replace bootstrap.py with py/airflow.py; we use this Python script to trigger an Airflow pipeline. We update the steps in the Airflow E2E pipeline to create junit XML files in GCS so Gubernator can compute the results of the test. This PR introduces a regression into our PROW jobs in that the current pipeline doesn't run lint checks on our Python code; we will fix that in a subsequent PR. Our Prow job uses the same base container as our Airflow deployment; the only difference is the entrypoint. We get rid of main.go because our AIrflow pipeline takes care of setting up the CRD and cluster.
We need to extend the E2E test infrastructure to run multiple tests in parallel. The diagram below illustrates the expected flow
The sequence of events is:
* We potentially need multiple clusters to test different features (e.g. GPUs) or different versions of
K8s
* Each test will exercise different TfJob features; e.g. non-distributed vs. distributed)
So the sequence of events is well represented by the DAG as illustrated by the diagram. I'd like to find a convenient way to express and execute the DAG; preferably without rolling our own workflow system.
Option 1 One Prow Job Per Test
We could create one prow job for each test and rely on prow to run the tests in parallel.
Disadvantages
Option 2 Use Ginkgo
Advantages
Ginkgo is a go framework. I'm increasingly leaning towards using Python not go for our CI/CT. Since the TensorFlow test programs will be written in Python, I think Python is a good choice for CT.
Option 3 Airflow
We can express the DAG illustrated above as an Airflow pipeline. The prow job can then just trigger the pipeline and upload artifacts to GCS as needed.
Advantages
* This should make it easy to deploy/manage Airflow on K8s
* Should make it easy to express the steps in the DAG as K8s Jobs and orchestrate them using Airflow
Disadvantages
The text was updated successfully, but these errors were encountered: