diff --git a/README.md b/README.md index 9e7e1358449..f93b4db31fe 100644 --- a/README.md +++ b/README.md @@ -7,8 +7,7 @@ [![Coverage Status](https://coveralls.io/repos/github/kubeflow/katib/badge.svg?branch=master)](https://coveralls.io/github/kubeflow/katib?branch=master) [![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/katib)](https://goreportcard.com/report/github.com/kubeflow/katib) -Katib is a Kubernetes Native System for [Hyperparameter Tuning][1] and [Neural Architecture Search][2]. -The system is inspired by [Google vizier][3] and supports multiple ML/DL frameworks (e.g. TensorFlow, Apache MXNet, and PyTorch). +Katib is a Kubernetes-based system for [Hyperparameter Tuning][1] and [Neural Architecture Search][2]. Katib supports a number of ML frameworks, including TensorFlow, Apache MXNet, PyTorch, XGBoost, and others. Table of Contents ================= @@ -30,7 +29,7 @@ Table of Contents * [Running examples](#running-examples) * [Cleanups](#cleanups) * [Quick Start](#quick-start) - * [Who are using katib?](#who-are-using-katib) + * [Who are using Katib?](#who-are-using-katib) * [CONTRIBUTING](#contributing) Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc) @@ -43,7 +42,7 @@ on the Kubeflow website. ## Name -Katib stands for `secretary` in Arabic. As `Vizier` stands for a high official or a prime minister in Arabic, this project Katib is named in the honor of Vizier. +Katib stands for `secretary` in Arabic. ## Concepts in Katib @@ -86,11 +85,17 @@ Thus, Katib supports multiple frameworks with the help of different job kinds. Currently Katib supports the following exploration algorithms: -* random search -* grid search -* [hyperband](https://arxiv.org/pdf/1603.06560.pdf) -* [bayesian optimization](https://arxiv.org/pdf/1012.2599.pdf) -* [NAS based on reinforcement learning](https://github.com/kubeflow/katib/tree/master/pkg/suggestion/v1alpha3/NAS_Reinforcement_Learning) +#### Hyperparameter Tuning + +* [Random Search](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Random_search) +* [Tree of Parzen Estimators (TPE)](https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf) +* [Grid Search](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Grid_search) +* [Hyperband](https://arxiv.org/pdf/1603.06560.pdf) +* [Bayesian Optimization](https://arxiv.org/pdf/1012.2599.pdf) + +#### Neural Architecture Search + +* [Reinforcement Learning](https://github.com/kubeflow/katib/tree/master/pkg/suggestion/v1alpha3/NAS_Reinforcement_Learning) ## Components in Katib @@ -98,11 +103,11 @@ Currently Katib supports the following exploration algorithms: Katib consists of several components as shown below. Each component is running on k8s as a deployment. Each component communicates with others via GRPC and the API is defined at `pkg/apis/manager/v1alpha3/api.proto`. -- katib: main components. - - katib-db-manager: GRPC API server of katib which is the DB Interface. - - katib-mysql: Data storage backend of katib using mysql. - - katib-ui: User interface of katib. - - katib-controller: Controller for katib CRDs in Kubernetes. +- Katib main components: + - katib-db-manager: GRPC API server of Katib which is the DB Interface. + - katib-mysql: Data storage backend of Katib using mysql. + - katib-ui: User interface of Katib. + - katib-controller: Controller for Katib CRDs in Kubernetes. ## Web UI @@ -124,7 +129,9 @@ install Kubeflow. See the documentation: * [Kubeflow installation guide](https://www.kubeflow.org/docs/started/getting-started/) * [Kubeflow hyperparameter tuning -guides](https://www.kubeflow.org/docs/components/hyperparameter-tuning/). +guides](https://www.kubeflow.org/docs/components/hyperparameter-tuning/). + +If you install Katib with other Kubeflow components, you can't submit Katib jobs in Kubeflow namespace. Alternatively, if you want to install Katib manually, follow these steps: @@ -181,12 +188,13 @@ metadata: type: local app: katib spec: + storageClassName: katib capacity: storage: 10Gi accessModes: - ReadWriteOnce hostPath: - path: /data/katib + path: /tmp/katib ``` Create this pv after deploying Katib package @@ -337,7 +345,7 @@ Delete installed components using `kubectl delete -f` on the respective folders. Please see [Quick Start Guide](./docs/quick-start.md) -## Who are using katib? +## Who are using Katib? Please see [adopters.md](./docs/community/adopters.md) diff --git a/ROADMAP.md b/ROADMAP.md index bb54e9ec51c..bd318809ec7 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -1,25 +1,21 @@ -# Katib 2019 Roadmap +# Katib 2020 Roadmap -This document provides a high level view of where Katib will grow in 2019. These objectives are based on Katib's Critical User Journey (CUJ), -which can be found [here](https://bit.ly/2QNKMwt). +This document provides a high level view of where Katib will grow in 2020. The original Katib design document can be found [here](https://docs.google.com/document/d/1ZEKhou4z1utFTOgjzhSsnvysJFNEJmygllgDCBnYvm8/edit#heading=h.7fzqir88ovr). # Katib 1.0 Readiness -* Stabilize APIs for StudyJobs - * Beta by end of Q2, 1.0 by end of Q4 - * Formalize naming conventions (we use different names like katib vs vizier in different places) - * Refactor studyjob field names [#351](https://github.com/kubeflow/katib/issues/351) - * Rename fields so their names are more meaningful (e.g. requestCount vs requestNumber) [#161](https://github.com/kubeflow/katib/issues/161) -* Fully integrate katib with existing E2E examples: +* Stabilize APIs for Experiments + * Reconsider the design of Trial Template [#906](https://github.com/kubeflow/katib/issues/906) + * Early Stopping [#692](https://github.com/kubeflow/katib/issues/692) + * Resuming Experiment [#1061](https://github.com/kubeflow/katib/issues/1061), [#1062](https://github.com/kubeflow/katib/issues/1062) +* Fully integrate Katib with existing E2E examples: * Xgboost * Mnist * GitHub issue summarization * Publish API documentation, best practices, tutorials * [Issues list](https://github.com/kubeflow/katib/issues) -* [Issues for 0.5.0 release](https://github.com/kubeflow/katib/labels/area%2F0.5.0) - # Enhance HP Tuning Experience @@ -32,36 +28,14 @@ Integration with KF distributed training components * PyTorch * Allow Katib to support other operator types generically [#341](https://github.com/kubeflow/katib/issues/341) -## 2. Configuring a Study -* Streamlining the StudyJob schema - providing simpler ways to write worker specs and metric collector specs. -* Expose more information in StudyJob status fields - * List all job conditions with details [#344](https://github.com/kubeflow/katib/issues/344) - * Returning study metadata such as number of trials and best hyperparameter values so far [#356](https://github.com/kubeflow/katib/issues/356) -* Integration with Jupyter notebooks and Fairing [#355](https://github.com/kubeflow/katib/issues/355) - * Allow users to start with an existing model from a notebook and do HP tuning with minimal code changes -* Allowing a StudyJob to be resumed with additional trials [#346](https://github.com/kubeflow/katib/issues/346) -* Generating StudyJob configurations and launching StudyJobs through UI +## 2. Configuring a Experiment * Supporting additional suggestion algorithms [#15](https://github.com/kubeflow/katib/issues/15) -* Support for StudyJob deployment in a different namespace [#343](https://github.com/kubeflow/katib/issues/343) - ## 3. Tracking Model Performance -* Enhance metrics collection - * May need to revisit the design - use a push model instead of pull model? * UI enhancements: allowing data scientists to visualize results easier * Support for persistent model and metadata storage * Ideally users should be able to export and reuse trained models from a common storage - -# Other Features - -Designs are pending for the following new features: -* Multi-Tenancy Support -* [NAS](https://docs.google.com/document/d/1qGWy-C5XSQmh82XYoMcJ_JWLHwmyvdMRjCkFMfkO0vE/edit) -* Batch scheduling -* [Integration with Pipelines](https://github.com/kubeflow/katib/issues/331) -* Early stopping feature - # Test and Release Infrastructure * Improve e2e test coverage diff --git a/docs/community/adopters.md b/docs/community/adopters.md index 9e8cd3092e4..0c61b0ddb54 100644 --- a/docs/community/adopters.md +++ b/docs/community/adopters.md @@ -1,6 +1,6 @@ # Adopters of Kubeflow Katib -Below are the adopters of project Katib. If you are using katib +Below are the adopters of project Katib. If you are using Katib please add yourself into the following list by a pull request. | Organization | Contact | Description of Use | diff --git a/docs/developer-guide.md b/docs/developer-guide.md index 55a62160117..98dec05c5d2 100644 --- a/docs/developer-guide.md +++ b/docs/developer-guide.md @@ -6,7 +6,7 @@ Table of Contents * [Requirements](#requirements) * [Build from source code](#build-from-source-code) * [Workflow design](#workflow-design) - * [Implement a new algorithm and use it in katib](#implement-a-new-algorithm-and-use-it-in-katib) + * [Implement a new algorithm and use it in Katib](#implement-a-new-algorithm-and-use-it-in-katib) * [Create a new Trial kind](#create-a-new-trial-kind) * [Algorithm settings documentation](#algorithm-settings-documentation) * [Design proposals](#design-proposals) @@ -39,13 +39,13 @@ Check source code as follows: make build ``` -You can deploy katib v1alpha3 manifests into a k8s cluster as follows: +You can deploy Katib v1alpha3 manifests into a k8s cluster as follows: ```bash make deploy ``` -You can undeploy katib v1alpha3 manifests from a k8s cluster as follows: +You can undeploy Katib v1alpha3 manifests from a k8s cluster as follows: ```bash make undeploy @@ -55,7 +55,7 @@ make undeploy Please see [workflow-design.md](./workflow-design.md) -## Implement a new algorithm and use it in katib +## Implement a new algorithm and use it in Katib Please see [new-algorithm-service.md](./new-algorithm-service.md) diff --git a/docs/new-algorithm-service.md b/docs/new-algorithm-service.md index 75932ce980b..7c47e139ea0 100644 --- a/docs/new-algorithm-service.md +++ b/docs/new-algorithm-service.md @@ -1,14 +1,14 @@ -# Document about how to add a new algorithm in katib +# Document about how to add a new algorithm in Katib -## Implement a new algorithm and use it in katib +## Implement a new algorithm and use it in Katib ### Implement the algorithm -The design of katib follows the [`ask-and-tell` pattern](https://scikit-optimize.github.io/notebooks/ask-and-tell.html): +The design of Katib follows the `ask-and-tell` pattern: > They often follow a pattern a bit like this: 1. ask for a new set of parameters 1. walk to the experiment and program in the new parameters 1. observe the outcome of running the experiment 1. walk back to your laptop and tell the optimizer about the outcome 1. go to step 1 -When an experiment is created, one algorithm service will be created. Then katib asks for new sets of parameters via `GetSuggestions` GRPC call. After that, katib creates new trials according to the sets and observe the outcome. When the trials are finished, katib tells the metrics of the finished trials to the algorithm, and ask another new sets. +When an experiment is created, one algorithm service will be created. Then Katib asks for new sets of parameters via `GetSuggestions` GRPC call. After that, Katib creates new trials according to the sets and observe the outcome. When the trials are finished, Katib tells the metrics of the finished trials to the algorithm, and ask another new sets. The new algorithm needs to implement `Suggestion` service defined in [api.proto](../pkg/apis/manager/v1alpha3/api.proto). One sample algorithm looks like: @@ -87,7 +87,7 @@ Create a package under [cmd/suggestion](../cmd/suggestion). Then create the main Here is an example: [cmd/suggestion/hyperopt](../cmd/suggestion/hyperopt). Then build the Docker image. -### Use the algorithm in katib. +### Use the algorithm in Katib. Update the [katib-config](../manifests/v1alpha3/katib-controller/katib-config.yaml), add a new object: @@ -106,9 +106,9 @@ Update the [katib-config](../manifests/v1alpha3/katib-controller/katib-config.ya } ``` -### Contribute the algorithm to katib +### Contribute the algorithm to Katib -If you want to contribute the algorithm to katib, you could add unit test or e2e test for it in CI and submit a PR. +If you want to contribute the algorithm to Katib, you could add unit test or e2e test for it in CI and submit a PR. #### Unit Test @@ -142,9 +142,14 @@ You can setup the GRPC server using `grpc_testing`, then define you own test cas #### E2E Test (Optional) -E2e tests help katib verify that the algorithm works well. To add a e2e test for the new algorithm, you need to: +E2e tests help Katib verify that the algorithm works well. +To add a e2e test for the new algorithm, in [test/scripts/v1alpha3](../test/scripts/v1alpha3) you need to: -Create a new script `run-suggestion-xxx.sh` in [test/scripts/v1alpha3](../test/scripts/v1alpha3). Here is an example [test/scripts/v1alpha3/build-suggestion-hyperopt.sh](../test/scripts/v1alpha3/build-suggestion-hyperopt.sh) (Replace `` with the new algorithm name): +1. Create a new Experiment yaml file in [examples/v1alpha3](../examples/v1alpha3) with the new algorithm. + +2. Create a new script `build-suggestion-xxx.sh` to build new suggestion. Here is an example [test/scripts/v1alpha3/build-suggestion-hyperopt.sh](../test/scripts/v1alpha3/build-suggestion-hyperopt.sh). + +3. Create a new script `run-suggestion-xxx.sh` to run new suggestion. Below is an example (Replace `` with the new algorithm name): ```bash #!/bin/bash diff --git a/docs/new-trial-kind.md b/docs/new-trial-kind.md index ac9d2d949ff..62dad6f060c 100644 --- a/docs/new-trial-kind.md +++ b/docs/new-trial-kind.md @@ -1,4 +1,4 @@ -# Document about how to support a new Kubernetes resource in katib trial +# Document about how to support a new Kubernetes resource in Katib trial ## Update the supported list @@ -27,7 +27,7 @@ func GetSupportedJobList() []schema.GroupVersionKind { } ``` -In this function, we define the Kubernetes `GroupVersionKind` that are supported in katib. If you want to add a new kind, please append the `supportedJobList`. +In this function, we define the Kubernetes `GroupVersionKind` that are supported in Katib. If you want to add a new kind, please append the `supportedJobList`. ## Update logic about status update @@ -70,7 +70,7 @@ The function is used to determine which container in the job is the actual main ### Add logic about how to determine the master pod -In katib, we only inject metrics collector sidecar into the master pod (See [metrics-collector.md](./proposals/metrics-collector.md) for more details). Thus we need to update the `JobRoleMap` in [const.go](../pkg/webhook/v1alpha3/pod/const.go). +In Katib, we only inject metrics collector sidecar into the master pod (See [metrics-collector.md](./proposals/metrics-collector.md) for more details). Thus we need to update the `JobRoleMap` in [const.go](../pkg/webhook/v1alpha3/pod/const.go). ```go var JobRoleMap = map[string][]string{ diff --git a/docs/proposals/metrics-collector.md b/docs/proposals/metrics-collector.md index 0f219c34903..7b17ab42444 100644 --- a/docs/proposals/metrics-collector.md +++ b/docs/proposals/metrics-collector.md @@ -13,7 +13,7 @@ ## Links -- [katib/issues#685 (katib metrics collector solution)](https://github.com/kubeflow/katib/issues/685) +- [katib/issues#685 (Katib metrics collector solution)](https://github.com/kubeflow/katib/issues/685) - [katib/pull#697 (API for metricCollector)](https://github.com/kubeflow/katib/pull/697#issuecomment-516264282) - [katib/pull#716 (Add pod level inject webhook)](https://github.com/kubeflow/katib/pull/716) - [katib/pull#729 (Inject pod sidecar for specified namespace)](https://github.com/kubeflow/katib/pull/729) @@ -29,7 +29,7 @@ The cron job pulls the targeted pod logs periodically and then persist the logs However, the pulled-based design has [some problems](https://github.com/kubeflow/tf-operator/issues/722#issuecomment-405669269), such as, at what frequency should we scrape the metrics and so on. To enhance the extensibility and support EarlyStopping, we propose a new design of the metrics collector. -In the new design, katib use mutating webhook to inject metrics collector container as a sidecar into Job/Tfjob/PytorchJob pod. +In the new design, Katib use mutating webhook to inject metrics collector container as a sidecar into Job/Tfjob/PytorchJob pod. The sidecar collects metrics of the master and then store them on the persistent layer (e.x. katib-db-manager and metadata server).
@@ -116,7 +116,7 @@ For more detail, see [here](https://github.com/kubeflow/katib/pull/697#issuecomm ### Mutating Webhook To avoid collecting duplicated metrics, as we discuss in [kubeflow/katib#685](https://github.com/kubeflow/katib/issues/685), only one metrics collector sidecar will be injected into the master pod during one Experiment. -In the new design, there are two modes for katib mutating webhook to inject the sidecar: **Pod Level Injecting** and **Job Level Injecting**. +In the new design, there are two modes for Katib mutating webhook to inject the sidecar: **Pod Level Injecting** and **Job Level Injecting**. The webhook decides which mode to be used based on the `katib-metricscollector-injection=enabled` label tagged on the namespace. In the namespace with `katib-metricscollector-injection=enabled` label, the webhook inject the sidecar in the pod level. Otherwise, without this label, injecting in the job level. diff --git a/docs/proposals/suggestion.md b/docs/proposals/suggestion.md index a235cfe359e..d6dc6e22c9f 100644 --- a/docs/proposals/suggestion.md +++ b/docs/proposals/suggestion.md @@ -27,9 +27,9 @@ Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc) ## Background -Katib makes suggestions long-running in v1alpha3. And the suggestions need to communicate with katib DB manager to get experiments and trials from katib db driver. This design hurts high availability. +Katib makes suggestions long-running in v1alpha3. And the suggestions need to communicate with Katib DB manager to get experiments and trials from Katib db driver. This design hurts high availability. -Thus we proposed a new design to implement a CRD for suggestion and remove katib db communication from main workflow. The new design simplifies the implmentation of experiment and trial controller, and makes katib Kubernetes native. +Thus we proposed a new design to implement a CRD for suggestion and remove Katib db communication from main workflow. The new design simplifies the implmentation of experiment and trial controller, and makes Katib Kubernetes native. This document is to illustrate the details of the new design. @@ -365,7 +365,7 @@ status: ### Random -We can use the implementation in katib or [hyperopt](https://github.com/hyperopt/hyperopt). +We can use the implementation in Katib or [hyperopt](https://github.com/hyperopt/hyperopt). ### Grid diff --git a/docs/quick-start.md b/docs/quick-start.md index a4df815a15a..eb78fc51997 100644 --- a/docs/quick-start.md +++ b/docs/quick-start.md @@ -30,7 +30,7 @@ In this quick start guide, we demonstrate how to use TensorFlow in Katib, which ### Package Training Code -The first thing we need to do is to package the training code to a docker image. We use the [example code](../examples/v1alpha3/mnist-tensorflow/mnist_with_summaries.py), which builds a simple neural network, to train on MNIST. The code trains the network and outputs the TFEvents to `/tmp` by default. +The first thing we need to do is to package the training code to a docker image. We use the [example code](https://github.com/kubeflow/tf-operator/blob/master/examples/v1/mnist_with_summaries/mnist_with_summaries.py), which builds a simple neural network, to train on MNIST. The code trains the network and outputs the TFEvents to `/tmp` by default. You can use our prebuilt image `gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0`. Thus we can skip it. @@ -121,7 +121,7 @@ The experiment has two hyperparameters defined in `parameters`: `--learning_ra Or you could just run: ```bash -kubectl apply -f ./examples/v1alpha3/tfjob-example.yaml +kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha3/tfjob-example.yaml ``` ### Get trial results diff --git a/docs/workflow-design.md b/docs/workflow-design.md index ae42af09542..38233fd2e90 100644 --- a/docs/workflow-design.md +++ b/docs/workflow-design.md @@ -59,10 +59,10 @@ spec: spec: containers: - name: {{.Trial}} - image: docker.io/katib/mxnet-mnist-example + image: docker.io/kubeflowkatib/mxnet-mnist command: - - "python" - - "/mxnet/example/image-classification/train_mnist.py" + - "python3" + - "/opt/mxnet-mnist/mnist.py" - "--batch-size=64" {{- with .HyperParameters}} {{- range .}} @@ -131,10 +131,10 @@ spec: spec: containers: - name: random-example-fm2g6jpj - image: docker.io/katib/mxnet-mnist-example + image: docker.io/kubeflowkatib/mxnet-mnist command: - - "python" - - "/mxnet/example/image-classification/train_mnist.py" + - "python3" + - "/opt/mxnet-mnist/mnist.py" - "--batch-size=64" - "--lr=0.027435456064371484" - "--num-layers=4"