Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Katib docs #1066

Merged
merged 3 commits into from
Feb 25, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 25 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,7 @@
[![Coverage Status](https://coveralls.io/repos/github/kubeflow/katib/badge.svg?branch=master)](https://coveralls.io/github/kubeflow/katib?branch=master)
[![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/katib)](https://goreportcard.com/report/github.com/kubeflow/katib)

Katib is a Kubernetes Native System for [Hyperparameter Tuning][1] and [Neural Architecture Search][2].
The system is inspired by [Google vizier][3] and supports multiple ML/DL frameworks (e.g. TensorFlow, Apache MXNet, and PyTorch).
Katib is a Kubernetes-based system for [Hyperparameter Tuning][1] and [Neural Architecture Search][2]. Katib supports a number of ML frameworks, including TensorFlow, Apache MXNet, PyTorch, XGBoost, and others.

Table of Contents
=================
Expand All @@ -30,7 +29,7 @@ Table of Contents
* [Running examples](#running-examples)
* [Cleanups](#cleanups)
* [Quick Start](#quick-start)
* [Who are using katib?](#who-are-using-katib)
* [Who are using Katib?](#who-are-using-katib)
* [CONTRIBUTING](#contributing)

Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc)
Expand All @@ -43,7 +42,7 @@ on the Kubeflow website.

## Name

Katib stands for `secretary` in Arabic. As `Vizier` stands for a high official or a prime minister in Arabic, this project Katib is named in the honor of Vizier.
Katib stands for `secretary` in Arabic.

## Concepts in Katib

Expand Down Expand Up @@ -86,23 +85,29 @@ Thus, Katib supports multiple frameworks with the help of different job kinds.

Currently Katib supports the following exploration algorithms:

* random search
* grid search
* [hyperband](https://arxiv.org/pdf/1603.06560.pdf)
* [bayesian optimization](https://arxiv.org/pdf/1012.2599.pdf)
* [NAS based on reinforcement learning](https://github.com/kubeflow/katib/tree/master/pkg/suggestion/v1alpha3/NAS_Reinforcement_Learning)
#### Hyperparameter Tuning

* [Random Search](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Random_search)
* [Tree of Parzen Estimators (TPE)](https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf)
* [Grid Search](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Grid_search)
* [Hyperband](https://arxiv.org/pdf/1603.06560.pdf)
* [Bayesian Optimization](https://arxiv.org/pdf/1012.2599.pdf)

#### Neural Architecture Search

* [Reinforcement Learning](https://github.com/kubeflow/katib/tree/master/pkg/suggestion/v1alpha3/NAS_Reinforcement_Learning)


## Components in Katib

Katib consists of several components as shown below. Each component is running on k8s as a deployment.
Each component communicates with others via GRPC and the API is defined at `pkg/apis/manager/v1alpha3/api.proto`.

- katib: main components.
- katib-db-manager: GRPC API server of katib which is the DB Interface.
- katib-mysql: Data storage backend of katib using mysql.
- katib-ui: User interface of katib.
- katib-controller: Controller for katib CRDs in Kubernetes.
- Katib main components:
- katib-db-manager: GRPC API server of Katib which is the DB Interface.
- katib-mysql: Data storage backend of Katib using mysql.
- katib-ui: User interface of Katib.
- katib-controller: Controller for Katib CRDs in Kubernetes.

## Web UI

Expand All @@ -124,7 +129,9 @@ install Kubeflow. See the documentation:
* [Kubeflow installation
guide](https://www.kubeflow.org/docs/started/getting-started/)
* [Kubeflow hyperparameter tuning
guides](https://www.kubeflow.org/docs/components/hyperparameter-tuning/).
guides](https://www.kubeflow.org/docs/components/hyperparameter-tuning/).

If you install Katib with other Kubeflow components, you can't submit Katib jobs in Kubeflow namespace.

Alternatively, if you want to install Katib manually, follow these steps:

Expand Down Expand Up @@ -181,12 +188,13 @@ metadata:
type: local
app: katib
spec:
storageClassName: katib
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: /data/katib
path: /tmp/katib
```

Create this pv after deploying Katib package
Expand Down Expand Up @@ -337,7 +345,7 @@ Delete installed components using `kubectl delete -f` on the respective folders.

Please see [Quick Start Guide](./docs/quick-start.md)

## Who are using katib?
## Who are using Katib?

Please see [adopters.md](./docs/community/adopters.md)

Expand Down
42 changes: 8 additions & 34 deletions ROADMAP.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,21 @@
# Katib 2019 Roadmap
# Katib 2020 Roadmap

This document provides a high level view of where Katib will grow in 2019. These objectives are based on Katib's Critical User Journey (CUJ),
which can be found [here](https://bit.ly/2QNKMwt).
This document provides a high level view of where Katib will grow in 2020.

The original Katib design document can be found [here](https://docs.google.com/document/d/1ZEKhou4z1utFTOgjzhSsnvysJFNEJmygllgDCBnYvm8/edit#heading=h.7fzqir88ovr).

# Katib 1.0 Readiness

* Stabilize APIs for StudyJobs
* Beta by end of Q2, 1.0 by end of Q4
* Formalize naming conventions (we use different names like katib vs vizier in different places)
* Refactor studyjob field names [#351](https://github.com/kubeflow/katib/issues/351)
* Rename fields so their names are more meaningful (e.g. requestCount vs requestNumber) [#161](https://github.com/kubeflow/katib/issues/161)
* Fully integrate katib with existing E2E examples:
* Stabilize APIs for Experiments
* Reconsider the design of Trial Template [#906](https://github.com/kubeflow/katib/issues/906)
* Early Stopping [#692](https://github.com/kubeflow/katib/issues/692)
* Resuming Experiment [#1061](https://github.com/kubeflow/katib/issues/1061), [#1062](https://github.com/kubeflow/katib/issues/1062)
* Fully integrate Katib with existing E2E examples:
* Xgboost
* Mnist
* GitHub issue summarization
* Publish API documentation, best practices, tutorials
* [Issues list](https://github.com/kubeflow/katib/issues)
* [Issues for 0.5.0 release](https://github.com/kubeflow/katib/labels/area%2F0.5.0)


# Enhance HP Tuning Experience

Expand All @@ -32,36 +28,14 @@ Integration with KF distributed training components
* PyTorch
* Allow Katib to support other operator types generically [#341](https://github.com/kubeflow/katib/issues/341)

## 2. Configuring a Study
* Streamlining the StudyJob schema - providing simpler ways to write worker specs and metric collector specs.
* Expose more information in StudyJob status fields
* List all job conditions with details [#344](https://github.com/kubeflow/katib/issues/344)
* Returning study metadata such as number of trials and best hyperparameter values so far [#356](https://github.com/kubeflow/katib/issues/356)
* Integration with Jupyter notebooks and Fairing [#355](https://github.com/kubeflow/katib/issues/355)
* Allow users to start with an existing model from a notebook and do HP tuning with minimal code changes
* Allowing a StudyJob to be resumed with additional trials [#346](https://github.com/kubeflow/katib/issues/346)
* Generating StudyJob configurations and launching StudyJobs through UI
## 2. Configuring a Experiment
* Supporting additional suggestion algorithms [#15](https://github.com/kubeflow/katib/issues/15)
* Support for StudyJob deployment in a different namespace [#343](https://github.com/kubeflow/katib/issues/343)


## 3. Tracking Model Performance
* Enhance metrics collection
* May need to revisit the design - use a push model instead of pull model?
* UI enhancements: allowing data scientists to visualize results easier
* Support for persistent model and metadata storage
* Ideally users should be able to export and reuse trained models from a common storage


# Other Features

Designs are pending for the following new features:
* Multi-Tenancy Support
* [NAS](https://docs.google.com/document/d/1qGWy-C5XSQmh82XYoMcJ_JWLHwmyvdMRjCkFMfkO0vE/edit)
* Batch scheduling
* [Integration with Pipelines](https://github.com/kubeflow/katib/issues/331)
* Early stopping feature

# Test and Release Infrastructure

* Improve e2e test coverage
Expand Down
2 changes: 1 addition & 1 deletion docs/community/adopters.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Adopters of Kubeflow Katib

Below are the adopters of project Katib. If you are using katib
Below are the adopters of project Katib. If you are using Katib
please add yourself into the following list by a pull request.

| Organization | Contact | Description of Use |
Expand Down
8 changes: 4 additions & 4 deletions docs/developer-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Table of Contents
* [Requirements](#requirements)
* [Build from source code](#build-from-source-code)
* [Workflow design](#workflow-design)
* [Implement a new algorithm and use it in katib](#implement-a-new-algorithm-and-use-it-in-katib)
* [Implement a new algorithm and use it in Katib](#implement-a-new-algorithm-and-use-it-in-katib)
* [Create a new Trial kind](#create-a-new-trial-kind)
* [Algorithm settings documentation](#algorithm-settings-documentation)
* [Design proposals](#design-proposals)
Expand Down Expand Up @@ -39,13 +39,13 @@ Check source code as follows:
make build
```

You can deploy katib v1alpha3 manifests into a k8s cluster as follows:
You can deploy Katib v1alpha3 manifests into a k8s cluster as follows:

```bash
make deploy
```

You can undeploy katib v1alpha3 manifests from a k8s cluster as follows:
You can undeploy Katib v1alpha3 manifests from a k8s cluster as follows:

```bash
make undeploy
Expand All @@ -55,7 +55,7 @@ make undeploy

Please see [workflow-design.md](./workflow-design.md)

## Implement a new algorithm and use it in katib
## Implement a new algorithm and use it in Katib

Please see [new-algorithm-service.md](./new-algorithm-service.md)

Expand Down
23 changes: 14 additions & 9 deletions docs/new-algorithm-service.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# Document about how to add a new algorithm in katib
# Document about how to add a new algorithm in Katib

## Implement a new algorithm and use it in katib
## Implement a new algorithm and use it in Katib

### Implement the algorithm

The design of katib follows the [`ask-and-tell` pattern](https://scikit-optimize.github.io/notebooks/ask-and-tell.html):
The design of Katib follows the `ask-and-tell` pattern:

> They often follow a pattern a bit like this: 1. ask for a new set of parameters 1. walk to the experiment and program in the new parameters 1. observe the outcome of running the experiment 1. walk back to your laptop and tell the optimizer about the outcome 1. go to step 1

When an experiment is created, one algorithm service will be created. Then katib asks for new sets of parameters via `GetSuggestions` GRPC call. After that, katib creates new trials according to the sets and observe the outcome. When the trials are finished, katib tells the metrics of the finished trials to the algorithm, and ask another new sets.
When an experiment is created, one algorithm service will be created. Then Katib asks for new sets of parameters via `GetSuggestions` GRPC call. After that, Katib creates new trials according to the sets and observe the outcome. When the trials are finished, Katib tells the metrics of the finished trials to the algorithm, and ask another new sets.

The new algorithm needs to implement `Suggestion` service defined in [api.proto](../pkg/apis/manager/v1alpha3/api.proto). One sample algorithm looks like:

Expand Down Expand Up @@ -87,7 +87,7 @@ Create a package under [cmd/suggestion](../cmd/suggestion). Then create the main

Here is an example: [cmd/suggestion/hyperopt](../cmd/suggestion/hyperopt). Then build the Docker image.

### Use the algorithm in katib.
### Use the algorithm in Katib.

Update the [katib-config](../manifests/v1alpha3/katib-controller/katib-config.yaml), add a new object:

Expand All @@ -106,9 +106,9 @@ Update the [katib-config](../manifests/v1alpha3/katib-controller/katib-config.ya
}
```

### Contribute the algorithm to katib
### Contribute the algorithm to Katib

If you want to contribute the algorithm to katib, you could add unit test or e2e test for it in CI and submit a PR.
If you want to contribute the algorithm to Katib, you could add unit test or e2e test for it in CI and submit a PR.

#### Unit Test

Expand Down Expand Up @@ -142,9 +142,14 @@ You can setup the GRPC server using `grpc_testing`, then define you own test cas

#### E2E Test (Optional)

E2e tests help katib verify that the algorithm works well. To add a e2e test for the new algorithm, you need to:
E2e tests help Katib verify that the algorithm works well.
To add a e2e test for the new algorithm, in [test/scripts/v1alpha3](../test/scripts/v1alpha3) you need to:

Create a new script `run-suggestion-xxx.sh` in [test/scripts/v1alpha3](../test/scripts/v1alpha3). Here is an example [test/scripts/v1alpha3/build-suggestion-hyperopt.sh](../test/scripts/v1alpha3/build-suggestion-hyperopt.sh) (Replace `<name>` with the new algorithm name):
1. Create a new Experiment yaml file in [examples/v1alpha3](../examples/v1alpha3) with the new algorithm.

2. Create a new script `build-suggestion-xxx.sh` to build new suggestion. Here is an example [test/scripts/v1alpha3/build-suggestion-hyperopt.sh](../test/scripts/v1alpha3/build-suggestion-hyperopt.sh).

3. Create a new script `run-suggestion-xxx.sh` to run new suggestion. Below is an example (Replace `<name>` with the new algorithm name):

```bash
#!/bin/bash
Expand Down
6 changes: 3 additions & 3 deletions docs/new-trial-kind.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Document about how to support a new Kubernetes resource in katib trial
# Document about how to support a new Kubernetes resource in Katib trial

## Update the supported list

Expand Down Expand Up @@ -27,7 +27,7 @@ func GetSupportedJobList() []schema.GroupVersionKind {
}
```

In this function, we define the Kubernetes `GroupVersionKind` that are supported in katib. If you want to add a new kind, please append the `supportedJobList`.
In this function, we define the Kubernetes `GroupVersionKind` that are supported in Katib. If you want to add a new kind, please append the `supportedJobList`.

## Update logic about status update

Expand Down Expand Up @@ -70,7 +70,7 @@ The function is used to determine which container in the job is the actual main

### Add logic about how to determine the master pod

In katib, we only inject metrics collector sidecar into the master pod (See [metrics-collector.md](./proposals/metrics-collector.md) for more details). Thus we need to update the `JobRoleMap` in [const.go](../pkg/webhook/v1alpha3/pod/const.go).
In Katib, we only inject metrics collector sidecar into the master pod (See [metrics-collector.md](./proposals/metrics-collector.md) for more details). Thus we need to update the `JobRoleMap` in [const.go](../pkg/webhook/v1alpha3/pod/const.go).

```go
var JobRoleMap = map[string][]string{
Expand Down
6 changes: 3 additions & 3 deletions docs/proposals/metrics-collector.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@

## Links

- [katib/issues#685 (katib metrics collector solution)](https://github.com/kubeflow/katib/issues/685)
- [katib/issues#685 (Katib metrics collector solution)](https://github.com/kubeflow/katib/issues/685)
- [katib/pull#697 (API for metricCollector)](https://github.com/kubeflow/katib/pull/697#issuecomment-516264282)
- [katib/pull#716 (Add pod level inject webhook)](https://github.com/kubeflow/katib/pull/716)
- [katib/pull#729 (Inject pod sidecar for specified namespace)](https://github.com/kubeflow/katib/pull/729)
Expand All @@ -29,7 +29,7 @@ The cron job pulls the targeted pod logs periodically and then persist the logs
However, the pulled-based design has [some problems](https://github.com/kubeflow/tf-operator/issues/722#issuecomment-405669269), such as, at what frequency should we scrape the metrics and so on.

To enhance the extensibility and support EarlyStopping, we propose a new design of the metrics collector.
In the new design, katib use mutating webhook to inject metrics collector container as a sidecar into Job/Tfjob/PytorchJob pod.
In the new design, Katib use mutating webhook to inject metrics collector container as a sidecar into Job/Tfjob/PytorchJob pod.
The sidecar collects metrics of the master and then store them on the persistent layer (e.x. katib-db-manager and metadata server).

<center>
Expand Down Expand Up @@ -116,7 +116,7 @@ For more detail, see [here](https://github.com/kubeflow/katib/pull/697#issuecomm
### Mutating Webhook

To avoid collecting duplicated metrics, as we discuss in [kubeflow/katib#685](https://github.com/kubeflow/katib/issues/685), only one metrics collector sidecar will be injected into the master pod during one Experiment.
In the new design, there are two modes for katib mutating webhook to inject the sidecar: **Pod Level Injecting** and **Job Level Injecting**.
In the new design, there are two modes for Katib mutating webhook to inject the sidecar: **Pod Level Injecting** and **Job Level Injecting**.

The webhook decides which mode to be used based on the `katib-metricscollector-injection=enabled` label tagged on the namespace.
In the namespace with `katib-metricscollector-injection=enabled` label, the webhook inject the sidecar in the pod level. Otherwise, without this label, injecting in the job level.
Expand Down
6 changes: 3 additions & 3 deletions docs/proposals/suggestion.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,9 @@ Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc)

## Background

Katib makes suggestions long-running in v1alpha3. And the suggestions need to communicate with katib DB manager to get experiments and trials from katib db driver. This design hurts high availability.
Katib makes suggestions long-running in v1alpha3. And the suggestions need to communicate with Katib DB manager to get experiments and trials from Katib db driver. This design hurts high availability.

Thus we proposed a new design to implement a CRD for suggestion and remove katib db communication from main workflow. The new design simplifies the implmentation of experiment and trial controller, and makes katib Kubernetes native.
Thus we proposed a new design to implement a CRD for suggestion and remove Katib db communication from main workflow. The new design simplifies the implmentation of experiment and trial controller, and makes Katib Kubernetes native.

This document is to illustrate the details of the new design.

Expand Down Expand Up @@ -365,7 +365,7 @@ status:

### Random

We can use the implementation in katib or [hyperopt](https://github.com/hyperopt/hyperopt).
We can use the implementation in Katib or [hyperopt](https://github.com/hyperopt/hyperopt).

### Grid

Expand Down
4 changes: 2 additions & 2 deletions docs/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ In this quick start guide, we demonstrate how to use TensorFlow in Katib, which

### Package Training Code

The first thing we need to do is to package the training code to a docker image. We use the [example code](../examples/v1alpha3/mnist-tensorflow/mnist_with_summaries.py), which builds a simple neural network, to train on MNIST. The code trains the network and outputs the TFEvents to `/tmp` by default.
The first thing we need to do is to package the training code to a docker image. We use the [example code](https://github.com/kubeflow/tf-operator/blob/master/examples/v1/mnist_with_summaries/mnist_with_summaries.py), which builds a simple neural network, to train on MNIST. The code trains the network and outputs the TFEvents to `/tmp` by default.

You can use our prebuilt image `gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0`. Thus we can skip it.

Expand Down Expand Up @@ -121,7 +121,7 @@ The experiment has two hyperparameters defined in `parameters`: `--learning_ra
Or you could just run:

```bash
kubectl apply -f ./examples/v1alpha3/tfjob-example.yaml
kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha3/tfjob-example.yaml
```

### Get trial results
Expand Down
12 changes: 6 additions & 6 deletions docs/workflow-design.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,10 @@ spec:
spec:
containers:
- name: {{.Trial}}
image: docker.io/katib/mxnet-mnist-example
image: docker.io/kubeflowkatib/mxnet-mnist
command:
- "python"
- "/mxnet/example/image-classification/train_mnist.py"
- "python3"
- "/opt/mxnet-mnist/mnist.py"
- "--batch-size=64"
{{- with .HyperParameters}}
{{- range .}}
Expand Down Expand Up @@ -131,10 +131,10 @@ spec:
spec:
containers:
- name: random-example-fm2g6jpj
image: docker.io/katib/mxnet-mnist-example
image: docker.io/kubeflowkatib/mxnet-mnist
command:
- "python"
- "/mxnet/example/image-classification/train_mnist.py"
- "python3"
- "/opt/mxnet-mnist/mnist.py"
- "--batch-size=64"
- "--lr=0.027435456064371484"
- "--num-layers=4"
Expand Down