Kubeflow Training Operator is currently at v1.
- Go (1.23 or later)
- Docker (23 or later)
- Python (3.11 or later)
- kustomize (4.0.5 or later)
- Kind (0.22.0 or later)
- Lima (an alternative to DockerDesktop) (0.21.0 or later)
- Colima (Lima specifically for MacOS) (0.6.8 or later)
- pre-commit
Note for Lima the link is to the Adopters, which supports several different container environments.
Create a symbolic link inside your GOPATH to the location you checked out the code
mkdir -p $(go env GOPATH)/src/github.com/kubeflow
ln -sf ${GIT_TRAINING} $(go env GOPATH)/src/github.com/kubeflow/training-operator
- GIT_TRAINING should be the location where you checked out https://github.com/kubeflow/training-operator
Install dependencies
go mod tidy
Build the library
go install github.com/kubeflow/training-operator/cmd/training-operator.v1
Running the operator locally (as opposed to deploying it on a K8s cluster) is convenient for debugging/development.
First, you need to run a Kubernetes cluster locally. We recommend Kind.
You can create a kind
cluster by running
kind create cluster
This will load your kubernetes config file with the new cluster.
After creating the cluster, you can check the nodes with the code below which should show you the kind-control-plane.
kubectl get nodes
The output should look something like below:
$ kubectl get nodes
kind-control-plane Ready control-plane 32s v1.27.3
Note, that for the example job below, the PyTorchJob uses the kubeflow
From here we can apply the manifests to the cluster.
kubectl apply --server-side -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
Then we can patch it with the latest operator image.
kubectl patch -n kubeflow deployments training-operator --type json -p '[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value": "kubeflow/training-operator:latest"}]'
Then we can run the job with the following command.
kubectl apply -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml
And we can see the output of the job from the logs, which may take some time to produce but should look something like below.
$ kubectl logs -n kubeflow -l training.kubeflow.org/job-name=pytorch-simple --follow
Defaulted container "pytorch" out of: pytorch, init-pytorch (init)
2024-04-19T19:00:29Z INFO Train Epoch: 1 [4480/60000 (7%)] loss=2.2295
2024-04-19T19:00:32Z INFO Train Epoch: 1 [5120/60000 (9%)] loss=2.1790
2024-04-19T19:00:35Z INFO Train Epoch: 1 [5760/60000 (10%)] loss=2.1150
2024-04-19T19:00:38Z INFO Train Epoch: 1 [6400/60000 (11%)] loss=2.0294
2024-04-19T19:00:41Z INFO Train Epoch: 1 [7040/60000 (12%)] loss=1.9156
2024-04-19T19:00:44Z INFO Train Epoch: 1 [7680/60000 (13%)] loss=1.7949
2024-04-19T19:00:47Z INFO Train Epoch: 1 [8320/60000 (14%)] loss=1.5567
2024-04-19T19:00:50Z INFO Train Epoch: 1 [8960/60000 (15%)] loss=1.3715
2024-04-19T19:00:54Z INFO Train Epoch: 1 [9600/60000 (16%)] loss=1.3385
2024-04-19T19:00:57Z INFO Train Epoch: 1 [10240/60000 (17%)] loss=1.1650
2024-04-19T19:00:29Z INFO Train Epoch: 1 [4480/60000 (7%)] loss=2.2295
2024-04-19T19:00:32Z INFO Train Epoch: 1 [5120/60000 (9%)] loss=2.1790
2024-04-19T19:00:35Z INFO Train Epoch: 1 [5760/60000 (10%)] loss=2.1150
2024-04-19T19:00:38Z INFO Train Epoch: 1 [6400/60000 (11%)] loss=2.0294
2024-04-19T19:00:41Z INFO Train Epoch: 1 [7040/60000 (12%)] loss=1.9156
2024-04-19T19:00:44Z INFO Train Epoch: 1 [7680/60000 (13%)] loss=1.7949
2024-04-19T19:00:47Z INFO Train Epoch: 1 [8320/60000 (14%)] loss=1.5567
2024-04-19T19:00:50Z INFO Train Epoch: 1 [8960/60000 (15%)] loss=1.3715
2024-04-19T19:00:53Z INFO Train Epoch: 1 [9600/60000 (16%)] loss=1.3385
2024-04-19T19:00:57Z INFO Train Epoch: 1 [10240/60000 (17%)] loss=1.1650
Now that you confirmed you can spin up an operator locally, you can try to test your local changes to the operator. You do this by building a new operator image and loading it into your kind cluster.
make docker-build IMG=my-username/training-operator:my-pr-01
You can swap my-username/training-operator:my-pr-01
with whatever you would like.
kind load docker-image my-username/training-operator:my-pr-01
cd ./manifests/overlays/standalone
kustomize edit set image my-username/training-operator=my-username/training-operator:my-pr-01
Update the newTag
key in ./manifests/overlayes/standalone/kustimization.yaml
with the new image.
Deploy the operator with:
kubectl apply -k ./manifests/overlays/standalone
And now we can submit jobs to the operator.
kubectl patch -n kubeflow deployments training-operator --type json -p '[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value": "my-username/training-operator:my-pr-01"}]'
kubectl apply -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml
You should be able to see a pod for your training operator running in your namespace using
kubectl logs -n kubeflow -l training.kubeflow.org/job-name=pytorch-simple
On ubuntu the default go package appears to be gccgo-go which has problems see issue golang-go package is also really old so install from golang tarballs instead.
To generate Python SDK for the operator, run:
This command will re-generate the api and model files together with the documentation and model tests.
The following files/folders in sdk/python
are auto-generated and should not be modified directly:
The Training Operator client and public APIs are located here:
Make sure to install pre-commit (pip install pre-commit
) and run pre-commit install
from the root of the repository at
least once before creating git commits.
The pre-commit hooks ensure code quality and
consistency. They are executed in CI. PRs that fail to comply with the hooks
will not be able to pass the corresponding CI gate. The hooks are only executed
against staged files unless you run pre-commit run --all
, in which case,
they'll be executed against every file in the repository.
Specific programmatically generated files listed in the exclude
field in
.pre-commit-config.yaml are deliberately
excluded from the hooks.