Improve Rayclusterreplicaset Status #295

Yicheng-Lu-llll · 2024-10-12T04:40:13Z

Pull Request Description

RayCluster recently added condition support. This PR utilizes the new RayCluster status to improve the RayClusterReplicaSet status. Please see my comment for a detailed explanation of each status change.

Related Issues

Resolves: #171

Important: Before submitting, please complete the description above and review the checklist below.

Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

[Bug]: Corrections to existing functionality
[CI]: Changes to build process or CI pipeline
[Docs]: Updates or additions to documentation
[API]: Modifications to aibrix's API or interface
[CLI]: Changes or additions to the Command Line Interface
[Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

PR title includes appropriate prefix(es)
Changes are clearly explained in the PR description
New and existing tests pass successfully
Code adheres to project style and best practices
Documentation updated to reflect changes (if applicable)
Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

Signed-off-by: Yicheng-Lu-llll <[email protected]>

Yicheng-Lu-llll · 2024-10-12T18:58:19Z

config/dependency/kuberay-operator/README.md

@@ -3,7 +3,7 @@
 Commands to export manifest from helm package. After you got manifest, copy to this folder.

 ```shell
-helm template kuberay-operator kuberay/kuberay-operator --namespace aibrix-system --version 1.2.1 --include-crds --set env[0].name=ENABLE_PROBES_INJECTION --set env[0].value=\"false\" --set fullnameOverride=kuberay-operator --output-dir ./config/dependency
+helm template kuberay-operator kuberay/kuberay-operator --namespace aibrix-system --version 1.2.1 --include-crds --set env[0].name=ENABLE_PROBES_INJECTION --set env[0].value=\"false\" --set fullnameOverride=kuberay-operator --set featureGates[0].name=RayClusterStatusConditions --set featureGates[0].enabled=true --output-dir ./config/dependency


We need to enable the condition feature gate. See: https://github.com/ray-project/kuberay/blob/v1.2.1/helm-chart/kuberay-operator/values.yaml#L62

Yicheng-Lu-llll · 2024-10-12T19:19:55Z

pkg/controller/rayclusterreplicaset/rayclusterreplicaset_utils.go

+	}
+
+	// Case 4: The RayCluster has been provisioned and we need to check if it is ready.
+	return rayclusterutil.IsRayClusterReady(&c)


isClusterActive is used to decide whether we need to count this Raycluster and later use the count to compare with the desired number to decide whether to scale up or down.

If the Raycluster has been marked as deleted, do not count it.

If not been marked as delete, we categorize the Ray cluster into two statuses:

Init status:

This means the cluster is waiting for all Ray Pods to become ready for the first time. In this case, we count the cluster, as ReplicaSets also count Pods that are in the init stage. While it’s true that if there is an image pull error or resource limits and the Pod never starts, we may wait indefinitely, the ReplicaSet also doesn’t handle this issue.

Non-init status:

We strictly check if the Ray cluster is fully ready, meaning the head Pod is ready and all worker Pods are ready. If any Pod fails, we currently treat this failure as unrecoverable, and a new Ray cluster needs to be created. We can reconsider this approach later.

Yicheng-Lu-llll · 2024-10-12T19:24:01Z

pkg/utils/raycluster.go

@@ -27,17 +30,30 @@ func IsRayClusterReady(cluster *rayclusterv1.RayCluster) bool {
 }

 func IsRayClusterStateReady(status rayclusterv1.RayClusterStatus) bool {
-	return status.State == rayclusterv1.Ready


We consider a Raycluster to be in 'good' status' if and only if all Ray Pods are in Ready status.

Yicheng-Lu-llll · 2024-10-12T20:00:34Z

pkg/controller/rayclusterfleet/recreate.go

@@ -107,13 +107,7 @@ func oldPodsRunning(newRS *orchestrationv1alpha1.RayClusterReplicaSet, oldRSs []
 			continue
 		}
 		for _, pod := range podList {


In a Deployment, oldPodsRunning is used to check if all the Pods from the old ReplicaSet have already released their resources. If so, we can create a new ReplicaSet to update, following the recreate strategy.

The code below is copied from https://github.com/kubernetes/kubernetes(we can see that Pod in status v1.PodFailed and v1.PodSucceeded do not occupy any resources and thus will pass the check):

for _, pod := range podList { switch pod.Status.Phase { case v1.PodFailed, v1.PodSucceeded: // Don't count pods in terminal state. continue case v1.PodUnknown: // v1.PodUnknown is a deprecated status. // This logic is kept for backward compatibility. // This used to happen in situations like when the node is temporarily disconnected from the cluster. // If we can't be sure that the pod is not running, we have to count it. return true default: // Pod is not in terminal phase. return true } }

Similarly, here, we check whether the old RayClusterReplicaSet still has Rayclusters occupying resources.

Unlike Pods, Ray clusters do not have a terminal status to indicate they have finished and will no longer occupy resources. Therefore, the best approach is to detect if the Ray cluster has been marked for deletion (DeletionTimestamp). If so, the KubeRay operator will eventually delete it.

Another tricky issue is "unhealthy" or "failed" Ray clusters will not automatically deleted and will continue to occupy resources. However, that is a separate issue. For now, we just ensure that all countable (healthy) Rayclusters have been marked as deleted for the old ReplicaSet.

makes sense.

Signed-off-by: Yicheng-Lu-llll <[email protected]>

Jeffwan · 2024-10-14T02:39:07Z

Excellent work! I will take a look tomorrow

Jeffwan

Overall looks good to me. Nice job.

Jeffwan · 2024-10-14T02:40:37Z

go.mod

-	sigs.k8s.io/controller-runtime v0.17.3
+	k8s.io/kube-openapi v0.0.0-20240620174524-b456828f718b
+	k8s.io/utils v0.0.0-20240502163921-fe8a2dddb1d0
+	sigs.k8s.io/controller-runtime v0.17.5


Any features we need from v0.17.3 -> v0.17.5?

It automatically updates when I run go mod tidy. It may because Kuberay v1.2.1 currently relies on v0.17.5: https://github.com/ray-project/kuberay/blob/v1.2.1/ray-operator/go.mod#L32

Jeffwan · 2024-10-14T02:45:31Z

pkg/controller/rayclusterfleet/recreate.go

@@ -107,13 +107,7 @@ func oldPodsRunning(newRS *orchestrationv1alpha1.RayClusterReplicaSet, oldRSs []
 			continue
 		}
 		for _, pod := range podList {


makes sense.

#295 introduce the latest kuberay api and the dependencies bumps sigs.k8s.io/controller-runtime from v0.17.3 to v0.17.5. Due to that change, make manifest update the CRD definitions

* Update manifests version to v0.1.0-rc.3 (#287) * [Misc] Add sync images step and scripts in release process (#283) Add sync images step and scripts in release process * [batch] E2E works with driver and request proxy (#272) * e2e driver and test * comment functions * check job status in test * format update * update copyright * add examples with instructions and interfaces * move batch tutorial --------- Co-authored-by: xin.chen <[email protected]> * Fix address already in use when AIRuntime start in pod (#289) add uvicorn startup into file entrypoint * Read model name from request body (#290) * Use model name from request body * rename dummy to reserved router * Fix redis bootstrap flaky connection issue (#293) * skip docs CI if no changes in /docs dir (#294) * skip docs CI if no changes in /docs dir * test docs build * Improve Rayclusterreplicaset Status (#295) * improve rayclusterreplicaset status * nit * fix lint error * improve isClusterActive logic * fix lint error * remove redundant isRayPodCreateOrDeleteFailed check --------- Signed-off-by: Yicheng-Lu-llll <[email protected]> * Add request trace for profiling (#291) * Add request trace for profiling * add to redis at 10 second interval * nit * round to nearest 10s interval * round timestamp to nearest 10s interval and aggregate data by model * add go routine to add request trace * Update the crd definiton due to runtime upgrade (#298) #295 introduce the latest kuberay api and the dependencies bumps sigs.k8s.io/controller-runtime from v0.17.3 to v0.17.5. Due to that change, make manifest update the CRD definitions * Push images to Github registry in release pipeline (#301) * Disable docker build github workflow to cut CI cost * Push images to Github registry in release pipeline * Build autoscaler abstractions like fetcher, client and scaler (#300) * minor clean up on the autoscaler controller * Extract the algorithm package algorithm is extracted to distinguish with the scaler. * Refactor scaler interface 1. Split the Scaler interface and BaseAutoscaler implementation 2. Create APA/KPA scaler separately and adopt the corresponding algorithms * Introduce the scalingContext in algorithm * Introduce k8s.io/metrics for resource & custom metrics fetching * Extract metric fetcher to cover the fetching logic * Optimize the scaler workflow to adopt fetch and client interface * Further refactor the code structure * Support pod autoscaler periodically check (#306) * Support pod autoscaler periodically check * Fix the error case * Add timeout in nc check for redis bootstrap (#309) * Refactor AutoScaler: metricClient, context, reconcile (#308) * Refactor AutoScaler: optimize metric client, context, and reconcile processes. * fix make lint-all * fix typos --------- Signed-off-by: Yicheng-Lu-llll <[email protected]> Co-authored-by: xinchen384 <[email protected]> Co-authored-by: xin.chen <[email protected]> Co-authored-by: brosoul <[email protected]> Co-authored-by: Varun Gupta <[email protected]> Co-authored-by: Yicheng-Lu-llll <[email protected]> Co-authored-by: Rong-Kang <[email protected]>

* improve rayclusterreplicaset status * nit * fix lint error * improve isClusterActive logic * fix lint error * remove redundant isRayPodCreateOrDeleteFailed check --------- Signed-off-by: Yicheng-Lu-llll <[email protected]>

#295 introduce the latest kuberay api and the dependencies bumps sigs.k8s.io/controller-runtime from v0.17.3 to v0.17.5. Due to that change, make manifest update the CRD definitions

improve rayclusterreplicaset status

d9ea54a

Signed-off-by: Yicheng-Lu-llll <[email protected]>

Yicheng-Lu-llll force-pushed the improve-rayclusterreplicaset-status branch from 252026a to d9ea54a Compare October 12, 2024 18:49

Yicheng-Lu-llll commented Oct 12, 2024

View reviewed changes

Yicheng-Lu-llll added 2 commits October 12, 2024 20:13

nit

30db09c

Signed-off-by: Yicheng-Lu-llll <[email protected]>

fix lint error

fdda623

Signed-off-by: Yicheng-Lu-llll <[email protected]>

Yicheng-Lu-llll changed the title ~~[WIP] Improve Rayclusterreplicaset Status~~ Improve Rayclusterreplicaset Status Oct 12, 2024

Yicheng-Lu-llll marked this pull request as ready for review October 12, 2024 20:30

Yicheng-Lu-llll requested review from Jeffwan and varungup90 October 12, 2024 20:31

Yicheng-Lu-llll added 3 commits October 13, 2024 22:57

improve isClusterActive logic

d292656

Signed-off-by: Yicheng-Lu-llll <[email protected]>

fix lint error

d6a6732

Signed-off-by: Yicheng-Lu-llll <[email protected]>

remove redundant isRayPodCreateOrDeleteFailed check

dab3f11

Signed-off-by: Yicheng-Lu-llll <[email protected]>

Jeffwan approved these changes Oct 14, 2024

View reviewed changes

Jeffwan merged commit d2fd044 into main Oct 14, 2024
9 checks passed

Jeffwan deleted the improve-rayclusterreplicaset-status branch October 14, 2024 03:05

Jeffwan mentioned this pull request Oct 15, 2024

Update the crd definiton due to runtime upgrade #298

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Rayclusterreplicaset Status #295

Improve Rayclusterreplicaset Status #295

Yicheng-Lu-llll commented Oct 12, 2024 •

edited

Loading

Yicheng-Lu-llll Oct 12, 2024

Yicheng-Lu-llll Oct 12, 2024 •

edited

Loading

Yicheng-Lu-llll Oct 12, 2024

Yicheng-Lu-llll Oct 12, 2024

Jeffwan Oct 14, 2024

Jeffwan commented Oct 14, 2024

Jeffwan left a comment

Jeffwan Oct 14, 2024

Yicheng-Lu-llll Oct 14, 2024

Jeffwan Oct 14, 2024

Jeffwan Oct 14, 2024

Improve Rayclusterreplicaset Status #295

Improve Rayclusterreplicaset Status #295

Conversation

Yicheng-Lu-llll commented Oct 12, 2024 • edited Loading

Pull Request Description

Related Issues

Pull Request Title Format

Submission Checklist

Choose a reason for hiding this comment

Yicheng-Lu-llll Oct 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jeffwan commented Oct 14, 2024

Jeffwan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yicheng-Lu-llll commented Oct 12, 2024 •

edited

Loading

Yicheng-Lu-llll Oct 12, 2024 •

edited

Loading