[Bug] Fix the way how podautoscaler handle 0 pods. #508

zhangjyr · 2024-12-09T07:28:58Z

Pull Request Description

Previously, if podautoscaler found no pods, the podautoscaler treated as an error. While the podautoscaler can handle 0 pods in the codes, this is unnecessary.
This PR also fixed a minor bug when KPA.delayWindow is nil.
This PR fixed an error reporting error while the err will only be reported if metricDesiredReplicas == -1, while the code always returns 0 as metricDesiredReplicas on error. This constraint prevents many errors from being visible, and I think it should be removed.

Note: this PR does not solve all problems related to 0 pods since it might related to how the podautoscaler counts pods. See comments in #507

Related Issues

Resolves: #481

Important: Before submitting, please complete the description above and review the checklist below.

Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

[Bug]: Corrections to existing functionality
[CI]: Changes to build process or CI pipeline
[Docs]: Updates or additions to documentation
[API]: Modifications to aibrix's API or interface
[CLI]: Changes or additions to the Command Line Interface
[Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

PR title includes appropriate prefix(es)
Changes are clearly explained in the PR description
New and existing tests pass successfully
Code adheres to project style and best practices
Documentation updated to reflect changes (if applicable)
Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

pkg/utils/pod.go

pkg/controller/podautoscaler/scaler/kpa.go

Jeffwan · 2024-12-09T17:16:12Z

pkg/controller/podautoscaler/podautoscaler_controller.go

@@ -368,7 +368,7 @@ func (r *PodAutoscalerReconciler) reconcileCustomPA(ctx context.Context, pa auto
 		// computeReplicasForMetrics gives
 		// TODO: check why it return the metrics name here?
 		metricDesiredReplicas, metricName, metricTimestamp, err := r.computeReplicasForMetrics(ctx, pa, scale, metricKey)
-		if err != nil && metricDesiredReplicas == -1 {
+		if err != nil {


I can not remember the logic here, could you briefly elaborate more details here?

I'm seeking background here, too. As computeReplicaForMetrics return 0 along with any error:

func (r *PodAutoscalerReconciler) computeReplicasForMetrics(ctx context.Context, pa autoscalingv1alpha1.PodAutoscaler, scale *unstructured.Unstructured, metricKey metrics.NamespaceNameMetric) (replicas int32, relatedMetrics string, timestamp time.Time, err error) { ... labelsSelector, err := extractLabelSelector(scale) if err != nil { return 0, "", currentTimestamp, err } originalReadyPodsCount, err := scaler.GetReadyPodsCount(ctx, r.Client, pa.Namespace, labelsSelector) if err != nil { return 0, "", currentTimestamp, fmt.Errorf("error getting ready pods count: %w", err) } ... err = r.updateScalerSpec(ctx, pa, metricKey) if err != nil { ... return 0, "", currentTimestamp, fmt.Errorf("error update scaler spec: %w", err) } ... autoScaler, ok := r.AutoscalerMap[metricKey] if !ok { return 0, "", currentTimestamp, fmt.Errorf("unsupported scaling strategy: %s", pa.Spec.ScalingStrategy) } scaleResult := autoScaler.Scale(int(originalReadyPodsCount), metricKey, currentTimestamp) if scaleResult.ScaleValid { ... return scaleResult.DesiredPodCount, metricKey.MetricName, currentTimestamp, nil } return 0, "", currentTimestamp, fmt.Errorf("can not calculate metrics for scale %s", pa.Spec.ScaleTargetRef.Name) }

Jeffwan · 2024-12-09T17:47:27Z

@nwangfw I think we talked about scale to 0 stuff last Thursday. could you double check this PR.

Keep readyPodsCount value intact and fix affected code.

nwangfw · 2024-12-09T19:04:58Z

@nwangfw I think we talked about scale to 0 stuff last Thursday. could you double check this PR.

@Jeffwan Sure. Last week, we discussed the issue that we need to set the minReplica to 1 or the podautoscaler will return error in Issue #481. I believe this PR addresses this case even if the minReplica is not configured and 0 readypod case.

zhangjyr · 2024-12-09T19:20:21Z

@nwangfw I think we talked about scale to 0 stuff last Thursday. could you double check this PR.

@Jeffwan Sure. Last week, we discussed the issue that we need to set the minReplica to 1 or the podautoscaler will return error in Issue #481. I believe this PR addresses this case even if the minReplica is not configured.

Right, this PR will make the 0 cases tolerable by the podautoscaler. However, as I mentioned in #507, the current implementation uses the pod count in the namespace, which is deviated from scaleTargetRef definition. So the behavior of #481 seems unpredictable, while a100(1), a40(0) ok (because there is 1 pod in the namespace), and a40(0) alone failed.

nwangfw · 2024-12-09T19:40:35Z

@nwangfw I think we talked about scale to 0 stuff last Thursday. could you double check this PR.

@Jeffwan Sure. Last week, we discussed the issue that we need to set the minReplica to 1 or the podautoscaler will return error in Issue #481. I believe this PR addresses this case even if the minReplica is not configured.

Right, this PR will make the 0 cases tolerable by the podautoscaler. However, as I mentioned in #507, the current implementation uses the pod count in the namespace, which is deviated from scaleTargetRef definition. So the behavior of #481 seems unpredictable, while a100(1), a40(0) ok (because there is 1 pod in the namespace), and a40(0) alone failed.

while a100(1), a40(0) ok (because there is 1 pod in the namespace), and a40(0) alone failed.

I am not sure that I fully understand this part. Can you elaborate on it more? Thanks!

zhangjyr · 2024-12-09T21:20:30Z

while a100(1), a40(0) ok (because there is 1 pod in the namespace), and a40(0) alone failed.

I tried a configuration that sets a100 deployment with replica 1 and a40 deployment with replica 0. The podautoscaler currently reports no error without the fixing in this PR. The reason is discussed in #507 because podautoscaler found one pod in the namespace, which is good with it.

However, if set both a100 and a40 to replica 0, podautoscaler will trigger the error.

nwangfw · 2024-12-09T21:25:21Z

while a100(1), a40(0) ok (because there is 1 pod in the namespace), and a40(0) alone failed.

I tried a configuration that sets a100 deployment with replica 1 and a40 deployment with replica 0. The podautoscaler currently reports no error without the fixing in this PR. The reason is discussed in #507 because podautoscaler found one pod in the namespace, which is good with it.

However, if set both a100 and a40 to replica 0, podautoscaler will trigger the error.

Got it. For a100(1) and a40(0) case, you mean there is no error, but the scaling results are still not predictable due to the issue that we found in PR #507, right?

* Bug fix * Make APA comfortable with 0 originalReadyPodsCount for completeness. * Verify ActivationScale > 1 settings. Keep readyPodsCount value intact and fix affected code. * Bug fix * Document the new logic in comments. --------- Co-authored-by: Jingyuan Zhang <[email protected]>

Bug fix

dcb3189

zhangjyr requested review from Jeffwan, kr11 and nwangfw December 9, 2024 07:29

zhangjyr added kind/bug Something isn't working area/heterogeneous labels Dec 9, 2024

Jeffwan reviewed Dec 9, 2024

View reviewed changes

Make APA comfortable with 0 originalReadyPodsCount for completeness.

c8fc788

Jingyuan Zhang added 2 commits December 9, 2024 10:17

Verify ActivationScale > 1 settings.

6839450

Keep readyPodsCount value intact and fix affected code.

Bug fix

11d8baa

Document the new logic in comments.

d31c81a

Jeffwan merged commit 4ebf0ff into main Dec 9, 2024
10 checks passed

Jeffwan deleted the issues/481_PodAutoscaler_scaling_down_to_0 branch December 9, 2024 23:49

zhangjyr mentioned this pull request Dec 10, 2024

KPA scaling results in heterogeneous GPU #507

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Fix the way how podautoscaler handle 0 pods. #508

[Bug] Fix the way how podautoscaler handle 0 pods. #508

zhangjyr commented Dec 9, 2024

Jeffwan Dec 9, 2024

zhangjyr Dec 9, 2024

Jeffwan commented Dec 9, 2024

nwangfw commented Dec 9, 2024 •

edited

Loading

zhangjyr commented Dec 9, 2024

nwangfw commented Dec 9, 2024 •

edited

Loading

zhangjyr commented Dec 9, 2024 •

edited

Loading

nwangfw commented Dec 9, 2024 •

edited

Loading

[Bug] Fix the way how podautoscaler handle 0 pods. #508

[Bug] Fix the way how podautoscaler handle 0 pods. #508

Conversation

zhangjyr commented Dec 9, 2024

Pull Request Description

Related Issues

Pull Request Title Format

Submission Checklist

Jeffwan Dec 9, 2024

Choose a reason for hiding this comment

zhangjyr Dec 9, 2024

Choose a reason for hiding this comment

Jeffwan commented Dec 9, 2024

nwangfw commented Dec 9, 2024 • edited Loading

zhangjyr commented Dec 9, 2024

nwangfw commented Dec 9, 2024 • edited Loading

zhangjyr commented Dec 9, 2024 • edited Loading

nwangfw commented Dec 9, 2024 • edited Loading

nwangfw commented Dec 9, 2024 •

edited

Loading

nwangfw commented Dec 9, 2024 •

edited

Loading

zhangjyr commented Dec 9, 2024 •

edited

Loading

nwangfw commented Dec 9, 2024 •

edited

Loading