Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Fix the way how podautoscaler handle 0 pods. #508

Merged
merged 5 commits into from
Dec 9, 2024

Conversation

zhangjyr
Copy link
Collaborator

@zhangjyr zhangjyr commented Dec 9, 2024

Pull Request Description

  1. Previously, if podautoscaler found no pods, the podautoscaler treated as an error. While the podautoscaler can handle 0 pods in the codes, this is unnecessary.
  2. This PR also fixed a minor bug when KPA.delayWindow is nil.
  3. This PR fixed an error reporting error while the err will only be reported if metricDesiredReplicas == -1, while the code always returns 0 as metricDesiredReplicas on error. This constraint prevents many errors from being visible, and I think it should be removed.

Note: this PR does not solve all problems related to 0 pods since it might related to how the podautoscaler counts pods. See comments in #507

Related Issues

Resolves: #481

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

@zhangjyr zhangjyr requested review from Jeffwan, kr11 and nwangfw December 9, 2024 07:29
@zhangjyr zhangjyr added kind/bug Something isn't working area/heterogeneous labels Dec 9, 2024
@@ -368,7 +368,7 @@ func (r *PodAutoscalerReconciler) reconcileCustomPA(ctx context.Context, pa auto
// computeReplicasForMetrics gives
// TODO: check why it return the metrics name here?
metricDesiredReplicas, metricName, metricTimestamp, err := r.computeReplicasForMetrics(ctx, pa, scale, metricKey)
if err != nil && metricDesiredReplicas == -1 {
if err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can not remember the logic here, could you briefly elaborate more details here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeking background here, too. As computeReplicaForMetrics return 0 along with any error:

func (r *PodAutoscalerReconciler) computeReplicasForMetrics(ctx context.Context, pa autoscalingv1alpha1.PodAutoscaler, scale *unstructured.Unstructured, metricKey metrics.NamespaceNameMetric) (replicas int32, relatedMetrics string, timestamp time.Time, err error) {
	...
	labelsSelector, err := extractLabelSelector(scale)
	if err != nil {
		return 0, "", currentTimestamp, err
	}

	originalReadyPodsCount, err := scaler.GetReadyPodsCount(ctx, r.Client, pa.Namespace, labelsSelector)
	if err != nil {
		return 0, "", currentTimestamp, fmt.Errorf("error getting ready pods count: %w", err)
	}

	...
	err = r.updateScalerSpec(ctx, pa, metricKey)
	if err != nil {
		...
		return 0, "", currentTimestamp, fmt.Errorf("error update scaler spec: %w", err)
	}

	...
	autoScaler, ok := r.AutoscalerMap[metricKey]
	if !ok {
		return 0, "", currentTimestamp, fmt.Errorf("unsupported scaling strategy: %s", pa.Spec.ScalingStrategy)
	}

	scaleResult := autoScaler.Scale(int(originalReadyPodsCount), metricKey, currentTimestamp)
	if scaleResult.ScaleValid {
		...
		return scaleResult.DesiredPodCount, metricKey.MetricName, currentTimestamp, nil
	}

	return 0, "", currentTimestamp, fmt.Errorf("can not calculate metrics for scale %s", pa.Spec.ScaleTargetRef.Name)
}

@Jeffwan
Copy link
Collaborator

Jeffwan commented Dec 9, 2024

@nwangfw I think we talked about scale to 0 stuff last Thursday. could you double check this PR.

Jingyuan Zhang added 2 commits December 9, 2024 10:17
Keep readyPodsCount value intact and fix affected code.
@nwangfw
Copy link
Collaborator

nwangfw commented Dec 9, 2024

@nwangfw I think we talked about scale to 0 stuff last Thursday. could you double check this PR.

@Jeffwan Sure. Last week, we discussed the issue that we need to set the minReplica to 1 or the podautoscaler will return error in Issue #481. I believe this PR addresses this case even if the minReplica is not configured and 0 readypod case.

@zhangjyr
Copy link
Collaborator Author

zhangjyr commented Dec 9, 2024

@nwangfw I think we talked about scale to 0 stuff last Thursday. could you double check this PR.

@Jeffwan Sure. Last week, we discussed the issue that we need to set the minReplica to 1 or the podautoscaler will return error in Issue #481. I believe this PR addresses this case even if the minReplica is not configured.

Right, this PR will make the 0 cases tolerable by the podautoscaler. However, as I mentioned in #507, the current implementation uses the pod count in the namespace, which is deviated from scaleTargetRef definition. So the behavior of #481 seems unpredictable, while a100(1), a40(0) ok (because there is 1 pod in the namespace), and a40(0) alone failed.

@nwangfw
Copy link
Collaborator

nwangfw commented Dec 9, 2024

@nwangfw I think we talked about scale to 0 stuff last Thursday. could you double check this PR.

@Jeffwan Sure. Last week, we discussed the issue that we need to set the minReplica to 1 or the podautoscaler will return error in Issue #481. I believe this PR addresses this case even if the minReplica is not configured.

Right, this PR will make the 0 cases tolerable by the podautoscaler. However, as I mentioned in #507, the current implementation uses the pod count in the namespace, which is deviated from scaleTargetRef definition. So the behavior of #481 seems unpredictable, while a100(1), a40(0) ok (because there is 1 pod in the namespace), and a40(0) alone failed.

while a100(1), a40(0) ok (because there is 1 pod in the namespace), and a40(0) alone failed.

I am not sure that I fully understand this part. Can you elaborate on it more? Thanks!

@zhangjyr
Copy link
Collaborator Author

zhangjyr commented Dec 9, 2024

while a100(1), a40(0) ok (because there is 1 pod in the namespace), and a40(0) alone failed.

I tried a configuration that sets a100 deployment with replica 1 and a40 deployment with replica 0. The podautoscaler currently reports no error without the fixing in this PR. The reason is discussed in #507 because podautoscaler found one pod in the namespace, which is good with it.

However, if set both a100 and a40 to replica 0, podautoscaler will trigger the error.

@nwangfw
Copy link
Collaborator

nwangfw commented Dec 9, 2024

while a100(1), a40(0) ok (because there is 1 pod in the namespace), and a40(0) alone failed.

I tried a configuration that sets a100 deployment with replica 1 and a40 deployment with replica 0. The podautoscaler currently reports no error without the fixing in this PR. The reason is discussed in #507 because podautoscaler found one pod in the namespace, which is good with it.

However, if set both a100 and a40 to replica 0, podautoscaler will trigger the error.

Got it. For a100(1) and a40(0) case, you mean there is no error, but the scaling results are still not predictable due to the issue that we found in PR #507, right?

@Jeffwan Jeffwan merged commit 4ebf0ff into main Dec 9, 2024
10 checks passed
@Jeffwan Jeffwan deleted the issues/481_PodAutoscaler_scaling_down_to_0 branch December 9, 2024 23:49
gangmuk pushed a commit that referenced this pull request Jan 25, 2025
* Bug fix

* Make APA comfortable with 0 originalReadyPodsCount for completeness.

* Verify ActivationScale > 1 settings.
Keep readyPodsCount value intact and fix affected code.

* Bug fix

* Document the new logic in comments.

---------

Co-authored-by: Jingyuan Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/heterogeneous kind/bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding PodAutoscaler ScalingDown to 0 Handling
3 participants