Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Scaler: Resolve Issues with Metric Parameter Updates in Multiple KPAs #437

Merged
merged 1 commit into from
Nov 30, 2024

Conversation

kr11
Copy link
Collaborator

@kr11 kr11 commented Nov 27, 2024

Pull Request Description

[Please provide a clear and concise description of your changes here]

Related Issues

Resolves: #410 #436

Description

When multiple KPAAutoScaler instances are created simultaneously, or when a same KPA is updated multiple times, there is a possibility that parameter values, particularly stateful metric parameters (like window length), may not update successfully. This PR addresses issue #406 and may also help resolve issue #436.

Existing Issues

  1. The PodAutoscalerReconciler's AutoscalerMap is structured as {KPA: KPAAutoScaler, APA: APAAutoScaler}, meaning all KPAs share the same KPAAutoScaler object. During periodic calls to the reconcile function, the PA objects always update the same Autoscaler.
  2. All KPA share the same KpaMetricClient, which maintains a dictionary mapping from NamespaceNameMetric to PanicWindow and StableWindow. The setting of panicWindowDuration(length of panic window) and stableWindowDuration does not take effect.

Modification

  1. The PodAutoscalerReconciler's autoscalerMap is changed from {KPA: KPAAutoScaler, APA: APAAutoScaler} to NamespaceNameMetric -> KPAAutoScaler. Each PA has its own independent Context and metricClient.
  2. We no longer create KPAAutoScaler instances in newReconciler., but waits until the reconcile function to create (if it does not exist) or update (if it already exists).
  3. Each KpaMetricClient maintains only one panicWindow and one stableWindow, instead of maintaining metrics for multiple PAs.
  4. Corrected the issue where settings for panicWindowDuration and stableWindowDuration were not taking effect.

TODO:

  1. Currently, updates to existing panicWindowDuration and stableWindowDuration are not allowed, as this involves the problem of how metric data in the old window can be transferred to the new window. We will design this carefully in the future.
  2. How should we support the deletion of PAs?

Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

@kr11 kr11 changed the title [WIP] refactor scaler: init and update [WIP] Refactor Scaler: Resolve Issues with Metric Parameter Updates in Multiple KPAs Nov 27, 2024
@kr11 kr11 changed the title [WIP] Refactor Scaler: Resolve Issues with Metric Parameter Updates in Multiple KPAs Refactor Scaler: Resolve Issues with Metric Parameter Updates in Multiple KPAs Nov 27, 2024
Copy link
Collaborator

@Jeffwan Jeffwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am curious whether we can make scaling context a map instead of entire autoscaler.

@@ -54,42 +54,28 @@ type KPAMetricsClient struct {
// are collected and processed within the sliding window.
granularity time.Duration
// the difference between stable and panic metrics is the time window range
panicWindowDict map[NamespaceNameMetric]*aggregation.TimeWindow
stableWindowDict map[NamespaceNameMetric]*aggregation.TimeWindow
panicWindow *aggregation.TimeWindow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client is not singleton now? it only response for one metric?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client is not singleton now? it only response for one metric?

According to KNative, it employs a global metricCollector that maintains a map of metric windows for each Scaler. This global metricCollector is passed into NewAutoscaler so that all scaler.metricClient refer to the same object metricCollector.
In this PR, we mainly fix the bugs that we cannot assign time window length correctly according to individual scaler context.
Considering we have planned to revamp and streamline metrics management across the autoscaler, model adapter, and other components, I recommend we defer the transition of the metric client from individual to centralized handling to the upcoming PRs.

@@ -112,7 +98,7 @@ func (c *KPAMetricsClient) UpdateMetrics(now time.Time, metricKey NamespaceNameM
defer c.collectionsMutex.Unlock()

// Update metrics into the window for tracking
err := c.UpdateMetricIntoWindow(metricKey, now, sumMetricValue)
err := c.UpdateMetricIntoWindow(now, sumMetricValue)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does it know the metrics now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now metric client is response for single scaler.

@@ -74,36 +74,9 @@ func newReconciler(mgr manager.Manager) (reconcile.Reconciler, error) {
Mapper: mgr.GetRESTMapper(),
resyncInterval: 10 * time.Second, // TODO: this should be override by an environment variable
eventCh: make(chan event.GenericEvent),
AutoscalerMap: make(map[metrics.NamespaceNameMetric]scaler.Scaler),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

em. in this case, does it mean multiple HPA, KPA or APA scalers are flatten?
We used to have 3 singleton autoscaler

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

em. in this case, does it mean multiple HPA, KPA or APA scalers are flatten? We used to have 3 singleton autoscaler

I prefer to flatten them, as we need to manage individual object per scaler rather than one for each PA type. Considering we create multiple scalers with the same PA-Type (e.g. KPA), each one has a different context, metric window, and other stateful attributes, so we cannot merge them into one scaler object

@kr11
Copy link
Collaborator Author

kr11 commented Nov 29, 2024

i am curious whether we can make scaling context a map instead of entire autoscaler.

Besides the context (stateless configuration), KPA and APA also possess stateful attributes such as panicTime, metrics, and DelayWindow.

I believe it's necessary to retain the entire scaler object, not just the context, even though we plan to centralize the metric window into a metricCollector in the future.

Copy link
Collaborator

@Jeffwan Jeffwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@Jeffwan
Copy link
Collaborator

Jeffwan commented Nov 30, 2024

I had a brief discussion with @kr11 offline and get concensus on the flat structure to overcome the existing challenges. This can be merged.

@Jeffwan Jeffwan merged commit 9c85745 into main Nov 30, 2024
10 checks passed
@Jeffwan Jeffwan deleted the kangrong/fix/multi_pa_metric_windows branch November 30, 2024 17:59
gangmuk pushed a commit that referenced this pull request Jan 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some stateful PA attributes cannot be refreshed in reconcile
2 participants