Refactor metric source for customized protocol, port and path #511

kr11 · 2024-12-09T16:46:55Z

Pull Request Description

[Please provide a clear and concise description of your changes here]

Related Issues

Resolves: #494

Marking as WIP since more tests need to be done. We hope reviewers can first evaluate the refactoring reasonableness, especially for gpu-optimizer.

In current version, we cannot which only support http protocol and hard-code the port to 8000.

We aim to accomodate the following metric resource in a unified format:

support pod-metric-source ({http[s]}:{pod_ip}:{port}/{path}) and domain link( gpu-optimizer.aibrix-system.svc.cluster.local:8080)
supoort http and https, and enable tospecify port rather than hard-code

To solve it, we enhance the previous metric sources:

refactor MetricSource, MetricSource is now distinguished by MetricSourceType(domain or pod). We Further add ProtocolType to specify http or https.
For now we support exact one MetricSource. We move PodAutoscalerSpec.TargetMetric and PodAutoscalerSpec.TargetValue into MetricsSource.TargetMetric and MetricsSource.TargetValue.

The modified podautoscaler_types.go is as follows:

type MetricSourceType string

const (
  // POD need to scan all k8s pods to fetch the data
  POD MetricSourceType = "pod"
  // DOMAIN only need to access specified domain
  DOMAIN MetricSourceType = "domain"
)

type ProtocolType string

const (
  HTTP  ProtocolType = "http"
  HTTPS ProtocolType = "https"
)

// MetricSource defines an endpoint and path from which metrics are collected.
type MetricSource struct {
  // access an endpoint or scan a list of k8s pod
  MetricSourceType MetricSourceType `json:"metricSourceType"`
  // http or https
  ProtocolType ProtocolType `json:"protocolType"`
  // e.g. service1.example.com. meaningless for MetricSourceType.POD
  Endpoint string `json:"endpoint,omitempty"`
  // e.g. /api/metrics/cpu
  Path string `json:"path"`
  // e.g. 8080. meaningless for MetricSourceType.DOMAIN
  Port string `json:"port,omitempty"`
  // TargetMetric identifies the specific metric to monitor (e.g., kv_cache_utilization).
  TargetMetric string `json:"targetMetric"`
  // TargetValue sets the desired threshold for the metric (e.g., 50 for 50% utilization).
  TargetValue string `json:"targetValue"`
}

The pa yaml example is as follows:

spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mock-llama2-7b
  minReplicas: 0
  maxReplicas: 10
  metricsSources:
    - metricSourceType: pod
      protocolType: http
      port: 8080
      path: metrics
      targetMetric: "avg_prompt_throughput_toks_per_s"
      targetValue: "1"
  scalingStrategy: "KPA"

Now We allow user define all metric attributes (domain, port, protocol, path, metricName), and pass them to the bottom function:

func (f *RestMetricsFetcher) FetchPodMetrics(ctx context.Context, pod v1.Pod, source autoscalingv1alpha1.MetricSource) (float64, error) {
  return f.FetchMetric(ctx, source.ProtocolType, fmt.Sprintf("%s:%s", pod.Status.PodIP, source.Port), source.Path, source.TargetMetric)
}

func (f *RestMetricsFetcher) FetchMetric(ctx context.Context, protocol autoscalingv1alpha1.ProtocolType, endpoint, path, metricName string) (float64, error) {
  // Use http to fetch endpoint
  url := fmt.Sprintf("%s://%s/%s", protocol, endpoint, strings.TrimLeft(path, "/"))
  // xxx
}

Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

[Bug]: Corrections to existing functionality
[CI]: Changes to build process or CI pipeline
[Docs]: Updates or additions to documentation
[API]: Modifications to aibrix's API or interface
[CLI]: Changes or additions to the Command Line Interface
[Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

PR title includes appropriate prefix(es)
Changes are clearly explained in the PR description
New and existing tests pass successfully
Code adheres to project style and best practices
Documentation updated to reflect changes (if applicable)
Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

api/autoscaling/v1alpha1/podautoscaler_types.go

zhangjyr · 2024-12-09T19:10:42Z

I noticed with this PR, reading metrics now potentially supports HTTPS. However, the code in cache still hard coded as:

func (c *Cache) updatePodMetrics() {
	c.mu.Lock()
	defer c.mu.Unlock()

	for _, pod := range c.Pods {
		...
		// We should use the primary container port. In the future, we can decide whether to use sidecar container's port
		url := fmt.Sprintf("http://%s:%d/metrics", pod.Status.PodIP, podPort)
		allMetrics, err := metrics.ParseMetricsURL(url)
		if err != nil {
			klog.Warningf("Error parsing metric families: %v\n", err)
		}
		...
	}
	...
}

Will this introduce some inconsistency? Should cache behavior be included in this PR? Or maybe KPA for pod metrics reuse metric loading logic in the cache?

kr11 · 2024-12-10T06:55:39Z

I noticed with this PR, reading metrics now potentially supports HTTPS. However, the code in cache still hard coded as:
func (c *Cache) updatePodMetrics() {
	c.mu.Lock()
	defer c.mu.Unlock()

	for _, pod := range c.Pods {
		...
		// We should use the primary container port. In the future, we can decide whether to use sidecar container's port
		url := fmt.Sprintf("http://%s:%d/metrics", pod.Status.PodIP, podPort)
		allMetrics, err := metrics.ParseMetricsURL(url)
		if err != nil {
			klog.Warningf("Error parsing metric families: %v\n", err)
		}
		...
	}
	...
}
Will this introduce some inconsistency? Should cache behavior be included in this PR? Or maybe KPA for pod metrics reuse metric loading logic in the cache?

Yes, it's a good question. We have multiple metric fetcher implementations scattered across the autoscaler, cache, and model adapter components. @Jeffwan plans to refactor these codes to make them more concise.

I have checked the code invocation hierarchy, and I think it's challenging to address this issue within this PR. The cache update process is independent of the metric source defined in pa.yaml, and the port and path are both hard-coded.

I haven't yet thought of a straightforward way to resolve this, as the changes in this PR depend on the metric source, which is from the user's pa.yaml configuration.

Jeffwan · 2024-12-10T18:34:20Z

this is good fining! We do not need to care about the metric fetcher in the cache.go at this moment. It will be refactor later. The primary goal of this PR is to

support standalone autoscaler customizaiton
support heterogenous serving scenario

Jeffwan

The change looks good to me.

* refactor pa_types.go and modify metric client, fetcher * fix test bug * conduct ./hack/update-codegen.sh * update config/crd/autoscaling, fix function name, update pa.yaml

kr11 added the area/autoscaling label Dec 9, 2024

kr11 requested review from zhangjyr, Jeffwan and nwangfw December 9, 2024 16:46

Jeffwan reviewed Dec 9, 2024

View reviewed changes

api/autoscaling/v1alpha1/podautoscaler_types.go Outdated Show resolved Hide resolved

api/autoscaling/v1alpha1/podautoscaler_types.go Show resolved Hide resolved

kr11 added 3 commits December 10, 2024 13:45

refactor pa_types.go and modify metric client, fetcher

fee6ad6

fix test bug

b2c6618

conduct ./hack/update-codegen.sh

1ce3333

kr11 force-pushed the kangrong/fix/customized_protocol_and_metrics_port branch from ff8fd91 to 1ce3333 Compare December 10, 2024 05:45

update config/crd/autoscaling, fix function name, update pa.yaml

6df4fca

kr11 changed the title ~~[WIP] Refactor metric source for customized protocol, port and path~~ Refactor metric source for customized protocol, port and path Dec 10, 2024

kr11 mentioned this pull request Dec 10, 2024

Fix KPA bug, and an elaborate KPA test case #515

Merged

Jeffwan approved these changes Dec 10, 2024

View reviewed changes

Jeffwan merged commit a5e9849 into main Dec 10, 2024
10 checks passed

Jeffwan deleted the kangrong/fix/customized_protocol_and_metrics_port branch December 10, 2024 18:34

Jeffwan mentioned this pull request Dec 10, 2024

metric in TargetMetrics is kind of conflict with targetMetrics #482

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor metric source for customized protocol, port and path #511

Refactor metric source for customized protocol, port and path #511

kr11 commented Dec 9, 2024

zhangjyr commented Dec 9, 2024

kr11 commented Dec 10, 2024

Jeffwan commented Dec 10, 2024

Jeffwan left a comment

Refactor metric source for customized protocol, port and path #511

Refactor metric source for customized protocol, port and path #511

Conversation

kr11 commented Dec 9, 2024

Pull Request Description

Related Issues

Pull Request Title Format

Submission Checklist

zhangjyr commented Dec 9, 2024

kr11 commented Dec 10, 2024

Jeffwan commented Dec 10, 2024

Jeffwan left a comment

Choose a reason for hiding this comment