Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor metric source for customized protocol, port and path #511

Merged
merged 4 commits into from
Dec 10, 2024

Conversation

kr11
Copy link
Collaborator

@kr11 kr11 commented Dec 9, 2024

Pull Request Description

[Please provide a clear and concise description of your changes here]

Related Issues

Resolves: #494

Marking as WIP since more tests need to be done. We hope reviewers can first evaluate the refactoring reasonableness, especially for gpu-optimizer.

In current version, we cannot which only support http protocol and hard-code the port to 8000.

We aim to accomodate the following metric resource in a unified format:

  1. support pod-metric-source ({http[s]}:{pod_ip}:{port}/{path}) and domain link( gpu-optimizer.aibrix-system.svc.cluster.local:8080)
  2. supoort http and https, and enable tospecify port rather than hard-code

To solve it, we enhance the previous metric sources:

  1. refactor MetricSource, MetricSource is now distinguished by MetricSourceType(domain or pod). We Further add ProtocolType to specify http or https.
  2. For now we support exact one MetricSource. We move PodAutoscalerSpec.TargetMetric and PodAutoscalerSpec.TargetValue into MetricsSource.TargetMetric and MetricsSource.TargetValue.

The modified podautoscaler_types.go is as follows:

type MetricSourceType string

const (
  // POD need to scan all k8s pods to fetch the data
  POD MetricSourceType = "pod"
  // DOMAIN only need to access specified domain
  DOMAIN MetricSourceType = "domain"
)

type ProtocolType string

const (
  HTTP  ProtocolType = "http"
  HTTPS ProtocolType = "https"
)

// MetricSource defines an endpoint and path from which metrics are collected.
type MetricSource struct {
  // access an endpoint or scan a list of k8s pod
  MetricSourceType MetricSourceType `json:"metricSourceType"`
  // http or https
  ProtocolType ProtocolType `json:"protocolType"`
  // e.g. service1.example.com. meaningless for MetricSourceType.POD
  Endpoint string `json:"endpoint,omitempty"`
  // e.g. /api/metrics/cpu
  Path string `json:"path"`
  // e.g. 8080. meaningless for MetricSourceType.DOMAIN
  Port string `json:"port,omitempty"`
  // TargetMetric identifies the specific metric to monitor (e.g., kv_cache_utilization).
  TargetMetric string `json:"targetMetric"`
  // TargetValue sets the desired threshold for the metric (e.g., 50 for 50% utilization).
  TargetValue string `json:"targetValue"`
}

The pa yaml example is as follows:

spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mock-llama2-7b
  minReplicas: 0
  maxReplicas: 10
  metricsSources:
    - metricSourceType: pod
      protocolType: http
      port: 8080
      path: metrics
      targetMetric: "avg_prompt_throughput_toks_per_s"
      targetValue: "1"
  scalingStrategy: "KPA"

Now We allow user define all metric attributes (domain, port, protocol, path, metricName), and pass them to the bottom function:

func (f *RestMetricsFetcher) FetchPodMetrics(ctx context.Context, pod v1.Pod, source autoscalingv1alpha1.MetricSource) (float64, error) {
  return f.FetchMetric(ctx, source.ProtocolType, fmt.Sprintf("%s:%s", pod.Status.PodIP, source.Port), source.Path, source.TargetMetric)
}

func (f *RestMetricsFetcher) FetchMetric(ctx context.Context, protocol autoscalingv1alpha1.ProtocolType, endpoint, path, metricName string) (float64, error) {
  // Use http to fetch endpoint
  url := fmt.Sprintf("%s://%s/%s", protocol, endpoint, strings.TrimLeft(path, "/"))
  // xxx
}

Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

@zhangjyr
Copy link
Collaborator

zhangjyr commented Dec 9, 2024

I noticed with this PR, reading metrics now potentially supports HTTPS. However, the code in cache still hard coded as:

func (c *Cache) updatePodMetrics() {
	c.mu.Lock()
	defer c.mu.Unlock()

	for _, pod := range c.Pods {
		...
		// We should use the primary container port. In the future, we can decide whether to use sidecar container's port
		url := fmt.Sprintf("http://%s:%d/metrics", pod.Status.PodIP, podPort)
		allMetrics, err := metrics.ParseMetricsURL(url)
		if err != nil {
			klog.Warningf("Error parsing metric families: %v\n", err)
		}
		...
	}
	...
}

Will this introduce some inconsistency? Should cache behavior be included in this PR? Or maybe KPA for pod metrics reuse metric loading logic in the cache?

@kr11 kr11 force-pushed the kangrong/fix/customized_protocol_and_metrics_port branch from ff8fd91 to 1ce3333 Compare December 10, 2024 05:45
@kr11
Copy link
Collaborator Author

kr11 commented Dec 10, 2024

I noticed with this PR, reading metrics now potentially supports HTTPS. However, the code in cache still hard coded as:

func (c *Cache) updatePodMetrics() {
	c.mu.Lock()
	defer c.mu.Unlock()

	for _, pod := range c.Pods {
		...
		// We should use the primary container port. In the future, we can decide whether to use sidecar container's port
		url := fmt.Sprintf("http://%s:%d/metrics", pod.Status.PodIP, podPort)
		allMetrics, err := metrics.ParseMetricsURL(url)
		if err != nil {
			klog.Warningf("Error parsing metric families: %v\n", err)
		}
		...
	}
	...
}

Will this introduce some inconsistency? Should cache behavior be included in this PR? Or maybe KPA for pod metrics reuse metric loading logic in the cache?

Yes, it's a good question. We have multiple metric fetcher implementations scattered across the autoscaler, cache, and model adapter components. @Jeffwan plans to refactor these codes to make them more concise.

I have checked the code invocation hierarchy, and I think it's challenging to address this issue within this PR. The cache update process is independent of the metric source defined in pa.yaml, and the port and path are both hard-coded.

I haven't yet thought of a straightforward way to resolve this, as the changes in this PR depend on the metric source, which is from the user's pa.yaml configuration.

@kr11 kr11 changed the title [WIP] Refactor metric source for customized protocol, port and path Refactor metric source for customized protocol, port and path Dec 10, 2024
@Jeffwan
Copy link
Collaborator

Jeffwan commented Dec 10, 2024

this is good fining! We do not need to care about the metric fetcher in the cache.go at this moment. It will be refactor later. The primary goal of this PR is to

  • support standalone autoscaler customizaiton
  • support heterogenous serving scenario

Copy link
Collaborator

@Jeffwan Jeffwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me.

@Jeffwan Jeffwan merged commit a5e9849 into main Dec 10, 2024
10 checks passed
@Jeffwan Jeffwan deleted the kangrong/fix/customized_protocol_and_metrics_port branch December 10, 2024 18:34
gangmuk pushed a commit that referenced this pull request Jan 25, 2025
* refactor pa_types.go and modify metric client, fetcher

* fix test bug

* conduct ./hack/update-codegen.sh

* update config/crd/autoscaling, fix function name, update pa.yaml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support customized protocol and metrics port in autoscaler
3 participants