[Feat] Add advanced Kubernetes configuration options for router deployment #151

liaddrori1 · 2025-02-18T16:23:30Z

Description

This PR improves the Kubernetes configuration for router deployment by adding advanced scheduling and monitoring features.

Currently, monitoring requires installing prom-stack with custom configurations. The recommended approach is to provide an option to create a ServiceMonitor , enabling monitoring with any Prometheus Operator in the cluster.

Key Changes

Advanced Scheduling Options
- Added support for nodeSelector, affinity, and tolerations in the router deployment
- These options provide greater control over pod placement and scheduling
Service Configuration Improvements
- Introduced flexible port naming through servicePortName configuration
- Enhanced service template to support better port management
Monitoring Integration with Existing Prometheus Stack
- Added ServiceMonitor configuration that integrates with existing prometheus-operator
- Note: This implementation leverages your existing Prometheus stack installation
- No additional Prometheus installation required
- ServiceMonitor will be automatically discovered by your prometheus-operator

Technical Details

The router deployment template now supports standard Kubernetes scheduling features
ServiceMonitor configuration is conditionally enabled via .Values.routerSpec.serviceMonitor.enabled
ServiceMonitor features:
- Configurable scrape interval (default: 30s)
- Adjustable scrape timeout (default: 10s)
- Customizable metrics path
- Support for additional labels to match your Prometheus setup
Port naming consistency maintained across service and deployment configurations

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE

Make sure the code changes pass the pre-commit checks.
Sign-off your commit by using -s when doing git commit
Try to classify PRs for easy understanding of the type of changes, such as [Bugfix], [Feat], and [CI].

Detailed Checklist (Click to Expand)

Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Feat] for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).
[Router] for changes to the vllm_router (e.g., routing algorithm, router observability, etc.).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

Pass all linter checks. Please use pre-commit to format your code. See README.md for installation.
The code need to be well-documented to ensure future contributors can easily understand the code.
Please include sufficient tests to ensure the change is stay correct and robust. This includes both unit tests and integration tests.

DCO and Signed-off-by

When contributing changes to this project, you must agree to the DCO. Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.

Using -s with git commit will automatically add this header.

What to Expect for the Reviews

We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.

- Enhance values schema with comprehensive node, affinity, and service monitor configuration - Update values.yaml with new configuration options - Modify router deployment template to support nodeSelector, affinity, and tolerations - Add ServiceMonitor template for Prometheus metrics scraping - Update router service to use configurable port name Signed-off-by: Liad Drori <[email protected]>

YuhanLiu11 · 2025-02-18T16:28:04Z

Thanks for your contribution! We will take a look soon.

Shaoting-Feng · 2025-02-18T21:44:54Z

I am getting the following error when enabling the service monitor: Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: resource mapping not found for name: "vllm-router-service-monitor" namespace: "" from "": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1" ensure CRDs are installed first.

Could you provide instructions on installing Prometheus Operator?

liaddrori1 · 2025-02-22T14:13:52Z

You can use any prometheus-operator install you want. For example you can install kube-prometheus-stack

You just need to make sure you have the CRD's installed in your cluster

Apsu · 2025-02-22T21:33:57Z

You can use any prometheus-operator install you want. For example you can install kube-prometheus-stack

You just need to make sure you have the CRD's installed in your cluster

Probably useful to include the example install of prometheus-operator-crds from the same helm repo. And would also help to modify the observability values.yaml to set serviceMonitorSelectorNilUsesHelmValues: false instead of this https://github.com/vllm-project/production-stack/blob/main/observability/values.yaml#L96-L108 -- while the current approach skirts this common footgun in vanilla kube-prometheus-stack installs, it's inferior to managing ServiceMonitors separately as your PR is doing, but that helm value change is the usual companion of the approach (and the CRD installation sequencing of course).

ApostaC · 2025-02-28T00:31:36Z

The code in this PR looks good to me. We have a basic setup script of kube-prometheus-stack here, so I guess using ServiceMonitor will be a better idea than customizing the values.yaml when deploying the kube Prometheus stack.

ApostaC

LGTM!
@Shaoting-Feng @YuhanLiu11 when you have time, can you try enabling the ServiceMonitor with kube-prometheus-stack installed?

Shaoting-Feng self-requested a review February 18, 2025 16:53

YuhanLiu11 self-requested a review February 18, 2025 17:29

ApostaC approved these changes Feb 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Add advanced Kubernetes configuration options for router deployment #151

[Feat] Add advanced Kubernetes configuration options for router deployment #151

liaddrori1 commented Feb 18, 2025

YuhanLiu11 commented Feb 18, 2025

Shaoting-Feng commented Feb 18, 2025

liaddrori1 commented Feb 22, 2025

Apsu commented Feb 22, 2025

ApostaC commented Feb 28, 2025

ApostaC left a comment

[Feat] Add advanced Kubernetes configuration options for router deployment #151

Are you sure you want to change the base?

[Feat] Add advanced Kubernetes configuration options for router deployment #151

Conversation

liaddrori1 commented Feb 18, 2025

Description

Key Changes

Technical Details

PR Title and Classification

Code Quality

DCO and Signed-off-by

What to Expect for the Reviews

YuhanLiu11 commented Feb 18, 2025

Shaoting-Feng commented Feb 18, 2025

liaddrori1 commented Feb 22, 2025

Apsu commented Feb 22, 2025

ApostaC commented Feb 28, 2025

ApostaC left a comment

Choose a reason for hiding this comment