Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Add advanced Kubernetes configuration options for router deployment #151

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

liaddrori1
Copy link
Contributor

Description

This PR improves the Kubernetes configuration for router deployment by adding advanced scheduling and monitoring features.

Currently, monitoring requires installing prom-stack with custom configurations. The recommended approach is to provide an option to create a ServiceMonitor , enabling monitoring with any Prometheus Operator in the cluster.

Key Changes

  1. Advanced Scheduling Options

    • Added support for nodeSelector, affinity, and tolerations in the router deployment
    • These options provide greater control over pod placement and scheduling
  2. Service Configuration Improvements

    • Introduced flexible port naming through servicePortName configuration
    • Enhanced service template to support better port management
  3. Monitoring Integration with Existing Prometheus Stack

    • Added ServiceMonitor configuration that integrates with existing prometheus-operator
    • Note: This implementation leverages your existing Prometheus stack installation
    • No additional Prometheus installation required
    • ServiceMonitor will be automatically discovered by your prometheus-operator

Technical Details

  • The router deployment template now supports standard Kubernetes scheduling features
  • ServiceMonitor configuration is conditionally enabled via .Values.routerSpec.serviceMonitor.enabled
  • ServiceMonitor features:
    • Configurable scrape interval (default: 30s)
    • Adjustable scrape timeout (default: 10s)
    • Customizable metrics path
    • Support for additional labels to match your Prometheus setup
  • Port naming consistency maintained across service and deployment configurations

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE


  • Make sure the code changes pass the pre-commit checks.
  • Sign-off your commit by using -s when doing git commit
  • Try to classify PRs for easy understanding of the type of changes, such as [Bugfix], [Feat], and [CI].
Detailed Checklist (Click to Expand)

Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Feat] for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).
  • [Router] for changes to the vllm_router (e.g., routing algorithm, router observability, etc.).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • Pass all linter checks. Please use pre-commit to format your code. See README.md for installation.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Please include sufficient tests to ensure the change is stay correct and robust. This includes both unit tests and integration tests.

DCO and Signed-off-by

When contributing changes to this project, you must agree to the DCO. Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.

Using -s with git commit will automatically add this header.

What to Expect for the Reviews

We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.

- Enhance values schema with comprehensive node, affinity, and service monitor configuration
- Update values.yaml with new configuration options
- Modify router deployment template to support nodeSelector, affinity, and tolerations
- Add ServiceMonitor template for Prometheus metrics scraping
- Update router service to use configurable port name

Signed-off-by: Liad Drori <[email protected]>
@YuhanLiu11
Copy link
Collaborator

Thanks for your contribution! We will take a look soon.

@Shaoting-Feng Shaoting-Feng self-requested a review February 18, 2025 16:53
@YuhanLiu11 YuhanLiu11 self-requested a review February 18, 2025 17:29
@Shaoting-Feng
Copy link
Collaborator

I am getting the following error when enabling the service monitor: Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: resource mapping not found for name: "vllm-router-service-monitor" namespace: "" from "": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1" ensure CRDs are installed first.

Could you provide instructions on installing Prometheus Operator?

@liaddrori1
Copy link
Contributor Author

You can use any prometheus-operator install you want. For example you can install kube-prometheus-stack

You just need to make sure you have the CRD's installed in your cluster

@Apsu
Copy link

Apsu commented Feb 22, 2025

You can use any prometheus-operator install you want. For example you can install kube-prometheus-stack

You just need to make sure you have the CRD's installed in your cluster

Probably useful to include the example install of prometheus-operator-crds from the same helm repo. And would also help to modify the observability values.yaml to set serviceMonitorSelectorNilUsesHelmValues: false instead of this https://github.com/vllm-project/production-stack/blob/main/observability/values.yaml#L96-L108 -- while the current approach skirts this common footgun in vanilla kube-prometheus-stack installs, it's inferior to managing ServiceMonitors separately as your PR is doing, but that helm value change is the usual companion of the approach (and the CRD installation sequencing of course).

@ApostaC
Copy link
Collaborator

ApostaC commented Feb 28, 2025

The code in this PR looks good to me. We have a basic setup script of kube-prometheus-stack here, so I guess using ServiceMonitor will be a better idea than customizing the values.yaml when deploying the kube Prometheus stack.

Copy link
Collaborator

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
@Shaoting-Feng @YuhanLiu11 when you have time, can you try enabling the ServiceMonitor with kube-prometheus-stack installed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants