Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Integrate deployment configurations and fix autoscaler/gpu optimizer connectivity #500

Merged

Conversation

zhangjyr
Copy link
Collaborator

@zhangjyr zhangjyr commented Dec 6, 2024

Pull Request Description

This PR fixed the connectivity problem between podautoscaler and GPU optimizer by:

  1. Updated k8s role definition to use ClusterRole, so GPU optimizer now monitor all deployments in all namespace with model label
  2. Include [WIP] Add GPU Optimizer deployment and update configurations #480 changes. Deployment configurations are integrated into config/default

Note: I keep the deployment.yaml under GPU optimizer undeleted for debugging purposes.

Related Issues

Resolves: #484 #480 #459

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

nwangfw and others added 4 commits December 4, 2024 13:13
…_failed_to_fetch_metrics_from_MetricSource

# Conflicts:
#	development/simulator/deployment-a100.yaml
#	development/simulator/deployment-a40.yaml
@zhangjyr zhangjyr added this to the v0.2.0 milestone Dec 6, 2024
@zhangjyr zhangjyr requested review from Jeffwan and nwangfw December 6, 2024 21:24
@zhangjyr
Copy link
Collaborator Author

zhangjyr commented Dec 6, 2024

I moved some comments from #480 here:
If the server is down, what's the autoscaler behavior? Have you tested such behaviors?
We haven't tested. Ideally, if the GPU optimizer were down, the podautoscaler would not be able to read metrics and keep replicas intact. After the GPU optimizer resumes, the GPU optimizer should not output valid metrics before a new solution is reached with sufficient load traces. Most of the changes involved in this approach will be easy to implement. However, we'll need to prolong the load trace timeout in Redis (e.g., up to 300 seconds, which is aligned with the current GPU optimizer window), so the GPU optimizer can restore the solution quickly enough.

@nwangfw
Copy link
Collaborator

nwangfw commented Dec 6, 2024

Right now, we are modifying code under python/aibrix/aibrix/gpu_optimizer. Can we delete python/aibrix/aibrix/gpuoptimizer folder?

@Jeffwan
Copy link
Collaborator

Jeffwan commented Dec 6, 2024

@nwangfw we should not have such folder? upstream folder has been renamed

@nwangfw
Copy link
Collaborator

nwangfw commented Dec 6, 2024

@nwangfw we should not have such folder? upstream folder has been renamed

Never mind. Seems that it has been deleted. Please ignore it and thanks.

Copy link
Collaborator

@Jeffwan Jeffwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@Jeffwan Jeffwan merged commit dd2aa26 into main Dec 7, 2024
10 checks passed
@Jeffwan Jeffwan deleted the issues/484_Controller_failed_to_fetch_metrics_from_MetricSource branch December 7, 2024 01:38
gangmuk pushed a commit that referenced this pull request Jan 25, 2025
…imizer connectivity (#500)

* Add GPU Optimizer deployment and update configurations

* Fix k8s accessibility regard namespaces. GPU optimizer now monitor all namespaces with model label.

* Lint fix

* Deployment clean-up

* Update README.md

---------

Co-authored-by: Ning Wang <[email protected]>
Co-authored-by: Jingyuan Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Controller failed to fetch metrics from MetricSource
3 participants