[Feat] Integrate deployment configurations and fix autoscaler/gpu optimizer connectivity #500

zhangjyr · 2024-12-06T21:24:33Z

Pull Request Description

This PR fixed the connectivity problem between podautoscaler and GPU optimizer by:

Updated k8s role definition to use ClusterRole, so GPU optimizer now monitor all deployments in all namespace with model label
Include [WIP] Add GPU Optimizer deployment and update configurations #480 changes. Deployment configurations are integrated into config/default

Note: I keep the deployment.yaml under GPU optimizer undeleted for debugging purposes.

Related Issues

Resolves: #484 #480 #459

Important: Before submitting, please complete the description above and review the checklist below.

Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

[Bug]: Corrections to existing functionality
[CI]: Changes to build process or CI pipeline
[Docs]: Updates or additions to documentation
[API]: Modifications to aibrix's API or interface
[CLI]: Changes or additions to the Command Line Interface
[Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

PR title includes appropriate prefix(es)
Changes are clearly explained in the PR description
New and existing tests pass successfully
Code adheres to project style and best practices
Documentation updated to reflect changes (if applicable)
Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

…l namespaces with model label.

…_failed_to_fetch_metrics_from_MetricSource # Conflicts: # development/simulator/deployment-a100.yaml # development/simulator/deployment-a40.yaml

zhangjyr · 2024-12-06T21:26:02Z

I moved some comments from #480 here:
If the server is down, what's the autoscaler behavior? Have you tested such behaviors?
We haven't tested. Ideally, if the GPU optimizer were down, the podautoscaler would not be able to read metrics and keep replicas intact. After the GPU optimizer resumes, the GPU optimizer should not output valid metrics before a new solution is reached with sufficient load traces. Most of the changes involved in this approach will be easy to implement. However, we'll need to prolong the load trace timeout in Redis (e.g., up to 300 seconds, which is aligned with the current GPU optimizer window), so the GPU optimizer can restore the solution quickly enough.

python/aibrix/aibrix/gpu_optimizer/deployment.yaml

python/aibrix/aibrix/gpu_optimizer/Makefile

config/overlays/vke/default/kustomization.yaml

nwangfw · 2024-12-06T22:30:07Z

Right now, we are modifying code under python/aibrix/aibrix/gpu_optimizer. Can we delete python/aibrix/aibrix/gpuoptimizer folder?

Jeffwan · 2024-12-06T22:34:05Z

@nwangfw we should not have such folder? upstream folder has been renamed

nwangfw · 2024-12-06T22:43:41Z

@nwangfw we should not have such folder? upstream folder has been renamed

Never mind. Seems that it has been deleted. Please ignore it and thanks.

Jeffwan

/lgtm

…imizer connectivity (#500) * Add GPU Optimizer deployment and update configurations * Fix k8s accessibility regard namespaces. GPU optimizer now monitor all namespaces with model label. * Lint fix * Deployment clean-up * Update README.md --------- Co-authored-by: Ning Wang <[email protected]> Co-authored-by: Jingyuan Zhang <[email protected]>

nwangfw and others added 4 commits December 4, 2024 13:13

Add GPU Optimizer deployment and update configurations

90cd690

Fix k8s accessibility regard namespaces. GPU optimizer now monitor al…

d2be10a

…l namespaces with model label.

Merge branch 'gpu-optimizer-orchestration' into issues/484_Controller…

e544c12

…_failed_to_fetch_metrics_from_MetricSource # Conflicts: # development/simulator/deployment-a100.yaml # development/simulator/deployment-a40.yaml

Lint fix

0e64ad6

zhangjyr added the area/heterogeneous label Dec 6, 2024

zhangjyr added this to the v0.2.0 milestone Dec 6, 2024

zhangjyr requested review from Jeffwan and nwangfw December 6, 2024 21:24

Jeffwan reviewed Dec 6, 2024

View reviewed changes

python/aibrix/aibrix/gpu_optimizer/deployment.yaml Outdated Show resolved Hide resolved

python/aibrix/aibrix/gpu_optimizer/Makefile Outdated Show resolved Hide resolved

config/overlays/vke/default/kustomization.yaml Outdated Show resolved Hide resolved

Jingyuan Zhang added 2 commits December 6, 2024 15:30

Deployment clean-up

1ba7527

Update README.md

8e39e61

Jeffwan approved these changes Dec 7, 2024

View reviewed changes

Jeffwan merged commit dd2aa26 into main Dec 7, 2024
10 checks passed

Jeffwan deleted the issues/484_Controller_failed_to_fetch_metrics_from_MetricSource branch December 7, 2024 01:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Integrate deployment configurations and fix autoscaler/gpu optimizer connectivity #500

[Feat] Integrate deployment configurations and fix autoscaler/gpu optimizer connectivity #500

zhangjyr commented Dec 6, 2024

zhangjyr commented Dec 6, 2024

nwangfw commented Dec 6, 2024 •

edited

Loading

Jeffwan commented Dec 6, 2024

nwangfw commented Dec 6, 2024 •

edited

Loading

Jeffwan left a comment

[Feat] Integrate deployment configurations and fix autoscaler/gpu optimizer connectivity #500

[Feat] Integrate deployment configurations and fix autoscaler/gpu optimizer connectivity #500

Conversation

zhangjyr commented Dec 6, 2024

Pull Request Description

Related Issues

Pull Request Title Format

Submission Checklist

zhangjyr commented Dec 6, 2024

nwangfw commented Dec 6, 2024 • edited Loading

Jeffwan commented Dec 6, 2024

nwangfw commented Dec 6, 2024 • edited Loading

Jeffwan left a comment

Choose a reason for hiding this comment

nwangfw commented Dec 6, 2024 •

edited

Loading

nwangfw commented Dec 6, 2024 •

edited

Loading