-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add vllm graceful termination configuration #568
Conversation
e52e3de
to
efea10c
Compare
benchmarks/autoscaling/7b.yaml
Outdated
@@ -22,6 +22,7 @@ spec: | |||
labels: | |||
model.aibrix.ai/name: deepseek-coder-7b-instruct | |||
spec: | |||
terminationGracePeriodSeconds: 300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even considering we may have request in the queue, this is probably a little long. Our longest query probably take ~30s. Did you see long request there? I did see your prestop hook exit if there's no running or waiting requests but there's some extreme case that vLLM may experience some issue which will delay the terminting process to 300s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my tests, the terminating process will be delayed up to 300s only if there are hundreds of pending requests. In other cases, the termination process will not be that long.
I think 30s should be good enough for a single long query. However, what about if there are multiple pending requests? Should we triple it, say 90s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. 60 or 90s makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commit a suggestion to change to 60 at this moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
RUNNING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_running' | grep -v '#' | awk '{print $2}') | ||
WAITING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_waiting' | grep -v '#' | awk '{print $2}') | ||
if [ "$RUNNING" = "0.0" ] && [ "$WAITING" = "0.0" ]; then | ||
echo "Terminating: No active or waiting requests, safe to terminate" >> /proc/1/fd/1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
em. here you forward the logs to main container outputs. this way works. Technically, we can also check FailedPreStopHook
. I am ok with this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, in this way, we manually handle the termination handling. did you check whether vLLM itself handles the termination? for example, send SIGTERM
does it exit immediately or wait for the request to be finished?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, in this way, we manually handle the termination handling. did you check whether vLLM itself handles the termination? for example, send
SIGTERM
does it exit immediately or wait for the request to be finished?
I tested and the vllm will exit immediately.
/cc this is a change @brosoul should be aware of. |
Signed-off-by: Jiaxin Shan <[email protected]>
* vllm gracefull termination configed * Update benchmarks/autoscaling/7b.yaml Signed-off-by: Jiaxin Shan <[email protected]> --------- Signed-off-by: Jiaxin Shan <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]>
Pull Request Description
This PR is used to add graceful termination after a workload pod is deleted by podautoscaler. In the current configuration, 5 mins is given for the pod to finish running and pending requests before it is actually deleted.
Related Issues
Resolves: #553
Important: Before submitting, please complete the description above and review the checklist below.
Contribution Guidelines (Expand for Details)
We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:
Pull Request Title Format
Your PR title should start with one of these prefixes to indicate the nature of the change:
[Bug]
: Corrections to existing functionality[CI]
: Changes to build process or CI pipeline[Docs]
: Updates or additions to documentation[API]
: Modifications to aibrix's API or interface[CLI]
: Changes or additions to the Command Line Interface[Misc]
: For changes not covered above (use sparingly)Note: For changes spanning multiple categories, use multiple prefixes in order of importance.
Submission Checklist
By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.