-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Piggybacking more information in response header #795
Comments
@Jeffwan @varungup90 WDYH? |
Only per request level information should be returned in response headers. The information listed in the issue is captured in the metrics which is reflected in dashboard and is queryable as well by client. |
Yeah that's true. but if we want to map the state to a particular request that it was scheduled, snapshot is needed. I wonder what would be downside of it. any thoughts? |
Request and response headers must be light weight. You can dump the state in logs for per request basis. |
@varungup90 @Jeffwan |
🚀 Feature Description and Motivation
Suggesting piggybacking more information in the header on the response. For example, currently gateway is returning the target-pod-ip on the response header. I suggest including more information in this manner on the response.
It would be useful since we can get snapshot information at the per request level when the request is scheduled. Request granularity information will be very useful for post-analysis and more.
Candidate state information would be queue size, GPU memory utilization, KV cache hit ratio, RPS for each GPU, TPS for each GPU, etc. The exact list should be discussed. The requirement is that none of them shouldn't introduce overhead on request critical path.
Downside/overhead of including more information on the response header would be overhead in the gateway and the request size gets bigger. Neither is significant.
Use Case
post-analysis
Proposed Solution
piggybacking more information on the response header
The text was updated successfully, but these errors were encountered: