Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Piggybacking more information in response header #795

Open
gangmuk opened this issue Mar 5, 2025 · 5 comments
Open

Piggybacking more information in response header #795

gangmuk opened this issue Mar 5, 2025 · 5 comments

Comments

@gangmuk
Copy link
Collaborator

gangmuk commented Mar 5, 2025

🚀 Feature Description and Motivation

Suggesting piggybacking more information in the header on the response. For example, currently gateway is returning the target-pod-ip on the response header. I suggest including more information in this manner on the response.
It would be useful since we can get snapshot information at the per request level when the request is scheduled. Request granularity information will be very useful for post-analysis and more.

Candidate state information would be queue size, GPU memory utilization, KV cache hit ratio, RPS for each GPU, TPS for each GPU, etc. The exact list should be discussed. The requirement is that none of them shouldn't introduce overhead on request critical path.

Downside/overhead of including more information on the response header would be overhead in the gateway and the request size gets bigger. Neither is significant.

Use Case

post-analysis

Proposed Solution

piggybacking more information on the response header

@gangmuk
Copy link
Collaborator Author

gangmuk commented Mar 5, 2025

@Jeffwan @varungup90 WDYH?

@varungup90
Copy link
Collaborator

Only per request level information should be returned in response headers. The information listed in the issue is captured in the metrics which is reflected in dashboard and is queryable as well by client.

@gangmuk
Copy link
Collaborator Author

gangmuk commented Mar 5, 2025

Yeah that's true. but if we want to map the state to a particular request that it was scheduled, snapshot is needed. I wonder what would be downside of it. any thoughts?

@varungup90
Copy link
Collaborator

Request and response headers must be light weight. You can dump the state in logs for per request basis.

@gangmuk
Copy link
Collaborator Author

gangmuk commented Mar 6, 2025

@varungup90 @Jeffwan
Not sure you've heard of it. but in envoy, there was similar discussion in the past. They proposed ORCA. It is a proposal for an open standard for request cost aggregation. I think it was integrated into envoy officially. We don't need to follow the exact format but we can maybe think about so-called AIBrix ORCA things for AI specific metrics and support them in AIBrix.

orca issue in envoy
orca design doc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants