Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Real-time streaming inference load generation by perf_analyzer #8059

Open
vadimkantorov opened this issue Mar 8, 2025 · 0 comments

Comments

@vadimkantorov
Copy link

vadimkantorov commented Mar 8, 2025

Triton and perf_analyzer has some support for streaming-mode inference:

But if I understand well, these still assume that the input to the model is provided in one shot. In some scenarios (e.g. real-time text/speech translation/ or real-time speech recognition), the input should also be continuously supplied in streamed/chunked mode (ideally, reusing a single connection in duplex mode).

I propose to add this scenario to perf_analyzer as well

Thanks!


I also wonder if Triton ships any perf_analyzer-like robust Python scripts for correct, low-overhead, zero-gc, concurrent load generation / metrics computation? (so that the load-generation part can be customized)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant