[feature request] Real-time streaming inference load generation by `perf_analyzer` #8059

vadimkantorov · 2025-03-08T16:09:46Z

Triton and perf_analyzer has some support for streaming-mode inference:

But if I understand well, these still assume that the input to the model is provided in one shot. In some scenarios (e.g. real-time text/speech translation/ or real-time speech recognition), the input should also be continuously supplied in streamed/chunked mode (ideally, reusing a single connection in duplex mode).

I propose to add this scenario to perf_analyzer as well

Thanks!

I also wonder if Triton ships any perf_analyzer-like robust Python scripts for correct, low-overhead, zero-gc, concurrent load generation / metrics computation? (so that the load-generation part can be customized)?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] Real-time streaming inference load generation by `perf_analyzer` #8059

[feature request] Real-time streaming inference load generation by `perf_analyzer` #8059

vadimkantorov commented Mar 8, 2025 •

edited

Loading

[feature request] Real-time streaming inference load generation by perf_analyzer #8059

[feature request] Real-time streaming inference load generation by perf_analyzer #8059

Comments

vadimkantorov commented Mar 8, 2025 • edited Loading

[feature request] Real-time streaming inference load generation by `perf_analyzer` #8059

[feature request] Real-time streaming inference load generation by `perf_analyzer` #8059

vadimkantorov commented Mar 8, 2025 •

edited

Loading