You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
But if I understand well, these still assume that the input to the model is provided in one shot. In some scenarios (e.g. real-time text/speech translation/ or real-time speech recognition), the input should also be continuously supplied in streamed/chunked mode (ideally, reusing a single connection in duplex mode).
I propose to add this scenario to perf_analyzer as well
Thanks!
I also wonder if Triton ships any perf_analyzer-like robust Python scripts for correct, low-overhead, zero-gc, concurrent load generation / metrics computation? (so that the load-generation part can be customized)?
The text was updated successfully, but these errors were encountered:
Triton and
perf_analyzer
has some support for streaming-mode inference:But if I understand well, these still assume that the input to the model is provided in one shot. In some scenarios (e.g. real-time text/speech translation/ or real-time speech recognition), the input should also be continuously supplied in streamed/chunked mode (ideally, reusing a single connection in duplex mode).
I propose to add this scenario to
perf_analyzer
as wellThanks!
I also wonder if Triton ships any
perf_analyzer
-like robust Python scripts for correct, low-overhead, zero-gc, concurrent load generation / metrics computation? (so that the load-generation part can be customized)?The text was updated successfully, but these errors were encountered: