[V1] Supports scheduling asynchronousization on V1 version #11133
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello @robertgshaw2-neuralmagic @njhill Based on the previous discussion #10634, I implemented a version of asynchronous scheduling. I measured the request locally with bs=256, which dropped from the original 5-6ms to 150-170us. During the entire process, gpu-util continued to be 100%.
Now that the verification of correctness has been completed, I will continue to improve the pr
TODO:
But after turning on cudagraph, when using the new stream to process input, the precision is not aligned. Can we discuss this? Is there anything we noticed in cudagrpah?
In addition to the above, everyone is welcome to make suggestions