-
-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Initial Idea and Design for Asynchronous Scheduling #10634
Comments
Async scheduling is something that we may to implement in V1 eventually. However, it is very complex, so we plan to exhaust all other optimizations and stabilize the feature set of V1 before implementing Async scheduling The sketch you described above is what we have in mind |
@robertgshaw2-neuralmagic Yes, I agree with your point of view. Ultimate performance is what we pursue. I am worried that after we introduce asynchronous scheduling, the framework will undergo major changes and introduce new problems. We may need to re-analyze and optimize? , or is it not a problem that I am worried about? |
Thanks @lixiaolx ... like @robertgshaw2-neuralmagic said, we don't plan to implement async scheduling initially, largely because it will be complicated to make it work with other optimizations where an unknown number of tokens might be generated per step for each sequence. However, as you observed from the profile, it's quite slow currently and this is partly due to the fact that the process input/output threads run concurrently and may contend for the GIL. The plan is to change this so that this input/output processing (serialization/deserialization) happens only during the modelforward pass. |
🚀 The feature, motivation and pitch
After incorporating this pr and use, on Llama2-7b, bs=256, and the test data set is ShareGpt. I found that the gap between the two decodes at this time is about 5-6ms (token gap), which still accounts for a large proportion. I have discussed with author @robertgshaw2-neuralmagic @njhill , this problem is caused by the multi-threaded GIL lock problem.

At the same time, I made an assumption that if the problem is optimized, the cost of optimizing about 2-3ms can be achieved.
I combined it with the current implementation. After making attempts and ideas for asynchronousization, it is initially estimated that it should be optimized to about 200-300us, basically eliminating the token gap. The solution is roughly as shown in the figure below:

Combined with current main implementations. Can anyone please help to check if it is possible?
main implementation
Use last_schedule to record the last scheduling result, and also to wait for the real token_ids of the last GPU for output.
Each time a request comes, directly throw the input_data into the queue and return false data to update the resources needed for the next scheduling.
Further split the prepare data part of the GPU so that the GPU updates this input_ids in a new thread or process, performs forward calculations, and caches this token_ids for the next update of token_ids.
In this idea, the only time overhead on the GPU is to update input_ids and record cache_input_ids each time, and these can all be implemented on the GPU.

The final effect is as follows
Alternatives
No response
Additional context
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: