Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Initial Idea and Design for Asynchronous Scheduling #10634

Open
1 task done
lixiaolx opened this issue Nov 25, 2024 · 4 comments
Open
1 task done

[Feature]: Initial Idea and Design for Asynchronous Scheduling #10634

lixiaolx opened this issue Nov 25, 2024 · 4 comments
Labels
feature request New feature or request

Comments

@lixiaolx
Copy link

lixiaolx commented Nov 25, 2024

🚀 The feature, motivation and pitch

After incorporating this pr and use, on Llama2-7b, bs=256, and the test data set is ShareGpt. I found that the gap between the two decodes at this time is about 5-6ms (token gap), which still accounts for a large proportion. I have discussed with author @robertgshaw2-neuralmagic @njhill , this problem is caused by the multi-threaded GIL lock problem.
image

At the same time, I made an assumption that if the problem is optimized, the cost of optimizing about 2-3ms can be achieved.

I combined it with the current implementation. After making attempts and ideas for asynchronousization, it is initially estimated that it should be optimized to about 200-300us, basically eliminating the token gap. The solution is roughly as shown in the figure below:
Combined with current main implementations. Can anyone please help to check if it is possible?
image

main implementation

  1. Use last_schedule to record the last scheduling result, and also to wait for the real token_ids of the last GPU for output.

  2. Each time a request comes, directly throw the input_data into the queue and return false data to update the resources needed for the next scheduling.

  3. Further split the prepare data part of the GPU so that the GPU updates this input_ids in a new thread or process, performs forward calculations, and caches this token_ids for the next update of token_ids.

In this idea, the only time overhead on the GPU is to update input_ids and record cache_input_ids each time, and these can all be implemented on the GPU.
The final effect is as follows
image

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@lixiaolx lixiaolx added the feature request New feature or request label Nov 25, 2024
@robertgshaw2-redhat
Copy link
Collaborator

robertgshaw2-redhat commented Nov 25, 2024

Async scheduling is something that we may to implement in V1 eventually. However, it is very complex, so we plan to exhaust all other optimizations and stabilize the feature set of V1 before implementing Async scheduling

The sketch you described above is what we have in mind

@lixiaolx
Copy link
Author

so we plan to exhaust all other optimizations and stabilize the feature set of V1 before implementing Async scheduling

@robertgshaw2-neuralmagic Yes, I agree with your point of view. Ultimate performance is what we pursue. I am worried that after we introduce asynchronous scheduling, the framework will undergo major changes and introduce new problems. We may need to re-analyze and optimize? , or is it not a problem that I am worried about?

@njhill
Copy link
Member

njhill commented Nov 27, 2024

Thanks @lixiaolx ... like @robertgshaw2-neuralmagic said, we don't plan to implement async scheduling initially, largely because it will be complicated to make it work with other optimizations where an unknown number of tokens might be generated per step for each sequence.

However, as you observed from the profile, it's quite slow currently and this is partly due to the fact that the process input/output threads run concurrently and may contend for the GIL. The plan is to change this so that this input/output processing (serialization/deserialization) happens only during the modelforward pass.

@new-TonyWang
Copy link

new-TonyWang commented Dec 28, 2024

Hi, could you please teach me how to get this timeline, thank you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants