[Feature]: Initial Idea and Design for Asynchronous Scheduling #10634

lixiaolx · 2024-11-25T11:52:24Z

🚀 The feature, motivation and pitch

After incorporating this pr and use, on Llama2-7b, bs=256, and the test data set is ShareGpt. I found that the gap between the two decodes at this time is about 5-6ms (token gap), which still accounts for a large proportion. I have discussed with author @robertgshaw2-neuralmagic @njhill , this problem is caused by the multi-threaded GIL lock problem.

At the same time, I made an assumption that if the problem is optimized, the cost of optimizing about 2-3ms can be achieved.

I combined it with the current implementation. After making attempts and ideas for asynchronousization, it is initially estimated that it should be optimized to about 200-300us, basically eliminating the token gap. The solution is roughly as shown in the figure below:
Combined with current main implementations. Can anyone please help to check if it is possible?

main implementation

Use last_schedule to record the last scheduling result, and also to wait for the real token_ids of the last GPU for output.
Each time a request comes, directly throw the input_data into the queue and return false data to update the resources needed for the next scheduling.
Further split the prepare data part of the GPU so that the GPU updates this input_ids in a new thread or process, performs forward calculations, and caches this token_ids for the next update of token_ids.

In this idea, the only time overhead on the GPU is to update input_ids and record cache_input_ids each time, and these can all be implemented on the GPU.
The final effect is as follows

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

robertgshaw2-redhat · 2024-11-25T14:41:50Z

Async scheduling is something that we may to implement in V1 eventually. However, it is very complex, so we plan to exhaust all other optimizations and stabilize the feature set of V1 before implementing Async scheduling

The sketch you described above is what we have in mind

lixiaolx · 2024-11-26T13:34:48Z

so we plan to exhaust all other optimizations and stabilize the feature set of V1 before implementing Async scheduling

@robertgshaw2-neuralmagic Yes, I agree with your point of view. Ultimate performance is what we pursue. I am worried that after we introduce asynchronous scheduling, the framework will undergo major changes and introduce new problems. We may need to re-analyze and optimize? , or is it not a problem that I am worried about?

njhill · 2024-11-27T01:02:49Z

Thanks @lixiaolx ... like @robertgshaw2-neuralmagic said, we don't plan to implement async scheduling initially, largely because it will be complicated to make it work with other optimizations where an unknown number of tokens might be generated per step for each sequence.

However, as you observed from the profile, it's quite slow currently and this is partly due to the fact that the process input/output threads run concurrently and may contend for the GIL. The plan is to change this so that this input/output processing (serialization/deserialization) happens only during the modelforward pass.

new-TonyWang · 2024-12-28T06:44:43Z

Hi, could you please teach me how to get this timeline, thank you?

lixiaolx added the feature request New feature or request label Nov 25, 2024

lixiaolx mentioned this issue Dec 12, 2024

[V1] Supports scheduling asynchronousization on V1 version #11133

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Initial Idea and Design for Asynchronous Scheduling #10634

[Feature]: Initial Idea and Design for Asynchronous Scheduling #10634

lixiaolx commented Nov 25, 2024 •

edited

Loading

robertgshaw2-redhat commented Nov 25, 2024 •

edited

Loading

lixiaolx commented Nov 26, 2024

njhill commented Nov 27, 2024

new-TonyWang commented Dec 28, 2024 •

edited

Loading

[Feature]: Initial Idea and Design for Asynchronous Scheduling #10634

[Feature]: Initial Idea and Design for Asynchronous Scheduling #10634

Comments

lixiaolx commented Nov 25, 2024 • edited Loading

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

robertgshaw2-redhat commented Nov 25, 2024 • edited Loading

lixiaolx commented Nov 26, 2024

njhill commented Nov 27, 2024

new-TonyWang commented Dec 28, 2024 • edited Loading

lixiaolx commented Nov 25, 2024 •

edited

Loading

robertgshaw2-redhat commented Nov 25, 2024 •

edited

Loading

new-TonyWang commented Dec 28, 2024 •

edited

Loading