-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hardware][Ascend]forward_oot for FusedMoE #68
Conversation
|
42d855d
to
49a14d2
Compare
vllm_ascend/ops/fused_moe.py
Outdated
num_tokens, _ = hidden_states.shape | ||
E, N, _ = w1.shape | ||
|
||
batch_size_decode = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why hard code batch_size_decode
as 1? We only support bs1 scenario?
vllm_ascend/ops/fused_moe.py
Outdated
row_idx_decode_len = batch_size_decode * top_k | ||
row_idx_decode = torch.arange( | ||
0, row_idx_decode_len, | ||
dtype=torch.int32).view(top_k, -1).permute(1, 0).int().contiguous().npu() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the same?
dtype=torch.int32).view(top_k, -1).permute(1, 0).int().contiguous().npu() | |
torch.arange(0, row_idx_decode_len, dtype=torch.int32, device="npu").view(-1 top_k) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to confirm if the batch_size_decode can always be hardcode to 1 before adapt this.
vllm_ascend/ops/fused_moe.py
Outdated
row_idx_prefill_len = batch_size_prefill * num_tokens * top_k | ||
row_idx = torch.arange( | ||
0, row_idx_prefill_len, dtype=torch.int32, | ||
device=topk_weights.device).view(top_k, -1).permute(1, 0).int().contiguous() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
vllm_ascend/ops/fused_moe.py
Outdated
expert_tokens = torch_npu.npu_moe_compute_expert_tokens(expanded_expert_idx, E) | ||
expert_tokens = expert_tokens.to(torch.int64) | ||
|
||
w1 = torch_npu.npu_transpose(w1, (0, 2, 1), require_contiguous=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better use torch.transpose? Since require_contiguous=False
tranpose()
should only change its meta_data, which should be the same with torch.transpose()
.
0efd905
to
5ad81ee
Compare
Signed-off-by: YHT <[email protected]>
Add npu implement for FusedMoE Signed-off-by: YHT <[email protected]> Co-authored-by: YHT <[email protected]> Signed-off-by: angazenn <[email protected]>
What this PR does / why we need it?
In order to adapt the DeepSeek model of the vLLM framework to Ascend hardware, a Fused MoE (Mixture of Experts) module is developed.
Does this PR introduce any user-facing change?
I've written the NPU version of the group_topk function and the fused_expert function. Next, I'll further implement the forward_oot method of the UnquantizedFusedMoEMethod class.
How was this patch tested?