[Hardware][Ascend]forward_oot for FusedMoE #68

SidaoY · 2025-02-17T05:02:13Z

What this PR does / why we need it?

In order to adapt the DeepSeek model of the vLLM framework to Ascend hardware, a Fused MoE (Mixture of Experts) module is developed.

Does this PR introduce any user-facing change?

I've written the NPU version of the group_topk function and the fused_expert function. Next, I'll further implement the forward_oot method of the UnquantizedFusedMoEMethod class.

How was this patch tested?

wangxiyuan · 2025-02-17T06:34:56Z

add -s to git commit
rebase to the newest code
update the commit message.

ganyi1996ppo · 2025-02-18T06:01:13Z

vllm_ascend/ops/fused_moe.py

+    num_tokens, _ = hidden_states.shape
+    E, N, _ = w1.shape
+
+    batch_size_decode = 1 


Why hard code batch_size_decode as 1? We only support bs1 scenario?

ganyi1996ppo · 2025-02-18T06:09:51Z

vllm_ascend/ops/fused_moe.py

+    row_idx_decode_len = batch_size_decode * top_k
+    row_idx_decode = torch.arange(
+                0, row_idx_decode_len, 
+                dtype=torch.int32).view(top_k, -1).permute(1, 0).int().contiguous().npu()


Seems the same?

Suggested change

dtype=torch.int32).view(top_k, -1).permute(1, 0).int().contiguous().npu()

torch.arange(0, row_idx_decode_len, dtype=torch.int32, device="npu").view(-1 top_k)

Need to confirm if the batch_size_decode can always be hardcode to 1 before adapt this.

ganyi1996ppo · 2025-02-18T06:11:52Z

vllm_ascend/ops/fused_moe.py

+        row_idx_prefill_len = batch_size_prefill * num_tokens * top_k
+        row_idx = torch.arange(
+            0, row_idx_prefill_len, dtype=torch.int32,
+            device=topk_weights.device).view(top_k, -1).permute(1, 0).int().contiguous()


ganyi1996ppo · 2025-02-18T06:20:43Z

vllm_ascend/ops/fused_moe.py

+    expert_tokens = torch_npu.npu_moe_compute_expert_tokens(expanded_expert_idx, E)
+    expert_tokens = expert_tokens.to(torch.int64)
+
+    w1 = torch_npu.npu_transpose(w1, (0, 2, 1), require_contiguous=False)


Better use torch.transpose? Since require_contiguous=False tranpose() should only change its meta_data, which should be the same with torch.transpose().

vllm_ascend/ops/fused_moe.py

Signed-off-by: YHT <[email protected]>

Add npu implement for FusedMoE Signed-off-by: YHT <[email protected]> Co-authored-by: YHT <[email protected]> Signed-off-by: angazenn <[email protected]>

SidaoY force-pushed the v0.7.1-dev branch 6 times, most recently from 42d855d to 49a14d2 Compare February 18, 2025 04:11

SidaoY changed the title ~~[Hardware][Ascend]Add MLAAttention backend and forward_oot for FusedMoE~~ [Hardware][Ascend]forward_oot for FusedMoE Feb 18, 2025

ganyi1996ppo reviewed Feb 18, 2025

View reviewed changes

vllm_ascend/ops/fused_moe.py Show resolved Hide resolved

ganyi1996ppo reviewed Feb 18, 2025

View reviewed changes

vllm_ascend/ops/fused_moe.py Show resolved Hide resolved

SidaoY force-pushed the v0.7.1-dev branch 4 times, most recently from 0efd905 to 5ad81ee Compare February 18, 2025 11:56

Yikun mentioned this pull request Feb 18, 2025

[New Model]: DeepSeek V3 / R1 #72

Open

feat: Fused MoE

b443d32

Signed-off-by: YHT <[email protected]>

SidaoY force-pushed the v0.7.1-dev branch from 5ad81ee to b443d32 Compare February 18, 2025 13:00

ganyi1996ppo approved these changes Feb 18, 2025

View reviewed changes

wangxiyuan merged commit 718c763 into vllm-project:v0.7.1-dev Feb 18, 2025
3 checks passed

Yikun mentioned this pull request Feb 18, 2025

[Doc] Update doc to work with release #85

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hardware][Ascend]forward_oot for FusedMoE #68

[Hardware][Ascend]forward_oot for FusedMoE #68

SidaoY commented Feb 17, 2025 •

edited

Loading

wangxiyuan commented Feb 17, 2025

ganyi1996ppo Feb 18, 2025

ganyi1996ppo Feb 18, 2025

ganyi1996ppo Feb 18, 2025

ganyi1996ppo Feb 18, 2025

ganyi1996ppo Feb 18, 2025

	dtype=torch.int32).view(top_k, -1).permute(1, 0).int().contiguous().npu()
	torch.arange(0, row_idx_decode_len, dtype=torch.int32, device="npu").view(-1 top_k)

[Hardware][Ascend]forward_oot for FusedMoE #68

[Hardware][Ascend]forward_oot for FusedMoE #68

Conversation

SidaoY commented Feb 17, 2025 • edited Loading

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

wangxiyuan commented Feb 17, 2025

ganyi1996ppo Feb 18, 2025

Choose a reason for hiding this comment

ganyi1996ppo Feb 18, 2025

Choose a reason for hiding this comment

ganyi1996ppo Feb 18, 2025

Choose a reason for hiding this comment

ganyi1996ppo Feb 18, 2025

Choose a reason for hiding this comment

ganyi1996ppo Feb 18, 2025

Choose a reason for hiding this comment

SidaoY commented Feb 17, 2025 •

edited

Loading