Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hardware][Ascend]forward_oot for FusedMoE #68

Merged
merged 1 commit into from
Feb 18, 2025

Conversation

SidaoY
Copy link
Contributor

@SidaoY SidaoY commented Feb 17, 2025

What this PR does / why we need it?

In order to adapt the DeepSeek model of the vLLM framework to Ascend hardware, a Fused MoE (Mixture of Experts) module is developed.

Does this PR introduce any user-facing change?

I've written the NPU version of the group_topk function and the fused_expert function. Next, I'll further implement the forward_oot method of the UnquantizedFusedMoEMethod class.

How was this patch tested?

@wangxiyuan
Copy link
Collaborator

  1. add -s to git commit
  2. rebase to the newest code
  3. update the commit message.

@SidaoY SidaoY force-pushed the v0.7.1-dev branch 6 times, most recently from 42d855d to 49a14d2 Compare February 18, 2025 04:11
@SidaoY SidaoY changed the title [Hardware][Ascend]Add MLAAttention backend and forward_oot for FusedMoE [Hardware][Ascend]forward_oot for FusedMoE Feb 18, 2025
num_tokens, _ = hidden_states.shape
E, N, _ = w1.shape

batch_size_decode = 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why hard code batch_size_decode as 1? We only support bs1 scenario?

row_idx_decode_len = batch_size_decode * top_k
row_idx_decode = torch.arange(
0, row_idx_decode_len,
dtype=torch.int32).view(top_k, -1).permute(1, 0).int().contiguous().npu()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the same?

Suggested change
dtype=torch.int32).view(top_k, -1).permute(1, 0).int().contiguous().npu()
torch.arange(0, row_idx_decode_len, dtype=torch.int32, device="npu").view(-1 top_k)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to confirm if the batch_size_decode can always be hardcode to 1 before adapt this.

row_idx_prefill_len = batch_size_prefill * num_tokens * top_k
row_idx = torch.arange(
0, row_idx_prefill_len, dtype=torch.int32,
device=topk_weights.device).view(top_k, -1).permute(1, 0).int().contiguous()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

expert_tokens = torch_npu.npu_moe_compute_expert_tokens(expanded_expert_idx, E)
expert_tokens = expert_tokens.to(torch.int64)

w1 = torch_npu.npu_transpose(w1, (0, 2, 1), require_contiguous=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use torch.transpose? Since require_contiguous=False tranpose() should only change its meta_data, which should be the same with torch.transpose().

@SidaoY SidaoY force-pushed the v0.7.1-dev branch 4 times, most recently from 0efd905 to 5ad81ee Compare February 18, 2025 11:56
Signed-off-by: YHT <[email protected]>
@wangxiyuan wangxiyuan merged commit 718c763 into vllm-project:v0.7.1-dev Feb 18, 2025
3 checks passed
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Feb 21, 2025
Add npu implement for FusedMoE

Signed-off-by: YHT <[email protected]>
Co-authored-by: YHT <[email protected]>
Signed-off-by: angazenn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants