Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deepep internode implementations. #71435

Open
wants to merge 33 commits into
base: develop
Choose a base branch
from

Conversation

Xreki
Copy link
Contributor

@Xreki Xreki commented Mar 5, 2025

PR Category

Communication Library

PR Types

New features

Description

pcard-67164
基于#71358 ,集成deepep多机通信实现。deepep多机实现依赖第三方库nvshmem,本PR中通过cmake实现了nvshmem的自动下载、编译、动态库的打包和安装,通过cmake选项WITH_NVSHMEM控制,默认是OFF。因nvshmem又依赖了gdrcopy库,如果用户的gdrcopy没有安装到系统目录,则需要通过GDRCOPY_HOME来指定安装路径。

待讨论和优化点:

  1. 统一deepep代码编译的条件。当前编译Paddle时,若包含90架构则会编译deepep代码,是否要设置成包含90架构则默认下载编译nvshmem库和deepep多机代码。
  2. 因为对gdrcopy库的依赖,编译deepep代码时依然需要用户先手动编译gdrcopy库并指定安装路径。后续gdrcopy库是打包到镜像中,还是也通过cmake自动下载和编译。

@Xreki Xreki requested review from sneaxiy and ForFishes as code owners March 5, 2025 12:34
Copy link

paddle-bot bot commented Mar 5, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@Xreki Xreki force-pushed the new_add_deepep_nvshmem branch from 774a5f2 to bc74744 Compare March 6, 2025 04:18
@@ -173,7 +178,7 @@ struct LowLatencyLayout {
// - 2 symmetric odd/even signaling buffers

// Message sizes
EP_HOST_ASSERT(num_scales * sizeof(float) <= hidden);
EP_HOST_ASSERT(num_scales * static_cast<int>(sizeof(float)) <= hidden);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为啥不是static_cast<int64_t>呢?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_scaleshiddenint类型,所以cast到了相同的数据类型

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改成static_cast<int64_t>

num_channels * num_ranks * sizeof(int) + // Channel start offset
num_channels * num_ranks * sizeof(int) + // Channel end offset
num_channels * num_ranks * sizeof(int) * 2 + // Queue head and tail
num_ranks * num_ranks * static_cast<int>(sizeof(int)) + // prefix matrix
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为啥不是static_cast<int64_t>呢?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

下同

@Xreki Xreki force-pushed the new_add_deepep_nvshmem branch from b7fe54e to 6531039 Compare March 9, 2025 03:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants