Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 'index' parameter for ProcessMesh.get_mesh_with_dim #62125

Merged

Conversation

From00
Copy link
Contributor

@From00 From00 commented Feb 27, 2024

PR types

Others

PR changes

Others

Description

Pcard-76459
升级自动并行ProcessMesh.get_mesh_with_dim接口,新增index参数支持直接获取指定维度index下标索引的mesh。缺省为None,表示获取该维度下所有mesh,等价于index=[:]。

【相关背景】
自动并行随机性控制依赖mesh的全局自增id构造随机种子。对于需要获取特定维度mesh的场景,若该维度不在第一维,之前的接口写法需要先构造一个指定维度在第一维的中间状态mesh,然后再通过[]索引特定的下标mesh;若获取的维度在第一维,则不需要有实际的转置操作,不会多出一个mesh,这导致两种情况下全局自增id不同(相差1),因而生成的随机种子也不同。
在用户侧,表现出的现象是相同模型上两种逻辑等价的mesh操作,却运算出了不同的loss结果,这种现象不符合常识,令人困惑,对用户体验有影响。
这个问题在自动并行Llama模型上被发现。PR PaddlePaddle/PaddleNLP#8011 对Llama模型调整了mesh拓扑顺序,将[dp, pp]调整为[pp, dp],通信拓扑顺序改变,只是影响了逻辑上的process_mesh与物理卡之间的映射关系,但却由于mesh全局自增id的偏移触发了模型运行结果改变。

相关代码: https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/llama/modeling_auto.py#L78

修改之前:

mesh = mesh.get_mesh_with_dim("pp")[pp_idx] 

get_mesh_with_dim先将pp转置到第一维([dp, pp, mp]->[pp, dp, mp]),产生中间mesh: [pp, dp, mp] ,然后再对中间mesh取pp_idx索引,得到实际需要的mesh:[dp, mp]。调整后拓扑顺序是pp在前的情况下,这个中间mesh本身就存在,不会影响自增id。

修改之后:

mesh = mesh.get_mesh_with_dim("pp", pp_idx) 

get_mesh_with_dim直接构造实际需要的mesh:[dp, mp],不产生中间结果。

【one more thing】
对这种写法的改变,只是一种临时解决方案,并未从根本上解决问题。要彻底避免类似问题,本质上应该让自动并行生成的随机种子对mesh的“逻辑改变”不敏感,但由于自动并行process_mesh在设计上的灵活性,许多逻辑概念允许用户任意改变(如转置mesh、重排process_id、重命名dims等),要改造需要对整个随机种子生成算法进行重新设计和实现,工作量较大。

Copy link

paddle-bot bot commented Feb 27, 2024

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@sunzhongkai588 sunzhongkai588 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@jeff41404 jeff41404 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for API

@From00 From00 merged commit 08d2b79 into PaddlePaddle:develop Feb 29, 2024
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants