Add 'index' parameter for ProcessMesh.get_mesh_with_dim #62125
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR types
Others
PR changes
Others
Description
Pcard-76459
升级自动并行
ProcessMesh.get_mesh_with_dim
接口,新增index
参数支持直接获取指定维度index下标索引的mesh。缺省为None
,表示获取该维度下所有mesh,等价于index=[:]。【相关背景】
自动并行随机性控制依赖mesh的全局自增id构造随机种子。对于需要获取特定维度mesh的场景,若该维度不在第一维,之前的接口写法需要先构造一个指定维度在第一维的中间状态mesh,然后再通过
[]
索引特定的下标mesh;若获取的维度在第一维,则不需要有实际的转置操作,不会多出一个mesh,这导致两种情况下全局自增id不同(相差1),因而生成的随机种子也不同。在用户侧,表现出的现象是相同模型上两种逻辑等价的mesh操作,却运算出了不同的loss结果,这种现象不符合常识,令人困惑,对用户体验有影响。
这个问题在自动并行Llama模型上被发现。PR PaddlePaddle/PaddleNLP#8011 对Llama模型调整了mesh拓扑顺序,将
[dp, pp]
调整为[pp, dp]
,通信拓扑顺序改变,只是影响了逻辑上的process_mesh与物理卡之间的映射关系,但却由于mesh全局自增id的偏移触发了模型运行结果改变。相关代码: https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/llama/modeling_auto.py#L78
修改之前:
get_mesh_with_dim
先将pp转置到第一维([dp, pp, mp]
->[pp, dp, mp]
),产生中间mesh:[pp, dp, mp]
,然后再对中间mesh取pp_idx索引,得到实际需要的mesh:[dp, mp]
。调整后拓扑顺序是pp在前的情况下,这个中间mesh本身就存在,不会影响自增id。修改之后:
get_mesh_with_dim
直接构造实际需要的mesh:[dp, mp]
,不产生中间结果。【one more thing】
对这种写法的改变,只是一种临时解决方案,并未从根本上解决问题。要彻底避免类似问题,本质上应该让自动并行生成的随机种子对mesh的“逻辑改变”不敏感,但由于自动并行process_mesh在设计上的灵活性,许多逻辑概念允许用户任意改变(如转置mesh、重排process_id、重命名dims等),要改造需要对整个随机种子生成算法进行重新设计和实现,工作量较大。