Remove delay_scale_loss and release_grads for llama-2 13B's benchmark. #8623

Xreki · 2024-06-19T01:34:26Z

PR types

Others

PR changes

Others

Description

模型	训练策略	分支	训练吞吐	max memory reserved（日志中）
Llama-2 13B	pp4sharding8-vpp5-mbs1-acc4	develop	1991.236	48.738
Llama-2 13B	pp4sharding8-vpp5-mbs1-acc4	去掉release_grads	2037.899 (+2.34%)	53.602
Llama-2 13B	pp4sharding8-vpp5-mbs1-acc4	去掉delay_scale_loss	2051.128 (+0.65%)	53.602

Llama-2 13B性能提升说明：

release_grads策略可以节省峰值显存占用，但是每个训练step结束后会释放梯度所占空间，并在下一个step重新申请和初始化，故而会引入一定的开销。Llama-2 13B模型并没有打满显存，故可以移除该选项

delay_scale_loss策略是为了优化收敛，一方面相比较的竞品没有使用该策略，另一方面该策略在会引入一个设备同步、影响sharding allgather overlap的效果。

PaddleNLP/paddlenlp/trainer/trainer.py

Lines 1112 to 1120 in 439f8f3

    
           if self.args.gradient_accumulation_steps > 1 and self._enable_delay_scale_loss(): 
        
               paddle.device.synchronize() 
        
               for p in model._layers.parameters(): 
        
                   with paddle.no_grad(): 
        
                       if hasattr(p, "main_grad") and p.main_grad is not None: 
        
                           assert p.grad is None 
        
                           p.main_grad.scale_(1.0 / self.args.gradient_accumulation_steps) 
        
                       elif p.grad is not None: 
        
                           p.grad.scale_(1.0 / self.args.gradient_accumulation_steps)

paddle-bot · 2024-06-19T01:34:30Z

Thanks for your contribution!

codecov · 2024-06-19T02:05:04Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 54.18%. Comparing base (cd2a70e) to head (d98e9e7).
Report is 241 commits behind head on develop.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #8623   +/-   ##
========================================
  Coverage    54.18%   54.18%           
========================================
  Files          625      625           
  Lines        98947    98947           
========================================
  Hits         53618    53618           
  Misses       45329    45329

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Remove delay_scale_loss and release_grads for llama-2 13B's benchmark.

d98e9e7

ZHUI approved these changes Jun 19, 2024

View reviewed changes

ZHUI merged commit 970b868 into PaddlePaddle:develop Jun 19, 2024
9 of 11 checks passed

Xreki deleted the opt_llama2_benchmark branch June 19, 2024 06:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove delay_scale_loss and release_grads for llama-2 13B's benchmark. #8623

Remove delay_scale_loss and release_grads for llama-2 13B's benchmark. #8623

Xreki commented Jun 19, 2024

paddle-bot bot commented Jun 19, 2024

codecov bot commented Jun 19, 2024 •

edited

Loading

	if self.args.gradient_accumulation_steps > 1 and self._enable_delay_scale_loss():
	paddle.device.synchronize()
	for p in model._layers.parameters():
	with paddle.no_grad():
	if hasattr(p, "main_grad") and p.main_grad is not None:
	assert p.grad is None
	p.main_grad.scale_(1.0 / self.args.gradient_accumulation_steps)
	elif p.grad is not None:
	p.grad.scale_(1.0 / self.args.gradient_accumulation_steps)

Remove delay_scale_loss and release_grads for llama-2 13B's benchmark. #8623

Remove delay_scale_loss and release_grads for llama-2 13B's benchmark. #8623

Conversation

Xreki commented Jun 19, 2024

PR types

PR changes

Description

paddle-bot bot commented Jun 19, 2024

codecov bot commented Jun 19, 2024 • edited Loading

Codecov Report

codecov bot commented Jun 19, 2024 •

edited

Loading