Decoupled Momentum Optimization #771

peter-sk · 2024-12-24T12:24:15Z

Cleaned-up version of https://github.com/bloc97/DeMo for integrating efficient distributed training a la Decoupled Monentum Optimization (https://arxiv.org/abs/2411.19870)

dirkgr

This looks interesting, but I don't actually know what this optimizer is. Can you give some background?

Edit: There is a description at the top. Am blind.

dirkgr · 2025-02-04T01:27:06Z

olmo/optim.py

+            compression_topk=cfg.optimizer.compression_topk,
+            compression_chunk=cfg.optimizer.compression_chunk,
+            weight_decay=cfg.optimizer.weight_decay,
+            process_group=None,  # TODO: fix for hybrid sharding


This seems important? Hybrid is necessary for big models.

dirkgr · 2025-02-04T01:28:02Z

olmo/config.py

+    ### DeMo parameters
+    compression_decay: float = 0.999
+
+    compression_topk: int = 32
+    """
+    How many numbers of topk to transmit per chunk, if dynamic is enabled, this is the initial topk
+    """
+
+    compression_chunk: int = 64
+    """
+    Size of the chunk of the gradients, note that 2D gradients are chunked in 2D, which the topk sparsity is squared compared to 1D
+    """


Prefix these with demo_?

dirkgr · 2025-02-04T01:28:57Z

olmo/config.py

+    disable_grad_sync: bool = False
+


I see this setting twice, once here, and once in DDPGradSyncMode?

dirkgr · 2025-02-04T01:30:11Z

olmo/optim.py

@@ -647,6 +649,177 @@ def get_post_step_metrics(
            return metrics


+class DeMo(torch.optim.SGD, Optimizer):


It seems like the organization would make more sense if this class, and demo_utils.py, were in their own file together, and then we use __all__ to make this optimizer appear the same as the others.

dirkgr · 2025-02-04T01:33:47Z

Oh, I see. You put a reference in the description 🙈.

Paper says you pushed this to 1B/100B tokens. Can you go further? Experience says, things like this stop working if you go really big.

DeMo

1ca4390

peter-sk mentioned this pull request Jan 6, 2025

empty tensors when using DeMo with FSDP bloc97/DeMo#2

Open

Merge branch 'main' into demo

2d3baaf

dirkgr requested changes Feb 4, 2025

View reviewed changes

dirkgr self-assigned this Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoupled Momentum Optimization #771

Decoupled Momentum Optimization #771

peter-sk commented Dec 24, 2024

dirkgr left a comment •

edited

Loading

dirkgr Feb 4, 2025

dirkgr Feb 4, 2025

dirkgr Feb 4, 2025

dirkgr Feb 4, 2025

dirkgr commented Feb 4, 2025

		@@ -647,6 +649,177 @@ def get_post_step_metrics(
		return metrics


		class DeMo(torch.optim.SGD, Optimizer):

Decoupled Momentum Optimization #771

Are you sure you want to change the base?

Decoupled Momentum Optimization #771

Conversation

peter-sk commented Dec 24, 2024

dirkgr left a comment • edited Loading

Choose a reason for hiding this comment

dirkgr Feb 4, 2025

Choose a reason for hiding this comment

dirkgr Feb 4, 2025

Choose a reason for hiding this comment

dirkgr Feb 4, 2025

Choose a reason for hiding this comment

dirkgr Feb 4, 2025

Choose a reason for hiding this comment

dirkgr commented Feb 4, 2025

dirkgr left a comment •

edited

Loading