You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have 2 nodes which contain 2 4090s each with a mellanox dual 25gbe NIC as RoCE interconnect. I'd like to know if it is possible to run llm-compressor in "distributed mode" leveraging accelerate's ability to handle multi-node training. I may be misunderstanding the functionality, but if not I would be grateful to know in an example how to leverage multiple nodes' GPUs.
Thank you for a great tool!
The text was updated successfully, but these errors were encountered:
Hi @nicklausbrown , you likely will not need to train on multiple nodes for the compression algorithms we are providing here. We are running calibration training, usually involving caching activations of a single batch of data and performing some compression based on the results. Are you are looking to do post-training afterward? It would certainly benefit from multi-node but is not needed for most of the pipelines we are currently supporting.
Hello,
I have 2 nodes which contain 2 4090s each with a mellanox dual 25gbe NIC as RoCE interconnect. I'd like to know if it is possible to run llm-compressor in "distributed mode" leveraging accelerate's ability to handle multi-node training. I may be misunderstanding the functionality, but if not I would be grateful to know in an example how to leverage multiple nodes' GPUs.
Thank you for a great tool!
The text was updated successfully, but these errors were encountered: