Cap `nproc_per_node` based on the CPU resources of the node for PyTorch TrainJob #2407

astefanutti · 2025-01-31T07:58:31Z

PyTorch relies on the number of CPUs on the physical host when determining the "local world size" if nproc_per_node is set to auto and the node is a CPU-only device.

In that configuration, which is used by the preset torch-distributed training runtime, the number of processes equals the number of CPUs on the host, and leads to the following problems:

Out of memory issues for worker Pods scheduled on nodes with large number of CPUs
Deadlocks when the CPU limit set for the container is less than the actual number of CPUs

To mitigate these issues, nproc_per_node should default to the CPU limit when set, or fallback to 1 when the PyTorch ML policy defines numProcPerNode: auto.

This has been discussed in details in #2387 (comment)

The text was updated successfully, but these errors were encountered:

andreyvelich · 2025-03-05T17:25:43Z

As we discussed, we should transition the numProcPerNode assignment to the ML Plugins from the client SDK: #2470 (comment).

If user doesn't explicitly set .trainer.numProcPerNode value in TrainJob, we should automatically calculate this value based on container resources and device type: cpu, gpu, tpu.

/good-first-issue
/area controller
/kind enhancement

google-oss-prow · 2025-03-05T17:25:45Z

@andreyvelich:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

As we discussed, we should transition the numProcPerNode assignment to the ML Plugins from the client SDK: #2470 (comment).

If user doesn't explicitly set .trainer.numProcPerNode value in TrainJob, we should automatically calculate this value based on container resources and device type: cpu, gpu, tpu.

/good-first-issue
/area controller
/kind enhancement

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

astefanutti mentioned this issue Jan 31, 2025

KEP-2170: Add PyTorch DDP MNIST training example #2387

Merged

1 task

astefanutti mentioned this issue Mar 4, 2025

chore(test): Add E2E tests for Kubeflow Trainer #2470

Merged

google-oss-prow bot added area/controller good first issue kind/enhancement help wanted labels Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cap `nproc_per_node` based on the CPU resources of the node for PyTorch TrainJob #2407

Cap `nproc_per_node` based on the CPU resources of the node for PyTorch TrainJob #2407

astefanutti commented Jan 31, 2025

andreyvelich commented Mar 5, 2025

google-oss-prow bot commented Mar 5, 2025

Cap nproc_per_node based on the CPU resources of the node for PyTorch TrainJob #2407

Cap nproc_per_node based on the CPU resources of the node for PyTorch TrainJob #2407

Comments

astefanutti commented Jan 31, 2025

andreyvelich commented Mar 5, 2025

google-oss-prow bot commented Mar 5, 2025

Cap `nproc_per_node` based on the CPU resources of the node for PyTorch TrainJob #2407

Cap `nproc_per_node` based on the CPU resources of the node for PyTorch TrainJob #2407