This page provides reference information for configuring and troubleshooting Instant Clusters.Documentation Index
Fetch the complete documentation index at: https://runpod-b18f5ded-promptless-remove-flash-beta-notification.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Environment variables
The following environment variables are automatically set on all nodes in an Instant Cluster:| Environment Variable | Description |
|---|---|
PRIMARY_ADDR / MASTER_ADDR | The address of the primary node. |
PRIMARY_PORT / MASTER_PORT | The port of the primary node. All ports are available. |
NODE_ADDR | The static IP of this node within the cluster network. |
NODE_RANK | The cluster rank (i.e. global rank) assigned to this node. NODE_RANK = 0 for the primary node. |
NUM_NODES | The number of nodes in the cluster. |
NUM_TRAINERS | The number of GPUs per node. |
HOST_NODE_ADDR | A convenience variable, defined as PRIMARY_ADDR:PRIMARY_PORT. |
WORLD_SIZE | The total number of GPUs in the cluster (NUM_NODES * NUM_TRAINERS). |
NODE_ADDR) on the overlay network. When a cluster is deployed, the system designates one node as the primary node by setting the PRIMARY_ADDR and PRIMARY_PORT environment variables. This simplifies working with multiprocessing libraries that require a primary node.
The following variables are equivalent:
MASTER_ADDRandPRIMARY_ADDRMASTER_PORTandPRIMARY_PORT
MASTER_* variables are available to provide compatibility with tools that expect these legacy names.
Network interfaces
Instant Clusters use dedicated high-bandwidth network interfaces for inter-node communication, separate from the management interface used for external traffic.| Interface | Purpose |
|---|---|
ens1 - ens8 | High-bandwidth interfaces for inter-node communication. Each interface provides a private network connection. |
eth0 | Management interface for external traffic (internet connectivity). |
PRIMARY_ADDR environment variable corresponds to ens1, which enables launching and bootstrapping distributed processes.
NCCL configuration
NCCL (NVIDIA Collective Communications Library) handles GPU-to-GPU communication in distributed training. You must configure NCCL to use the internal network interfaces.Required configuration
Set theNCCL_SOCKET_IFNAME environment variable to use the internal network:
NCCL_SOCKET_IFNAME uses all available interfaces. However, explicitly setting it to ens1 ensures NCCL uses the high-bandwidth internal network.
Debugging NCCL
To troubleshoot multi-node communication issues, enable NCCL debug logging:Troubleshooting
Connection timeouts during distributed training
Symptom: Training jobs fail with connection timeout errors between nodes. Cause: NCCL is attempting to communicate over the external network interface (eth0) instead of the internal interfaces (ens1-ens8).
Solution: Set the NCCL_SOCKET_IFNAME environment variable:
Nodes cannot find the primary node
Symptom: Worker nodes fail to connect to the primary node during initialization. Cause: ThePRIMARY_ADDR or MASTER_ADDR environment variable is not being used correctly in your distributed training script.
Solution: Verify your script uses the PRIMARY_ADDR environment variable for the rendezvous address. For PyTorch distributed training:
Inconsistent training performance
Symptom: Training speed varies significantly between runs or degrades over time. Cause: Network congestion or suboptimal NCCL configuration. Solution:- Ensure all nodes are in the same data center (Runpod handles this automatically).
- Enable NCCL debugging to identify bottlenecks:
export NCCL_DEBUG=INFO - Verify your batch sizes are appropriate for the cluster size to maintain efficient GPU utilization.