Let’s be real — distributed training is still a pain in 2026.
Last week our team migrated a 7B parameter model from single GPU to 8-GPU DDP. Should’ve been straightforward, right? Nope. We hit OOM immediately and spent two days debugging before realizing we forgot to scale batch_size by the number of GPUs. Classic rookie mistake.
PyTorch’s DistributedDataParallel (DDP) is hands-down the most reliable multi-GPU training approach right now. But the docs? Let’s just say they leave a lot to be desired. Most tutorials online are either from 2023 or only cover toy examples. This post is my field notes from two years of production DDP work — the configs that actually work, the bugs that’ll waste your week, and the stuff nobody tells you.
Why DDP Crushes DataParallel
Two options in PyTorch for multi-GPU:
- DataParallel (DP): Single process, one master GPU collects gradients and broadcasts. Master card memory blows up, efficiency tanks past 2 GPUs.
- DistributedDataParallel (DDP): Multi-process, one process per GPU, each computes gradients independently then all-reduces. No master bottleneck, near-linear scaling.
Verdict: If you’re still on DP in 2026, stop. I’ve benchmarked this — DP gives you maybe 1.5x speedup on 4 GPUs. DDP gives you 3.8x. The choice is obvious.
Single-Node Multi-GPU: The Baseline
Let’s start with the most common setup: one machine, 4 GPUs. Assuming you’re on PyTorch 2.12 (released May 2026 — finally fixed the gloo backend issues that plagued 2.11).
1. Launch Methods
Option A: torchrun (recommended)
torchrun --nproc_per_node=4 train.py
Option B: Manual spawn
import torch.multiprocessing as mp
mp.spawn(train_fn, args=(args,), nprocs=args.n_gpus)
Use torchrun. Always. It handles environment variables, error recovery, and logging. By 2026 this is the only sensible choice.
2. Core Code Template
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def setup(rank, world_size):
dist.init_process_group(
backend='nccl',
init_method='env://',
rank=rank,
world_size=world_size
)
torch.cuda.set_device(rank)
def cleanup():
dist.destroy_process_group()
def train():
local_rank = int(os.environ['LOCAL_RANK'])
world_size = int(os.environ['WORLD_SIZE'])
setup(local_rank, world_size)
model = MyModel().cuda(local_rank)
model = DDP(model, device_ids=[local_rank])
# CRITICAL: batch_size must be divided by world_size
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
dataloader = DataLoader(train_dataset, batch_size=32//world_size,
sampler=train_sampler)
for epoch in range(epochs):
train_sampler.set_epoch(epoch) # Must shuffle each epoch!
for data, target in dataloader:
data, target = data.cuda(local_rank), target.cuda(local_rank)
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
cleanup()
The bug that got me: DistributedSampler defaults to sequential ordering. If you forget set_epoch(epoch), every epoch gets the exact same data order. Your model will memorize the epoch sequence pattern — I watched validation loss plateau for three days before catching this.
Multi-Node: Where Things Get Real
Single-node 4 GPUs is the tutorial. Multi-node is the war story.
Our team built a 4-node, 32×A100 cluster last year. The number of things that went wrong is… substantial.
Launch Command
# On master node:
torchrun --nnodes=4 --nproc_per_node=8 \
--rdzv_backend=c10d \
--rdzv_endpoint="master_ip:29500" \
--rdzv_id=my_job \
train.py
# On all other 3 nodes (identical command):
torchrun --nnodes=4 --nproc_per_node=8 \
--rdzv_backend=c10d \
--rdzv_endpoint="master_ip:29500" \
--rdzv_id=my_job \
train.py
The Network Nightmare
Multi-node DDP lives and dies by your network. nccl backend needs InfiniBand or RoCE. If your nodes are connected via 1GbE — don’t bother, or at least crank up NCCL_TIMEOUT.
Real story: same config worked perfectly on AWS p4d instances. Moved to our on-prem cluster? Constant timeouts. Two days debugging later: the switch MTU wasn’t set for jumbo frames. Facepalm.
2026 Best Practices:
- InfiniBand between nodes, latency < 2μs
- Minimum 100Gbps RoCE if no IB
NCCL_IB_DISABLE=1falls back to TCP, but expect 5-10x slowdown
Performance Comparison
| Configuration | Throughput (steps/s) | Memory/GPU | Speedup | Notes |
|---|---|---|---|---|
| Single GPU | 10 | 32GB | 1x | Baseline |
| Single Node, 4×GPU DDP | 38 | 8.2GB | 3.8x | Near-linear |
| Single Node, 8×GPU DDP | 72 | 4.3GB | 7.2x | Comm overhead visible |
| 2 Nodes, 16×GPU | 125 | 4.3GB | 12.5x | Cross-node latency hurts |
| 4 Nodes, 32×GPU | 220 | 4.3GB | 22x | Gradient compression needed |
Takeaway: Single-node scaling is almost perfect. Cross-node scaling degrades. Beyond 32 GPUs, use ZeroRedundancyOptimizer or FSDP — DDP alone won’t cut it.
What’s New in 2026
PyTorch 2.12 shipped some genuinely useful distributed features:
- NCCL 2.22 integration: Topology-aware routing for NVLink 4.0, auto-selects optimal communication paths
- Fault tolerance improvements:
torchrunnow auto-restarts failed workers. No more manually killing and relaunching. - Compiled DDP:
torch.compilefinally plays nice with DDP. Previous versions would throw cryptic errors when you tried both.
FAQ
Q: Why is my DDP training slower than single GPU? A: Your per-GPU batch size is too small. When compute time < communication time, GPUs sit idle waiting for gradients. Rule of thumb: at least 8-16 samples per GPU.
Q: Multi-node training keeps timing out. What do I check?
A: Network first: ping latency < 1ms, bandwidth > 25Gbps. Then run with NCCL_DEBUG=INFO to see where it hangs. 90% of the time it’s a firewall blocking port 29500.
Q: My model doesn’t fit on a single GPU. Now what? A: FSDP (Fully Sharded Data Parallel) or ZeRO-3. DDP requires each GPU to hold the full model. FSDP shards parameters across GPUs — it’s the standard approach for models > 13B params in 2026.
Q: torchrun launches but processes crash immediately.
A: Check environment variables: MASTER_ADDR, MASTER_PORT, WORLD_SIZE, RANK. torchrun sets these automatically, but if your script overrides them manually, things break.
This setup has been running our production workloads for eight months, from 7B to 70B parameter models. DDP has its quirks — the NCCL timeout issues, the sampler epoch bug, the network configuration hell — but once you know the patterns, it’s the most stable multi-GPU solution out there.
One last piece of advice: always validate on 2 GPUs first before scaling to 32. Debugging at scale is exponentially harder, and you don’t want to be reading 32 copies of the same error log.