PyTorch Blog3/25|Research & Papers Products & Services

Enabling Faster Pre-training for DeepSeek-V3 on B200 with TorchTitan

PyTorch and Nebius collaborated to enable training of DeepSeek-V3 Mixture-of-Experts models (16B and 671B) on a 256-GPU NVIDIA B200 cluster using TorchTitan, achieving up to 41% faster pre-training.

💡

Why it matters

Faster pre-training of large language models is crucial for accelerating AI research and development, with significant implications for various industries.

Key Points

1Enabled training of large DeepSeek-V3 Mixture-of-Experts models (16B and 671B)
2Achieved up to 41% faster pre-training on a 256-GPU NVIDIA B200 cluster
3Leveraged PyTorch and TorchTitan for distributed training

Details

The article discusses a joint effort between PyTorch and Nebius to enable faster pre-training of DeepSeek-V3 Mixture-of-Experts models on a large-scale GPU cluster. They evaluated two techniques - MXFP8 and DeepEP - to improve training efficiency on the 256-GPU NVIDIA B200 system using TorchTitan. MXFP8 is a mixed-precision training approach that leverages FP8 data types, while DeepEP is an efficient parallelization method for Mixture-of-Experts models. By combining these techniques, the team was able to achieve up to 41% faster pre-training times for the 16B and 671B parameter DeepSeek-V3 models.

Enabling Faster Pre-training for DeepSeek-V3 on B200 with TorchTitan

Why it matters

Key Points

Details

Dive deeper

Related Articles

Generating State-of-the-Art GEMMs with TorchInductor's Cute…

Understanding NCCL Watchdog Timeouts in Large AI Model Trai…

PyTorch 2.11 Release Highlights

PyTorch 2.10+TorchAO: Powering AIPC scenarios on Intel® Cor…

TorchSpec: Speculative Decoding Training at Scale

Generalized Dot-Product Attention: Tackling Real-World Chal…

Building Voice Agents with ExecuTorch: A Cross-Platform Fou…

MXFP8 Training for MoEs: 1.3x Speedup for Llama4 Scout on G…

PyTorch at NVIDIA GTC 2026: Join Us in San Jose!

KernelAgent: Hardware-Guided GPU Kernel Optimization via Mu…

AI Curator

Ask me anything about AI

Related Articles

Generating State-of-the-Art GEMMs with TorchInductor's Cute…

Understanding NCCL Watchdog Timeouts in Large AI Model Trai…

PyTorch 2.11 Release Highlights

PyTorch 2.10+TorchAO: Powering AIPC scenarios on Intel® Cor…

TorchSpec: Speculative Decoding Training at Scale

Generalized Dot-Product Attention: Tackling Real-World Chal…

Building Voice Agents with ExecuTorch: A Cross-Platform Fou…

MXFP8 Training for MoEs: 1.3x Speedup for Llama4 Scout on G…

PyTorch at NVIDIA GTC 2026: Join Us in San Jose!

KernelAgent: Hardware-Guided GPU Kernel Optimization via Mu…