PyTorch Blog3/12|Research & Papers Products & Services

MXFP8 Training for MoEs: 1.3x Speedup for Llama4 Scout on GB200 Cluster

Researchers achieved a 30.2% training speedup for the Llama4 Scout model using MXFP8 mixed-precision training primitives in TorchAO, compared to bfloat16 on a GB200 cluster.

💡

Why it matters

Improving the training efficiency of large language models is crucial for advancing AI capabilities and reducing the computational cost and environmental impact of model development.

Key Points

1Demonstrated a +30.2% training speedup for Llama4 Scout using MXFP8 mixed-precision training
2Achieved equivalent convergence to bfloat16 with 81% of the theoretical maximum speedup
3Leveraged TorchAO and TorchTitan for efficient mixed-precision training on a GB200 cluster

Details

The researchers used MXFP8 (mixed-precision FP8) training primitives in TorchAO to achieve a 30.2% speedup in training the Llama4 Scout model compared to bfloat16 on a GB200 cluster. This represents ~81% of the theoretical maximum speedup possible with MXFP8. TorchAO and TorchTitan were used to enable efficient mixed-precision training at scale on the GB200 cluster. The results demonstrate the potential for MXFP8 to significantly accelerate the training of large language models and other AI models, while maintaining equivalent convergence to higher-precision training.

MXFP8 Training for MoEs: 1.3x Speedup for Llama4 Scout on GB200 Cluster

Why it matters

Key Points

Details

Dive deeper

Related Articles

Generating State-of-the-Art GEMMs with TorchInductor's Cute…

Understanding NCCL Watchdog Timeouts in Large AI Model Trai…

Enabling Faster Pre-training for DeepSeek-V3 on B200 with T…

PyTorch 2.11 Release Highlights

PyTorch 2.10+TorchAO: Powering AIPC scenarios on Intel® Cor…

TorchSpec: Speculative Decoding Training at Scale

Generalized Dot-Product Attention: Tackling Real-World Chal…

Building Voice Agents with ExecuTorch: A Cross-Platform Fou…

PyTorch at NVIDIA GTC 2026: Join Us in San Jose!

KernelAgent: Hardware-Guided GPU Kernel Optimization via Mu…

AI Curator

Ask me anything about AI

Related Articles

Generating State-of-the-Art GEMMs with TorchInductor's Cute…

Understanding NCCL Watchdog Timeouts in Large AI Model Trai…

Enabling Faster Pre-training for DeepSeek-V3 on B200 with T…

PyTorch 2.11 Release Highlights

PyTorch 2.10+TorchAO: Powering AIPC scenarios on Intel® Cor…

TorchSpec: Speculative Decoding Training at Scale

Generalized Dot-Product Attention: Tackling Real-World Chal…

Building Voice Agents with ExecuTorch: A Cross-Platform Fou…

PyTorch at NVIDIA GTC 2026: Join Us in San Jose!

KernelAgent: Hardware-Guided GPU Kernel Optimization via Mu…