PyTorch Blog4h ago|Research & Papers Products & Services

Generating State-of-the-Art GEMMs with TorchInductor's CuteDSL Backend

TorchInductor now supports a new CuteDSL backend for matrix multiplication optimization, in addition to Triton, CUTLASS, and cuBLAS. This post discusses the technical motivations and benefits of the CuteDSL integration.

💡

Why it matters

The addition of the CuteDSL backend to TorchInductor provides PyTorch users with a new high-performance option for matrix multiplication, which is a fundamental operation in many AI and ML workloads.

Key Points

1TorchInductor supports multiple autotuning backends for matrix multiplications
2CuteDSL is a new backend added alongside Triton, CUTLASS, and cuBLAS
3CuteDSL provides state-of-the-art performance for general matrix multiplications (GEMMs)

Details

TorchInductor is a PyTorch library that provides optimized matrix multiplication kernels through autotuning. It currently supports three backends: Triton, CUTLASS (C++), and cuBLAS. This post announces the integration of a fourth backend, CuteDSL, which is a domain-specific language (DSL) for generating high-performance GEMM kernels. The CuteDSL backend leverages advanced compiler optimizations and hardware-specific intrinsics to deliver state-of-the-art performance, outperforming the existing backends in many cases. The technical details and motivations behind this integration are discussed, highlighting how CuteDSL's code generation capabilities can benefit PyTorch users in terms of matrix multiplication efficiency and ease of use.

Generating State-of-the-Art GEMMs with TorchInductor's CuteDSL Backend

Why it matters

Key Points

Details

Dive deeper

Related Articles

Understanding NCCL Watchdog Timeouts in Large AI Model Trai…

Enabling Faster Pre-training for DeepSeek-V3 on B200 with T…

PyTorch 2.11 Release Highlights

PyTorch 2.10+TorchAO: Powering AIPC scenarios on Intel® Cor…

TorchSpec: Speculative Decoding Training at Scale

Generalized Dot-Product Attention: Tackling Real-World Chal…

Building Voice Agents with ExecuTorch: A Cross-Platform Fou…

MXFP8 Training for MoEs: 1.3x Speedup for Llama4 Scout on G…

PyTorch at NVIDIA GTC 2026: Join Us in San Jose!

KernelAgent: Hardware-Guided GPU Kernel Optimization via Mu…

AI Curator

Ask me anything about AI

Related Articles

Understanding NCCL Watchdog Timeouts in Large AI Model Trai…

Enabling Faster Pre-training for DeepSeek-V3 on B200 with T…

PyTorch 2.11 Release Highlights

PyTorch 2.10+TorchAO: Powering AIPC scenarios on Intel® Cor…

TorchSpec: Speculative Decoding Training at Scale

Generalized Dot-Product Attention: Tackling Real-World Chal…

Building Voice Agents with ExecuTorch: A Cross-Platform Fou…

MXFP8 Training for MoEs: 1.3x Speedup for Llama4 Scout on G…

PyTorch at NVIDIA GTC 2026: Join Us in San Jose!

KernelAgent: Hardware-Guided GPU Kernel Optimization via Mu…