Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP

2026-06-11

1 min read

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

This blog post analyzes PyTorch's nn.Linear and MLP profiling to understand GPU kernel execution and CPU dispatch optimization.

•nn.Linear uses the addmm kernel internally to combine matrix multiplication and bias addition into a single GPU operation
•aten::t (transpose) is a view operation that only modifies tensor metadata (strides) without copying data
•torch.compile eliminates CPU overhead of dispatching transpose views by hardcoding the precomputed strides directly into the kernel call
•GPU kernels use different binary implementations based on input layouts, distinguishable by kernel name suffixes like _tn_ for transposed layout
•Epilogue optimization folds small operations like bias addition into the matrix multiplication kernel's writeback phase to minimize memory traffic

Related Articles