Mixture of Experts (MoEs) in Transformers

2026-02-26

1 min read

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

This post explains MoE architecture and how transformers v5 added first-class MoE support.

•MoEs replace dense FFN layers with expert sub-networks; a router selects a few experts per token, so capacity scales with total params but inference cost with active params only.
•gpt-oss-20b has 21B total but ~3.6B active params per token, running at ~115 tok/s on M3 Ultra Mac.
•transformers v5 introduces WeightConverter to pack per-expert checkpoint tensors into contiguous tensors for efficient grouped GEMM kernels.
•Async lazy materialization and single-pass routing cut load time from 66s (v4) to 20s (v5) for Qwen1.5-110B; TP mode reduces it to 10s.
•Quantization is now integrated into the loading pipeline, applied once experts are in packed layout.

Related Articles