Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.
Endigest AI Core Summary
This post explains MoE architecture and how transformers v5 added first-class MoE support.
•MoEs replace dense FFN layers with expert sub-networks; a router selects a few experts per token, so capacity scales with total params but inference cost with active params only.
•gpt-oss-20b has 21B total but ~3.6B active params per token, running at ~115 tok/s on M3 Ultra Mac.
•transformers v5 introduces WeightConverter to pack per-expert checkpoint tensors into contiguous tensors for efficient grouped GEMM kernels.
•Async lazy materialization and single-pass routing cut load time from 66s (v4) to 20s (v5) for Qwen1.5-110B; TP mode reduces it to 10s.
•Quantization is now integrated into the loading pipeline, applied once experts are in packed layout.
This summary was automatically generated by AI based on the original article and may not be fully accurate.