Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.
Endigest AI Core Summary
This post introduces an agent skill that enables coding agents (Claude and Codex) to write production-ready CUDA kernels for HuggingFace's diffusers and transformers libraries.
•The skill (~550 tokens of structured guidance) packages GPU architecture-specific knowledge for H100, A100, and T4, covering memory access patterns, vectorization strategies, and PyTorch bindings
•Agents install the skill via `kernels skills add cuda-kernels --claude` and generate complete kernel projects including CUDA source, PyTorch C++ bindings, build config, and benchmark scripts
•For LTX-Video (diffusers), RMSNorm kernels achieved an average 1.88x isolated speedup and 1.43x end-to-end speedup on H100 when combined with torch.compile
•For Qwen3-8B (transformers), RMSNorm kernels achieved an average 1.94x isolated speedup, scaling from 1.58x at 128 tokens to 2.47x at 8192 tokens
•Generated kernels can be published to the HuggingFace Kernel Hub for community reuse without requiring local compilation
This summary was automatically generated by AI based on the original article and may not be fully accurate.