Custom Kernels for All from Codex and Claude

2026-02-13

1 min read

Read Original

Get the latest tech trends every morning

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

Endigest AI Core Summary

This post introduces an agent skill that enables coding agents (Claude and Codex) to write production-ready CUDA kernels for HuggingFace's diffusers and transformers libraries.

•The skill (~550 tokens of structured guidance) packages GPU architecture-specific knowledge for H100, A100, and T4, covering memory access patterns, vectorization strategies, and PyTorch bindings
•Agents install the skill via `kernels skills add cuda-kernels --claude` and generate complete kernel projects including CUDA source, PyTorch C++ bindings, build config, and benchmark scripts
•For LTX-Video (diffusers), RMSNorm kernels achieved an average 1.88x isolated speedup and 1.43x end-to-end speedup on H100 when combined with torch.compile
•For Qwen3-8B (transformers), RMSNorm kernels achieved an average 1.94x isolated speedup, scaling from 1.58x at 128 tokens to 2.47x at 8192 tokens

Custom Kernels for All from Codex and Claude

Get the latest tech trends every morning

Endigest AI Core Summary

Related Articles

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

Specialization Beats Scale: A Strategic Variable Most AI Procurement Decisions Overlook

From "What Happened?" to "What Will Happen?"

OlmoEarth v1.1: A more efficient family of models