CPU vs GPU FP4 GEMM at Small Model Shapes
RTX PRO 6000 Blackwell dominates dense FP4 GEMM, but Ryzen 9 9950X3D can beat it on tiny Qwen KV projections. A shape-derived sweep over 1-8B dense and MoE models shows where the crossover appears.
RTX PRO 6000 Blackwell dominates dense FP4 GEMM, but Ryzen 9 9950X3D can beat it on tiny Qwen KV projections. A shape-derived sweep over 1-8B dense and MoE models shows where the crossover appears.
PyTorch's CPU top-K isn't actually vectorized. An 80-line AVX-512 kernel beats it by up to 34x, and mid-pipeline CPU offload of the DSA selection stage beats an all-GPU fused Triton kernel by 1.2x end-to-end on RTX PRO 6000 Blackwell + Zen 5.
I built a tournament where frontier coding agents tried to compress Qwen3-4B. They did real work first, hit the ceiling, then discovered that overfitting, fabrication, and lying about metrics scored better than honest quantization.
Anthropic accidentally published source maps revealing Claude Code internals: 62+ hidden feature flags, model codenames (Capybara, Fennec, Numbat), an "Undercover Mode" for stealth contributions, and a secret Tamagotchi pet system.
A cleaned-up CUDA to ROCm/HIP portability reference with the rows I actually reached for most while writing AMD kernels.
What if you could generate images in one forward pass instead of fifty? Same UNet, same parameters, 57x faster. Trained on 8x H100s, benchmarked on a 3090.
The battery simulation community is stuck on CPU. Naive GPU ports are actually slower. Hand-tuned CUDA kernels achieve 89% of RTX 3090 peak bandwidth.
A custom CUDA megakernel for Qwen3-0.6B that fuses RMSNorm, QKV projection, RoPE, attention, and MLP into a single kernel launch - achieving 527 tok/s decode on RTX 3090.
Custom CUDA kernels that eliminate computational bottlenecks in spherical harmonics and tensor product operations - the core primitives of equivariant GNNs like MACE, NequIP, and Allegro.
Compressing Qwen3-30B-A3B from 6,144 to 1,698 experts while retaining 91.5% HumanEval performance - fitting a frontier-class MoE model into 18GB of VRAM.
Reproducing "Attention Is Not What You Need" (arXiv 2512.19428) reveals a 22.6% performance gap vs the claimed 10-15%. Includes custom CUDA kernels with 2x inference speedup.
AI is consuming energy at a rate that Earth's grids can barely sustain. I spent several days modeling a 100 Megawatt Orbital Compute Cluster with Gemini to design a rig that lives in the vacuum.
A deep dive into MiniMax M2.1, the 230B parameter sparse MoE model that activates only 10B parameters per token while achieving SOTA performance at 10% of Claude Sonnet's cost.
A comprehensive technical analysis of GLM-4.7, the 358B parameter Mixture-of-Experts model pushing the boundaries of coding, reasoning, and agentic AI capabilities.
A down-to-earth answer considering my experience with CUDA, what it has and hasn't brought me success in, where the ecosystem is going, and how to play strategically around that.