Blog

Driftin: Single-Step Image Generation at 306 FPS

What if you could generate images in one forward pass instead of fifty? Same UNet, same parameters, 57x faster. Trained on 8x H100s, benchmarked on a 3090.

Dendrite: 133x Faster Battery Simulation with Hand-Tuned CUDA

The battery simulation community is stuck on CPU. Naive GPU ports are actually slower. Hand-tuned CUDA kernels achieve 89% of RTX 3090 peak bandwidth.

MegaQwen: 3.8x Faster LLM Inference by Fusing an Entire Transformer Block

A custom CUDA megakernel for Qwen3-0.6B that fuses RMSNorm, QKV projection, RoPE, attention, and MLP into a single kernel launch - achieving 527 tok/s decode on RTX 3090.

Batmobile: 10-20x Faster CUDA Kernels for Equivariant Graph Neural Networks

Custom CUDA kernels that eliminate computational bottlenecks in spherical harmonics and tensor product operations - the core primitives of equivariant GNNs like MACE, NequIP, and Allegro.

KernelBench v3: Rebuilding a GPU Kernel Benchmark from First Principles

How discovering the original KernelBench was exploitable led to building a focused, cost-effective benchmark for evaluating LLM kernel engineering on modern architectures.

Running a 30B Parameter Model on a Single RTX 3090: REAP Compression Experiments

Compressing Qwen3-30B-A3B from 6,144 to 1,698 experts while retaining 91.5% HumanEval performance - fitting a frontier-class MoE model into 18GB of VRAM.

Grassmann Flows: An Independent Reproduction Study

Reproducing "Attention Is Not What You Need" (arXiv 2512.19428) reveals a 22.6% performance gap vs the claimed 10-15%. Includes custom CUDA kernels with 2x inference speedup.

The Space-Native Data Center: A First Principles Design for 2030

AI is consuming energy at a rate that Earth's grids can barely sustain. I spent several days modeling a 100 Megawatt Orbital Compute Cluster with Gemini to design a rig that lives in the vacuum.

MiniMax M2.1: The Open-Source Coding Powerhouse That Rivals Closed-Source Giants

A deep dive into MiniMax M2.1, the 230B parameter sparse MoE model that activates only 10B parameters per token while achieving SOTA performance at 10% of Claude Sonnet's cost.

GLM-4.7: Z.ai's Frontier Agentic Reasoning Model

A comprehensive technical analysis of GLM-4.7, the 358B parameter Mixture-of-Experts model pushing the boundaries of coding, reasoning, and agentic AI capabilities.

Should I Learn CUDA?

A down-to-earth answer considering my experience with CUDA, what it has and hasn't brought me success in, where the ecosystem is going, and how to play strategically around that.