The Physics of LLM Inference
Build your own LLM serving engine from scratch. Covers hardware-level optimization, memory management, custom CUDA and Triton kernel development, and throughput maximization. 113 pages + full code repository.
$5+
Build your own LLM serving engine from scratch. Covers hardware-level optimization, memory management, custom CUDA and Triton kernel development, and throughput maximization. 113 pages + full code repository.
$5+
Implement reinforcement learning techniques for building reasoning capabilities into language models on a single GPU. Covers policy gradients, GRPO, think tokens, and memory-efficient training. 60% code, 40% prose.
$5
Learn GPT-2 pre-training on a single GPU. Covers tokenization, embeddings, attention mechanisms, transformer architecture, training loops, and optimization. Based on nanoGPT - minimal code, maximum understanding.
$5+
Your GPU is sitting 50% idle during inference. This book shows you exactly why and how to fix it with a single fused CUDA kernel. Covers kernel fusion, memory bandwidth bottlenecks, and building an end-to-end megakernel.
$5+
Stop translating NumPy tutorials into JAX by trial and error. This book gives you every core concept side-by-side with runnable code and real GPU output. Covers JAX fundamentals, transformations, and GPU acceleration.
$5+