CPU vs GPU FP4 GEMM at Small Model Shapes

A Ryzen 9 9950X3D can beat an RTX PRO 6000 Blackwell on tiny KV projections. Then the GPU takes over by hundreds of times once the projection becomes a real MLP.

I started this as a straightforward question: what is the state-of-the-art FP4 matrix multiply path on this workstation? The obvious dense GEMM answer is not subtle. An RTX PRO 6000 Blackwell running the available SM120 FP4 kernels can push over a petaflop per second on large batched shapes. A Zen 5 desktop CPU cannot compete with that.

But that is the wrong question for decode. In small 1-8B language models, many per-token operations are skinny matrix-vector or tiny matrix-matrix projections: attention Q/K/V/O, dense MLP up/down, and, in MoEs, per-expert MLPs that may only see a few routed tokens. If the batch is small enough, kernel launch, packing, and GPU fixed costs can dominate the arithmetic.

So I ran two sweeps. First, a controlled same-shape CPU-vs-GPU sweep across representative GEMM/GEMV shapes. Second, a model-derived sweep from Hugging Face configs for Qwen2.5 1.5B/3B/7B, Mistral 7B, OLMoE-1B-7B, and Qwen1.5-MoE-A2.7B. The code and setup guide are in github.com/Infatoshi/cpu-gemm-gemv.

Important caveat: the CPU path here is true MXFP4 through ik_llama.cpp. The broadly runnable SM120 GPU path available on this machine is NVFP4/block-scaled FP4 through FlashInfer CUTLASS and b12x. The sweep is shape-matched, not format-identical.

Hardware: Ryzen 9 9950X3D, RTX PRO 6000 Blackwell Workstation Edition, CUDA 13.2 driver stack. GPU timings use CUDA events, warmups, repeated samples, and L2 flushes. CPU timings use repeated ggml/ik_llama perf runs.

The large-shape baseline

On dense GEMM, the GPU does exactly what you expect. AtM=4096,N=4096,K=5376b12x NVFP4 measured 1219 TFLOP/s. The best CPU MXFP4 path measured 2.12 TFLOP/s on the same shape. That is a 576x gap.

CPU vs GPU FP4 GEMM throughput — Shape-matched GEMM sweep. The dense path is not where a desktop CPU catches Blackwell.

The GEMV-like shapes are less extreme but still usually favor the GPU. The GPU wins by 5.8x to 25.9x in this hand-picked set. That already suggests a boundary: for projection sizes in the thousands and a few tokens, the CPU is not absurdly far away.

CPU vs GPU FP4 GEMV-like throughput — The GEMV-like cases are the interesting ones. The GPU still wins, but not by the dense-GEMM margin.

GPU over CPU speedup for the first sweep — GPU/CPU speedup over the controlled benchmark shapes.

Deriving real model shapes

I pulled public config.jsonfiles and derived the projection shapes. For dense models this means: attention Q, KV, fused QKV, O, MLP up/gate, and MLP down. For MoE models, I also derived average per-expert token counts fromtokens * top_k / num_experts, rounded up, plus shared-expert projections where the config has them.

Model	Hidden	MLP / Expert	MoE
Qwen2.5-1.5B	1536	8960	dense
Qwen2.5-3B	2048	11008	dense
Qwen2.5-7B	3584	18944	dense
Mistral-7B-v0.3	4096	14336	dense
OLMoE-1B-7B	2048	1024 expert	64 experts, top-8
Qwen1.5-MoE-A2.7B	2048	1408 expert, 5632 shared	60 experts, top-4

Where the crossover appears

The model-derived sweep is the interesting one. At one token, the median GPU/CPU speedup is only 5.4x. The minimum is 0.6x, meaning the CPU wins. At two tokens, the minimum is still only 0.7x. By eight tokens the GPU wins every shape in the sweep, but the median is still just 10.9x. By 128 tokens, the median jumps to 46.6x and the best MLP case reaches 259x.

source tokens | median GPU/CPU | min | max
────────────────────────────────────────
1             | 5.4x           | 0.6x | 24.7x
2             | 5.9x           | 0.7x | 26.4x
4             | 7.2x           | 0.8x | 33.5x
8             | 10.9x          | 1.2x | 54.4x
128           | 46.6x          | 5.5x | 259.3x

GPU over CPU speedup scatter plot over model-derived shapes — Each point is a projection derived from an actual 1-8B dense or MoE config. Small projections flatten the GPU advantage.

The CPU actually beats the GPU on Qwen2.5's tiny KV projections:

model          op            tokens  M  N    K     GPU TF/s  CPU TF/s
────────────────────────────────────────────────────────────────────
Qwen2.5-1.5B   attn_kv_each  1       1  256  1536  0.074     0.122
Qwen2.5-1.5B   attn_kv_each  2       2  256  1536  0.159     0.231
Qwen2.5-3B     attn_kv_each  1       1  256  2048  0.088     0.127
Qwen2.5-3B     attn_kv_each  2       2  256  2048  0.168     0.236

That does not mean “run the transformer on the CPU.” It means that for small decode, a projection-by-projection runtime has a lot of overhead hiding under the FLOP/s headline. A CPU kernel with no device launch and no packing pipeline can be competitive on the tiniest shapes.

Dense models: attention vs MLP

Dense attention projections are mixed. Q and O are square-ish and become GPU-friendly with enough tokens. KV projections in grouped-query models can be very skinny. MLPs, on the other hand, quickly become classic large GEMMs.

Dense model attention best throughput at source tokens 128 — Dense attention at T=128: the GPU is firmly ahead, but the smaller KV shapes explain the weak T=1/T=2 behavior.

Dense model MLP best throughput at source tokens 128 — Dense MLP projections become real GEMMs. Qwen2.5-7B MLP up/gate reaches 524.8 TFLOP/s on the GPU, 259x over the CPU path.

MoE makes the small-shape problem worse

MoE expert routing shrinks the per-expert matrix shape. OLMoE has 64 experts and top-8 routing, so with one source token the average expert seesceil(1 * 8 / 64) = 1token. Qwen1.5-MoE has 60 experts and top-4 routing, so it stays around one token per expert until the source batch grows substantially. This is the regime where GPU utilization is easiest to lose.

MoE model attention best throughput at source tokens 128 — MoE attention projections are ordinary dense projections, but the model dimensions are smaller than many 7B dense baselines.

MoE model MLP best throughput at source tokens 128 — MoE expert MLPs are small. Shared experts behave more like dense MLPs.

What I would take away

Dense MLPs belong on the GPU. Even at modest token counts, they turn into enough work to amortize GPU fixed costs.
Tiny KV and per-expert MoE projections are different. At one or two routed tokens, the CPU can be within the same order of magnitude and occasionally faster in this benchmark setup.
The useful boundary is shape-specific. A runtime that blindly treats every projection as a GEMM misses the small-shape regime.
Format parity still matters. I want a packaged SM120 MXFP4 GEMV path before treating this as a final architectural result.

The practical next experiment is not “CPU vs GPU” in the abstract. It is a scheduler: keep large MLPs and batched QKV on the GPU, but consider CPU execution or CPU-assisted packing for the smallest KV/expert projections during low-batch decode. That is where the result points.

Reproducibility

The standalone repo includes the sweep scripts, setup notes, result schema, and instructions for building the CPU benchmark harness:

git clone https://github.com/Infatoshi/cpu-gemm-gemv
cd cpu-gemm-gemv
uv run --no-project --python /home/infatoshi/cpu/.venv/bin/python \
  python scripts/run_model_shape_sweep.py --out-dir results/model_shape_sweep

Raw result JSON is also mirrored with this post:controlled sweep and model-shape sweep.