← Back to blog

Running a 30B Parameter Model on a Single RTX 3090: REAP Compression Experiments

Compressing Qwen3-30B-A3B from 6,144 to 1,698 experts while retaining 91.5% HumanEval performance - fitting a frontier-class MoE model into 18GB of VRAM.

The Problem: Frontier Models Are Too Big

The latest generation of Mixture-of-Experts (MoE) models offer incredible performance per active parameter, but their total parameter counts make local deployment impractical. Qwen3-30B-A3B, for example, has 30 billion total parameters spread across 6,144 experts (128 per layer across 48 layers), but only activates 3 billion parameters per token.

The irony: you need ~35GB of VRAM to load a model that only uses 3B parameters at inference time. Most of those experts sit idle, waiting for their rare activation.

This is the fundamental inefficiency of MoE models - you pay the memory cost for all experts, but only a fraction activate for any given token. The router selects the top-k experts (typically 2-8) while the remaining 120+ experts per layer consume VRAM without contributing to the current forward pass.

The question I set out to answer: How many of those idle experts can we remove before the model breaks?

REAP: Router-weighted Expert Activation Pruning

REAP (Router-weighted Expert Activation Pruning) is a saliency-based pruning method from the paper "Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models" (arXiv:2404.02397). Unlike simple frequency-based pruning (removing rarely-used experts), REAP computes a composite saliency score that considers multiple factors:

The Saliency Score Components

  • Expert Activation Frequency - How often does the router select this expert across the calibration dataset? Frequently activated experts are likely important for the model's general capabilities.
  • Router Confidence - When selected, how strongly does the router prefer this expert? An expert that's selected with high confidence is more critical than one selected with marginal preference.
  • Hidden State Contribution - What is the cosine similarity between the expert's output and the final hidden state? This measures how much the expert's computation actually influences the layer's output.

The REAP score combines these factors to identify experts that are not just rarely used, but also contribute little when they are used. This catches edge cases that pure frequency-based pruning misses - like an expert that's rarely activated but critically important for specific reasoning patterns.

The Pruning Process

For each layer, REAP:

  1. Runs a calibration pass over a representative dataset (I used evol-codealpaca-v1)
  2. Accumulates saliency scores for each expert across all tokens
  3. Ranks experts by their composite saliency
  4. Removes the bottom N experts (determined by the target compression ratio)
  5. Adjusts the router weights to maintain proper probability distributions

The key insight is that calibration only needs to happen once. After computing saliency scores, you can experiment with different compression ratios without re-running the expensive calibration pass.

Per-Layer Pruning Schedules

Not all layers are created equal. Research on transformer interpretability suggests:

  • Early layers handle more general, input-processing features (tokenization artifacts, basic syntax)
  • Middle layers build intermediate representations and handle complex reasoning
  • Later layers specialize in output formatting and task-specific behaviors

This motivated experimenting with non-uniform compression schedules:

Schedule Types

  • Uniform: Remove the same percentage of experts from every layer. Simple but suboptimal.
  • Linear Decrease: Prune more heavily in early layers (e.g., 40% at layer 0), decreasing to lighter pruning in later layers (e.g., 10% at layer 47). The intuition: early layers have more redundancy.
  • Linear Increase: The opposite - preserve early layers and prune later layers more heavily. Less effective in my experiments.

The linear decrease schedule with spread 0.15-0.25 consistently outperformed uniform compression at the same overall ratio in my experiments.

Experimental Setup

Model Configuration

Base Model: Qwen3-30B-A3B

  • Architecture: Mixture-of-Experts with 48 MoE layers
  • Experts per layer: 128 routed experts
  • Total experts: 6,144 (128 x 48)
  • Active experts per token: 8 (selected by top-k routing)
  • Active parameters: ~3B per token
  • Total parameters: ~30B
  • Default VRAM requirement: ~35GB in BF16

Calibration Configuration

  • Dataset: theblackcat102/evol-codealpaca-v1 (1,024 samples)
  • Distance Measure: Cosine similarity for hidden state contribution scoring

Evaluation Benchmarks

EvalPlus (primary benchmark):

  • HumanEval: 164 hand-written Python programming problems
  • HumanEval+: Extended test cases for HumanEval
  • MBPP: 974 crowd-sourced Python programming problems
  • MBPP+: Extended test cases for MBPP

Results

Benchmark Performance at Different Compression Levels

REAP compression benchmark results showing HumanEval, HumanEval+, MBPP, and MBPP+ scores across compression levels
CompressionExpertsHumanEvalHumanEval+MBPPMBPP+
0% (Baseline)6,144TBDTBDTBDTBD
25% REAP4,60891.5%87.8%85.7%73.5%
50% REAP3,07289.0%84.8%80.4%68.5%

Note: Baseline evaluation encountered issues with Qwen3's thinking mode generating excessively long responses that caused the evaluation harness to hang.

Analyzing the Results

25% Compression (4,608 experts remaining):

  • HumanEval drops to 91.5% - still highly competitive
  • MBPP+ shows the largest relative drop
  • This level would reduce VRAM from ~35GB to ~27GB

50% Compression (3,072 experts remaining):

  • HumanEval remains strong at 89.0%
  • MBPP drops more noticeably to 80.4%
  • VRAM reduction to approximately ~20GB

The Degradation Pattern: HumanEval (more algorithmic) degrades slower than MBPP (more diverse). Extended test suites show larger drops - pruning reduces robustness to edge cases.

Fitting on RTX 3090: The 78% Compression Push

Memory reduction chart showing VRAM requirements from 35GB baseline to 17.78GB at 78% compression, with RTX 3090 24GB reference line

To fit Qwen3-30B-A3B on a single RTX 3090 (24GB VRAM), I pushed compression to 78%:

Schedule: linear_decrease
Spread: 0.15
Target ratio: 0.78

Memory Results

  • Original experts: 6,144
  • Experts removed: 4,446
  • Experts remaining: 1,698 (27.6% of original)
  • Measured GPU memory: 17.78 GB
  • RTX 3090 compatible: Yes, with ~6GB headroom for KV cache

Per-Layer Expert Distribution

  • Layer 0: ~93% compression (only ~9 experts remaining)
  • Layer 23 (middle): ~78% compression (~28 experts remaining)
  • Layer 47 (final): ~63% compression (~47 experts remaining)

The Reality: Stability Issues

This is where I need to be honest about the limitations. While the model loads and runs at 78% compression, I encountered significant stability issues:

  • Thinking Mode Loops: Qwen3's thinking mode became unstable. The model would enter infinite generation loops, producing endless reasoning without concluding.
  • Benchmark Failures: The baseline evaluation got stuck on the last MBPP sample (377/378) for over 10 minutes before I had to kill the process.
  • Quality Degradation: At 78% compression, qualitative testing showed noticeably degraded responses. The model still generates coherent text, but reasoning quality suffers.
  • No Formal Benchmarks at 78%: Due to stability issues, I was unable to complete EvalPlus benchmarks at this compression level.

Mitigations That Help

  • Disable thinking mode: Appending /no_think to prompts prevents infinite loops
  • Temperature tuning: Lower temperatures (0.3-0.5) produce more stable outputs
  • Max token limits: Strict max_new_tokens limits prevent runaway generation

Technical Implementation

Core Pruning Function

def prune_model_in_memory(model, observer_data, ratios_per_layer, prune_method="reap"):
    """Prune experts in-place based on REAP saliency scores."""
    model_attrs = MODEL_ATTRS[model.__class__.__name__]
    total_removed = 0
    
    for layer_idx in range(len(ratios_per_layer)):
        if layer_idx not in observer_data:
            continue
            
        layer_data = observer_data[layer_idx]
        num_experts = layer_data["expert_frequency"].shape[0]
        n_to_prune = int(num_experts * ratios_per_layer[layer_idx])
        
        if n_to_prune == 0:
            continue
        
        # Get saliency scores
        saliency = layer_data.get(prune_method) or layer_data.get("expert_frequency")
        
        # Identify experts to prune (lowest saliency)
        _, experts_to_prune = torch.topk(saliency, n_to_prune, largest=False)
        retained_indices = [i for i in range(num_experts) if i not in experts_to_prune]
        
        # Prune expert ModuleList
        moe = get_moe(model, layer_idx)
        all_experts = getattr(moe, model_attrs["experts"])
        retained_experts = [all_experts[i] for i in retained_indices]
        setattr(moe, model_attrs["experts"], nn.ModuleList(retained_experts))
        
        # Adjust router weights
        router = getattr(moe, model_attrs["router"])
        router.weight.data = router.weight.data[retained_indices, :]
        if getattr(router, "bias", None) is not None:
            router.bias.data = router.bias.data[retained_indices]
        router.out_features = len(retained_indices)
        
        total_removed += n_to_prune
    
    return total_removed

Handling Qwen3's Thinking Mode

# Method 1: Suffix approach
messages = [{"role": "user", "content": user_input + " /no_think"}]

# Method 2: Chat template configuration
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Disable thinking mode
)

Lessons Learned

What Worked

  • MoE models are highly compressible - The sparse activation pattern means most experts are redundant for any given input distribution.
  • Per-layer schedules outperform uniform - The linear decrease schedule consistently won.
  • Saliency beats frequency - Pure activation frequency misses important edge cases.
  • 25% compression is nearly free - HumanEval only drops ~2% absolute at 25% compression.

What Didn't Work

  • 78% compression has stability issues - While it fits on RTX 3090, quality degradation and thinking mode instability make it impractical.
  • Baseline benchmarks failed - Qwen3's thinking mode caused evaluation harness hangs.
  • No quality recovery at extreme compression - The capacity is simply gone, unlike quantization.

Practical Recommendations

For RTX 3090 Deployment (24GB)

  • Target 50% compression for balance of quality and memory
  • Use linear_decrease schedule with spread 0.15-0.20
  • Disable thinking mode to prevent stability issues
  • Set strict max_new_tokens limits
  • Expected VRAM: ~20GB with room for KV cache

For Quality-Critical Applications

Resources

December 2025

Thanks to the REAP paper authors for the saliency-based pruning methodology, and to the Qwen team for releasing a capable MoE model with permissive licensing.