← Back to blog

Grassmann Flows for Sequence Modeling: An Independent Reproduction Study

An independent reproduction of "Attention Is Not What You Need" (arXiv 2512.19428) reveals a 22.6% performance gap - significantly larger than the paper's claimed 10-15%.

Abstract

I present an independent reproduction study of "Attention Is Not What You Need" (arXiv 2512.19428), which proposes replacing transformer self-attention with Grassmann manifold-based geometric operations using Plucker coordinates. The original paper claims Grassmann flow layers achieve perplexity "within 10-15% of size-matched Transformers" on Wikitext-2. My reproduction, using the exact architecture specified in the paper, reveals a 22.6% performance gap - significantly larger than claimed.

Introduction

The transformer architecture has dominated sequence modeling since 2017, with self-attention providing a powerful mechanism for capturing long-range dependencies. However, the quadratic complexity of attention with respect to sequence length has motivated an extensive search for alternatives. Recent years have seen the emergence of state space models (Mamba), linear recurrent units (RWKV), and various forms of linear attention.

Into this landscape comes a provocative proposal: replacing attention entirely with operations on Grassmann manifolds. The paper "Attention Is Not What You Need" argues that the geometric structure of Grassmann manifolds - specifically through Plucker coordinate embeddings - can capture the pairwise token interactions that attention provides, while maintaining linear complexity in sequence length.

I set out to reproduce these results exactly. What I found complicates the narrative.

Background: Grassmann Manifolds and Plucker Coordinates

What is a Grassmann Manifold?

The Grassmann manifold Gr(k, n) is the space of all k-dimensional linear subspaces of an n-dimensional vector space. Unlike Euclidean space, the Grassmann manifold has non-trivial curvature - it is a smooth, compact manifold where "points" are themselves subspaces rather than vectors.

Plucker Coordinates

Given two vectors u, v in R^r that span a 2-dimensional subspace, the Plucker coordinates are:

p_ij = u_i * v_j - u_j * v_i    for all i < j

This produces r(r-1)/2 coordinates - for r=32 (the paper's value), that's 496 Plucker coordinates per token pair. The key property: Plucker coordinates are antisymmetric (p_ij = -p_ji) and encode the "wedge product" structure - the signed area relationships between vector components.

Intuition: Why Might This Work?

Where attention asks "how much should token t attend to token s?" via a dot-product similarity, Plucker coordinates ask "what is the geometric relationship between the subspaces defined by these token representations?" The antisymmetric structure means forward and backward relationships are explicitly different - causality is baked into the geometry.

The Original Paper's Claims

The paper makes several specific claims:

  • Performance: Perplexity "within 10-15% of size-matched Transformers" on Wikitext-2
  • Model Size: 13-18M parameter models
  • Reduced dimension: r = 32
  • Window sizes: {1, 2, 4, 8, 12, 16} for 6-layer models
  • Blend gating: alpha * h + (1-alpha) * g
  • Gate input: concatenation of [h; g]

Methodology: Exact Reproduction

Model Configurations

GrassmannGPTv4 (Paper Architecture):

ParameterValue
Total Parameters17,695,168 (17.70M)
model_dim256
num_layers6
reduced_dim (r)32
window_sizes[1, 2, 4, 8, 12, 16]
ff_dim1024 (4x model_dim)

Size-Matched Transformer Baseline:

ParameterValue
Total Parameters17,670,400 (17.67M)
model_dim256
num_layers6
num_heads8

Both models are within 0.14% of each other in parameter count - a fair comparison.

Training Configuration

  • Dataset: Wikitext-2 (wikitext-2-raw-v1)
  • Tokenizer: GPT2Tokenizer (vocab size: 50,257)
  • Optimizer: AdamW (lr=3e-4, weight_decay=0.01)
  • Scheduler: CosineAnnealingLR
  • Epochs: 20
  • Batch size: 32
  • Hardware: NVIDIA H100 SXM5 80GB

Results

Training Curves

Training curves comparison showing loss and perplexity over epochs for Grassmann vs Transformer models

The training curves reveal distinct convergence behavior. The transformer achieves its best validation perplexity at epoch 13 (190.03), while the Grassmann model peaks at epoch 15 (236.05). Both begin overfitting thereafter, but at very different performance levels.

Final Comparison

ModelParametersBest Val PPLTest PPL
Grassmann17.70M236.05242.94
Transformer17.67M190.03198.17
Bar chart comparing Grassmann vs Transformer perplexity, showing 22.6% gap versus paper's claimed 10-15%

Gap Analysis

Test PPL Ratio: 242.94 / 198.17 = 1.226
Observed Gap: 22.6%
Paper Claim: 10-15%
Discrepancy: +7.6% to +12.6% worse than claimed

The paper's claim is not reproduced. The observed 22.6% gap significantly exceeds the claimed 10-15%.

CUDA Kernel Optimization

As part of this reproduction, I implemented custom CUDA kernels for the Plucker coordinate computation - the computational bottleneck of Grassmann flows.

Performance Results (H100 80GB)

CUDA kernel optimization speedups: 4.6x for mixing layer, 2.0x for full model inference
MetricPyTorchCUDASpeedup
Mixing layer forward1.59 ms0.35 ms4.6x
Full model inference9.16 ms4.53 ms2.0x
Inference throughput0.45M tok/s0.90M tok/s2.0x

Theoretical Discussion

Why Might Plucker Coordinates Struggle?

1. Fixed Geometric Operations: Attention computes input-dependent weights. Plucker coordinates perform a fixed geometric operation (the antisymmetric wedge product). The learning happens in projections, but the core mixing operation is predetermined.

2. The r(r-1)/2 Bottleneck: With r=32, the Plucker embedding has 496 dimensions, projected to model_dim=256 - substantial compression. Compare to attention where the full model_dim participates in key-query matching.

3. Window Averaging vs. Learned Aggregation: The paper averages Plucker features across window sizes. This treats delta=1 and delta=16 equally. Attention learns position-dependent and content-dependent weights.

4. Antisymmetry May Not Match Language: Plucker coordinates satisfy p_ij = -p_ji. But linguistic relationships are not generally antisymmetric in this simple way.

Comparison to Successful Alternatives

The architectures that successfully challenge attention share common properties:

  • Mamba/SSMs: Input-dependent state transitions (selectivity)
  • RWKV: Learned time-decay mechanisms
  • Linear Attention: Kernel approximations that preserve attention's structure

The successful alternatives preserve some form of input-dependent, learned aggregation. Grassmann flows take a different path: fixed geometric operations with learned projections.

Conclusions

  • Paper claims not reproduced: My 22.6% gap significantly exceeds the claimed 10-15%
  • Consistent underperformance: Across multiple configurations, Grassmann flows underperform transformers
  • Theoretical concerns: Fixed geometric operations may not provide the flexible, content-dependent mixing that language requires
  • CUDA optimization: 2x inference speedup achieved with custom kernels

Independent reproduction is essential. When a paper claims results that challenge established baselines, the community benefits from verification. In this case, the verification reveals a more nuanced picture: Grassmann flows are a creative idea that, in their current form, do not deliver on the stated performance claims.

Reproducibility

All code and results are available:

python train_wikitext2.py --model both --epochs 20 --model-dim 256 --num-layers 6

Hardware used: NVIDIA H100 SXM5 80GB (Voltage Park Cloud)

Citation

@article{arledge2025grassmann,
  title={Grassmann Flows for Sequence Modeling: An Independent Reproduction Study},
  author={Arledge, Elliot},
  year={2025},
  month={December},
  url={https://github.com/Infatoshi/grassmann-flows}
}

December 2025