A Brief Introduction to Linear Attentions

This is a blog version of my slides on the topic of linear attentions. The slides were presented during the weekly group meeting/study group in our team. The storyline is highly inspired by ¹.

Motivation

Attention and recurrent modules can be seen as token-mixing modules. They blend signals across the sequence domain, in contrast to FFNs or MLPs, which operate within the scope of a single token.

However, softmax attention and recurrent modules come with their own strengths and weaknesses.

	Recurrent	Softmax Attention	Linear Attention
Training	Sequential	Parallelizable	Parallelizable?
Inference	Linear	Quadratic	Linear?
Performance	OK?	Strong	?

Linear attention tries to combine the best of both worlds.

Softmax Attention

q_{i}, k_{i}, v_{i}, o_{i} \in R^{d \times 1} Q = [q_{1} q_{2} \dots q_{n}]^{⊤} \in R^{n \times d} K = [k_{1} k_{2} \dots k_{n}]^{⊤} \in R^{n \times d} V = [v_{1} v_{2} \dots v_{n}]^{⊤} \in R^{n \times d} O = [o_{1} o_{2} \dots o_{n}]^{⊤} \in R^{n \times d}

o_{t} = \frac{\sum _{i = 1}^{t} exp ( q _{t}^{⊤} k _{i} ) v _{i}}{\sum _{j = 1}^{t} exp ( q _{t}^{⊤} k _{j} )}, O = softmax ((Q K^{⊤}) ⊙ M) V, M is causal mask

Inference:
- Each step is $\sum_{i = 1}^{t} exp (q_{t}^{⊤} k_{i}) v_{i} ⟹ O (n d)$
- Overall complexity is $O (n^{2})$ when $d ≪ n$ .
Training: can be parallelized with matrix form. Causality is realized by the causal mask.

Linear Attention: Variants and Perspectives

Vanilla Linear Attention

o_{t} = \frac{\sum _{i = 1}^{t} exp ( q _{t}^{⊤} k _{i} ) v _{i}}{\sum _{j = 1}^{t} exp ( q _{t}^{⊤} k _{j} )} = \frac{\sum _{i = 1}^{t} v _{i} \cdot exp ( k _{i}^{⊤} q _{t} )}{\sum _{j = 1}^{t} exp ( k _{j}^{⊤} q _{t} )}

Bottleneck is $\sum_{i = 1}^{t} v_{i} \cdot exp (k_{i}^{⊤} q_{t})$ $⟹$ can we remove $exp$ ?
Replace $exp (k_{i}^{⊤} q_{t})$ with kernel trick $ϕ (k_{i})^{T} ϕ (q_{t})$ ³, where $ϕ (\cdot) : R^{n} \to R^{m}$ .

o_{t} = \frac{\sum _{i = 1}^{t} v _{i} \cdot ( ϕ ( k _{i} ) ^{⊤} ϕ ( q _{t} ) )}{\sum _{j = 1}^{t} ϕ ( k _{j} ) ^{T} ϕ ( q _{t} )} = \frac{( \sum _{i = 1}^{t} v _{i} \cdot ϕ ( k _{i} ) ^{⊤} ) ϕ ( q _{t} )}{( \sum _{j = 1}^{t} ϕ ( k _{i} ) ^{⊤} ) ϕ ( q _{t} )} = \frac{S _{t} ϕ ( q _{t} )}{z _{t}^{⊤} ϕ ( q _{t} )}

Recent works⁴⁵ discovered that a linear kernel ( $ϕ = I$ ) and no normalizer works well in practice. $o_{t} = S_{t} q_{t}, S_{t} = S_{t - 1} + v_{t} k_{t}^{⊤}, S_{t} \in R^{d \times d}$ $O = (Q K^{⊤} ⊙ M) V$
During inference:
- State $S_{t}$ is now $O (d^{2}) \to O (1)$ space.
- Each token update is $O (d^{2}) \to O (1)$ time.
We have solved the quadratic complexity issue!
What about training?
- Ideally we’d like $O = (Q K^{⊤}) V = Q (K^{⊤} V) ⟹ O (n^{2} d) \to O (n d^{2}) ⟹$ ❌ due to $⊙ M$ .

Chunkwise Parallel Form

The chunkwise parallel form aims to balance between parallel and recurrent form. Subquadratic, partially parallel training.

Input is split into non-overlapping chunks with length $c$ . Number of chunks is $\frac{n}{c}$ .
For each chunk $i \in [0, 1, \dots, \frac{n}{c} - 1]$ , calculate
$S_{[i + 1]} = S_{[i]} + j = i c + 1 \sum (i + 1) c k_{j}^{⊤} v_{j} = S_{[i]} + K_{[i]}^{⊤} V_{[i]}$
$K_{[i]}^{⊤} V_{[i]}$ can be calculated in $O (c^{2} d)$ in parallel $⟹ O (\frac{n}{c} c^{2} d) = O (n c d)$
Similarly, calculate output
$O_{[i + 1]} = inter-chunk: O (c d^{2}) Q_{[i + 1]} S_{[i]} + intra-chunk: O (c^{2} d + c d^{2}) ((Q_{[i + 1]} K_{[i + 1]}^{⊤}) ⊙ M) V_{[i + 1]}$

Training complexity is thus $O (\frac{n}{c} (c d^{2} + c^{2} d + c d^{2})) = O (n c d + 2 n d^{2}) < O (n^{2} d)$ when $d < n$ .

Problems of Vanilla Linear Attention

Performance
- Linear attention is basically cumsum.
- State/memory $S_{t}$ size is constant.
- When sequence length $n$ is long enough, the rank of $S_{t}$ would be less than $n$ , leading to key collisions.
Training Efficiency
- Chunkwise parallel form is required for approximate parallel training.

RetNet²

Motivation:
- Vanilla linear attention performs worse than softmax attention.
- Forget gates has been shown to be crucial in RNNs, but training is sequential.
Methodology:
- Introduced a data-independent, scalar forgetting factor $γ ⟹ S_{t} = γ S_{t - 1} + v_{t} k_{t}^{⊤}, o_{t} = S_{t} q_{t}$
- Derived both recurrent and chunkwise parallel form.

The Mamba Series⁶⁷

Started from State Space Model (SSM): $h_{t}^{'} = A h_{t} + B x_{t} y_{t} = C h_{t}$
Mamba: $S_{t} = exp (- (α_{t} 1^{⊤}) ⊙ exp (A)) ⊙ S_{t - 1} + (α_{t} ⊙ v_{t}) k_{t}^{⊤} o_{t} = S_{t} q_{t} + d ⊙ v_{t}$
- However, training cannot be parallelized.
Mamba-2: simplified to improve training efficiency. $S_{t} = γ_{t} S_{t - 1} + v_{t} k_{t}^{⊤} o_{t} = S_{t} q_{t}$
- Different from RetNet, $γ_{t} \in (0, 1)$ is data-dependent.

GLA⁸

S_{t} = G_{t} ⊙ S_{t - 1} + v_{t} k_{t}^{⊤}, G_{t} \in (0, 1)^{d_{k} \times d_{k}}

Motivation: effectively and efficiently parametrize forgetting gate.
- RetNet: data-independent scalar.
- Mamba: data-dependent, but no parallelize training, and hardware inefficient.
- Mamba2: parallelizable training. $G_{t} = γ_{t} 1 1^{⊤}$
Methodology: $G_{t} = α_{t} 1^{⊤} α_{t} = σ (W_{α^{+}} \cdot W_{α^{-}} \cdot x_{t}) S_{t} = (α_{t} 1^{⊤}) ⊙ S_{t - 1} + v_{t} k_{t}^{⊤} = Diag (α_{t}) S_{t - 1} + v_{t} k_{t}^{⊤}$
- Forgetting gate $α_{t}$ is now column-wise instead of global.

DeltaNet⁹

Motivation: forgetting factor design from a fast weight programming perspective.

Original recurrent module: $o_{t} = S_{t} q_{t}, S_{t} = S_{t - 1} + v_{t} k_{t}^{⊤}$
Formulate it as a deep-learning problem:
- Model: $f (S; x) = S x$ where $S$ is the model weights.
- Training: $v = f (S; k)$
- Inference: $o = f (S; q)$
- Weight update rule: $S = S_{t - 1} + v k^{⊤} ⟹$ What does this mean?
  - Learning rate $η = 1$
  - $\nabla_{S} L (f (S; k), v) = - v k^{⊤}$
  - $L = - v^{⊤} (S k) = - ⟨ f (S; k), v ⟩ ⟹$ maximize inner-product

💡 Any better ways to do this?

Model: $f (S; x) = S x$
Loss function: $L = \frac{1}{2} ∥ S k - v ∥^{2}$
Gradient: $\nabla_{S} L = (S k - v) k^{⊤}$
Update rule: $S_{t} = S_{t - 1} - β_{t} (S_{t - 1} k_{t} - v_{t}) k_{t}^{⊤}$

And there we have DeltaNet.

Note that the convolutions in the image are depthwise separable convolutions.

Alternative Interpretation: Key-Value Retrieval

The Delta Rule:

S_{t} = S_{t - 1} - β_{t} (S_{t - 1} k_{t} - v_{t}) k_{t}^{⊤}

Retrieves the old value using the current key $v_{t}^{old} = S_{t - 1} k_{t}$ .
Obtains a new value $v_{t}^{new}$ by interpolating between the old value and the current new value $v_{t}$ , which replaces $v_{t}^{old}$ in the memory.

v_{t}^{new} = β_{t} v_{t} + (1 - β_{t}) v_{t}^{old} S_{t} = S_{t - 1} remove - v_{t}^{old} k_{t}^{⊤} write + v_{t}^{new} k_{t}^{⊤}

Gated DeltaNet¹⁰

DeltaNet: precise modification of a single key-value pair. $S_{t} = S_{t - 1} - β_{t} (S_{t - 1} k_{t} - v_{t}) k_{t}^{⊤}$
Adds forgetting gate $α_{t} \in (0, 1)$ from Mamba2 to allow rapid context cleanup. $S_{t} = α_{t} S_{t - 1} - α_{t} \cdot β_{t} S_{t - 1} k_{t} k_{t}^{⊤} + β_{t} v_{t} k_{t}^{⊤} \approx γ_{t} S_{t - 1} - η_{t} (S_{t - 1} k_{t} - v_{t}) k_{t}^{⊤}$
Given the update rule, the loss function can be derived as: $L = \frac{1}{2} ∥ S k - v ∥^{2} + regularization \frac{1 - γ}{η} ∥ S ∥_{F}^{2}$

JetBlock & Jet Nemotron¹¹

Based on Gated DeltaNet:

Remove convolution on query and key.
Dynamic convolution on value.
- Low-rank kernel generator with SiLU activation.

Linear Attentions in the Wild

Model	Attention	Hybrid	Size	Date
MiniMax-M1	Lightning Attention (GLA)	7:1	456B (A46B)	2025.06
Qwen3-Next-80B-A3B	GDN (& Gated Attention)	3:1	80B (A3B)	2025.09
IBM Granite-4.0	Mamba-2	9:1	3B~32B (A3B)	2025.10
Kimi Linear	Kimi Delta Attention (GDN + GLA)	3:1	48B (A3B)	2025.10
DeepSeek-V3.2-Exp	DeepSeek Sparse Attention	❌	671B (A37B)	2025.09

Notes on Deepseek Sparse Attention (DSA)¹²

DSA is not linear attention.
It is more like a full softmax attention with an additional top-k operator that selects only the token pairs worth attending to.
The importance of a token pair is determined by a lightning indexer.
Given that the index head dimension $H^{I}$ is small, the computation of the lightning indexer is small compared to the full attention. Nevertheless, the complexity of the attention part can be reduced from $O (n^{2})$ to $O (nk)$ , where $k$ is index_topk.
Therefore, DSA can still be seen as subquadratic.

MiniMax-M1 vs. M2¹³

MiniMax-M1 utilizes GLA, while MiniMax-M2 reverts to full softmax attention.
Full Attention still holds practical advantages across various scenarios (code/math, agents, multimodal tasks, long chain-of-thought, RL, low-precision compute, speculative decoding, etc.).

Issues

Evaluation Difficulty
- Existing benchmarks are easy to maxed out.
- High observation cost: scaling up, more complex evaluation.
Infrastructure Gaps for Efficient Attention
- Training: efficient attentions are often memory-bound without deep IO optimizations.
- Inference:
  - Theoretical compute/memory savings only happens for long-enough sequences.
  - Challenges: low-precision state storage, efficient prefix caching, speculative decoding improvements.

Conclusion & Takeaways

Linear attention is about striking the right balance between performance and efficiency.
- Performance: gating, memory update.
- Training Efficiency: stick to chunkwise parallel form.
- Inference Efficiency: constant-size state/memory.
Adoption is slowly increasing, but still far from mainstream.
- Infrastructure is far behind softmax attention.

苏剑林. (Jun. 20, 2025). 《线性注意力简史：从模仿、创新到反哺》[Blog post]. Retrieved from https://kexue.fm/archives/11033 ↩
Sun, Yutao, et al. “Retentive network: A successor to transformer for large language models.” arXiv preprint arXiv:2307.08621 (2023). ↩ ↩² ↩³
Katharopoulos, Angelos, et al. “Transformers are rnns: Fast autoregressive transformers with linear attention.” International conference on machine learning. PMLR, 2020. ↩
Schlag, Imanol, Kazuki Irie, and Jürgen Schmidhuber. “Linear transformers are secretly fast weight programmers.” International conference on machine learning. PMLR, 2021. ↩
Mao, Huanru Henry. “Fine-tuning pre-trained transformers into decaying fast weights.” arXiv preprint arXiv:2210.04243 (2022). ↩
Gu, Albert, and Tri Dao. “Mamba: Linear-time sequence modeling with selective state spaces.” First conference on language modeling. 2024. ↩
Dao, Tri, and Albert Gu. “Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.” arXiv preprint arXiv:2405.21060 (2024). ↩
Yang, Songlin, et al. “Gated linear attention transformers with hardware-efficient training.” arXiv preprint arXiv:2312.06635 (2023). ↩
Yang, Songlin, et al. “Parallelizing linear transformers with the delta rule over sequence length.” Advances in neural information processing systems 37 (2024): 115491-115522. ↩
Yang, Songlin, Jan Kautz, and Ali Hatamizadeh. “Gated delta networks: Improving mamba2 with delta rule.” arXiv preprint arXiv:2412.06464 (2024). ↩
Gu, Yuxian, et al. “Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search.” arXiv preprint arXiv:2508.15884 (2025). ↩
DeepSeekAI. “DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention.” GitHub (2025). ↩
See reddit post. ↩

📖 dict

🌳 Recent Trees

A Brief Introduction to Linear Attentions

部落格架設紀錄 - 貳: Quartz & GitHub Pages 設定

部落格架設紀錄 - 零: 動機跟糾結

🌱 Recent Seedlings

AutoRound

FlatQuant

A Brief Introduction to Linear Attentions