RetNet

Summary

Linear attention
- Replaces $exp$ with $ϕ (\cdot)$
- Worse performance than transformers.
Recurrent models
- Sacrifices training parallelism.
State space models.
- Training parallelism & performance.

Both recurrence and parallelism.

S_{t} = A S_{t - 1} + v_{t} k_{t}^{⊤} o_{t} = S_{t} q_{t}

$A \in R^{d \times d}$ is diagonalized

A = Λ (γ e^{i θ}) Λ^{- 1}, γ, θ \in R^{d}

$Λ$ can be absorbed into $W_{q}$ and $W_{k}$ . Therefore, the formula can be simplified as

Retention (X) = (Q K^{⊤} ⊙ D) V where D_{nm} = {γ^{n - m}, 0, n \geq m n < m

or in recurrent form

S_{t} = γ S_{t - 1} + v_{t} k_{t}^{⊤}, o_{t} = S_{t} q_{t}