Gated Linear Attention

Summary

Flash Linear Attention (FLA): hardware/IO-aware linear attention kernel.
- Chunk-wise block-parallel form interpolates between the parallel and recurrent form.
- $O (\frac{L}{C} (C^{2} d + C d^{2})) = O (L C d + L d^{2}) < O (L^{2} d) when L > d$
Data-dependent gating mechanism for linear attention that still allows hardware-efficient chunkwise form.

Motivation

Current implementations of linear attention lack I/O-awareness and are thus slower than highly optimized implementations of softmax attention.
Forget gates are shown crucial in RNNs.
- RetNet: non-data-dependent decay factor.
- Mamba cannot parallelize training and is not hardware-efficient.
- Mamba2 uses a more restricted gating mechanism: $G_{t} = γ_{t} 1 1^{⊤}, γ_{t} \in (0, 1)$ is a scalar.

Methodology

Chunk-Wise Block-Parallel

Softmax attention can be parallelized during training at the cost of quadratic complexity.

GLA

Recurrent Form

A general data-dependent forgetting gate can be formulated as

S_{t} = G_{t} ⊙ S_{t - 1} + v_{t} k_{t}^{⊤}, G_{t} \in (0, 1)^{d_{k} \times d_{k}}

A naive mapping $x_{t} \to G_{t}$ would require matrix of size $d \times d_{k} \times d_{v}$ , which would be parameter-inefficient.
Outer-product-based low-rank parameterization: $G_{t} = α_{t} β_{t}^{⊤}$ .
Mamba2 uses $G_{t} = γ_{t} 1 1^{⊤}$ , where $γ_{t} \in (0, 1)$ is a data-dependent scalar.

GLA adopts a middle ground:

G_{t} = α_{t} 1^{⊤} S_{t} = (α_{t} 1^{⊤}) ⊙ S_{t - 1} + v_{t} k_{t}^{⊤} = Diag (α_{t}) S_{t - 1} + v_{t} k_{t}^{⊤} α_{t} = σ (W_{α, up} \cdot W_{α, down} \cdot x_{t})

flowchart LR

subgraph Recurrent
    q_proj("Q")
    k_proj("K")
    v_proj("V")
    alpha_down[/"`$$W_{\alpha^-}$$`"\]
    alpha_up[\"`$$W_{\alpha^+}$$`"/]
    sigmoid(("`$$\sigma$$`"))
    matmul(("`$$\times$$`"))
    add(("`$$+$$`"))
    matmul2(("`$$\times$$`"))
    diag("`$$\text{Diag}(\cdot)$$`")
    
    outer_product(("`$$\times$$`"))
    v_proj --> outer_product
    k_proj --> outer_product
    outer_product --> add
    alpha_down --> alpha_up
    alpha_up --> sigmoid --"`$$\alpha_t$$`"--> diag --"`$$G_t$$`"--> matmul --> add
    add --> matmul2
    q_proj --> matmul2
end

state(("`$$S_{t - 1}$$`"))
input(("`$$x_t$$`"))
output(("`$$o_t$$`"))
state_next(("`$$S_t$$`"))

input --> q_proj
input --> k_proj
input --> v_proj 
input --> alpha_down
state --> matmul
matmul2 --> output
add --> state_next

📖 dict

🌳 Recent Trees

A Brief Introduction to Linear Attentions

部落格架設紀錄 - 貳: Quartz & GitHub Pages 設定

部落格架設紀錄 - 零: 動機跟糾結

🌱 Recent Seedlings

AutoRound

FlatQuant

Gated Linear Attention

Summary

Motivation

Methodology

Chunk-Wise Block-Parallel

GLA

Recurrent Form

Graph View

Table of Contents

📖 __dict__

🌳 Recent Trees

A Brief Introduction to Linear Attentions

部落格架設紀錄 - 貳: Quartz & GitHub Pages 設定

部落格架設紀錄 - 零: 動機跟糾結

🌱 Recent Seedlings

AutoRound

FlatQuant

Gated Linear Attention

Summary

Motivation

Methodology

Chunk-Wise Block-Parallel

GLA

Recurrent Form

Graph View

Table of Contents

📖 dict