DeltaNet

Parallelizing Linear Transformers with the Delta Rule over Sequence Length (arxiv link).

Motivation

Linear attention is pure addition.
Forgetting factor $γ$ cannot fully forget previous context.
Model should learn to remove less important memory.

The Delta Rule

S_{t} = S_{t - 1} - β_{t} (S_{t - 1} k_{t} - v_{t}) k_{t}^{⊤}

This can be regarded as optimizing an online regression loss using a single step of SGD:

L_{t} (S) = \frac{1}{2} ∥ S k_{t} - v_{t} ∥^{2}, S_{t} = S_{t - 1} - β_{t} \nabla_{S_{t - 1}} L_{t} (S_{t - 1}) = S_{t - 1} - β_{t} (S_{t - 1} k_{t} - v_{t})

An alternative interpretation for DeltaNet is from the perspective of key-value retrieval:

Retrieves the old value using the current key $v_{t}^{old} = S_{t - 1} k_{t}$ .
Obtains a new value $v_{t}^{new}$ by interpolating between the old value and the current new value $v_{t}$ , which replaces $v_{t}^{old}$ in the memory.

v_{t}^{new} = β_{t} v_{t} + (1 - β_{t}) v_{t}^{old} S_{t} = S_{t - 1} remove - v_{t}^{old} k_{t}^{⊤} write + v_{t}^{new} k_{t}^{⊤}

DeltaNet Transformer

---
config:
  look: handDrawn
---
flowchart BT
in(("in"))
qk_proj("Linear")
qk_conv("Conv<sup>*</sup>")
qk_silu("SiLU")
qk_l2norm("L2 Norm")
v_proj("Linear")
v_conv("Conv<sup>*</sup>")
v_silu("SiLU")
beta_proj("Linear")
beta_sigmoid("Sigmoid")
delta_rule("Delta Rule")
rmsnorm("RMSNorm")
o_proj("Linear")
out(("out"))


in --> qk_proj --> qk_conv --> qk_silu --> qk_l2norm -- "q, k" --> delta_rule
in --> v_proj --> v_conv --> v_silu -- "v" --> delta_rule
in --> beta_proj --> beta_sigmoid -- "beta" --> delta_rule
delta_rule --> rmsnorm --> o_proj
o_proj --> out

The convolution layers are depthwise-separable convolution that generalizes the shift SSM.
Interleave DeltaNet layers and SWA layers.
Two layers with global attention: 2^nd and the $(\frac{N}{2} + 1)$ -th layer.

📖 dict

🌳 Recent Trees

A Brief Introduction to Linear Attentions

部落格架設紀錄 - 貳: Quartz & GitHub Pages 設定

部落格架設紀錄 - 零: 動機跟糾結

🌱 Recent Seedlings

AutoRound

FlatQuant

DeltaNet

Motivation

The Delta Rule

DeltaNet Transformer

Graph View

Table of Contents

📖 __dict__

🌳 Recent Trees

A Brief Introduction to Linear Attentions

部落格架設紀錄 - 貳: Quartz & GitHub Pages 設定

部落格架設紀錄 - 零: 動機跟糾結

🌱 Recent Seedlings

AutoRound

FlatQuant

DeltaNet

Motivation

The Delta Rule

DeltaNet Transformer

Graph View

Table of Contents

📖 dict