QuaRot

Summary

Activation quantization!
Use rotation to remove outliers from the hidden state without changing the output.
Computational invariance.
Applied to hidden states (residual), activations of feed forward components, attention mechanism, and KV cache.
99% accuracy retain on 4-bit quantized Llama-2-70B.
Lossless 6 & 8-bit without any calibration data using RTN.

Methodology

Background

Hadamard Transform

Walsh-Hadamard matrix:
$H_{2} = \frac{1}{2} [11 1 - 1] H_{n} = \frac{1}{2} [H_{n - 1} H_{n - 1} H_{n - 1} - H_{n - 1}]$
- The computation complexity of $H_{d} x$ is $O (d lo g_{2} (d))$
Randomized Hadamard matrix:
$s \in {- 1, 1}^{d} \tilde{H} = H \cdot Diag (s)$
- $\tilde{H}$ is also orthogonal.

Incoherence Processing

Introduced by QuiP.
A weight matrix $W \in R^{m \times n}$ is $μ$ -incoherent if $max (W) \leq μ \cdot \frac{∥ W ∥ _{F}}{mn}$
- $max$ is the element-wise max of the matrix.
  - Is this different from $∥ W ∥_{\infty}$ ?
A matrix that has high incoherence is hard to quantize: the largest element is an outlier relative to the magnitude of the average element.

Computational Invariance

RMSNorm (X) = RMSNorm (X Q^{⊤}) Q

No re-scaling happens in the RMSNorm block.
- In practice, re-scaling is absorbed into the adjacent weight matrices first.

Method

Feed Forward Network (FFN)

RMSNorm scales $α$ is absorbed into $W_{gate}$ and $W_{up}$ : $W_{k} \leftarrow Q^{⊤} diag (α) W_{k}$
An additional Hadamard transform is inserted before down projection to smooth out the outliers then apply online symmetric per-token quantization. The second half of that Hadamard transform is merged into down_proj.

📖 dict

🌳 Recent Trees

A Brief Introduction to Linear Attentions

部落格架設紀錄 - 貳: Quartz & GitHub Pages 設定

部落格架設紀錄 - 零: 動機跟糾結

🌱 Recent Seedlings

AutoRound

FlatQuant

QuaRot

Summary

Methodology

Background

Hadamard Transform

Incoherence Processing

Computational Invariance

Method

Feed Forward Network (FFN)

Attention

Graph View

Table of Contents

📖 __dict__

🌳 Recent Trees

A Brief Introduction to Linear Attentions

部落格架設紀錄 - 貳: Quartz & GitHub Pages 設定

部落格架設紀錄 - 零: 動機跟糾結

🌱 Recent Seedlings

AutoRound

FlatQuant

QuaRot

Summary

Methodology

Background

Hadamard Transform

Incoherence Processing

Computational Invariance

Method

Feed Forward Network (FFN)

Attention

Graph View

Table of Contents

📖 dict