Summary
- Activation quantization!
- Use rotation to remove outliers from the hidden state without changing the output.
- Computational invariance.
- Applied to hidden states (residual), activations of feed forward components, attention mechanism, and KV cache.
- 99% accuracy retain on 4-bit quantized Llama-2-70B.
- Lossless 6 & 8-bit without any calibration data using RTN.
Methodology
Background
Hadamard Transform
-
Walsh-Hadamard matrix:
- The computation complexity of is
-
Randomized Hadamard matrix:
- is also orthogonal.
Incoherence Processing
- Introduced by QuiP.
- A weight matrix is -incoherent if
- is the element-wise max of the matrix.
- Is this different from ?
- is the element-wise max of the matrix.
- A matrix that has high incoherence is hard to quantize: the largest element is an outlier relative to the magnitude of the average element.
Computational Invariance
- No re-scaling happens in the RMSNorm block.
- In practice, re-scaling is absorbed into the adjacent weight matrices first.
Method
Feed Forward Network (FFN)

- RMSNorm scales is absorbed into and :
- An additional Hadamard transform is inserted before down projection to smooth out the outliers then apply online symmetric per-token quantization. The second half of that Hadamard transform is merged into
down_proj.
Attention
