Summary

  • Activation quantization!
  • Use rotation to remove outliers from the hidden state without changing the output.
  • Computational invariance.
  • Applied to hidden states (residual), activations of feed forward components, attention mechanism, and KV cache.
  • 99% accuracy retain on 4-bit quantized Llama-2-70B.
  • Lossless 6 & 8-bit without any calibration data using RTN.

Methodology

Background

Hadamard Transform

  • Walsh-Hadamard matrix:

    • The computation complexity of is
  • Randomized Hadamard matrix:

    • is also orthogonal.

Incoherence Processing

  • Introduced by QuiP.
  • A weight matrix is -incoherent if
    • is the element-wise max of the matrix.
      • Is this different from ?
  • A matrix that has high incoherence is hard to quantize: the largest element is an outlier relative to the magnitude of the average element.

Computational Invariance

  • No re-scaling happens in the RMSNorm block.
    • In practice, re-scaling is absorbed into the adjacent weight matrices first.

Method

Feed Forward Network (FFN)

  • RMSNorm scales is absorbed into and :
  • An additional Hadamard transform is inserted before down projection to smooth out the outliers then apply online symmetric per-token quantization. The second half of that Hadamard transform is merged into down_proj.

Attention