Summary

  • Activation quantization!
  • Use rotation to remove outliers from the hidden state without changing the output.
  • Computational invariance.
  • Applied to hidden states (residual), activations of feed forward components, attention mechanism, and KV cache.
  • 99% accuracy retain on 4-bit quantized Llama-2-70B.
  • Lossless 6 & 8-bit without any calibration data using RTN.

Methodology

Background

Hadamard Transform

  • Walsh-Hadamard matrix:

    • The computation complexity of is
  • Randomized Hadamard matrix:

    • is also orthogonal.

Incoherence Processing

  • Introduced by QuiP.
  • A weight matrix is -incoherent if
    • is the element-wise max of the matrix.
      • Is this different from ?
  • A matrix that has high incoherence is hard to quantize: the largest element is an outlier relative to the magnitude of the average element.

Computational Invariance

  • No re-scaling happens in the RMSNorm block.
    • In practice, re-scaling is absorbed into the adjacent weight matrices first.

Method

Inter-Block Activation

  • Goal: apply incoherence processing on the activations between decoder blocks, i.e., make them easily quantizable by rotating them, .
  • Merge RMSNorm into all front linear layers ().
  • Choose a global randomized Hadamard transform .
    • All activations between decoder blocks are rotated, therefore needed to rotate back.

Input Activation to Down Projection

  • Goal: apply incoherence processing on the input activations to down_proj
  • Insert an online Hadamard transformation before .
  • Merge the other half of the Hadamard transformation into
  • Apply quantization to the input of

Input Activation to Out Projection

  • Goal: apply incoherence processing on the input activations to o_proj.
  • An additional online Hadamard (Hadamard heads) is required to head-wise rotate output of multi-head attention.

KV-Cache

  • Due to positional embedding, Hadamard transformation cannot be merged into weights.
  • Online Hadamard after RoPE to query and key, which would be canceled out during attention.
  • Since K and V are both rotated, apply quantization and store them into KV cache.