Summary

  • Identify the optimal affine transformation for each linear layer.
  • Block-wise training strategy.
  • Kronecker product.

Methodology

  • Use Kronecker product instead of full-size transformation matrix to reduce computation.
  • Learnable per-channel scaling merged into weight.
  • Learnable clipping thresholds .
  • Decoder-block-level training using MSE loss.
  • Self-attention:
    • are decomposed.
    • are kept in original shape due to per-head quantization already facilitates cheap transformations.
    • are fused.
  • MLP: are both decomposed.