📖 dict

🌳 Recent Trees

A Brief Introduction to Linear Attentions
Nov 13, 2025
部落格架設紀錄 - 貳: Quartz & GitHub Pages 設定
Oct 01, 2025
部落格架設紀錄 - 零: 動機跟糾結
Sep 30, 2025

🌱 Recent Seedlings

AutoRound
Nov 21, 2025
FlatQuant
Nov 19, 2025

See 10 more →

❯

❯

FlatQuant

Nov 19, 20251 min read

paper
llm
quantization
compression

Summary

Identify the optimal affine transformation for each linear layer.
Block-wise training strategy.
Kronecker product.

Methodology

Use Kronecker product instead of full-size transformation matrix to reduce computation.
Learnable per-channel scaling merged into weight.
Learnable clipping thresholds $\in (0, 1)$ .
Decoder-block-level training using MSE loss.
Self-attention:
- $P_{a}, P_{o}$ are decomposed.
- $P_{h}, P_{v}$ are kept in original shape due to per-head quantization already facilitates cheap transformations.
- $P_{o}, P_{v}$ are fused.
MLP: $P_{ug}, P_{d}$ are both decomposed.

Graph View

Summary
Methodology

Created with Quartz v4.5.2 © 2025

GitHub