Summary

  • Introduced constant forgetting factor .
  • Derived parallel and recurrent representation.
  • Derived chunk-wise block-parallel form for efficient training.

Motivation

  • Linear attention
    • Replaces with
    • Worse performance than transformers.
  • Recurrent models
    • Sacrifices training parallelism.
  • State space models.
    • Training parallelism & performance.

Methodology

Both recurrence and parallelism.

is diagonalized

can be absorbed into and . Therefore, the formula can be simplified as

or in recurrent form