Rate-distortion optimization for transformer inference

Published in arXiv, 2026

Transformers achieve superior performance on many tasks, but impose heavy compute and memory requirements during inference. This inference can be made more efficient by partitioning the process across multiple devices, which, in turn, requires compressing its intermediate representations. We introduce a principled rate-distortion-based framework for lossy compression that learns compact encodings that explicitly trade bitrate for accuracy. Experiments on language benchmarks show that the simplest of the proposed codecs achieves substantial rate savings, outperforming more complex methods. We characterize and analyze the rate-distortion behaviour of transformers, offering a unified lens for understanding performance in representation coding. This formulation extends information-theoretic concepts to derive bounds on the achievable rate of learnable codecs. For different architectures and tasks, we empirically demonstrate that their rates are driven by these bounds, adding to the explainability of the formulations.

Code available at: https://github.com/adeandrade/research

Recommended citation: Anderson de Andrade, Alon Harell, & Ivan Bajić. (2026). "Rate-distortion optimization for transformer inference." arXiv:2601.22002.
Download Paper | Download Bibtex

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)