Listen

Description

The provided sources explore advanced techniques for optimizing large language model (LLM) inference, specifically by addressing the memory bottlenecks of the Key-Value (KV) cache. KVQuant introduces a high-precision quantization framework that utilizes per-channel scaling, non-uniform datatypes, and sparse outlier handling to compress activations to sub-4-bit precision with minimal accuracy loss. Similarly, the KIVI algorithm proposes a tuning-free 2-bit quantization strategy that differentiates between key and value cache distributions to increase throughput. Shifting from quantization to architectural pruning, DuoAttention identifies specific Retrieval Heads that require full context while reducing Streaming Heads to constant memory usage by focusing only on recent tokens and attention sinks. Together, these methods enable LLMs to process million-level context lengths on standard hardware by drastically reducing the architectural and computational footprint of stored activations.

Sources:

1) 2024

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

University of California, Berkeley, ICSI, LBNL

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami

https://arxiv.org/pdf/2401.18079

2) 2024

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Rice University, Texas A&M University, Stevens Institute of Technology, Carnegie Mellon University

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen (Henry) Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu

https://arxiv.org/pdf/2402.02750

3) 2024

QAQ: Quality Adaptive Quantization for LLM KV Cache

Nanjing University

Shichen Dong, Wen Cheng, Jiayu Qin, Wei Wang

https://arxiv.org/pdf/2403.04643

4) May 8, 2024

KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

Rice University, Stevens Institute of Technology, ThirdAI Corp.

Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, Anshumali Shrivastava

https://arxiv.org/pdf/2405.03917

5) 2024

DUOATTENTION: EFFICIENT LONG-CONTEXT LLM INFERENCE WITH RETRIEVAL AND STREAMING HEADS

MIT, Tsinghua University, SJTU, University of Edinburgh, NVIDIA

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han

https://arxiv.org/pdf/2410.10819

6) 2025

MILLION: MasterIng Long-Context LLM Inference Via Outlier-Immunized KV Product QuaNtization

Shanghai Jiao Tong University, Shanghai Qi Zhi Institute, Huawei Technologies Co., Ltd, China University of Petroleum-Beijing

Zongwu Wang, Peng Xu, Fangxin Liu, Yiwei Hu, Qingxiao Sun, Gezi Li, Cheng Li, Xuan Wang, Li Jiang, Haibing Guan

https://arxiv.org/pdf/2504.03661