The provided sources explore advanced techniques for optimizing large language model (LLM) inference, specifically by addressing the memory bottlenecks of the Key-Value (KV) cache. KVQuant introduces a high-precision quantization framework that utilizes per-channel scaling, non-uniform datatypes, and sparse outlier handling to compress activations to sub-4-bit precision with minimal accuracy loss. Similarly, the KIVI algorithm proposes a tuning-free 2-bit quantization strategy that differentiates between key and value cache distributions to increase throughput. Shifting from quantization to architectural pruning, DuoAttention identifies specific Retrieval Heads that require full context while reducing Streaming Heads to constant memory usage by focusing only on recent tokens and attention sinks. Together, these methods enable LLMs to process million-level context lengths on standard hardware by drastically reducing the architectural and computational footprint of stored activations.
Sources:
1) 2024
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
University of California, Berkeley, ICSI, LBNL
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami
https://arxiv.org/pdf/2401.18079
2) 2024
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Rice University, Texas A&M University, Stevens Institute of Technology, Carnegie Mellon University
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen (Henry) Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu
https://arxiv.org/pdf/2402.02750
3) 2024
QAQ: Quality Adaptive Quantization for LLM KV Cache
Nanjing University
Shichen Dong, Wen Cheng, Jiayu Qin, Wei Wang
https://arxiv.org/pdf/2403.04643
4) May 8, 2024
KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization
Rice University, Stevens Institute of Technology, ThirdAI Corp.
Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, Anshumali Shrivastava
https://arxiv.org/pdf/2405.03917
5) 2024
DUOATTENTION: EFFICIENT LONG-CONTEXT LLM INFERENCE WITH RETRIEVAL AND STREAMING HEADS
MIT, Tsinghua University, SJTU, University of Edinburgh, NVIDIA
Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han
https://arxiv.org/pdf/2410.10819
6) 2025
MILLION: MasterIng Long-Context LLM Inference Via Outlier-Immunized KV Product QuaNtization
Shanghai Jiao Tong University, Shanghai Qi Zhi Institute, Huawei Technologies Co., Ltd, China University of Petroleum-Beijing
Zongwu Wang, Peng Xu, Fangxin Liu, Yiwei Hu, Qingxiao Sun, Gezi Li, Cheng Li, Xuan Wang, Li Jiang, Haibing Guan
https://arxiv.org/pdf/2504.03661