This August 2025 paper introduces Oaken, a novel acceleration solution for serving Large Language Models (LLMs) that addresses the significant challenges of memory bandwidth and capacity bottlenecks inherent in batched LLM inference. Oaken achieves this through a co-designed algorithm and hardware architecture, featuring an online-offline hybrid KV cache quantization technique. This technique efficiently reduces the memory footprint and access requirements of the Key-Value (KV) cache by categorizing data into "inliers" and "outliers" using offline threshold profiling and applying group-shift quantization. Furthermore, Oaken integrates custom quantization/dequantization engines and memory management units into LLM accelerators to translate algorithmic gains into tangible performance improvements, demonstrating increased throughput and minimal accuracy loss compared to existing methods.
Source:
https://arxiv.org/html/2503.18599v2