This April 2025 paper introduces ShadowKV, an innovative inference system for long-context Large Language Models (LLMs) designed to significantly enhance throughput and support larger batch sizes without compromising accuracy. It achieves this by strategically managing the Key-Value (KV) cache: specifically, it compresses the low-rank pre-Rotary Position Embedding (RoPE) key cache on the GPU and offloads the value cache to the CPU. ShadowKV further optimizes performance through an accurate KV selection strategy that reconstructs minimal sparse KV pairs on-the-fly, thus minimizing decoding latency. Empirical evaluations demonstrate that ShadowKV can support up to 6x larger batch sizes and boost throughput by up to 3.04x on an A100 GPU across various LLMs and benchmarks, even outperforming theoretical infinite memory scenarios.
Source:
https://arxiv.org/pdf/2410.21465