This January 2025 paper introduces HybridServe, an LLM inference system designed to enhance throughput and cost-effectiveness for large language models by optimizing memory usage and host-GPU communication. It tackles the challenges of host memory offloading, where model parameters and KV cache are stored on slower host memory to reduce costs but can lead to GPU underutilization due to limited transfer bandwidth. HybridServe proposes a novel activation checkpointing technique with a KV-Activation hybrid caching scheme that stores intermediate activations, allowing for faster recomputation of the KV cache while model parameters are transferred. This system dynamically balances communication overhead and recomputation time to maximize throughput, demonstrating significant improvements over existing state-of-the-art methods like FlexGen.
Source:
https://arxiv.org/pdf/2501.01792