This June 2023 paper introduces FlexGen, a novel high-throughput generation engine designed to overcome the substantial computational and memory demands of large language model (LLM) inference on limited hardware, specifically a single commodity GPU. It details FlexGen's ability to aggregate memory and computation across the GPU, CPU, and disk, employing an optimized scheduling approach and a linear programming-based policy search to store and access tensors efficiently. Furthermore, FlexGen incorporates 4-bit compression for model weights and attention caches, which significantly reduces memory footprint with minimal accuracy loss. The research demonstrates FlexGen's superior performance, achieving substantially higher throughput compared to existing offloading systems, even enabling the operation of models as large as OPT-175B on a single 16GB GPU.
Source:
https://arxiv.org/pdf/2303.06865