Listen

Description

This August 2025 paper introduces Pimba, a novel Processing-in-Memory (PIM) accelerator designed to enhance the efficiency of Large Language Model (LLM) serving for both traditional transformer-based models and emerging post-transformer architectures. The authors highlight that memory bandwidth is a critical bottleneck for both types of LLMs, specifically during attention operations in transformers and state updates in post-transformers. Pimba addresses this by integrating PIM technology with LLM quantization, using a State-update Processing Unit (SPU) shared between memory banks to maximize hardware resource sharing and area efficiency. The system employs MX-based quantized arithmetic within its State-update Processing Engine (SPE), which is identified as a Pareto-optimal choice for balancing accuracy and area overhead. Evaluations show Pimba significantly boosts token generation throughput and reduces latency and energy consumption compared to existing GPU and GPU+PIM systems, providing a unified and scalable solution for diverse LLM serving demands.

Source:

https://arxiv.org/pdf/2507.10178