Listen

Description

This May 2025 paper introduces a resource-aware algorithm designed to optimize the performance of Large Language Models (LLMs) for low-latency inference on edge computing devices. The core innovation lies in its fine-grained partitioning of the Transformer architecture, specifically at the attention head-level, rather than coarser layer-level divisions. This approach allows for dynamic reassignment and migration of these individual attention heads and their associated Key/Value (K/V) caches across heterogeneous edge devices. By managing the expanding memory footprint of K/V caches and exploiting parallel execution of attention heads, the proposed method significantly reduces inference latency and memory usage compared to existing static or layer-based partitioning strategies.

Source:

https://arxiv.org/pdf/2505.02533