Listen

Description

This February 2025 research addresses the critical issue of training instability in Large Language Models (LLMs), which often stems from sudden, massive "gradient spikes" that can be thousands of times larger than typical gradients. The authors introduce Spike-Aware Adam with Momentum Reset (SPAM), a novel optimizer designed to counteract these spikes through periodic momentum resets and spike-aware gradient clipping, which scales down rather than zeroes out large gradients. Experiments demonstrate that SPAM consistently outperforms existing optimizers like Adam and Adafactor across various LLM sizes during both pre-training and fine-tuning. Furthermore, SPAM offers a memory-efficient version leveraging sparse momentum, enabling better performance under memory constraints compared to other state-of-the-art memory-efficient optimizers. The study highlights the detrimental impact of gradient spikes and presents an effective optimization strategy to enhance LLM training stability and resource efficiency.

Source:

https://arxiv.org/pdf/2501.06842