We review the latest papers which focus on advancements and critical uses of Sparse Autoencoders (SAEs), which are tools used to decode the internal "monosemantic" features of large language models.
Research from **ICLR 2025** and other repositories introduces **TopK SAEs** and **Multi-Layer SAEs**, demonstrating that these architectures offer superior reconstruction and scalability compared to traditional ReLU-based models. **RouteSAE** further improves efficiency by using a **dynamic routing mechanism** to extract integrated features from across multiple layers of a model's residual stream. However, critical analysis reveals that many identified "reasoning" features may actually be **linguistic correlates** or syntactic templates rather than genuine cognitive traces. By utilizing **falsification frameworks** and **causal token injection**, researchers caution against over-interpreting feature activations without rigorous validation. Together, these documents provide a technical foundation for **mechanistic interpretability**, balancing new architectural breakthroughs with a skeptical look at current evaluation metrics.
Sources:
1)
2025
Residual Stream Analysis with Multi-Layer SAEs
Tim Lawson
https://arxiv.org/abs/2409.04185
2)
2025
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher Manning, Christopher Potts
https://openreview.net/forum?id=XAjfjizaKs
3)
2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda
www.neuronpedia.org/sae-bench
4)
2025
Toward Efficient Sparse Autoencoder-Guided Steering for Improved In-Context Learning in Large Language Models
University of Illinois at Urbana-Champaign
Ikhyun Cho, Julia Hockenmaier
https://aclanthology.org/2025.emnlp-main.1474.pdf
5)
2025
Route Sparse Autoencoder to Interpret Large Language Models
University of Science and Technology of China, Douyin Co., Ltd.
Wei Shi, Sihang Li, Tao Liang, Mingyang Wan, Guojun Ma, Xiang Wang, Xiangnan He
https://aclanthology.org/2025.emnlp-main.346.pdf
6)
2025
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models
Carnegie Mellon University
Aashiq Muhamed, Mona Diab, Virginia Smith
https://aclanthology.org/2025.findings-naacl.87.pdf
7)
February 10 2026
Falsifying Sparse Autoencoder Reasoning Features in Language Models
UC Berkeley, UCSF
George Ma, Zhongyuan Liang, Irene Y. Chen, Somayeh Sojoudi
https://arxiv.org/pdf/2601.05679
8)
Under Review
Sparse But Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders
Anonymous authors
https://openreview.net/pdf/035a5937c6a536c67b5999aa43e53dd3800ba3a4.pdf
9)
2025
Revising and Falsifying Sparse Autoencoder Feature Explanations
University of California, Berkeley
George Ma, Samuel Pfrommer, Somayeh Sojoudi
https://openreview.net/pdf?id=OJAW2mHVND
10)
2025
Scaling and Evaluating Sparse Autoencoders
OpenAI
Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu
https://proceedings.iclr.cc/paper_files/paper/2025/file/42ef3308c230942d223c411adf182c88-Paper-Conference.pdf