Listen

Description

We review the latest papers which focus on advancements and critical uses of Sparse Autoencoders (SAEs), which are tools used to decode the internal "monosemantic" features of large language models.

Research from **ICLR 2025** and other repositories introduces **TopK SAEs** and **Multi-Layer SAEs**, demonstrating that these architectures offer superior reconstruction and scalability compared to traditional ReLU-based models. **RouteSAE** further improves efficiency by using a **dynamic routing mechanism** to extract integrated features from across multiple layers of a model's residual stream. However, critical analysis reveals that many identified "reasoning" features may actually be **linguistic correlates** or syntactic templates rather than genuine cognitive traces. By utilizing **falsification frameworks** and **causal token injection**, researchers caution against over-interpreting feature activations without rigorous validation. Together, these documents provide a technical foundation for **mechanistic interpretability**, balancing new architectural breakthroughs with a skeptical look at current evaluation metrics.

Sources:

1)

2025

Residual Stream Analysis with Multi-Layer SAEs

Tim Lawson

https://arxiv.org/abs/2409.04185

2)

2025

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher Manning, Christopher Potts

https://openreview.net/forum?id=XAjfjizaKs

3)

2025

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda

www.neuronpedia.org/sae-bench

4)

2025

Toward Efficient Sparse Autoencoder-Guided Steering for Improved In-Context Learning in Large Language Models

University of Illinois at Urbana-Champaign

Ikhyun Cho, Julia Hockenmaier

https://aclanthology.org/2025.emnlp-main.1474.pdf

5)

2025

Route Sparse Autoencoder to Interpret Large Language Models

University of Science and Technology of China, Douyin Co., Ltd.

Wei Shi, Sihang Li, Tao Liang, Mingyang Wan, Guojun Ma, Xiang Wang, Xiangnan He

https://aclanthology.org/2025.emnlp-main.346.pdf

6)

2025

Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models

Carnegie Mellon University

Aashiq Muhamed, Mona Diab, Virginia Smith

https://aclanthology.org/2025.findings-naacl.87.pdf

7)

February 10 2026

Falsifying Sparse Autoencoder Reasoning Features in Language Models

UC Berkeley, UCSF

George Ma, Zhongyuan Liang, Irene Y. Chen, Somayeh Sojoudi

https://arxiv.org/pdf/2601.05679

8)

Under Review

Sparse But Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders

Anonymous authors

https://openreview.net/pdf/035a5937c6a536c67b5999aa43e53dd3800ba3a4.pdf

9)

2025

Revising and Falsifying Sparse Autoencoder Feature Explanations

University of California, Berkeley

George Ma, Samuel Pfrommer, Somayeh Sojoudi

https://openreview.net/pdf?id=OJAW2mHVND

10)

2025

Scaling and Evaluating Sparse Autoencoders

OpenAI

Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu

https://proceedings.iclr.cc/paper_files/paper/2025/file/42ef3308c230942d223c411adf182c88-Paper-Conference.pdf