Listen

Description

The paper introduces "InterPLM," a systematic framework for interpreting protein language models (PLMs) using sparse autoencoders (SAEs). This method successfully extracts thousands of interpretable features from PLMs like ESM-2, revealing biological concepts such as binding sites and functional domains that are stored in superposition within the model's neurons. The research demonstrates that SAE features show significantly stronger alignment with known biological annotations than individual neurons and that larger PLMs capture a broader range of concepts. Furthermore, the framework leverages large language models for automated feature description and validation, showing that feature activations can identify missing database annotations and enable the targeted steering of sequence generation.

References: