Activation Reward Models for Few-Shot Model Alignment

Description

This paper introduces Activation Reward Models (Activation RMs), a novel method for aligning Large Language Models (LLMs) and Multimodal Models with human preferences using minimal data. Unlike traditional reward models that require extensive fine-tuning, this approach utilizes activation steering to manipulate a model’s internal representations through just a few examples. By identifying and guiding specific attention heads, the system generates accurate reward signals and adapts rapidly to new tasks without parameter updates. To evaluate this method, the authors present PreferenceHack, a benchmark designed to test if reward models are susceptible to common biases like length or formatting. Results indicate that Activation RMs effectively mitigate reward hacking and achieve performance comparable to leading closed-source models. The research concludes that this framework offers a sample-efficient and interpretable alternative for ensuring AI systems adhere to complex human intents.

Listen

Description

Want to check another podcast?