Listen

Description

This research paper investigates a "selection crisis" in personalized AI alignment, revealing that standard metrics fail to predict how models actually behave during deployment. While researchers typically use reward model (RM) accuracy to measure success, the authors demonstrate that this metric correlates poorly with a model's ability to generate preferred content through reward-guided decoding. To address this gap, they introduce policy accuracy and a new benchmark called Pref-LaMP, which allows for the first direct evaluation of model outputs against ground-truth user completions. Their findings show a complete decoupling between a model's ranking ability and its generation quality, with many high-performing reward models failing to produce aligned responses. Notably, the study discovers that simple in-context learning (ICL) consistently outperforms complex personalized reward methods for models with 3 billion or more parameters. Ultimately, the authors urge the field to move beyond proxy metrics and adopt end-to-end behavioral evaluations to ensure personalized AI truly reflects individual user preferences.