GenARM

GenARM : Reward Guided Generation with Autoregressive Reward Model for Test-Time Alignment

¹University of Maryland, College Park ²JPMorgan AI Research

ICLR, 2025

Abstract

Large Language Models (LLMs) exhibit impressive capabilities but require careful alignment with human preferences. Traditional training-time methods finetune LLMs using human preference datasets but incur significant training costs and require repeated training to handle diverse user preferences. Test-time alignment methods address this by using reward models (RMs) to guide frozen LLMs without retraining. However, existing test-time approaches rely on trajectory-level RMs which are designed to evaluate complete responses, making them unsuitable for autoregressive text generation that requires computing next-token rewards from partial responses. To address this, we introduce GenARM, a test-time alignment approach that leverages the Autoregressive Reward Model--a novel reward parametrization designed to predict next-token rewards for efficient and effective autoregressive generation. Theoretically, we demonstrate that this parametrization can provably guide frozen LLMs toward any distribution achievable by traditional RMs within the KL-regularized reinforcement learning framework. Experimental results show that GenARM significantly outperforms prior test-time alignment baselines and matches the performance of training-time methods. Additionally, GenARM enables efficient weak-to-strong guidance, aligning larger LLMs with smaller RMs without the high costs of training larger models. Furthermore, GenARM supports multi-objective alignment, allowing real-time trade-offs between preference dimensions and catering to diverse user preferences without retraining.

TL;DR: GenARM uses an autoregressive reward model to efficiently guide a base LLM for test-time alignment, outperforming prior methods and enabling weak-to-strong guidance and multi-objective alignment.

Why Do We Need Autoregressive Reward Model?

❌ Traditional reward models = Slow 🚶

🔹They score entire responses post-generation 📜

🔹LLMs must generate fully before evaluation ⏳

✅ GenARM = Fast 🏎️

🔹 Predicts next-token rewards on the fly ⚡

🔹 Guides LLMs token by token—drastically improving efficiency! 💡

What’s an Autoregressive Reward Model?

Unlike conventional trajectory-level reward models, GenARM parametrizes rewards at the token level:

🔹 Rewards decompose naturally as log probabilities 🔄

🔹 Each token selection is guided dynamically 🎯

Parametrization of the Autoregressive Reward Model.

How Does GenARM Work?

Training Phase:

✅ Learns next-token rewards from trajectory-level preference data 📊

✅ Ensures that preferred responses accumulate higher total rewards 💯

🚀 Inference Phase:

✅ Combines LLM logits + next-token rewards to dynamically guide generation 🔄

💡 No model retraining. Just plug, play, and align!

How Well Does it Perform?

🔥 Fastest test-time alignment method—significantly outperforms baselines!

🔥 Achieves 90% of fine-tuned performance—without retraining!

🔥 Weak-to-strong guidance: Uses a 7B RM to align a 70B LLM, saving HUGE compute costs!

💡 More power, less compute! 🏆

BibTeX

@inproceedings{ xu2025genarm, title={GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-Time Alignment}, author={Xu, Yuancheng and Sehwag, Udari Madhushani and Koppel, Alec and Zhu, Sicheng and An, Bang and Huang, Furong and Ganesh, Sumitra}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025}, }