GenARM : Reward Guided Generation with Autoregressive Reward Model for Test-Time Alignment

1University of Maryland, College Park    2JPMorgan AI Research

ICLR, 2025

(Next-token generation guided by different RMs) Using a trajectory-level RM to select the next token (top) requires the costly process of generating full responses for each candidate. In contrast, GenARM (bottom) efficiently samples the next token by combining scores from the base LLM and our proposed Autoregressive Reward Model, which is trained to predict next-token rewards directly.

Abstract

Large Language Models (LLMs) exhibit impressive capabilities but require careful alignment with human preferences. Traditional training-time methods finetune LLMs using human preference datasets but incur significant training costs and require repeated training to handle diverse user preferences. Test-time alignment methods address this by using reward models (RMs) to guide frozen LLMs without retraining. However, existing test-time approaches rely on trajectory-level RMs which are designed to evaluate complete responses, making them unsuitable for autoregressive text generation that requires computing next-token rewards from partial responses. To address this, we introduce GenARM, a test-time alignment approach that leverages the Autoregressive Reward Model--a novel reward parametrization designed to predict next-token rewards for efficient and effective autoregressive generation. Theoretically, we demonstrate that this parametrization can provably guide frozen LLMs toward any distribution achievable by traditional RMs within the KL-regularized reinforcement learning framework. Experimental results show that GenARM significantly outperforms prior test-time alignment baselines and matches the performance of training-time methods. Additionally, GenARM enables efficient weak-to-strong guidance, aligning larger LLMs with smaller RMs without the high costs of training larger models. Furthermore, GenARM supports multi-objective alignment, allowing real-time trade-offs between preference dimensions and catering to diverse user preferences without retraining.

TL;DR: GenARM uses an autoregressive reward model to efficiently guide a base LLM for test-time alignment, outperforming prior methods and enabling weak-to-strong guidance and multi-objective alignment.

Why Do We Need Autoregressive Reward Model?

❌ Traditional reward models = Slow 🚶

🔹They score entire responses post-generation 📜

🔹LLMs must generate fully before evaluation ⏳

✅ GenARM = Fast 🏎️

🔹 Predicts next-token rewards on the fly ⚡

🔹 Guides LLMs token by token—drastically improving efficiency! 💡

What’s an Autoregressive Reward Model?

Unlike conventional trajectory-level reward models, GenARM parametrizes rewards at the token level:

🔹 Rewards decompose naturally as log probabilities 🔄

🔹 Each token selection is guided dynamically 🎯

Autoregressive RM

Parametrization of the Autoregressive Reward Model.

How Does GenARM Work?

Training Phase:

✅ Learns next-token rewards from trajectory-level preference data 📊

✅ Ensures that preferred responses accumulate higher total rewards 💯

🚀 Inference Phase:

✅ Combines LLM logits + next-token rewards to dynamically guide generation 🔄

💡 No model retraining. Just plug, play, and align!

Train and test phase

How Well Does it Perform?



🔥 Fastest test-time alignment method—significantly outperforms baselines!

🔥 Achieves 90% of fine-tuned performance—without retraining!

🔥 Weak-to-strong guidance: Uses a 7B RM to align a 70B LLM, saving HUGE compute costs!

💡 More power, less compute! 🏆



Train and test phase



Train and test phase







Train and test phase



Train and test phase







Train and test phase



Train and test phase

BibTeX

@inproceedings{
        xu2025genarm,
        title={GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-Time Alignment},
        author={Xu, Yuancheng and Sehwag, Udari Madhushani and Koppel, Alec and Zhu, Sicheng and An, Bang and Huang, Furong and Ganesh, Sumitra},
        booktitle={The Thirteenth International Conference on Learning Representations},
        year={2025},
}