TY - GEN
T1 - Generative Reward Modeling via Synthetic Criteria Preference Learning
AU - Liang, Xiaobo
AU - Zhang, Haoke
AU - Li, Juntao
AU - Chen, Kehai
AU - Zhu, Qiaoming
AU - Zhang, Min
N1 - Publisher Copyright:
© 2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - Generative Reward Models (GenRMs) leverage synthesized Chains of Thought (CoT) to reduce the need for massive labeled data, but this approach introduces risks of overoptimization due to the inability to guarantee the correctness of the CoTs. Identifying and optimizing unexpected behaviors within these synthesized CoT remains a challenge, as it heavily depends on precise annotations of intermediate behavior, similar to process supervision. In this work, we introduce a criteria-based preference tree for reward modeling, where each path in the tree represents a reasoning trajectory based on synthesized criteria. Crucially, each reasoning trajectory can be independently optimized through RL algorithm. These fine-grained process reward signals are derived from the inference-time computations and predefined rules, eliminating the need for human supervision. In experiments, SyncPL1 showed significant improvements over baselines on multiple human preference benchmarks. We further demonstrate that synthesized data can be learned using a long CoT format, analogous to an o1-like model, further enhancing performance while keeping stability and efficiency during training.
AB - Generative Reward Models (GenRMs) leverage synthesized Chains of Thought (CoT) to reduce the need for massive labeled data, but this approach introduces risks of overoptimization due to the inability to guarantee the correctness of the CoTs. Identifying and optimizing unexpected behaviors within these synthesized CoT remains a challenge, as it heavily depends on precise annotations of intermediate behavior, similar to process supervision. In this work, we introduce a criteria-based preference tree for reward modeling, where each path in the tree represents a reasoning trajectory based on synthesized criteria. Crucially, each reasoning trajectory can be independently optimized through RL algorithm. These fine-grained process reward signals are derived from the inference-time computations and predefined rules, eliminating the need for human supervision. In experiments, SyncPL1 showed significant improvements over baselines on multiple human preference benchmarks. We further demonstrate that synthesized data can be learned using a long CoT format, analogous to an o1-like model, further enhancing performance while keeping stability and efficiency during training.
UR - https://www.scopus.com/pages/publications/105021064159
U2 - 10.18653/v1/2025.acl-long.1297
DO - 10.18653/v1/2025.acl-long.1297
M3 - 会议稿件
AN - SCOPUS:105021064159
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 26755
EP - 26769
BT - Long Papers
A2 - Che, Wanxiang
A2 - Nabende, Joyce
A2 - Shutova, Ekaterina
A2 - Pilehvar, Mohammad Taher
PB - Association for Computational Linguistics (ACL)
T2 - 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Y2 - 27 July 2025 through 1 August 2025
ER -