TY - GEN
T1 - MPO
T2 - 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
AU - Zhao, Weixiang
AU - Hu, Yulin
AU - Deng, Yang
AU - Wu, Tongtong
AU - Zhang, Wenxuan
AU - Guo, Jiahe
AU - Zhang, An
AU - Zhao, Yanyan
AU - Qin, Bing
AU - Chua, Tat Seng
AU - Liu, Ting
N1 - Publisher Copyright:
© 2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - Large language models (LLMs) have become increasingly central to AI applications worldwide, necessitating robust multilingual safety alignment to ensure secure deployment across diverse linguistic contexts. Existing preference learning methods for safety alignment, such as RLHF and DPO, are primarily monolingual and struggle with noisy multilingual data. To address these limitations, we introduce Multilingual reward gaP Optimization (MPO), a novel approach that leverages the well-aligned safety capabilities of the dominant language (e.g., English) to improve safety alignment across multiple languages. MPO directly minimizes the reward gap difference between the dominant language and target languages, effectively transferring safety capabilities while preserving the original strengths of the dominant language. Extensive experiments on three LLMs, LLaMA-3.1, Gemma-2 and Qwen2.5, validate MPO's efficacy in multilingual safety alignment without degrading general multilingual utility. Our code is available at: https://github.com/circle-hit/MPO. WARNING: This paper may contain content that is offensive and harmful.
AB - Large language models (LLMs) have become increasingly central to AI applications worldwide, necessitating robust multilingual safety alignment to ensure secure deployment across diverse linguistic contexts. Existing preference learning methods for safety alignment, such as RLHF and DPO, are primarily monolingual and struggle with noisy multilingual data. To address these limitations, we introduce Multilingual reward gaP Optimization (MPO), a novel approach that leverages the well-aligned safety capabilities of the dominant language (e.g., English) to improve safety alignment across multiple languages. MPO directly minimizes the reward gap difference between the dominant language and target languages, effectively transferring safety capabilities while preserving the original strengths of the dominant language. Extensive experiments on three LLMs, LLaMA-3.1, Gemma-2 and Qwen2.5, validate MPO's efficacy in multilingual safety alignment without degrading general multilingual utility. Our code is available at: https://github.com/circle-hit/MPO. WARNING: This paper may contain content that is offensive and harmful.
UR - https://www.scopus.com/pages/publications/105021025476
U2 - 10.18653/v1/2025.acl-long.1149
DO - 10.18653/v1/2025.acl-long.1149
M3 - 会议稿件
AN - SCOPUS:105021025476
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 23564
EP - 23587
BT - Long Papers
A2 - Che, Wanxiang
A2 - Nabende, Joyce
A2 - Shutova, Ekaterina
A2 - Pilehvar, Mohammad Taher
PB - Association for Computational Linguistics (ACL)
Y2 - 27 July 2025 through 1 August 2025
ER -