TY - GEN
T1 - Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs
AU - Zhao, Weixiang
AU - Hu, Yulin
AU - Deng, Yang
AU - Guo, Jiahe
AU - Sui, Xingyu
AU - Han, Xinyang
AU - Zhang, An
AU - Zhao, Yanyan
AU - Qin, Bing
AU - Chua, Tat Seng
AU - Liu, Ting
N1 - Publisher Copyright:
© 2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - Role-playing enables large language models (LLMs) to engage users in immersive and personalized interactions, but it also introduces significant safety risks. Existing role-play fine-tuning techniques improve role adaptability but may degrade safety performance, particularly for villainous characters. In this work, we conduct the first comprehensive assessment of role-play fine-tuning risks by training 95 role-specific LLMs using RoleBench. Our experiments reveal that role-play fine-tuning leads to a noticeable decline in safety performance, with safety risks varying based on character traits. To tackle this challenge, we propose Safety-Aware Role-Play Fine-Tuning (SaRFT), a novel method designed to balance role-playing capabilities and safety. Extensive experiments on LLaMA-3-8B-Instruct, Gemma-2-9B-it, and Qwen2.5-7B-Instruct demonstrate that SaRFT consistently outperforms state-of-the-art baselines under both LoRA and full-parameter fine-tuning settings. Our findings highlight the necessity of role-adaptive safety measures and provide insights into mitigating role-specific safety risks in role-playing LLMs. Our code is available at: https://github.com/yulinlp/SaRFT. WARNING: This paper may contain content that is harmful.
AB - Role-playing enables large language models (LLMs) to engage users in immersive and personalized interactions, but it also introduces significant safety risks. Existing role-play fine-tuning techniques improve role adaptability but may degrade safety performance, particularly for villainous characters. In this work, we conduct the first comprehensive assessment of role-play fine-tuning risks by training 95 role-specific LLMs using RoleBench. Our experiments reveal that role-play fine-tuning leads to a noticeable decline in safety performance, with safety risks varying based on character traits. To tackle this challenge, we propose Safety-Aware Role-Play Fine-Tuning (SaRFT), a novel method designed to balance role-playing capabilities and safety. Extensive experiments on LLaMA-3-8B-Instruct, Gemma-2-9B-it, and Qwen2.5-7B-Instruct demonstrate that SaRFT consistently outperforms state-of-the-art baselines under both LoRA and full-parameter fine-tuning settings. Our findings highlight the necessity of role-adaptive safety measures and provide insights into mitigating role-specific safety risks in role-playing LLMs. Our code is available at: https://github.com/yulinlp/SaRFT. WARNING: This paper may contain content that is harmful.
UR - https://www.scopus.com/pages/publications/105021031249
U2 - 10.18653/v1/2025.acl-long.544
DO - 10.18653/v1/2025.acl-long.544
M3 - 会议稿件
AN - SCOPUS:105021031249
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 11112
EP - 11137
BT - Long Papers
A2 - Che, Wanxiang
A2 - Nabende, Joyce
A2 - Shutova, Ekaterina
A2 - Pilehvar, Mohammad Taher
PB - Association for Computational Linguistics (ACL)
T2 - 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Y2 - 27 July 2025 through 1 August 2025
ER -