Skip to main navigation Skip to search Skip to main content

Safety Alignment via Constrained Knowledge Unlearning

  • Zesheng Shi
  • , Yucheng Zhou
  • , Jing Li*
  • , Yuxin Jin
  • , Yu Li
  • , Daojing He
  • , Fangming Liu
  • , Saleh Alharbi
  • , Jun Yu
  • , Min Zhang
  • *Corresponding author for this work
  • Harbin Institute of Technology Shenzhen
  • University of Macau
  • Nankai University
  • Zhejiang University
  • Peng Cheng Laboratory
  • Shaqra University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Despite significant progress in safety alignment, large language models (LLMs) remain susceptible to jailbreak attacks. Existing defense mechanisms have not fully deleted harmful knowledge in LLMs, which allows such attacks to bypass safeguards and produce harmful outputs. To address this challenge, we propose a novel safety alignment strategy, Constrained Knowledge Unlearning (CKU), which focuses on two primary objectives: knowledge localization and retention, and unlearning harmful knowledge. CKU works by scoring neurons in specific multilayer perceptron (MLP) layers to identify a subset U of neurons associated with useful knowledge. During the unlearning process, CKU prunes the gradients of neurons in U to preserve valuable knowledge while effectively mitigating harmful content. Experimental results demonstrate that CKU significantly enhances model safety without compromising overall performance, offering a superior balance between safety and utility compared to existing methods. Additionally, our analysis of neuron knowledge sensitivity across various MLP layers provides valuable insights into the mechanics of safety alignment and model knowledge editing.

Original languageEnglish
Title of host publicationLong Papers
EditorsWanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
PublisherAssociation for Computational Linguistics (ACL)
Pages25515-25529
Number of pages15
ISBN (Electronic)9798891762510
DOIs
StatePublished - 2025
Externally publishedYes
Event63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025 - Vienna, Austria
Duration: 27 Jul 20251 Aug 2025

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
Volume1
ISSN (Print)0736-587X

Conference

Conference63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Country/TerritoryAustria
CityVienna
Period27/07/251/08/25

Fingerprint

Dive into the research topics of 'Safety Alignment via Constrained Knowledge Unlearning'. Together they form a unique fingerprint.

Cite this