Skip to main navigation Skip to search Skip to main content

The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions

  • Wenbo Pan*
  • , Zhichao Liu
  • , Qiguang Chen
  • , Xiangyang Zhou
  • , Haining Yu
  • , Xiaohua Jia
  • *Corresponding author for this work
  • City University of Hong Kong
  • Harbin Institute of Technology
  • Harbin Institute of Technology
  • Microsoft USA

Research output: Contribution to journalConference articlepeer-review

Abstract

Large Language Models’ safety-aligned behaviors, such as refusing harmful queries, can be represented by linear directions in activation space. Previous research modeled safety behavior with a single direction, limiting mechanistic understanding to an isolated safety feature. In this work, we discover that safety-aligned behavior is jointly controlled by multi-dimensional directions. Namely, we study the vector space of representation shifts during safety fine-tuning on Llama 3 8B for refusing jailbreaks. By studying orthogonal directions in the space, we first find that a dominant direction governs the model’s refusal behavior, while multiple smaller directions represent distinct and interpretable features like hypothetical narrative and role-playing. We then measure how different directions promote or suppress the dominant direction, showing the important role of secondary directions in shaping the model’s refusal representation. Finally, we demonstrate that removing certain trigger tokens in harmful queries can mitigate these directions to bypass the learned safety capability, providing new insights on understanding safety alignment vulnerability from a multi-dimensional perspective. Code and artifacts are available at https://github.com/ BMPixel/safety-residual-space.

Original languageEnglish
Pages (from-to)47697-47716
Number of pages20
JournalProceedings of Machine Learning Research
Volume267
StatePublished - 2025
Externally publishedYes
Event42nd International Conference on Machine Learning, ICML 2025 - Vancouver, Canada
Duration: 13 Jul 202519 Jul 2025

Fingerprint

Dive into the research topics of 'The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions'. Together they form a unique fingerprint.

Cite this