Skip to main navigation Skip to search Skip to main content

An Improved Stable Fine-Tuning framework for offline-to-online reinforcement learning

  • Kaidong Zhao
  • , Yanjie Li*
  • , Ke Lin
  • *Corresponding author for this work
  • Harbin Institute of Technology Shenzhen

Research output: Contribution to journalArticlepeer-review

Abstract

Offline reinforcement learning prevents high environmental interaction costs by training strategies with historical data. However, training with suboptimal datasets may lead to limited learning policies. Offline-to-online (O2O) fine-tuning is a common approach to enhance performance. Because of the inaccurate Q-value estimation, direct fine-tuning policies learned from offline datasets result in performance degradation. To address these challenges, we propose an Improved Stable Fine-Tuning (SFT) framework. The SFT framework retains the evaluation network from the offline phase for action selection until the new evaluation network stabilizes, ensuring performance stability throughout the fine-tuning process. To further enhance the effectiveness of SFT, we have developed a comprehensive policy evaluation method that provides more robust assessments of policy performance. Additionally, the SFT algorithm incorporates a periodic parameter reset mechanism, leveraging the fact that temporary declines in policy performance do not compromise overall performance. This mechanism enables the learning policy to effectively escape local optima and explore superior solutions. Experimental results demonstrate that the SFT framework outperforms existing methods in both performance stability and overall effectiveness during the fine-tuning process.

Original languageEnglish
Article number110578
JournalComputers and Electrical Engineering
Volume127
DOIs
StatePublished - Oct 2025
Externally publishedYes

Keywords

  • Fine-tuning
  • Off-Policy Evaluation
  • Reinforcement learning

Fingerprint

Dive into the research topics of 'An Improved Stable Fine-Tuning framework for offline-to-online reinforcement learning'. Together they form a unique fingerprint.

Cite this