Skip to main navigation Skip to search Skip to main content

Data Efficient Deep Reinforcement Learning With Action-Ranked Temporal Difference Learning

  • Harbin Institute of Technology Shenzhen

Research output: Contribution to journalArticlepeer-review

Abstract

In value-based deep reinforcement learning (RL), value function approximation errors lead to suboptimal policies. Temporal difference (TD) learning is one of the most important methodologies to approximate state-action (Q) value function. In TD learning, it is critical to estimate Q values of greedy actions more accurately because a more accurate target Q value enhances the estimation accuracy of Q value. To improve the estimation accuracy of Q value, we propose an action-ranked TD learning method to enhance the performance of deep RL by weighting each TD error according to the rank of its corresponding state-action pair's value among all the Q values on a state. The proposed method can provide more accurate target values for TD learning, making the estimation of the Q value more accurate. We apply the proposed method to a representative value-based deep RL algorithm, and results show that the proposed method outperforms baselines on 31 out of 40 Atari games. Furthermore, we extend the proposed method to multi-agent deep RL. To adaptively determine the hyperparameter in action-ranked TD learning, we propose a meta action-ranked TD learning. A series of experiments quantitatively verify that our methods outperform baselines on Atari games, StarCraft-II, and Grid World environments.

Original languageEnglish
Pages (from-to)2949-2961
Number of pages13
JournalIEEE Transactions on Emerging Topics in Computational Intelligence
Volume8
Issue number4
DOIs
StatePublished - 2024
Externally publishedYes

Keywords

  • Reinforcement learning
  • action-rank
  • data efficient
  • meta learning
  • temporal difference

Fingerprint

Dive into the research topics of 'Data Efficient Deep Reinforcement Learning With Action-Ranked Temporal Difference Learning'. Together they form a unique fingerprint.

Cite this