Skip to main navigation Skip to search Skip to main content

On-line policy gradient estimation with multi-step sampling

  • Yan Jie Li*
  • , Fang Cao
  • , Xi Ren Cao
  • *Corresponding author for this work
  • Hong Kong University of Science and Technology
  • Harbin Institute of Technology Shenzhen
  • Beijing Jiaotong University

Research output: Contribution to journalArticlepeer-review

Abstract

In this note, we discuss the problem of the sample-path-based (on-line) performance gradient estimation for Markov systems. The existing on-line performance gradient estimation algorithms generally require a standard importance sampling assumption. When the assumption does not hold, these algorithms may lead to poor estimates for the gradients. We show that this assumption can be relaxed and propose algorithms with multi-step sampling for performance gradient estimates; these algorithms do not require the standard assumption. Simulation examples are given to illustrate the accuracy of the estimates.

Original languageEnglish
Pages (from-to)3-17
Number of pages15
JournalDiscrete Event Dynamic Systems: Theory and Applications
Volume20
Issue number1
DOIs
StatePublished - Mar 2010
Externally publishedYes

Keywords

  • Markov reward processes
  • On-line estimation
  • Performance potentials
  • Policy gradient

Fingerprint

Dive into the research topics of 'On-line policy gradient estimation with multi-step sampling'. Together they form a unique fingerprint.

Cite this