Reinforcement Learning in Continuous Time and Space

Top Cited Papers

1 January 2000

journal article
Published by MIT Press in Neural Computation

Vol. 12 (1), 219-245
https://doi.org/10.1162/089976600300015961

Abstract

This article presents a reinforcement learning framework for continuous-time dynamical systems without a priori discretization of time, state, and action. Basedonthe Hamilton-Jacobi-Bellman (HJB) equation for infinite-horizon, discounted reward problems, we derive algorithms for estimating value functions and improving policies with the use of function approximators. The process of value function estimation is formulated as the minimization of a continuous-time form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived, and their correspondences with the conventional residual gradient, TD (0), and TD (λ) algorithms are shown. For policy improvement, two methods—a continuous actor-critic method and a value-gradient-based greedy policy—are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived. The advantage updating, a model-free algorithm derived previously, is also formulated in the HJB-based framework. The performance of the proposed algorithms is first tested in a nonlinear control task of swinging a pendulum up with limited torque. It is shown in the simulations that (1) the task is accomplished by the continuous actor-critic method in a number of trials several times fewer than by the conventional discrete actor-critic method; (2) among the continuous policy update methods, the value-gradient-based policy with a known or learned dynamic model performs several times better than the actor-critic method; and (3) a value function update using exponential eligibility traces is more efficient and stable than that based on Euler approximation. The algorithms are then tested in a higher-dimensional task: cart-pole swing-up. This task is accomplished in several hundred trials using the value-gradient-based policy with a learned dynamic model.

Keywords

This publication has 8 references indexed in Scilit:

An analysis of temporal-difference learning with function approximation
IEEE Transactions on Automatic Control, 1997
Reinforcement Learning: A Survey
Journal of Artificial Intelligence Research, 1996
Reinforcement Learning Applied to a Differential Game
Adaptive Behavior, 1995
TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play
Neural Computation, 1994
A stochastic reinforcement learning algorithm for learning real-valued functions
Neural Networks, 1990
Learning to predict by the methods of temporal differences
Machine Learning, 1988
Neurons with graded response have collective computational properties like those of two-state neurons.
Proceedings of the National Academy of Sciences, 1984
Neuronlike adaptive elements that can solve difficult learning control problems
IEEE Transactions on Systems, Man, and Cybernetics, 1983

Cited by 654 articles