Policy gradient methods

From Scholarpedia

This article is undergoing 2 initial reviews; It may contain inaccuracies and unapproved changes made by anonymous reviewers.

Peer review status

You are not logged in.

Reviewer A: Awaits author's response (last modifications by reviewer - 93 days ago)
Reviewer B: Awaits author's response (last modifications by reviewer - 82 days ago)
This article is not accepted to Scholarpedia yet. It may contain inaccuracies and unapproved changes made by anonymous reviewers.
Dr. Jan Peters prefers wiki-style peer-review process: reviewers are encouraged to modify the article directly.

Author: Dr. Jan Peters, Max-Planck Institute, Germany & University of Southern California, USC

Policy gradient methods are a type of reinforcement learning techniques that rely upon optimizing parametrized policies with respect to the expected return (long-term cumulative reward) by gradient descent. They do not suffer from of many problems that have been marring traditional reinforcement learning approaches such as the lack of guarantees of a value function, the intractability problem resulting from uncertain state information and the complexity arising from continuous states & actions.


Contents

Introduction

Reinforcement learning is probably the most general framework in which reward-related learning problems of animals, humans or machine can be phrased. However, most of the methods proposed in the reinforcement learning community are not yet applicable in many problems such as robotics, motor control, etc. This inapplicability may result from problems with uncertain state information. Thus, those systems need to be modeled as partially observable markov decision problems which often results in excessive computational demands. Most traditional reinforcement learning methods have no convergence guarantees and there exist even divergence examples. Continuous states and actions in high dimensional spaces cannot be treated by most off-the-self reinforcement learning approaches.

Policy gradient methods differ significantly as they do not suffer from these problems in the same way. For example, uncertainty in the state might degrade the performance of the policy (if no additional state estimator is being used) but the optimization techniques for the policy do not need to be changed. Continuous states and actions can be dealt with in the exact same way as discrete ones while the learning performance is often additionally increased. Convergence at least to a local optimum is guaranteed.


The advantages of policy gradient methods for real world applications are numerous. Among the most important ones are that the policy representations can be chosen so that it is meaningful for the task and can incorporate previous domain knowledge, that often fewer parameters are needed in the learning process than in value-function based approaches and that there is a variety of different algorithms for policy gradient estimation in the literature which have a rather strong theoretical underpinning. Additionally, policy gradient methods can be used either model-free or model-based as they are a generic formulation. As a result, most real-world robotics applications of reinforcement learning employ some kind of policy gradient approach, see Peters & Schaal, 2008 for some examples.

Assumptions and Notation

We assume that we can model the control system in a discrete-time manner and we will denote the current time step by k. In order to take possible stochasticity of the plant into account, we denote it using a probability distribution \mathbf{x}_{k+1}\sim p\left(  \mathbf{x} _{k+1}\left\vert \mathbf{x}_{k},\mathbf{u}_{k}\right.  \right) as model where \mathbf{u}_{k}\in\mathbb{R}^{M} denotes the current action, and \mathbf{x}_{k}, \mathbf{x}_{k+1}\in\mathbb{R}^{N} denote the current and next state, respectively. We furthermore assume that actions are generated by a policy \mathbf{u}_{k} \sim\pi_{\mathbf{\theta}}\left(  \mathbf{u}_{k}\left\vert \mathbf{x} _{k}\right.  \right) which is modeled as a probability distribution in order to incorporate exploratory actions; for some special problems, the optimal solution to a control problem is actually a stochastic controller (Sutton, McAllester, Singh, and Mansour, 2000). The policy is assumed to be parameterized by some policy parameters \mathbf{\theta} \in\mathbb{R}^{K}.The sequence of states and actions forms a trajectory (also called history or roll-out) denoted by \mathbf{\tau}=[\mathbf{x}_{0:H},\mathbf{u}_{0:H}] where H denotes the horizon which can be infinite. At each instant of time, the learning system receives a reward denoted by r\left(  \mathbf{x} _{k},\mathbf{u}_{k}\right) \in\mathbb{R}.

The general goal of policy optimization in reinforcement learning is to optimize the policy parameters \mathbf{\theta}\in\mathbb{R}^{K} so that the expected return

J\left(  \mathbf{\theta}\right)  =E\left\{  \sum\nolimits_{k=0}^{H}a_{k} r_{k}\right\}

is optimized where a_{k} denote time-step dependent weighting factors, often set to a_{k}=\gamma^{k} for discounted reinforcement learning (where \gamma is in [0,1]) or a_{k}=1/H for the average reward case. For real-world applications, we require that any change to the policy parameterization has to be smooth as drastic changes can be hazardous for the actor as well as useful initializations of the policy based on domain knowledge would otherwise vanish after a single update step. For these reasons, policy gradient methods which follow the steepest descent on the expected return are the method of choice. These methods update the policy parameterization according to the gradient update rule

\mathbf{\theta}_{h+1}=\mathbf{\theta}_{h}+\alpha_{h}\left.  \mathbf{\nabla }_{\mathbf{\theta}}J\right\vert _{\mathbf{\theta}=\mathbf{\theta}_{h}},

where \alpha_{h}\in\mathbb{R}^{+} denotes a learning rate and h\in\{0,1,2,\ldots\} the current update number.

If the gradient estimate is unbiased and learning rates fulfill \sum\textstyle_{h=0}^{\infty}\alpha_{h}>0 and \sum\textstyle_{h=0}^{\infty}\alpha_{h}^{2}=\textrm{const}, the learning process is guaranteed to converge at least to a local minimum.

The main problem in policy gradient methods is to obtain a good estimator of the policy gradient \left. \mathbf{\nabla}_{\mathbf{\theta}}J\right\vert _{\mathbf{\theta}=\mathbf{\theta}_{h}}. In robotics and control, people have traditionally used deterministic model-based methods for obtaining the gradient (Jacobson & Mayne, 1970; Dyer & McReynolds, 1970; Hasdorff, 1976). However, in order to become autonomous we cannot expect to be able to model every detail of the system. Therefore, we need to estimate the policy gradient simply from data generated during the execution of a task, i.e., without the need for a model. In this section, we will study different approaches and discuss their properties.

Approaches to Policy Gradient Estimation

The literature on policy gradient methods has yielded a variety of estimation methods over the last years. The most prominent approaches, which have been applied to robotics are finite-difference and likelihood ratio methods, better known as REINFORCE in reinforcement learning.

Finite-difference Methods


Finite-difference methods are among the oldest policy gradient approaches; they originated from the stochastic simulation community and are quite straightforward to understand. The policy parameterization is varied by small increments \mathbf{\Delta\theta}_{i} and for each policy parameter variation \mathbf{\theta}_{h}+\mathbf{\Delta\theta}_{i} roll-outs (or trajectories) are performed which generate estimates \Delta\hat{J}_{j}\approx J(\mathbf{\theta }_{h}+\mathbf{\Delta\theta}_{i})-J_{\text{ref}} of the expected return. There are different ways of choosing the reference value J_{\text{ref}}, e.g. forward-difference estimators with J_{\text{ref}}=J(\mathbf{\theta}_{h}) and central-difference estimators with J_{\text{ref}}=J(\mathbf{\theta}_{h}-\mathbf{\Delta\theta}_{i}). The policy gradient estimate \mathbf{g}_{\text{FD}}\approx\left. \mathbf{\nabla}_{\mathbf{\theta}}J\right\vert_{\mathbf{\theta}  =\mathbf{\theta}_{h}} can be estimated by regression yielding

\mathbf{g}_{\text{FD}}=\left(  \mathbf{\Delta\Theta}^{T}\mathbf{\Delta\Theta }\right)  ^{-1}\mathbf{\Delta\Theta}^{T}\mathbf{\Delta\hat{J}},

where \mathbf{\Delta\Theta} =[\mathbf{\Delta\theta}_{1},\ldots,\mathbf{\Delta\theta}_{I}]^{T} and \mathbf{\Delta\hat{J}}=[\Delta\hat{J}_{1},\ldots ,\Delta\hat{J}_{I}]^{T} denote the I samples. This approach can be highly efficient in simulation optimization of deterministic systems (Spall, 2003) or when a common history of random numbers (Glynn, 1987) is being used (the later is known as PEGASUS in reinforcement learning (Ng & Jordan, 2000)), and can get close to a convergence rate of O\left(  I^{-1/2}\right) (Glynn, 1987). However, when used on a real system, the uncertainities degrade the performance resulting in convergence rates ranging between O\left(  I^{-1/4}\right) to O\left(  I^{-2/5} \right) depending on the chosen reference value (Glynn, 1987). An implementation of this algorithm is shown below.

input: policy parameterization \mathbf{\theta}_{h}
repeat
    generate policy variation \mathbf{\Delta\theta}_{i}
    estimate \hat{J}_{j}\approx J(\mathbf{\theta}_{h}+\mathbf{\Delta\theta}_{i})=\left\langle \sum\nolimits_{k=0}^{H} a_{k}r_{k}\right\rangle from roll-out
    estimate \hat{J}_{\text{ref}}, e.g., \hat{J}_{\text{ref}}=J(\mathbf{\theta}_{h}-\mathbf{\Delta\theta}_{i}) from roll-out
    compute \Delta\hat{J}_{j}\approx J(\mathbf{\theta}_{h}+\mathbf{\Delta\theta}_{i})-J_{\text{ref}}
    compute gradient \mathbf{g}_{\text{FD}}=\left( \mathbf{\Delta\Theta}^{T}\mathbf{\Delta\Theta}\right)  ^{-1}\mathbf{\Delta \Theta}^{T}\mathbf{\Delta\hat{J}}
until  gradient estimate \mathbf{g}_{\text{FD}} converged
return gradient estimate \mathbf{g}_{\text{FD}}

Due to the simplicity of this approach, such methods have been successfully applied to numerous applications. However, the straightforward application to robotics is not without peril as the generation of the \mathbf{\Delta\theta}_{j} requires proper knowledge of the system, as badly chosen \mathbf{\Delta\theta}_{j} can destabilize the policy so that the system becomes instable and the gradient estimation process is prone to fail. If the parameters differ highly in scale, significant difficulties could be the consequences.

Likelihood Ratio Methods and REINFORCE

Likelihood ratio methods are driven by a different important insight. Assume that trajectories \mathbf{\tau} are generated from a system by roll-outs, i.e., \mathbf{\tau}\sim p_{\mathbf{\theta}}\left(  \mathbf{\tau}\right) =p\left(  \left. \mathbf{\tau}\right\vert \mathbf{\theta}\right) with rewards r(\mathbf{\tau})=\sum\textstyle_{k=0}^{H}a_{k}r_{k}. In this case, the policy gradient can be estimated using the likelihood ratio (see e.g. Glynn, 1987; Aleksandrov, Sysoyev, and Shemeneva, 1968) better known as REINFORCE (Williams, 1992) trick, i.e., by

\mathbf{\nabla}_{\mathbf{\theta}}J\left(  \mathbf{\theta}\right) = \int_{\mathbb{T}}\mathbf{\nabla}_{\mathbf{\theta}}p_{\mathbf{\theta}}\left( \mathbf{\tau}\right)  r(\mathbf{\tau})d\mathbf{\tau} = E\left\{  \mathbf{\nabla }_{\mathbf{\theta}}\log p_{\mathbf{\theta}}\left(  \mathbf{\tau}\right) r(\mathbf{\tau})\right\},
as
\int_{\mathbb{T}}\mathbf{\nabla}_{\mathbf{\theta}}p_{\mathbf{\theta} }\left(  \mathbf{\tau}\right)  r(\mathbf{\tau})d\mathbf{\tau}=\int _{\mathbb{T}}p_{\mathbf{\theta}}\left(  \mathbf{\tau}\right)  \mathbf{\nabla }_{\mathbf{\theta}}\log p_{\mathbf{\theta}}\left(  \mathbf{\tau}\right) r(\mathbf{\tau})d\mathbf{\tau}.

Importantly, the derivative \mathbf{\nabla }_{\mathbf{\theta}}\log p_{\mathbf{\theta}}\left(  \mathbf{\tau}\right) can be computed without knowledge of the generating distribution p_{\mathbf{\theta}}\left(  \mathbf{\tau}\right) as

p_{\mathbf{\theta}}\left(  \mathbf{\tau}\right)  =p(\mathbf{x}_{0})\prod\nolimits_{k=0} ^{H}p\left(  \mathbf{x}_{k+1}\left\vert \mathbf{x}_{k},\mathbf{u}_{k}\right. \right)  \pi_{\mathbf{\theta}}\left(  \mathbf{u}_{k}\left\vert \mathbf{x} _{k}\right.  \right)

implies that

\mathbf{\nabla}_{\mathbf{\theta}}\log p_{\mathbf{\theta}}\left(  \mathbf{\tau }\right)  =\sum\nolimits_{k=0}^{H}\mathbf{\nabla}_{\mathbf{\theta}}\log \pi_{\mathbf{\theta}}\left(  \mathbf{u}_{k}\left\vert \mathbf{x}_{k}\right. \right)

as only the policy depends on \mathbf{\theta}.

Thus, the derivatives through the control system do not have to be computed. This result makes an important difference: in stochastic system optimization, finite difference estimators are often preferred as the derivative through the system is required but not known. In policy search, we always know the derivative of the policy with respect to its parameters and therefore we can make use of the theoretical advantages of likelihood ratio gradient estimators. As

\int_{\mathbb{T}}\mathbf{\nabla}_{\mathbf{\theta} }p_{\mathbf{\theta}}\left(  \mathbf{\tau}\right)  d\mathbf{\tau}=0,

a constant baseline can be inserted resulting into the gradient estimator

\mathbf{\nabla}_{\mathbf{\theta}}J\left(  \mathbf{\theta}\right)  =E\left\{ \mathbf{\nabla}_{\mathbf{\theta}}\log p_{\mathbf{\theta}}\left(  \mathbf{\tau }\right)  \left(  r(\mathbf{\tau})-b\right)  \right\}  ,

where the baseline b\in\mathbb{R} can be chosen arbitrarily (Williams, 1992) but usually with the goal to minimize the variance of the gradient estimator. See Peters & Schaal, 2008 for an overview of how to choose the baseline optimally. Therefore, the general path likelihood ratio estimator or episodic REINFORCE gradient estimator is given by

\mathbf{g}_{\text{RF}}=\left\langle \left(  \sum\nolimits_{k=0}^{H} \mathbf{\nabla}_{\mathbf{\theta}}\log\pi_{\mathbf{\theta}}\left( \mathbf{u}_{k}\left\vert \mathbf{x}_{k}\right.  \right)  \right)  \left( \sum\nolimits_{l=0}^{H}a_{l}r_{l}-b\right)  \right\rangle ,

where \left\langle \cdot\right\rangle denotes the average over trajectories. This type of method is guaranteed to converge to the true gradient at the fastest theoretically possible pace of O\left( I^{-1/2}\right) where I denotes the number of roll-outs (Glynn, 1987) even if the data is generated from a highly stochastic system. An implementation of this algorithm will be shown below together with the estimator for the optimal baseline.


input: policy parameterization \mathbf{\theta}_{h}
repeat
    perform a trial and obtain \mathbf{x}_{0:H},\mathbf{u}_{0:H},r_{0:H}
    for each  gradient element g_{h}
         estimate optimal baseline
              b^{h}=\frac{\left\langle \left(  \sum\nolimits_{k=0}^{H} \mathbf{\nabla}_{\theta_{h}}\log\pi_{\mathbf{\theta}}\left(  \mathbf{u}_{k}\left\vert \mathbf{x}_{k}\right.  \right)  \right)  ^{2}\sum\nolimits_{l=0}^{H}a_{l}r_{l}\right\rangle }{\left\langle \left( \sum\nolimits_{k=0}^{H}\mathbf{\nabla}_{\theta_{h}}\log\pi_{\mathbf{\theta} }\left(  \mathbf{u}_{k}\left\vert \mathbf{x}_{k}\right.  \right)  \right) ^{2}\right\rangle }
         estimate the gradient element
              g_{h}=\left\langle \left(  \sum\nolimits_{k=0}^{H}\mathbf{\nabla }_{\theta_{h}}\log\pi_{\mathbf{\theta}}\left(  \mathbf{u}_{k}\left\vert \mathbf{x}_{k}\right.  \right)  \right)  \left(  \sum\nolimits_{l=0}^{H} a_{l}r_{l}-b^{h}\right)  \right\rangle
    end for 
until  gradient estimate \mathbf{g}_{\text{RF}}=[g_{1},\ldots,g_{h}] converged
return gradient estimate \mathbf{g}_{\text{RF}}=[g_{1},\ldots,g_{h}]


Besides the theoretically faster convergence rate, likelihood ratio gradient methods have a variety of advantages in comparison to finite difference methods when applied to robotics. As the generation of policy parameter variations is no longer needed, the complicated control of these variables can no longer endanger the gradient estimation process. Furthermore, in practice, already a single roll-out can suffice for an unbiased gradient estimate viable for a good policy update step, thus reducing the amount of roll-outs needed. Finally, this approach has yielded the most real-world robotics results (Peters & Schaal, 2008).

Natural Policy Gradients

One of the main reasons for using policy gradient methods is that we intend to do just a small change \mathbf{\Delta\theta} to the policy \pi_{\mathbf{\theta}} while improving the policy. However, the meaning of small is ambiguous. When using the Euclidian metric of \sqrt{\mathbf{\Delta \theta}^{T}\mathbf{\Delta\theta}}, then the gradient is different for every parameterization \mathbf{\theta} of the policy \pi_{\mathbf{\theta}} even if these parameterization are related to each other by a linear transformation (Kakade, 2002). This problem poses the question of how we can measure the closeness between the current policy and the updated policy based upon the distribution of the paths generated by each of these. In statistics, a variety of distance measures for the closeness of two distributions (e.g., p_{\mathbf{\theta}}\left( \mathbf{\tau}\right) and p_{\mathbf{\theta}+\mathbf{\Delta\theta}}\left( \mathbf{\tau}\right)) have been suggested, e.g., the Kullback-Leibler divergence d_{\text{KL}}\left( p_{\mathbf{\theta}},p_{\mathbf{\theta}+\mathbf{\Delta\theta}}\right), the Hellinger distance d_{\text{HD}} and others (Su & Gibbs, 2002). Many of these distances (e.g., the previously mentioned ones) can be approximated by the same second order Taylor expansion, i.e., by

d_{\text{KL}}\left(  p_{\mathbf{\theta}},p_{\mathbf{\theta}+\mathbf{\Delta \theta}}\right)  \approx\mathbf{\Delta\theta}^{T}\mathbf{F}_{\mathbf{\theta} }\text{ }\mathbf{\Delta\theta},

where

\mathbf{F}_{\mathbf{\theta}}=\int_{\mathbb{T}}p_{\mathbf{\theta} }\left(  \mathbf{\tau}\right)  \nabla\log p_{\mathbf{\theta}}\left( \mathbf{\tau}\right)  \nabla\log p_{\mathbf{\theta}}\left(  \mathbf{\tau }\right)  ^{T}d\mathbf{\tau}=\left\langle \nabla\log p_{\mathbf{\theta} }\left(  \mathbf{\tau}\right)  \nabla\log p_{\mathbf{\theta}}\left( \mathbf{\tau}\right)  ^{T}\right\rangle

is known as the Fisher-information matrix. Let us now assume that we restrict the change of our policy to the length of our learning rate or step-size \alpha_{h}, i.e., we have a restricted step-size gradient descent approach well-known in the optimization literature (Fletcher & Fletcher, 2000), and that our parameter change is given by

\mathbf{\Delta\theta}=\operatorname{argmax}_{\mathbf{\Delta\tilde{\theta}} }\frac{\alpha_{h}\mathbf{\Delta\tilde{\theta}}^{T}\mathbf{\nabla }_{\mathbf{\theta}}J}{\mathbf{\Delta\tilde{\theta}}^{T}\mathbf{F} _{\mathbf{\theta}}\text{ }\mathbf{\Delta\tilde{\theta}}}=\alpha_{h} \mathbf{F}_{\mathbf{\theta}}^{-1}\mathbf{\nabla}_{\mathbf{\theta}}J,

where \mathbf{\nabla}_{\mathbf{\theta}}J denotes the `vanilla' policy gradient from Section Finite-difference Methods. This update step can be interpreted as follows: determine the maximal improvement \mathbf{\Delta\tilde{\theta}} of the policy for a constant fixed change of the policy \mathbf{\Delta\tilde{\theta}}^{T}\mathbf{F}_{\mathbf{\theta}}\mathbf{\Delta\tilde{\theta}}.


Figure 1: When plotting the expected return landscape for a simple problem such as an 1d linear quadratic regulation, the differences between ‘vanilla’ and natural policy gradients becomes apparent.
Enlarge
Figure 1: When plotting the expected return landscape for a simple problem such as an 1d linear quadratic regulation, the differences between ‘vanilla’ and natural policy gradients becomes apparent.

This type of approach is known as Natural Policy Gradients and has its separate origin in supervised learning (Amari, 1998). It was first suggested in the context of reinforcement learning by Kakade (2002) and has been explored in greater depth in (Bagnell & Schneider, 2003; Peters, Vijayakumar & Schaal, 2003, 2005; Peters & Schaal, 2008). The strongest theoretical advantage of this approach is that its performance no longer depends on the parameterization of the policy and is therefore safe to be used for arbitrary policies. In practice, the learning process often converges significantly faster for most practical cases.

One of the fastest general algorithms for estimating natural policy gradients which does not need complex parameterized baselines is the episodic natural actor critic. This algorithm, originally derived in (Peters, Vijayakumar & Schaal, 2003), can be considered the `natural' version of REINFORCE with a baseline optimal for this gradient estimator. However, for steepest descent with respect to a metric, the baseline also needs to minimize the variance with respect to the same metric. In this case, we can minimize the whole covariance matrix of the natural gradient estimate \mathbf{\Delta\hat{\theta}} given by

\mathbf{\Sigma} =\operatorname{Cov}\left\{  \mathbf{\Delta\hat{\theta} }\right\}  _{\mathbf{F}_{\mathbf{\theta}}}=E\left\{  \left(  \mathbf{\Delta\hat{\theta}-F}_{\mathbf{\theta}} ^{-1}\mathbf{g}_{\text{RF}}\left(  b\right)  \right)  ^{T}\mathbf{F} _{\mathbf{\theta}}\left(  \mathbf{\Delta\hat{\theta}-F}_{\mathbf{\theta}} ^{-1}\mathbf{g}_{\text{RF}}\left(  b\right)  \right)  \right\}  ,
with
\mathbf{g}_{\text{RF}}\left(  b\right)  =\left\langle \nabla\log p_{\mathbf{\theta}}\left(  \mathbf{\tau}\right)  \left(  r\left( \mathbf{\tau}\right)  -b\right)  \right\rangle
being the REINFORCE gradient

with baseline b. As outlined in (Peters, Vijayakumar & Schaal, 2003), it can be shown that the minimum-variance unbiased natural gradient estimator can be determined as shown below.

input: policy parameterization \mathbf{\theta}_{h}
repeat
    perform M trials and obtain \mathbf{x}_{0:H},\mathbf{u}_{0:H},r_{0:H} for each trial
    Obtain the sufficient statistics
         Policy derivatives \mathbf{\psi}_{k}=\mathbf{\nabla}_{\theta}\log\pi_{\mathbf{\theta}}\left(  \mathbf{u}_{k}\left\vert \mathbf{x}_{k}\right.  \right)
         Fisher matrix \mathbf{F}_{\mathbf{\theta}}=\left\langle \left( \sum\nolimits_{k=0}^{H}\mathbf{\psi}_{k}\right)\left(  \sum\nolimits_{l=0}^{H}\mathbf{\psi}_{l}\right)  ^{T}\right\rangle
         Vanilla gradient \mathbf{g}=\left\langle\left(\sum\nolimits_{k=0}^{H}\mathbf{\psi}_{k}\right)  \left(\sum\nolimits_{l=0}^{H}a_{l}r_{l}\right)  \right\rangle
         Eligbility \mathbf{\phi}=\left\langle \left(\sum\nolimits_{k=0}^{H}\mathbf{\psi}_{k}\right)  \right\rangle
         Average reward \bar{r}=\left\langle\sum\nolimits_{l=0}^{H}a_{l}r_{l}\right\rangle
    Obtain natural gradient by computing
         Baseline b=\mathbf{Q}\left(  \bar {r}-\mathbf{\phi}^{T}\mathbf{F}_{\mathbf{\theta}}^{-1}\mathbf{g}\right)
              with \mathbf{Q}=M^{-1}\left(  1+\mathbf{\phi}^{T}\left( M\mathbf{F}_{\mathbf{\theta}}-\mathbf{\phi\phi}^{T}\right)^{-1}\mathbf{\phi}\right)
         Natural gradient \mathbf{g}_{\text{NG}}=\mathbf{F}_{\mathbf{\theta}}^{-1}\left(  \mathbf{g}-\mathbf{\phi}b\right)
until  gradient estimate \mathbf{g}_{\text{NG}}=[g_{1},\ldots,g_{h}] converged
return gradient estimate \mathbf{g}_{\text{NG}}=[g_{1},\ldots,g_{h}]


Conclusion

We have presented an quick overview on policy gradient methods. While many details needed to be omitted and may be found in (Peters & Schaal, 2008), this entry roughly represents the state of the art in policy gradient methods. All three major ways of estimating first order gradients, i.e., finite-difference gradients, vanilla policy gradients and natural policy gradients are discussed in this paper and practical algorithms are given.

References

  • V. Aleksandrov, V. Sysoyev, and V. Shemeneva, Stochastic optimization, Engineering Cybernetics, vol.~5, pp. 11-16, 1968.
  • S. Amari, Natural gradient works efficiently in learning, Neural Computation, vol.~10, 1998.
  • J. Bagnell and J. Schneider, Covariant policy search, International Joint article on Artificial Intelligence, 2003.
  • J. Baxter and P. Bartlett, Direct gradient-based reinforcement learning, Journal of Artificial Intelligence Research, 1999.
  • A. Berny, Statistical machine learning and combinatorial optimization, In L. Kallel, B. Naudts, and A. Rogers, editors, Theoretical Aspects of Evolutionary Computing, Lecture Notes in Natural Computing, 2000.
  • P. Dyer and S. R. McReynolds, The Computation and Theory of Optimal Control. New York: Academic Press, 1970.
  • R. Fletcher and R. Fletcher, Practical Methods of Optimization. John Wiley \& Sons, 2000.
  • P. Glynn, Likelihood ratio gradient estimation for stochastic systems, Communications of the ACM, vol.~33, no.~10, pp. 75-84, October 1990.
  • P. Glynn, Likelihood ratio gradient estimation: an overview, in Proceedings of the 1987 Winter Simulation Conference, Atlanta, GA, 1987, pp. 366--375.
  • E. Greensmith, P. Bartlett, and J. Baxter, Variance reduction techniques for gradient estimates in reinforcement learning, Advances in Neural Information Processing Systems, 2001.
  • V. Gullapalli, Associative reinforcement learning of real-value functions, SMC, 1991.
  • L. Hasdorff, Gradient optimization and nonlinear control. John Wiley & Sons, 1976.
  • D.~H. Jacobson and D.~Q. Mayne, Differential Dynamic Programming. New York: American Elsevier Publishing Company, Inc., 1970.
  • G. Lawrence, N. Cowan, and S. Russell, Efficient gradient estimation for motor control learning, in Proc. UAI-03, Acapulco, Mexico, 2003.
  • A.Y. Ng and M. Jordan, PEGASUS: A policy search method for large MPDs and POMDPs, in Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, 2000.
  • J. Peters, S. Vijayakumar, and S. Schaal, Reinforcement learning for humanoid robotics, in IEEE/RSJ International Conference on Humanoid Robotics, 2003.
  • J. Peters, S. Vijayakumar, and S. Schaal, Natural actor-critic, in Proceedings of the European Machine Learning Conference (ECML)}, 2005.
  • J. C. Spall, Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control. Hoboken, NJ: Wiley, 2003.
  • F. Su and A. Gibbs, On choosing and bounding probability metrics, International Statistical Review, vol. 70, no. 3, pp. 419-435, 2002.
  • R. Sutton, D. McAllester, S. Singh, and Y. Mansour, Policy gradient methods for reinforcement learning with function approximation, Advances in Neural Information Processing Systems, 2000.
  • R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learning, vol. 8, no. 23, 1992.

See also

Reinforcement learning

Invited by: Dr. Eugene M. Izhikevich, Editor-in-Chief of Scholarpedia, the peer-reviewed open-access encyclopedia
Action editor: Dr. Eugene M. Izhikevich, Editor-in-Chief of Scholarpedia, the peer-reviewed open-access encyclopedia
For authors