Policy gradient methods
From Scholarpedia
| This article is undergoing 2 initial reviews; It may contain inaccuracies and unapproved changes made by anonymous reviewers. | ||||||||||||||||||||
Author: Dr. Jan Peters, Max-Planck Institute, Germany & University of Southern California, USC
Policy gradient methods are a type of reinforcement learning techniques that rely upon optimizing parametrized policies with respect to the expected return (long-term cumulative reward) by gradient descent. They do not suffer from of many problems that have been marring traditional reinforcement learning approaches such as the lack of guarantees of a value function, the intractability problem resulting from uncertain state information and the complexity arising from continuous states & actions.
Contents |
Introduction
Reinforcement learning is probably the most general framework in which reward-related learning problems of animals, humans or machine can be phrased. However, most of the methods proposed in the reinforcement learning community are not yet applicable in many problems such as robotics, motor control, etc. This inapplicability may result from problems with uncertain state information. Thus, those systems need to be modeled as partially observable markov decision problems which often results in excessive computational demands. Most traditional reinforcement learning methods have no convergence guarantees and there exist even divergence examples. Continuous states and actions in high dimensional spaces cannot be treated by most off-the-self reinforcement learning approaches.
Policy gradient methods differ significantly as they do not suffer from these problems in the same way. For example, uncertainty in the state might degrade the performance of the policy (if no additional state estimator is being used) but the optimization techniques for the policy do not need to be changed. Continuous states and actions can be dealt with in the exact same way as discrete ones while the learning performance is often additionally increased. Convergence at least to a local optimum is guaranteed.
The advantages of policy gradient methods for real world applications
are numerous. Among the most important ones are that the policy
representations can be chosen so that it is meaningful for the task
and can incorporate previous domain knowledge, that often fewer
parameters are needed in the learning process than in value-function
based approaches and that there is a variety of different algorithms
for policy gradient estimation in the literature which have a rather
strong theoretical underpinning. Additionally, policy gradient methods
can be used either model-free or model-based as they are a generic
formulation. As a result, most real-world robotics applications of
reinforcement learning employ some kind of policy gradient
approach, see Peters & Schaal, 2008 for some examples.
Assumptions and Notation
We assume that we can model the control system in a discrete-time
manner and we will denote the current time step by
. In
order to take possible stochasticity of the plant into account, we
denote it using a probability distribution
as model where
denotes the current
action, and
,
denote the current and
next state, respectively. We furthermore assume that actions are
generated by a policy
which is modeled as a probability
distribution in order to incorporate exploratory actions; for some
special problems, the optimal solution to a control problem is
actually a stochastic controller (Sutton, McAllester, Singh, and
Mansour, 2000). The policy is assumed to be parameterized by some
policy parameters
.The
sequence of states and actions forms a trajectory (also called history
or roll-out) denoted by
where
denotes the horizon which can be infinite. At each
instant of time, the learning system receives a reward denoted by
.
The general goal of policy optimization in reinforcement learning is
to optimize the policy parameters
so that the expected
return

is optimized where
denote time-step dependent
weighting factors, often set to
for
discounted reinforcement learning (where
is in
) or
for the average reward
case. For real-world applications, we require that any change to the
policy parameterization has to be smooth as drastic changes can be
hazardous for the actor as well as useful initializations of the
policy based on domain knowledge would otherwise vanish after a single
update step. For these reasons, policy gradient methods which follow
the steepest descent on the expected return are the method of choice.
These methods update the policy parameterization according to the
gradient update rule

where
denotes a learning rate
and
the current update number.
If the gradient estimate is unbiased and learning rates fulfill
and
,
the learning process is guaranteed to converge at least to a local
minimum.
The main problem in policy gradient methods is to obtain a good
estimator of the policy gradient
. In robotics and
control, people have traditionally used deterministic model-based
methods for obtaining the gradient (Jacobson & Mayne, 1970; Dyer &
McReynolds, 1970; Hasdorff, 1976). However, in order to become
autonomous we cannot expect to be able to model every detail of the
system. Therefore, we need to estimate the policy gradient simply from
data generated during the execution of a task, i.e., without the need
for a model. In this section, we will study different approaches and
discuss their properties.
Approaches to Policy Gradient Estimation
The literature on policy gradient methods has yielded a variety of estimation methods over the last years. The most prominent approaches, which have been applied to robotics are finite-difference and likelihood ratio methods, better known as REINFORCE in reinforcement learning.
Finite-difference Methods
Finite-difference methods are among the oldest policy gradient
approaches; they originated from the stochastic simulation community
and are quite straightforward to understand. The policy
parameterization is varied by small increments
and for each policy parameter
variation
roll-outs (or trajectories) are performed which generate estimates
of the expected
return. There are different ways of choosing the reference value
, e.g. forward-difference estimators with
and
central-difference estimators with
.
The policy gradient estimate
can be estimated by regression yielding

where
and
denote the
samples. This
approach can be highly efficient in simulation optimization of
deterministic systems (Spall, 2003) or when a common history of random
numbers (Glynn, 1987) is being used (the later is known as PEGASUS in
reinforcement learning (Ng & Jordan, 2000)), and can get close to a
convergence rate of
(Glynn,
1987). However, when used on a real system, the uncertainities degrade
the performance resulting in convergence rates ranging between
to
depending on the chosen reference value (Glynn,
1987). An implementation of this algorithm is shown below.
input: policy parameterizationrepeat generate policy variation
estimate
from roll-out estimate
, e.g.,
from roll-out compute
compute gradient
until gradient estimate
converged return gradient estimate
![]()
Due to the simplicity of this approach, such methods have been
successfully applied to numerous applications. However, the
straightforward application to robotics is not without peril as the
generation of the
requires
proper knowledge of the system, as badly chosen
can destabilize the policy so
that the system becomes instable and the gradient estimation process
is prone to fail. If the parameters differ highly in scale,
significant difficulties could be the consequences.
Likelihood Ratio Methods and REINFORCE
Likelihood ratio methods are driven by a different important insight.
Assume that trajectories
are generated from
a system by roll-outs, i.e.,
with rewards
. In
this case, the policy gradient can be estimated using the likelihood
ratio (see e.g. Glynn, 1987; Aleksandrov, Sysoyev, and Shemeneva,
1968) better known as REINFORCE (Williams, 1992) trick, i.e., by


Importantly, the derivative
can
be computed without knowledge of the generating distribution
as
implies that

as only the policy depends on
.
Thus, the derivatives through the control system do not have to be computed. This result makes an important difference: in stochastic system optimization, finite difference estimators are often preferred as the derivative through the system is required but not known. In policy search, we always know the derivative of the policy with respect to its parameters and therefore we can make use of the theoretical advantages of likelihood ratio gradient estimators. As
,a constant baseline can be inserted resulting into the gradient estimator

where the baseline
can be chosen
arbitrarily (Williams, 1992) but usually with the goal to minimize
the variance of the gradient estimator. See Peters & Schaal, 2008
for an overview of how to choose the baseline optimally. Therefore,
the general path likelihood ratio estimator or episodic REINFORCE
gradient estimator is given by

where
denotes the average over trajectories.
This type of method is guaranteed to converge to the
true gradient at the fastest theoretically possible pace of
where
denotes the number of roll-outs (Glynn, 1987)
even if the data is generated from a highly stochastic system. An
implementation of this algorithm will be shown below together with the estimator for
the optimal baseline.
input: policy parameterizationrepeat perform a trial and obtain
for each gradient element
estimate optimal baseline
estimate the gradient element
end for until gradient estimate
converged return gradient estimate
![]()
Besides the theoretically faster convergence rate, likelihood ratio
gradient methods have a variety of advantages in comparison to finite difference methods when applied to robotics. As the generation of
policy parameter variations is no longer needed, the complicated
control of these variables can no longer endanger the gradient
estimation process. Furthermore, in practice, already a single
roll-out can suffice for an unbiased gradient estimate viable for a
good policy update step, thus reducing the amount of roll-outs needed.
Finally, this approach has yielded the most real-world robotics
results (Peters & Schaal, 2008).
Natural Policy Gradients
One of the main reasons for using policy gradient methods is that we
intend to do just a small change
to
the policy
while improving the
policy. However, the meaning of small is ambiguous. When using the
Euclidian metric of
, then the gradient is
different for every parameterization
of
the policy
even if these
parameterization are related to each other by a linear transformation
(Kakade, 2002). This problem poses the question of how we can measure
the closeness between the current policy and the updated policy based
upon the distribution of the paths generated by each of these. In
statistics, a variety of distance measures for the closeness of two
distributions (e.g.,
and
) have been suggested, e.g., the
Kullback-Leibler divergence
,
the Hellinger distance
and others (Su &
Gibbs, 2002). Many of these distances (e.g., the previously mentioned
ones) can be approximated by the same second order Taylor expansion,
i.e., by

where

is known as the Fisher-information matrix. Let us now assume that we
restrict the change of our policy to the length of our learning rate
or step-size
, i.e., we have a restricted
step-size gradient descent approach well-known in the optimization
literature (Fletcher & Fletcher, 2000), and that our parameter change
is given by

where
denotes the
`vanilla' policy gradient from Section
Finite-difference Methods. This update step can be
interpreted as follows: determine the maximal improvement
of the policy for a
constant fixed change of the policy
.
This type of approach is known as Natural Policy Gradients and has its separate origin in supervised learning (Amari, 1998). It was first suggested in the context of reinforcement learning by Kakade (2002) and has been explored in greater depth in (Bagnell & Schneider, 2003; Peters, Vijayakumar & Schaal, 2003, 2005; Peters & Schaal, 2008). The strongest theoretical advantage of this approach is that its performance no longer depends on the parameterization of the policy and is therefore safe to be used for arbitrary policies. In practice, the learning process often converges significantly faster for most practical cases.
One of the fastest general algorithms for estimating natural policy gradients
which does not need complex parameterized baselines is the episodic natural
actor critic. This algorithm, originally derived in (Peters, Vijayakumar & Schaal, 2003),
can be considered the `natural' version of REINFORCE with a baseline optimal for
this gradient estimator. However, for steepest descent with respect to a metric,
the baseline also needs to minimize the variance with respect
to the same metric. In this case, we can minimize the whole covariance matrix of
the natural gradient estimate
given by


with baseline
As outlined in (Peters, Vijayakumar & Schaal, 2003),
it can be shown that the
minimum-variance unbiased natural gradient estimator can be determined as
shown below.
input: policy parameterizationrepeat perform
trials and obtain
for each trial Obtain the sufficient statistics Policy derivatives
Fisher matrix
Vanilla gradient
Eligbility
Average reward
Obtain natural gradient by computing Baseline
with
Natural gradient
until gradient estimate
converged return gradient estimate
![]()
Conclusion
We have presented an quick overview on policy gradient methods. While many details needed to be omitted and may be found in (Peters & Schaal, 2008), this entry roughly represents the state of the art in policy gradient methods. All three major ways of estimating first order gradients, i.e., finite-difference gradients, vanilla policy gradients and natural policy gradients are discussed in this paper and practical algorithms are given.
References
- V. Aleksandrov, V. Sysoyev, and V. Shemeneva, Stochastic optimization, Engineering Cybernetics, vol.~5, pp. 11-16, 1968.
- S. Amari, Natural gradient works efficiently in learning, Neural Computation, vol.~10, 1998.
- J. Bagnell and J. Schneider, Covariant policy search, International Joint article on Artificial Intelligence, 2003.
- J. Baxter and P. Bartlett, Direct gradient-based reinforcement learning, Journal of Artificial Intelligence Research, 1999.
- A. Berny, Statistical machine learning and combinatorial optimization, In L. Kallel, B. Naudts, and A. Rogers, editors, Theoretical Aspects of Evolutionary Computing, Lecture Notes in Natural Computing, 2000.
- P. Dyer and S. R. McReynolds, The Computation and Theory of Optimal Control. New York: Academic Press, 1970.
- R. Fletcher and R. Fletcher, Practical Methods of Optimization. John Wiley \& Sons, 2000.
- P. Glynn, Likelihood ratio gradient estimation for stochastic systems, Communications of the ACM, vol.~33, no.~10, pp. 75-84, October 1990.
- P. Glynn, Likelihood ratio gradient estimation: an overview, in Proceedings of the 1987 Winter Simulation Conference, Atlanta, GA, 1987, pp. 366--375.
- E. Greensmith, P. Bartlett, and J. Baxter, Variance reduction techniques for gradient estimates in reinforcement learning, Advances in Neural Information Processing Systems, 2001.
- V. Gullapalli, Associative reinforcement learning of real-value functions, SMC, 1991.
- L. Hasdorff, Gradient optimization and nonlinear control. John Wiley & Sons, 1976.
- D.~H. Jacobson and D.~Q. Mayne, Differential Dynamic Programming. New York: American Elsevier Publishing Company, Inc., 1970.
- G. Lawrence, N. Cowan, and S. Russell, Efficient gradient estimation for motor control learning, in Proc. UAI-03, Acapulco, Mexico, 2003.
- A.Y. Ng and M. Jordan, PEGASUS: A policy search method for large MPDs and POMDPs, in Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, 2000.
- J. Peters, S. Vijayakumar, and S. Schaal, Reinforcement learning for humanoid robotics, in IEEE/RSJ International Conference on Humanoid Robotics, 2003.
- J. Peters, S. Vijayakumar, and S. Schaal, Natural actor-critic, in Proceedings of the European Machine Learning Conference (ECML)}, 2005.
- J. C. Spall, Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control. Hoboken, NJ: Wiley, 2003.
- F. Su and A. Gibbs, On choosing and bounding probability metrics, International Statistical Review, vol. 70, no. 3, pp. 419-435, 2002.
- R. Sutton, D. McAllester, S. Singh, and Y. Mansour, Policy gradient methods for reinforcement learning with function approximation, Advances in Neural Information Processing Systems, 2000.
- R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learning, vol. 8, no. 23, 1992.
repeat
generate policy variation
from roll-out
estimate
, e.g.,
from roll-out
compute
until gradient estimate
converged
return gradient estimate
for each gradient element
estimate optimal baseline
estimate the gradient element
end for
until gradient estimate
converged
return gradient estimate
trials and obtain
Fisher matrix
Vanilla gradient
Eligbility
Average reward
Obtain natural gradient by computing
Baseline
with
Natural gradient
until gradient estimate
converged
return gradient estimate 