Talk:Policy gradient methods

From Scholarpedia

This page is not peer reviewed.
Peer review status
Press 'article' link (above).

Reviewer has general comments and has decided to put all in one section

First I would like to state that the review is very clearly written, concise but still with enough details, when considering this text from a mathematical point of view. But there are deficiencies with respect to the presentation from the point of view of a reader more generally interested in applications of the method. Thus the following remarks for improvement essentially sum up to asking the author to provide a much better practical insight into the method.

The introduction (last paragraph) would be a good place to provide several citations of applications of the policy gradient methods which the author considers either most illustrative or most successful (also something in addition to his own work!!).

It would be helpful to get a very clear statement of the author not only on advantages, but also on shortcomings of the policy gradient methods. As he was working with those methods a great lot, he should have observations what problems arise when trying to apply this approach. (Never sell your own method as “the salvation of all souls”…..) E.g. I would think that policy gradient methods are quite demanding to implement, mainly because one has to have considerable knowledge about the system one wants to control to make reasonable policy definitions. But I would not insist on that observation, possibly the author himself has much more targeted remarks.

I also miss some practical details or advices. E.g. it is stated that policy gradient methods guarantee convergence at least to a local optimum. But one would want a comment how relevant that is to RL tasks, e.g. in robotics (how often functions to be optimized by policy gradient actually have a single-maximum-structure or whether local convergence could be rather a problem and not a positive feature). Hence, what are the consequences of applying this method to tasks that do not have this structure. Comments like that would be really valuable for someone who considers applying this method. E.g. for the likelihood ratio method one has to calculate the average over trajectories. Are there practical advices how many trajectories one has to use to obtain a good gradient estimate? And so on and so on. Thus, in summary remarks along these “practical and critical” lines would be EXCEEDINGLY helpful for the readers.

For the likelihood ratio methods it would be good to have a link to some example, where one defines the task, the policy, derives log-policy gradient expression, provides a numerical solution emphasizing on details, the step size, the number of averaged trajectories, etc. A good example with intermediate steps clearly explained would help people to apply the method and thereby “spread the news” (I guess the author is interested in this….)

Being not a true mathematician, I found one place difficult to understand: the second equation in the section "Likelihood Ratio Methods and REINFORCE", where log is first introduced. To my understanding it would be good to add a link or explanation where the equation comes from because this is quite central to the method and the averge reader might not be able to follow.

Also some of my colleagues have found the name "vanilla policy gradient" quite annoying. They were saying that this is just the gradient, not the "vanilla gradient".

Summarizing, this is a very nice introduction to policy gradient methods for one who knows mathematics relatively well and does not bother much about implementation of the method. But for adequate help with the implementation of this technique more practical details are required. Furthermore, a more balanced (critical!) view as to the advantages as well as disadvantages of this method should be applied.

For authors