Talk:Policy gradient methods
<review>Reviewer has general comments and has decided to put all in one section
First I would like to state that the review is very clearly written, concise but still with enough details, when considering this text from a mathematical point of view. But there are deficiencies with respect to the presentation from the point of view of a reader more generally interested in applications of the method. Thus the following remarks for improvement essentially sum up to asking the author to provide a much better practical insight into the method.
The introduction (last paragraph) would be a good place to provide several citations of applications of the policy gradient methods which the author considers either most illustrative or most successful (also something in addition to his own work!!). </review>
Authors response: I have added a paragraph that outlines applications by several groups.
<review> It would be helpful to get a very clear statement of the author not only on advantages, but also on shortcomings of the policy gradient methods. As he was working with those methods a great lot, he should have observations what problems arise when trying to apply this approach. (Never sell your own method as “the salvation of all souls”…..) E.g. I would think that policy gradient methods are quite demanding to implement, mainly because one has to have considerable knowledge about the system one wants to control to make reasonable policy definitions. But I would not insist on that observation, possibly the author himself has much more targeted remarks. </review>
Authors response: I apologize, you are absolutely right. I have added both advantages and disadvantages sections both on policy gradients in general in the introduction, and later to contrast the methods with each other.
<review> I also miss some practical details or advices. E.g. it is stated that policy gradient methods guarantee convergence at least to a local optimum. But one would want a comment how relevant that is to RL tasks, e.g. in robotics (how often functions to be optimized by policy gradient actually have a single-maximum-structure or whether local convergence could be rather a problem and not a positive feature). Hence, what are the consequences of applying this method to tasks that do not have this structure. Comments like that would be really valuable for someone who considers applying this method. </review>
Authors response: I have incorporated this part in the introduction -- both in an application paragraph at the end of the introduction as well as in the disadvantages of policy gradients
<review> And so on and so on. Thus, in summary remarks along these “practical and critical” lines would be EXCEEDINGLY helpful for the readers. </review>
Authors response: I have tried hard to incorporate this part in the article.
<review> Being not a true mathematician, I found one place difficult to understand: the second equation in the section "Likelihood Ratio Methods and REINFORCE", where log is first introduced. To my understanding it would be good to add a link or explanation where the equation comes from because this is quite central to the method and the averge reader might not be able to follow. </review>
Authors response: I have simplified this part. It should be much more readable now!
<review> Also some of my colleagues have found the name "vanilla policy gradient" quite annoying. They were saying that this is just the gradient, not the "vanilla gradient". </review>
Authors response: I have tried to reduce the usage of the 'vanilla' in the article and I am now pointing out that the vanilla was never mend to be derogatory but rather that the 'vanilla' flavor of the policy parametrization affects the regular policy gradient but not the natural policy gradient.
<review> Summarizing, this is a very nice introduction to policy gradient methods for one who knows mathematics relatively well and does not bother much about implementation of the method. But for adequate help with the implementation of this technique more practical details are required. Furthermore, a more balanced (critical!) view as to the advantages as well as disadvantages of this method should be applied. </review>
Authors response: Thanks for the critical but insightful review. I hope to have fulfilled your requests adequately.