Talk:Reinforcement learning

From Scholarpedia
Jump to: navigation, search

    Interesting article, better suited to the topic "Neural RL" than "RL". Detailed comments follow.


    General Response to the Referee

    We have tried to take care of all comments and we are grateful also for the editorial work the Ref. has done!

    Concerning the Introduction:

    We have added a more general paragraph on model-based methods as compared to RL, where we are basically adopting the description from the Kaelbling et al (1996) review. Clearly, given the limited space, we could not extend this very much.

    Concerning the point on the possible bias too much towards "the neuronal perspective":

    In general it seems that the field of RL has especially in the last years rather more strongly developed this "neuronal branch" (through the physiological work of Schultz and others, through models of Suri et al and others, etc.) Textbooks on computational neuroscience are indeed essentially only talking about this perspective (see e.g. Dayan and Abbott). Hence, we do believe that the machine-learning perspective and the neuronal perspective should have pretty much the same weight in such an article. We have tried to achieve this by putting them side-by-side, to more clearly show the relations between both fields. Hence we do not think that the neuronal perspective is overweight.

    Concerning Exploration/Exploitation

    The refs. comments on Exploration, Exploitation has so far not been followed because we could not make anything from the Refs statement about "some recent lovely theory (E-cubed et al)". Sorry but we did not find this. Could the ref point this out more clearly. Then we will be able to discuss this issue in an appropriate way in another revision round.


    • "RL-agent" -> "RL agent". There is no need to hyphenate this phrase.
    • "normally by trial-and-error": That description is not accurate and should be changed. To many readers, trial and error implies guessing.
    • "encodes the success of an action": Perhaps "the success of an action's outcome". Some authors have explicitly argued that rewards need to be tied to states only, not actions.
    • "to learn actions" -> "to learn to select actions"?


    • "machine learning perspective" -> "machine-learning perspective".
    • "cases of reinforcement learning (RL) cover": Very awkward phrasing. Maybe "best studied case is when the learner's environment can be modeled as a Markov decision process"?
    • "by starting actions": Weird phrase. Perhaps "selecting actions"?
    • "Actions will follow a policy, which can also change": Very awkward. "Actions are selected according to a policy..."?
    • "The goal of RL is to maximize the expected cumulative reward (the return) by subsequently visiting a subset of states in the MDP.": Very confusing description. Maybe "The goal of an RL algorithm is to select actions that maximize the expected cumulative reward (the return) of the agent". Subsets have nothing to do with it.
    • "This, for example, can be done by assuming a given (unchangeable) policy": I would delete this sentence and the next one.


    • "Early on we note" -> "Early on, we note".
    • "neuronal network type formalism" -> "neural network formalism"?
    • "In general RL methods are mainly employed to address two related problems: the Prediction- and the Control Problem." -> "In general. RL methods are employed to address two related problems: the Prediction Problem and the Control Problem."

    NOTE: I would suggest discussing three high-level approaches to RL---policy search, value function learning, and model-based learning---in the intro. Later material assumes that this information is known.

    • "RL is used to learn" -> "RL methods are used to learn"?
    • "the value function for the policy followed": 'Value function' has not be defined yet.
    • "when performing actions starting at this state" -> "following a policy from this state"?
    • "By means of RL" -> "By interacting with the environment"?
    • "that particular set of policies which maximizes" -> "a policy that maximizes".
    • "This way we have at the end obtained an optimal policy which allows for action planning and optimal control": Delete? Seems confusing and redundant with previous sentence.
    • "it is clear that the prediction problem is part of the control problem, too." -> "solving the control problem would seem to require a solution to the prediction problem as well"?


    • "embedding of reinforcement learning": I don't know what this phrase means.
    • "At the same time RL-concepts" -> "At the same time, RL concepts".
    • "and theory of" -> "and the theory of"?
    • "e.g. Robot Soccer": A reference to a paper should be given, not just the competition. In particular, I don't think much RL research has been done in the robocup setting.
    • "dispatching, (Crites" -> "dispatching (Crites"?
    • "simulating classical-" -> "simulating classical".


    • "account on the history" -> "account of the history"?
    • "Here we will" -> "Here, we will".
    • "several problem-fields" -> "several academic disciplines"? In general, I think the term "problem-field" should be replaced. I suspect it is a direct translation of a non-English term.
    • "Belmann" -> "Bellman".
    • "and this has become" -> ", which is"?
    • "Trial and error learning" -> "Trial-and-error learning".
    • "it took much longer until the first, still more qualitative, mathematical models have been developed" -> "it took much longer for the first, still more qualitative, mathematical models to be developed".
    • "classical conditioning, where": Awkward, perhaps break up the sentence.
    • "given in Balkenius and Moren (1998)" -> "given by Balkenius and Moren (1998)".
    • "algorithmical" -> "computational"?
    • "In addition to these two problem-fields" -> "Arising from the interdisciplinary study of these two fields, "?
    • "Temporal Difference Learning": Sentence runs on and on---consider breaking after the citation.
    • "creating a situation where temporal differences need to be evaluated": Confusing.
    • "Goal of this is" -> "The goal of this computation is"?
    • "later it could be transfered to the field of optimal control by the work of Watkins, who had invented Q-learning (1989), which is also a temporal difference algorithm." -> "it was used to solve optimal control problems. Of particular note is the work of Watkins (1989) on Q-learning, a temporal-difference-based control algorithm."
    • "Hence, it is by the set of TD-methods that the prediction and the control problem could be spanned together." : Delete?
    • ", which began" -> "that began".
    • "and he also introduced". Sentence is running on. Consider breaking right before this phrase.
    • "on the other hand": Delete.
    • "Any evaluation in this case be performed" -> "Any evaluation, in this case, must be performed"?
    • "RL appears to be related to non-evaluative feedback, because the agent receives feedback directly from the environment and not from a teacher. However, here the subtle, but sometimes very troubling problem hides, namely who actually defines the environment. Why this is a problem will only be discussed later" -> "Because animals don't receive evaluative feedback, RL would appear to be an example of unsupervised learning. However, this formulation hides the subtle, sometimes very troubling, problem of who actually defines the environment. The reason this issue is a problem will be discussed later".

    NOTE: In my opinion, the article needs a great deal of work on editing. For the remainder of my review, I will refrain from making editorial comments and focus on issues of content.


    • "we need to distinguish between... machine learning ... neuronal": Before reading this article, I was not aware of the "neuronal" branch of RL. Given that the authors of the article are the authors of the main citation on this branch, I am concerned that they have put too much emphasis on this part of the field. The two appear to be given completely equal weight in the discussion, which seems inappropriate (or, at least requires greater justification).
    • "this algorithm emulates the backward-TD(1) procedure": I'm not sure I see why it is important to discuss this algorithm in detail. If it emulates an existing algorithm, how do you distinguish it from an alternative implementation?


    • "SARSA": Should provide a citation for this name. (The idea predates the name.)
    • "Formulations for SARSA(lambda)": Provide a citation.
    • "the neuronal perspective (see there!)": I'm not sure what that comment means, but it seems inappropriate.


    • I hate to be a "spin doctor", but this title makes it sound like RL is broken. More accurately, the list consists of "challenges" or "extensions" to the basic algorithms that have been dealt with to

    varying degrees. However, I think the list itself is excellent.

    • "RL will only work in stationary environments": That's just not true. Many algorithms work just fine if dynamics change slowly.
    • "Several possible strategies exist": Could also mention evolutionary methods that evolve their own reward function over a series of generations.
    • "Possible model mismatches do not occur here.": However, you should make it clear the shortcomings of correlative feedback---can they be misled by spurious correlations? How would a designer specify the task without rewards?
    • "Exploration-Exploitation Dilemma": In my view, exploration is the key issue that makes RL its own field. So, personally, I would prefer to see a great deal more discussion on this topic. There is also some recent lovely theory (E-cubed et al.) and some promising algorithms, so I would not want the reader to leave with the impression that the problem is unsolvable!

    Reply by the author: sorry. Cound't find that ref (E-cubed et al).


    There is a lot of interesting material in the article. I do not feel comfortable seeing it published under the title "Reinforcement Learning", however, as it seems to take an idiosyncratic view of the field. Perhaps the article could be called "Neuronal Reinforcement Learning" and the background material on RL could be dropped.

    comments to the author response

    The exploration/exploitation work I had in mind is exemplified by the work of Kearns and Singh. On Singh's webpage (, the title is given as "Near-Optimal Reinforcement Learning in Polynomial Time". There has also been a lot of follow on work by authors like Brafman and Tennenholtz, Langford, Kakade, and others.

    I apologize for not being more explicit in my earlier note.

    As for the issue of equal weight for "neuronal" and "machine learning" views, I do not think the two should be combined. As best I can tell, there are two distinct communities calling themselves "reinforcement learning" and combining the material into one article does not seem appropriate to me. I remain convinced that the title of the article should be changed.

    I appreciate that it is difficult to address comments from new reviewers, and so I apologise in advance. However, I had some difficulty with a couple of aspects of the target article, and think that these issues need to be addressed before it is ready for primetime.

    The key points are:

    1) Along with the earlier reviewer, I am uncomfortable about the treatment of those aspects of RL that are part of machine learning. Although this is an encyclopaedia of computational neuroscience, the article is describing issues with the AI aspects of RL, and this is rather too incomplete to be a fair representation. Lots of issues (eg things like options) aren't given their full measure. I would also recommend that the name of the article be changed to something like "RL in the brain".

    • Reply: This is a general deceision which will involve the Editor. We (also the editor) think that it would be a good idea to have this article as it is and another one about the machine learning perspective.

    2) Although I quite understand the desire of the authors to place their own work on ISO and ICO in the context of RL in the field, I am afraid that this makes the article appear biased. I think that in a volume such as this, there should be a separate article for ISO/ICO, which could, of course, be refered to from this article, but that the standard treatment of neural RL, which is in fact what really mirrors the theoretical treatment, and what accounts for the data on dopamine activity.

    • Reply: the problem is that people associate RL with Q-learning/SARSA learning. However, also correlation based learning is able to implement reinforcement learning as long as it's closed loop. The only limitation is that the behaviour is not so flexible as in SARA/Q-learning. RL itself comes from a behavioural background where animals have been observed and then some form of learning has been implicated. So, we could also add Paul Verschure's correlation based learning schemes, for example.

    3) The historical picture does not seem to be entirely accurate, and is certainly not documented with references in a way that is suitable for an article "of record". For instance, the substantial input from mathematical psychology in the Bush and Mosteller days to the work in engineering on stochastic learning automata is not celebrated here; and, although Klopf was Rich Sutton's undergraduate advisor at Stanford, crediting him with the bulk of the emphasis of TD towards animal learning seems too strong. What exactly is the evidence for this?

    • Reply: Perhaps the reviewer could point us to the right literature.

    4) A few more minor comments: it would be good to distinguish instrumental and classical conditioning more strongly; Redgrave has recanted his 1999 view about dopamine and salience in favor of his new and more complex theory; the mathematical description of the "Neuronal perspective" is strange -- why v(t) rather than V(t); why should it be r(t) rather than R(t); what's v'? There should be more discussion of the representation of time; how exactly do the "alternative more simpler mechanisms ... [that] reflect the actual neuronal structure and measured signals" account for the substantial data on the activity of dopamine neurons - Redgrave and Gurney's account does not come wiht a learning rule that delivers the Schultzian dopamine activity.

    • Reply: OK. We'll address these points.

    <review>Another review</review>

    Any review of RL in computational neuroscience must attempt to describe, not only the theoretical and mathematical formalism grounded in machine learning, but also the way this formalism might map onto a biological, neuronal substrate. Both these research strands are extensive, and so this task is an heroic one. However the authors have indeed grappled with it, as evidenced by their 'mind map' diagram in Fig 1, and their use of dual text columns. In general therefore, I commend the article and its approach.

    However, I think the article could do with substantial clarifications and tightening of notation and text. Details follow

    General clarifications

    Opening line: 'learning by interacting with an environment'. True, but so is all learning (including supervised learning). I think the key point with RL is surely that the interaction with the environment is by `trial and error'.

    opening para. '...receives is a numerical reward'. The use of the term 'reward' is currently highly contentious and goes to the heart of several problems in the field. In biological circles, 'reward' comes with a heavy conceptual baggage surrounding ideas like 'appetitive', 'hedonic affect', etc. I think it's worth noting that the term 'reward' in RL is entirely neutral in this respect and that any link with biological reward is an additional, interpretation overlay.

    The subtitle 'The classical view (machine learning perspective)', and its contrast with the neuronal perspective points to the need for some methodological framework. I would suggest that Marr's computational framework of Computation, Algorithm, and Implementational *levels of analysis* might be a good starting point, but augmented with a mechanistic level between algorithm and implementation. This 4 level scheme was given in Marr's original (MIT technical memo) formulation, and recently developed independently by Gurney et al (TINs 2004). Thus, the 'machine learning perspective' sits at the algoritmic level of description. The 'neuronal perspective' is then viewed as the mechanistic level of description that implements the algorithms in abstract neuronal networks (e.g. Fig 2B). It then remains to map these abstract networks onto a real biological substrate (e.g by invoking STDP and dopamine etc.).

    "adaptive Actor-Critics: an adaptive policy iteration algorithm, which approximates the model of the value function by TD,". Surely the main point about actor critic is that it uses the same TD-error for both value function update and policy update?

    "Note, the neuronal perspective of RL is in general indeed meant to address biological questions. Its goals are usually not related to those of other artificial neural network (ANN) approaches". In the light of the point about mechanisms and algorithms above, isn't it possible that an ANN could be a mechanistic bridge between an RL algorithm and a biological implementation in a real neuronal network?

    "Embedding and typical applications". The term 'embedding' needs to be explained and, if the authors adopt the approach advocated above, perhaps it would correspond to a two stage mapping: algorithm -> neural mechanism -> biological neuronal network (?)

    In Fig 3. The analogy with conventional control in panel A, suggest that the set-point be likened to 'context' (panel B). I couldn't see how context was a set-point?

    Notational and mathematical points

    "if we know the reward function R(s,a)". Is R (upper case) the return, or the reward (see use as return in the section 'Prediction')?

    The following refers to the section in single column text under the sub-heading TD-learning

    In the definition of V(s), I anticipated an expectation operator rather than a sum (so v(s) = E_{\pi}[ R_t | s]...?? (this crops up in other places)

    V(s) -> V(s) + [R_t - V(s)] is presumably a realisation of the update rule at a particular step in time series of updates. So the following text about the update being zero would surely only apply in the mean under correct prediction?

    Again - confusion with mean values in R_t = r_{t+1} + \gamma V(s_{t+1})? [i.e. doesn't this apply to means of R and r?]

    Now under the adjacent column...

    I struggled with the outcome of the definitions v <-> r until I went to the authors' Neural Comp article and saw that it should be v <-> R (!).

    is the neural scheme here (v <-> R) original to the authors? If not then a citation might help.

    Again, I struggled with the significance of x1 * E as a convolution, until I realised that it is probably intended that E be an FIR filter and so it make sense to perform convolution with x1. is that right? If so - some explanation here would help! Perhaps the minimal requirement is a delay element? In which case the symbol D might be better.

    "delay line which splits x_1 into many inputs with unit delay ". I think this means "delay line which splits x_1 into many inputs, each delayed with respect to each other by a unit delay"?

    No definition is given of x0 when first mentioned in ISO learning.

    It's probably worth noting that you intend the prime symbol to denote differentiation (it might be confused with a next state version of a variable or similar)

    Under SARS - again - sum should be expectation?

    Textual points

    (examples only - there are others - I suggest a careful re-reading of the text)

    The title 'Definitions' is perhaps better phrased as 'Overview'

    The caption in figure 4 has the wrong reference. It says the STDP data are from "(F, Markram et al 1997)" but the data on the left are from "Bi GQ and Poo MM. Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type. J Neurosci 18: 10464-10472, 1998.".

    The links could be integrated better. For example "can be modelled as a Markov decision process Markov Decision Problems (MDP)". Could be "The best studied case is when RL can be formulated as a class of Markov Decision Problems (MDPs).

    "model-based algorithms". First mention with no explanation. An implicit definition is give a few lines down "If the model (T and R) of the process is not known" - but it's better up front where the term is first used.

    The cross-hatched box in Fig 1 isn't clear.

    Suggest "RL methods are used [in] a wide range of applications"

    First mention of classical conditioning - is there a link to be made here within the encyclopedia?

    "value function need[s] to be evaluated"

    "...prediction problems (Sutton and Barto 1981, Sutton 1988), is was used to solve" should read: "... prediction problems (Sutton and Barto 1981, Sutton 1988), it was also used to solve.."

    "non-evaluative feedback, [where] he associated .."

    "problem of who actually defines the environment." -> "problem of how the environment is defined" (presumably the modeler or nature define the environment)

    "After learning the primary reflex, triggered by x_0 is being avoided." doesn't make sense.

    "As x_0=0 is the convergence condition, this system is controlled by input control" - couldn't see the significance of this.

    "have been recorded in the Substantia Nigra (SR, pars compacta)". I think should read "have been recorded in the Substantia Nigra pars compacta (SNc)"

    Personal tools

    Focal areas