Robot learning by demonstration
- Aude Billard*, EPFL Ecole Polytechnique Federale de Lausanne
- Daniel Grollman, Ecole Polytechnique Federale de Lausanne
Robot Learning from demonstration (RLfD) or Robot Programming by Demonstration (RPbD) (also known as Imitation Learning and Apprenticeship learning) is a paradigm for enabling robots to autonomously perform new tasks. Rather than requiring users to analytically decompose and manually program a desired behavior, work in RLfD/RPbD takes the view that an appropriate robot controller can be derived from observations of a human's own performance thereof. The aim is for robot capabilities to be more easily extended and adapted to novel situations, even by users without programming ability.
Contents |
Overview
Consider a household robot capable of performing mobile manipulation tasks. One task that an end-user may desire the robot to perform is to clean the kitchen [Dillmann 2004]. Doing so may involve multiple subtasks, such as piling dirty dishes, opening and closing the dishwasher, dusting, washing pots and pans, sweeping, etc. Further, every time this kitchen-cleaning behavior is run, the robot will need to deal with different circumstances – different items, different initial states, different rooms, etc.
In a traditional programming scenario, a human programmer would have to code a robot controller that is capable of responding to any situation the robot may face. The overall task may need to be broken down into 10s or 100s of smaller steps, and each one tested for robustness prior to the robot's leaving the factory. If and when failures occurred in the field, highly-skilled technicians would need to be dispatched to update the system for the new circumstances. Instead, RLfD/RPbD allows the end-user to 'program' the robot simply by showing it how to perform the task - no coding required. Then, when failures occur, the end-user need only provide more demonstrations, rather than calling for professional help.
Key problems in RLfD/RPbD
Nehaniv & Dautenhahn (2001) phrased the problems faced by RLfD/RPbD in a set of key questions: '’'What to imitate? How to imitate? When to imitate? Whom to imitate?’. To date, only the first two questions have really been addressed in RLfD/RPbD. They are:
- What to imitate: Determining which aspects of the demonstration should be imitated. E.g, if the demonstrator always approaches a location from the north, is it necessary for the robot to do the same? Answering this question strongly influences whether or not a derived robot controller is a successful imitation - a robot that approaches from the south is appropriately trained if direction is unimportant, but needs further education if it is. This issue is related to questions of signal versus noise, and answered by determining the metric by which the resulting behavior is evaluated.
- How to imitate (or ‘’’Correspondence Problem’’’): Determining how the robot will perform those parts of the demonstration that should be imitated. E.g, if the demonstrator uses a foot to move an object, is it acceptable for a wheeled robot to bump it, or should it use a gripper instead? Given that robots and humans may have different embodiments; this issue is closely related to that of the correspondence problem. The evaluation metric addresses this issue in conjunction with the task-mapping projectors.
Ways to solve RLfD/RPbD
In the kitchen-cleaning task we reviewed above, see Fig. 1, RLfD/RPbD has and can be applied at several levels:
1) Subtask sequencing: Given a set of known (pre-programmed or learned previously) behaviors, such as pick up cup, move toward dishwasher, open dishwasher, etc, the robot must learn the correct sequence of actions to perform. Example of such sequencing of behavior for a household task can be found in Figure 1. Other examples include learning sequencing of known behavior for navigation through imitation of a more knowledgeable robot or human [Demiris and Hayes, 2006; Gaussier et al. 98, Nicolescu and Mataric 2003]; and learning and sequencing of primitive motions for full body motion in humanoid robots [Billard, 2000; Ito and Tani 2004; Kulic et al. 2008].
2) Individual motions: The individual motions, i.e, picking something up, opening a drawer, wiping a window, can be learned from watching a human do them. Often, applied forces need to be taken into account as well as absolute and relative position of the robot, its end effectors, and the objects in the world. Most RLfD/RPbD work to date focused on learning the kinematic of motions by recording the position of the end-effector and/or the position of the robot’s joints, see Figure Figure 2 for a few examples. More recently, a few works have investigated transmission of force-based signals through human demonstration [Calinon et al. 09; Kormushev et al 2011, Rozo et al 2011, Kronander and Billard 2012].
File:Ball in cup.gif Figure 2: Teaching a robot how to play “ball in cup". Start with a few correct human demonstrations and then let the robot learn the rest through trial and error (Kober and Peters, 2010); longer video at [1] |
File:Panckake.gif Figure 3: Teaching a robot how to flip pancakes. Longer version of the video, see [2] |
File:Donut.gif Figure 4: Learning from failure (Grollman and Billard, 2011). Longer version of the video, see [3] |
Interfaces for Demonstration
The interface used to provide demonstration plays a key role in the way the information is gathered and transmitted. We can distinguish three major trends:
A) One may directly record human motions. If one is interested solely in the kinematic of the motion, one may use any of the various existing motion tracking systems, see Figure Figure . These return precise motion of all joints and have been used in various works for RLfD/RPbD of full body motion [Kulic et al. 2008; Ude et al. 2004; Kim et al. 2009]. These methods are advantageous in that they allow the human to move freely. They however require solutions to the correspondence problem, i.e. the problem of how to transfer motion from human to robot when both differ in the kinematic and dynamics of their body.
B) At the other end there are immersive teleoperation scenarios, where a human operator is limited to using the robot's own sensors and effectors to perform the task [Coates et al. 2008, Grollman & Jenkins 07]. The advantage is that this solves entirely the correspondence problem as the system record directly the perception and action from the robot’s standpoint
C) In the middle, there are techniques such as kinesthetic teaching, where the robot is physically guided through the task by the humans, see Figure . Recent advances in skin technology offers the possibility to teach robots how to exploit touch contact on object, see Figure 7.
Figure 5: LfD of full body motion during walking using vision and the humanoid robot DB (Ude et al 2004) |
Figure 6: Teaching a robot how to adapt its stiffness by shaking the robot. The stiffness is decreased in the eigen-direction of the perturbation and inversely proportionally to the eigenvalues. Long version of the movie available at [4]: [Kronander and Billard, 2012 |
Figure 7: By exploiting the compliance of the iCub robot’s fingers, the teacher can teach the robot how to adapt the posture of the fingers in response to change in its tactile sensing at the finger tips. Long version of the movie available at [5]: [Sauser et al. 2011 |
RLfD/RPbD and Human-Robot Interaction
Since RLfD/RPbD necessarily deals both with humans and robots, it overlaps heavily with the field of Human Robot Interaction (HRI). In addition to the learning algorithms themselves, many human-centric issues are researched as part of RLfD/RPbD, mostly focused on how to better elicit and utilize the demonstrations, see [Goodrich & Schultz 07, Fong et al 03, Breazeal & Scasselatti 02] for surveys.
History and State-of-the-Art
Robot Learning from Demonstration started in the 1980's. Then, and still to a large extent now, robots had to be explicitly and tediously hand programmed for each different task they had to perform. RLfD/RPbD seeks to minimize, or even eliminate, this difficult step. Further, as robots move out of heavily engineered environments and into the real world, is unlikely that all possible events that it will face can be anticipated and dealt with by a human programmer. Thus, robot learning is required to adapt the robot's behavior to new situations.
Robot learning can be seen as finding a controller that satisfies some constraints (e.g. the task is accomplished). In RLfD/RPbD, demonstrations are used to simplify the problem by reducing the space of controllers that must be examined. If good examples are observed, the search can begin there and find a local optima. Conversely, bad examples may be used to exclude portions of the search space.
RLfD/RPbD promises are thus multiple. On the one hand, one hopes that it will make learning faster, in contrast to tedious trial-and-error learning, particularly in high dimensional spaces (the so-called curse of dimensionality). On the other hand, one expects that the methods, being user-friendly, would allow robots to be utilized to a greater extent in day-to-day interactions with non-specialist humans. Robot Programming by demonstration has by now become a central topic of robotics that spans across general research areas such as human-robot interaction, machine learning, machine vision and motor control.
There is much work in RLfD/RPbD, covering many different techniques. However, all approaches have certain commonalities. In this article, we outline a framework for talking about RLfD/RPbD techniques, and illustrate how several contemporary approaches can be viewed through this lens.
Formalism
For some desired robot behavior, we assume that there exists a suitable policy, latent in a human, denoted \(\Omega^\mathsf{y}\ .\) The goal of RLfD/RPbD, and indeed of other human-robot policy transfer techniques such as explicit programming and teleoperation, is to create an analogous robot policy, \(\Omega^\mathsf{x}\) that will cause the robot to exhibit the desired behavior. Ideally, instantiating this policy should be easy and intuitive.
As no transfer mechanism, no matter how direct, is perfect, it is important to evaluate the robot's policy after it is instantiated. For this, one uses a metric, \(M\) that, conceptually, computes the similarity between the human's latent policy and the robot's. Ideally, this metric is optimal when the two match exactly, when the robot does exactly what the human would want to do in all situations. There are many possible metrics, ranging from subjective evaluations of 'human-ness', to task-specific values such as the number of goals scored, to quantitative measures such as minimum jerk. Additionally, there may be more than one optimum for a given metric. In other words, different robot control policies may lead to behaviors that are equally good. This stems in part from the fact that human and robot may differ in their body geometry and in the way they perceive and act in the world.
To deal with differences between the human and robot, one must determine a common mapping in which a subset of the robot and human perception and actions may be deemed equivalent. This is commonly referred to as the correspondence problem. There is no generic solution to this problem and current efforts are directed to formulating equivalence for task-specific cases. We refer generically to the process by which one forms these equivalences through a pair of operators \(\phi_\mathsf{x},\phi_\mathsf{y}\) that project the state and action spaces of the two agents into a common, perhaps task-specific, space. These operators represent a solution to the correspondence problem and allow for a direct comparison of the policies.
Generally, in RLfD/RPbD, the correspondence problem is taken as solved. Teleoperation interfaces, for example, require the human user to map their own controls onto the robot's body (and possibly utilize the robot's sensor information directly). Kinesthetic teaching is another popular approach, whereby the human physically moves the robot to perform the desired task. Again, the human is directly controlling the robot, and thus performs the equivalence mapping implicitly. In contrast, motion capture systems often explicitly hardcode the congruencies between human and robot forms.
In explicit programming, the improvement of \(\Omega^\mathsf{x}\) (learning) is done by reprogramming the robot or manually changing parameters. Instead, in RLfD/RPbD, learning occurs automatically, via some update operator \(U\ .\) The goal of learning in both cases is to find a (perhaps local) optimum of the metric \(M\ .\) Much of contemporary research focuses on the choice of learning algorithm, that is, \(U\ .\) However, there is also research that explores the way information flows from the human to the robot, which we call the learning process, \(L\ .\)
Learning techniques can be roughly grouped into two classes: "One-Shot Batch Learning and Incremental Learning". Batch techniques process all of the data at once and form the robot control policy from scratch. If new data is to be incorporated, all the previous data must be reprocessed as well, so all data must be kept. In contrast, incremental learning updates the current policy in light of new data, after which the data can be discarded. Note that even an initial learning step can be seen as an update from the null policy, or 'tabula rasa.' In addition to not requiring that all data be stored, incremental techniques lend themselves well to interactive teaching where the robot demonstrates at each step of the training its current understanding of the task. Doing so allows the human teacher to monitor the robot's progress and to provide more directed teaching. This is a yet little explored area of research and the vast majority of work in RLfD/RPbD to date relies on batch learning.
One-shot Batch RLfD/RPbD, where a single demonstration is collected and processed, is depicted in Figure #LfD_RL as the blue line. Generally an analytical approach is used to derive a closed-form solution for optimizing \(M\) from the data. We contrast this with pure reinforcement learning (RL - green line), which does not use human demonstration, but only the metric (reward). Recent research has looked at combining these two approaches, initializing with batch RLfD/RPbD and then refining the resulting \(\Omega^\mathsf{x}\) with RL.
We posit that the process by which a learner agent (robot) learns from observing and interacting with a demonstrator agent (human) can be entirely described by the following variables:
| \(\mathsf{x}\) | The learner agent (Robot) | |
| \(\mathsf{y}\) | The demonstrator agent (Human) | |
| \( \mathsf{D_x} \) | \( \mathsf{D_y} \) | The dimensionalities of their state spaces |
| \(\mathcal{X} \subset \Re^\mathsf{D_x} \) | \(\mathcal{Y} \subset \Re^\mathsf{D_y} \) | Their state spaces |
| \( \mathbf{x} \in \mathcal{X} \) | \( \mathbf{y} \in \mathcal{Y} \) | A individual state (\( \mathsf{D_x} \) or \(\mathsf{D_y}\) dimensional columnvector) |
| \( \mathbf{X}^n = \{\mathbf{x}^n_t\}_{t=0}^{T_n^\mathsf{x}}\) | \(\mathbf{Y}^n = \{\mathbf{y}^n_t\}_{t=0}^{T_n^\mathsf{y}}\) | A trajectory through states of length \(T_n\) (\(\mathsf{D_?} \times T_n^? \) matrix) |
| \(\mathfrak{x} = \{\mathbf{x}^s\}_{s=1}^S\) | \(\mathfrak{y} = \{\mathbf{y}^s\}_{s=1}^S\) | A set of \(S\) states |
| \(\mathfrak{X} = \{\mathbf{X}^n\}_{n=1}^N\) | \(\mathfrak{Y} = \{\mathbf{Y}^n\}_{n=1}^N\) | A set of \(N\) trajectories |
| \(\mathbf{X} = \Omega^\mathsf{x}(\mathbf{x}_0)\) | \(\mathbf{Y}=\Omega^\mathsf{y}(\mathbf{y}_0)\) | The control policies / behavior controllers. |
| \(\mathbf{z} \in \mathcal{Z} \subset \Re^\mathsf{D_z}\) | The task space | |
| \(\mathbf{x} \approx \mathbf{y} \) | Equivalence relationship between human and robot states | |
| \(\mathbf{z}=\phi_\mathsf{x}(\mathbf{x})\) | \(\mathbf{z}=\phi_\mathsf{y}(\mathbf{y})\) | Operators that map to task space \(\mathbf{x} \approx \mathbf{y} \) |
| \( M(\Omega^\mathsf{x},\Omega^\mathsf{y}) \approx M(\Omega^\mathsf{x}(\mathcal{X}),\Omega^\mathsf{y}(\mathcal{Y}))\) | A metric to compare the two controllers | |
| \( \Omega^\mathsf{x} = U(\Omega^\mathsf{x})\) | The robot update | |
| \( \Omega^\mathsf{x} = L(\mathsf{x},\mathsf{y}) \) | The learning process | |
Behavioral Samples
Direct access to \(\Omega^\mathsf{y}\) is hard to come by, even skilled programmers make mistakes when transferring desired behaviors onto robots. RLfD/RPbD thus instead attempts to infer \(\Omega^\mathsf{x}\) from demonstrations, assumed to be informative about the nature of \(\Omega^\mathsf{y}\ .\) We define a demonstration as a trajectory through the human's state space, starting from some initial state and running for some number of steps. For our demonstrator, we consider an \(\mathsf{D_y}\)-dimensional state space \(\mathcal{Y}\ ,\) which is a subset of \(\Re^\mathsf{D_y}\ .\) Individual states are \(\mathsf{D_y}\)-dimensional column vectors, \(\mathbf{y} \in \mathcal{Y}\ .\) Starting from an initial state \(\mathbf{y}^n_0\ ,\) we denote the nth trajectory (demonstration) as \(\mathbf{Y}^n = \Omega^\mathsf{y}(\mathbf{y}^n_0) = \{\mathbf{y}_t\}_{t=0}^{\mathsf{T}^\mathsf{y}_n}\ .\) We can also consider a set of (not necessarily unique) \(\mathsf{N_y}\) initial conditions, \(\mathfrak{y}_0 = \{\mathbf{y}^n_0\}_{n=1}^{N_\mathsf{y}}\ ,\) and the resulting set of trajectories: \(\mathfrak{Y} = \Omega^\mathsf{y}(\mathfrak{y}_0) = \{\Omega^\mathsf{y}(\mathbf{y}^n_0)\}_{n=1}^{\mathsf{N_y}}\ .\) Note that demonstrations need not have the same length (\(\mathsf{T}^\mathsf{y}_i \ne \mathsf{T}^\mathsf{y}_j, i \ne j\)). Further, even if two of the initial conditions are identical, it may be that the generated trajectories are different, due to noise or other errors.
Likewise, we consider trajectories generated by robot as indicative of the learned behavior. We similarly define the states of the robot as \(\mathbf{x} \in \mathcal{X} \subset \Re^\mathsf{D_x}\) and consider a set of trials \(\mathfrak{X} = \Omega^\mathsf{x}(\mathfrak{x}_0)\) (from a set of \(\mathsf{N_x}\) initial positions). We can then consider comparing the two behaviors by evaluating the set of trajectories they produce. That is, we approximate \(M(\Omega^\mathsf{x},\Omega^\mathsf{y})\) with \(M(\mathfrak{X},\mathfrak{Y})\ .\) We implicitly assume that the best approximation of \(M(\Omega^\mathsf{x},\Omega^\mathsf{y}\)) would be achieved by running both controllers from every possible initial condition and computing \(M(\Omega^\mathsf{x}(\mathcal{X}),\Omega^\mathsf{y}(\mathcal{Y}))\ .\)
Correspondences
To evaluate the similarity between the human and robot behaviors, we must first deal with the fact that the human and the robot may occupy different state spaces, of perhaps different dimensionalities. Thus, as a first step towards computing \(M\ ,\) we must identify correspondences between the state spaces as illustrated in Figure #correspond. We identify three different ways in which states \(\mathbf{x}\) and \(\mathbf{y}\) can be said to correspond (denoted \(\mathbf{x} \approx \mathbf{y}\)), and give brief examples:
- Perceptual equivalence: Due to differences between human and robot sensory capabilities, the same scene may appear very different to each. See Figure #corrpercep.
- Physical equivalence: Due to differences between human and robot embodiments, they may perform different actions to accomplish the same physical effect. See Figure #corract.
- Task equivalence: For a given task, certain observable or affectable properties may be irrelevant and safely ignored. See Figure #corrtask.
More formally, there are a pair of operators that map each of the agent's spaces into some equivalence space, \(\mathbf{z} \in \mathcal{Z} \subset \Re^\mathsf{D_z}\ .\) We have \(\phi_\mathsf{y}: \mathcal{Y} \rightarrow \mathcal{Z}\) and \(\phi_\mathsf{x}: \mathcal{X} \rightarrow \mathcal{Z}\ ,\) which take into account all three types of equivalence, and where the mappings are likely many-to-one, and therefore not uniquely reversible.
We can think of the perceptual equivalence as dealing with the manner in which the agents perceive the world, and makes sure that the information necessary to perform the task is available to both. Physical equivalence deals with the manner in which agents affect and interact with the world, and makes sure that the task is actually performable by both. Task equivalence removes from consideration details that, while perceptible/performable, do not matter for the task.
Metrics
With the projections into task space, we can now directly compare data from both the human and robot controllers. We here describe some basic, simple possibilities for \(M\ .\) These can be further combined to generate more complex equations.
End Point Location
For tasks where only the final state of the robot matters (e.g. reaching a navigation goal), a possible \(M\) is:
\(M(\mathfrak{X},\mathfrak{Y}) = -\frac{1}{\mathsf{N_x}}\sum_{n=1}^{\mathsf{N_x}} (\phi_\mathsf{x}(\mathbf{x}^n_{\mathsf{T}^\mathsf{x}_n}) - \frac{1}{\mathsf{N_y}}\sum_{m=1}^{\mathsf{N_y}}\phi_\mathsf{y}(\mathbf{y}^m_{\mathsf{T}^\mathsf{y}_m}))^2\)
Where the target location is taken to be the average ending location of the demonstrations, and learning is aimed at minimizing the mean squared error of the ending location of the trials.
Path Matching
If in addition to the endpoint, the actual trajectory is important, we can have:
\(M(\mathfrak{X},\mathfrak{Y}) = -\sum_{n=1}^{\mathsf{N_x}} \sum_{t=0}^{\mathsf{T}^\mathsf{x}_n} (\phi_\mathsf{x}(\mathbf{x}^n_t) - \phi_\mathsf{y}(\mathbf{y}^n_t))^2\)
However, in order to use this equation, we require that the human and robot generate the same number of trajectories (\(\mathsf{N_x} = \mathsf{N_y}\)), starting in equivalent locations (\(\mathfrak{y}_0 \approx \mathfrak{x}_0\)), and that the paired trajectories take the same length of time (\(\mathsf{T}^\mathsf{y}_n = \mathsf{T}^\mathsf{x}_n\)). These first two conditions can be met with experimental design, and the last is often achieved by resampling or time warping.
Path Similarity
To relax those assumptions, we can introduce features of paths, such as smoothness or minimum jerk. We can then compute
\(M(\mathfrak{X},\mathfrak{Y}) = -\left((\frac{1}{\mathsf{N_x}}\sum_{n=1}^{\mathsf{N_x}} f(\phi_\mathsf{x}(\mathbf{X}^n))) - (\frac{1}{\mathsf{N_y}}\sum_{n=1}^{\mathsf{N_y}} f(\phi_\mathsf{y}(\mathbf{Y}^n))\right)\)
where \(f\) is some feature of the trajectories defined in the task space.
Reward
A particular case of the above is consider accumulated reward along the paths. However, in this case it is unnecessary to compare to the demonstrator. Instead, the demonstrator can be used to initialize the learning process, or provide the reward signal itself.
\(M(\mathfrak{X},\mathfrak{Y}) = \frac{1}{\mathsf{N_x}}\sum_{n=1}^{\mathsf{N_x}} R(\phi_\mathsf{x}(\mathbf{X}^n))\)
Probabilistic
An alternate view is to treat the known trajectories as samples from some underlying probability distribution:
\(M(\mathfrak{X},\mathfrak{Y}) = P(\phi_\mathsf{y}(\mathfrak{Y})|\phi_\mathsf{x}(\mathfrak{X}))\)
Where \(P\) is a density estimator. This approach can be thought of as maximizing the probability that the robot will generate the same trajectories as the human.
Learning
The actual learning update of the learner agent's controller \(\Omega^\mathsf{x}=U(\Omega^\mathsf{x})\) can be triggered by new demonstrations \(\mathfrak{Y}\ ,\) self-trials \(\mathfrak{X}\ ,\) or even specific corrections. Key to the update is the concept of generalization, or the ability of the robot to behave appropriately in novel situations. Often, there is an element of discovering task equivalence inherent in generalization. The distinction between the task equivalence in \(\{\phi_\mathsf{x}, \phi_\mathsf{y}\}\) and the generalization that occurs during learning is mostly one of scale. The mappings \(\phi\)s remove whole dimensions (all readings from temperature/color sensors) before learning takes place, while generalization occurs over observed values. If, during learning, a behavior is generalized over a particular dimension to the point that values on that dimension do not affect the behavior, it is equivalent to removing that dimension from consideration.
We distinguish the learning update from the learning process, which defines how the robot and human controllers are used to generate data, and if and when explicit evaluations of \(M\) are performed. We denote the overall process \(\Omega^\mathsf{x}=\mathfrak{L}(\mathsf{x},\mathsf{y})\) and now provide some general frameworks which can be used to perform \(\mathfrak{L}\ .\)
Batch Learning
Batch learning is illustrated in Figure #RLFD-Batch and can be described pseudo-algorithmically as:
- Collect \(\mathfrak{Y}\) from \(\Omega^\mathsf{y}\)
- Derive \(\Omega^\mathsf{x}\) from \(\mathfrak{Y}\) based on analytical analysis of \(M\)
In batch learning, all samples from the demonstrator are collected before learning takes place. Usually, this is because the data collection process itself is difficult, tedious, or expensive (perhaps an expert must be paid). Further, batch learning techniques are often used, which may take a long time to process the data (compared to the time spent collecting it). Thus, it is infeasible to collect some data, process it, and then collect more.
When using batch learning, care must be taken to sample the demonstrators behavior sufficiently. This generally means starting from a wide variety of initial positions, but can also mean perturbing the human in different ways. The idea is that the demonstrations cover, or span, all the possible situations the robot may encounter during autonomous execution. Usually, human intuition is used to determine which demonstrations are sufficient.
The learning update itself often makes use of the mathematical properties of \(M\ ,\) such as an analytical solution to maximize it. As such, it is often not computed during learning itself, but used as an evaluative tool afterwards.
Self-Improvement Learning
Similar to batch learning, self-improvement learning collects \(\mathfrak{Y}\) all in one go, as seen in Figure #RLFD-SI. The difference lies in how \(\Omega^\mathsf{x}\) is estimated:
- Collect \(\mathfrak{Y}\) from \(\Omega^\mathsf{y}\)
- Derive \(\Omega^\mathsf{x}\) from \(\mathfrak{Y}\)
- Generate \(\mathfrak{X}\) from \(\Omega^\mathsf{x}\)
- Evaluate \(M(\Omega^\mathsf{x},\Omega^\mathsf{y})\)
- Update \(\Omega^\mathsf{x}\)
- Repeat from 3
Particularly, once \(\Omega^\mathsf{x}\) is derived, it will be used to generate new samples, which then drive the improvement of \(\Omega^\mathsf{x}\) itself. A simple approach to improvement is to estimate the gradient of \(\nabla M/\nabla \Omega^\mathsf{x}\) from the evaluation samples, and then change \(\Omega^\mathsf{x}\) accordingly.
Interactive Learning
Similar to self-improvement, interactive learning approaches \(\Omega^\mathsf{x}\) iteratively. The difference is that the demonstrator is included in the iterative framework, providing more data as needed as seen in Figure #RLFD-Interact:
- Collect \(\mathfrak{Y}\) from \(\Omega^\mathsf{y}\)
- Derive \(\Omega^\mathsf{x}\) from \(\mathfrak{Y}\)
- Generate \(\mathfrak{X}\) from \(\Omega^\mathsf{x}\)
- Evaluate \(M(\Omega^\mathsf{x},\Omega^\mathsf{y})\)
- Collect additional \(\mathfrak{Y}\) from \(\Omega^\mathsf{y}\) if needed
- Update \(\Omega^\mathsf{x}\)
- Repeat from 3
Opposite of batch learning, interactive learning often requires that the learning process itself be relatively speedy, and that demonstrations be relatively easy to acquire. The idea is that after observing the robot's behavior in step 3, the demonstrator can provide additional demonstrations targeted at the errors made by the robot. Thus, the reliance on human intuition as to what parts of the space must be explored is lessened.
Current Work
We provide a brief look at current work in RLfD/RPbD. For each program of research we provide a reference and succinctly describe the choices for \(M, \mathcal{Z}, \phi_{\mathsf{x}}, \phi_{\mathsf{y}}\) and \(L\ .\) We further provide some notes on the model used for \(\Omega^\mathsf{x}\) and the update method \(U\ .\)
For ease of viewing, we divide the field into several broad areas, placing work based on the main focus of the research. Clicking on a section will take you to the corresponding part of the Current Work page. This section is by no means complete, and we invite other researchers to submit synopses of their own (or others') work.
Subtask Decomposition
This area of work is concerned with breaking an overall task into smaller pieces. Particular challenges are to identify subtasks in a demonstration (segmenting), and determining whether they should be sequenced to perform the task or blended.
Interactive Learning
Work in this area focuses on techniques whereby the user and robot can work more closely together to improve the robot's policy. Areas of interest include endowing the robot with a sense of confidence in its abilities, so it can ask for help, and allowing the user to address particular subportions of the overall task.
Trajectory Learning
Here we focus on techniques that aim to learn particular trajectories through the agent's state and action space. Research looks at generalizing appropriately to new conditions and dealing with perturbations or variance when behaving.
Reward-based LfD
These techniques use Reinforcement Learning in conjunction with LfD to improve the robot's performance beyond that of a demonstrator, with respect to a known reward function. Generally, demonstration is used as a means to initialize the search for good parameters.
Inverse Reinforcement Learning
Work which also combines RL with RLfD/RPbD. Here, however, the reward function is unknown, and the user's demonstration is used to infer it, it is then optimized.
Learning from Failure
Most RLfD/RPbD work relies on successful demonstrations of the desired task by the human. Instead, work in this area looks at extracting information from a human's failed attempts.
Open problems in RLfD/RPbD
Work in RLfD/RPbD, as any work in robot learning, makes a number of assumptions. These relate to the choice of data representation, of model, of learning method and procedure. Some interesting lines of current research are revisiting some of these assumptions. We name a few below:
- Meta-Learning: Generally, the form of the robot's control policy is fixed, and learning focuses on determining appropriate parameters (even for nonparametric methods). Instead, a system could be provided with multiple possible representations of controllers and select which is most appropriate. Additionally, the data collection process can include varying amounts of interaction with the human. In this respect RLfD/RPbD can be seen as a branch of Human Robot Interaction (HRI).
- Imitation Learning and Reinforcement Learning: Imitation learning is limiting in that it requires the robot to learn only from what has been demonstrated. Reinforcement learning, in contrast, allows the robot to discover new control policies through free exploration of state-action space. Approaches that combine imitation learning and reinforcement learning aim at exploiting the strength of both algorithms to overcome their respective drawbacks. Demonstrations are used to guide the exploration done in reinforcement learning, hence reducing the time to find an adequate control policy, while still allowing the robot to depart from the demonstrated behavior. While most of these works assume a known reward to guide the exploration, Inverse Reinforcement Learning (IRL) offers a framework to determine automatically the reward and the optimal control policy. When using human demonstrations to guide learning, IRL is solving jointly the What to imitate and How to imitate problems.
- Learning from Failed Demonstrations: The vast majority of work on RLfD/RPbD assumes that all the demonstrations are good demonstrations. Recent work has also investigated the possibility that demonstrations may instead be failed attempts at performing the task. In this case, RLfD/RPbD focuses on learning what to and what not to imitate. It offers an interesting alternative to approaches that combine imitation learning and reinforcement learning, in that no reward needs to be explicitly determined.
References
- Asfour, T., Azad, P., Vahrenkamp, N., Regenstein, K., Bierbaum, A., Welke, K., Schröder, J. & Dillmann, R. (2007), Toward humanoid manipulation in human-centred environments, Robotics and Autonomous Systems, Vol. 56, pp. 54-65
- Billard, A. (2000) Learning motor skills by imitation: a biologically inspired robotic model. Cybernetics & Systems, 32, 1-2, 155-193
- Billard, A., Calinon, S., Dillmann, R. and Schaal, S. (2008) Robot Programming by Demonstration (Review). Handbook of Robotics, . chapter 59, 2008.
- Breazeal, C and Scassellati, (2002) B, Robots that imitate humans, Trends in Cognitive Science, Vol. 6, Issue 11, P. 481–487.
- Calinon, S., Evrard, P., Gribovskaya, E., Billard, A. and Kheddar, A. (2009) Learning collaborative manipulation tasks by demonstration using a haptic interface. Proceedings of the International Conference on Advanced Robotics (ICAR), 2009.
- Coates, A,, Abbeel, P and Ng, A. Y., "Learning for control from multiple demonstrations," in Proc. 25th Intl. Conf. on Machine Learning (ICML 2008), A. McCallum and S. Roweis, Eds., ACM International Conference Proceeding Series, Vol. 307, New York, NY: The Association for Computing Machinery, Inc., 2008, pp. 144-151.
- Dillman, R. (2004), "Teaching and learning of robot tasks via observation of human performance", Robotics and Autonomous Systems Volume 47, Issues 2-3, 30 June 2004, Pages 109-116.
- Fong, T, Nourbakhsh, I and Dautenhahn (2003), K, A survey of socially interactive robots, Robotics and Autonomous Systems
Volume 42, Issues 3–4, P. 143–166.
- Gaussier, P et al., From perception–action loops to imitation processes: A bottom-up approach of learning by imitation, Applied Artificial Intelligence Journal 12 (7–8) (1998).
- Ito, M and Tani, J, On-line imitative interaction with a humanoid robot using a dynamic neural network model of a mirror system. Adaptive Behavior, 12 2 (2004), pp. 93–115.
- Goodrich, M and Schultz, A (2007), Human-robot interaction: a survey, Foundations and Trends in Human-Computer Interaction, Vol 1, issue 3.
- Grollman, D and Jenkins, O.C, Incremental learning of subtasks from unsegmented demonstration, In International Conference on Intelligent Robots and Systems, Taipei, Taiwan, October 2010.
- Kim, S., Kim, C., You, B. and Oh, S (2009) Stable Whole-body Motion Generation for Humanoid robots to Imitate Human Motions. Proc. IEEE/RSJ Intl Conf. on Intelligent Robots and Systems (IROS).
- Kormushev, P, Calinon, S, and D. Caldwell, “Imitation Learning of Positional and Force Skills Demonstrated via Kinesthetic Teaching and Haptic Input,” Advanced Robotics, pp. 1–20, 2011.
- Kronander, K and Billard, A. Online Learning of Varying Stiffness Through Physical Human-Robot Interaction, IEEE-RAS Int. Conf. on Human-Robot Interaction (ICRA), 2012.
- Kruger, V.; Herzog, D.; Baby, S.; Ude, A.; Kragic, D.; Learning actions from observations, Robotics and Automation Magazine, 17:2, 30-43, 2010
- Kulic, D, Takano, W and Nakamura, Y, “Incremental learning, clustering and hierarchy formation of whole body motion patterns using adaptive hidden Markov chains,” Int. J. Robot. Res., vol. 27, no. 7, pp. 761–784, 2008.
- Nehaniv & Dautenhah, "Like Me? - Measures of Correspondence and Imitation," Cybernetics and Systems, Jan 2011 pp. 11-51 [6]
- Nicolescu, M. N and Matarić, M.J, Methods for robot task learning: Demonstrations, generalization and practice, in: Proceedings of the Second International Joint
Conference on Autonomous Agents and Multi-Agent Systems, AAMAS’03, 2003.
- L. Rozo, P. Jimenez, and C. Torras, “Robot Learning from Demonstration of Force-based Tasks with Multiple Solution Trajectories,” in 15th International Conference on Advanced Robotics (ICAR), 2011, pp. 124–129.
- Sauser, E., Argall, Brenna Dee, Metta, Giorgio and Billard, A. (2011) Iterative Learning of Grasp Adaptation through Human Corrections. Robotics and Autonomous Systems
- Tani, M. Ito and Y. Sugita, Self-organization of distributed represented multiple behavior schemata in a mirror system: Reviews of robot experiments using RNNPB. Neural Networks, 17 8–9 (2004), pp. 1273–1289
- Ude, A, Atkeson, C.G and Riley, M "Programming full-body movements for humanoid robots by observation", Robotics and Autonomous Systems, vol. 47, pp. 93-108, 2004
Additional Reading
Machine Learning Approaches to RLfD
The vast majority of work on RLfD/RPbD follows a more machine learning approach to the problem.
- Surveys of works in this area can be found at:
- B.D. Argall, S. Chernova, M. Veloso, and B. Browning, (2010). A Survey of Robot Learning from Demonstration. Robotics and Autonomous Systems [7].
- A. Billard, S. Calinon, R. Dillmann and S. Schaal (2008). Robot Programming by Demonstration. Handbook of Robotics: MIT Press [8].
- S. Schaal, A. Ijspeert and A. Billard (2003). Computational approaches to motor learning by imitation, Philosophical Transactions: Biological Sciences (The Royal Society) [9].
- S. Schaal (1999). Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences [10].
Biological Inspiration and Neural Modeling
RLfD/RPbD is at core inspired by the way humans learn from being guided by experts, from infancy through adulthood. A large body of work on RLfD/RPbD takes, hence, inspiration in concepts from psychology and biology. Some of these works pursue a computational neuroscience approach and uses neural modeling. Others pursue a more cognitive science approach and build conceptual model of imitation learning in animals.
- Surveys of work in this area can be found in:
- E Oztop, M Kawato (2006). Mirror neurons and imitation: A computationally guided review. Neural Networks [11].
- K. Dautenhahn and C. Nehaniv (2002). Imitation in Animals and Artifacts, MIT Press [12].
- A. Billard (2002). Imitation. Handbook of Brain Theory and Neural Networks: MIT Press [13].
- C. Breazeal and B. Scassellati (2002). "Robots that imitate humans," Trends in Cognitive Science [14].



