# User:Aude Billard/Proposed/Formalism for Learning from Demonstration

To enable comparisons between techniques for Robot Learning from / Programming by Demonstration (LfD-PbD) we here present a formalism for discussing and representing various approaches. This formalism is necessarily highly abstract, to allow for the broadest collection of work to be described. Much current work in LfD-PbD only focuses on exploring one or two aspects of the overall LfD-PbD paradigm as set forth here.

## Formalism

LfD-PbD assumes that there exists a desired robot control policy, latent in a human, denoted $$\Omega^\mathsf{y}\ .$$ This policy describes perfectly what the robot should do in every situation. The goal is to create an analogous robot policy, $$\Omega^\mathsf{x}$$ that will cause the robot to exhibit the desired behavior. Ideally, instantiating this policy should be easy and intuitive.

To compare $$\Omega^\mathsf{y}\ .$$ and $$\Omega^\mathsf{x}$$ we introduce the metric, $$M$$. Conceptually, $$M$$ computes the similarity between the policies that is optimal when the two match exactly, when the robot does exactly what the human would want in all situations. Note that there may be multiple $$\Omega^\mathsf{x}$$ that maximize $$M$$, due in part to differences in robot and human perception and embodiment.

To deal with these differences between the human and robot, we map human and robot perceptions and actions into a common, perhaps task-specific space, with a pair of operators $$\phi_\mathsf{x},\phi_\mathsf{y}$$. These operators represent a solution to the correspondence problem and allow for a direct comparison of the policies. Generally, in LfD-PbD work these operators are taken as given and are encoded into the teaching interface.

Figure 1: An illustration contrasting One-shot Batch LfD with Reinforcement learning. Both utilize the mapping from the robot's sensory and action spaces into the task domain, as well as a metric for evaluating the robot's policy and an update method to improve the controller. LfD also requires a mapping from the human to the task space. RL may require multiple cycles, but in one-shot LfD only one update step is performed

$$\Omega^\mathsf{x}$$ is (re)updated (learnt) from data via an update operator $$U$$, which aims to find a (perhaps local) optimum of $$M\ .$$ We differentiate here between the algorithmic way in which new data is incorporated into the estimate of $$\Omega^\mathsf{x}$$, that is, $$U$$, and the actual method by which information flows from the human to the robot, which we call the learning process, $$L\ .$$ Both are areas of active research.

Approaches to $$U$$ can roughly be grouped in two categories. The first set of batch techniques process all of the data at once and form the robot control policy from scratch. If new data is to be incorporated, all the previous data must be reprocessed as well, so all data must be kept. In contrast, the second set of incremental learning techniques update the current policy in light of new data, after which the data can be discarded. Incremental techniques thus lend themselves well to interactive teaching where the robot demonstrates at each step of the training its current understanding of the task. Doing so allows the human teacher to monitor the robot's progress and to provide more directed teaching.

One-shot batch learning, where a single demonstration is collected and processed, is depicted in Figure 1 as the blue line. Generally an analytical approach is used to derive a closed-form solution for optimizing $$M$$ from the data. We contrast this with pure reinforcement learning (RL - green line), which does not use human demonstration, but only the metric (reward).

We posit that the process by which a learner agent (robot) learns from observing and interacting with a demonstrator agent (human) can be entirely described by the following variables:

 $$\mathsf{x}$$ The learner agent (Robot) $$\mathsf{y}$$ The demonstrator agent (Human) $$\mathsf{D_x}$$ $$\mathsf{D_y}$$ The dimensionalities of their state spaces $$\mathcal{X} \subset \Re^\mathsf{D_x}$$ $$\mathcal{Y} \subset \Re^\mathsf{D_y}$$ Their state spaces $$\mathbf{x} \in \mathcal{X}$$ $$\mathbf{y} \in \mathcal{Y}$$ A individual state ($$\mathsf{D_x}$$ or $$\mathsf{D_y}$$ dimensional columnvector) $$\mathbf{X}^n = \{\mathbf{x}^n_t\}_{t=0}^{T_n^\mathsf{x}}$$ $$\mathbf{Y}^n = \{\mathbf{y}^n_t\}_{t=0}^{T_n^\mathsf{y}}$$ A trajectory through states of length $$T_n$$ ($$\mathsf{D_?} \times T_n^?$$ matrix) $$\mathfrak{x} = \{\mathbf{x}^s\}_{s=1}^S$$ $$\mathfrak{y} = \{\mathbf{y}^s\}_{s=1}^S$$ A set of $$S$$ states $$\mathfrak{X} = \{\mathbf{X}^n\}_{n=1}^N$$ $$\mathfrak{Y} = \{\mathbf{Y}^n\}_{n=1}^N$$ A set of $$N$$ trajectories $$\mathbf{X} = \Omega^\mathsf{x}(\mathbf{x}_0)$$ $$\mathbf{Y}=\Omega^\mathsf{y}(\mathbf{y}_0)$$ The control policies / behavior controllers. $$\mathbf{z} \in \mathcal{Z} \subset \Re^\mathsf{D_z}$$ The task space $$\mathbf{x} \approx \mathbf{y}$$ Equivalence relationship between human and robot states $$\mathbf{z}=\phi_\mathsf{x}(\mathbf{x})$$ $$\mathbf{z}=\phi_\mathsf{y}(\mathbf{y})$$ Operators that map to task space $$\mathbf{x} \approx \mathbf{y}$$ $$M(\Omega^\mathsf{x},\Omega^\mathsf{y}) \approx M(\Omega^\mathsf{x}(\mathcal{X}),\Omega^\mathsf{y}(\mathcal{Y}))$$ A metric to compare the two controllers $$\Omega^\mathsf{x} = U(\Omega^\mathsf{x})$$ The robot update $$\Omega^\mathsf{x} = L(\mathsf{x},\mathsf{y})$$ The learning process

### Behavioral Samples

As direct access to $$\Omega^\mathsf{y}$$ is nigh-impossible, LfD-PbD attempts to inder it from assumably informative demonstrations. A demonstration is a trajectory through the human's state space, starting from some initial state and running for some number of steps. Consider the demonstrator's $$\mathsf{D_y}$$-dimensional state space $$\mathcal{Y}\ ,$$ which is a subset of $$\Re^\mathsf{D_y}$$, and where individual states are $$\mathsf{D_y}$$-dimensional column vectors, $$\mathbf{y} \in \mathcal{Y}\ .$$ Starting from an initial state $$\mathbf{y}^n_0\ ,$$ we denote the nth trajectory (demonstration) as $$\mathbf{Y}^n = \Omega^\mathsf{y}(\mathbf{y}^n_0) = \{\mathbf{y}_t\}_{t=0}^{\mathsf{T}^\mathsf{y}_n}\ .$$ We can also consider a set of (not necessarily unique) $$\mathsf{N_y}$$ initial conditions, $$\mathfrak{y}_0 = \{\mathbf{y}^n_0\}_{n=1}^{N_\mathsf{y}}\ ,$$ and the resulting set of trajectories$\mathfrak{Y} = \Omega^\mathsf{y}(\mathfrak{y}_0) = \{\Omega^\mathsf{y}(\mathbf{y}^n_0)\}_{n=1}^{\mathsf{N_y}}\ .$ Note that demonstrations need not have the same length ($$\mathsf{T}^\mathsf{y}_i \ne \mathsf{T}^\mathsf{y}_j, i \ne j$$). Further, even if two of the initial conditions are identical, it may be that the generated trajectories are different, due to noise or other errors.

Likewise, trajectories generated by the robot are considered indicative of the learned behavior. We similarly define the states of the robot as $$\mathbf{x} \in \mathcal{X} \subset \Re^\mathsf{D_x}$$ and consider a set of trials $$\mathfrak{X} = \Omega^\mathsf{x}(\mathfrak{x}_0)$$ (from a set of $$\mathsf{N_x}$$ initial positions). Comparing the two behaviors can then be done by evaluating the set of trajectories they produce, approximating $$M(\Omega^\mathsf{x},\Omega^\mathsf{y})$$ with $$M(\mathfrak{X},\mathfrak{Y})\ .$$ We implicitly assume that the best approximation of $$M(\Omega^\mathsf{x},\Omega^\mathsf{y}$$) would be achieved by running both controllers from every possible initial condition and computing $$M(\Omega^\mathsf{x}(\mathcal{X}),\Omega^\mathsf{y}(\mathcal{Y}))\ .$$

To aid in the comparison, we map each of the agent's spaces into some equivalence space, $$\mathbf{z} \in \mathcal{Z} \subset \Re^\mathsf{D_z}\ .$$ Let there be $$\phi_\mathsf{y}: \mathcal{Y} \rightarrow \mathcal{Z}$$ and $$\phi_\mathsf{x}: \mathcal{X} \rightarrow \mathcal{Z}\ ,$$ which are likely many-to-one, and therefore not uniquely reversible. Comparisons between trajectories then take place directly in this equivalence space.

### Metrics

We here describe some basic, simple possibilities for $$M$$ making use of these mappings. These base metrics can be further combined to generate more complex equations.

#### End Point Location

For tasks where only the final state of the robot matters (e.g. reaching a navigation goal), a possible $$M$$ is$M(\mathfrak{X},\mathfrak{Y}) = -\frac{1}{\mathsf{N_x}}\sum_{n=1}^{\mathsf{N_x}} (\phi_\mathsf{x}(\mathbf{x}^n_{\mathsf{T}^\mathsf{x}_n}) - \frac{1}{\mathsf{N_y}}\sum_{m=1}^{\mathsf{N_y}}\phi_\mathsf{y}(\mathbf{y}^m_{\mathsf{T}^\mathsf{y}_m}))^2$

Where the target location is taken to be the average ending location of the demonstrations, and learning is aimed at minimizing the mean squared error of the ending location of the trials.

#### Path Matching

If in addition to the endpoint, the actual trajectory is important, we can have$M(\mathfrak{X},\mathfrak{Y}) = -\sum_{n=1}^{\mathsf{N_x}} \sum_{t=0}^{\mathsf{T}^\mathsf{x}_n} (\phi_\mathsf{x}(\mathbf{x}^n_t) - \phi_\mathsf{y}(\mathbf{y}^n_t))^2$

However, in order to use this equation, we require that the human and robot generate the same number of trajectories ($$\mathsf{N_x} = \mathsf{N_y}$$), starting in equivalent locations ($$\mathfrak{y}_0 \approx \mathfrak{x}_0$$), and that the paired trajectories take the same length of time ($$\mathsf{T}^\mathsf{y}_n = \mathsf{T}^\mathsf{x}_n$$). These first two conditions can be met with experimental design, and the last is often achieved by resampling or time warping.

#### Path Similarity

To relax those assumptions, we can introduce features of paths, such as smoothness or minimum jerk. We can then compute

$$M(\mathfrak{X},\mathfrak{Y}) = -\left((\frac{1}{\mathsf{N_x}}\sum_{n=1}^{\mathsf{N_x}} f(\phi_\mathsf{x}(\mathbf{X}^n))) - (\frac{1}{\mathsf{N_y}}\sum_{n=1}^{\mathsf{N_y}} f(\phi_\mathsf{y}(\mathbf{Y}^n))\right)$$

where $$f$$ is some feature of the trajectories defined in the task space.

#### Reward

A particular case of the above is consider accumulated reward along the paths. However, in this case it is unnecessary to compare to the demonstrator. Instead, the demonstrator can be used to initialize the learning process, or provide the reward signal itself.

$$M(\mathfrak{X},\mathfrak{Y}) = \frac{1}{\mathsf{N_x}}\sum_{n=1}^{\mathsf{N_x}} R(\phi_\mathsf{x}(\mathbf{X}^n))$$

#### Probabilistic

An alternate view is to treat the known trajectories as samples from some underlying probability distribution$M(\mathfrak{X},\mathfrak{Y}) = P(\phi_\mathsf{y}(\mathfrak{Y})|\phi_\mathsf{x}(\mathfrak{X}))$

Where $$P$$ is a density estimator. This approach can be thought of as maximizing the probability that the robot will generate the same trajectories as the human.

### Learning

Updates of the learner agent's controller $$\Omega^\mathsf{x}=U(\Omega^\mathsf{x})$$ can be based off of new demonstrations $$\mathfrak{Y}\ ,$$ self-trials $$\mathfrak{X}\ ,$$ or even specific corrections. Key is the concept of generalization, or the ability of the robot to behave appropriately in novel situations. Arguably, generalization is one way to discover task equivalence, the distinction between $$\{\phi_\mathsf{x}, \phi_\mathsf{y}\}$$ and the generalization that occurs during learning is mostly one of scale and prior knowledge. The learning process itself defines how the robot and human controllers are used to generate data, and if and when explicit evaluations of $$M$$ are performed. We denote the overall process $$\Omega^\mathsf{x}=\mathfrak{L}(\mathsf{x},\mathsf{y})$$ and now provide some general frameworks which can be used to perform $$\mathfrak{L}\ .$$

##### Batch Learning
Figure 2: A batch learning process, where all demonstrations are collected before learning.

Batch learning is illustrated in Figure 2 and can be described pseudo-algorithmically as:

1. Collect $$\mathfrak{Y}$$ from $$\Omega^\mathsf{y}$$
2. Derive $$\Omega^\mathsf{x}$$ from $$\mathfrak{Y}$$ based on analytical analysis of $$M$$

In batch learning, all samples from the demonstrator are collected before learning takes place, generally because the data collection process itself is difficult, tedious, or expensive. Further, batch learning techniques are often used, which may take a long time to process the data (compared to the time spent collecting it). Thus, it is infeasible to collect some data, process it, and then collect more.

When using batch learning, care must be taken to sample the demonstrators behavior sufficiently. Generally this means starting from a wide variety of initial positions, but can also mean perturbing the human in different ways. Usually, human intuition is used to determine which demonstrations are sufficient to cover, or span, all the possible situations the robot may encounter during autonomous execution.

The learning update itself often makes use of the mathematical properties of $$M\ ,$$ such as an analytical solution to maximize it. As such, it is often not computed during learning itself, but used as an evaluative tool afterwards.

##### Self-Improvement Learning
Figure 3: Self-Improvement learning, where Batch RLfD has been used to initialize reinforcement learning.

Similar to batch learning, self-improvement learning collects $$\mathfrak{Y}$$ all in one go, as seen in . The difference lies in how $$\Omega^\mathsf{x}$$ is estimated:

1. Collect $$\mathfrak{Y}$$ from $$\Omega^\mathsf{y}$$
2. Derive $$\Omega^\mathsf{x}$$ from $$\mathfrak{Y}$$
3. Generate $$\mathfrak{X}$$ from $$\Omega^\mathsf{x}$$
4. Evaluate $$M(\Omega^\mathsf{x},\Omega^\mathsf{y})$$
5. Update $$\Omega^\mathsf{x}$$
6. Repeat from 3

Particularly, once $$\Omega^\mathsf{x}$$ is derived, it will be used to generate new samples, which then drive the improvement of $$\Omega^\mathsf{x}$$ itself based on explicit computations of $$M\ .$$ A simple approach to improvement is to estimate the gradient of $$\nabla M/\nabla \Omega^\mathsf{x}$$ from the evaluation samples, and then change $$\Omega^\mathsf{x}$$ accordingly.

##### Interactive Learning
Figure 4: Interactive learning consists of a closed loop, where new demonstrations are generated in response to robot performance.

Similar to self-improvement, interactive learning approaches $$\Omega^\mathsf{x}$$ iteratively. The difference is that the demonstrator is included in the iterative framework, providing more data as needed as seen in :

1. Collect $$\mathfrak{Y}$$ from $$\Omega^\mathsf{y}$$
2. Derive $$\Omega^\mathsf{x}$$ from $$\mathfrak{Y}$$
3. Generate $$\mathfrak{X}$$ from $$\Omega^\mathsf{x}$$
4. Evaluate $$M(\Omega^\mathsf{x},\Omega^\mathsf{y})$$
5. Collect additional $$\mathfrak{Y}$$ from $$\Omega^\mathsf{y}$$ if needed
6. Update $$\Omega^\mathsf{x}$$
7. Repeat from 3

In opposition to batch learning, interactive learning often requires that the learning process itself be relatively speedy, and that demonstrations be relatively easy to acquire. The idea is that after observing the robot's behavior in step 3, the demonstrator can provide additional demonstrations targeted at the errors made by the robot. Thus, the reliance on human intuition as to what parts of the space must be explored is lessened.