# Robot learning by demonstration/Current Work

We provide a brief look at current work in RLfD/RPbD. For each program of research we provide a reference and succinctly describe the choices for \(M, \mathcal{Z}, \phi_{\mathsf{x}}, \phi_{\mathsf{y}}\) and \(L\ .\) We further provide some notes on the model used for \(\Omega^\mathsf{x}\) and the update method \(U\ .\) This section is by no means complete, and we invite other researchers to submit synopses of their own (or others') work.

## Contents |

## Subtask Decomposition

- Version Space Algebra Pardowitz et al. (2007)
- \(M\ :\) Logic steps.
- \(\mathcal{Z}\ :\) Hand, finger and upper torso motion, motion of objects, voice commands
- \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) cybergloves and stereo-cameras mounted on pan-tilt-units, voice entry, audio response, gesture recognition and iconic information channels are integrated into the system.
- \(L\ :\) One shot learning. Each human action are decomposed automatically into set of predefined primitives.
- \(\Omega^\mathsf{x},U\ :\) A sequence of actions is learned by adapting the probability of transition across these actions.

- Learning Behavior Fusion from Demonstration Nicolescu et al. (2007)
- \(M\ :\) Task performance and consistency with user style
- \(\mathcal{Z}\ :\) Robot navigational position and orientation
- \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) Joystick teleoperation
- \(L\ :\) One-shot learning
- \(\Omega^\mathsf{x},U\ :\) Demonstrations are split into sequences of subgoals, where each subgoal is achieved by a weighted combination of behavior primitives.

- Hierarchical HMM decomposition and re-composition of task into motion primitives - Kulic & Nakamura (2010).
- \(M\ :\) Maximum Likelihood
- \(\mathcal{Z}\ :\) Full body joint motion
- \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) 3D visual decomposition
- \(L\ :\) A long demonstration of motion that encompass a series of sub-motions is presented to the system. The system automatically builds HMM in a hierarchichal manner so as to determine the minimum number of submotions and to learn their sequencing in the demonstration.
- \(\Omega^\mathsf{x},U\ :\) Each HMM is learned using Expectation-Maximization. Creating of a new HMM is done based on likelihood estimates.

- Extracting the important features of a task - Billard et al. (2006)
- \(M\ :\) Relative likelihood
- \(\mathcal{Z}\ :\) Joint angles, 3D Carthesian Position of Endeffector or of Object
- \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) Teleoperation / Kinesthetic / Handwriting
- \(L\ :\) Batch
- \(\Omega^\mathsf{x},U\ :\) One GMM fit with Expectation-Maximization is learned for each representation of the task (path of the end-effector relative to each object, joint angle trajectory). Comparison of likelihood across models allows to determine optimal representation. A cost function that balances costs across different representation is updated. The weights in the cost function are inversely proportional to the likelihood. The less likely a representation is the less it influences reproduction.

- Realtime Overlapping Gaussian Expert Regression - Grollman & Jenkins (2010)
- \(M\ :\) Task performance
- \(\mathcal{Z}\ :\) Joint Angles, Cartesian velocities
- \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) Immersive teleoperation
- \(L\ :\) Incremental, interactive
- \(\Omega^\mathsf{x},U\ :\) The overall policy is a multimap, where one state can lead to multiple actions depending on the active subtask. Statistical techniques automatically determine the number of subtasks and their individual policies.

## Interactive Learning

- Confidence Based Autonomy - Chernova & Veloso (2009)
- \(M\ :\) Task performance
- \(\mathcal{Z}\ :\) Obstacle distances and discrete actions.
- \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) Teleoperation
- \(L\ :\) Interactive, active learning. The system prompts the user for more information as needed, and the user can provide correction for mistakes.
- \(\Omega^\mathsf{x},U\ :\) Classifiers (e.g. GMM, SVM) are learned to associate portions of state space with actions. Each action learns its own confidence threshold to trigger active learning.

- Dogged Learning - Grollman & Jenkins (2008)
- \(M\ :\) Minimizing mean-squared error on actions (equivalent to one-step trajectories).
- \(\mathcal{Z}\ :\) Robot motor pose, color-blob vision, walk direction and speed, predefined motions
- \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) Immersive teleoperation
- \(L\ :\) Interactive learning, where the human observes task performance and may provide additional demonstration as desired.
- \(\Omega^\mathsf{x},U\ :\) The robot controller is viewed as a mapping from robot states to actions (or desired next states). Multiple incremental, sparse regression techniques (LWPR - Locally Weighted Projection Regression, and SOGP - Sparse Online Gaussian Processes) are used to directly estimate the policy.

- Advice Operators / Refinement and Reuse of Skills - Argall et al. (2011) Argall et al (2010)
- \(M\ :\) Task performance metrics and teacher evaluation
- \(\mathcal{Z}\ :\) Features computed from robot position and orientation, goal position and orientation, robot translational and rotational speeds.
- \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) Teleoperation
- \(L\ :\) Incremental / Interactive: The teacher observes and corrects multiple learner executions, a batch update is performed, and the process repeats.
- \(\Omega^\mathsf{x},U\ :\) The learned controller is a continuous state-action mapping from computed state features to robot actions (wheel speeds), learned by Locally Weighted Learning. Teacher correction generates new data (not via teleoperation), which is added to the demonstration set. Lazy learning techniques mean that the policy is re-derived only locally, as visited during new learner executions.

## Trajectory Learning

- Dynamic Movement Primitives - Schaal et al. (2003)
- \(M\ :\) Locally Weighted Regression
- \(\mathcal{Z}\ :\) 3D position/Joint angles
- \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) Teleoperation / Kinesthetic via exeskeleton
- \(L\ :\) Batch
- \(\Omega^\mathsf{x},U\ :\) Learn a modulatory term added to a Linear Dynamical System stable at an attractor. Precise trajectory fitting of non-linear trajectories.

- Spline-Based Techniques Aleotti & Caselli (2006)
- \(M\ :\) Means-Square Error
- \(\mathcal{Z}\ :\) 3D position/Joint angles
- \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) Teleoperation / Kinesthetic via exeskeleton
- \(L\ :\) Batch
- \(\Omega^\mathsf{x},U\ :\) cluster, select, and approximate human demonstrated trajectories

- Gaussian Mixture Regression - Calinon et al. (2006) Gribovskaya et al. (2010)
- \(M\ :\) Maximum Likelihood
- \(\mathcal{Z}\ :\) Joint angles, 3D Carthesian Position of Endeffector or of Object
- \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) Teleoperation / Kinesthetic / Handwriting
- \(L\ :\) Batch
- \(\Omega^\mathsf{x},U\ :\) GMM fit with Expectation-Maximization. Comparison of likelihood across models allows to determine optimal representation.

- Stable Estimator of Dynamical Systems (SEDS) - Khansari-Zadeh & Billard (2010)
- \(M\ :\) Global Asymptotic Stability, Likelihood or Mean Square Error
- \(\mathcal{Z}\ :\) 3D position/Orientation/Joint angles
- \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) Teleoperation / Kinesthetic / Handwriting
- \(L\ :\) Batch
- \(\Omega^\mathsf{x},U\ :\) GMM fit with gradient ascent to maximize cost function under global stability constraints.

## Reward-based LfD

- Gaussian Mixture Model + update through NAC - Guenter et al. (2008)
- \(M\ :\) External reward
- \(\mathcal{Z}\ :\) Robot Cartesian coordinates of endeffector
- \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) Kinesthetic demonstration
- \(L\ :\) Self-improvement starting from a several demonstrations
- \(\Omega^\mathsf{x},U\ :\) The parameters of a Gaussian Mixture Model that was originally learned from good demonstrations are updated through RL. The original demonstrations are used to limit the exploration.

- PoWER - Kober & Peters (2008)
- \(M\ :\) External reward
- \(\mathcal{Z}\ :\) Robot joint angles and velocities, Cartesian coordinates of object
- \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) Kinesthetic demonstration
- \(L\ :\) Self-improvement starting from a single demonstration
- \(\Omega^\mathsf{x},U\ :\) Each joint of the robot has a parameterized motor primitive, current parameters are weighted by generating rollouts and collecting reward. Parameters are then updated based on the weights to maximize the expected reward of the policy.

- Motor skill coordination through imitation and reinforcement learning - Kormushev et al. (2010)
- \(M\ :\) Similarity estimation based on the residuals of weighted least squares regression combined with manually defined episodic reward function.
- \(\mathcal{Z}\ :\) Cartesian position and orientation of the end-effector.
- \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) Kinesthetic teaching or motion capture.
- \(L\ :\) Batch or incremental.
- \(\Omega^\mathsf{x},U\ :\) Mixture of proportional-derivative systems described by a set of virtual attractors and their respective impedance parameters (full stiffness matrices). The policy is initialized by imitation and refined by EM-based reinforcement learning, guided by a manually defined reward function and the residuals of weighted least-squares estimation.

## Inverse Reinforcement Learning

- Inverse Reinforcement learning via Feature Matching - Abbel & Ng (2004)
- \(M\ :\) Maximizing a reward function linear in state features, accomplished by matching feature counts.
- \(\mathcal{Z}\ :\) Robot position, orientation, velocity and angular velocity.
- \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) Teleoperation
- \(L\ :\) Self-improvement/ Interactive: The learnt system is incrementally improved to convergence, and additional data provided if still unsatisfactory.
- \(\Omega^\mathsf{x},U\ :\) A Markov Decision Process (MDP), iteratively solved to equalize expected feature counts in the demonstrated and learned trajectories.

## Learning from Failure

- Learning from Failure - Grollman & Billard (2011)
- \(M\ :\) Task accomplishment (binary)
- \(\mathcal{Z}\ :\) Robot joint position and velocity
- \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) Kinesthetic manipulation
- \(L\ :\) Self-improvement
- \(\Omega^\mathsf{x},U\ :\) From a set of failed demonstrations, a distribution in velocity-position space is built that reproduces the human's behavior in areas of consistency, and decreases the likelihood of the human's actions where they varied.

## References

Pardowitz, Glaser and Dillmann, "Learning repetitive robot programs from demonstrations using version space algebra." International Conference on Robotics and Appliations, 2007.[1]

Nicolescu, Jenkins, Olenderski and Fritzinger, "Learning behavior fusion from demonstration." Interaction Studies 9:2, 2008, pages 319-352. [2]

Kulic & Nakamura, "Incremental Learning of Human Behaviors using Hierarchical Hidden Markov Models." International Conference on Intelligent Robots and Systems 2010, pages 4649-4655. [3]

Billard, Calinon and Guenter, "Discriminative and Adaptive Imitation in Uni-Manual and Bi-Manual Tasks." Robotics and Autonomous Systems, 54:5, 2006, pages 370-384. [4]

Grollman & Jenkins, "Incremental Learning of Subtasks from Unsegmented Demonstration." International Conference on Intelligent Robots and Systems 2008. [5]

Chernova & Veloso, "Interactive Policy Learning through Confidence-Based Autonomy." Journal of Artificial Intelligence Research, 34, 2009, pages 1-25. [6]

Grollman & Jenkins, "Sparse Incremental Learning for Interactive Robot Control Policy Estimation." International Conference on Robotics and Automation, 2008, pages 3315-3320. [7]

Argall, Browning, and Veloso, "Teacher feedback to scaffold and refine demonstrated motion primitives on a mobile robot." Robotics and Autonomous Systems, 59:3-4, 2011, pages 243-255. [8]

Argall, Sauser, and Billard, "Tactile Guidance for Policy Adaptation." Foundations and Trends in Robotics, 1:2, 2010, pages 79-133. [9]

Schaal, Peters, Nakanishi and Ijspeert, "Control, Planning, Learning, and Imitation with Dynamic Movement Primitives." International Conference on Intelligent Robots and Systems, 2003. [10]

Aleotti & Caselli, "Robust trajectory learning and approximation for robot programming by demonstrations." Robotics and Autonomous Systems, 54:5, 2006, pages 409-413. [11]

Calinon, Guenter, and Billard, "On Learning, Representing, and Generalizing a Task in a Humanoid Robot." Transactions on Systems, Man, and Cybernetics, 37:2, 2007, pages 286-298. [12]

Gribovskaya, Khansari-Zadeh, and Billard, "Learning Nonlinear Multivariate Dynamics of Motion in Robotic Manipulators." Journal of Robotics Research, 2010. [13]

Khansari-Zadeh & Billard, "Imitation learning of Globally Stable Non-Linear Point-to-Point Robot Motions using Nonlinear Programming." International Conference on Intelligent Robots and Systems, 2010. [14]

Guenter, Hersch, Calinon and Billard, "Reinforcement Learning for Imitating Constrained Reaching Movements." Advanced Robotics, 21:13 2007, pages 1521-1544. [15]/114046]

Kober & Peters, "Policy Search for Motor Primitives in Robotics." Neural Information Processing Systems, 2008. [16]

Kormushev, Calinon and Caldwell, "Robot Motor Skill Coordination with EM-based Reinforcement Learning." International Conference on Intelligent Robots and Systems, 2010. [17]

Abbeel & Ng, "Apprenticeship Learning via Inverse Reinforcement Learning." International Conference on Machine Learning, 2004. [18]

Grollman & Billard, "Donut as I do: Learning from failed demonstrations." International Conference on Robotics and Automation, 2010. [19]