Robot learning by demonstration/CurrentWork

From Scholarpedia
Jump to: navigation, search

    This section is a brief look at current work in RLfD. For each program of research, we provide a reference and succinctly describe the choices for \(M, \mathcal{Z}, \phi_{\mathsf{x}}, \phi_{\mathsf{y}}\) and \(\mathcal{L}\ .\) We further provide some notes on the model used for \(\Omega^\mathsf{x}\) and the method if inferring it. This section is by no means complete, and we invite other researchers to submit synopses of their own (or others') work.

    • Dogged Learning - [1]
      • \(M\ :\) Minimizing mean-squared error on actions (equivalent to one-step trajectories).
      • \(\mathcal{Z}\ :\) Robot motor pose, color-blob vision, walk direction and speed, predefined motions
      • \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) Immersive teleoperation
      • \(\mathcal{L}\ :\) Interactive learning, where the human observes task performance and may provide additional demonstration as desired.
      • \(\Omega^\mathsf{x}\ :\) The robot controller is viewed as a mapping from robot states to actions (or desired next states). Different incremental, sparse regression techniques (LWPR - Locally Weighted Projection Regression, and SOGP - Sparse Online Gaussian Processes) are used to directly estimate a uni-valued map, but later work applies an infinite mixture of Gaussian process experts (ROGER - Realtime Overlapping Gaussian Expert Regression) to learn multi-valued maps.
    • Inverse Reinforcement learning - [2]
      • \(M\ :\) Maximizing a reward function linear in state features, equivalent to matching feature counts.
      • \(\mathcal{Z}\ :\) Robot position, orientation, velocity and angular velocity.
      • \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) Teleoperation
      • \(\mathcal{L}\ :\) Self-improvement/ Interactive: The learnt system is incrementally improved to convergence, and additional data provided if still unsatisfactory.
      • \(\Omega^\mathsf{x}\ :\) The controller is a continuous Markov Decision Process (MDP), solved with Differential dynamic programming (DDP). Weights on the features (squared state error, square state, squared state velocity, squared integral of error) are adapted incrementally to equalize expected feature values in the demonstrated and learned trajectories.
    • Advice Operators - [3]
      • \(M\ :\) Task performance metrics and teacher evaluation
      • \(\mathcal{Z}\ :\) Features computed from robot position and orientation, goal position and orientation, robot translational and rotational speeds.
      • \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) Teleoperation
      • \(\mathcal{L}\ :\) Incremental / Interactive: The teacher observes and corrects multiple learner executions, a batch update is performed, and the process repeats.
      • \(\Omega^\mathsf{x}\ :\) The learned controller is a continuous state-action mapping from computed state features to robot actions (wheel speeds), learned by Locally Weighted Learning. Teacher correction generates new data (not via teleoperation), which is added to the demonstration set. Lazy learning techniques mean that the policy is re-derived only locally, as visited during new learner executions.
    • PoWER - [4]
      • \(M\ :\) External reward
      • \(\mathcal{Z}\ :\) Robot joint angles and velocities, Cartesian coordinates of object
      • \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) Kinesthetic demonstration
      • \(\mathcal{L}\ :\) Self-improvement starting from a single demonstration
      • \(\Omega^\mathsf{x}\ :\) Each joint of the robot has a parameterized motor primitive, current parameters are weighted by generating rollouts and collecting reward. Parameters are then updated based on the weights to maximize the expected reward of the policy.
    • Stable Estimator of Dynamical Systems (SEDS) - [5]
      • \(M\ :\) Global Asymptotic Stability, Likelihood or Mean Square Error
      • \(\mathcal{Z}\ :\) 3D position/Orientation/Joint angles
      • \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) Teleoperation / Kinesthetic / Handwriting
      • \(\mathcal{L}\ :\) Batch
      • \(\Omega^\mathsf{x}\ :\) GMM fit with gradient ascent to maximize cost function under global stability constraints.
    • Motor skill coordination through imitation and reinforcement learning- [6][7]
      • \(M\ :\) Similarity estimation based on the residuals of weighted least squares regression combined with manually defined episodic reward function.
      • \(\mathcal{Z}\ :\) Cartesian position and orientation of the end-effector.
      • \(\phi_{\mathsf{x}}, \phi_{\mathsf{y}}\ :\) Kinesthetic teaching or motion capture.
      • \(\mathcal{L}\ :\) Batch or incremental.
      • \(\Omega^\mathsf{x}\ :\) Mixture of proportional-derivative systems described by a set of virtual attractors and their respective impedance parameters (full stiffness matrices). The policy is initialized by imitation and refined by EM-based reinforcement learning, guided by a manually defined reward function and the residuals of weighted least-squares estimation.
    Personal tools
    Namespaces

    Variants
    Actions
    Navigation
    Focal areas
    Activity
    Tools