Section: New Results
LifeLong Robot Learning and Development of Motor and Social Skills
Active Learning and Intrinsic Motivation
Active Learning of Inverse Models with Goal Babbling
Participants : Adrien Baranes, PierreYves Oudeyer.
We have continued to elaborate and study our SelfAdaptive Goal Generation  Robust Intelligent Adaptive Curiosity (SAGGRIAC) architecture as an intrinsically motivated goal exploration mechanism which allows active learning of inverse models in highdimensional redundant robots. Based on active goal babbling, this allows a robot to efficiently and actively learn distributions of parameterized motor skills/policies that solve a corresponding distribution of parameterized tasks/goals. The architecture makes the robot sample actively novel parameterized tasks in the task space, based on a measure of competence progress, each of which triggers lowlevel goaldirected learning of the motor policy parameters that allow to solve it. For both learning and generalization, the system leverages regression techniques which allow to infer the motor policy parameters corresponding to a given novel parameterized task, and based on the previously learnt correspondences between policy and task parameters.
We have conducted experiments with highdimensional continuous sensorimotor spaces in three different robotic setups: 1) learning the inverse kinematics in a highlyredundant robotic arm, 2) learning omnidirectional locomotion with motor primitives in a quadruped robot 23 24 , 3) an arm learning to control a fishing rod with a flexible wire. We show that 1) exploration in the task space can be a lot faster than exploration in the actuator space for learning inverse models in redundant robots; 2) selecting goals maximizing competence progress creates developmental trajectories driving the robot to progressively focus on tasks of increasing complexity and is statistically significantly more efficient than selecting tasks randomly, as well as more efficient than different standard active motor babbling methods; 3) this architecture allows the robot to actively discover which parts of its task space it can learn to reach and which part it cannot.
This work was published in the journal Robotics and Autonomous Systems [25] .


Learning Exploration Strategies in Modelbased Reinforcement Learning
Participants : Manuel Lopes, Todd Hester, Peter Stone, PierreYves Oudeyer.
We studied how different exploration algorithms can be combine and selected at runtime. Typically the user must handtune exploration parameters for each different domain and/or algorithm that they are using. We introduced an algorithm called leo for learning to select among different exploration strategies online. This algorithm makes use of bandittype algorithms to adaptively select exploration strategies based on the rewards received when following them. We show empirically that this method performs well across a set of five domains In contrast, for a given algorithm, no set of parameters is best across all domains. Our results demonstrate that the leo algorithm successfully learns the best exploration strategies online, increasing the received reward over static parameterizations of exploration and reducing the need for handtuning exploration parameters [46] .
Active Inverse Reinforcement Learning through Generalized Binary Search
Participants : Manuel Lopes, Francisco Melo.
We contributed the first aggressive active learning algorithm for nonseparable multiclass classification. We generalize an existing active learning algorithm for binary classification [116] to the multiclass setting, and identify mild conditions under which the proposed method provably retains the main properties of the original algorithm, namely consistency and sample complexity. In particular, we show that, in the binary case, our method reduces to the original algorithm of [116] . We then contribute an extension of our method to multilabel settings, identify its main properties and discuss richer querying strategies. We conclude the paper with two illustrative application examples. The first application features a standard textclassification problem. The second application scenario features a learning from demonstration setting. In both cases we demonstrate the advantage of our active sampling approach against random sampling. We also discuss the performance of the proposed approach in terms of the derived theoretical bounds.
Exploration strategies in developmental robotics: a unified probabilistic framework
Participants : Clément MoulinFrier, PierreYves Oudeyer.
We present a probabilistic framework unifying two important families of exploration mechanisms recently shown to be efficient to learn complex nonlinear redundant sensorimotor mappings. These two explorations mechanisms are: 1) goal babbling, 2) active learning driven by the maximization of empirically measured learning progress. We show how this generic framework allows to model several recent algorithmic architectures for autonomous exploration. Then, we propose a particular implementation using Gaussian Mixture Models, which at the same time provides an original empirical measure of the competence progress. Finally, we perform computer simulations on two simulated setups: the control of the end effector of a 7DoF arm and the control of the formants produced by an articulatory synthesizer. We are able to reproduce previous results from [25] with the advantages of a clean and compact probabilistic framework to efficiently express, implement and compare various exploration strategies on developmental robotics setups.
This work was published in three international conferences [54] , [56] , [55] .
Autonomous Reuse of Motor Exploration Trajectories
Participants : Fabien Benureau, PierreYves Oudeyer.
We developped an algorithm for transferring exploration strategies between tasks that share a common motor space in the context of lifelong autonomous learning in robotics. In such context sampling is costly, and exploration can take a long time before finding interesting, learnable data about a task. Our algorithm shows that we can significantly reduce sampling by reusing past data of other learned tasks, with no need of external knowledge or specific task structure. The algorithm does not transfer observations, or make assumptions about how the learning is conducted. Instead, only selected motor commands are transferred between tasks, chosen autonomously according to an empirical measure of learning progress. We show that on a wide variety of variations from a source task, such as changing the object the robot is interacting with or altering the morphology of the robot, this simple and flexible transfer method increases early performance significantly in the new task. We also investigate the limitation of this algorithm on specific situations.
This work has been published at ICDL, in Osaka [40] .
Learning and optimization of motor policies
OffPolicy ActorCritic
Participants : Thomas Degris, Martha White, Richard Sutton.
Actor–critic architectures are an interesting candidate for learning with robots: they can represent complex stochastic policies suitable for robots, they can learn online and incrementally and their pertimestep complexity scales linearly with the number of learned weights. Moreover, interesting connections have been identified in the existing literature with neuroscience. Until recently, however, practical actor–critic methods have been restricted to the onpolicy setting, in which the agent learns only about the policy it is executing.
In an offpolicy setting, on the other hand, an agent learns about a policy or policies different from the one it is executing. Offpolicy methods have a wider range of applications and learning possibilities. Unlike onpolicy methods, offpolicy methods are able to, for example, learn about an optimal policy while executing an exploratory policy, learn from demonstration, and learn multiple tasks in parallel from a single sensorymotor interaction with an environment. Because of this generality, offpolicy methods are of great interest in many application domains.
We have presented the first actorcritic algorithm for offpolicy reinforcement learning. Our algorithm is online and incremental, and its pertimestep complexity scales linearly with the number of learned weights. We have derived an incremental, linear time and space complexity algorithm that includes eligibility traces and empirically show better or comparable performance to existing algorithms on standard reinforcementlearning benchmark problems. This work was was reproduced independently by Saminda Abeyruwan from the University of Miami.
AutoActor Critic
Participant : Thomas Degris.
As mentioned above, actor–critic architectures are an interesting candidate for robots to learn new skills in unknown and changing environments. However, existing actor–critic architectures, as many machine learning algorithms, require manual tuning of different parameters to work in the real world. To be able to systematize and scaleup skill learning on a robot, learning algorithms need to be robust to their parameters. The Flowers team has been working on making existing actor–critic algorithms more robust to make them suitable to a robotic setting. Results on standard reinforcement learning benchmarks are encouraging. This work will be submitted to international conference related with reinforcement learning. Interestingly, the methods developed in this work also offer a new formalism to think about different existing themes of Flowers research such as curiosity and maturational constraints.
Deterministic Policy Gradient Algorithms
Thomas Degris and colleagues from UCL and Deepming have consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic pol icy gradient has a particularly appealing form: it is the expected gradient of the actionvalue func tion. This simple form means that the deter ministic policy gradient can be estimated much more efficiently than the usual stochastic pol icy gradient. To ensure adequate exploration, we introduce an offpolicy actorcritic algorithm that learns a deterministic target policy from an exploratory behaviour policy. We demonstrate that deterministic policy gradient algorithms can significantly outperform their stochastic counter parts in highdimensional action spaces. [58]
Relationship between BlackBox Optimization and Reinforcement Learning
Participant : Freek Stulp.
Policy improvement methods seek to optimize the parameters of a policy with respect to a utility function. There are two main approaches to performing this optimization: reinforcement learning (RL) and blackbox optimization (BBO). In recent years, benchmark comparisons between RL and BBO have been made, and there has been several attempts to specify which approach works best for which types of problem classes.
We have made several contributions to this line of research by: 1) Defining four algorithmic properties that further clarify the relationship between RL and BBO. 2) Showing how the derivation of ever more powerful RL algorithms displays a trend towards BBO. 3) Continuing this trend by applying two modifications to the stateoftheart $P{I}^{2}$ algorithm, which yields an algorithm we denote $P{I}^{\mathsf{\text{BB}}}$. We show that $P{I}^{\mathsf{\text{BB}}}$ is a BBO algorithm, and, more specifically, that it is a special case of the stateoftheart CMAES algorithm. 4) Demonstrating that the simpler $P{I}^{\mathsf{\text{BB}}}$ achieves similar or better performance than $P{I}^{2}$ on several evaluation tasks. 5) Analyzing why BBO outperforms RL on these tasks. These contributions have been published on HAL [129] , and in Paladyn: Journal of Behavioral Robotics [36] .
Probabilistic optimal control: a quasimetric approach
Participants : Steve N'Guyen, Clément MoulinFrier, Jacques Droulez.
During his previous postdoc at the Laboratoire de Physiologie de la Perception et de l'Action (College de France, Paris), Clément MoulinFrier joined Jacques Droulez and Steve N'Guyen to work on an alternative and original approach of probabilistic optimal control called quasimetric. A journal paper was published in PLoS ONE in December 2013 [31] , where the authors propose a new approach dealing with decision making under uncertainty.
Social learning and intrinsic motivation
Socially Guided Intrinsic Motivation for Skill Learning
Participants : Sao Mai Nguyen, PierreYves Oudeyer.
We have explored how social interaction can bootstrap the learning of a robot for motor learning. We first studied how simple demonstrations by teachers could have a bootstrapping effect on autonomous exploration with intrinsic motivation by building a learner who uses both imitation learning and SAGGRIAC algorithm [25] , and thus designed the SGIMD (Socially Guided Intrinsic Motivation by Demonstration) algorithm [32] , [24] , [32] [114] , [111] . We then investigated on the reasons of this bootstrapping effect [113] , to show that demonstrations by teachers can both enhance more tasks to be explored, as well as favor more easily generalized actions to be used. This analysis is generalizable for all algorithms using social guidance and goaloriented exploration. We then proposed to build a strategic learner who can learn multiple tasks and with multiple strategies. An overview and theoretical study of multitask, multistrategy Strategic Learning is presented in [99] . We also forsook to build a learning algorithm for more natural interaction with the human users. We first designed the SGIMIM algorithm so that it can determine itself when it should ask for help from the teacher while trying to explore autonomously as long as possible so as to use as little of the teacher's time as possible [112] . After tackling with the problem of how and when to learn, we also investigated an active learner who can determine who to ask for help: in the case of two teachers available, SGIMIM can determine which strategy to adopt between autonomous exploration and learning by demonstration, and which teacher enhances most learning progress for the learner [115] , and ask him for help.
While the above results have been shown in simulation environments: of a simple deterministic air hockey game (fig. 25 ), and a stochastic fishing experiment with a realtime physical simulator (fig. 26 ), we are now building the experimental setup of the fishing experiment in order to carry out the experiments with naive users.
Adaptive task execution for implicit humanrobot coordination
Participants : Ievgen Perederieiev, Manuel Lopes, Freek Stulp.
We began a project which goal is to study how computational models of multiagent systems can be applied in situations where one agent is a human. We aim at applications where robots collaborate with humans for achieving complex tasks..
A very important capability for efficient collaborative work is the mutual agreement of a task and the ability to predict the behavior of others. We address such aspect by studying methods that increase the predictability of the robot actions. An efficient motor execution becomes the one that not just optimize speed and minimizes energy but also the one that improves the reliability of the team behavior. We are studying policy gradient methods and working on policy improvement algorithms ($P{I}^{2}$, $CEM$ and $CMAES$). A feasibility study will consider a simple task between a robot and a person where the goal is to coordinate the way a set of three colored buttons is pressed.
Unsupervised learning of motor primitives
Clustering activities
Participants : Manuel Lopes, Luis Montesano, Javier Almingol.
Learning behaviors from data has applications in surveillance and monitoring systems, virtual agents and robotics among others. In our approach, ww assume that in a given unlabeled dataset of multiple behaviors, it is possible to find a latent representation in a controller space that allows to generate the different behaviors. Therefore, a natural way to group these behaviors is to search a common control system that generate them accurately.
Clustering behaviors in a latent controller space has two major challenges. First, it is necessary to select the control space that generate behaviors. This space will be parameterized by a set of features that will change for different behaviors. Usually, each controller will minimize a cost function with respect to several task features. The latent representation is in turn defined by the selected features and their corresponding weight. Second, an unknown number of such controllers is required to generate different behaviors and the grouping must be based on the ability of the controller to generate the demonstrations using a compact set of controllers.
We propose a Dirichlet Process based algorithm to cluster behaviors in a latent controller space which encodes the dynamical system generating the observed trajectories. The controller uses a potential function generated as a linear combination of features. To enforce sparsity and automatically select features for each cluster independently, we impose a conditional Laplace prior over the controller parameters. Based on this models, we derive a sparse Dirichlet Process Mixture Model (DPMM) algorithm that estimates the number of behaviors and a sparse latent controller for each of them based on a large set of features [38] .

Learning the Combinatorial Structure of Demonstrated Behaviors with Inverse Reinforcement Control
Participants : Olivier Mangin, PierreYves Oudeyer.
We have elaborated and illustrated a novel approach to learning motor skills from demonstration. This approach combines ideas from inverse reinforcement learning, in which actions are assumed to solve a task, and dictionary learning. In this work we introduced a new algorithm that is able to learn behaviors by assuming that the observed complex motions can be represented in a smaller dictionary of concurrent tasks. We developed an optimization formalism and show how we can learn simultaneously the dictionary and the mixture coefficients that represent each demonstration. We presented results on a toy problem and shown that our algorithm finds an efficient set of primitive tasks where naive approaches such as PCA and using a dictionary built from random examples fail to achieve the same tasks. These results that where presented as [60] , extend the ones from [103] .
Interaction of Maturation and Intrinsic Motivation for Developmental Learning of Motor Skills in Robots
Participants : Adrien Baranes, PierreYves Oudeyer.
We have introduced an algorithmic architecture that couples adaptively models of intrinsic motivation and physiological maturation for autonomous robot learning of new motor skills. Intrinsic motivation, also called curiositydriven learning, is a mechanism for driving exploration in active learning. Maturation denotes here mechanisms that control the evolution of certain properties of the body during development, such as the number and the spatiotemporal resolution of available sensorimotor channels. We argue that it is useful to introduce and conceptualize complex bidirectional interactions among these two mechanisms, allowing to actively control the growth of complexity in motor development in order to guide efficiently exploration and learning. We introduced a model of maturational processes, taking some functional inspiration from the myelination process in humans, and show how it can be coupled in an original and adaptive manner with the intrinsic motivation architecture SAGGRIAC (SelfAdaptive Goal Generation  Robust Intelligent Adaptive Curiosity algorithm), creating a new system, called McSAGGRIAC. We then conducted experiments to evaluate both qualitative and quantitative properties of these systems when applied to learning to control a highdimensional robotic arm, as well as to learning omnidirectional locomotion in a quadruped robot equipped with motor synergies. We showed that the combination of active and maturational learning can allow to gain orders of magnitude in learning speed as well as reach better generalization performances. A journal article is currently being written.