Keywords
 A3. Data and knowledge
 A3.1. Data
 A3.1.1. Modeling, representation
 A3.1.4. Uncertain data
 A3.1.11. Structured data
 A3.3. Data and knowledge analysis
 A3.3.1. Online analytical processing
 A3.3.2. Data mining
 A3.3.3. Big data analysis
 A3.4. Machine learning and statistics
 A3.4.1. Supervised learning
 A3.4.2. Unsupervised learning
 A3.4.3. Reinforcement learning
 A3.4.4. Optimization and learning
 A3.4.5. Bayesian methods
 A3.4.6. Neural networks
 A3.4.8. Deep learning
 A3.5.2. Recommendation systems
 A5.1. HumanComputer Interaction
 A5.10.7. Learning
 A8.6. Information theory
 A8.11. Game Theory
 A9. Artificial intelligence
 A9.2. Machine learning
 A9.3. Signal analysis
 A9.4. Natural language processing
 A9.7. AI algorithmics
 B2. Health
 B3.1. Sustainable development
 B3.5. Agronomy
 B4.4. Energy delivery
 B4.4.1. Smart grids
 B5.8. Learning and training
 B7.2.1. Smart vehicles
 B9.1.1. Elearning, MOOC
 B9.5. Sciences
 B9.5.6. Data science
1 Team members, visitors, external collaborators
Research Scientists
 Debabrota Basu [Inria, from Nov 2020, Starting Faculty Position]
 Rémy Degenne [Inria, from Nov 2020, Starting Faculty Position]
 Émilie Kaufmann [CNRS, Researcher, HDR]
 OdalricAmbrym Maillard [Inria, Researcher, HDR]
 JillJênn Vie [Inria, Researcher]
Faculty Member
 Philippe Preux [Team leader, Université de Lille, Professor, HDR]
PostDoctoral Fellows
 Sein Minn [Inria, from Aug 2020]
 Mohit Mittal [Inria, from Nov 2020]
 Pierre Ménard [Inria, until Oct 2020]
PhD Students
 Dorian Baudry [CNRS]
 Omar Darwiche Domingues [Inria]
 Johan Ferret [Google]
 Yannis FletBerliac [Université de Lille]
 Guillaume Gautier [CNRS, Until Oct 2020]
 Nathan Grinsztajn [École polytechnique]
 Leonard HussenotDesenonges [Google]
 Édouard Leurent [Renault, Until Oct 2020]
 Reda Ouhamma [École polytechnique]
 Pierre Perrault [Inria, until Nov 2020]
 Sarah Perrin [Université de Lille]
 Fabien Pesquerel [École Normale Supérieure de Paris, from Nov 2020]
 Clémence Réda [Université de Paris]
 Hassan Saber [Inria, until Aug 2020]
 Patrick Saux [Inria, from Nov 2020]
 Mathieu Seurin [Université de Lille]
 Julien Seznec [Le Livre Scolaire, until Dec 2020]
 Xuedong Shang [Université de Lille]
 Florian Strub [Deepmind, until Jan 2020]
 Jean Tarbouriech [Facebook]
Technical Staff
 Clémence Léguillette [Inria, Engineer, from Oct 2020]
 Vianney Taquet [Inria, Engineer, from Sep 2020]
 Julien Teigny [Inria, Engineer, from Sep 2020]
 Franck Valentini [Inria, Engineer, until Oct 2020]
Administrative Assistant
 Amélie Supervielle [Inria]
External Collaborator
 Romain Gautron [Centre de coopération internationale en recherche agronomique]
2 Overall objectives
Scool is a machine learning (ML) research group. Scool research focuses on the study of the sequential decision making under uncertainty problem (SDMUP). In particular, we will consider bandit problems and the reinforcement learning (RL) problem. In a simplified way, RL considers the problem of learning an optimal policy in a Markov Decision Problem (MDP); when the set of states collapses to a single state, this is known as the bandit problem which focuses on the exploration/exploitation problem.
Bandit and RL problems are interesting to study on their own; both types of problems share a number of fundamental issues (convergence analysis, sample complexity, representation, safety, etc); both problems have real applications, different though closely related; the fact that while solving an RL problem, one faces an exploration/exploitation problem and has to solve a bandit problem in each state connects the two types of problems very intimately.
In our work, we also consider settings going beyond the Markovian assumption, in particular nonstationary settings, which represents a challenge common to bandits and RL. We also consider online learning where the goal is to learn a model from a stream of data, such as learning a compressed representation of a stream of data (each data may be a scalar, a vector, or even a more complex data structure such as a tree or a graph). A distinctive aspect of the SDMUP with regards to the rest of the field of ML is that the learning problem takes place within a closedloop interaction between a learning agent and its environment. This feedback loop makes our field of research very different from the two other subfields of ML, supervised and unsupervised learning, even when they are defined in an incremental setting. Hence, SDMUP combines ML with control: the learner is not passive: the learner acts on its environment, and learns from the consequences of these interactions; hence, the learner can act in order to obtain information from the environment.
We wish to go on, studying applied questions and developing theory to come up with sound approaches to the practical resolution of SDMUP tasks, and guide their resolution. Nonstationary environments are a particularly interesting setting; we are studying this setting and developing new tools to approach it in a sound way, in order to have algorithms to detect environment changes as fast as possible, and as reliably as possible, adapt to them, and prove their behavior, in terms of their performance, measured with the regret for instance. We mostly consider non parametric statistical models, that is models in which the number of parameters is not fixed (a parameter may be of any type: a scalar, a vector, a function, etc), so that the model can adapt along learning, and to its changing environment; this also let the algorithm learn a representation that fits its environment.
3 Research program
Our research is mostly dealing with bandit problems, and reinforcement learning problems. We investigate each thread separately and also in combination, since the management of the exploration/exploitation tradeoff is a major issue in reinforcement learning.
On bandit problems, we focus on:
 structured bandits
 bandits for planning (in particular for MCTS)
 non stationary bandits
Regarding reinforcement learning, we focus on:
 modeling issues, and dealing with the discrepancy between the model and the task to solve
 learning and using the structure of a Markov decision problem, and of the learned policy
 generalization in reinforcement learning
 RL in non stationary environments
Beyond these objectives, we put a particular emphasis on the study of nonstationary environments. An other area of great concern is the combination of symbolic methods with numerical methods, be it to provide knowledge to the learning algorithm to improve its learning curve, or to better understand what the algorithm has learned and explain its behavior, or to rely on causality rather than on mere correlation.
We also put a particular emphasis on real applications and how to deal with their constraints: lack of a simulator, difficulty to have a realistic model of the problem, small amount of data, dealing with risks, availability of expert knowledge on the task.
4 Application domains
Scool has 3 main topics of application:
 health
 sustainable development
 elearning
In each of these domains, we put forward the investigation and the application of the idea of sequential decision making under uncertainty. Though supervised and non supervised learning have been yet extensively studied and applied in these fields, sequential decision making remains far less studied; bandits have yet been used in many applications of ecommerce (e.g. for computational advertising and recommendation systems). However, in applications where human beings may be severaly impacted, bandits and reinforcement learning have not yet been much studied; moreover, these applications come along a scarcity of data, and the non availability of a simulator, which prevents heavy computational simulations to come up with safe automatic decision making.
In 2020, in health, we investigate patient followup with Prof. F. Pattou's research group (CHU Lille, INSERM, Université de Lille) in project B4H. This effort comes along investigating how we may use medical data available locally at CHU Lille, and also the national social security data. We also investigate drug repurposing with Prof. A. DelahayeDuriez (Inserm, Université de Paris) in project Repos. We also study catheter control by way of reinforcement learning with Inria Lille group Defrost, and company Robocath (Rouen). In 20192020, we also studied more traditional machine learning aspects, with the investigation of deep learning technology in radiology, with Prof. A. Cotten (CHU and Université de Lille), in project RAID. Finally, in the context of the Covid19 sudden pandemic, we volunteered to get involved in a series of works with APHP; if the immediate needs were not on research question but more about the exploitation of our skills in data science and machine learning, this activity led towards more research oriented questions later in Fall 2020.
Regarding sustainable development, we have a set of projects and collaborations regarding agriculture and gardening. With Cirad and CGIAR, we investigate how one may recommend agricultural practices to farmers in developing countries. Through an associate team with Bihar Agriculture University (India), we investigate data collection. Inria exploratory action SR4SG concerns recommender systems at the level of individual gardens. In marine biology, we focus on the warm water footprints, called wakes, caused by movements of large marine vessels. These wakes disrupt the natural flora and fauna around the ship routes and ports. We are developing machine learning techniques to detect the wakes using data from acoustic Doppler profilers, and to correlate them with the structure and motion of marine vessels. We also worked on the control of smartgrids, with LPCIM at École Polytechnique, and in collaboration with Total.
Regarding elearning, the collaboration with Le Livre Scolaire has led to the defense of J. Seznec's PhD (CIFRE grant). We have continued our work with Pix by organizing an open lab in January 2020 attracting 20 people of various fields: psychometricians, data scientists, statisticians at the French Ministry of Education. We also got involved in advising a startup under creation related to elearning, this creation being supported by the Inria Startup Studio.
There are two important aspects that are amply shared common by these various application fields. First, we consider that data collection is an active task: we do not passively observe and record data: we design methods and algorithms to search for useful data. This idea is exploited in most of these works oriented towards applications. Second, many of these projects include a careful management of risks for human beings. We have to take decisions taking care of their consequences on human beings, on ecosystems and life more generally. This comes along the work we did on the autonomous control of vehicles (with Renault, in collaboration with InriaLille team Valse, through a CIFRE grant) where safety is obviously a very important issue too.
5 Social and environmental responsibility
Sustainable development is a major field of research and application of Scool. We investigate what machine learning can bring to sustainable development, identifiying locks, and styudying how to overcome them.
Let us mention here:
 wake detection in marine sciences,
 sustainable agriculture in developing countries,
 sustainable gardening,
 control of smartgrids.
More details can be found in section 4.
6 Highlights of the year
 On Oct.
31 2020, SequeL ended, being replaced by the brand new joint teamproject Scool.  É. Leurent had an oral presentation at NeurIPS (1% acceptance rate this year). This magnifically concludes his PhD work based on a collaboration between SequeL/Scool and Valse, under O.A. Maillard and D. Efimov cosupervision.
 Our team composed of J.J. Vie, Sein Minn (Scool), Mehdi Douch, Yassine Esmili (Axiome, Inria Startup Studio) got ranked 5 over 23 submissions at the NeurIPS Education Challenge 2020. J.J. Vie (Scool) and Matthieu Doutreligne (Parietal, Inria Saclay) got ranked 2 over 16 submissions at the NeurIPS Healthcare HideandSeek Challenge 2020.
 The InriaCovid ScikitEDS project in partnership with Inria Parietal was presented to President Emmanuel Macron on Dec. 4, 2020.
7 New software and platforms
7.1 New software
7.1.1 highwayenv
 Name: An environment for autonomous driving decisionmaking
 Keywords: Generic modeling environment, Simulation, Autonomous Cars, Artificial intelligence
 Functional Description: The environment is composed of several variants, each of which corresponds to driving scenes: highway, roundabout, intersection, merge, parking, etc. The road network is described by a graph, and is then populated with simulated vehicles. Vehicle kinematics follows a simple Bicycle model, and their behavior is determined by models derived from road traffic simulation literature. The egovehicle has access to a description of the scene through several types of observations, and its behavior is controlled through an action space, either discrete (change of lanes, of cruising speed) or continuous ( accelerator pedal, steering wheel angle). The objective function to maximize is also described by the environment and may vary depending on the task to be solved. The interface of the library is inherited from the standard defined by OpenAI Gym, consisting of four main methods: gym.make(id), env.step(action), env.reset(), and env.render().

URL:
https://
github. com/ eleurent/ highwayenv  Author: Edouard Leurent
 Contact: Edouard Leurent
7.1.2 rlberry
 Keywords: Reinforcement learning, Simulation, Artificial intelligence
 Functional Description: rlberry is a reinforcement learning (RL) library in Python for research and education. The library provides implementations of several RL agents for you to use as a starting point or as baselines, provides a set of benchmark environments, very useful to debug and challenge your algorithms, handles all random seeds for you, ensuring reproducibility of your results, and is fully compatible with several commonly used RL libraries like OpenAI gym and Stable Baselines.

URL:
https://
github. com/ rlberrypy/ rlberry  Contact: Omar Darwiche Domingues
7.1.3 justicia
 Name: Justicia: A Stochastic SAT Approach to Formally Verify Fairness
 Keywords: Fairness, Machine learning, Verification, Fairness Verification, Fair and ethical machine learning, Formal methods
 Functional Description: justicia is a fairness verifier written in Python. The library provides a stochastic SAT encoding of multiple fairness definitions and fair ML algorithms. justicia then further verifies the fairness metric achieved by the corresponding ML algorithm. It is now available as an official Python package and can be installed using pip.
 News of the Year: 2020

URL:
https://
www. github. com/ meelgroup/ justicia  Contact: Debabrota Basu
 Participant: Bishwamittra Ghosh
 Partner: National University of Singapore
8 New results
We organize our research results in a set of categories. The main categories are: bandit problems, reinforcement learning problems, and applications.
8.1 Bandit problems
Statistical efficiency of Thompson sampling for combinatorial semibandits
We investigate stochastic combinatorial multiarmed bandit with semibandit feedback (CMAB). In CMAB, the question of the existence of an efficient policy with an optimal asymptotic regret (up to a factor polylogarithmic with the action size) is still open for many families of distributions, including mutually independent outcomes, and more generally the multivariate subGaussian family. We propose to answer the above question for these two families by analyzing variants of the Combinatorial Thompson Sampling policy (CTS). For mutually independent outcomes in $[0,1]$, we propose a tight analysis of CTS using Beta priors. We then look at the more general setting of multivariate subGaussian outcomes and propose a tight analysis of CTS using Gaussian priors. This last result gives us an alternative to the Efficient Sampling for Combinatorial Bandit policy (ESCB), which, although optimal, is not computationally efficient.
Subsampling for Efficient NonParametric Bandit Exploration
In this paper we propose the first multiarmed bandit algorithm based on resampling that achieves asymptotically optimal regret simultaneously for different families of arms (namely Bernoulli, Gaussian and Poisson distributions). Unlike Thompson Sampling which requires to specify a different prior to be optimal in each case, our proposal RBSDA does not need any distributiondependent tuning. RBSDA belongs to the family of Subsampling Duelling Algorithms (SDA) which combines the subsampling idea first used by the BESA 61 and SSMC 63 algorithms with different subsampling schemes. In particular, RBSDA uses Random Block sampling. We perform an experimental study assessing the flexibility and robustness of this promising novel approach for exploration in bandit models.
MonteCarlo Graph Search: the Value of Merging Similar States
We consider the problem of planning in a Markov Decision Process (MDP) with a generative model and limited computational budget. Despite the underlying MDP transitions having a graph structure, the popular MonteCarlo Tree Search algorithms such as UCT rely on a tree structure to represent their value estimates. That is, they do not identify together two similar states reached via different trajectories and represented in separate branches of the tree. In this work, we propose a graphbased planning algorithm, which takes into account this state similarity. In our analysis, we provide a regret bound that depends on a novel problemdependent measure of difficulty, which improves on the original treebased bound in MDPs where the trajectories overlap, and recovers it otherwise. Then, we show that this methodology can be adapted to existing planning algorithms that deal with stochastic systems. Finally, numerical simulations illustrate the benefits of our approach.
Planning in Markov Decision Processes with GapDependent Sample Complexity
We propose MDPGapE, a new trajectorybased MonteCarlo Tree Search algorithm for planning in a Markov Decision Process in which transitions have a finite support. We prove an upper bound on the number of calls to the generative models needed for MDPGapE to identify a nearoptimal action with high probability. This problemdependent sample complexity result is expressed in terms of the suboptimality gaps of the stateaction pairs that are visited during exploration. Our experiments reveal that MDPGapE is also effective in practice, in contrast with other algorithms with sample complexity guarantees in the fixedconfidence setting, that are mostly theoretical.
Spectral bandits
Smooth functions on graphs have wide applications in manifold and semisupervised learning. In this work, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as contentbased recommendation. In this problem, each item we can recommend is a node of an undirected graph and its expected rating is similar to the one of its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms where the cumulative regret with respect to the optimal policy would not scale poorly with the number of nodes. In particular, we introduce the notion of an effective dimension, which is small in realworld graphs, and propose three algorithms for solving our problem that scale linearly and sublinearly in this dimension. Our experiments on contentrecommendation problem show that a good estimator of user preferences for thousands of items can be learned from just tens of node evaluations.
Solving Bernoulli RankOne Bandits with Unimodal Thompson Sampling
Stochastic RankOne Bandits 68, 69 are a simple framework for regret minimization problems over rankone matrices of arms. The initially proposed algorithms are proved to have logarithmic regret, but do not match the existing lower bound for this problem. We close this gap by first proving that rankone bandits are a particular instance of unimodal bandits, and then providing a new analysis of Unimodal Thompson Sampling (UTS), initially proposed by Paladino et al. 70. We prove an asymptotically optimal regret bound on the frequentist regret of UTS and we support our claims with simulations showing the significant improvement of our method compared to the stateoftheart.
A Practical Algorithm for Multiplayer Bandits when Arm Means Vary Among Players
We study a multiplayer stochastic multiarmed bandit problem in which players cannot communicate, and if two or more players pull the same arm,a collision occurs and the involved players receive zero reward. We consider the challenging heterogeneous setting, in which different arms may have different means for different players, and propose a new and efficient algorithm that combines the idea of leveraging forced collisions for implicitcommunication and that of performing matching eliminations. We present a finitetime analysis of our algorithm, giving the first sublinear minimaxregret bound for this problem, and prove that if the optimal assignment of players to arms is unique, our algorithm attains the optimal $O(ln(T\left)\right)$ regret, solving an open question raised at NeurIPS 2018 by Bistritz and Leshem 62.
Budgeted online influence maximization
We introduce a new budgeted framework for online influence maximization, considering the total cost of an advertising campaign instead of the common cardinality constraint on a chosen influencer set. Our approach models better the realworld setting where the cost of influencers varies and advertizers want to find the best value for their overall social advertising budget. We propose an algorithm assuming an independent cascade diffusion model and edgelevel semibandit feedback, and provide both theoretical and experimental results. Our analysis is also valid for the cardinalityconstraint setting and improves the state of the art regret bound in this case.
Fixedconfidence guarantees for Bayesian bestarm identification
We investigate and provide new insights on the sampling rule called TopTwo Thompson Sampling (TTTS). In particular, we justify its use for fixedconfidence bestarm identification. We further propose a variant of TTTS called TopTwo Transportation Cost (T3C), which disposes of the computational burden of TTTS. As our main contribution, we provide the first sample complexity analysis of TTTS and T3C when coupled with a very natural Bayesian stopping rule, for bandits with Gaussian rewards, solving one of the open questions raised by Russo 71. We also provide new posterior convergence results for TTTS under two models that are commonly used in practice: bandits with Gaussian and Bernoulli rewards and conjugate priors.
The Influence of Shape Constraints on the Thresholding Bandit Problem
We investigate the stochastic Thresholding Bandit problem (TBP) under several shape constraints. On top of (i) the vanilla, unstructured TBP, we consider the case where (ii) the sequence of arm's means ${\left({\mu}_{k}\right)}_{k}$ is monotonically increasing MTBP, (iii) the case where ${\left({\mu}_{k}\right)}_{k}$ is unimodal UTBP and (iv) the case where ${\left({\mu}_{k}\right)}_{k}$ is concave CTBP. In the TBP problem the aim is to output, at the end of the sequential game, the set of arms whose means are above a given threshold. The regret is the highest gap between a misclassified arm and the threshold. In the fixed budget setting, we provide problem independent minimax rates for the expected regret in all settings, as well as associated algorithms. We prove that the minimax rates for the regret are (i) $\sqrt{log\left(K\right)K/T}$ for TBP, (ii) $\sqrt{log\left(K\right)/T}$ for MTBP, (iii) $\sqrt{K/T}$ for UTBP and (iv) $\sqrt{loglogK/T}$ for CTBP, where $K$ is the number of arms and $T$ is the budget. These rates demonstrate that the dependence on $K$ of the minimax regret varies significantly depending on the shape constraint. This highlights the fact that the shape constraints modify fundamentally the nature of the TBP problem to the other.
Covarianceadapting algorithm for semibandits with application to sparse outcomes
We investigate stochastic combinatorial semibandits, where the entire joint distribution of outcomes impacts the complexity of the problem instance (unlike in the standard bandits). Typical distributions considered depend on specific parameter values, whose prior knowledge is required in theory but quite difficult to estimate in practice; an example is the commonly assumed subGaussian family. We alleviate this issue by instead considering a new general family of subexponential distributions, which contains bounded and Gaussian ones. We prove a new lower bound on the regret on this family, that is parameterized by the unknown covariance matrix, a tighter quantity than the subGaussian matrix. We then construct an algorithm that uses covariance estimates, and provide a tight asymptotic analysis of the regret. Finally, we apply and extend our results to the family of sparse outcomes, which has applications in many recommender systems.
Gamification of pure exploration for linear bandits
We investigate an active pureexploration setting, that includes bestarm identification, in the context of linear stochastic bandits. While asymptotically optimal algorithms exist for standard multiarm bandits, the existence of such algorithms for the bestarm identification in linear bandits has been elusive despite several attempts to address it. First, we provide a thorough comparison and new insight over different notions of optimality in the linear case, including $G$optimality, transductive optimality from optimal experimental designand asymptotic optimality. Second, we design the first asymptotically optimal algorithm for fixedconfidence pure exploration in linear bandits. As a consequence, our algorithm naturally bypasses the pitfall caused by a simple but difficult instance, that most prior algorithms had to be engineered to deal with explicitly. Finally, we avoid the need to fully solve an optimal design problem by providing an approach that entails an efficient implementation.
Stochastic bandits with vector losses: Minimizing ${\ell}^{\infty}$norm of relative losses
Multiarmed bandits are widely applied in scenarios like recommender systems, for which the goal is to maximize the click rate. However, more factors should be considered, e.g., user stickiness, user growth rate, user experience assessment, etc. In this paper, we model this situation as a problem of Karmed bandit with multiple losses. We define relative loss vector of an arm where the ith entry compares the arm and the optimal arm with respect to the ith loss. We study two goals: (a) finding the arm with the minimum ${\ell}^{\infty}$norm of relative losses with a given confidence level (which refers to fixedconfidence bestarm identification); (b) minimizing the ${\ell}^{\infty}$norm of cumulative relative losses (which refers to regret minimization). For goal (a), we derive a problemdependent sample complexity lower bound and discuss how to achieve matching algorithms. For goal (b), we provide a regret lower bound of $\Omega \left({T}^{2/3}\right)$ and provide a matching algorithm.
Efficient ChangePoint Detection for Tackling PiecewiseStationary Bandits
We introduce GLRklUCB, a novel algorithm for the piecewise i.i.d. nonstationary bandit problem with bounded rewards. This algorithm combines an efficient bandit algorithm, klUCB, with an efficient, parameterfree, changepoint detector, the Bernoulli Generalized Likelihood Ratio Test, for which we provide new theoretical guarantees of independent interest. Unlike previous nonstationary bandit algorithms using a changepoint detector, GLRklUCB does not need to be calibrated based on prior knowledge on the arms' means. We prove that this algorithm can attain a $O\left(\sqrt{TA{{\rm Y}}_{T}log\left(T\right)}\right)$ regret in $T$ rounds on some “easy” instances, where A is the number of arms and ${{\rm Y}}_{T}$ the number of changepoints, without prior knowledge of ${{\rm Y}}_{T}$. In contrast with recently proposed algorithms that are agnostic to ${{\rm Y}}_{T}$, we perform a numerical study showing that GLRklUCB is also very efficient in practice, beyond easy instances.
Adversarial Attacks on Linear Contextual Bandits
Contextual bandit algorithms are applied in a wide range of domains, from advertising to recommender systems, from clinical trials to education. In many of these domains, malicious agents may have incentives to attack the bandit algorithm to induce it to perform a desired behavior. For instance, an unscrupulous ad publisher may try to increase their own revenue at the expense of the advertisers; a seller may want to increase the exposure of their products, or thwart a competitor's advertising campaign. In this paper, we study several attack scenarios and show that a malicious agent can force a linear contextual bandit algorithm to pull any desired arm $To\left(T\right)$ times over a horizon of $T$ steps, while applying adversarial modifications to either rewards or contexts that only grow logarithmically as $O(logT)$. We also investigate the case when a malicious agent is interested in affecting the behavior of the bandit algorithm in a single context (e.g., a specific user). We first provide sufficient conditions for the feasibility of the attack and we then propose an efficient algorithm to perform the attack. We validate our theoretical results on experiments performed on both synthetic and realworld datasets.
Forcedexploration free Strategies for Unimodal Bandits
We consider a multiarmed bandit problem specified by a set of Gaussian or Bernoulli distributions endowed with a unimodal structure. Although this problem has been addressed in the literature 64, the stateoftheart algorithms for such structure make appear a forcedexploration mechanism. We introduce IMEDUB, the first forcedexploration free strategy that exploits the unimodalstructure, by adapting to this setting the Indexed Minimum Empirical Divergence (IMED) strategy introduced by Honda and Takemura 66. This strategy is proven optimal. We then derive KLUCBUB, a KLUCB version of IMEDUB, which is also proven optimal. Owing to our proof technique, we are further able to provide a concise finitetime analysis of both strategies in an unified way. Numerical experiments show that both IMEDUB and KLUCBUB perform similarly in practice and outperform the stateoftheart algorithms.
Optimal Strategies for GraphStructured Bandits
We study a structured variant of the multiarmed bandit problem specified by a set of Bernoulli distributions $\nu ={\left({\nu}_{a,b}\right)}_{a\in \mathcal{A},b\in \mathcal{B}}$ with means ${\left({\mu}_{a,b}\right)}_{a\in \mathcal{A},b\in \mathcal{B}}\in {[0,1]}^{\mathcal{A}\times \mathcal{B}}$ and by a given weight matrix $\omega ={\left({\omega}_{b,{b}^{\text{'}}}\right)}_{b,{b}^{\text{'}}\in \mathcal{B}}$, where $\mathcal{A}$ is a finite set of arms and $\mathcal{B}$ is a finite set of users. The weight matrix $\omega $ is such that for any two users $b,{b}^{\text{'}}\in \mathcal{B},{max}_{a\in \mathcal{A}}{\mu}_{a,b}{\mu}_{a,{b}^{\text{'}}}\le {\omega}_{b,{b}^{\text{'}}}$. This formulation is flexible enough to capture various situations, from highlystructured scenarios ($\omega \in {\{0,1\}}^{\mathcal{B}\times \mathcal{B}}$) to fully unstructured setups ($\omega \equiv 1$).We consider two scenarios depending on whether the learner chooses only the actions to sample rewards from or both users and actions. We first derive problemdependent lower bounds on the regret for this generic graphstructure that involves a structure dependent linear programming problem. Second, we adapt to this setting the Indexed Minimum Empirical Divergence (IMED) algorithm introduced by Honda and Takemura (2015), and introduce the IMEDGS${}^{\u2606}$ algorithm. Interestingly, IMEDGS${}^{\u2606}$ does not require computing the solution of the linear programming problem more than about $log\left(T\right)$ times after $T$ steps, while being provably asymptotically optimal. Also, unlike existing bandit strategies designed for other popular structures, IMEDGS${}^{\u2606}$ does not resort to an explicit forced exploration scheme and only makes use of local counts of empirical events. We finally provide numerical illustration of our results that confirm the performance of IMEDGS${}^{\u2606}$.
On MultiArmed Bandit Designs for DoseFinding Trials
We study the problem of finding the optimal dosage in early stage clinical trials through the multiarmed bandit lens. We advocate the use of the Thompson Sampling principle, a flexible algorithm that can accommodate different types of monotonicity assumptions on the toxicity and efficacy of the doses. For the simplest version of Thompson Sampling, based on a uniform prior distribution for each dose, we provide finitetime upper bounds on the number of suboptimal dose selections, which is unprecedented for dosefinding algorithms. Through a large simulation study, we then show that variants of Thompson Sampling based on more sophisticated prior distributions outperform stateoftheart dose identification algorithms in different types of dosefinding studies that occur in phase I or phase I/II trials.
8.2 Reinforcement learning
Only Relevant Information Matters: Filtering Out Noisy Samples to Boost RL
In reinforcement learning, policy gradient algorithms optimize the policy directly and rely on sampling efficiently an environment. Nevertheless, while most sampling procedures are based on direct policy sampling, selfperformance measures could be used to improve such sampling prior to each policy update. Following this line of thought, we introduce SAUNA, a method where noninformative transitions are rejected from the gradient update. The level of information is estimated according to the fraction of variance explained by the value function: a measure of the discrepancy between V and the empirical returns. In this work, we use this metric to select samples that are useful to learn from, and we demonstrate that this selection can significantly improve the performance of policy gradient methods. In this paper: (a) We define SAUNA's metric and introduce its method to filter transitions. (b) We conduct experiments on a set of benchmark continuous control problems. SAUNA significantly improves performance. (c) We investigate how SAUNA reliably selects samples with the most positive impact on learning and study its improvement on both performance and sample efficiency.
Inferential Induction: A Novel Framework for Bayesian Reinforcement Learning
Bayesian Reinforcement Learning (BRL) offers a decisiontheoretic solution to the reinforcement learning problem. While “modelbased” BRL algorithms have focused either on maintaining a posterior distribution on models, BRL “modelfree” methods try to estimate value function distributions but make strong implicit assumptions or approximations. We describe a novel Bayesian framework, inferential induction, for correctly inferring value function distributions from data, which leads to a new family of BRL algorithms. We design an algorithm, Bayesian Backwards Induction (BBI), with this framework. We experimentally demonstrate that BBI is competitive with the state of the art. However, its advantage relative to existing BRL modelfree methods is not as great as we have expected, particularly when the additional computational burden is taken into account.
Tightening Exploration in Upper Confidence Reinforcement Learning
The upper confidence reinforcement learning (UCRL2) algorithm introduced in 67 is a popular method to perform regret minimization in unknown discrete Markov Decision Processes under the averagereward criterion. Despite its nice and generic theoretical regret guarantees, this algorithm and its variants have remained until now mostly theoretical as numerical experiments in simple environments exhibit long burnin phases before the learning takes place. In pursuit of practical efficiency, we present UCRL3, following the lines of UCRL2, but with two key modifications: First, it uses stateoftheart timeuniform concentration inequalities to compute confidence sets on the reward and (componentwise) transition distributions for each stateaction pair. Furthermore, to tighten exploration, it uses an adaptive computation of the support of each transition distribution, which in turn enables us to revisit the extended value iteration procedure of UCRL2 to optimize over distributions with reduced support by disregarding low probability transitions, while still ensuring nearoptimism. We demonstrate, through numerical experiments in standard environments, that reducing exploration this way yields a substantial numerical improvement compared to UCRL2 and its variants. On the theoretical side, these key modifications enable us to derive a regret bound for UCRL3 improving on UCRL2, that for the first time makes appear notions of local diameter and local effective support, thanks to varianceaware concentration bounds.
“I'm sorry Dave, I'm afraid I can't do that” Deep QLearning From Forbidden Actions
The use of Reinforcement Learning (RL) is still restricted to simulation or to enhance humanoperated systems through recommendations. Realworld environments (e.g. industrial robots or power grids) are generally designed with safety constraints in mind implemented in the shape of valid actions masks or contingency controllers. For example, the range of motion and the angles of the motors of a robot can be limited to physical boundaries. Violating constraints thus results in rejected actions or entering in a safe mode driven by an external controller, making RL agents incapable of learning from their mistakes. In this paper, we propose a simple modification of a stateoftheart deep RL algorithm (DQN), enabling learning from forbidden actions. To do so, the standard Qlearning update is enhanced with an extra safety loss inspired by structured classification. We empirically show that it reduces the number of hit constraints during the learning phase and accelerates convergence to nearoptimal policies compared to using standard DQN. Experiments are done on a Visual Grid World Environment and TextWorld domain.
A Machine of Few Words Interactive Speaker Recognition with Reinforcement Learning
Speaker recognition is a well known and studied task in the speech processing domain. It has many applications, either for security or speaker adaptation of personal devices. In this paper, we present a new paradigm for automatic speaker recognition that we call Interactive Speaker Recognition (ISR). In this paradigm, the recognition system aims to incrementally build a representation of the speakers by requesting personalized utterances to be spoken in contrast to the standard textdependent or textindependent schemes. To do so, we cast the speaker recognition task into a sequential decisionmaking problem that we solve with Reinforcement Learning. Using a standard dataset, we show that our method achieves excellent performance while using little speech signal amounts. This method could also be applied as an utterance selection mechanism for building speech synthesis systems.
HIGhER: Improving instruction following with Hindsight Generation for Experience Replay
Language creates a compact representation of the world and allows the description of unlimited situations and objectives through compositionality. While these characterizations may foster instructing, conditioning or structuring interactive agent behavior, it remains an openproblem to correctly relate language understanding and reinforcement learning in even simple instruction following scenarios. This joint learning problem is alleviated through expert demonstrations, auxiliary losses, or neural inductive biases. In this paper, we propose an orthogonal approach called Hindsight Generation for Experience Replay (HIGhER) that extends the Hindsight Experience Replay approach to the languageconditioned policy setting. Whenever the agent does not fulfill its instruction, HIGhER learns to output a new directive that matches the agent trajectory, and it relabels the episode with a positive reward. To do so, HIGhER learns to map a state into an instruction by using past successful trajectories, which removes the need to have external expert interventions to relabel episodes as in vanilla HER. We show the efficiency of our approach in the BabyAI environment, and demonstrate how it complements other instruction following methods.
Fictitious Play for Mean Field Games: Continuous Time Analysis and Applications
In this paper, we deepen the analysis of continuous time Fictitious Play learning algorithm to the consideration of various finite state Mean Field Game settings (finite horizon, $\gamma $discounted), allowing in particular for the introduction of an additional common noise. We first present a theoretical convergence analysis of the continuous time Fictitious Play process and prove that the induced exploitability decreases at a rate $O\left(\frac{1}{t}\right)$. Such analysis emphasizes the use of exploitability as a relevant metric for evaluating the convergence towards a Nash equilibrium in the context of Mean Field Games. These theoretical contributions are supported by numerical experiments provided in either modelbased or modelfree settings. We provide hereby for the first time converging learning dynamics for Mean Field Games in the presence of common noise.
Regret bounds for kernelbased reinforcement learning
We consider the explorationexploitation dilemma in finitehorizon reinforcement learning problems whose stateaction space is endowed with a metric. We introduce KernelUCBVI, a modelbased optimistic algorithm that leverages the smoothness of the MDP and a nonparametric kernel estimator of the rewards and transitions to efficiently balance exploration and exploitation. Unlike existing approaches with regret guarantees, it does not use any kind of partitioning of the stateaction space. For problems with K episodes and horizon H, we provide a regret bound of $O\left({H}^{3}{K}^{max(12,\frac{2d}{2d+1})}\right)$, where d is the covering dimension of the joint stateaction space. We empirically validate KernelUCBVI on discrete and continuous MDPs.
CopyCAT: Taking Control of Neural Policies with Constant Attacks
We propose a new perspective on adversarial attacks against deep reinforcement learning agents. Our main contribution is CopyCAT, a targeted attack able to consistently lure an agent into following an outsider's policy. It is precomputed, therefore fast inferred, and could thus be usable in a realtime scenario. We show its effectiveness on Atari 2600 games in the novel readonly setting. In this setting, the adversary cannot directly modify the agent's state – its representation of the environment – but can only attack the agent's observation – its perception of the environment. Directly modifying the agent's state would require a writeaccess to the agent's inner workings and we argue that this assumption is too strong in realistic settings.
SelfAttentional Credit Assignment for Transfer in Reinforcement Learning
The ability to transfer knowledge to novel environments and tasks is a sensible desiderata for general learning agents. Despite the apparent promises, transfer in RL is still an open and little exploited research area. In this paper, we take a brandnew perspective about transfer: we suggest that the ability to assign credit unveils structural invariants in the tasks that can be transferred to make RL more sample efficient. Our main contribution is Secret, a novel approach to transfer learning for RL that uses a backwardview credit assignment mechanism based on a selfattentive architecture. Two aspects are key to its generality: it learns to assign credit as a separate offline supervised process and exclusively modifies the reward function. Consequently, it can be supplemented by transfer methods that do not modify the reward function and it can be plugged on top of any RL algorithm.
8.3 Adaptive control
RobustAdaptive Control of Linear Systems: beyond Quadratic Costs
We consider the problem of robust and adaptive model predictive control (MPC) of a linear system, with unknown parameters that are learned along the way (adaptive), in a critical setting where failures must be prevented (robust). This problem has been studied from different perspectives by different communities. However, the existing theory deals only with the case of quadratic costs (the LQ problem), which limits applications to stabilisation and tracking tasks only. In order to handle more general (nonconvex) costs that naturally arise in many practical problems, we carefully select and bring together several tools from different communities, namely nonasymptotic linear regression, recent results in interval prediction, and treebased planning. Combining and adapting the theoretical guarantees at each layer is non trivial, and we provide the first endtoend suboptimality analysis for this setting. Interestingly, our analysis naturally adapts to handle many models and combines with a datadriven robust model selection strategy, which enables to relax the modelling assumptions. Last, we strive to preserve tractability at any stage of the method, that we illustrate on two challenging simulated environments.
RobustAdaptive Interval Predictive Control for Linear Uncertain Systems
We consider the problem of stabilization of a linear system, under state and control constraints, and subject to bounded disturbances and unknown parameters in the state matrix. First, using a simple least square solution and available noisy measurements, the set of admissible values for parameters is evaluated. Second, for the estimated set of parameter values and the corresponding linear interval model of the system, two interval predictors are recalled and an unconstrained stabilizing control is designed that uses the predicted intervals. Third, to guarantee the robust constraint satisfaction, a model predictive control algorithm is developed, which is based on solution of an optimization problem posed for the interval predictor. The conditions for recursive feasibility and asymptotic performance are established. Efficiency of the proposed control framework is illustrated by numeric simulations.
8.4 Applications
The challenge of controlling microgrids in the presence of rare events with Deep Reinforcement Learning
The increased penetration of renewable energies and the need to decarbonize the grid come with a lot of challenges. Microgrids, power grids that can operate independently from the main system, are seen as a promising solution. They range from a small building to a neighbourhood or a village. As they colocate generation, storage and consumption, microgrids are often built with renewable energies. At the same time, because they can be disconnected from the main grid, they can be more resilient and less dependent on central generation. Due to their diversity and distributed nature, advanced metering and control will be necessary to maximize their potential. This paper presents a reinforcement learning algorithm to tackle the energy management of an offgrid microgrid, represented as a Markov Decision Process. The main objective function of the proposed algorithm is to minimize the global operating cost. By nature, rare events occur in physical systems. One of the main contribution of this paper is to demonstrate how to train agents in the presence of rare events. We prove that merging the combined experience replay method with a novel methods called “Memory Counter” unstucks the agent during its learning phase. Compared to baselines, we show that an extended version of Double Deep QNetwork with a priority list of actions into the decision making strategy process lowers significantly the operating cost. Experiments are conducted using two years of realworld data from Ecole Polytechnique in France.
Geometric Deep Reinforcement Learning for Dynamic DAG Scheduling
In practice, it is quite common to face combinatorial optimization problems which contain uncertainty along with nondeterminism and dynamicity. These three properties call for appropriate algorithms; reinforcement learning (RL) is dealing with them in a very natural way. Today, despite some efforts, most reallife combinatorial optimization problems remain out of the reach of reinforcement learning algorithms. In this paper, we propose a reinforcement learning approach to solve a realistic scheduling problem, and apply it to an algorithm commonly executed in the high performance computing community, the Cholesky factorization. On the contrary to static scheduling, where tasks are assigned to processors in a predetermined ordering before the beginning of the parallel execution, our method is dynamic: task allocations and their execution ordering are decided at runtime, based on the system state and unexpected events, which allows much more flexibility. To do so, our algorithm uses graph neural networks in combination with an actorcritic algorithm (A2C) to build an adaptive representation of the problem on the fly. We show that this approach is competitive with stateoftheart heuristics used in highperformance computing runtime systems. Moreover, our algorithm does not require an explicit model of the environment, but we demonstrate that extra knowledge can easily be incorporated and improves performance. We also exhibit key properties provided by this RL approach, and study its transfer abilities to other instances.
Interdisciplinary Research in Artificial Intelligence: Challenges and Opportunities
The use of artificial intelligence (AI) in a variety of research fields is speeding up multiple digital revolutions, from shifting paradigms in healthcare, precision medicine and wearable sensing, to public services and education offered to the masses around the world, to future cities made optimally efficient by autonomous driving. When a revolution happens, the consequences are not obvious straight away and, to date, there is no uniformly adapted framework to guide AI research to ensure a sustainable societal transition. To answer this need, here we analyze three key challenges to interdisciplinary AI research, and deliver three broad conclusions: 1) future development of AI should not only impact other scientific domains but should also take inspiration and benefit from other fields of science, 2) AI research must be accompanied by decision explainability, dataset bias transparency aswell as development of evaluation methodologies and creation of regulatory agencies to ensure responsibility, and 3) AI education should receive more attention, efforts and innovation from the educational and scientific communities. Our analysis is of interest not only to AI practitioners but also to other researchers and the general public as it offers ways to guide the emerging collaborations and interactions toward the most fruitful outcomes.
International electronic health recordderived COVID19 clinical course profiles: the 4CE consortium
We leveraged the largely untapped resource of electronic health record data to address critical clinical and epidemiological questions about Coronavirus Disease 2019 (COVID19). To do this, we formed an international consortium (4CE) of 96 hospitals across 5 countries (https://
Machine learning applications in drug development
Due to the huge amount of biological and medical data available today, along with wellestablished machine learning algorithms, the design of largely automated drug development pipelines can now be envisioned. These pipelines may guide, or speed up, drug discovery; provide a better understanding of diseases and associated biological phenomena; help planning preclinical wetlab experiments, and even future clinical trials. This automation of the drug development process might be key to the current issue of low productivity rate that pharmaceutical companies currently face. In this survey, we will particularly focus on two classes of methods: sequential learning and recommender systems, which are active biomedical fields of research.
8.5 Other
Restarted Bayesian Online Changepoint Detector achieves Optimal Detection Delay
In this paper, we consider the problem of sequential changepoint detection where both the changepoints and the distributions before and after the change are assumed to be unknown. For this problem of primary importance in statistical and sequential learning theory, we derive a variant of the Bayesian Online Change Point Detector proposed by 65 which is easier to analyze than the original version while keeping its powerful messagepassing algorithm. We provide a nonasymptotic analysis of the falsealarm rate and the detection delay that matches the existing lowerbound. We further provide the first explicit highprobability control of the detection delay for such approach. Experiments on synthetic and realworld data show that this proposal outperforms the stateofart changepoint detection strategy, namely the Improved Generalized Likelihood Ratio (Improved GLR) while compares favorably with the original Bayesian Online Change Point Detection strategy.
8.6 Covid crisis
During the InriaCovid, we helped the ScikitEDS team organize the data of Paris hospitals (APHP) under the form of daily dashboards. This was not only a matter of reporting but also helping practitioners (surgeons, biostatisticians, epidemiologists, and infectious disease specialists) run statistic models and propose new ML methods. This led to a couple of publications. We also led a work package in the EIT Health Covidom Community focused on the prediction of clinical worsening from symptoms on the telemonitoring app Covidom. With the help of Vianney Taquet and Clémence Léguillette who were hired on this project, we proposed new algorithms that are currently being implemented in production.
9 Bilateral contracts and grants with industry
9.1 Bilateral contracts with industry
 2 contracts with Google regarding PhDs of J. Ferret and L. Hussenot (2020–2022), contract headed and PhD supervision by Ph. Preux.
 1 contract with Facebook AI Research regarding PhD of J. Tarbouriech (2019–2021), contract headed and PhD supervision by Ph. Preux.
 1 contract with Renault regarding PhD of É. Leurent (2018–2020), contract headed by Ph. Preux, PhD supervision by OA. Maillard and D. Efimov (Valse, Inria Lille).
 N. Grinsztajn advised startup Deeplife in Paris.
 J.J. Vie is advising Axiome from Inria Startup Studio in since hackAtech Lille. With this startup, we competed to the NeurIPS 2020 Education Challenge and got ranked 5th over 23 submissions.
10 Partnerships and cooperations
10.1 International initiatives
10.1.1 Inria associate team not involved in an IIL
Associate team “Data Collection for Smart Crop Management” (DC4SCM) has begun in 2020. The partner being in India, the activities have heavily suffered from the covid19 pandemic.
Scool also participates in the associate team 6PAC with CWI, headed by B. Guedj.
10.1.2 Inria international partners
Informal international partners
 with CGIAR, regarding agricultural practices recommendation.
 with A. Gopalan, IISC. Bangalore, about Markov decision processes with exponential family models.
 with Y. Bergner (New York University, USA), P. Halpin (University of North Carolina at Chapel Hill, USA), about multidimensional item response theory
 with A. Gilra and E. Vasilaki (University of Sheffield, United Kingdom), and with M. GroßeWenstrup (University of Vienna, Austria), we designed the proposal of the ChistEra project CausalXRL, which has been accepted and begins on April
1, 2021.  with L. Martinez (InriaChile) et al., we designed the proposal of the STIC AmSud project named EMISTRAL, which has been accepted and begins in 2021.
 with K. Meel and B. Ghosh (National University of Singapore), about formal fairness verification of machine learning algorithms.
 with I. Trummer (Cornell University, USA), about designing theory and algorithms of automated database optimizers based on reinforcement learning.
 with C. Dimitrakakis (University of Oslo, Norway), about designing fair and risksensitive reinforcement learning algorithms.
 with I.M Hassellöv (Chalmers University of Technology, Sweden), about using machine learning for enabling marine sciences and study of corresponding environmental phenomena.
 with Hisashi Kashima and Koh Takeuchi (Kyoto University, Japan), about active learning for education; this led to the proposal of a new associate team (OPALE° which has been accepted and starts in 20201.
10.2 International research visitors
10.2.1 Visits of international scientists
Prof. Anders Jonsson, University Pompeu Fabra (Spain), spent 1 year in the team on sabbatical, 2019–2020.
10.3 European initiatives
10.3.1 Collaborations in European programs, except FP7 and H2020
 ChistEra Delta, headed by A. Jonsson (University Pompeu Fabra, Spain), local head: É. Kaufmann, 10/2017 – 12/2021.
 EIT Health Covidom Community, with APHP, headed by Patrick Jourdain, local head: J.J. Vie, 4/2020 – 12/2020.
10.4 National initiatives
Scool is involved in 2 ANR projects:
 ANR Bold, headed by V. Perchet (ENS ParisSaclay, ENSAE), local head: É. Kaufmann, 2019–2023.
 ANR JCJC Badass, O.A. Maillard, 2016–2020.
Scool is involved in some Inria projects:

Challenge HPC – Big Data, headed by B. Raffin, Datamove, Grenoble.
In this challenge, we collaborate with:
 B. Raffin, on what HPC can bring and can be used at its best for reinforcement learning.
 O. Beaumont, E. Jeannot, on what RL can bring to HPC, in particular the use of RL for task scheduling.

Challenge HY_AIAI.
In this challenge, we collaborate with L. Gallaraga, CR Inria Rennes, about the combination of statistical and symbolic approaches in machine learning.
 Exploratory action “Sequential Recommendation for Sustainable Gardening (SR4SG) ”, headed by OA. Maillard.
Other collaborations in France:
 T. Levent, PhD student, LPICM, École Polytechnique, control of smartgrids.
 R. Gautron, PhD student, Cirad, agricultural practices recommendation.
 É. Oyallon, CR CNRS, Sorbonne Université, machine learning on graphs.
 M. Valko, researcher DeepMind.
 K. Naudin, Aida Unit, Cirad Montpellier, agroecology.
 Y. Yordanov, A. Dechartres, A. Dinh, P. Jourdain, APHP, EIT Health Covidom Community
 A. Gramfort, G. Varoquaux, O. Grisel, Inria Saclay Parietal, projet InriaCovid ScikitEDS avec APHP
 R. Khonsari, É. Vibert, A. Diallo, É. Vicaut, M. Bernaux, R. Nizard, T. Simon, N. Paris, L.B. Luong, J. Assouad, C. Paugam, APHP, risks of mortality in surgery of COVID19 patients
 P.A. Jachiet, M. Doutreligne (DREES, then HAS), A. Floyrac (DREES, then Health Data Hub), synthetic data generation of the Système national de données de santé (SNDS)
 A. DelahayeDuriez, INSERM, Université de Paris
10.5 Regional initiatives
 F. Pattou (PRU) and his group the Translational Research Laboratory for Diabetes (INSERM UMR 1190), CHU Lille, about patient personalized followup. This collaboration is funded by a set of projects:
 project B4H from ISite Lille,
 project Phenomix from ISite Lille,
 project PersoSurg funded by the CPER.
 A. Cotten CHU Lille and her group, project RAID, funded by the CPER.
 E. Chatelain, Eng. BiLille, medical data analysis.
 P. Schegg (PhD student), Ch. Duriez (DR Inria), J. Desquidt (MCF UdL), EPC Defrost Inria Lille, reinforcement learning for soft robotics in a surgical environment
 N. Mitton, DR Inria Lille, EPI Fun, data collection for smart crop management (associate team DC4SCM).
 D. Efimov, DR Inria Lille, EPC Valse, control theory (cosupervision of a PhD).
 project Biodimètre from Métropôle Européenne de Lille (MEL).
11 Dissemination
11.1 Promoting scientific activities
11.1.1 Scientific events: organisation
General chair, scientific chair
 J.J. Vie: General Chair of EDM 2021
Member of the organizing committees
J.J. Vie coorganized the following events:
 WASL 2020, Optimizing Human Learning, Third Workshop eliciting Adaptive Sequences for Learning, fully virtual, colocated with AIED 2020, with Benoît Choffin, Fabrice Popineau (LRI), Hisashi Kashima (Kyoto University, Japan), 6 July 2020, 12 participants
 FATED 2020, Fairness, Accountability, and Transparency in Educational Data, fully virtual, colocated with EDM 2020, with Nigel Bosch (University of Illinois at UrbanaChampaign, USA), Christopher Brooks (University of Michigan, USA), Shayan Doroudi (University of California Irvine, USA), Josh Gardner (University of Washington, USA), Kenneth Holstein (Carnegie Mellon University, USA), Andrew Lan (University of Massachusetts at Amherst, USA), Collin Lynch (North Carolina State University, USA), Beverly Park Woolf (University of Massachusetts at Amherst, USA), Mykola Pechenizkiy (Eindhoven University of Technology, The Netherlands), Steven Ritter (Carnegie Learning, USA), Renzhe Yu (University of California, Irvine, USA), 10 July 2020, 60 participants
 We organized two workshops related to elearning: one about optimizing human learning at the AI for Education conference (AIED 2020), one about fairness in educational data mining at EDM 2020.
11.1.2 Scientific events: selection
Member of the conference program committees
 Ph. Preux: PC member for AAAI 2020, IJCAI 2020; Area chair for ECML (I declined being an area chair for ICML 2020)
 J.J. Vie: Senior PC member of EDM 2020
 OA. Maillard: PC member of COLT 2020, ICML 2020, ALT 2020.
 D. Basu: PC member of AAAI 2020.
Reviewer
Scool members review paper submissions to all major conference in machine learning (e.g. ICML, NeurIPS, COLT, ALT, AI&Stats, ICLR, etc), AI (IJCAI, AAAI), and Privacy (PoPETS).
It should also be noted that due to the heavy load of work, we decline many invitations, even for these conferences.
11.1.3 Journal
Member of the editorial boards
 OA. Maillard: part of the Journal of Machine Learning Research (JMLR) editorial board, as of July 2020.
 J.J. Vie: part of the Journal of Educational Data Mining (JEDM) editorial board.
Reviewer  reviewing activities
 OA. Maillard: Reviews for JMLR.
 J.J. Vie: Reviews for JEDM.
 D. Basu: IEEE Access, IEEE Transactions on Dependable and Secure Computing.
11.1.4 Invited talks
Many events were canceled in 2020. Others could not be granted (e.g. OA. Mailllard was invited for as a keynote speaker at SMMA 2020 but had to decline).
11.1.5 Scientific expertise
 OA. Maillard:
 member of the CRCN/ISFP jury at Inria Lille.
 was asked reviewing expertise for the Astrid ANR committee.
 is part of Commission Emploi Recherche (CER) in 2020.
 Ph. Preux:
 member of the national Inria DR jury.
 member of the CRCN/ISFP jury at Inria Grenoble.
 is a member of the scientific committee “data science and models” (CSS5) of the IRD.
 is a member of a group thinking about “responsible AI” in REV3, in Lille.
 He also declined reviewing many requests (including for the ANR, and IUF).
11.1.6 Research administration

Philippe Preux is « Délégué Scientifique Adjoint » of InriaLille.
As such, he is a member of the « Commission d'évaluation » of Inria.
As head of Scool (REP), he is a member of the « Comité des équipesprojets » (CEP)of InriaLille.
He is also a member of the « Bureau scientifique du centre » (BSC) of InriaLille.
11.2 Teaching  Supervision  Juries
11.2.1 Teaching
 D. Basu: Learning to Optimise Online with Full and Partial Information, Reading Session, M2 Data Science, UdL.
 D. Baudry taught about 100 h. in 2020, in data science and NLP, in M1 maths and M2 web analyst at the UdL.
 O. Darwiche taught reinforcement learning at École Centrale de Lille and in the data science master of the Université de Lille. He also served as a TA in reinforcement learning at the African Institute for Mathematical Sciences (AIMS) in Ghana in February.
 É. Kaufmann: Data Mining (21h) M1 Maths/Finance, UdL.
 É. Kaufmann: Sequential Decision Making (24h) M2 Data Science, Centrale Lille
 OA. Maillard: Statistical Reinforcement Learning (42h), MAP/INF641, Master Artificial Intelligence and advanced Visual Computing, École Polytechnique.
 R. Ouhamma taught about 40 hours, algorithmics and programming at UdL, and also proba/stats in M2 at CentraleLille.
 S. Perrin taught about 30 hours, Unix and databases at UdL.
 Ph. Preux: « IA et apprentissage automatique », DU IA & Santé, UdL
 H. Saber: as part of his agrégé de mathématiques duty, he taught in the licence and master of mathematics at UdL.
 M. Seurin is ATER, hence teaches 192 hours during the academic year at UdL. He taught machine learning, data science, reinforcement learning and other topics in licence and master MIASHS.
 F. Valentini: « deep learning pour la radiologie », practical session, DU IA & Santé, UdL.
 J.J. Vie: Introduction to Machine Learning (24h), M2 Mechanical Engineering, Polytech'Lille.
 J.J. Vie: Deep Learning Do It Yourself (45h), M1, ENS Paris.
Due to the Covid19 crisis, our participation to some summer schools has been canceled.
11.2.2 Supervision
Apart from Ph.D. students, we supervised the following students in 2020:
 Ph. Preux supervised A. Zeddoun, A. Moulin (master interns), H. Delavenne, A. Tuynmann, A. Vigneron (L3 interns).
 J.J. Vie supervised Pierre Bourse (L3), Aymeric Floyrac, Salim Nadir (M2).
In Scool, we consider that supervising students is part of the training of a Ph.D. student. Therefore, some of them (like Y. FletBerliac and R. Ouhamma) have participated to supervision of master students, under the supervision of a permanent researcher.
11.2.3 Juries
 É. Kaufmann was part of the following juries:
 PhD: Erwan LeCarpentier, IRIT Toulouse, July
 PhD: Yann Issartel, Université ParisSaclay, December
 PhD: Margaux Brégère, Université ParisSaclay, December
 PhD: Andrea Tirinzoni, Polytechnico Milano, Italy, reviewer
 Ph. Preux was part of the following juries:
 hdr: Nistor Grozavu, Université Paris 13, March, reviewer
 hdr: Sylvain Lamprier, Sorbonne Université, September, reviewer
 hdr: Émilie Kaufmann, UdL, November, president
 PhD: Tanguy Levent, École Polytechnique, December, president
 PhD: Matthieu Jedor, Université ParisSaclay, December, reviewer
 OA. Maillard was part of the following juries:
 PhD: Robin Vogel, TélécomParis, October, examiner
 PhD: Margaux Brégère, Université ParisSaclay, December, reviewer.
11.3 Popularization
 OA. Maillard, communication on « Jardinage massivement collaboratif » related to his Action Exploratoire SR4SG during French fête de la science week, October 2020. https://
www. youtube. com/ watch?v=AJHA1kG2d9A  É. Leurent, communication at “Inria 13:45” event, October 13.
 Ph. Preux is a member of the Collectif des chercheurs « œuvres et recherches » of UdL, in particular the AI working group. The goal is to investigate links between AI and artists, organize events etc.
12 Scientific production
12.1 Major publications
 1 inproceedings 'Spectral Learning from a Single Trajectory under FiniteState Policies'. International conference on Machine Learning Proceedings of the International conference on Machine Learning Sidney, France July 2017
 2 inproceedings 'MultiPlayer Bandits Revisited'. Algorithmic Learning Theory Mehryar Mohri and Karthik Sridharan Lanzarote, Spain April 2018
 3 inproceedings 'Only Relevant Information Matters: Filtering Out Noisy Samples to Boost RL'. IJCAI 2020  International Joint Conference on Artificial Intelligence Yokohama, Japan July 2020
 4 inproceedings 'Optimal Best Arm Identification with Fixed Confidence'. 29th Annual Conference on Learning Theory (COLT) 49 JMLR Workshop and Conference Proceedings New York, United States June 2016
 5 article'Operatorvalued Kernels for Learning from Functional Response Data'.Journal of Machine Learning Research17202016, 154
 6 inproceedings'MonteCarlo Tree Search by Best Arm Identification'.NIPS 2017  31st Annual Conference on Neural Information Processing SystemsAdvances in Neural Information Processing SystemsLong Beach, United StatesDecember 2017, 123
 7 article 'Boundary Crossing Probabilities for General Exponential Families'. Mathematical Methods of Statistics 27 2018
 8 inproceedings 'Tightening Exploration in Upper Confidence Reinforcement Learning'. International Conference on Machine Learning Vienna, Austria July 2020
 9 inproceedings 'Improving offline evaluation of contextual bandit algorithms via bootstrapping techniques'. International Conference on Machine Learning 32 Journal of Machine Learning Research, Workshop and Conference Proceedings; Proceedings of The 31st International Conference on Machine Learning Beijing, China June 2014
 10 inproceedings'Visual Reasoning with Multihop Feature Modulation'.ECCV 2018  15th European Conference on Computer Vision1120511220Part of the Lecture Notes in Computer Science book series  LNCS11209Munich, GermanySeptember 2018, 808831
12.2 Publications of the year
International journals
 11 article'International electronic health recordderived COVID19 clinical course profiles: the 4CE consortium'.npj Digital Medicine31December 2020, #109
 12 article 'Fast sampling from betaensembles'. Statistics and Computing 31 7 January 2021
 13 article 'Spectral bandits'. Journal of Machine Learning Research 2020
 14 article 'Interdisciplinary Research in Artificial Intelligence: Challenges and Opportunities'. Frontiers in Big Data 3 November 2020
 15 article 'The challenge of controlling microgrids in the presence of rare events with Deep Reinforcement Learning'. IET Smart Grid 2020
 16 article'Machine learning applications in drug development'.Computational and Structural Biotechnology Journal182020, 241252
International peerreviewed conferences
 17 inproceedings 'Restarted Bayesian Online Changepoint Detector achieves Optimal Detection Delay'. International Conference on Machine Learning Wien, Austria July 2020
 18 inproceedings 'What Matters In OnPolicy Reinforcement Learning? A LargeScale Empirical Study'. ICLR 2021  Ninth International Conference on Learning Representations Vienna / Virtual, Austria May 2021
 19 inproceedings 'Subsampling for Efficient NonParametric Bandit Exploration'. NeurIPS 2020 Vancouver, Canada December 2020
 20 inproceedings 'A Practical Algorithm for Multiplayer Bandits when Arm Means Vary Among Players'. AISTATS 2020  23rd International Conference on Artificial Intelligence and Statistics Palermo, Italy August 2020
 21 inproceedings'The Influence of Shape Constraints on the Thresholding Bandit Problem'.COLT 2020  Thirty Third Conference on Learning Theory125Graz / Virtual, Austria2020, 12281275
 22 inproceedings 'HIGhER: Improving instruction following with Hindsight Generation for Experience Replay'. ADPRL 2020  IEEE SSCI Conference on Adaptive Dynamic Programming and Reinforcement Learning Camberra / Virtual, Australia December 2020
 23 inproceedings 'Primal Wasserstein Imitation Learning'. ICLR 2021  Ninth International Conference on Learning Representations Vienna / Virtual, Austria June 2020
 24 inproceedings 'Gamification of pure exploration for linear bandits'. International Conference on Machine Learning Vienna / Virtual, Austria 2020
 25 inproceedings 'SelfAttentional Credit Assignment for Transfer in Reinforcement Learning'. IJCAI 2020  29th International Joint Conference on Artificial Intelligence Yokohama / Virtual, Japan July 2020
 26 inproceedings 'SelfImitation Advantage Learning'. AAMAS 2021  20th International Conference on Autonomous Agents and Multiagent Systems Londres / Virtual, United Kingdom May 2021
 27 inproceedings 'Adversarially Guided ActorCritic'. ICLR 2021  International Conference on Learning Representations Vienna / Virtual, Austria May 2021
 28 inproceedings 'Learning Value Functions in Deep Policy Gradients using Residual Variance'. ICLR 2021  International Conference on Learning Representations Vienna / Virtual, Austria May 2021
 29 inproceedings 'Only Relevant Information Matters: Filtering Out Noisy Samples to Boost RL'. IJCAI 2020  International Joint Conference on Artificial Intelligence Yokohama, Japan July 2020
 30 inproceedings 'Geometric Deep Reinforcement Learning for Dynamic DAG Scheduling'. IEEE SSCI 2020  Symposium Series on Computational Intelligence SSCI 2020 proceedings Canberra / Virtual, Australia December 2020
 31 inproceedings 'Show me the Way: Intrinsic Motivation from Demonstrations'. AAMAS 2021  20th International Conference on Autonomous Agents and Multiagent Systems Virtual, United Kingdom May 2021
 32 inproceedings 'CopyCAT: Taking Control of Neural Policies with Constant Attacks'. AAMAS 2020  19th International Conference on Autonomous Agents and MultiAgent Systems Virtual, New Zealand May 2020
 33 inproceedings 'Planning in Markov Decision Processes with GapDependent Sample Complexity'. Neural Information Processing Systems Vancouver, France December 2020
 34 inproceedings'Inferential Induction: A Novel Framework for Bayesian Reinforcement Learning'."I Can't Believe It's Not Better!" at NeurIPS Workshops"I Can't Believe It's Not Better!" at NeurIPS Workshops137Proceedings of Machine Learning ResearchVancouver, CanadaDecember 2020, 4352
 35 inproceedings 'Adaptive rewardfree exploration'. Algorithmic Learning Theory Paris, France 2021
 36 inproceedings 'RobustAdaptive Control of Linear Systems: beyond Quadratic Costs'. NeurIPS 2020  34th Conference on Neural Information Processing Systems Vancouver / Virtual, Canada December 2020
 37 inproceedings 'RobustAdaptive Interval Predictive Control for Linear Uncertain Systems'. CDC 2020  59th IEEE Conference on Decision and Control Jeju Island / Virtual, South Korea December 2020
 38 inproceedings'MonteCarlo Graph Search: the Value of Merging Similar States'.ACML 2020  12th Asian Conference on Machine Learning129Bangkok / Virtual, Thailand2020, 577  602
 39 inproceedings 'Tightening Exploration in Upper Confidence Reinforcement Learning'. International Conference on Machine Learning Vienna, Austria July 2020
 40 inproceedings 'Statistical efficiency of Thompson sampling for combinatorial semibandits'. Neural Information Processing Systems virtual, France 2020
 41 inproceedings 'Budgeted online influence maximization'. International Conference on Machine Learning Vienna, Austria 2020
 42 inproceedings 'Covarianceadapting algorithm for semibandits with application to sparse outcomes'. Conference on Learning Theory Graz, Austria 2020
 43 inproceedings '"I'm sorry Dave, I'm afraid I can't do that" Deep QLearning From Forbidden Actions'. Internationnal Joint Conference on Neural Networks Glasgow, United Kingdom July 2020
 44 inproceedings 'A Machine of Few Words Interactive Speaker Recognition with Reinforcement Learning'. Interspeech 2020 proceedings Conference of the International Speech Communication Association (INTERSPEECH) Shanghai, China October 2020
 45 inproceedings 'Fixedconfidence guarantees for Bayesian bestarm identification'. International Conference on Artificial Intelligence and Statistics Palermo, Italy 2020
 46 inproceedings'Solving Bernoulli RankOne Bandits with Unimodal Thompson Sampling'.ALT 2020  31st International Conference on Algorithmic Learning Theory117San Diego, United StatesFebruary 2020, 1  28
National peerreviewed Conferences
Conferences without proceedings
 47 inproceedings 'Evaluating DAS3H on the EdNet Dataset'. AAAI 2021  The 35th Conference on Artificial Intelligence / Imagining PostCOVID Education with AI Virtual, United States January 2021
Doctoral dissertations and habilitation theses
 48 thesis 'Safe and Efficient Reinforcement Learning for Behavioural Planning in Autonomous Driving'. Université de Lille October 2020
 49 thesis 'Efficient Learning in Stochastic Combinatorial SemiBandits'. Univeristé ParisSaclay November 2020
 50 thesis 'Multimodal and Interactive Models for Visually Grounded Language Learning'. Université de Lille; École doctorale, ED SPI 074 : Sciences pour l'Ingénieur January 2020
Reports & preprints
 51 misc 'On MultiArmed Bandit Designs for DoseFinding Trials'. April 2020
 52 misc 'Efficient ChangePoint Detection for Tackling PiecewiseStationary Bandits'. December 2020
 53 misc 'Regret bounds for kernelbased reinforcement learning'. Vienna, Austria April 2020
 54 misc 'SENTINEL: Taming Uncertainty with Ensemblebased Distributional Reinforcement Learning'. February 2021
 55 misc 'Adversarial Attacks on Linear Contextual Bandits'. October 2020
 56 report 'Fast active learning for pure exploration in reinforcement learning'. DeepMind July 2020
 57 misc 'Fictitious Play for Mean Field Games: Continuous Time Analysis and Applications'. September 2020
 58 misc 'Forcedexploration free Strategies for Unimodal Bandits'. June 2020
 59 misc 'Optimal Strategies for GraphStructured Bandits'. July 2020

60
misc
'Stochastic bandits with vector losses: Minimizing
${}^{}$ norm of relative losses'. October 2020
12.3 Cited publications
 61 inproceedings'Subsampling for Multiarmed Bandits'.Machine Learning and Knowledge Discovery in Databases  European Conference, ECML PKDD 2014, Nancy, France, September 1519, 2014. Proceedings, Part I2014, 115131URL: https://doi.org/10.1007/9783662448489_8
 62 inproceedings'Distributed MultiPlayer Bandits  a Game of Thrones Approach'.Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 38, 2018, Montréal, Canada2018, 72227232URL: https://proceedings.neurips.cc/paper/2018/hash/c2964caac096f26db222cb325aa267cbAbstract.html
 63 article'The multiarmed bandit problem: An efficient nonparametric solution'.The Annals of Statistics4812020, 346373URL: https://doi.org/10.1214/19AOS1809
 64 inproceedings'Unimodal Bandits: Regret Lower Bounds and Optimal Algorithms'.Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 2126 June 20142014, 521529URL: http://proceedings.mlr.press/v32/combes14.html
 65 article'OnLine Inference for Multiple Changepoint Problems'.Journal of the Royal Statistical Society. Series B (Statistical Methodology)6942007, 589605URL: http://www.jstor.org/stable/4623285
 66 article'Nonasymptotic analysis of a new bandit algorithm for semibounded rewards'.J. Mach. Learn. Res.162015, 37213756URL: http://dl.acm.org/citation.cfm?id=2912115
 67 article'Nearoptimal Regret Bounds for Reinforcement Learning'.J. Mach. Learn. Res.112010, 15631600URL: http://portal.acm.org/citation.cfm?id=1859902
 68 inproceedings'Bernoulli Rank1 Bandits for Click Feedback'.Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 1925, 20172017, 20012007URL: https://doi.org/10.24963/ijcai.2017/278
 69 inproceedings'Stochastic Rank1 Bandits'.Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 2022 April 2017, Fort Lauderdale, FL, USA2017, 392401URL: http://proceedings.mlr.press/v54/katariya17a.html
 70 inproceedings'Unimodal Thompson Sampling for GraphStructured Arms'.Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, February 49, 2017, San Francisco, California, USA2017, 24572463URL: http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14325
 71 inproceedings'Simple Bayesian Algorithms for Best Arm Identification'.Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 2326, 20162016, 14171418URL: http://proceedings.mlr.press/v49/russo16.html