Section: Research Program
Decisionmaking Under Uncertainty
The phrase βDecision under uncertaintyβ refers to the problem of taking decisions when we do not have a full knowledge neither of the situation, nor of the consequences of the decisions, as well as when the consequences of decision are non deterministic.
We introduce two specific subdomains, namely the Markov decision processes which models sequential decision problems, and bandit problems.
Reinforcement Learning
Sequential decision processes occupy the heart of the SequeL project; a detailed presentation of this problem may be found in Puterman's book [41] .
A Markov Decision Process (MDP) is defined as the tuple $(\mathrm{\pi \x9d\x92\xb3},\mathrm{\pi \x9d\x92\x9c},P,r)$ where $\mathrm{\pi \x9d\x92\xb3}$ is the state space, $\mathrm{\pi \x9d\x92\x9c}$ is the action space, $P$ is the probabilistic transition kernel, and $r:\mathrm{\pi \x9d\x92\xb3}\Gamma \x97\mathrm{\pi \x9d\x92\x9c}\Gamma \x97\mathrm{\pi \x9d\x92\xb3}\beta \x86\x92I\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}R$ is the reward function. For the sake of simplicity, we assume in this introduction that the state and action spaces are finite. If the current state (at time $t$) is $x\beta \x88\x88\mathrm{\pi \x9d\x92\xb3}$ and the chosen action is $a\beta \x88\x88\mathrm{\pi \x9d\x92\x9c}$, then the Markov assumption means that the transition probability to a new state ${x}^{\text{'}}\beta \x88\x88\mathrm{\pi \x9d\x92\xb3}$ (at time $t+1$) only depends on $(x,a)$. We write $p\left({x}^{\text{'}}\rightx,a)$ the corresponding transition probability. During a transition $(x,a)\beta \x86\x92{x}^{\text{'}}$, a reward $r(x,a,{x}^{\text{'}})$ is incurred.
In the MDP ($\mathrm{\pi \x9d\x92\xb3},\mathrm{\pi \x9d\x92\x9c},P,r)$, each initial state ${x}_{0}$ and action sequence ${a}_{0},{a}_{1},...$ gives rise to a sequence of states ${x}_{1},{x}_{2},...$, satisfying $\mathrm{\beta \x84\x99}\left({x}_{t+1}={x}^{\text{'}}{x}_{t}=x,{a}_{t}=a\right)=p\left({x}^{\text{'}}\rightx,a),$ and rewards (Note that for simplicity, we considered the case of a deterministic reward function, but in many applications, the reward ${r}_{t}$ itself is a random variable.) ${r}_{1},{r}_{2},...$ defined by ${r}_{t}=r({x}_{t},{a}_{t},{x}_{t+1})$.
The history of the process up to time $t$ is defined to be ${H}_{t}=({x}_{0},{a}_{0},...,{x}_{t1},{a}_{t1},{x}_{t})$. A policy $\mathrm{{\rm O}\x80}$ is a sequence of functions ${\mathrm{{\rm O}\x80}}_{0},{\mathrm{{\rm O}\x80}}_{1},...$, where ${\mathrm{{\rm O}\x80}}_{t}$ maps the space of possible histories at time $t$ to the space of probability distributions over the space of actions $\mathrm{\pi \x9d\x92\x9c}$. To follow a policy means that, in each time step, we assume that the process history up to time $t$ is ${x}_{0},{a}_{0},...,{x}_{t}$ and the probability of selecting an action $a$ is equal to ${\mathrm{{\rm O}\x80}}_{t}({x}_{0},{a}_{0},...,{x}_{t})\left(a\right)$. A policy is called stationary (or Markovian) if ${\mathrm{{\rm O}\x80}}_{t}$ depends only on the last visited state. In other words, a policy $\mathrm{{\rm O}\x80}=({\mathrm{{\rm O}\x80}}_{0},{\mathrm{{\rm O}\x80}}_{1},...)$ is called stationary if ${\mathrm{{\rm O}\x80}}_{t}({x}_{0},{a}_{0},...,{x}_{t})={\mathrm{{\rm O}\x80}}_{0}\left({x}_{t}\right)$ holds for all $t\beta \x89\u20af0$. A policy is called deterministic if the probability distribution prescribed by the policy for any history is concentrated on a single action. Otherwise it is called a stochastic policy.
We move from an MD process to an MD problem by formulating the goal of the agent, that is what the sought policy $\mathrm{{\rm O}\x80}$ has to optimize? It is very often formulated as maximizing (or minimizing), in expectation, some functional of the sequence of future rewards. For example, an usual functional is the infinitetime horizon sum of discounted rewards. For a given (stationary) policy $\mathrm{{\rm O}\x80}$, we define the value function ${V}^{\mathrm{{\rm O}\x80}}\left(x\right)$ of that policy $\mathrm{{\rm O}\x80}$ at a state $x\beta \x88\x88\mathrm{\pi \x9d\x92\xb3}$ as the expected sum of discounted future rewards given that we state from the initial state $x$ and follow the policy $\mathrm{{\rm O}\x80}$:
${V}^{\mathrm{{\rm O}\x80}}\left(x\right)=\mathrm{\pi \x9d\x94\u038c}\left[\underset{t=0}{\overset{\mathrm{\beta \x88\x9e}}{\beta \x88\x91}}{\mathrm{\Xi \xb3}}^{t}{r}_{t}{x}_{0}=x,\mathrm{{\rm O}\x80}\right],$  (1) 
where $\mathrm{\pi \x9d\x94\u038c}$ is the expectation operator and $\mathrm{\Xi \xb3}\beta \x88\x88(0,1)$ is the discount factor. This value function ${V}^{\mathrm{{\rm O}\x80}}$ gives an evaluation of the performance of a given policy $\mathrm{{\rm O}\x80}$. Other functionals of the sequence of future rewards may be considered, such as the undiscounted reward (see the stochastic shortest path problems [37] ) and average reward settings. Note also that, here, we considered the problem of maximizing a reward functional, but a formulation in terms of minimizing some cost or risk functional would be equivalent.
In order to maximize a given functional in a sequential framework, one usually applies Dynamic Programming (DP)Β [35] , which introduces the optimal value function ${V}^{*}\left(x\right)$, defined as the optimal expected sum of rewards when the agent starts from a state $x$. We have ${V}^{*}\left(x\right)={sup}_{\mathrm{{\rm O}\x80}}{V}^{\mathrm{{\rm O}\x80}}\left(x\right)$. Now, let us give two definitions about policies:

We say that a policy $\mathrm{{\rm O}\x80}$ is optimal, if it attains the optimal values ${V}^{*}\left(x\right)$ for any state $x\beta \x88\x88\mathrm{\pi \x9d\x92\xb3}$, i.e., if ${V}^{\mathrm{{\rm O}\x80}}\left(x\right)={V}^{*}\left(x\right)$ for all $x\beta \x88\x88\mathrm{\pi \x9d\x92\xb3}$. Under mild conditions, deterministic stationary optimal policies exist [36] . Such an optimal policy is written ${\mathrm{{\rm O}\x80}}^{*}$.

We say that a (deterministic stationary) policy $\mathrm{{\rm O}\x80}$ is greedy with respect to (w.r.t.) some function $V$ (defined on $\mathrm{\pi \x9d\x92\xb3}$) if, for all $x\beta \x88\x88\mathrm{\pi \x9d\x92\xb3}$,
$\mathrm{{\rm O}\x80}\left(x\right)\beta \x88\x88arg\underset{a\beta \x88\x88\mathrm{\pi \x9d\x92\x9c}}{max}\underset{{x}^{\text{'}}\beta \x88\x88\mathrm{\pi \x9d\x92\xb3}}{\beta \x88\x91}p\left({x}^{\text{'}}\rightx,a)\left[r(x,a,{x}^{\text{'}})+\mathrm{\Xi \xb3}V\left({x}^{\text{'}}\right)\right].$where $arg{max}_{a\beta \x88\x88\mathrm{\pi \x9d\x92\x9c}}f\left(a\right)$ is the set of $a\beta \x88\x88\mathrm{\pi \x9d\x92\x9c}$ that maximizes $f\left(a\right)$. For any function $V$, such a greedy policy always exists because $\mathrm{\pi \x9d\x92\x9c}$ is finite.
The goal of Reinforcement Learning (RL), as well as that of dynamic programming, is to design an optimal policy (or a good approximation of it).
The wellknown Dynamic Programming equation (also called the Bellman equation) provides a relation between the optimal value function at a state $x$ and the optimal value function at the successors states ${x}^{\text{'}}$ when choosing an optimal action: for all $x\beta \x88\x88\mathrm{\pi \x9d\x92\xb3}$,
${V}^{*}\left(x\right)=\underset{a\beta \x88\x88\mathrm{\pi \x9d\x92\x9c}}{max}\underset{{x}^{\text{'}}\beta \x88\x88\mathrm{\pi \x9d\x92\xb3}}{\beta \x88\x91}p\left({x}^{\text{'}}\rightx,a)\left[r(x,a,{x}^{\text{'}})+\mathrm{\Xi \xb3}{V}^{*}\left({x}^{\text{'}}\right)\right].$  (2) 
The benefit of introducing this concept of optimal value function relies on the property that, from the optimal value function ${V}^{*}$, it is easy to derive an optimal behavior by choosing the actions according to a policy greedy w.r.t. ${V}^{*}$. Indeed, we have the property that a policy greedy w.r.t. the optimal value function is an optimal policy:
${\mathrm{{\rm O}\x80}}^{*}\left(x\right)\beta \x88\x88arg\underset{a\beta \x88\x88\mathrm{\pi \x9d\x92\x9c}}{max}\underset{{x}^{\text{'}}\beta \x88\x88\mathrm{\pi \x9d\x92\xb3}}{\beta \x88\x91}p\left({x}^{\text{'}}\rightx,a)\left[r(x,a,{x}^{\text{'}})+\mathrm{\Xi \xb3}{V}^{*}\left({x}^{\text{'}}\right)\right].$  (3) 
In short, we would like to mention that most of the reinforcement learning methods developed so far are built on one (or both) of the two following approaches ( [47] ):

Bellman's dynamic programming approach, based on the introduction of the value function. It consists in learning a βgoodβ approximation of the optimal value function, and then using it to derive a greedy policy w.r.t. this approximation. The hope (well justified in several cases) is that the performance ${V}^{\mathrm{{\rm O}\x80}}$ of the policy $\mathrm{{\rm O}\x80}$ greedy w.r.t. an approximation $V$ of ${V}^{*}$ will be close to optimality. This approximation issue of the optimal value function is one of the major challenges inherent to the reinforcement learning problem. Approximate dynamic programming addresses the problem of estimating performance bounds (e.g. the loss in performance $\left\right{V}^{*}{V}^{\mathrm{{\rm O}\x80}}\left\right$ resulting from using a policy $\mathrm{{\rm O}\x80}$greedy w.r.t. some approximation $V$ instead of an optimal policy) in terms of the approximation error $\left\right{V}^{*}V\left\right$ of the optimal value function ${V}^{*}$ by $V$. Approximation theory and Statistical Learning theory provide us with bounds in terms of the number of sample data used to represent the functions, and the capacity and approximation power of the considered function spaces.

Pontryagin's maximum principle approach, based on sensitivity analysis of the performance measure w.r.t. some control parameters. This approach, also called direct policy search in the Reinforcement Learning community aims at directly finding a good feedback control law in a parameterized policy space without trying to approximate the value function. The method consists in estimating the socalled policy gradient, i.e. the sensitivity of the performance measure (the value function) w.r.t. some parameters of the current policy. The idea being that an optimal control problem is replaced by a parametric optimization problem in the space of parameterized policies. As such, deriving a policy gradient estimate would lead to performing a stochastic gradient method in order to search for a local optimal parametric policy.
Finally, many extensions of the Markov decision processes exist, among which the Partially Observable MDPs (POMDPs) is the case where the current state does not contain all the necessary information required to decide for sure of the best action.
Multiarm Bandit Theory
Bandit problems illustrate the fundamental difficulty of decision making in the face of uncertainty: A decision maker must choose between what seems to be the best choice (βexploitβ), or to test (βexploreβ) some alternative, hoping to discover a choice that beats the current best choice.
The classical example of a bandit problem is deciding what treatment to give each patient in a clinical trial when the effectiveness of the treatments are initially unknown and the patients arrive sequentially. These bandit problems became popular with the seminal paper [42] , after which they have found applications in diverse fields, such as control, economics, statistics, or learning theory.
Formally, a Karmed bandit problem ($K\beta \x89\u20af2$) is specified by K realvalued distributions. In each time step a decision maker can select one of the distributions to obtain a sample from it. The samples obtained are considered as rewards. The distributions are initially unknown to the decision maker, whose goal is to maximize the sum of the rewards received, or equivalently, to minimize the regret which is defined as the loss compared to the total payoff that can be achieved given full knowledge of the problem, i.e., when the arm giving the highest expected reward is pulled all the time.
The name βbanditβ comes from imagining a gambler playing with K slot machines. The gambler can pull the arm of any of the machines, which produces a random payoff as a result: When arm k is pulled, the random payoff is drawn from the distribution associated to k. Since the payoff distributions are initially unknown, the gambler must use exploratory actions to learn the utility of the individual arms. However, exploration has to be carefully controlled since excessive exploration may lead to unnecessary losses. Hence, to play well, the gambler must carefully balance exploration and exploitation. Auer et al. [34] introduced the algorithm UCB (Upper Confidence Bounds) that follows what is now called the βoptimism in the face of uncertainty principleβ. Their algorithm works by computing upper confidence bounds for all the arms and then choosing the arm with the highest such bound. They proved that the expected regret of their algorithm increases at most at a logarithmic rate with the number of trials, and that the algorithm achieves the smallest possible regret up to some sublogarithmic factor (for the considered family of distributions).