Overall Objectives
Research Program
Application Domains
Highlights of the Year
New Software and Platforms
New Results
Bilateral Contracts and Grants with Industry
Partnerships and Cooperations
XML PDF e-pub
PDF e-Pub

Section: Research Program

Decision-making Under Uncertainty

The phrase β€œDecision under uncertainty” refers to the problem of taking decisions when we do not have a full knowledge neither of the situation, nor of the consequences of the decisions, as well as when the consequences of decision are non deterministic.

We introduce two specific sub-domains, namely the Markov decision processes which models sequential decision problems, and bandit problems.

Reinforcement Learning

Sequential decision processes occupy the heart of the SequeL project; a detailed presentation of this problem may be found in Puterman's book [46] .

A Markov Decision Process (MDP) is defined as the tuple (𝒳,π’œ,P,r) where 𝒳 is the state space, π’œ is the action space, P is the probabilistic transition kernel, and r:π’³Γ—π’œΓ—π’³β†’IR is the reward function. For the sake of simplicity, we assume in this introduction that the state and action spaces are finite. If the current state (at time t) is xβˆˆπ’³ and the chosen action is aβˆˆπ’œ, then the Markov assumption means that the transition probability to a new state x'βˆˆπ’³ (at time t+1) only depends on (x,a). We write p(x'|x,a) the corresponding transition probability. During a transition (x,a)β†’x', a reward r(x,a,x') is incurred.

In the MDP (𝒳,π’œ,P,r), each initial state x0 and action sequence a0,a1,... gives rise to a sequence of states x1,x2,..., satisfying β„™xt+1=x'|xt=x,at=a=p(x'|x,a), and rewards (Note that for simplicity, we considered the case of a deterministic reward function, but in many applications, the reward rt itself is a random variable.) r1,r2,... defined by rt=r(xt,at,xt+1).

The history of the process up to time t is defined to be Ht=(x0,a0,...,xt-1,at-1,xt). A policy Ο€ is a sequence of functions Ο€0,Ο€1,..., where Ο€t maps the space of possible histories at time t to the space of probability distributions over the space of actions π’œ. To follow a policy means that, in each time step, we assume that the process history up to time t is x0,a0,...,xt and the probability of selecting an action a is equal to Ο€t(x0,a0,...,xt)(a). A policy is called stationary (or Markovian) if Ο€t depends only on the last visited state. In other words, a policy Ο€=(Ο€0,Ο€1,...) is called stationary if Ο€t(x0,a0,...,xt)=Ο€0(xt) holds for all tβ‰₯0. A policy is called deterministic if the probability distribution prescribed by the policy for any history is concentrated on a single action. Otherwise it is called a stochastic policy.

We move from an MD process to an MD problem by formulating the goal of the agent, that is what the sought policy Ο€ has to optimize? It is very often formulated as maximizing (or minimizing), in expectation, some functional of the sequence of future rewards. For example, an usual functional is the infinite-time horizon sum of discounted rewards. For a given (stationary) policy Ο€, we define the value function VΟ€(x) of that policy Ο€ at a state xβˆˆπ’³ as the expected sum of discounted future rewards given that we state from the initial state x and follow the policy Ο€:

where 𝔼 is the expectation operator and γ∈(0,1) is the discount factor. This value function VΟ€ gives an evaluation of the performance of a given policy Ο€. Other functionals of the sequence of future rewards may be considered, such as the undiscounted reward (see the stochastic shortest path problems [45] ) and average reward settings. Note also that, here, we considered the problem of maximizing a reward functional, but a formulation in terms of minimizing some cost or risk functional would be equivalent.

In order to maximize a given functional in a sequential framework, one usually applies Dynamic Programming (DP)Β  [43] , which introduces the optimal value function V*(x), defined as the optimal expected sum of rewards when the agent starts from a state x. We have V*(x)=supΟ€VΟ€(x). Now, let us give two definitions about policies:

The goal of Reinforcement Learning (RL), as well as that of dynamic programming, is to design an optimal policy (or a good approximation of it).

The well-known Dynamic Programming equation (also called the Bellman equation) provides a relation between the optimal value function at a state x and the optimal value function at the successors states x' when choosing an optimal action: for all xβˆˆπ’³,

The benefit of introducing this concept of optimal value function relies on the property that, from the optimal value function V*, it is easy to derive an optimal behavior by choosing the actions according to a policy greedy w.r.t. V*. Indeed, we have the property that a policy greedy w.r.t. the optimal value function is an optimal policy:

In short, we would like to mention that most of the reinforcement learning methods developed so far are built on one (or both) of the two following approaches ( [49] ):

Finally, many extensions of the Markov decision processes exist, among which the Partially Observable MDPs (POMDPs) is the case where the current state does not contain all the necessary information required to decide for sure of the best action.

Multi-arm Bandit Theory

Bandit problems illustrate the fundamental difficulty of decision making in the face of uncertainty: A decision maker must choose between what seems to be the best choice (β€œexploit”), or to test (β€œexplore”) some alternative, hoping to discover a choice that beats the current best choice.

The classical example of a bandit problem is deciding what treatment to give each patient in a clinical trial when the effectiveness of the treatments are initially unknown and the patients arrive sequentially. These bandit problems became popular with the seminal paper [47] , after which they have found applications in diverse fields, such as control, economics, statistics, or learning theory.

Formally, a K-armed bandit problem (Kβ‰₯2) is specified by K real-valued distributions. In each time step a decision maker can select one of the distributions to obtain a sample from it. The samples obtained are considered as rewards. The distributions are initially unknown to the decision maker, whose goal is to maximize the sum of the rewards received, or equivalently, to minimize the regret which is defined as the loss compared to the total payoff that can be achieved given full knowledge of the problem, i.e., when the arm giving the highest expected reward is pulled all the time.

The name β€œbandit” comes from imagining a gambler playing with K slot machines. The gambler can pull the arm of any of the machines, which produces a random payoff as a result: When arm k is pulled, the random payoff is drawn from the distribution associated to k. Since the payoff distributions are initially unknown, the gambler must use exploratory actions to learn the utility of the individual arms. However, exploration has to be carefully controlled since excessive exploration may lead to unnecessary losses. Hence, to play well, the gambler must carefully balance exploration and exploitation. Auer et al. [42] introduced the algorithm UCB (Upper Confidence Bounds) that follows what is now called the β€œoptimism in the face of uncertainty principle”. Their algorithm works by computing upper confidence bounds for all the arms and then choosing the arm with the highest such bound. They proved that the expected regret of their algorithm increases at most at a logarithmic rate with the number of trials, and that the algorithm achieves the smallest possible regret up to some sub-logarithmic factor (for the considered family of distributions).