## Section: Scientific Foundations

### Multi-armed bandit problems

This is a stochastic problem, in which a large number of arms, possibly indexed by a continuous set like [0, 1] , is available. Each arm is associated with a fixed but unknown distribution. At each round, the player chooses an arm, a payoff is drawn at random according to the distribution that is associated with it, and the only feedback that the player gets is the value of this payoff. The key quantity to study this problem is the mean-payoff function f , that indicates for each arm x the expected payoff f(x) of the distribution that is associated with it. The target is to minimize the regret, i.e., ensure that the difference between the cumulative payoff obtained by the player and the one of the best arm is small.

Typical results in the literature are of the following form: if the regularity of the mean-payoff function f is known (or if a bound on it is known) then the regret is small. Actually, results take the following weaker form: when the algorithm is tuned with some parameters, then the regret is small against a certain class of stochastic environments.

The question is to have an adaptive procedure, that, given one unknown environment (with unknown regularity), ensures that the regret is asymptotically small; it would be even better to control the regret in some uniform manner (in a distribution-free sense up to the regularity parameters).