You can access my full list of publications on the HAL depository or my google-scholar profile.

## 2019

**Budgeted Reinforcement Learning in Continuous State Space**, Nicolas Carrara, Edouard Leurent, Romain Laroche, Tanguy Urvoy, Odalric-Ambrym Maillard, Olivier Pietquin. Advances in Neural Information Processing Systems 32 (NIPS 2019) Link, HaL.

*«*A Budgeted Markov Decision Process (BMDP) is an extension of a MarkovDecision Process to critical applications requiring safety constraints. It relies on anotion of risk implemented in the shape of a cost signal constrained to lie belowan – adjustable – threshold. So far, BMDPs could only be solved in the case offinite state spaces with known dynamics. This work extends the state-of-the-artto continuous spaces environments and unknown dynamics. We show that thesolution to a BMDP is a fixed point of a novel Budgeted Bellman Optimalityoperator. This observation allows us to introduce natural extensions of DeepReinforcement Learning algorithms to address large-scale BMDPs. We validateour approach on two simulated applications: spoken dialogue and autonomousdriving.»

**Regret Bounds for Learning State Representations in Reinforcement Learning**, Ronald Ortner, Matteo Pirotta, Alessandro Lazaric, Ronan Fruit, Odalric-Ambrym Maillard. Advances in Neural Information Processing Systems 32 (NIPS 2019) Link, HaL.

*«*We consider the problem of online reinforcement learning when several staterepresentations (mapping histories to a discrete state space) are available to thelearning agent. At least one of these representations is assumed to induce a Markovdecision process (MDP), and the performance of the agent is measured in terms ofcumulative regret against the optimal policy giving the highest average reward inthis MDP representation. We propose an algorithm (UCB-MS) with \tildeO(\sqrt(T)) regret in any communicating MDP. The regret bound shows thatUCB-MSautomaticallyadapts to the Markov model and improves over the currently known best bound oforder \tilde O(T^{2/3}).»

**Learning Multiple Markov Chains via Adaptive Allocation**, Mohammad Sadegh Talebi, Odalric-Ambrym Maillard, Advances in Neural Information Processing Systems 32 (NIPS 2019) Link, Arxiv, HaL.

*«*We study the problem of learning the transition matrices of a set of Markov chains from a single stream of observations on each chain. We assume that the Markov chains are ergodic but otherwise unknown. The learner can sample Markov chains sequentially to observe their states. The goal of the learner is to sequentially select various chains to learn transition matrices uniformly well with respect to some loss function. We introduce a notion of loss that naturally extends the squared loss for learning distributions to the case of Markov chains, and further characterize the notion of being \emph{uniformly good} in all problem instances. We present a novel learning algorithm that efficiently balances \emph{exploration} and \emph{exploitation} intrinsic to this problem, without any prior knowledge of the chains. We provide finite-sample PAC-type guarantees on the performance of the algorithm. Further, we show that our algorithm asymptotically attains an optimal loss.»

**Practical Open-Loop Optimistic Planning, **Edouard Leurent, Odalric-Ambrym Maillard, European Conference on Machine Learning (ECML), 2019. HaL, Conference website.

*«We consider the problem of online planning in a MarkovDecision Process when given only access to a generative model, restrictedto open-loop policies – i.e. sequences of actions – and under budgetconstraint. In this setting, the Open-Loop Optimistic Planning(OLOP) algorithm enjoys good theoretical guarantees but is overly conservativein practice, as we show in numerical experiments. We propose a modifiedversion of the algorithm with tighter upper-confidence bounds, KL-OLOP,that leads to better practical performances while retaining the samplecomplexity bound. Finally, we propose an efficient implementation that significantly improves the time complexity of both algorithms.*»

**[Best student paper award] Model-Based Reinforcement Learning Exploiting State-Action Equivalence, **Mahsa Asadi, Mohammad Sadegh Talebi, Hippolyte Bourel, Odalric-Ambrym Maillard, Asian Conference on Machine Learning (ACML), 2019. HaL, ArXiv.

*«Leveraging an equivalence property in the state-space of a Markov Decision Process (MDP) has been investigated in several studies. This paper studies equivalence structure in the reinforcement learning (RL) setup, where transition distributions are no longer assumed to be known. We present a notion of similarity between transition probabilities of various state-action pairs of an MDP, which naturally defines an equivalence structure in the state-action space. We present equivalence-aware confidence sets for the case where the learner knows the underlying structure in advance. These sets are provably smaller than their corresponding equivalence-oblivious counterparts. In the more challenging case of an unknown equivalence structure, we present an algorithm called ApproxEquivalence that seeks to find an (approximate) equivalence structure, and define confidence sets using the approximate equivalence. To illustrate the efficacy of the presented confidence sets, we present C-UCRL, as a natural modification of UCRL2 for RL in undiscounted MDPs. In the case of a known equivalence structure, we show that C-UCRL improves over UCRL2 in terms of regret by a factor of \sqrt{SA/C}, in any communicating MDP with S states, A actions, and C classes, which corresponds to a massive improvement when C≪SA. To the best of our knowledge, this is the first work providing regret bounds for RL when an equivalence structure in the MDP is efficiently exploited. In the case of an unknown equivalence structure, we show through numerical experiments that C-UCRL combined with ApproxEquivalence outperforms UCRL2 in ergodic MDPs.*»

**Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds***, Odalric-Ambrym Maillard, Algorithmic Learning Theory (ALT), 2019. HaL.
*

*«We consider change-point detection in a fully sequential setup, when observations are received one by one and one must raise an alarm as early as possible after any change. We assume that both the change points and the distributions before and after the change are unknown. We consider the class of piecewise-constant mean processes with sub-Gaussian noise, and we target a detection strategy that is uniformly good on this class (this constrains the false alarm rate and detection delay). We introduce a novel tuning of the GLR test that takes here a simple form involving scan statistics, based on a novel sharp concentration inequality using an extension of the Laplace method for scan-statistics that holds doubly-uniformly in time. This also considerably simplifies the implementation of the test and analysis. We provide (perhaps surprisingly) the first fully non-asymptotic analysis of the detection delay of this test that matches the known existing asymptotic orders, with fully explicit numerical constants. Then, we extend this analysis to allow some changes that are not-detectable by any uniformly-good strategy (the number of observations before and after the change are too small for it to be detected by any such algorithm), and provide the first robust, finite-time analysis of the detection delay.*»

[Erratum: We noticed that the proof of Th.4 is actually incomplete. We are in the process of completing the proof. While the theorem seems to valid as is, it seems with the current proof technique, one should add a loglog correction term to the reported bound.]

**Habilitation thesis: Mathematics of Statistical Sequential Decision Making**, *Odalric-Ambrym Maillard, HaL.*

## 2018

**Approximate Robust Control of Uncertain Dynamical Systems, ***
NeurIPS 2018 – 32nd Conference on Neural Information Processing Systems*, ,MLITS workshop, Dec 2018, Montréal, Canad. HaL, ArXiv.

**Streaming kernel regression with provably adaptive mean, variance, and regularization,** Audrey Durand, Odalric-Ambrym Maillard, Joelle Pineau**, **Journal of Machine Learning Research (JMLR), 2018. HaL, ArXiv.

*«We consider the problem of streaming kernel regression, when the observations arrive sequentially and the goal is to recover the underlying mean function, assumed to belong to an RKHS. The variance of the noise is not assumed to be known. In this context, we tackle the problem of tuning the regularization parameter adaptively at each time step, while maintaining tight confidence bounds estimates on the value of the mean function at each point. To this end, we first generalize existing results for finite-dimensional linear regression with fixed regularization and known variance to the kernel setup with a regularization parameter allowed to be a measurable function of past observations. Then, using appropriate self-normalized inequalities we build upper and lower bound estimates for the variance, leading to Bersntein-like concentration bounds. The later is used in order to define the adaptive regularization. The bounds resulting from our technique are valid uniformly over all observation points and all time steps, and are compared against the literature with numerical experiments. Finally, the potential of these tools is illustrated by an application to kernelized bandits, where we revisit the Kernel UCB and Kernel Thompson Sampling procedures, and show the benefits of the novel adaptive kernel tuning strategy.*»

**Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs**, M. S. Talebi, O.-A. Maillard, Algorithmic Learning Theory (ALT), 2018. HaL

*«The problem of reinforcement learning in an unknown and discrete Markov Decision Process (MDP) under the average-reward criterion is considered, when the learner interacts with the system in a single stream of observations, starting from an initial state without any reset. We revisit the minimax lower bound for that problem by making appear the local variance of the bias function in place of the diameter of the MDP. Furthermore, we provide a novel analysis of the KLUCRL algorithm establishing a high-probability regret bound scaling as \tilde O(\sqrt{S\sum_{s,a}V^\star_{s,a}T) for this algorithm for ergodic MDPs, where S denotes the number of states and where V^\star_{s,a} is the variance of the bias function with respect to the next-state distribution following action a in state s. The resulting bound improves upon the best previously known regret bound \tildeO(DS\sqrt{AT}) for that algorithm, where A and D respectively denote the maximum number of actions (per state) and the diameter of MDP. We finally compare the leading terms of the two bounds in some benchmark MDPs indicating that the derived bound can provide an order of magnitude improvement in some cases. Our analysis leverages novel variations of the transportation lemma combined with Kullback-Leibler concentration inequalities, that we believe to be of independent interest.*»

## 2017

**Efficient tracking of a growing number of experts**, J. Mourtada, O.-A. Maillard, Algorithmic Learning Theory (ALT), 2017. HaL.

*«We consider a variation on the problem of prediction with expert advice, where new forecasters that were unknown until then may appear at each round. As often in prediction with expert advice, designing an algorithm that achieves near-optimal regret guarantees is straightforward, using aggregation of experts. However, when the comparison class is sufficiently rich, for instance when the best expert and the set of experts itself change over time, such strategies require to maintain a prohibitive number of weights (typically exponential with the time horizon). By contrast, designing strategies that both achieve a near-optimal regret and maintain a reasonable number of weights is highly non-trivial. We consider three increasingly challenging objectives (simple regret, shifting regret and sparse shifting regret) that extend existing notions defined for a fixed expert ensemble; in each case, we design strategies that achieve tight regret bounds, adaptive to the parameters of the comparison class, while being computationally inexpensive. Moreover, our algorithms are anytime, agnostic to the number of incoming experts and completely parameter-free. Such remarkable results are made possible thanks to two simple but highly effective recipes: first the “abstention trick” that comes from the specialist framework and enables to handle the least challenging notions of regret, but is limited when addressing more sophisticated objectives. Second, the “muting trick” that we introduce to give more flexibility. We show how to combine these two tricks in order to handle the most challenging class of comparison strategies.»*

**Boundary Crossing Probabilities for General Exponential Families**, O.-A. Maillard, Algorithmic Learning Theory (ALT), 2017. Publisher website. Extended version in Mathematical Methods of Statistics.

*«We consider parametric exponential families of dimension K on the real line. We study a variant of \textit{boundary crossing probabilities} coming from the multi-armed bandit literature, in the case when the real-valued distributions form an exponential family of dimension K. Formally, our result is a concentration inequality that bounds the probability that B^ψ(θ^n,θ⋆)≥f(t/n)/n, where θ⋆ is the parameter of an unknown target distribution, θ^n is the empirical parameter estimate built from n observations, ψ is the log-partition function of the exponential family and Bψ is the corresponding Bregman divergence. From the perspective of stochastic multi-armed bandits, we pay special attention to the case when the boundary function f is logarithmic, as it is enables to analyze the regret of the state-of-the-art KL-UCB and KL-UCB+ strategies, whose analysis was left open in such generality. Indeed, previous results only hold for the case when K=1, while we provide results for arbitrary finite dimension K, thus considerably extending the existing results. Perhaps surprisingly, we highlight that the proof techniques to achieve these strong results already existed three decades ago in the work of T.L. Lai, and were apparently forgotten in the bandit community. We provide a modern rewriting of these beautiful techniques that we believe are useful beyond the application to stochastic multi-armed bandits.»*

**Spectral Learning from a Single Trajectory under Finite-State Policies**, B. Balle, O.-A. Maillard in International conference on Machine Learning, 2017.

*«**We present spectral methods of moments for **learning sequential models from a single trajectory, **in stark contrast with the classical literature **that assumes the availability of multiple i.i.d. **trajectories. Our approach leverages an efficient* *SVD-based learning algorithm for weighted automata **and provides the first rigorous analysis **for learning many important models using dependent* *data. We state and analyze the algorithm under **three increasingly difficult scenarios: probabilistic **automata, stochastic weighted automata, **and reactive predictive state representations controlled* *by a finite-state policy. Our proofs include **novel tools for studying mixing properties **of stochastic weighted automata.** »*

## 2016

**The non-stationary stochastic multi-armed bandit problem**, R. Allesiardo, R. Féraud, O.-A. Maillard, in *International Journal of Data Science and Analytics*, 2016. Publisher link.

*«We consider a variant of the stochastic multi-armed bandit with K arms where the rewards are not assumed to be identically distributed, but are generated by a non-stationary stochastic process. We first study the unique best arm setting when there exists one unique best arm. Second, we study the general switching best arm setting when a best arm switches at some unknown steps. For both settings, we target problem-dependent bounds, instead of the more conservative problem free bounds. We consider two classical problems: 1) Identify a best arm with high probability (best arm identification), for which the performance measure by the sample complexity (number of samples before finding a near optimal arm). To this end, we naturally extend the definition of sample complexity so that it makes sense in the switching best arm setting, which may be of independent interest. 2) Achieve the smallest cumulative regret (regret minimization) where the regret is measured with respect to the strategy pulling an arm with the best instantaneous mean at each step. »*

**Random Shuffling and Resets for the Non-stationary Stochastic Bandit Problem**, R. Allesiardo, R. Féraud, O.-A. Maillard, 2016 Arxiv. HaL.

*«We consider a non-stationary formulation of the stochastic multi-armed bandit where the rewards are no longer assumed to be identically distributed. For the best-arm identification task, we introduce a version of Successive Elimination based on random shuffling of the K arms. We prove that under a novel and mild assumption on the mean gap \Delta, this simple but powerful modification achieves the same guarantees in term of sample complexity and cumulative regret than its original version, but in a much wider class of problems, as it is no more constrained to stationary distributions. We also show that the original Successive Elimination fails to have controlled regret in this more general scenario, thus showing the benefit of shuffling. We then remove our mild assumption and adapt the algorithm to the best-arm identification task with switching arms. We adapt the definition of the sample complexity for that case and prove that, against an optimal policy with N−1 switches of the optimal arm, this new algorithm achieves an expected sample complexity of O(\Delta^{−2}sqrt{NKδ^{−1}log(K/δ)}), where δ is the probability of failure of the algorithm, and an expected cumulative regret of O(\Delta^{−1}\sqrt{NTKlog(TK)}) after T time steps. »*

**Low-rank Bandits with Latent Mixtures, **A. Gopalan, O.-A. Maillard, M. Zaki, submitted to *JMLR*, Sep 2016. Arxiv.

*«We study the task of maximizing rewards from recommending items (actions) to users sequentially interacting with a recommender system. Users are modeled as latent mixtures of C many representative user classes, where each class specifies a mean reward profile across actions. Both the user features (mixture distribution over classes) and the item features (mean reward vector per class) are unknown a priori. The user identity is the only contextual information available to the learner while interacting. This induces a low-rank structure on the matrix of expected rewards r_a,b from recommending item a to user b. The problem reduces to the well-known linear bandit when either user or item-side features are perfectly known. In the setting where each user, with its stochastically sampled taste profile, interacts only for a small number of sessions, we develop a bandit algorithm for the two-sided uncertainty. It combines the Robust Tensor Power Method of Anandkumar et al. (2014b) with the OFUL linear bandit algorithm of Abbasi-Yadkori et al. (2011). We provide the first rigorous regret analysis of this combination, showing that its regret after T user interactions is O(C√BT), with B the number of users. An ingredient towards this result is a novel robustness property of OFUL, of independent interest.»*

**Self-normalization techniques for streaming confident regression,** O.-A. Maillard, submitted to *Bernoulli, 2016. Publisher website. HaL**.*

*«We consider, in a generic streaming regression setting, the problem of building a confidence interval (and distribution) on the next observation based on past observed data. The observations given to the learner are of the form (x, y) with y = f (x) + ξ, where x can have arbitrary dependency on the past observations, f is unknown and the noise ξ is sub-Gaussian conditionally on the past observations. Further, the observations are assumed to come from some external filtering process making the number of observations itself a random stopping time. In this challenging scenario that captures a large class of processes with non-anticipative dependencies, we study the ordinary, ridge, and kernel least-squares estimates and provide confidence intervals based on self-normalized vector-valued martingale techniques, applied to the estimation of the mean and of the variance. We then discuss how these adaptive confidence intervals can be used in order to detect a possible model mismatch as well as to estimate the future (self-information, quadratic, or transportation) loss of the learner at a next step.»*

**Pliable rejection sampling**, A. Erraqabi, M. Valko, A. Carpentier and O.-A. Maillard in International Conference on Machine Learning (ICML) 2016. *HaL*

*«Rejection sampling is a technique for sampling from difficult distributions. However, its use is limited due to a high rejection rate. Common adaptive rejection sampling methods either work only for very specific distributions or without performance guarantees. In this paper, we present pliable rejection sampling (PRS), a new approach to rejection sampling, where we learn the sampling proposal using a kernel estimator. Since our method builds on rejection sampling, the samples obtained are with high probability i.i.d. and distributed according to f. Moreover, PRS comes with a guarantee on the number of accepted samples.»*

**Active Roll-outs in MDP with Irreversible Dynamics***, *O.-A. Maillard, T.A Mann, S. Mannor and R. ortner. Preprint, 2016.

**Optimistic-path based UCRL for average-reward continuous state-action MDPs**, O.-A. Maillard and R. Ortner. Preprint, 2016.

## 2015

**A note on replacing uniform subsampling by random projections in MCMC for linear regression of tall datasets**, R. Bardenet and O.-A. Maillard in *28th **Neural Information Processing Systems (NIPS) *workshop, 2015. *HaL*

*«New Markov chain Monte Carlo (MCMC) methods have been proposed to tackle inference with tall datasets, i.e., when the number n of data items is intractably large. A large class of these new MCMC methods is based on randomly subsampling the dataset at each MCMC iteration. We investigate whether random projections can replace this random subsampling for linear regression of big streaming data. In the latter setting, random projections have indeed become standard for non-Bayesian treatments. We isolate two issues for MCMC to apply to streaming regression: 1) a resampling issue; MCMC should access the same random projections across iterations to avoid keeping the whole dataset in memory and 2) a budget issue; making individual MCMC acceptance decisions should require o(n) random projections. While the resampling issue can be satisfyingly tackled, current techniques in random projections and MCMC for tall data do not solve the budget issue, and may well end up showing it is not possible.»*

## 2014

**[Full oral] “How hard is my MDP?” Distribution-norm to the rescue, **O.-A. Maillard, T.A. Mann and S. Mannor *in Proceedings of the 27th conference on advances in Neural Information Processing Systems (NIPS), 2014. Publisher website HaL
«In Reinforcement Learning (RL), state-of-the-art algorithms require a large number of samples per state-action pair to estimate the transition kernel p. In many problems, a good approximation of p is not needed. For instance, if from one state-action pair (s,a), one can only transit to states with the same value, learning p(⋅|s,a) accurately is irrelevant (only its support matters). This paper aims at capturing such behavior by defining a novel hardness measure for Markov Decision Processes (MDPs) we call the distribution-norm. The distribution-norm w.r.t. a measure ν is defined on zero ν-mean functions f by the standard variation of f with respect to ν. We first provide a concentration inequality for the dual of the distribution-norm. This allows us to replace the generic but loose ||⋅||1 concentration inequalities used in most previous analysis of RL algorithms, to benefit from this new hardness measure. We then show that several common RL benchmarks have low hardness when measured using the new norm. The distribution-norm captures finer properties than the number of states or the diameter and can be used to assess the difficulty of MDPs.»*

**Selecting Near-Optimal Approximate State Representations in Reinforcement Learning, **R.Ortner, O.-A. Maillard and D. Ryabko, *in Proceedings of the International Conference on Algorithmic Learning Theory (ALT), 2014. Publisher website HaL*

«*We consider a reinforcement learning setting introduced in (Maillard et al., NIPS 2011) where the learner does not have explicit access to the states of the underlying Markov decision process (MDP). Instead, she has access to several models that map histories of past interactions to states. Here we improve over known regret bounds in this setting, and more importantly generalize to the case where the models given to the learner do not contain a true model resulting in an MDP representation but only approximations of it. We also give improved error bounds for state aggregation.»*

**Sub-sampling for multi-armed bandits, **A. Baransi, O.-A. Maillard, S. Mannor, *in Europeean conference on Machine Learning (ECML), 2014. Publisher website HaL*

«*The stochastic multi-armed bandit problem is a popular model of the exploration/exploitation trade-off in sequential decision problems. We introduce a novel algorithm that is based on sub-sampling. Despite its simplicity, we show that the algorithm demonstrates excellent empirical performances against state-of-the-art algorithms, including Thompson sampling and KL-UCB. The algorithm is very flexible, it does need to know a set of reward distributions in advance nor the range of the rewards. It is not restricted to Bernoulli distributions and is also invariant under rescaling of the rewards. We provide a detailed experimental study comparing the algorithm to the state of the art, the main intuition that explains the striking results, and conclude with a finite-time regret analysis for this algorithm in the simplified two-arm bandit setting.»*

**Concentration inequalities for sampling without replacement***, R. Bardenet and O.-A. Maillard, in Bernoulli, 2014. Publisher website HaL*

*«Concentration inequalities quantify the deviation of a random variable from a fixed value. In spite of numerous applications, such as opinion surveys or ecological counting procedures, few concentration results are known for the setting of sampling without replacement from a finite population. Until now, the best general concentration inequality has been a Hoeffding inequality due to Serfling (1974). In this paper, we first improve on the fundamental result of Serfling (1974), and further extend it to obtain a Bernstein concentration bound for sampling without replacement. We then derive an empirical version of our bound that does not require the variance to be known to the user.»*

## 2013

**Latent bandits**, O.-A. Maillard and S. Mannor, *in Proceedings of the International Conference on Machine Learning (ICML), 2013. Publisher website *HaL

*«We consider a multi-armed bandit problem where the reward distributions are indexed by two sets –one for arms, one for type– and can be partitioned into a small number of clusters according to the type. First,we consider the setting where all reward distributions are known and all types have the same underlying cluster, the type’s identity is, however, unknown. Second, we study the case where types may come from different classes, which is significantly more challenging. Finally, we tackle the case where the reward distributions are completely unknown. In each setting, we introduce specific algorithms and derive non-trivial regret performance. Numerical experiments show that, in the most challenging agnostic case, the proposed algorithm achieves excellent performance in several difficult scenarios.»*

**Robust risk-averse stochastic multi-armed bandits**, O.-A. Maillard, *in Proceedings of the International Conference on Algorithmic Learning Theory (ALT), volume 8139 of Lecture Notes in Computer Science, pages 218–233. Springer Berlin Heidelberg, 2013. Publisher website HaL*

*«**We study a variant of the standard stochastic multi-armed bandit problem when one is not interested in the arm with the best mean, but instead in the arm maximizing some coherent risk measure criterion. Further, we are studying the deviations of the regret instead of the less informative expected regret. We provide an algorithm, called RA-UCB to solve this problem, together with a high probability bound on its regret*.*»*

**Competing with an infinite set of models in reinforcement learning**, P. Nguyen, O.-A. Maillard, D. Ryabko, and R. Ortner, *in Proceedings of the International Conference on Artificial Intelligence and Statistics (AI&STATS), volume 31 of JMLR W&CP , pages 463–471, Arizona, USA, 2013. Publisher website HaL*

*«We consider a reinforcement learning setting where the learner also has to deal with the problem of finding a suitable state-representation function from a given set of models. This has to be done while interacting with the environment in an online fashion (no resets), and the goal is to have small regret with respect to any Markov model in the set. For this setting, recently the BLB algorithm has been proposed, which achieves regret of order T^{2/3}, provided that the given set of models is finite. Our first contribution is to extend this result to a countably infinite set of models. Moreover, the BLB regret bound suffers from an additive term that can be exponential in the diameter of the MDP involved, since the diameter has to be guessed. The algorithm we propose avoids guessing the diameter, thus improving the regret bound.»*

**Optimal regret bounds for selecting the state representation in reinforcement learning**, O.-A. Maillard, P. Nguyen, R. Ortner, and D. Ryabko, *in Proceedings of the International conference on Machine Learning (ICML), volume 28 of JMLR W&CP, pages 543–551, Atlanta, USA, 2013. Publisher website HaL*

*«We consider an agent interacting with an environment in a single stream of actions, observations, and rewards, with no reset. This process is not assumed to be a Markov Decision Process (MDP). Rather, the agent has several representations (mapping histories of past interactions to a discrete state space) of the environment with unknown dynamics, only some of which result in an MDP. The goal is to minimize the average regret criterion against an agent who knows an MDP representation giving the highest optimal reward, and acts optimally in it. Recent regret bounds for this setting are of order O(T^{2/3}) with an additive term constant yet exponential in some characteristics of the optimal MDP. We propose an algorithm whose regret after T time steps is O(\sqrt{T}), with all constants reasonably small. This is optimal in T since O(\sqrt{T}) is the optimal regret in the setting of learning in a (single discrete) MDP.»*

**Kullback–leibler upper confidence bounds for optimal sequential allocation**, O. Cappé, A. Garivier, O.-A. Maillard, R. Munos and G. Stoltz,* in The Annals of Statistics, 41(3):1516–1541, 2013. Publisher website HaL*

*«We consider optimal sequential allocation in the context of the so-called stochastic multi-armed bandit model. We describe a generic index policy, in the sense of Gittins (1979), based on upper confidence bounds of the arm payoffs computed using the Kullback-Leibler divergence. We consider two classes of distributions for which instances of this general idea are analyzed: The kl-UCB algorithm is designed for one-parameter exponential families and the empirical KL-UCB algorithm for bounded and finitely supported distributions. Our main contribution is a unified finite-time analysis of the regret of these algorithms that asymptotically matches the lower bounds of Lai and Robbins (1985) and Burnetas et Katehakis (1996), respectively. We also investigate the behavior of these algorithms when used with general bounded rewards, showing in particular that they provide significant improvements over the state-of-the-art.»*

## 2012

**Linear Regression with Random Projections**, O.-A. Maillard and R. Munos, in Journal of Machine Learning Research (JMLR), 13:2735–2772, 2012. *Publisher website HaL*

*«We investigate a method for regression that makes use of a randomly generated subspace G_P (of finite dimension P) of a given large (possibly infinite) dimensional function space F, for example, L_{2}([0,1]^d). G_P is defined as the span of P random features that are linear combinations of a basis functions of F weighted by random Gaussian i.i.d.~coefficients. We show practical motivation for the use of this approach, detail the link that this random projections method share with RKHS and Gaussian objects theory and prove, both in deterministic and random design, approximation error bounds when searching for the best regression function in G_P rather than in F, and derive excess risk bounds for a specific regression algorithm (least squares regression in G_P). This paper stresses the motivation to study such methods, thus the analysis developed is kept simple for explanations purpose and leaves room for future developments.»*

**Hierarchical optimistic region selection driven by curiosity**, O.-A. Maillard, in P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Proceedings of the conference on advances in Neural Information Processing Systems 25 (NIPS), pages 1457–1465, 2012. *Publisher website HaL*

*«This paper aims to take a step forwards making the term ”intrinsic motivation” from reinforcement learning theoretically well founded, focusing on curiosity-driven learning. To that end, we consider the setting where, a fixed partition \P of a continuous space \X being given, and a process \nu defined on \X being unknown, we are asked to sequentially decide which cell of the partition to select as well as where to sample \nu in that cell, in order to minimize a loss function that is inspired from previous work on curiosity-driven learning. The loss on each cell consists of one term measuring a simple worst case quadratic sampling error, and a penalty term proportional to the range of the variance in that cell. The corresponding problem formulation extends the setting known as active learning for multi-armed bandits to the case when each arm is a continuous region, and we show how an adaptation of recent algorithms for that problem and of hierarchical optimistic sampling algorithms for optimization can be used in order to solve this problem. The resulting procedure, called Hierarchical Optimistic Region SElection driven by Curiosity (HORSE.C) is provided together with a finite-time regret analysis. »*

**Online allocation and homogeneous partitioning for piecewise constant mean-approximation**, A. Carpentier and O.-A. Maillard, in P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Proceedings of the conference on advances in Neural Information Processing Systems 25 (NIPS), pages 1970–1978, 2012. *Publisher website HaL*

*«In the setting of active learning for the multi-armed bandit, where the goal of a learner is to estimate with equal precision the mean of a finite number of arms, recent results show that it is possible to derive strategies based on finite-time confidence bounds that are competitive with the best possible strategy. We here consider an extension of this problem to the case when the arms are the cells of a finite partition P of a continuous sampling space X \subset \Real^d. Our goal is now to build a piecewise constant approximation of a noisy function (where each piece is one region of P and P is fixed beforehand) in order to maintain the local quadratic error of approximation on each cell equally low. Although this extension is not trivial, we show that a simple algorithm based on upper confidence bounds can be proved to be adaptive to the function itself in a near-optimal way, when |P| is chosen to be of minimax-optimal order on the class of \alpha-Hölder functions.»*

## 2011

**Finite-time analysis of multi-armed bandits problems with Kullback-Leibler divergences**, O.-A. Maillard, R. Munos, and G. Stoltz, in Proceedings of the 24th annual Conference On Learning Theory (COLT), 2011.* Publisher website HaL
«We consider a Kullback-Leibler-based algorithm for the stochastic multi-armed bandit problem in the case of distributions with finite supports (not necessarily known beforehand), whose asymptotic regret matches the lower bound of \cite{Burnetas96}. Our contribution is to provide a finite-time analysis of this algorithm; we get bounds whose main terms are smaller than the ones of previously known algorithms with finite-time analyses (like UCB-type algorithms).»*

**Selecting the state-representation in reinforcement learning** O.-A. Maillard, D. Ryabko, and R. Munos, in Proceedings of the 24th conference on advances in Neural Information Processing Systems (NIPS), pages 2627–2635, 2011. *Publisher website HaL*

*«The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T^{2/3} where T is the horizon time.»*

**Sparse recovery with Brownian sensing**, A. Carpentier, O.-A. Maillard and R. Munos, in Proceedings of the 24th conference on advances in Neural Information Processing Systems (NIPS), 2011. *Publisher website HaL*

*«We consider the problem of recovering the parameter α ∈ R^K of a sparse function f (i.e. the number of non-zero entries of α is small compared to the number K of features) given noisy evaluations of f at a set of well-chosen sampling points. We introduce an additional randomization process, called Brownian sensing, based on the computation of stochastic integrals, which produces a Gaussian sensing matrix, for which good recovery properties are proven, independently on the number of sampling points N , even when the features are arbitrarily non-orthogonal. Under the assumption that f is Hölder continuous with exponent at least 1/2, we provide an estimate â of the parameter such that ||α − â||_2 = O( ||η||_2/ sqrt(N)), where η is the observation noise. The method uses a set of sampling points uniformly distributed along a one-dimensional curve selected according to the features. We report numerical experiments illustrating our method.»*

**Adaptive bandits: Towards the best history-dependent strategy**, O.-A. Maillard and R. Munos, in Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AI&STATS), volume 15 of JMLR W&CP, 2011. *Publisher website HaL*

*«We consider multi-armed bandit games with possibly adaptive opponents. We introduce models Theta of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. she provides rewards that are stochastic functions of equivalence classes defined by some model theta* in Theta. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best history-class-based strategy) for the best model in Theta. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When Theta={theta}, i.e. only one model is considered, we derive *tractable* algorithms achieving a *tight* regret (at time T) bounded by Õ(\sqrt{TAC}), where C is the number of classes of theta. Now, when many models are available, all known algorithms achieving a nice regret O(\sqrt{T}) are unfortunately *not tractable* and scale poorly with the number of models |Theta| . Our contribution here is to provide *tractable* algorithms with regret bounded by T^{2/3}C^{1/3}log(|Theta|)^{1/2}.**»*

## 2010

**Online learning in adversarial lipschitz environments**, O.-A. Maillard and R. Munos, in Proceedings of the 2010 European Conference on Machine Learning and Knowledge Discovery in Databases: Part II, (ECML-PKDD), pages 305–320, Berlin, Heidelberg, 2010. Springer-Verlag. *Publisher website HaL*

*«We consider the problem of online learning in an adversarial environment when the reward functions chosen by the adversary are assumed to be Lipschitz. This setting extends previous works on linear and convex online learning. We provide a class of algorithms with cumulative regret upper bounded by O(\sqrt{dT ln(\lambda)}) where d is the dimension of the search space, T the time horizon, and \lambda the Lipschitz constant. Efficient numerical implementations using particle methods are discussed. Applications include online supervised learning problems for both full and partial (bandit) information settings, for a large class of non-linear regressors/classifiers, such as neural networks.»*

**Finite sample analysis of bellman residual minimization**, O.-A. Maillard, R. Munos, A. Lazaric, and M. Ghavamzadeh, in Proceedings of the Asian Conference on Machine Learning (ACML), 2010. *Publisher website HaL*

*«We consider the Bellman residual minimization approach for solving discounted Markov decision problems, where we assume that a generative model of the dynamics and rewards is available. At each policy iteration step, an approximation of the value function for the current policy is obtained by minimizing an empirical Bellman residual dened on a set of n states drawn i.i.d. from a distribution \mu, the immediate rewards, and the next states sampled from the model. Our main result is a generalization bound for the Bellman residual in linear approximation spaces. In particular, we prove that the empirical Bellman residual approaches the true (quadratic) Bellman residual in \mu-norm with a rate of order O(1/\sqrt{n}). This result implies that minimizing the empirical residual is indeed a sound approach for the minimization of the true Bellman residual which guarantees a good approximation of the value function for each policy. Finally, we derive performance bounds for the resulting approximate policy iteration algorithm in terms of the number of samples n and a measure of how well the function space is able to approximate the sequence of value functions.»*

**Scrambled objects for least-squares regression**, O.-A. Maillard and R. Munos, in J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Proceedings of the 23rd conference on advances in Neural Information Processing Systems (NIPS), NIPS ’10, pages 1549–1557, 2010. *Publisher website HaL*

*«We consider least-squares regression using a randomly generated subspace G_P \subset F of finite dimension P, where F is a function space of infinite dimension, e.g. L2([0, 1]^d). G_P is defined as the span of P random features that are linear combinations of the basis functions of F weighted by random Gaussian i.i.d. coefficients. In particular, we consider multi-resolution random combinations at all scales of a given mother function, such as a hat function or a wavelet. In this latter case, the resulting Gaussian objects are called scrambled wavelets and we show that they enable to approximate functions in Sobolev spaces H_s([0, 1]^d). As a result, given N data, the least-squares estimate \hat g built from P scrambled wavelets has excess risk ||f^* − \hat g||2 P = O(||f^*||2_{H_s([0,1]^d)}(logN)/P + P(logN)/N) for target functions f^* \in H_s([0, 1]^d) of smoothness order s > d/2. An interesting aspect of the resulting bounds is that they do not depend on the distribution P from which the data are generated, which is important in a statistical regression setting considered here. Randomization enables to adapt to any possible distribution. We conclude by describing an efficient numerical implementation using lazy expansions with numerical complexity \tilde O (2dN^3/2 logN + N^2), where d is the dimension of the input space.»*

**LSTD with random projections**, M. Ghavamzadeh, A. Lazaric, O.-A. Maillard, and R. Munos, In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Proceedings of 23rd conference on advances in Neural Information Processing Systems (NIPS) (NIPS), pages 721–729, 2010. *Publisher website HaL*

*«We consider the problem of reinforcement learning in high-dimensional spaces when the number of features is bigger than the number of samples. In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a highdimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm.»*

## 2009

**Compressed least-squares regression**, O.-A. Maillard and R. Munos, in Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Proceedings of the 22nd conference on advances in Neural Information Processing Systems (NIPS), pages 1213–1221, 2009. *Publisher website HaL*

*«We consider the problem of learning, from K data, a regression function in a linear space of high dimension N using projections onto a random subspace of lower dimension M. From any algorithm minimizing the (possibly penalized) empirical risk, we provide bounds on the excess risk of the estimate computed in the projected subspace (compressed domain) in terms of the excess risk of the estimate built in the high-dimensional space (initial domain). We show that solving the problem in the compressed domain instead of the initial domain reduces the estimation error at the price of an increased (but controlled) approximation error. We apply the analysis to Least-Squares (LS) regression and discuss the excess risk and numerical complexity of the resulting “Compressed Least Squares Regression” (CLSR) in terms of N, K, and M. When we choose M=O(\sqrt{K}), we show that CLSR has estimation error of order O(\log K / \sqrt{K}) and provides an interesting alternative to usual penalization techniques where one first selects a random subspace of lower dimension and then performs an empirical risk minimizer in this subspace.»*

**Complexity versus Agreement for Many Views**, O.-A. Maillard and N. Vayatis in Proceedings of the International Conference on Algorithmic Learning Theory (ALT), 2009. *Publisher website HaL Cite*

*«The paper considers the problem of semi-supervised multi-view classification, where each view corresponds to a Reproducing Kernel Hilbert Space. An algorithm based on co-regularization methods with extra penalty terms reflecting smoothness and general agreement properties is proposed. We first provide explicit tight control on the Rademacher (L1 ) complexity of the corresponding class of learners for arbitrary many views, then give the asymptotic behavior of the bounds when the co-regularization term increases, making explicit the relation between consistency of the views and reduction of the search space. Third we provide an illustration through simulations on toy examples. With many views, a parameter selection procedure is based on the stability approach with clustering and localization arguments. The last new result is an explicit bound on the L2 -diameter of the class of functions.»*

**Parallelization of the TD(λ) Learning Algorithm**, O.-A. Maillard, R. Coulom and P. Preux in European Workshop on Reinforcement Learning (EWRL), 2005*.*