# Formalism

Goal: for a function $f : X \to [0,1]$ where $|X| = K \lt \infty$ find $$\htmlClass{fragment}{ \max_{x\in X} f(x) }$$ based on noisy observations $f(x_t) + \eps_{x_t}$ where $\eps_{x_t}$ is random.

# Formalism

Regret: $$\htmlClass{fragment}{ R(T) = \sum_{t=1}^T f(x^*) - f(x_t) }$$ where $\displaystyle x^* = \argmax_{x\in X} f(x)$.

# A Regret Lower Bound

Theorem. For any algorithm there is an $f$ such that $$\htmlClass{fragment}{ \E[R(T)] \geq \Omega(\sqrt{KT}) . }$$

# Error Bars

Result (Hoeffding). Let $y_1,..,y_t$ be an IID sequence of random variables with values in $[0,1]$. Let $\bar{y}_t$ be their sample mean. Then $$\htmlClass{fragment}{ \P(\E(y_1) \gt \bar{y}_t + \delta) \leq e^{-2t\delta^2} . }$$

# Error Bars

Assume $f(x_t) + \eps_{x_t}$ satisfy Hoeffding's inequality. Choose $\delta,\eta$ so that $$\htmlClass{fragment}{ \P\Bigg( |\bar{y}_t(x) - f(x)| \leq \underset{\htmlClass{fragment}{\sigma_t(x)}}{\undergroup{\sqrt{\frac{2\ln T}{n_t(x)}}}} \Bigg) \geq 1 - \eta . }$$ where

• $n_t(x)$ is the expected number of times $x$ is selected by time $t$, and
• $\bar{y}_t(x_t)$ is the empirical mean of $f(x_t) + \eps_{x_t}$ up to time $t$.

# The Upper Confidence Bound Algorithm

Theorem. Hoeffdingâ€“UCB's regret satisfies $$\htmlClass{fragment}{ \E[R(T)] \leq \widetilde{\c{O}}(\sqrt{KT}) }$$ uniformly for all $f$.

# The Upper Confidence Bound Algorithm

Key idea. With sufficiently high probability, we have \begin{aligned} \htmlClass{fragment}{\Delta(x_t)} &\htmlClass{fragment}{= f(x^*) - f(x_t)} \\ &\htmlClass{fragment}{\leq f^+_t(x^*) - f^-_t(x_t)} \\ &\htmlClass{fragment}{\leq f^+_t(x_t) - f^-_t(x_t)} \\ &\htmlClass{fragment}{= 2\sigma_t(x_t)} \htmlClass{fragment}{= {\scriptstyle 2\sqrt{\frac{2\ln T}{n_t(x)}}}} \htmlClass{fragment}{\overset{t=T}{=} \widetilde{\c{O}}(n_T^{-1/2})} \end{aligned} using $f^-_t(x) \leq f(x) \leq f^+_t(x)$, and $f^+_t(x_t) \geq f^+_t(x)$, $\forall x,t$.

# The Upper Confidence Bound Algorithm

This gives \begin{aligned} \htmlClass{fragment}{\E[R(T)]} &\htmlClass{fragment}{= \sum_{t=1}^T \underset{\htmlClass{fragment}{\Delta(x_t)}}{\undergroup{f(x^*) - f(x_t)}}} \htmlClass{fragment}{= \sum_{x\in X} \underset{\htmlClass{fragment}{\widetilde{\c{O}}(n_T^{-1/2})}}{\undergroup{\Delta(x)}} n_T(x)} \\ &\htmlClass{fragment}{\leq \sum_{x\in X} \widetilde{\c{O}}(\sqrt{n_T(x)})} \htmlClass{fragment}{\leq \widetilde{\c{O}}\Bigg(\sqrt{K\sum_{x\in X} n_T(x)}\Bigg)} \\ &\htmlClass{fragment}{= \widetilde{\c{O}}(\sqrt{KT}) .} \end{aligned}

# Extensions

• $\widetilde{\c{O}}(\sqrt{KT}) \leadsto \c{O}(\sqrt{KT})$
• Adversarial and Contextual Bandits $$\htmlClass{fragment}{ \min_{\eps\in\c{E}} \max_{x\in X} f(x) }$$
• Bayesian methods and Thompson Sampling $$\htmlClass{fragment}{ x_{t+1} = \argmax_{x\in X} \phi_t(x) } \qquad \htmlClass{fragment}{ \phi_t \~ f \given y_1,..,y_t }$$
• Partial Monitoring and Information Directed Sampling