# Bayesian Optimization

Goal: minimize unknown function $\phi$ in as few evaluations as possible.

1. Build posterior $f \given \v{y}$ using data $(x_1,\phi(x_1)),..,(x_n,\phi(x_n))$.
2. Choose $$\htmlClass{fragment}{ x_{n+1} = \argmin_{x\in\c{X}} \alpha_n(x) }$$ to optimize the acquisition function $\alpha_n$, for instance Thompson sampling $\alpha_n \~ f\given\v{y}$.

# Modeling Dynamical Systems with Uncertainty

$$\htmlData{fragment-index=0,class=fragment}{ x_0 } \qquad \htmlData{fragment-index=1,class=fragment}{ x_1 = x_0 + f(x_0)\Delta t } \qquad \htmlData{fragment-index=2,class=fragment}{ x_2 = x_1 + f(x_1)\Delta t } \qquad \htmlData{fragment-index=3,class=fragment}{ .. }$$

# Sparse Gaussian Processes

## via Pathwise Conditioning

$$\htmlData{class=fragment,fragment-index=2} { (f\given\v{y})(\.) = } \htmlData{class=fragment,fragment-index=0} { f(\.) } \htmlData{class=fragment,fragment-index=1} { + \m{K}_{(\.)\v{x}} (\m{K}_{\v{x}\v{x}} + \m\Sigma)^{-1} (\v{y} - f(\v{x}) - \v\eps) }$$

# Sparse Gaussian Processes

$$(f\given\v{y})(\.) = f(\.) + \sum_{i=1}^N v_i^{(\v{y})} k(x_i,\.)$$

$$(f\given\v{y})(\.) \approx f(\.) + \sum_{j=1}^M v_j^{(\v{u})} k(z_j,\.)$$

# Sparse Gaussian Processes

$$\begin{gathered} (f\given\v{y})(\.) \approx f(\.) + \sum_{j=1}^M v_j^{(\v{u})} k(z_j,\.) \\ \htmlClass{fragment}{ \v{v}^{(\v{u})} = \m{K}_{\v{x}\v{x}}^{-1}(\v{u} - f(\v{x})) } \qquad \htmlClass{fragment}{ \v{u} \~\f{N}(\v{m},\m{S}) } \end{gathered}$$

# Numerical Stability

## Condition number: quantifies difficulty of solving $\m{A}^{-1}\v{b}$

$$\htmlClass{fragment}{ \f{cond}(\m{A}) } \htmlClass{fragment}{ = \lim_{\eps\to0} \sup_{\norm{\v\delta} \leq \eps\norm{\v{b}}} \frac{\norm{\m{A}^{-1}(\v{b} + \v\delta) - \m{A}^{-1}\v{b}}_2}{\eps\norm{\m{A}^{-1}\v{b}}_2} } \htmlClass{fragment}{ = \frac{\lambda_{\max}(\m{A})}{\lambda_{\min}(\m{A})} }$$

# Cholesky Factorization

Result. Let $\m{A}$ be a symmetric positive definite matrix of size $N \x N$, where $N \gt 10$. Assume that $$\f{cond}(\m{A}) \leq \frac{1}{2^{-t} \x 3.9 N^{3/2}}$$ where $t$ is the length of the floating point mantissa, and that $3N2^{-t} \lt 0.1$. Then floating point Cholesky factorization will succeed, producing a matrix $\m{L}$ satisfying $$\htmlClass{fragment}{ \m{L}\m{L}^T = \m{A} + \m{E} } \qquad\qquad \htmlClass{fragment}{ \norm{\m{E}}_2 \leq 2^{-t} \x 1.38 N^{3/2} \norm{\m{A}}_2 . }$$

# Condition Numbers of Kernel Matrices

Are kernel matrices always well-conditioned? No.

The Kac–Murdock–Szegö matrix $$\m{K}_{\v{x}\v{x}} = \scriptsize\begin{pmatrix} 1 & \rho &\rho^2 &\dots & \rho^{n-1} \\ \rho & 1 & \rho & \dots & \rho^{n-2} \\ \vdots & \vdots &\ddots & \ddots & \vdots \\ \rho^{n-1} & \rho^{n-2} & \rho^{n-3} & \dots & 1 \end{pmatrix}$$ satisfies $\frac{1+ 2\rho + 2\rho\eps + \rho^2}{1 - 2\rho - 2\rho\eps + \rho^2} \leq \f{cond}(\m{K}_{\v{x}\v{x}}) \leq \frac{(1+\rho)^2}{(1-\rho)^2}$, where $\eps = \frac{\pi^2}{N+1}$.

# Minimum Separation

## Separation: minimum distance between distinct $z_i$ and $z_j$

Proposition. Assuming spatial decay and stationarity, separation controls $\f{cond}(\m{K}_{\v{z}\v{z}})$ uniformly in $M$.

# The Clustered-data Approximation

$$\htmlData{class=fragment,fragment-index=0}{ \f{cl}(x) = \argmin_{j=1,..,M} \norm{x - z_j} } \qquad \htmlData{class=fragment,fragment-index=2}{ N_{\f{cl}}(x) = |\{x' \in \v{x} : \f{cl}(x) = \f{cl}(x')\}| }$$

# The Clustered-data Approximation

\htmlData{class=fragment,fragment-index=0}{ u_j = \frac{1}{N_{\f{cl}}(z_j)}\sum_{\f{cl}(x_i) = z_j} y_i } \qquad \begin{aligned} \htmlData{class=fragment,fragment-index=1}{ y_i \given f } & \htmlData{class=fragment,fragment-index=1}{ \~\f{N}(f(\f{cl}(x_i)),\sigma^2) } \\ \htmlData{class=fragment,fragment-index=2}{ u_i \given f } & \htmlData{class=fragment,fragment-index=2}{ \~\f{N}\del{f(z_i), \tfrac{\sigma^2}{N_{\f{cl}}(z_i)}} } \end{aligned} Then $(f\given\v{u})(\.) = (f\given\v{y})(\.)$ in distribution, and scales with $M \ll N$

# The Clustered-data Approximation

$$\htmlClass{fragment}{ (f\given\v{y})(\.) = f(\.) + \m{K}_{(\.)\v{z}} (\m{K}_{\v{z}\v{z}} + \m\Lambda)^{-1} (\v{u} - f(\v{z}) - \v\epsilon) } \qquad \htmlClass{fragment}{ \v\epsilon \~\f{N}(\v{0},\m\Lambda) }$$

# Thank you!

### https://avt.im/· @avt_im

A. Terenin, D. R. Burt, A. Artemev, S. Flaxman, M. van der Wilk, C. E. Rasmussen, H. Ge. Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees. arXiv: 2210.07893., 2022.