Pathwise Conditioning

$$\htmlData{class=fragment,fragment-index=2} { (f\given\v{y})(\.) = } \htmlData{class=fragment,fragment-index=0} { f(\.) } \htmlData{class=fragment,fragment-index=1} { + \m{K}_{(\.)\v{x}} } \htmlData{class=fragment,fragment-index=1} { \mathrlap{\m{K}_{\v{x}\v{x}}^{-1}} } \htmlData{class=fragment,fragment-index=3} { \ubr{\phantom{\m{K}_{\v{x}\v{x}}^{-1}}}{\c{O}(N^3)} } \htmlData{class=fragment,fragment-index=1} { (\v{y} - f(\v{x})) }$$

$$(f\given\v{y})(\.) = f(\.) + \sum_{i=1}^N v_i k(x_i,\.) \qquad \htmlData{class=fragment,fragment-index=5} { \v{v} = \m{K}_{\v{x}\v{x}}^{-1}(\v{y} - f(\v{x})) }$$

Numerical Stability

Condition number: quantifies difficulty of solving $\m{A}^{-1}\v{b}$

$$\htmlClass{fragment}{ \f{cond}(\m{A}) } \htmlClass{fragment}{ = \lim_{\eps\to0} \sup_{\norm{\v\delta} \leq \eps\norm{\v{b}}} \frac{\norm{\m{A}^{-1}(\v{b} + \v\delta) - \m{A}^{-1}\v{b}}_2}{\eps\norm{\m{A}^{-1}\v{b}}_2} } \htmlClass{fragment}{ = \frac{\lambda_{\max}(\m{A})}{\lambda_{\min}(\m{A})} }$$

Condition Numbers of Kernel Matrices

Are kernel matrices always well-conditioned? No.

One-dimensional time series on grid: Kac–Murdock–Szegö matrix $$\m{K}_{\v{x}\v{x}} = \scriptsize\begin{pmatrix} 1 & \rho &\rho^2 &\dots & \rho^{n-1} \\ \rho & 1 & \rho & \dots & \rho^{n-2} \\ \vdots & \vdots &\ddots & \ddots & \vdots \\ \rho^{n-1} & \rho^{n-2} & \rho^{n-3} & \dots & 1 \end{pmatrix}$$ for which $\frac{1+ 2\rho + 2\rho\eps + \rho^2}{1 - 2\rho - 2\rho\eps + \rho^2} \leq \f{cond}(\m{K}_{\v{x}\v{x}}) \leq \frac{(1+\rho)^2}{(1-\rho)^2}$, where $\eps = \frac{\pi^2}{N+1}$.

Minimum Separation

Separation: minimum distance between distinct $z_i$ and $z_j$

Proposition. Assuming spatial decay and stationarity, separation controls $\f{cond}(\m{K}_{\v{z}\v{z}})$ uniformly in $M$.

Have you tried stochastic gradient descent?

Conventional wisdom in deep learning:

• SGD variants are empirically often the best optimization algorithms
• ADAM is extremely effective, even on non-convex problems
• Minibatch-based training critical part of scalability

Gaussian Process Posteriors via Randomized Optimization Objectives

Split into posterior mean and uncertainty reduction terms \begin{aligned} \htmlData{class=fragment,fragment-index=1}{ (f\given\v{y})(\.) } & \htmlData{class=fragment,fragment-index=1}{ = f(\.) + \m{K}_{(\.)\v{x}}(\m{K}_{\v{x}\v{x}} + \m\Sigma)^{-1}(\v{y} - f(\v{x}) - \v\eps) } \\ & \htmlData{class=fragment,fragment-index=2}{ = f(\.) + \sum_{i=1}^N v_i^* k(x_i,\.) + \sum_{i=1}^N \alpha_i^* k(x_i,\.) } \end{aligned} where \begin{aligned} \htmlData{class=fragment,fragment-index=4}{ \v{v}^* } & \htmlData{class=fragment,fragment-index=4}{ = \argmin_{\v{v} \in \R^N} \sum_{i=1}^N \frac{(y_i - \m{K}_{x_i\v{x}}\v{v})^2}{\Sigma_{ii}} + \v{v}^T\m{K}_{\v{x}\v{x}}\v{v} } \\ \htmlData{class=fragment,fragment-index=5}{ \v\alpha^* } & \htmlData{class=fragment,fragment-index=5}{ = \argmin_{\v\alpha \in \R^N} \sum_{i=1}^N \frac{(f(x_i) + \eps_i - \m{K}_{x_i\v{x}}\v\alpha)^2}{\Sigma_{ii}} + \v\alpha^T\m{K}_{\v{x}\v{x}}\v\alpha . } \end{aligned}

Gaussian Process Posteriors via Randomized Optimization Objectives

\begin{aligned} \v{v}^* &= \argmin_{\v{v} \in \R^N} \sum_{i=1}^N \frac{(y_i - \m{K}_{x_i\v{x}}\v{v})^2}{\Sigma_{ii}} + \v{v}^T\m{K}_{\v{x}\v{x}}\v{v} \\ \v\alpha^* &= \argmin_{\v\alpha \in \R^N} \sum_{i=1}^N \frac{(f(x_i) + \eps_i - \m{K}_{x_i\v{x}}\v\alpha)^2}{\Sigma_{ii}} + \v\alpha^T\m{K}_{\v{x}\v{x}}\v\alpha . \end{aligned}

The Far-away Region

$$(f\given\v{y})(\.) = f(\.) + \sum_{i=1}^N v_i k(x_i,\.)$$

The Interpolation Region

Idea: maybe SGD (a) converges fast on a subspace, and (b) obtains something arbitrary but benign on the rest of the space?

Let $\m{K}_{\v{x}\v{x}} = \m{U}\m\Lambda\m{U}^T$ be the eigendecomposition of the kernel matrix. Define the spectral basis functions $$u^{(i)}(\.) = \sum_{j=1}^N \frac{U_{ji}}{\sqrt{\lambda_i}} k(x_j, \.) .$$

The Interpolation Region

Proposition. With probability $1-\delta$, we have $$\norm{\text{proj}_{u^{(i)}} \mu_{f\given\v{y}} - \text{proj}_{u^{(i)}} \mu_{\f{SGD}}}_{H_k} \leq \frac{1}{\sqrt{\lambda_i t}}\del{\frac{\norm{\v{y}}_2}{\eta\sigma^2} + G\sqrt{2\eta\sigma^2 \log\frac{N}{\delta}}} .$$

The Interpolation Region

Where do the top spectral basis functions concentrate?

• Idea: lift Courant–Fischer eigenvector characterization to the RKHS

Proposition. The spectral basis functions can be written u^{(i)}(\.) \htmlClass{fragment}{ = \argmax_{ \htmlClass{fragment}{ u \in H_k } } \cbr{ \htmlClass{fragment}{ \sum_{i=1}^N u(x_i)^2 } \htmlClass{fragment}{ : } \begin{aligned} & \htmlClass{fragment}{ \norm{u}_{H_k} = 1 } \\ & \htmlClass{fragment}{ \langle u,u^{(j)}\rangle_{H_k} = 0 } \htmlClass{fragment}{, \forall j \lt i } \end{aligned} }. }

Spectral basis functions concentrate on the data as much as possible, while remaining orthogonal to those with larger eigenvalues

SGD's implicit bias for Gaussian processes

Implicit bias: SGD converges quickly near the data, and causes no harm far from the data

Error concentrates in regions (a) without much data, which also (b) aren't located too far from the data:

• Lack of data $\leadsto$ predictions are mostly arbitrary
• Empirically: functions shrink to prior faster than exact posterior

Benign non-convergence $\leadsto$ robustness to instability

Reflections

Numerical analysis conventional wisdom:

• It's slow, conjugate gradient works much better
• If CG is slow then your problem is unstable
• Unstable problems are ill-posed, you should reformulate

This work: a very different way of looking at things

• Don't solve the linear system approximately if the solution isn't inherently needed
• Instead of a well-posed problem, a well-posed subproblem might be good enough
• Try SGD! It might work well, including for counterintuitive reasons
• A kernel matrix's eigenvectors carry information about data-density
• To see this, adopt a function-analytic view given by the spectral basis functions

Thank you!

https://avt.im/· @avt_im

J. A. Lin,* J. Antorán,* S. Padhy,* D. Janz, J. M. Hernández-Lobato, A. Terenin. Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent. arXiv:2306.11589, 2023.

A. Terenin,* D. R. Burt,* A. Artemev, S. Flaxman, M. van der Wilk, C. E. Rasmussen, H. Ge. Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees. arXiv: 2210.07893, 2022.

J. T. Wilson,* V. Borovitskiy,* P. Mostowsky,* A. Terenin,* M. P. Deisenroth. Efficiently Sampling Functions from Gaussian Process Posteriors. International Conference on Machine Learning, 2020. Honorable Mention for Outstanding Paper Award.

J. T. Wilson,* V. Borovitskiy,* P. Mostowsky,* A. Terenin,* M. P. Deisenroth. Pathwise Conditioning of Gaussian Process. Journal of Machine Learning Research, 2021.

*Equal contribution