2021-10-27 v0.0-e7150f2d (alpha)
Philosophy of these notes. Two key ideas determined what has been included so far.
I aim to provide simplified proofs over what appears in the literature, ideally reducing difficult things to something that fits in a single lecture.
I have primarily focused on a classical perspective of achieving a low test error for binary classification with IID data via standard (typically ReLU) feedforward networks.
Organization. Following the second point above, the classical view decomposes the test error into three parts.
Approximation (starts in section 1): given a classification problem, there exists a deep network which achieves low error over the distribution.
Optimization (starts in section 6): given a finite training set for a classification problem, there exist algorithms to find predictors with low training error and low complexity.
Generalization (starts in section 11): the gap between training and testing error is small for low complexity networks.
Formatting.
These notes use pandoc markdown with various extensions. A current html version is always at https://mjt.cs.illinois.edu/dlt/, and a current pdf version is always at https://mjt.cs.illinois.edu/dlt/index.pdf.
Owing to my unfamiliarity with pandoc, there are still various formatting bugs.
Various todo notes are marked throughout the text like this.
Feedback. I’m very eager to hear any and all feedback!
How to cite. Please consider using a format which makes the version clear:
@misc{mjt_dlt,
author = {Matus Telgarsky},
title = {Deep learning theory lecture notes},
howpublished = {\url{https://mjt.cs.illinois.edu/dlt/}},
year = {2021},
note = {Version: 2021-10-27 v0.0-e7150f2d (alpha)},
}
In his section we outline our basic setup, which can be summarized as follows:
Although this means we exclude many settings, as discussed above, much of the work in other settings uses tools from this most standard one.
Basic shallow network. Consider the mapping x \mapsto \sum_{j=1}^m a_j \sigma(w_j^{\scriptscriptstyle\mathsf{T}}x + b_j).
\sigma is the nonlinearity/activation/transfer. Typical choices: ReLU z\mapsto \max\{0,z\}, sigmoid z\mapsto \frac{1}{1+\exp(-z)}.
((a_j,w_j,b_j))_{j=1}^m are trainable parameters; varying them defines the function class. Sometimes in this shallow setting we freeze (a_j)_{j=1}^m, which gives a simple model that is still difficult to analyze (e.g., nonconvex).
We can think of this as a directed graph of width m: we have a hidden layer of m nodes, where the jth computes x\mapsto \sigma(w_j^{\scriptscriptstyle\mathsf{T}}x + b_j).
Define weight matrix W\in\mathbb{R}^{m \times d} and bias vector v\in \mathbb{R}^m as W_{j:} = w_j^{\scriptscriptstyle\mathsf{T}} and v_j := b_j. The first layer computes h := \sigma(Wx + b)\in\mathbb{R}^m (\sigma applied coordinate-wise), the second computes h\mapsto a^{\scriptscriptstyle\mathsf{T}}h.
Basic deep network. Extending the matrix notation, given parameters w = (W_1, b_1, \ldots, W_L, b_L), f(x;w) := \sigma_L( W_L \sigma_{L-1}( \cdots W_2 \sigma_1(W_1 x + b_1) + b_2 \cdots )+ b_L ). (1)
\sigma_j is now a multivariate mapping; in addition to coordinate-wise ReLU and sigmoid, we can do softmax z' \propto \exp(z), max-pooling (a few coordinates of input replaced with their maximum), attention layers, and many others.
We can replace x \mapsto W x + b with some compact representation while still preservering linearity, e.g., the standard implementation of a convolution layer. Maybe I will add the explicit formalisms somewhere?.
Often biases (b_1,\ldots,b_L) are dropped; the handling of these biases can change many elements of the story.
Typically \sigma_L is identity, so we refer to L as the number of affine layers, and L-1 the number of activation or hidden layers.
Width now means the maximum output dimension of each activation. (For technical reasons, sometimes need to also take max or input dimension, or treat inputs as a fake layer.)
Once again we can describe the computation via an acyclic graph. Classically, the activations were univariate mappings applied coordinate-wise, and single rows of the weight matrix were composed with univariate activations to give a node. Now, however, activations are often multivariate (and in particular can not be written as identical univariate mappins, applied coordinate-wise), and for computation reasons we prefer not to break the matrices into vectors, giving a more general graph with each matrix and activation as its own node.
Basic supervised learning setup; test error decomposition.
Given pairs ((x_i,y_i))_{i=1}^n (training set), our job is to produce a mapping x\mapsto y which performs well on future examples.
If there is no relationship between past and future data, we can’t hope for much.
The standard classical learning assumption is that both the training set, and future data, are drawn IID from some distribution on (x,y).
This IID assumption is not practical: it is not satisfied by real data. Even so, the analysis and algorithms here have many elements that carry over to more practical settings.
How do we define “performs well on future examples?”
Given one (x,y) and a prediction \hat y = f(x), we suffer a loss \ell(\hat y, y), e.g., logistic \ln(1+\exp(-\hat y y)), or squared (\hat y - y)^2/2.
On a training set, we suffer empirical risk \widehat{\mathcal{R}}(f) = \frac 1 n \sum_i \ell(f(x_i), y_i).
For future (random!) data, we consider (population) risk \mathcal{R}(f) = \mathop{\mathbb{E}}\ell(f(x), y) = \int \ell(f(x),y){\text{d}}\mu(x,y).
“Performs well on future examples” becomes “minimize \mathcal{R}(f).” We can decompose \mathcal{R}(f) into three separate concerns: given a training algorithm’s choice \hat f in some class of functions/predictors \mathcal{F}, as well as some reference solution \bar f \in \mathcal{F}, \begin{aligned} \mathcal{R}(\hat f) &= \mathcal{R}(\hat f) - \widehat{\mathcal{R}}(\hat f) &\text{(generalization)} \\ &\quad + \widehat{\mathcal{R}}(\hat f) - \widehat{\mathcal{R}}(\bar f) &\text{(optimization)} \\ &\quad + \widehat{\mathcal{R}}(\bar f) -\mathcal{R}(\bar f) &\qquad\text{(concentration/generalization)} \\ &\quad + \mathcal{R}(\bar f). &\text{(approximation)} \end{aligned} These notes are organized are organized into separately considering these three terms (treating “generalization” and “concentration/generalization” together).
Here are a few of the shortened and/or extended proofs in these notes.
Approximation.
Optimization.
Generalization.
Due to the above philosophy, many topics are currently omitted. Over time I hope to fill the gaps.
Here are some big omissions, hopefully resolved soon:
Architectures:
Non-feedforward, e.g., recurrent (Siegelmann and Sontag 1994).
Specific feedforward architecture choices like convolutional layers and skip connections.
Continuous depth, for instance various neural ODE frameworks (R. T. Q. Chen et al. 2018; Tzen and Raginsky 2019).
Other learning paradigms:
Data augmentation, self-training, and distribution shift.
Unsupervised learning (e.g., GANs), Adversarial ML, RL.
Further omitted topics, in a bit more detail, are discussed separately for approximation (section 1.1), optimization (section 6.1), and generalization (section 11.1).
Thanks to Ziwei Ji for extensive comments, discussion, and the proof of Theorem 10.3; thanks to Daniel Hsu for extensive comments and discussion; thanks to Francesco Orabona for detailed comments spanning many sections; thanks to Ohad Shamir for extensive comments on many topics; thanks to Karolina Dziugaite and Dan Roy for extensive comments on the generalization material; thanks to Thien Nguyen for extensive and detailed comments and corrections on many sections. Further thanks to Nadav Cohen, Quanquan Gu, Suriya Gunasekar, Frederic Koehler, Justin Li, Akshayaa Magesh, Maxim Raginsky, David Rolnick, Kartik Sreenivasan, Matthieu Terris, and Alex Wozniakowski for various comments and feedback.
As above, we wish to ensure that our predictors \mathcal{F} (e.g., networks of a certain architecture) have some element \bar f\in\mathcal{F} which simultaneously has small \mathcal{R}(f) and small complexity; we can re-interpret our notation and suppose \mathcal{F} already is some constrained class of low-complexity predictors, and aim to make \inf_{f\in\mathcal{F}} \mathcal{R}(f) small.
What is \mathcal{F}? In keeping with the earlier theme, it should be some convenient notion of “low complexity model”; but what is that?
Models reached by gradient descent. Since standard training methods are variants of simple first-order methods, it seems this might be a convenient candidate for \mathcal{F} which is tight with practice. Unfortunately, firstly we only have understanding of these models very close to initialization and very late in training, whereas practice seems to lie somewhere between. Secondly, we can’t just make this our definition as it breaks things in the standard approach to generalization.
Models of low norm, where norm is typically measured layer-wise, and also typically the “origin” is initialization. This is the current most common setup, though it doesn’t seem to be able to capture the behavior of gradient descent that well, except perhaps when very close to initialization.
All models of some fixed architecture, meaning the weights can be arbitrary. This is the classical setup, and we’ll cover it here, but it can often seem loose or insensitive to data, and was a key part of the criticisms against the general learning-theoretic approach (Zhang et al. 2017). The math is still illuminating and still key parts can be used as tools in a more sensitive analysis, e.g., by compressing a model and then applying one of these results.
The standard classical setup (“all models of some fixed architecture”) is often stated with a goal of competing with all continuous functions: \inf_{f\in\mathcal{F}} \mathcal{R}(f) \qquad\text{vs.}\qquad \inf_{g\ \textrm{continuous}} \mathcal{R}(g). E.g., \sup_{g\text{ cont.}} \inf_{f\in\mathcal{F}} \mathcal{R}(f) - \mathcal{R}(g). To simplify further, if \ell is \rho-Lipschitz (and still y=\pm 1), \begin{aligned} &\mathcal{R}(f) - \mathcal{R}(g) = \int \left({ \ell(yf(x)) - \ell(yg(x)) }\right){\text{d}}\mu(x,y) \\ &\leq \int \rho|yf(x) - yg(x)| {\text{d}}\mu(x,y) %\\ %& = \rho \int |f(x) - g(x)| {\text{d}}\mu(x,y), \end{aligned} and in particular we have reduced the approximation question to one about studying \|f-g\| with function space norms.
(Lower bounds.) The uniform norm has certain nice properties for proving upper bounds, but is it meaningful for a lower bound? Functions can be well-separated in uniform norm even if they are mostly the same: they just need one point of large difference. For this reason, L_1 norms, for instance \int_{[0,1]^d} |f(x)-g(x)|{\text{d}}x are prefered for lower bounds.
Full proofs for sobolev space approximation (Yarotsky 2016; Schmidt-Hieber 2017). Planning to add in Fall 2021!!!
Approximation of distributions and other settings.
Approximation power of low-norm functions.
We start with two types of standard approximation results, in the “classical” regime where we only care about the number of nodes and not the magnitude of the weights, and also the worst-case goal of competing with an arbitrary continuous function using some function space norm.
Elementary folklore results: univariate approximation with one hidden layer, and multivariate approximation with two hidden layers, just by stacking bricks. Latter use L_1 metric, which is disappointing.
Celebrated “universal approximation” result: fitting continuous functions over compact sets in uniform norm with a single hidden layer (Hornik, Stinchcombe, and White 1989).
There are weaknesses in these results (e.g., curse of dimension), and thus they are far from the practical picture. Still, they are very interesting and influential.
We can handle the univariate case by gridding the line and taking steps appropriately.
Proof. Define m := \lceil \frac\rho \epsilon\rceil, and for and b_i := i \epsilon/ \rho for i\in \{0,\ldots,m-1\}, and a_0 = g(0), \qquad a_i = g(b_i) - g(b_{i-1}), and lastly define f(x) := \sum_{i=0}^{m-1} a_i \mathbf{1}[x_i \geq b_i]. Then for any x\in[0,1], letting k be the largest index so that b_k \leq x, then f is constant along [b_k, x], and \begin{aligned} |g(x) - f(x)| &\leq |g(x) - g(b_k)| + |g(b_k) - f(b_k)| + |f(b_k) - f(x)| \\ &\leq \rho |x - b_k| + \left|{g(b_k) - \sum_{i=0}^k a_i }\right| + 0 \\ &\leq \rho(\epsilon/\rho) + \left|{ g(b_k) - g(b_0) - \sum_{i=1}^k (g(b_i) - g(b_{i-1}))}\right| \\ & =\epsilon. \end{aligned}
Now let’s handle the multivariate case. We will replicate the univariate approach: we will increment function values when the target function changes. In the univariate case, we could “localize” function modifications, but in the multivariate case by default we will modify an entire halfspace at once. To get around this, we use an additional layer.
The proof uses the following lemma (omitted in class), approximating continuous functions by piecewise constant functions.
Proof. Let partition \mathcal{P}= (R_1,\ldots,R_N) be given, and for each R_i, pick some x_i \in R_i, and set \alpha_i := g(x_i). Since each side length of each R_i is at most \delta, \begin{aligned} \sup_{x\in U} |g(x) - h(x)| &= \sup_{i\in \{1,\ldots,N\}} \sup_{x\in R_i} |g(x) - h(x)| \\ &\leq \sup_{i\in \{1,\ldots,N\}} \sup_{x\in R_i}\left({ |g(x) - g(x_i)| + |g(x_i) - h(x)| }\right) \\ &\leq \sup_{i\in \{1,\ldots,N\}} \sup_{x\in R_i}\left({ \epsilon+ |g(x_i) - \alpha_i| }\right) = \epsilon. \end{aligned}
Proof of Theorem 2.1. For convenience, throughout this proof define a norm \|f\|_1 = \int_{[0,2)^d} |f(x)|{\text{d}}x. Let \mathcal{P} denote a partition of [0,2)^d into rectangles of the form \prod_{j=1}^d [a_j,b_j) with b_j-a_j\leq \delta; the final result follows by restricting consideration to [0,1]^d, but we include an extra region to work with half-open intervals in a lazy way. Let h = \sum_i \alpha_i \mathbf{1}_{R_i} denote the piecewise-constant function provided by Lemma 2.1 with the given partition \mathcal{P}, which satisfies \|g-h\|_1 \leq \epsilon. Our final network f will be of the form f(x) := \sum_i \alpha_i g_i(x), where each g_i will be a ReLU network with two hidden layers and \mathcal{O}(d) nodes; since |\mathcal{P}|\geq 1/\delta^d, then f also uses at least 1/\delta^d nodes as stated. Our goal is to show \|f-g\|_1\leq 2\epsilon; to this end, note by the preceding choices and the triangle inequality that \begin{aligned} \|f-g\|_1 &\leq \|f-h\|_1 + \|h-g\|_1 \\ &= \left\|{ \sum_i \alpha_i (\mathbf{1}_{R_i} - g_i) }\right\|_1 + \epsilon \\ & \leq \sum_i |\alpha_i|\cdot \| \mathbf{1}_{R_i} - g_i \|_1 + \epsilon. \end{aligned} As such, if we can construct each g_i so that \| \mathbf{1}_{R_i} - g_i \|_1 \leq \frac {\epsilon}{\sum_i |\alpha_i|}, then the proof is complete. (If \sum_i |\alpha_i| = 0, we can set f to be the constant 0 network and the proof is again complete.)
Now fix i and let rectangle R_i be given of the form R_i := \times_{j=1}^d [a_j,b_j), and define g_i as follows. Letting \gamma>0 denote a free parameter to be optimized at the end of the proof, for each j\in \{1,\ldots,d\} define \begin{aligned} \hspace{-2em} g_{\gamma,j}(z) &:= \sigma\left({\frac {z - (a_j - \gamma)}{\gamma}}\right) - \sigma\left({\frac {z - a_j}{\gamma}}\right) - \sigma\left({\frac {z - b_j}{\gamma}}\right) + \sigma\left({\frac {z - (b_j+\gamma)}{\gamma}}\right) \\ &\in \begin{cases} \{1\} & z \in [a_j,b_j],\\ \{0\} & x \not \in [a_j-\gamma, b_j + \gamma],\\ [0,1] & \text{otherwise}, \end{cases} \end{aligned} and additionally g_\gamma(x) := \sigma(\sum_j g_{\gamma,j}(x_j) - (d-1)). (Note that a second hidden layer is crucial in this construction, it is not clear how to proceed without it, certainly with only \mathcal{O}(d) nodes. Later proofs can use only a single hidden layer, but they are not constructive, and need \mathcal{O}(d) nodes.) Note that g_\gamma\approx \mathbf{1}_{R_i} as desired, specifically g_\gamma(x) = \begin{cases} 1 & x\in R_i,\\ 0 & x \not\in \times_j [a_j-\gamma, b_j+\gamma],\\ [0,1] & \textrm{otherwise,} \end{cases} From which it follows that \begin{aligned} \|g_\gamma(x) - \mathbf{1}_{R_i}(x)\|_1 &= \int_{R_i} |g_\gamma - \mathbf{1}_{R_i}| + \int_{\times_j [a_j-\gamma, b_j+\gamma] \setminus R_i} |g_\gamma - \mathbf{1}_{R_i}| + \int_{[0,2)^d \setminus \times_j [a_j-\gamma, b_j+\gamma]} |g_\gamma - \mathbf{1}_{R_i}| \\ &\leq 0 + \prod_{j=1}^d (b_j - a_j + 2\gamma) - \prod_{j=1}^d (b_j - a_j) + 0 \\ &\leq \mathcal{O}(\gamma), \end{aligned} which means we can ensure \| \mathbf{1}_{R_i} - g_\gamma \|_1 \leq \frac {\epsilon}{\sum_i |\alpha_i|} by choosing sufficiently small \gamma, which completes the proof.
The proof of Theorem 2.1 use two layers to construct g_\gamma such that g_\gamma(x) \approx \mathbf{1}\left[{ x\in \times_i [a_i,b_i] }\right]. If instead we had a way to approximate multiplication we could instead approximate x \mapsto \prod_i \mathbf{1}\left[{ x_i \in [a_i,b_i] }\right] = \mathbf{1}\left[{ x\in \times_i [a_i,b_i] }\right]. Can we do this and then form a linear combination, all with just one hidden layer?
The answer will be yes, and we will use this to resolve the classical universal approximation question with a single hidden layer.
Consider unbounded width networks with one hidden layer: \begin{aligned} \mathcal{F}_{\sigma,d,m} &:= \mathcal{F}_{d,m} := \left\{{ x\mapsto a^{\scriptscriptstyle\mathsf{T}}\sigma(W x + b) : a\in\mathbb{R}^m, W\in\mathbb{R}^{m\times d}, b\in\mathbb{R}^m}\right\}. \\ \mathcal{F}_{\sigma,d} &:= \mathcal{F}_d := \bigcup_{m\geq 0} \mathcal{F}_{\sigma,d,m}. \end{aligned} Note that \mathcal{F}_{\sigma, m,1} denotes networks with a single node, and \mathcal{F}_{\sigma,d} is the linear span (in function space) of single-node networks.
First consider the (unusual) activation \sigma = \cos. Since 2 \cos(y)\cos(z) = \cos(y+z) + \cos(y-z), then \begin{aligned} & 2 \left[{\sum_{i=1}^m a_i \cos(w_i^{\scriptscriptstyle\mathsf{T}}x + b_i) }\right] \cdot \left[{ \sum_{j=1}^n c_j \cos(u_j^{\scriptscriptstyle\mathsf{T}}x + v_j) }\right]= \\ & \sum_{i=1}^m \sum_{j=1}^n a_i c_j \left({ \cos((w_i+u_j)^{\scriptscriptstyle\mathsf{T}}x + (b_i+v_j)) + \cos((w_i - u_j)^{\scriptscriptstyle\mathsf{T}}x + (b_i-v_j)) }\right), \end{aligned} thus f,g\in\mathcal{F}_{\cos,d} \Longrightarrow fg \in \mathcal{F}_{\cos,d} ! In other words, \mathcal{F}_{\cos,d} is closed under multiplication, and since we know we can approximate univariate functions arbitrarily well, this suggests that we can approximate x \mapsto \prod_i \mathbf{1}\left[{ x_i \in [a_i,b_i] }\right] = \mathbf{1}\left[{ x\in \times_i [a_i,b_i] }\right], and use it to achieve our more general approximation goal.
We’re in good shape to give the general universal approximation result. The classical Weierstrass theorem establishes that polynomials are universal approximators (Weierstrass 1885), and its generalization, the Stone-Weierstrass theorem, says that any family of functions satisfying some of the same properties as polynomials will also be a universal approximator. Thus we will show \mathcal{F}_{\sigma,d} is a universal approximator via Stone-Weierstrass, a key step being closure under multiplication as above; this proof scheme was first suggested in (Hornik, Stinchcombe, and White 1989), but is now a fairly standard way to prove universal approximation.
First, here is the statement of the Stone-Weierstrass Theorem.
Let functions \mathcal{F} be given as follows.
Each f\in\mathcal{F} is continuous.
For every x, there exists f\in\mathcal{F} with f(x) \neq 0.
For every x\neq x' there exists f\in\mathcal{F} with f(x)\neq f(x') (\mathcal{F} separates points).
\mathcal{F} is closed under multiplication and vector space operations (\mathcal{F} is an algebra).
Then \mathcal{F} is a universal approximator: for every continuous g :\mathbb{R}^d \to \mathbb{R} and \epsilon> 0, there exists f\in\mathcal{F} with \sup_{x\in[0,1]^d} |f(x)-g(x)|\leq \epsilon.
Weierstrass theorem itself has interesting proofs:
The modern standard one is due to Bernstein; it picks a fine grid and then a convenient set of interpolating polynomials which behave stably off the grid.
Weierstrass’s original proof convolved the target with a Gaussian, which makes it analytic, and also leads to good polynomial approximation.
First, we go back to \cos activations, which was the original choice in (Hornik, Stinchcombe, and White 1989); we can then handle arbitrary activations by univariate approximation of \cos, without increasing the depth (but increasing the width).
Proof. Let’s check the Stone-Weierstrass conditions:
Each f \in \mathcal{F}_{\cos,d} is continuous.
For each x, \cos(0^{\scriptscriptstyle\mathsf{T}}x) = 1 \neq 0.
For each x\neq x', f(z) := \cos((z-x')^{\scriptscriptstyle\mathsf{T}}(x-x') / \|x-x'\|^2)\in\mathcal{F}_d satisfies f(x) = \cos(1) \neq \cos(0) = f(x').
\mathcal{F}_{\cos,d} is closed under products and vector space operations as before.
We can work it out even more easily for \mathcal{F}_{\exp,d}.
Proof. Let’s check the Stone-Weierstrass conditions:
Each f \in \mathcal{F}_{\exp,d} is continuous.
For each x, \exp(0^{\scriptscriptstyle\mathsf{T}}x) = 1 \neq 0.
For each x\neq x', f(z) := \exp((z-x')^{\scriptscriptstyle\mathsf{T}}(x-x') / \|x-x'\|^2)\in\mathcal{F}_d satisfies f(x) = \exp(1) \neq \exp(0) = f(x').
\mathcal{F}_{\exp,d} is closed under VS ops by construction; for products, \begin{aligned} \left({\sum_{i=1}^n r_i \exp(a_i^{\scriptscriptstyle\mathsf{T}}x) }\right) \left({\sum_{j=1}^m s_j \exp(b_j^{\scriptscriptstyle\mathsf{T}}x) }\right) = \sum_{i=1}^m \sum_{j=1}^m r_i s_j \exp( (a+b)^{\scriptscriptstyle\mathsf{T}}x). \end{aligned}
Now let’s handle arbitrary activations.
Proof sketch (details in hw1). Given \epsilon>0 and continuous g, use Lemma 2.2 ((Hornik, Stinchcombe, and White 1989)) (or Lemma 2.3) to obtain h\in\mathcal{F}_{\cos,d} (or \mathcal{F}_{\exp,d}) with \sup_{x\in[0,1]^d}|h(x)-g(x)|\leq \epsilon/2. To finish, replace all appearances of \cos with an element of \mathcal{F}_{\sigma,1} so that the total additional error is \epsilon/2.
This section presents two ideas which have recently become very influential again.
Using infinite-width networks. This may seem complicated, but in fact it simplifies many things, and better captures certain phenomena.
Barron’s approximation theorem and norm (Barron 1993). Barron’s original goal was an approximation result which requires few nodes in some favorable cases. Interestingly, his construction can be presented as an infinite-width representation with equality, and furthermore the construction gives approximation guarantees near initialization (e.g., for the NTK, the topic of the next section).
We will finish the section with a more general view of these infinite-width constructions, and a technique to sample finite-width networks from them.
Let’s warm up with some univariate constructions.
Proof. By FTC and g(0) = 0 and x\in[0,1], g(x) = g(0) + \int_0^x g'(b) {\text{d}}b = 0 + \int_0^1 \mathbf{1}[x \geq b] g'(b){\text{d}}b. \hspace{6em}
That’s really it! We’ve written a differentiable function as a shallow infinite-width network, with equality, effortlessly.
This approach uses Fourier transforms; for those less familiar, it might seem daunting, but:
The approach will turn out to be natural.
There is extensive literature on Fourier transforms, so it’s an important connection to make.
The original paper (Barron 1993) is over 30 years old now, and still this seems to be one of the best approaches, even with modern considerations like staying near initialization!
Let’s first argue it’s natural. Recall the Fourier transform (e.g., Folland 1999, Chapter 8): \hat f (w) := \int \exp(-2\pi i w^{\scriptscriptstyle\mathsf{T}}x) f(x) {\text{d}}x. We also have Fourier inversion: if f\in L^1 and \hat f\in L^1, f(x) = \int \exp(2\pi i w^{\scriptscriptstyle\mathsf{T}}x) \hat f(w) {\text{d}}w. The inversion formula rewrite f as an infinite-width network! The only catch is that the activations are not only non-standard, they are over the complex plane.
Barron’s aproach is to convert these activations into something more normal; here we’ll use threshold nodes, but others are fine as well. If our starting function f is over the reals, then using \Re to denote the real part of a complex number, meaning \Re(a+bi) = a, then f(x) = \Re f(x) = \int \Re \exp(2\pi i w^{\scriptscriptstyle\mathsf{T}}x) \hat f(w) {\text{d}}w. If we expand with e^{i z} = \cos(z) + i \sin(z), we’re left with \cos, which is not compactly supported; to obtain an infinite-width form with threshold gates using a density which is compactly supported, Barron uses two tricks.
Polar decomposition. Let’s split up the Fourier transform \hat f into magnitude and radial parts: write \hat f(w) = |\hat f(w)| \exp( 2\pi i \theta(w)) with |\theta(w)| \leq 1. Since f is real-valued, \begin{aligned} f(x) &= \Re \int \exp(2\pi i w^{\scriptscriptstyle\mathsf{T}}x) \hat f(w) {\text{d}}w \\ &= \int \Re\left({ \exp(2\pi i w^{\scriptscriptstyle\mathsf{T}}x) \exp(2\pi i\theta(w)) | \hat f(w) |}\right) {\text{d}}w \\ &= \int \Re\left({ \exp(2\pi i w^{\scriptscriptstyle\mathsf{T}}x + 2\pi i\theta(w) }\right) |\hat f(w)| {\text{d}}w \\ &= \int \cos\left({ 2\pi w^{\scriptscriptstyle\mathsf{T}}x + 2\pi\theta(w) }\right) |\hat f(w)| {\text{d}}w. \end{aligned} We’ve now obtained an infinite width network over real-valued activations! \cos is neither compactly supported, no approaches a limit as its argument goes \pm\infty, which is where Barron’s second trick comes in.
Turning cosines into bumps! We’ll do two things to achieve our goal: subtracting f(0), and scaling by \|w\|: \begin{aligned} &f(x) - f(0) \\ &= \int \left[{\cos(2\pi w^{\scriptscriptstyle\mathsf{T}}x + 2\pi\theta(w) ) -\cos(2\pi w^{\scriptscriptstyle\mathsf{T}}0 + 2\pi\theta(w))}\right] |\hat f(w)| {\text{d}}w \\ &= \int \frac{\cos(2\pi w^{\scriptscriptstyle\mathsf{T}}x + 2\pi\theta(w) ) - \cos(2\pi\theta(w))}{\|w\|} \|w\| \cdot |\hat f(w)| {\text{d}}w. \end{aligned} The fraction does not blow up: since \cos is 1-Lipschitz, \begin{aligned} &\left|{ \frac{\cos(2\pi w^{\scriptscriptstyle\mathsf{T}}x + 2\pi\theta(w) ) - \cos(2\pi\theta(w))}{\|w\|} }\right| \\ &\leq \frac{\left|{2\pi w^{\scriptscriptstyle\mathsf{T}}x + 2\pi\theta(w) - 2\pi\theta(w)}\right|}{\|w\|} %\\ %& \leq \frac {2\pi |w^{\scriptscriptstyle\mathsf{T}}x|}{\|w\|} \leq 2\pi\|x\|. \end{aligned} This quantity is therefore well-behaved for bounded \|x\| so long as \|w\| |\hat f(w)| is well-behaved.
Barron combined these ideas with the sampling technique in Lemma 3.1 (Maurey (Pisier 1980)) to obtain estimates on the number of nodes needed to approximate functions whenever \|w\|\cdot|\hat f(w)| is well-behaved. We will follow a simpler approach here: we will give an explicit infinite-width form via only the first trick above and some algebra, and only then invoke sampling. The quantity \|w\|\cdot|\hat f(w)| will appear in the estimate of the “mass” of the infinite-width network as used to estimate how much to sample, analogous to the quantity \int_0^1 |g'(x)|{\text{d}}x from Proposition 2.1.
Before continuing, let’s discuss \|w\|\cdot|\hat f(w)| a bit more, which can be simplified via \widehat{\nabla f}(w) = 2\pi i w \hat f(w) into a form commonly seen in the literature.
Here is our approach in detail. Continuing with the previous Barron representation and using \|x\|\leq1, \begin{aligned} \cos(2\pi w^{\scriptscriptstyle\mathsf{T}}x + 2\pi\theta(w)) - \cos(2\pi\theta(w)) &= \int_0^{w^{\scriptscriptstyle\mathsf{T}}x} - 2\pi\sin(2\pi b + 2\pi\theta(w)){\text{d}}b \\ &= - 2\pi\int_0^{\|w\|} \mathbf{1}[w^{\scriptscriptstyle\mathsf{T}}x - b \geq 0]\sin(2\pi b + 2\pi\theta(w)){\text{d}}b \\ & + 2\pi\int_{-\|w\|}^0 \mathbf{1}[-w^{\scriptscriptstyle\mathsf{T}}x + b \geq 0]\sin(2\pi b + 2\pi\theta(w)){\text{d}}b. \end{aligned} Plugging this into the previous form (before dividing by \|w\|), \begin{aligned} \hspace{-2em}f(x) - f(0) &= -2\pi\int\!\! \int_0^{\|w\|} \mathbf{1}[w^{\scriptscriptstyle\mathsf{T}}x - b\geq 0]\left[{\sin(2\pi b + 2\pi\theta(w))|\hat f(w)|}\right]{\text{d}}b{\text{d}}w \\ &+2\pi\int\!\! \int_{-\|w\|}^0 \mathbf{1}[-w^{\scriptscriptstyle\mathsf{T}}x + b\geq 0]\left[{\sin(2\pi b + 2\pi\theta(w))|\hat f(w)|}\right]{\text{d}}b{\text{d}}w, \end{aligned} an infinite width network with threshold nodes!
We’ll tidy up with \widehat{ \nabla f}(w) = 2\pi i w \hat f(w) whereby \| \widehat {\nabla f}(w) \| = 2\pi\|w\|\cdot |\hat f(w)| as mentioned before. Lastly, to estimate the “mass” of this infinite width network (the integral of the density part of the integrand), \begin{aligned} & \left|{ 2\pi\int\!\! \int_0^{\|w\|} %\1[w^\T x - b \geq 0] \left[{\sin(2\pi b + 2\pi\theta(w))|\hat f(w)|}\right]{\text{d}}b{\text{d}}w }\right| \\ &+ \left|{ 2\pi\int\!\! \int_{-\|w\|}^0 %\1[-w^\T x + b\geq 0] \left[{\sin(2\pi b + 2\pi\theta(w))|\hat f(w)|}\right]{\text{d}}b{\text{d}}w}\right| \\ &\leq 2\pi\int\!\! \int_{-\|w\|}^{\|w\|} \left|{ \sin(2\pi b + 2\pi\theta(w)) }\right| |\hat f(w)|{\text{d}}b{\text{d}}w \\ &\leq 2\pi\int 2\|w\| \cdot |\hat f(w)|{\text{d}}b{\text{d}}w \\ &= 2 \int \left\|{ \widehat{\nabla f}(w) }\right\|{\text{d}}w. \end{aligned} Summarizing this derivations gives the following version of Barron’s approach.
When combined with the sampling tools in 3.3, we will recover Barron’s full result that the number of nodes needed to approximate f to accuracy \epsilon>0 is roughly \Big\| \widehat{\nabla f}(w) \Big\|{\text{d}}w / \epsilon^2.
Ideally, the Barron norm is small, for instance polynomial (rather than exponential) in dimension for interesting examples. Here are a few, mostly taken from (Barron 1993).
Gaussians. Since (e.g., Folland 1999, Prop 8.24) \begin{aligned} &f(x) = (2\pi \sigma^2)^{d/2} \exp( - \frac{\|x\|^2}{2\sigma^2}) \\ \Longrightarrow\quad& \hat f(w) = \exp( - 2\pi^2 \sigma^2 \|w\|^2 ), \end{aligned} meaning \hat f is an unnormalized Gaussian with variance (4\pi^2\sigma^2)^{-1}. Using normalization Z= (2\pi\sigma^2 )^{-d/2} and Holder gives \begin{aligned} \int \|w\| |\hat f(w)|{\text{d}}w &= Z \int Z^{-1} \|w\| |\hat f(w)|{\text{d}}w \\ &\leq Z \left({\int Z^{-1} \|w\|^2 |\hat f(w)|{\text{d}}w}\right)^{1/2} \\ & = Z \left({\frac {d}{4\pi^2\sigma^2 } }\right)^{1/2} = \frac{\sqrt d}{\sqrt{2\pi} (2\pi \sigma^2)^{(d+1)/2}}. \end{aligned} Consequently, if 2\pi\sigma^2 \geq 1, then \int \left\|{ \widehat{\nabla f}(w) }\right\| {\text{d}}w = \mathcal{O}(\sqrt{d}). On the other hand, general radial functions have exponential \|\widehat{\nabla f}(w)\| (Comment IX.9, Barron 1993); this is circumvented here since \|x\|\leq 1 and hence the Gaussian is quite flat.
Further brief example \int \left\|{ \widehat{\nabla f}(w) }\right\| {\text{d}}w calculations:
A few more from (Barron 1993, sec. IX): radial functions (IX.9), compositions with polynomials (IX.12) and analytic functions (IX.13), functions with \mathcal{O}(d) bounded derivatives (IX.15).
Barron also gives a lower bound for a specific set of functions which is exponential in dimension.
Further comments on Barron’s constructions can be found in (H. Lee et al. 2017).
General continuous functions can fail to satisfy \int \left\|{ \widehat{\nabla f}(w) }\right\| {\text{d}}w <\infty, but we can first convolve them with Gaussians and sample the resulting nearby function; this approach, along with a Barron theorem using ReLUs, can be found in (Ji, Telgarsky, and Xian 2020).
Now we will show how to obtain a finite-width representation from an infinite-width representation. Coarsely, given a representation \int \sigma(w^{\scriptscriptstyle\mathsf{T}}x) g(w) {\text{d}}w, we can form an estimate \sum_{j=1}^m s_j \tilde\sigma(w_j^{\scriptscriptstyle\mathsf{T}}x), \qquad \text{where } s_j \in \pm 1, \ {} \tilde \sigma(z) = \sigma(z) \int |g(w)|{\text{d}}w, by sampling w_j \sim |g(w)| / \int |g(w)|{\text{d}}w, and letting s_j := \textrm{sgn}(g(w_j)), meaning the sign corresponding to whether w fell in a negative or positive region of g. In expectation, this estimate is equal to the original function.
Here we will give a more general construction where the integral is not necessarily over the Lebesgue measure, which is useful when it has discrete parts and low-dimensional sets. This section will follow the same approach as (Barron 1993), namely using Maurey’s sampling method (cf. Lemma 3.1 (Maurey (Pisier 1980))), which gives an L_2 error; it is possible to use these techniques to obtain an L_\infty error via the “co-VC dimension technique” (Gurvits and Koiran 1995), but this is not pursued here.
To build this up, first let us formally define these infinite-width networks and their mass.
To develop sampling bounds, first we give the classical general Maurey sampling technique, which is stated as sampling in Hilbert spaces.
Suppose X = \mathop{\mathbb{E}}V, where r.v. V is supported on a set S. A natural way to “simplify” X is to instead consider \hat X := \frac 1 k\sum_{i=1}^k V_i, where (V_1,\ldots,V_k) are sampled iid. We want to argue \hat X \approx X; since we’re in a Hilbert space, we’ll try to make the Hilbert norm \|X - \hat X\| small.
Proof of Lemma 3.1 (Maurey (Pisier 1980)). Let (V_1,\ldots,V_k) be IID as stated. Then \begin{aligned} &\mathop{\mathbb{E}}_{V_1,\ldots,V_k}\left\|{ X - \frac 1 k \sum_i V_i }\right\|^2 \\ & = \mathop{\mathbb{E}}_{V_1,\ldots,V_k}\left\|{ \frac 1 k \sum_i \left({ V_i - X }\right) }\right\|^2 \\ &= \mathop{\mathbb{E}}_{V_1,\ldots,V_k}\frac 1 {k^2} \left[{ \sum_i \left\|{ V_i - X }\right\|^2 + \sum_{i\neq j} \left\langle V_i - X , V_j - X \right \rangle }\right] \\ &= \mathop{\mathbb{E}}_{V}\frac 1 {k} \left\|{ V - X }\right\|^2 \\ &= \mathop{\mathbb{E}}_{V}\frac 1 {k}\left({ \left\|{ V }\right\|^2 - \left\|{ X }\right\|^2 }\right) \\ &\leq \mathop{\mathbb{E}}_{V}\frac 1 {k} \left\|{ V }\right\|^2 \leq \sup_{U\in S}\frac 1 {k} \left\|{ U }\right\|^2. \end{aligned} To conclude, there must exist (U_1,\ldots,U_k) in S so that \left\|{ X - k^{-1} \sum_i U_i }\right\|^2 \leq \mathop{\mathbb{E}}_{V_1,\ldots,V_k}\left\|{ X - k^{-1} \sum_i V_i }\right\|^2. (“Probabilistic method.”)
Now let’s apply this to infinite-width networks in the generality of Definition 3.2. We have two issues to resolve.
Issue 1: what is the appropriate Hilbert space?
Issue 2: our “distribution” on weights is not a probability!
Example: consider x\in[0,1] and \sin(2\pi x) = \int_0^1 \mathbf{1}[x \geq b] 2\pi \cos(2\pi b){\text{d}}b. There are two issues: \int_0^1 |2\pi\cos(2\pi b)|{\text{d}}b \neq 1, and \cos(2\pi b) takes on negative and positive values.
Answer: we’ll correct this in detail shortly, but here is a sketch; recall also the discussion in Definition 3.2 of splitting a measure into positive and negative parts. First, we introduce a fake parameter s\in\{\pm 1\} and multiply \mathbf{1}[x\geq b] with it, simulating positive and negative weights with only positive weights; now our distribution is on pairs (s,b). Secondly, we’ll normalize everything by \int_0^1 |2\pi\cos(2\pi b)|{\text{d}}b.
Let’s write a generalized shallow network as x\mapsto \int g(x;w){\text{d}}\mu(w), where \mu is a nonzero signed measure over some abstract parameter space \mathbb{R}^p. E.g., w = (a,b,v) and g(x;w) = a \sigma(v^{\scriptscriptstyle\mathsf{T}}x + b).
Decompose \mu= \mu_+ - \mu_- into nonnegative measures \mu_{\pm} with disjoint support (this is the Jordan decomposition (Folland 1999), which was mentioned in Definition 3.2).
For nonnegative measures, define total mass \|\mu_{\pm}\|_1 = \mu_{\pm}(\mathbb{R}^p), and otherwise \|\mu\|_1 = \|\mu_+\|_1 + \|\mu_-\|_1.
Define {\widetilde{\mu}} to sample s \in \{\pm1\} with \mathop{\textrm{Pr}}[s = +1] = \frac{\|\mu_+\|_1}{\|\mu\|_1}, and then sample g\sim \frac{\mu_{s}}{\|\mu_{s}\|_1}=:\tilde \mu_s, and output \tilde g(\cdot;w,s) = s \|\mu\|_1 g(\cdot;w).
This sampling procedure has the correct mean: \begin{aligned} \int g(x;w) {\text{d}}\mu(w) &= \int g(x;w) {\text{d}}\mu_+(w) - \int g(x;w) {\text{d}}\mu_-(w) \\ &= \|\mu_+\|_1 \mathop{\mathbb{E}}_{\tilde\mu_+} g(x;w) - \|\mu_-\|_1 \mathop{\mathbb{E}}_{\tilde\mu_-} g(x;w) \\ &= \|\mu\|_1\left[{ \mathop{\textrm{Pr}}_{{\widetilde{\mu}}}[s = +1] \mathop{\mathbb{E}}_{{\widetilde{\mu}}_+} g(x;w) - \mathop{\textrm{Pr}}_{{\widetilde{\mu}}}[s = -1] \mathop{\mathbb{E}}_{{\widetilde{\mu}}_-} g(x;w)}\right] = \mathop{\mathbb{E}}_{{\widetilde{\mu}}} \tilde g(x;w,s). \end{aligned}
Proof. By the mean calculation we did earlier, g = \mathop{\mathbb{E}}_{{\widetilde{\mu}}} \|\mu\| s g_w = \mathop{\mathbb{E}}_{{\widetilde{\mu}}} {\tilde{g}}, so by the regular Maurey applied to {\widetilde{\mu}} and Hilbert space L_2(P) (i.e., writing V := {\tilde{g}} and g = \mathop{\mathbb{E}}V), \begin{aligned} \mathop{\mathbb{E}}_{{\tilde{w}}_1,\ldots,{\tilde{w}}_k} \left\|{ g - \frac 1 k \sum_i {\tilde{g}}(\cdot;{\tilde{w}}_i) }\right\|_{L_2(P)}^2 &\leq \frac {\mathop{\mathbb{E}}\|{\tilde{g}}(\cdot;{\tilde{w}})\|_{L_2(P)}^2}{k} \\ &\leq \frac {\sup_{s \in \{\pm 1\}} \sup_{w\in\mathcal{W}} \Big\{\|\mu\|_1 s g(\cdot;w)\Big\}^2_{L_2(P)}}{k}. \\ &\leq \frac {\|\mu\|_1^2 \sup_{w\in S} \|g(\cdot;w)\|_{L_2(P)}^2}{k}, \end{aligned} and the existence of the fixed (w_i,s_i) is also from Maurey.
In this section we consider networks close to their random initialization. Briefly, the core idea is to compare a network f :\mathbb{R}^d \times \mathbb{R}^p \to \mathbb{R}, which takes input x\in\mathbb{R}^d and has parameters W\in\mathbb{R}^p, to its first-order Taylor approximation at random initialization W_0: f_0(x;W) := f(x;W_0) + \left\langle \nabla f(x;W_0), W-W_0 \right \rangle. The key property of this simplification is that while it is nonlinear in x, it is affine in W, which will greatly ease analysis. This section is roughly organized as follows
4.1 gives the basic setup in more detail, including the networks considered and the specific random initialization. The study is almost solely on shallow networks, since the deep case currently only leads to a degraded analysis and is not well understood.
4.2 shows that near initialization, with large width, f\approx f_0.
4.3 provides the “kernel view”: since the previous part shows we are effectively linear over some feature space, it is natural to consider the kernel corresponding to that feature space. This provides many connections to the literature (via “neural tangent kernel” (NTK)), and also is used for a short proof that these functions near initialization are already universal approximators!
As explained shortly, we will almost solely consider the shallow case: f(x;W) := \frac 1 {\sqrt m} \sum_{j=1}^m a_j \sigma(w_j^{\scriptscriptstyle\mathsf{T}}x), \qquad W := \begin{bmatrix} \gets w_1^{\scriptscriptstyle\mathsf{T}}\to \\ \vdots\\ \gets w_m^{\scriptscriptstyle\mathsf{T}}\to \end{bmatrix} \in \mathbb{R}^{m\times d}, (2) where \sigma will either be a smooth activation or the ReLU, and we will treat a\in\mathbb{R}^m as fixed and only allow W\in\mathbb{R}^{m\times d} to vary. There are a number of reasons for this exact formalism, they are summarized below in Remark 4.1.
Now let’s consider the corresponding first-order Taylor approximation f_0 in detail. Consider any univariate activation \sigma which is differentiable except on a set of measure zero (e.g., countably many points), and Gaussian initialization W_0\in\mathbb{R}^{m\times d} as before. Consider the Taylor expansion at initialization: \begin{aligned} f_0(x;W) &= f(x;W_0) + \left\langle \nabla f(x;W_0), W-W_0 \right \rangle \\ &= \frac {1}{\sqrt m}\sum_{j=1}^m a_j \left({ \sigma(w_{0,j}^{\scriptscriptstyle\mathsf{T}}x) + \sigma'(w_{0,j}) x^{\scriptscriptstyle\mathsf{T}}(w_j - w_{0,j}) }\right) \\ &= \frac {1}{\sqrt m}\sum_{j=1}^m a_j \left({ \left[{ \sigma(w_{0,j}^{\scriptscriptstyle\mathsf{T}}x) - \sigma'(w_{0,j}) w_{0,j}^{\scriptscriptstyle\mathsf{T}}x}\right] + \sigma'(w_{0,j}) w_j^{\scriptscriptstyle\mathsf{T}}x }\right). \end{aligned} If \sigma is nonlinear, then this mapping is nonlinear in x, despite being affine in W! Indeed \nabla f(\cdot;W_0) defines a feature mapping: \nabla f(x;W_0) := \begin{bmatrix} \gets& a_1 \sigma'(w_{0,1}^{\scriptscriptstyle\mathsf{T}}x) x^{\scriptscriptstyle\mathsf{T}}&\to\\ &\vdots&\\ \gets& a_m \sigma'(w_{0,m}^{\scriptscriptstyle\mathsf{T}}x) x^{\scriptscriptstyle\mathsf{T}}&\to \end{bmatrix}; the predictor f_0 is affine is an affine function of the parameters, and is also affine in this feature-mapped data.
pytorch
initialization defaults to these standard deviations, but defaults to uniform distributions and not Gaussians. Lastly, some papers managed to set up analysis so that the final layer does most of the work (and the training problem is convex for the last layer), thus we follow the convention of some authors to train all but the last layer to rule out this possibility.
Our first step is to show that f-f_0 shrinks as m increases, which has a few immediate consequences.
It gives one benefit of “overparameterization.”
It gives us an effective way to do universal approximation with small \|W-W_0\|: we simply make m as large as needed and get more functions inside our RKHS.
First we handle the case that \sigma is smooth, by which we mean \sigma'' exists and satisfies |\sigma''|\leq \beta everywhere. This is not satisfied for the ReLU, but the proof is so simple that it is a good motivator for other cases.
Proof. By Taylor’s theorem, \left|{ \sigma(r) - \sigma(s) - \sigma'(s)(r-s)}\right| = \left|{ \int_r^s \sigma''(z) (s-z){\text{d}}z}\right| \leq \frac {\beta(r-s)^2}{2}. Therefore \begin{aligned} &\left|{f(x;W) - f(x;V) - \left\langle \nabla f(x;V), W-V \right \rangle}\right| \\ &\leq \frac 1 {\sqrt m} \sum_{j=1}^m |a_j| \cdot \left|{ \sigma(w_j^{\scriptscriptstyle\mathsf{T}}x) - \sigma(v_j^{\scriptscriptstyle\mathsf{T}}x) - \sigma'(v_j^{\scriptscriptstyle\mathsf{T}}x)x^{\scriptscriptstyle\mathsf{T}}(w_j - v_j) }\right| \\ &\leq \frac 1 {\sqrt m} \sum_{j=1}^m \frac {\beta (w_j^{\scriptscriptstyle\mathsf{T}}x -v_j^{\scriptscriptstyle\mathsf{T}}x)^2}{2} \\ &\leq \frac {\beta} {2\sqrt m} \sum_{j=1}^m \|w_j-v_j\|^2 \\ &= \frac {\beta} {2\sqrt m} \left\|{W-V}\right\|^2_{{\textrm{F}}}. \end{aligned}
Now we switch to the ReLU. The proof is much more complicated, but is instructive of the general calculations one must perform frequently with the ReLU.
The proof will use the following concentration inequality.
Proof. For any row j, define an indicator random variable P_j := \mathbf{1}[ |w_j^{\scriptscriptstyle\mathsf{T}}x| \leq \tau\|x\| ]. By rotational invariance, P_j = \mathbf{1}[ |w_{j,1}| \leq \tau ], which by the form of the Gaussian density gives \mathop{\textrm{Pr}}[P_j = 1] = \int_{-\tau}^{+\tau} \frac 1 {\sqrt{2\pi}} e^{-z^2/2}{\text{d}}z \leq \frac {2\tau}{\sqrt{2\pi}} \leq \tau. By Hoeffding’s inequality, with probability at least 1-\delta, \sum_{j=1}^m P_j \leq m \mathop{\textrm{Pr}}[P_1 = 1] + \sqrt{\frac {m}{2} \ln \frac 1 \delta} \leq m \tau + \sqrt{\frac m 2 \ln \frac 1 \delta}.
Proof of Lemma 4.1. Fix x\in\mathbb{R}^d. If \|x\|=0, then for any W\in\mathbb{R}^d, f(x;W) = 0 = f_0(x;W), and the proof is complete; henceforth consider the case \|x\|>0.
The proof idea is roughly as follows. The Gaussian initialization on W_0 concentrates around a rather large shell, and this implies |w_{0,j}^{\scriptscriptstyle\mathsf{T}}x| is large with reasonably high probability. If \|W-W_0\|_{\textrm{F}} is not too large, then \|w_{j} - w_{0,j}\| must be small for most coordinates; this means that w_j^{\scriptscriptstyle\mathsf{T}}x and w_{0,j}^{\scriptscriptstyle\mathsf{T}}x must have the same sign for most j.
Proceeding in detail, fix a parameter r>0, which will be optimized shortly. Let W be given with \|W-W_0\|\leq B. Define the sets \begin{aligned} S_1 &:= \left\{{ j \in [m] : |w_{j}^{\scriptscriptstyle\mathsf{T}}x| \leq r \|x\| }\right\}, \\ S_2 &:= \left\{{ j \in [m] : \|w_j-w_{0,j}\| \geq r }\right\}, \\ S &:= S_1 \cup S_2. \end{aligned} By Lemma 4.2, with probability at least 1-\delta, |S_1| \leq r m + \sqrt{m \ln(1/\delta)}. On the other hand, B^2 \geq \|W-W_0\|^2 \geq \sum_{j\in S_2} \|w_j-w_{0,j}\|^2 \geq |S_2| r^2, meaning |S_2| \leq B^2 / r^2. For any j\not \in S, if w_j^{\scriptscriptstyle\mathsf{T}}x > 0, then w_{0,j}^{\scriptscriptstyle\mathsf{T}}x \geq w_j^{\scriptscriptstyle\mathsf{T}}x - \|w_j-w_{0,j}\|\cdot\|x\| > \|x\| \left({ r - r }\right) = 0, meaning \mathbf{1}[w_j^{\scriptscriptstyle\mathsf{T}}x \geq 0] = \mathbf{1}[w_{0,j}^{\scriptscriptstyle\mathsf{T}}x \geq 0]; the case that j\not\in S and w_j^{\scriptscriptstyle\mathsf{T}}x < 0 is analogous. Together, |S| \leq r m + \sqrt{m \ln(1/\delta)} + \frac {B^2}{r^2} \quad \textup{and} \quad j\not\in S \Longrightarrow \mathbf{1}[w_j^{\scriptscriptstyle\mathsf{T}}x \geq 0 ] = \mathbf{1}[w_{0,j}^{\scriptscriptstyle\mathsf{T}}x\geq 0]. Lastly, we can finally choose r to balance terms in |S|: picking r := B^{2/3} / m^{1/3} gives |S| \leq (Bm)^{2/3} + \sqrt{m\ln(1/\delta)} + (Bm)^{2/3} \leq m^{2/3} \left({ 2 B^{2/3} + \sqrt{\ln(1/\delta)}}\right).
Now that |S| has been bounded, the proof considers the two different statements separately, though their proofs are similar.
As in the above remark, \begin{aligned} |f(x;W) - f_0(x;W)| &= \left|{ \left\langle \nabla f(x;W) - \nabla f(x;W_0), W \right \rangle }\right| \\ &=\frac 1 {\sqrt{m}} \left|{\sum_j a_j \left({\mathbf{1}[w_j^{\scriptscriptstyle\mathsf{T}}x \geq 0] - \mathbf{1}[w_{0,j}^{\scriptscriptstyle\mathsf{T}}x\geq 0]}\right) w_j^{\scriptscriptstyle\mathsf{T}}x}\right| \\ &\leq \frac 1 {\sqrt{m}} \sum_j \left|{\mathbf{1}[w_j^{\scriptscriptstyle\mathsf{T}}x \geq 0] - \mathbf{1}[w_{0,j}^{\scriptscriptstyle\mathsf{T}}x\geq 0]}\right| \cdot |w_j^{\scriptscriptstyle\mathsf{T}}x|. \end{aligned} To simplify this, as above \left|{\mathbf{1}[w_j^{\scriptscriptstyle\mathsf{T}}x \geq 0] - \mathbf{1}[w_{0,j}^{\scriptscriptstyle\mathsf{T}}x\geq 0]}\right| is only nonzero for j\in S. But when it is nonzero, this means \textrm{sgn}(w_j^{\scriptscriptstyle\mathsf{T}}x) \neq \textrm{sgn}(w_{0,j}^{\scriptscriptstyle\mathsf{T}}x), and thus |w_j^{\scriptscriptstyle\mathsf{T}}x| \leq |w_j^{\scriptscriptstyle\mathsf{T}}x - w_{0,j}^{\scriptscriptstyle\mathsf{T}}x|, and together with Cauchy-Schwarz (two applications!), and the above upper bound on |S| gives \begin{aligned} |f(x;W) - f_0(x;W)| &\leq \frac 1 {\sqrt{m}} \sum_j \left|{\mathbf{1}[w_j^{\scriptscriptstyle\mathsf{T}}x \geq 0] - \mathbf{1}[w_{0,j}^{\scriptscriptstyle\mathsf{T}}x\geq 0]}\right| \cdot |w_j^{\scriptscriptstyle\mathsf{T}}x| \\ &\leq \frac 1 {\sqrt{m}} \sum_{j\in S} \left|{w_j^{\scriptscriptstyle\mathsf{T}}x - w_{0,j}^{\scriptscriptstyle\mathsf{T}}x}\right| \\ &\leq \frac 1 {\sqrt{m}} \sum_{j\in S} \left\|{w_j - w_{0,j}}\right\| \\ &\leq \frac 1 {\sqrt{m}} \sqrt{|S|} \cdot \|W - W_0\|_{{\textrm{F}}} \\ &\leq B \sqrt{\frac{|S|} m} \\ &\leq B \sqrt{\frac {2 B^{2/3} + \sqrt{\ln(1/\delta)}} {m^{1/3}}} \\ &\leq \frac {2 B^{4/3} + B \ln(1/\delta)^{1/4}}{m^{1/6}}. \end{aligned}
Following similar reasoning, \begin{aligned} & \left|{f(x;V) - \left({f(x;W) + \left\langle \nabla_W f(x;W), V-W \right \rangle}\right)}\right| \\ &= \left|{ \left\langle \nabla f(x;V) - \nabla f(x;W), V \right \rangle }\right| \\ &=\frac 1 {\sqrt{m}} \left|{\sum_j a_j \left({\mathbf{1}[w_j^{\scriptscriptstyle\mathsf{T}}x \geq 0] - \mathbf{1}[v_{j}^{\scriptscriptstyle\mathsf{T}}x\geq 0]}\right) v_j^{\scriptscriptstyle\mathsf{T}}x}\right| \\ &\leq \frac 1 {\sqrt{m}} \left|{\sum_j a_j \left({\mathbf{1}[w_j^{\scriptscriptstyle\mathsf{T}}x \geq 0] - \mathbf{1}[v_{j}^{\scriptscriptstyle\mathsf{T}}x\geq 0]}\right)}\right|\cdot \left|{ w_j^{\scriptscriptstyle\mathsf{T}}x - v_j^{\scriptscriptstyle\mathsf{T}}x}\right| \\ &\leq \frac 1 {\sqrt{m}} \left|{\sum_j a_j \left({\mathbf{1}[w_j^{\scriptscriptstyle\mathsf{T}}x \geq 0] - \mathbf{1}[v_{j}^{\scriptscriptstyle\mathsf{T}}x\geq 0]}\right)}\right|\cdot \|w_j - v_j\|. \end{aligned} Now define S_3 analogously to S_2, but for the new matrix V: S_3 := \left\{{ j \in [m] : \|v_j-w_{0,j}\| \geq r }\right\}, and additionally define S_4 := S_1 \cup S_2 \cup S_3. By the earlier choice of r and related calculations, with probability at least 1-\delta, |S| \leq r m + \sqrt{m \ln(1/\delta)} + \frac {2B^2}{r^2} \leq m^{2/3} \left({ 3 B^{2/3} + \sqrt{\ln(1/\delta)}}\right). Plugging this back in and continuing as before, \begin{aligned} \left|{ \left\langle \nabla f(x;V) - \nabla f(x;W), V \right \rangle }\right| &\leq \frac 1 {\sqrt{m}} \left|{\sum_j a_j \left({\mathbf{1}[w_j^{\scriptscriptstyle\mathsf{T}}x \geq 0] - \mathbf{1}[v_{j}^{\scriptscriptstyle\mathsf{T}}x\geq 0]}\right)}\right|\cdot \|w_j - v_j\|. \\ &\leq \frac 1 {\sqrt{m}} \sum_{j\in S_4} \|w_j - v_j\|_{{\textrm{F}}} \\ &\leq \frac 1 {\sqrt{m}} \sqrt{|S_4|} \|V - W\|_{{\textrm{F}}} \\ &\leq 2B \sqrt{\frac {3B^{2/3} + \sqrt{\ln(1/\delta)}}{m^{1/3}}} \\ &\leq \frac {6B^{4/3} + 2B\ln(1/\delta)^{1/4}}{m^{1/6}}. \end{aligned}
So far, we’ve said that f-f_0 is small when the width is large. Now we will focus on f_0, showing that it is a large class of functions; thus, when the width is large, f obtained with small \|W-W_0\|_{\textrm{F}} can also capture many functions.
To start, let us see how to define a kernel. In the standard kernel setup, the kernel can be written as the inner product between feature mappings for two data points: \begin{aligned} k_m(x,x') &:= \left\langle \nabla f(x;W_0), \nabla f(x';W_0) \right \rangle \\ &= \left\langle \begin{bmatrix} \longleftarrow &a_1 x^{\scriptscriptstyle\mathsf{T}}\sigma'( w_{1,0}^{\scriptscriptstyle\mathsf{T}}x) / \sqrt{m} &\longrightarrow \\ &\vdots& \\ \longleftarrow &a_m x^{\scriptscriptstyle\mathsf{T}}\sigma'( w_{m,0}^{\scriptscriptstyle\mathsf{T}}x) / \sqrt{m}& \longrightarrow \end{bmatrix} , \begin{bmatrix} \longleftarrow &a_1 (x')^{\scriptscriptstyle\mathsf{T}}\sigma'( w_{1,0}^{\scriptscriptstyle\mathsf{T}}x') / \sqrt{m} &\longrightarrow \\ &\vdots& \\ \longleftarrow &a_m (x')^{\scriptscriptstyle\mathsf{T}}\sigma'( w_{m,0}^{\scriptscriptstyle\mathsf{T}}x') / \sqrt{m}& \longrightarrow \end{bmatrix} \right \rangle \\ &= \frac 1 m \sum_{j=1}^m a_j^2 \left\langle x \sigma'( w_{j,0}^{\scriptscriptstyle\mathsf{T}}x) , x' \sigma'( w_{j,0}^{\scriptscriptstyle\mathsf{T}}x') \right \rangle \\ &= x^{\scriptscriptstyle\mathsf{T}}x' \left[{ \frac 1 m \sum_{j=1}^m \sigma'( w_{j,0}^{\scriptscriptstyle\mathsf{T}}x) \sigma'( w_{j,0}^{\scriptscriptstyle\mathsf{T}}x') }\right]. \end{aligned} This gives one justification of the 1/\sqrt{m} factor: now this kernel is an average and not a sum, and we should expect it to have a limit as m\to \infty. To this end, and noting that the rows (w_{0,j}^{\scriptscriptstyle\mathsf{T}})_{j=1}^m are iid, then each term of the summation is iid, so by the SLLN, almost surely k_m(x,x') \xrightarrow{m\to\infty} k(x,x') := x^{\scriptscriptstyle\mathsf{T}}x' \mathop{\mathbb{E}}_w \left[{ \sigma'(w^{\scriptscriptstyle\mathsf{T}}x) \sigma'(w^{\scriptscriptstyle\mathsf{T}}x') }\right]. In homework we will (a) provide a more explicit form as dot product kernel, and (b) bound the difference exactly. add explicit ref.
For now, let us calculate the closed form for the ReLU; let’s do this geometrically. need to include picture proof
Consider the plane spanned by x and x'. Since projections of standard Gaussians are again standard Gaussians, we can consider a Gaussian random vector v\in\mathbb{R}^2 in this plane.
The integrand in the expectation is 1 iff v^{\scriptscriptstyle\mathsf{T}}x \geq 0 and v^{\scriptscriptstyle\mathsf{T}}x'\geq 0. Since \|v\| does not affect these expressions, we can simplify v\in\mathbb{R}^2 further to be sampled uniformly from the surface of the sphere.
Suppose \|x\| = 1 = \|x'\|, and define \theta := \arccos\left({ x^{\scriptscriptstyle\mathsf{T}}x'}\right); then the integrand is 1 if v has positive inner product with both x and x', which has probability \frac {\pi - \theta}{2\pi}.
Together, still using \|x\| = 1 = \|x'\|, \begin{aligned} k(x,x') &= x^{\scriptscriptstyle\mathsf{T}}x' \mathop{\mathbb{E}}_w \mathbf{1}[ w^{\scriptscriptstyle\mathsf{T}}x \geq 0 ]\cdot \mathbf{1}[w^{\scriptscriptstyle\mathsf{T}}x' \geq 0] %\\ %& = {x}^{{\scriptscriptstyle\mathsf{T}}} x' \left({ \frac {\pi - \arccos(x^{\scriptscriptstyle\mathsf{T}}x')}{2\pi} }\right). \end{aligned}
Now let’s return to the task of assessing how many functions we can represent near initialization. For this part, we will fix one degree of freedom in the data to effectively include a bias term; this is not necessary, but gives a shorter proof by reducing to standard kernel approximation theorems. We will show that this class is a universal approximator. Moreover \|W-V\| will correspond to the RKHS norm, thus by making the width large, we can approximate elements of this large RKHS arbitrarily finely.
Proceeding in detail, first let’s define our domain \mathcal{X}:= \left\{{ x \in \mathbb{R}^d : \|x\| = 1, x_d = 1/\sqrt{2} }\right\}, and our predictors \mathcal{H}:= \left\{{ x\mapsto \sum_{j=1}^m \alpha_j k(x,x_j) \ : \ m\geq 0, \alpha_j \in \mathbb{R}, x_j \in \mathcal{X}}\right\}. This might look fancy, but is the same as the functions we get by starting with x \mapsto \left\langle \nabla f(x;W_0), W-W_0 \right \rangle and allowing the width to go to infinity, and \|W-W_0\| be arbitrarily large; by the results in section 4.2, we can always choose an arbitrarily large width so that f - f_0\approx 0 even when \|W-W_0\| is large, and we will also show that large width approximates infinite width in the homework. As such, it suffices to show that \mathcal{H} is a universal approximator over \mathcal{X}.
Proof. Consider the set U := \{ u\in\mathbb{R}^{d-1} : \|u\|^2\leq 1/2\}, and the kernel function k(u,u') := f(u^{\scriptscriptstyle\mathsf{T}}u'), \qquad f(z) := \frac {(z+1/2)}{2} - \frac {(z+1/2)\arccos(z+1/2)}{2\pi}. We will show that this kernel is a universal approximator over U, which means it is also a universal approximator on its boundary \{u\in\mathbb{R}^{d-1} : \|u\|^2 = 1/2\}, and thus the kernel (x,x')\mapsto \frac {x^{\scriptscriptstyle\mathsf{T}}x'}{\pi} - \frac {x^{\scriptscriptstyle\mathsf{T}}x'}{\arccos(x^{\scriptscriptstyle\mathsf{T}}x')}{2\pi} is a universal approximator over \mathcal{X}.
Going back to the original claim, first note that \arccos has the Maclaurin series \arccos(z) = \frac \pi 2 - \sum_{k\geq 0} \frac {(2k)!}{2^{2k}(k!)^2}\left({\frac{z^{2k+1}}{2k+1}}\right), which is convergent for z\in[-1,+1]. From here, it can be checked that f has a Maclaurin series where every term is not only nonzero, but positive (adding the bias ensured this). This suffices to ensure that k is a universal approximator (Corollary 4.57, Steinwart and Christmann 2008).
We have not quite closed the loop, as we have not combined the pieces to show that for any continuous function g, we can select a large width m and W so that g \approx f_0(\cdot;W) \approx f(\cdot;W), but we’ve done most of the work, and a few remaining steps will be in homework. For a direct argument about this using a different approach based on (Barron 1993), see (Ji, Telgarsky, and Xian 2020).
So far we have given no compelling presentation of depth; in particular we have not justified the high depths used in practice.
In this section, we will give constructions of interesting functions by deep networks which can not be approximated by polynomially-sized shallow networks. These are only constructions, and it is unlikely these network structure are found by gradient descent and other practical methods, so the general question of justifying the high depth and particular architectures used in practice is still open.
There are four subsections to these notes.
First we will construct a simple piecewise-affine function, \Delta:\mathbb{R}\to\mathbb{R}, which will be our building block of more complex behavior. When \Delta is composed with itself, it builds complexity exponentially fast in a variety of natural notions (e.g., exponentially many copies of itself).
Then we will show that \Delta^{L^2} can be easily written as a deep but constant width network, whereas a shallow network needs exponential width even for approximation within a constant.
Then we will use \Delta^L to approximate x^2; this is meaningful because it leadsto many other approximations, and may seem more natural than \Delta^L.
Lastly we will use x^2 to approximate polynomials and Taylor expansions (Sobolev spaces).
Consider the \Delta function: \Delta(x) = 2\sigma_{\textrm{r}}(x) - 4\sigma_{\textrm{r}}(x-1/2) + 2\sigma_{\textrm{r}}(x-1) = \begin{cases} 2x&x \in [0,1/2),\\ 2-2x& x \in [1/2,1),\\ 0 & \text{otherwise}. \end{cases} How does \Delta look? And how about \Delta^2 := \Delta\circ \Delta? And \Delta^3? Picture drawn in class; figures forthcoming.
The pattern is that \Delta^L has 2^{L-1} copies of it self, uniformly shrunk down. In a sense, complexity has increased exponentially as a function of the the number of nodes and layers (both \mathcal{O}(L)). Later, it will matter that we not only have many copies, but that they are identical (giving uniform spacing). For now, here’s one way to characerize this behavior.
Let \left\langle x \right\rangle = x - \lfloor x\rfloor denote fractional part.
Proof of Proposition 5.1. The proof proceeds by induction on L=i.
For the base case i=1, if x\in[0,1) then directly \Delta^1(x) = \Delta(x) = \Delta(\left\langle x \right\rangle) = \Delta(\left\langle 2^0 x \right\rangle), whereas x=1 means \Delta^1(x) = \Delta(0) = \Delta(\left\langle 2^0 x \right\rangle).
For the inductive step, consider \Delta^{i+1}. The proof can proceed by peeling individual \Delta from the left or from the right; the choice here is to peel from the right. Consider two cases.
This section will establish the following separation between constant-width deep networks and subexponential width shallow networks.
Previously, we used L_2 and L_\infty to state good upper bounds on approximation; for bad approximation, we want to argue there is a large region where we fail, not just a few points, and that’s why we use an L_1 norm.
To be able to argue that such a large region exists, we don’t just need the hard function f = \Delta^{L^2+2} to have many regions, we need them to be regularly spaced, and not bunch up. In particular, if we replaced \Delta with the similar function 4x(1-x), then this proof would need to replace \frac 1 {32} with something decreasing with L.
Proof plan for Theorem 5.1 ((Telgarsky 2015, 2016)):
(Shallow networks have low complexity.) First we will upper bound the number of oscillations in ReLU networks. The key part of the story is that oscillations will grow polynomially in width, but exponentially in depth. give explicit lemma ref
(There exists a regular, high complexity deep network.) Then we will show there exists a function, realized by a slightly deeper network, which has many oscillations, which are moreover regularly spaced. The need for regular spacing will be clear at the end of the proof. We have already handled this part of the proof: the hard function is \Delta^{L^2+2}.
Lastly, we will use a region-counting argument to combine the preceding two facts to prove the theorem. This step would be easy for the L_\infty norm, and takes a bit more effort for the L_1 norm.
Proceeding with the proof, first we want to argue that shallow networks have low complexity. Our notion of complexity is simply the number of affine pieces.
Our proof will proceed by induction, using the following combination rules for piecewise affine functions.
N_A(f+g) \leq N_A(f) + N_A(g).
N_A(\sum_i a_i g_i + b) \leq \sum_i N_A(g_i).
N_A(f \circ g) \leq N_A(f) \cdot N_A(g).
N_A\left({ x \mapsto f(\sum_i a_i g_i(x) + b)}\right) \leq N_A(f)\sum_i N_A(g_i).
Proof of Lemma 5.2.
Draw f and g, with vertical bars at the right boundaries of affine pieces. There are \leq N_A(f) + N_A(g) -1 distinct bars, and f+g is affine between each adjacent pair of bars.
N_A(a_i g_i) \leq N_A(g_i) (equality if a_i\neq 0), thus induction with the preceding gives N_A(\sum_i a_i g_i) = \sum_i N_A(g_i), and N_A doesn’t change with addition of constants.
Let P_A(g) denote the pieces of g, and fix some U\in P_A(g); g is a fixed affine function along U. U is an interval, and consider the pieces of f_{|g(U)}; for each T\in P_A(f_{|g(U)}), f is affine, thus f\circ g is affine (along U \cap g_{|U}^{-1}(T)), and the total number of pieces is \sum_{U \in P_A(g)} N_A(f_{|g(U)}) \leq \sum_{U \in P_A(g)} N_A(f) \leq N_A(g) \cdot N_A(f).
Combine the preceding two.
Proof of Lemma 5.1.
To prove the second from the first, N_A(f) \leq 2^L \prod_{j\leq L} m_j, \prod_{j\leq L} m_j = \exp \sum_{j\leq L} \ln m_j = \exp \frac 1 L \sum_{j\leq L} L \ln m_j \leq \exp L \ln \frac 1 L \sum_{j\leq L} m_j = \left({\frac m L}\right)^L. For the first, proceed by induction on layers. Base case: layer 0 mapping the data with identity, thus N_A(g) = 1. For the inductive step, given g in layer i+1 which takes (g_1,\ldots,g_{m_i}) from the previous layer as input, \begin{aligned} N_A(g) &= N_A(\sigma( b + \sum_j a_j g_j)) \leq 2 \sum_{j=1}^{m_i} N_A(g_j) \\ & \leq 2 \sum_{j=1}^{m_i} 2^i \prod_{k < i} m_k = 2^{i+1} m_i \cdot \prod_{k < i} m_k. \end{aligned}
This completes part 1 of our proof plan, upper bounding the number of affine pieces polynomially in width and exponentially in depth.
The second part of the proof was to argue that \Delta^L gives a high complexity, regular function: we already provided this in Proposition 5.1, which showed that \Delta^L gives exactly 2^{L-1} copies of \Delta, each shrunken uniformly by a factor of 2^{L-1}.
The third part is a counting argument which ensures the preceding two imply the claimed separation in L_1 distance; details are as folllows.
Proof of Theorem 5.1 ((Telgarsky 2015, 2016)).
The proof proceeds by “counting triangles.”
Draw the line x\mapsto 1/2 (as in the figure). The “triangles” are formed by seeing how this line intersects f = \Delta^{L^2+2}. There are 2^{L^2 +1} copies of \Delta, which means 2^{L^2 + 2} - 1 (half-)triangles since we get two (half-)triangles for each \Delta but one is lost on the boundary of [0,1]. Each (half-)triangle has area \frac 1 4 \cdot \frac {1}{2^{L^2+2}} = 2^{-L^2 - 4}.
We will keep track of when g passes above and below this line; when it is above, we will count the triangles below; when it is above, we’ll count the triangles below. Summing the area of these triangles forms a lower bound on \int_{[0,1]} |f-g|.
Using the earlier lemma, g has N_A(g) \leq (2 \cdot 2^L / L)^L \leq 2^{L^2}.
For each piece, we shouldn’t count the triangles at its right endpoint, or if it crosses the line, and we also need to divide by two since we’re only counting triangles on one side; together \begin{aligned} \int_{[0,1]}|f-g| & \geq \textup{[number surviving triangles]}\cdot\textup{[area of triangle]} \\ &\geq\frac 1 2 \left[{2^{L^2+2} - 1 - 2\cdot 2^{L^2}}\right]\cdot \left[{2^{-L^2-4}}\right] \\ &= \frac 1 2 \left[{2^{L^2+1} - 1}\right]\cdot \left[{2^{-L^2-4}}\right] \\ &\geq \frac 1 {32}. \end{aligned}
Why x^2?
Why it should be easy: because x^2 = \int_0^\infty 2 \sigma(x-b) {\text{d}}b, so we need only to uniformly place ReLUs.
We’ll use an approximate construction due to (Yarotsky 2016). It will need only {\operatorname{poly\,log}}(1/\epsilon) nodes and depth to \epsilon-close!
By contrast, our shallow univariate approximation theorems needed 1/\epsilon nodes.
Why we care: with x^2, polarization gives us multiplication: xy = \frac 1 2 \left({ (x+y)^2 - x^2 - y^2 }\right). From that, we get monomials, polynomials, Taylor expansions.
Define S_i := \left({ \frac 0 {2^i}, \frac 1 {2^i}, \ldots, \frac {2^i}{2^i} }\right); let h_i be the linear interpolation of x^2 on S_i.
Thus:
h_i = h_{i+1} on S_i.
For x\in S_{i+1}\setminus S_i, defining \epsilon= 2^{-i-1}, \begin{aligned} h_i(x) - h_{i+1}(x) &= \frac 1 2 \left({h_i(x-\epsilon) + h_i(x+\epsilon)}\right) - h_{i+1}(x) \\ &= \frac 1 2 \left({(x-\epsilon)^2 + (x+\epsilon)^2}\right) - x^2 = \epsilon^2. \end{aligned} Key point: no dependence on x!
Thus, for any x\in S_{i+1}, h_{i+1}(x) = h_{i}(x) - \frac {1}{4^{i+1}} \mathbf{1}[x\in S_{i+1} \setminus S_i]
Since h_{i+1} linearly interpolates, then h_{i+1} - h_i must also linearly interpolate. The linear interpolation of \mathbf{1}[x\in S_{i+1}\setminus S_i] is \Delta^{i+1} ! Thus h_{i+1} = h_i - \frac {\Delta^{i+1}}{4^{i+1}}.
Since h_0(x) = x, then h_i(x) = x - \sum_{j=1}^i \frac {\Delta^j(x)}{4^j}.
Proof.
The interpolation property comes from construction/definition.
Since h_i = x - \sum_{j=1}^i \frac{\Delta^{j}}{4^j} and since \Delta^j requires 3 nodes and 2 layers for each new power, a worst case construction would need 2i layers and 3 \sum_{j\leq i} j = \mathcal{O}(i^2) nodes, but we can reuse individual \Delta elements across the powers, and thus need only 3i, though the network has “skip connections” (in the ResNet sense); alternatively we can replace the skip connections with a single extra node per layer which accumulates the output, or rather after layer j outputs h_j, which suffices since h_{j+1} -h_j = \Delta^{j+1} / 4^{j+1}.
Fix i, and set \tau := 2^{-i}, meaning \tau is the distance between interpolation points. The error between x^2 and h_i is thus bounded above by \begin{aligned} & \sup_{x\in [0,1-\tau]} \sup_{z \in [0,\tau]} \frac {\tau - z} \tau \left({x^2}\right) + \frac {z}{\tau}\left({x+\tau}\right)^2 - (x+z)^2 \\ &= \frac 1 \tau \sup_{x\in [0,1-\tau]} \sup_{z \in [0,\tau]} 2xz\tau + z\tau^2 - 2xz\tau - \tau z^2 \\ &= %opt at z = tau / 2 \frac 1 {4\tau} \sup_{x\in [0,1-\tau]} \frac {\tau^3}{4} = \frac {\tau^2}{4} = 4^{-i-1}. \end{aligned}
By a bound from last lecture, N_A(f) \leq (2N/L)^L. Using a symbolic package to differentiate, for any interval [a,b], \min_{(c,d)}\int_{[a,b]} (x^2 - (cx+d))^2{\text{d}}x = \frac {(b-a)^5}{180}. Let S index the subintervals of length at least 1/(2N) with N:=N_A(f), and restrict attention to [0,1]. Then \sum_{[a,b]\in S} (b-a) = 1 - \sum_{[a,b]\not\in S} (b-a) \geq 1 - N/(2N) = 1/2. Consequently, \begin{aligned} \int_{[0,1]} (x^2 - f(x))^2 {\text{d}}x &= \sum_{[a,b]\in P_A(f)} \int_{[a,b]\cap[0,1]} (x^2 - f(x))^2 {\text{d}}x \\ &\geq \sum_{[a,b]\in S} \frac {(b - a)^5}{180} \\ &\geq \sum_{[a,b]\in S} \frac {(b - a)}{2880 N^4} \geq \frac {1}{5760 N^4}. \end{aligned}
From squaring we can get many other things (still with \mathcal{O}(\ln(1/\epsilon)) depth and size.
Multiplication (via “polarization”): (x,y)\mapsto xy = \frac 1 2 \left({ (x+y)^2 - x^2 - y^2 }\right).
Multiplications gives polynomials.
\frac 1 x and rational functions (Telgarsky 2017).
Functions with “nice Taylor expansions” (Sobolev spaces) (Yarotsky 2016); though now we’ll need size bigger than \ln \frac 1 {\epsilon}:
First we approximate each function locally with a polynomial.
We multiply each local polynomial by a bump ((Yarotsky 2016) calls the family of bumps a “partition of unity”).
This was also reproved and connected to statistics questions by (Schmidt-Hieber 2017).
Here we will continue and give a version of Yarotsky’s main consequence to the approximation of x^2: approximating functions with many bounded derivatives (by approximating their Taylor expansions), formally an approximation result against a Sobolev ball in function space.
The proof consists of the following pieces:
Functions in Sobolev space are locally well-approximated by their Taylor expansions; therefore we will expand the approximation of x^2 to give approximation of general monomials in Lemma 5.4.
These Taylor approximations really only work locally. Therefore we need a nice way to switch between different Taylor expansions in different parts of [0,1]^d. This leads to the construction of a partition of unity, and is one of the other very interesting ideas in (Yarotsky 2016) (in addition to the construction of x^2; this is done below in Lemma 5.5.
First we use squaring to obtain multiplication.
Proof. The proof first handles the case l=2 directly, and uses l-1 copies of \text{prod}_{k,{2}} for the general case.
As such, for (a,b)\in\mathbb{R}^2, define \text{prod}_{k,{2}}(a,b) := \frac 1 2 \left({ 4 h_k((a+b)/2) - h_k(a) - h_k(b) }\right). The size of this network follows from the size of h_k given in Theorem 5.2 (roughly following (Yarotsky 2016)), and \text{prod}_{k,{2}}(a,b) = 0 when either argument is 0 since h_k(0) = 0. For the approximation guarantee, since every argument to each h_k is within [0,1], then Theorem 5.2 (roughly following (Yarotsky 2016)) holds, and using the polarization identity to rewrite a\cdot b gives \begin{aligned} 2 |\text{prod}_{k,{l}}(a,b) - ab| &= 2 |\text{prod}_{k,{l}}(a,b) - \frac 1 2 ((a+b)^2 - a^2 - b^2)| \\ &\leq 4 | h_k((a+b)/2) - ((a+b)/2)^2 | + |h_k(a) - a^2| + |h_k(b) - b^2| \\ &\leq 4\cdot 4^{-k-1} + 4^{-k-1} + 4^{-k-1} \leq 2\cdot{}4^{-k}. \end{aligned}
Now consider the case \text{prod}_{k,{i}} for i > 2: this network is defined via \text{prod}_{k,{i}}(x_1,\ldots,x_i) := \text{prod}_{k,{2}}( \text{prod}_{k,{i-1}}(x_1,\ldots,x_{i-1}), x_i). It is now shown by induction that this network has \mathcal{O}(ki + i^2) nodes and \mathcal{O}(ki) layers, that it evaluates to 0 when any argument is zero, and lastly satisfies the error guarantee \left|{\text{prod}_{k,{i}}(x_{1:i}) - \prod_{j=1}^i x_j }\right| \leq i 4^{-k}. The base case i=2 uses the explicit \text{prod}_{k,{2}} network and gaurantees above, thus consider i>2. The network embeds \text{prod}_{k,{i-1}} and another copy of \text{prod}_{k,{2}} as subnetworks, but additionally must pass the input x_i forward, thus requires \mathcal{O}(ki) layers and \mathcal{O}(ki + i^2) nodes, and evaluates to 0 if any argument is 0 by the guarantees on \text{prod}_{k,{2}} and the inductive hypothesis. For the error estimate, \begin{aligned} \left|{\text{prod}_{k,{i}}(x_1,\ldots,x_i) - \prod_{j=1}^i x_j}\right| &\leq \left|{ \text{prod}_{k,{2}}( \text{prod}_{k,{i-1}}(x_1,\ldots,x_{i-1}), x_i) - x_i \text{prod}_{k,{i-1}}(x_1,\ldots,x_{i-1})}\right| \\ &+\left|{ x_i \text{prod}_{k,{i-1}}(x_1,\ldots,x_{i-1}) - x_i \prod_{j=1}^{i-1} x_j}\right| \\ &\leq 4^{-k} + |x_i| \cdot \left|{\text{prod}_{k,{i-1}}(x_1,\ldots,x_{i-1}) - \prod_{j=1}^{i-1} x_j}\right| \\ &\leq 4^{-k} + |x_i| \cdot \left({(i-1) 4^{-k}}\right) \leq i 4^{-k}. \end{aligned}
From multiplication we get monomials.
Proof. \text{mono}_{k,r} consists of N parallel networks, one for each monomial. As such, given any \vec{\alpha} of degree q\leq r, to define coordinate \vec\alpha of \text{mono}_{k,r}, first rewrite \alpha as a vector v \in \{1,\ldots,d\}^q, whereby x^{\vec\alpha} := \prod_{i=1}^q x_{v_i}. Define \text{mono}_{k,r}(x)_{\vec\alpha} := \text{prod}_{k,{q}}(x_{v_1},\ldots,x_{v_q}), whereby the error estimate follows from Lemma 5.3, and the size estimate follows by multiplying the size estimate from Lemma 5.3 by N, and noting N \leq d^r.
Next we construct the approximate partition of unity.
Proof. Set N : = (s+1)^d, and let S be any enumeration of the vectors in the grid \{0,1/s,\ldots,s/s\}^d. Define first a univariate bump function h(a) := \sigma(sa + 1) - 2\sigma(sa) + \sigma(sa - 1) = \begin{cases} 1 + sa & a \in [-1/s,0), \\ 1- sa & a \in [0, 1/s] \\ 0 &\text{o.w.}. \end{cases} For any v \in S, define f_v(x) := \text{prod}_{k,{d}}(h(x_1-v_1), \dots, h(x_d - v_d)). By Lemma 5.3, \sup_{x\in[0,1]^d} |f_v(x) - \prod_{j=1}^d h(x_j - v_j)| \leq d 4^{-k}. Each coordinate of the output of \text{part}_{k,s} corresponds to some v\in S; in particular, define \text{part}_{k,s}(x)_v := f_v(x). As such, by the definition of f_v, and Lemma 5.3, and since |S| \leq (s+1)^d, then \text{part}_{k,s} can be written with kd layers and \mathcal{O}((kd + d^2)s^d) nodes. The local support claim for \text{part}_{k,s}(\cdot)_v follows by construction. For the claim of approximate partition of unity, using U\subseteq S to denote the local set of coordinates corresponding to nonzero coordinates of \text{part}_{k,s} (which has |U|\leq 2^d by the local support claim), \begin{aligned} | \sum_{v\in S} \text{part}_{k,s}(x)_v - 1| &= | \sum_{v\in U} (\text{part}_{k,s}(x)_v - \prod_{j=1}^d h(x_j - v_j) + \prod_{j=1}^d h(x_j - v_j) - 1 | \\ &\leq \sum_{v\in U} | \text{part}_{k,s}(x)_v - \prod_{j=1}^d h(x_j - v_j) + | \sum_{v\in U} \prod_{j=1}^d h(x_j - v_j) - 1 | \\ &\leq 2^d d 4^{-k} + | \sum_{v\in U} \prod_{j=1}^d h(x_j - v_j) - 1 |. \end{aligned} It turns out the last term of the sum is 0, which completes the proof: letting u denote the lexicographically smallest element in U (i.e., the “bottom left corner”), \begin{aligned} | \sum_{v\in U} \prod_{j=1}^d h(x_j - v_j) - 1 | &= | \sum_{w \in \{0,1/s\}^d} \prod_{j=1}^d h((x - u + w)_j) - 1 | \\ &= | \prod_{j=1}^d \sum_{w_j \in \{0,1/s\}} h((x - u + w)_j) - 1 | \\ &= | \prod_{j=1}^d (h(x_j -u_j) + h(x_j - u_j + 1/s)) - 1 |, \end{aligned} which is 0 because z := x - u \in [0,1/s]^d by construction, and using the case analysis of h gives h(z_j) + h(z_j + 1/s) = (1 + sz_j) + (1 - s(z_j + 1/s)) = 1 as desired.
Finally we are in shape to prove Theorem 5.4.
Proof of Theorem 5.4. The ReLU network for f will combine \text{part}_{k,s} from Lemma 5.5 with \text{mono}_{k,r} from Lemma 5.4 via approximate multiplication, meaning \text{prod}_{k,{2}} from Lemma 5.3.
In detail, let the grid S:=\{0,1/s,\ldots,s/s\}^d be given as in the statement of Lemma 5.5. For each v\in S, let p_v:\mathbb{R}^d \to \mathbb{R} denote the Taylor expansion of degree r at v; by a standard form of the Taylor error, for any x\in[0,1]^d with \|x-v\|_{\infty} \leq 1/s, |p_v(x) - g(x)| \leq \frac {Md^r}{r!} \|v-x\|_\infty^r \leq \frac {Md^r}{r!s^r}. Next, let w_v denote the Taylor coefficients forming p_v, and define f_v :\mathbb{R}^d\to\mathbb{R} as x\mapsto w_v^{\scriptscriptstyle\mathsf{T}}\text{mono}_{k,r}(x-v), meaning approximate p_v by taking the linear combination with weights w_v of the approximate monomials in x\mapsto \text{mono}_{k,r}(x-v). By Lemma 5.4, since there are at most d^r terms, the error is at most |f_v(x) - p_v(x)| = |\sum_{\vec\alpha} (w_v)_{\vec\alpha} (\text{mono}_{k,r}(x-v)_{\vec\alpha} - (x-v)^{\vec\alpha})| \leq \sum_{\vec\alpha} |(w_v)_{\vec\alpha}| r4^{-k} \leq M r d^r 4^{-k}. just realized a small issue that negative inputs might occur; can do some shifts or reflections or whatever to fix.
The final network is now obtained by using \text{prod}_{k,{2}} to approximately multiply each approximate Taylor expansion f_v by the corresponding locally-supported approximate partition of unity element \text{part}_{k,s}(x)_v; in particular, define f(x) := \sum_{v\in S} \text{prod}_{k,{2}}(f_v(x), \text{part}_{k,s}(x)_v). Then, using the above properties and the fact that the partition of unity is locally supported, letting U\subseteq S denote the set of at most 2^d active elements, \begin{aligned} \left|{f(x) - g(x)}\right| &\leq \left|{\sum_{v\in S} \text{prod}_{k,{2}}(f_v(x),\text{part}_{k,s}(x)_v) - \sum_{v\in S} f_v(x)\text{part}_{k,s}(x)_v}\right| \\ &+ \left|{\sum_{v\in S} f_v(x)\text{part}_{k,s}(x)_v - \sum_{v\in S} p_v(x)\text{part}_{k,s}(x)_v}\right| \\ &+ \left|{\sum_{v\in S} p_v(x)\text{part}_{k,s}(x)_v - \sum_{v\in S} g(x)\text{part}_{k,s}(x)_v}\right| \\ &+ \left|{\sum_{v\in S} g(x)\text{part}_{k,s}(x)_v - g(x) }\right| \\ &\leq 2 |U| 4^{-k} + Mrd^r 4^{-k} (1 + d 2^d 4^{-k}) + \frac {Md^r}{r! s^r}(1 + d 2^d4^{-k}) + |f(x)| d 2^d4^{-k} \\ &\leq Mrd^r\left({s^{-r} +4d 2^d\cdot 4^{-k} }\right) + 3d 2^d \cdot 4^{-k}. \end{aligned} The input to \text{prod}_{k,{2}} can exceed 1. for a maximally lazy fix, I should just clip its input.
Classically, the purpose of optimization is to approximately minimize (or maximize) an objective function f over a domain S: \min_{w\in S} f(w).
A core tension in the use of optimization in machine learning is that we would like to minimize the population risk \mathcal{R}(w) := \mathop{\mathbb{E}}\ell(Yf(X;w)); however, we only have access to the empirical risk \widehat{\mathcal{R}}(w) := n^{-1} \sum_i \ell(y_if(x_i;w)).
As a result, when choosing a w_t, we not only care that \widehat{\mathcal{R}}(w_t) is small, but also other good properties which may indicate \mathcal{R}(w_t) is small as well. Foremost amongst these are that w_t has low norm, but there are other possibilities.
Outline.
We will cover primarily first-order methods, namely gradient descent w_{t+1} := w_t - \eta_t \nabla\widehat{\mathcal{R}}(w_t), as well as the gradient flow \frac {{\text{d}}w}{{\text{d}}t} = \dot w(t) = - \nabla\widehat{\mathcal{R}}(w(t)). These dominate machine learning since:
They have low per-iteration complexity (which can be reduced further with stochastic gradients); classical optimization developed many methods with higher per-iteration cost but a lower number of iterations, but the high accuracy these give is not important here since our true objective is unknown anyway.
It seems they might have additional favorable properties; e.g., we will highlight the preference for low-norm solutions of first-order methods.
First we’ll cover classical smooth and convex opt, including strong convexity and stochastic gradients.
Here our analysis differs from the literature by generally not requiring boundedness or existence of minima. Concretely, many proofs will use an arbitrary reference point z in place of an optimum \bar w (which may not exist); this arbitrary z will be used effectively in the margin maximization lectures.
Then we will cover topics closer to deep learning, including gradient flow in a smooth shallow NTK case, and a few margin maximization cases, with a discussion of nonsmoothness.
…maybe I should always use \widehat{\mathcal{R}} or F for objectives
Mean-field perspective (Chizat and Bach 2018; Mei, Montanari, and Nguyen 2018): as m\to\infty, gradient descent mimics a Wasserstein flow on a distribution over nodes (random features). Many mean-field papers are in the 2-homogeneous case, whereas many NTK papers are in the 1-homogeneous case, which further complicates comparisons.
Landscape analysis. (E.g., all local optima are global.)
Matrix completion: solve (under RIP) \min_{\substack{X\in d\times r}} \sum_{(i,j)\in S} (M_{i,j} - XX^{\scriptscriptstyle\mathsf{T}})^2. Recently it was shown that all local optima are global, and so gradient descent from random initialization suffices (Ge, Lee, and Ma 2016).
For linear networks optimized with the squared loss, local optima are global, but there are bad saddle points (Kawaguchi 2016).
Width n suffices with general losses and networks (Nguyen and Hein 2017).
[ There is also work on residual networks but I haven’t looked closely. ]
Acceleration. Consider gradient descent with momentum: w_0 arbitrary, and thereafter v_{i+1} := w_i - \eta_i \nabla\widehat{\mathcal{R}}(w_i), \qquad w_{i+1} := v_{i+1} + \gamma_i(v_{i+1} - v_i)
This sometimes seems to help in deep learning (even in stochastic case), but no one knows why (and opinions differ).
If set \eta_i = 1/\beta and \gamma_i = i / (i+3) (constants matter) and \mathcal{R} convex, \widehat{\mathcal{R}}(w_i) - \inf_{w} \widehat{\mathcal{R}}(w) \leq \mathcal{O}(1/t^2) (“Nesterov’s accelerated method”). This rate is tight amongst algorithms outputting iterates in the span of gradients, under some assumptions people treat as standard.
Escaping saddle points. By adding noise to the gradient step, it is possible to exit saddle points (Jin et al. 2017). Some papers use this technique, though it is most useful in settings where all local minima (stationary points that are not saddles) are global minima.
Beyond NTK. A very limited amount of work studies nonlinear cases beyond what is possible with the NTK and/or highlighting ways in which the NTK does not capture the behavior of deep networks in practice, in particular showing sample complexity separations (Allen-Zhu and Li 2019; Daniely and Malach 2020; Ghorbani et al. 2020; Kamath, Montasser, and Srebro 2020; Yehudai and Shamir 2019, 2020).
Benefits of depth for optimization. Most of these works are either for shallow networks, or the analysis allows depth but degrades with increasing depth, in contrast with practical observations. A few works now are trying to show how depth can help optimization; one perspective is that sometimes it can accelerate convergence (Arora, Cohen, and Hazan 2018; Arora, Cohen, et al. 2018a).
Other first-order optimizers, e.g., Adam. There is recent work on these but afaik it doesn’t capture why these work well on many deep learning tasks.
Further analysis of overparameterization. Overparameterization makes many aspects of the optimization problem nicer, in particular in ways not investigated in these notes (Shamir 2018; S. Du and Hu 2019).
Hardness of learning and explicit global solvers. Even in simple cases, network training is NP-hard, but admits various types of approximation schemes (Goel et al. 2020; Diakonikolas et al. 2020).
First we will revisit classical convex optimization ideas. Our presentation differs from the normal one in one key way: we state nearly results without any assumption of a minimizer, but instead use an arbitrary reference point z\in\mathbb{R}^p. We will invoke these bounds later in settings where the minimum may not exist, but the problem structure suggests good choices for z (see e.g., Lemma 10.1).
if i include ReLU ntk I can also use it there.
We say “\widehat{\mathcal{R}} is \beta-smooth” to mean \beta-Lipschitz gradients: \|\nabla\widehat{\mathcal{R}}(w) - \nabla\widehat{\mathcal{R}}(v)\| \leq \beta \|w-v\|. (The math community says “smooth” for C^\infty.) We primarily invoke smoothness via the key inequality \widehat{\mathcal{R}}(v) \leq \widehat{\mathcal{R}}(w) + \left\langle \nabla\widehat{\mathcal{R}}(w), v-w \right \rangle + \frac \beta 2 \|v-w\|^2. In words: f can be upper bounded with the convex quadratic v \mapsto \frac \beta 2 \|v-w\|^2 + \left\langle \nabla\widehat{\mathcal{R}}(w), v-w \right \rangle + \widehat{\mathcal{R}}(w), which shares tangent and function value with \widehat{\mathcal{R}} at w. (The first definition also implies that we are lower bounded by concave quadratics.)
A key consequence: we can guarantee gradient descent does not increase the objective. Consider gradient iteration w' = w - \frac 1 \beta \nabla\widehat{\mathcal{R}}(w), then smoothness implies \widehat{\mathcal{R}}(w') \leq \widehat{\mathcal{R}}(w) - \left\langle \widehat{\mathcal{R}}(w), \widehat{\mathcal{R}}(w)/\beta \right \rangle + \frac {1}{2\beta}\|\widehat{\mathcal{R}}(w)\|^2 = \widehat{\mathcal{R}}(w) - \frac 1 {2\beta} \|\nabla\widehat{\mathcal{R}}(w)\|^2, and \|\nabla\widehat{\mathcal{R}}(w)\|^2 \leq 2\beta (\widehat{\mathcal{R}}(w) - \widehat{\mathcal{R}}(w')). With deep networks, we’ll produce similar bounds but in other ways.
As an exercise, let’s prove the earlier smoothness consequence. Considering the curve t\mapsto \widehat{\mathcal{R}}(w + t(v-w)) along [0,1], \begin{aligned} &\left|{ \widehat{\mathcal{R}}(v) - \widehat{\mathcal{R}}(w) - \left\langle \nabla\widehat{\mathcal{R}}(w), v-w \right \rangle }\right| \\ &= \left|{ \int_0^1 \left\langle \nabla\widehat{\mathcal{R}}(w+t(v-w)), v-w \right \rangle{\text{d}}t - \left\langle \nabla\widehat{\mathcal{R}}(w), v-w \right \rangle }\right| \\ &\leq \int_0^1\left|{ \left\langle \nabla\widehat{\mathcal{R}}(w+t(v-w))-\nabla\widehat{\mathcal{R}}(w), v-w \right \rangle }\right| {\text{d}}t \\ &\leq \int_0^1\|\nabla\widehat{\mathcal{R}}(w+t(v-w))-\nabla\widehat{\mathcal{R}}(w)\|\cdot\|v-w\| {\text{d}}t \\ &\leq \int_0^1 t\beta \|v-w\|^2 {\text{d}}t \\ &= \frac {\beta}{2} \|v-w\|^2. \end{aligned}
Since \frac {\sigma_{\min}(X)}{2} \|w'-w\|^2 \leq \frac 1 2 \|Xw'-Xw\|^2 \leq \frac {\sigma_{\max}(X)}{2} \|w'-w\|^2, thus \widehat{\mathcal{R}} is \sigma_{\max}(X)-smooth (and \sigma_{\min}-strongly-convex, as we’ll discuss).
The smoothness bound holds if we use the seminorm \|v\|_X = \|Xv\|. We’ll (maybe?) discuss smoothness wrt other norms in homework.
I should use \mathcal L not \widehat{\mathcal{R}} since unnormalized.
Consider first the gradient iteration w' := w - \eta \nabla\widehat{\mathcal{R}}(w), where \eta\geq 0 is the step size. When f is \beta smooth but not necessarily convex, the smoothness inequality directly gives \begin{aligned} \widehat{\mathcal{R}}(w') &\leq \widehat{\mathcal{R}}(w) + \left\langle \nabla\widehat{\mathcal{R}}(w), w'-w \right \rangle + \frac \beta 2 \|w'-w\|^2 \\ &= \widehat{\mathcal{R}}(w) - \eta \|\nabla\widehat{\mathcal{R}}(w)\|^2 + \frac {\beta\eta^2}{2}\|\nabla\widehat{\mathcal{R}}(w)\|^2 \\ &= \widehat{\mathcal{R}}(w) - \eta \left({1 - \frac {\beta\eta}{2}}\right) \|\nabla\widehat{\mathcal{R}}(w)\|^2. \end{aligned} (3)
If we choose \eta appropriately (\eta\leq 2/\beta) then: either we are near a critical point (\nabla\widehat{\mathcal{R}}(w)\approx 0), or we can decrease \widehat{\mathcal{R}}.
Let’s refine our notation to tell iterates apart:
Let w_0 be given.
Recurse: w_{i+1} := w_{i} - \eta_i \nabla\widehat{\mathcal{R}}(w_{i}).
I changed indexing (2021-09-23), need to update everywhere…
Rearranging our iteration inequality eq. 3 and summing over i<t, \begin{aligned} \sum_{i<t} \eta_{i}\left({1 - \frac {\beta\eta_{i}}{2}}\right)\|\nabla\widehat{\mathcal{R}}(w_i)\|^2 &\leq \sum_{i<t} \left({\widehat{\mathcal{R}}(w_i) - \widehat{\mathcal{R}}(w_{i+1})}\right) \\ &= \widehat{\mathcal{R}}(w_0) - \widehat{\mathcal{R}}(w_{t}). \end{aligned} We can summarize these observations in the following theorem.
If \eta_{i+1} \in [0,2/\beta], then \widehat{\mathcal{R}}(w_{i+1}) \leq \widehat{\mathcal{R}}(w_i).
If \eta_i := \eta \in [0,2/\beta] is constant across i, \begin{aligned} \min_{i<t} \|\nabla\widehat{\mathcal{R}}(w_i)\|^2 &\le \frac 1 t \sum_{i<t} \|\nabla\widehat{\mathcal{R}}(w_i)\|^2 \\ &\le \frac {2}{t\eta(2-\eta\beta)}\left({\widehat{\mathcal{R}}(w_0) - \widehat{\mathcal{R}}(w_t)}\right) \\ &\le \frac {2}{t\eta(2-\eta\beta)}\left({\widehat{\mathcal{R}}(w_0) - \inf_w \widehat{\mathcal{R}}(w) }\right). \end{aligned} This final expression is minimized by \eta := \frac 1 \beta, which gives \begin{aligned} \min_{i<t} \|\nabla\widehat{\mathcal{R}}(w_i)\|^2 &\le \frac 1 t \sum_{i<t} \|\nabla\widehat{\mathcal{R}}(w_i)\|^2 %\\ %& \le \frac {2\beta }{t}\left({\widehat{\mathcal{R}}(w_0) - \widehat{\mathcal{R}}(w_t)}\right) %\\ %& \le \frac {2\beta }{t}\left({\widehat{\mathcal{R}}(w_0) - \inf_w \widehat{\mathcal{R}}(w) }\right). \end{aligned}
Gradient flow version. Using FTC, chain rule, and definition, \begin{aligned} \widehat{\mathcal{R}}(w(t)) - \widehat{\mathcal{R}}(w(0)) &= \int_0^t \left\langle \nabla\widehat{\mathcal{R}}(w(s)), \dot w(s) \right \rangle{\text{d}}s \\ &= - \int_0^t\|\nabla\widehat{\mathcal{R}}(w(s))\|^2{\text{d}}s \\ &\leq - t \inf_{s\in[0,t]} \|\nabla\widehat{\mathcal{R}}(w(s))\|^2, \end{aligned} which can be summarized as follows.
GD: \min_{i<t} \|\nabla\widehat{\mathcal{R}}(w_i)\|^2 \le \frac {2\beta}{t}\left({\widehat{\mathcal{R}}(w_0) - \widehat{\mathcal{R}}(w_t)}\right).
\beta is from step size.
“2” is from the order smoothness term (avoided in GF).
If \widehat{\mathcal{R}} is differentiable and convex, then it is bounded below by its first-order approximations: \widehat{\mathcal{R}}(w') \geq \widehat{\mathcal{R}}(w) + \left\langle \nabla\widehat{\mathcal{R}}(w), w'-w \right \rangle\qquad \forall w,w'.
Proof. By convexity and the earlier smoothness inequality \|\nabla\widehat{\mathcal{R}}(w)\|^2 \leq 2\beta (\widehat{\mathcal{R}}(w)-\widehat{\mathcal{R}}(w')), \begin{aligned} \|w'-z\|^2 &= \|w-z\|^2 - \frac 2 \beta \left\langle \nabla\widehat{\mathcal{R}}(w), w-z \right \rangle + \frac 1 {\beta^2} \|\nabla\widehat{\mathcal{R}}(w)\|^2 \\ &\leq \|w-z\|^2 + \frac 2 \beta (\widehat{\mathcal{R}}(z)-\widehat{\mathcal{R}}(w)) + \frac 2 \beta (\widehat{\mathcal{R}}(w) - \widehat{\mathcal{R}}(w')) \\ &= \|w-z\|^2 + \frac 2 \beta (\widehat{\mathcal{R}}(z)-\widehat{\mathcal{R}}(w')). \end{aligned} Rearranging and applying \sum_{i<t}, \frac 2 \beta \sum_{i<t} (\widehat{\mathcal{R}}(w_{i+1}) - \widehat{\mathcal{R}}(z)) \leq \sum_{i<t} \left({ \|w_{i}-z\|^2 - \|w_{i+1}-z\|^2 }\right) The final bound follows by noting \widehat{\mathcal{R}}(w_i)\geq \widehat{\mathcal{R}}(w_t), and since the right hand side telescopes.
For GF, we use the same potential, but indeed start from the telescoping sum, which can be viewed as a Riemann sum corresponding to the following application of FTC: \begin{aligned} \frac 1 2 \|w(t) - z\|_2^2 - \frac 1 2 \|w(0) - z\|_2^2 &= \frac 1 2 \int_0^t \frac {{\text{d}}}{{\text{d}}s} \|w(s) - z\|_2^2 {\text{d}}s \\ &= \int_0^t \left\langle \frac {{\text{d}}w}{{\text{d}}s} , w(s) - z \right \rangle {\text{d}}s \\ &= \int_0^t \left\langle \nabla \widehat{\mathcal{R}}(w(s)) , z - w(s) \right \rangle {\text{d}}s \\ &\leq \int_0^t \left({ \widehat{\mathcal{R}}(z) - \widehat{\mathcal{R}}(w(s)) }\right) {\text{d}}s. \end{aligned}
We can use similar objective functions with deep learning, without smoothness (!).
Recall one of our definitions of strong convexity: say that \widehat{\mathcal{R}} is \lambda-strongly-convex (\lambda-sc) when \widehat{\mathcal{R}}(w') \geq \widehat{\mathcal{R}}(w) + \left\langle \nabla\widehat{\mathcal{R}}(w), w'-w \right \rangle + \frac \lambda 2 \|w'-w\|^2; see Remark 7.5 (characterizing convexity) for more forms.
If \widehat{\mathcal{R}} is convex, then \widehat{\mathcal{R}}_\lambda is \lambda-sc:
Another very useful property is that \lambda-sc gives a way to convert gradient norms to suboptimality.
Proof. Let w be given, and define the convex quadratic Q_w(v) := \widehat{\mathcal{R}}(w) + \left\langle \nabla\widehat{\mathcal{R}}(w), v-w \right \rangle + \frac \lambda 2 \|v-w\|^2, which attains its minimum at \bar v := w - \nabla\widehat{\mathcal{R}}(w)/\lambda. By definition \lambda-sc, \begin{aligned} \inf_v \widehat{\mathcal{R}}(v) \geq \inf_v Q_w(v) = Q_w(\bar v) = \widehat{\mathcal{R}}(w) - \frac 1 {2\lambda}\|\nabla\widehat{\mathcal{R}}(w)\|^2. \end{aligned}
Say our goal is to find w so that \widehat{\mathcal{R}}(w) - \inf_v \widehat{\mathcal{R}}(v) \leq \epsilon. When do we stop gradient descent?
The \lambda-sc case is easy: by the preceding lemma, we know that we can stop when \|\nabla\widehat{\mathcal{R}}(w)\| \leq \sqrt {2\lambda \epsilon}.
Another easy case is when \inf_v \widehat{\mathcal{R}}(v) is known, and we just watch \widehat{\mathcal{R}}(w_i). E.g., in classification tasks, deep networks are expect to get 0. For things like deep RL, once again it becomes a problem.
Many software packages use heuristics. Some people just run their methods as long as possible. In convex cases, sometimes we can compute duality gaps.
I shuold lemmas lemmas giving level set containment, and existence of minimizers.
Proof. Using previously-proved Lemmas from smooothness and strong convexity, \begin{aligned} \widehat{\mathcal{R}}(w_{i+1}) - \widehat{\mathcal{R}}(\bar w) &\leq \widehat{\mathcal{R}}(w_{i}) - \widehat{\mathcal{R}}(\bar w) - \frac {\|\nabla\widehat{\mathcal{R}}(w_i)\|^2}{2\beta} \\ &\leq \widehat{\mathcal{R}}(w_{i}) - \widehat{\mathcal{R}}(\bar w) - \frac {2\lambda(\widehat{\mathcal{R}}(w_i) - \widehat{\mathcal{R}}(\bar w))}{2\beta} \\ &\leq \left({\widehat{\mathcal{R}}(w_{i}) - \widehat{\mathcal{R}}(\bar w)}\right)\left({1 - \lambda/\beta}\right), \end{aligned} which gives the first bound by induction since \prod_{i<t}(1-\lambda/\beta) \leq \prod_{i<t}\exp\left({-\lambda/\beta}\right) = \exp\left({-t\lambda/\beta}\right).
For the second guarantee, expanding the square as usual, \begin{aligned} \|w' -\bar w\|^2 &=\|w-\bar w\|^2 + \frac 2 \beta \left\langle \nabla\widehat{\mathcal{R}}(w), \bar w - w \right \rangle + \frac 1 {\beta^2}\|\nabla\widehat{\mathcal{R}}(w)\|^2 \\ &\leq\|w-\bar w\|^2 + \frac 2 \beta \left({\widehat{\mathcal{R}}(\bar w) - \widehat{\mathcal{R}}(w) - \frac{\lambda}{2} \|\bar w- w\|_2^2 }\right) \\ &\qquad + \frac 1 {\beta^2}\left({2\beta(\widehat{\mathcal{R}}(w) - \widehat{\mathcal{R}}(w'))}\right) \\ &= (1-\lambda/\beta)\|w-\bar w\|^2 + \frac 2 \beta \left({\widehat{\mathcal{R}}(\bar w) - \widehat{\mathcal{R}}(w) + \widehat{\mathcal{R}}(w) - \widehat{\mathcal{R}}(w')}\right) \\ &\leq (1-\lambda/\beta)\|w-\bar w\|^2, \end{aligned} which gives the argument after a similar induction argument as before.
Next let’s handle the gradient flow.
Proof. By first-order optimality in the form \nabla\widehat{\mathcal{R}}(\bar w) = 0, then \begin{aligned} \frac {{\text{d}}}{{\text{d}}t} \frac 1 2 \|w(t) -\bar w\|^2 &= \left\langle w(t) - \bar w, \dot w(t) \right \rangle \\ &= -\left\langle w(t) - \bar w, \nabla \widehat{\mathcal{R}}(w(t)) - \nabla \widehat{\mathcal{R}}(\bar w) \right \rangle \\ &\leq - \lambda \|w(t) - \bar w\|^2. \end{aligned} By Grönwall’s inequality, this implies \begin{aligned} \|w(t) -\bar w\|^2 &\leq \|w(0)-\bar w\|^2 \exp\left({ -\int_0^t 2\lambda{\text{d}}s }\right) \\ &\leq \|w(0)-\bar w\|^2 \exp(-2\lambda t), \end{aligned} which establishes the guarantee on distances to initialization. For the objective function guarantee, \begin{aligned} \frac {{\text{d}}}{{\text{d}}t} (\widehat{\mathcal{R}}(w(t)) - \widehat{\mathcal{R}}(\bar w)) & = \left\langle \nabla\widehat{\mathcal{R}}(w(t)), \dot w(t) \right \rangle \\ &= - \left\|{ \nabla\widehat{\mathcal{R}}(w(t)) }\right\|^2 \leq - 2\lambda (\widehat{\mathcal{R}}(w(t)) - \widehat{\mathcal{R}}(\bar w)). \end{aligned} Grönwall’s inequality implies \widehat{\mathcal{R}}(w(t)) - \widehat{\mathcal{R}}(\bar w) \leq \left({ \widehat{\mathcal{R}}(w(0)) - \widehat{\mathcal{R}}(\bar w) }\right) \exp(-2t\lambda).
We have strayed a little from our goals by producing laborious proofs that not only separate the objective function and the distances, but also require minimizers. Interestingly, we can resolve this by changing the step size to a large (seemingly worse?) one.
Proof. Homework problem \ddot\smile.
Moreover, another standard rate given in the literature is 1/t under just strong convexity (no smoothness); however, this requires a step size \eta_i := (\lambda(i+1))^{-1}.
Let’s generalize gradient descent, and consider the iteration w_{i+1} := w_i - \eta_i g_i, where each g_i is merely some vector. If g_i := \nabla\widehat{\mathcal{R}}(w_i), then we have gradient descent, but in general we only approximate it. Later in this section, we’ll explain how to make g_i a “stochastic gradient.”
Our first step is to analyze this in our usual way with our favorite potential function, but accumulating a big error term: using convexity of \mathcal{R} and choosing a constant step size \eta_i := \eta \geq 0 for simplicity, \begin{aligned} \|w_{i+1} - z\|^2 &= \|w_i - \eta g_i - z\|^2 \\ &= \|w_i - z\|^2 - 2\eta_i \left\langle g_i, w_i - z \right \rangle + \eta^2 \|g_i\|^2 \\ &= \|w_i - z\|^2 + 2\eta \left\langle g_i - \nabla\mathcal{R}(w_i)+\nabla\mathcal{R}(w_i), z - w_i \right \rangle + \eta^2 \|g_i\|^2 \\ &\leq \|w_i - z\|^2 + 2\eta (\mathcal{R}(z) - \mathcal{R}(w_i) + \underbrace{\left\langle g_i - \nabla\mathcal{R}(w_i), z-w_i \right \rangle}_{\epsilon_i}) + \eta^2 \|g_i\|^2, \end{aligned} which after rearrangement gives 2\eta \mathcal{R}(w_i) \leq 2\eta \mathcal{R}(z) + \|w_{i} - z\|^2 - \|w_{i+1} - z\|^2 + 2\eta \epsilon_i+ \eta^2 \|g_i\|^2, and applying \frac 1 {2\eta t} \sum_{i<t} to both sides gives \frac 1 t \sum_{i<t} \mathcal{R}(w_i) \leq \mathcal{R}(z) + \frac{\|w_{0} - z\|^2 - \|w_{t} - z\|^2}{2 \eta t} + \frac 1 t \sum_{i<t}\left({ \epsilon_i+ \frac \eta 2 \|g_i\|^2}\right).
The following lemma summarizes this derivation.
Proof. This follows from the earlier derivation after plugging in G, \eta=c/\sqrt{t}, and applying Jensen’s inequality to the left hand side.
Now let us define the standard stochastic gradient oracle: \mathop{\mathbb{E}}[ g_i | w_{\leq i} ] = \nabla\mathcal{R}(w_i), where w_{\leq i} signifies all randomness in (w_1,\ldots,w_i).
Indeed, this setup allows the expectation to be nicely interpreted as an iterated integral over (x_1,y_1), then (x_2,y_2), and so on. The stochastic gradient g_i depends on (x_i,y_i) and w_i, but w_i does not depend on (x_i,y_i), rather on ((x_j,y_j))_{j=1}^{i-1}.
Annoyingly, there are many different settings for stochastic gradient descent, but they refer to themselves in the same way and it requires a closer look to determine the precise setting.
Previous slide suggested (x,y) is a fresh sample from the distribution; in this case, we are doing stochastic gradient descent on he population directly!
We can also resample the training set, in which case \mathcal{R} is our usual empirical risk, and now the randomness is under our control (randomized algorithm, not random data from nature). The “SVRG/SDCA/SAG/etc” papers are in this setting, as are some newer SGD papers. Since people typically do multiple passes over the time, perhaps this setting makes sense.
Now let’s work towards our goal of showing that, with high probability, our stochastic gradient method does nearly as well as a regular gradient method. (We will not show any benefit to stochastic noise, other than computation!)
Our main tool is as follows.
Proof omitted, though we’ll sketch some approaches in a few weeks.
We will use this inequality to handle \sum_{i<t} \epsilon_i. Firstly, we must show the desired expectations are zero. To start, \begin{aligned} \mathop{\mathbb{E}}\left[{ \epsilon_i\ \Big|\ w_{\leq i} }\right] &= \mathop{\mathbb{E}}\left[{ \left\langle g_i - \nabla\mathcal{R}(w_i), z - w_i \right \rangle\ \Big|\ w_{\leq i} }\right] \\ &= \left\langle \mathop{\mathbb{E}}\left[{ g_i - \nabla\mathcal{R}(w_i)\ {} \Big| \ {} w_{\leq i} }\right], z - w_i \right \rangle \\ &= \left\langle 0, z - w_i \right \rangle \\ &= 0. \end{aligned} Next, by Cauchy-Schwarz and the triangle inequality, \mathop{\mathbb{E}}|\epsilon_i| = \mathop{\mathbb{E}}\left|{ \left\langle g_i - \nabla\widehat{\mathcal{R}}(w_i), w_i - z \right \rangle }\right| \leq \mathop{\mathbb{E}}\left({ \|g_i\| + \|\nabla\widehat{\mathcal{R}}(w_i)\| }\right)\|w_i - z\| \leq 2GD. Consequently, by Azuma-Hoeffding, with probability at least 1-\delta, \sum_i \epsilon_i \leq 2GD \sqrt{2t \ln(1/\delta)}.
Plugging this into the earlier approximate gradient lemma gives the following. should give explicit cref
Here we will show our first optimization guarantees for (shallow) networks: one based on strong convexity, and one based on smoothness.
under construction.
(include preamble saying this looks like + theorem:sc_smooth?){.mjt} Theorem 7.5
Finally we will prove (rather than assert) that we can stay close to initialization long enough to get a small risk with an analysis that is essentially convex, essentially following the NTK (Taylor approximation).
This proof is a simplification of one by Chizat and Bach (2019). There are enough differences that it’s worth checking the original.
That paper highlights a “scaling phenomenon” as an explanation of the NTK. Essentially, increasing with always decreases initialization variance, and the paper argues this corresponds to “zooming in” on the Taylor expansion in function space, and flattening the dynamics.
This “scaling perspective” pervades much of the NTK literature and I recommend looking at (Chizat and Bach 2019) for further discussion; I do not discuss it much in this course or even in this proof, though I keep Chizat’s \alpha>0 scale parameter.
This proof comes after many earlier NTK analyses, e.g., (Jacot, Gabriel, and Hongler 2018; Simon S. Du et al. 2018; Allen-Zhu, Li, and Liang 2018; Arora, Du, Hu, Li, and Wang 2019). I like the proof by (Chizat and Bach 2019) very much and learned a lot from it; it was the most natural for me to teach. OTOH, it is quite abstract, and we’ll need homework problems to boil it down further.
Basic notation. For convenience, bake the training set into the predictor: \begin{aligned} f(w) &:= \begin{bmatrix} f(x_1;w)\\ \vdots \\ f(x_n;w) \end{bmatrix} \in \mathbb{R}^n. \end{aligned} We’ll be considering squared loss regression: \begin{aligned} \widehat{\mathcal{R}}(\alpha f(w)) &:= \frac {1}{2} \|\alpha f(w) - y\|^2, \qquad \widehat{\mathcal{R}}_0 := \widehat{\mathcal{R}}(\alpha f(w(0)) ), \end{aligned} where \alpha > 0 is a scale factor we’ll optimize later. maybe I should use \mathcal L not \widehat{\mathcal{R}} since unnormalized.
We’ll consider gradient flow: \begin{aligned} \dot w(t) &:= - \nabla_w \widehat{\mathcal{R}}(\alpha f(w(t))) = - \alpha J_t^{\scriptscriptstyle\mathsf{T}}\nabla\widehat{\mathcal{R}}(\alpha f(w(t))), \\ \textup{where } J_t := J_{w(t)} &:= \begin{bmatrix} \nabla f(x_1;w(t))^{\scriptscriptstyle\mathsf{T}}\\ \vdots \\ \nabla f(x_n;w(t))^{\scriptscriptstyle\mathsf{T}} \end{bmatrix} \in \mathbb{R}^{n\times p}. \end{aligned}
We will also explicitly define and track a flow u(t) over the tangent model; what we care about is w(t), but we will show that indeed u(t) and w(t) stay close in this setting. (Note that u(t) is not needed for the analysis of w(t).) \begin{aligned} f_0(u) &:= f(w(0)) + J_0(u - w(0)). \\ \dot u(t) &:= -\nabla_u \widehat{\mathcal{R}}(\alpha f_0(u(t))) = - \alpha J_0^{\scriptscriptstyle\mathsf{T}}\nabla\widehat{\mathcal{R}}(\alpha f_0(u(t))). \end{aligned} Both gradient flows have the same initial condition: u(0) = w(0),\qquad f_0(u(0)) = f_0(w(0)) = f(w(0)).
Remark 8.1 (initialization, width, etc) .
Assumptions. \begin{aligned} \textrm{rank}(J_0) &= n, \\ \sigma_{\min}&:= \sigma_{\min}(J_0) = \sqrt{\lambda_{\min}(J_0J_0^{\scriptscriptstyle\mathsf{T}})} = \sqrt{\lambda_{n}(J_0J_0^{\scriptscriptstyle\mathsf{T}})}> 0, \\ \sigma_{\max}&:= \sigma_{\max}(J_0) > 0, \\ \|J_w - J_v\| &\leq \beta \|w-v\|. \end{aligned}(4)
To get a handle on the various abstract constants and what they mean, consider the shallow case, namely f(x;w) = \sum_j s_j \sigma(w_j^{\scriptscriptstyle\mathsf{T}}x), where s_j\in\{\pm 1\} is not trained, and each w_j is trained.
Smoothness constant. Let X\in\mathbb{R}^{n\times d} be a matrix with the n training inputs as rows, and suppose \sigma is \beta_0-smooth. Then \begin{aligned} \|J_w - J_v\|_2^2 &= \sum_{i,j} \|x_i\|^2 (\sigma'(w_j^{\scriptscriptstyle\mathsf{T}}x_i) - \sigma'(v_j^{\scriptscriptstyle\mathsf{T}}x_i))^2 \\ &\leq \sum_{i,j} \|x_i\|^4 \beta_0^2 \|w_j - v_j \|^2 \\ &= \beta_0^2 \|X\|_{{\textrm{F}}}^4 \|w - v\|^2. \end{aligned} Thus \beta = \beta_0 \|X\|_{{\textrm{F}}}^2 suffices, which we can ballpark as \beta = \Theta(n).
Singular values. Now that we have an interpretation of the full rank assumption, ballpark the eigenvalues of J_0J_0^{\scriptscriptstyle\mathsf{T}}. By definition, (J_0J_0^{\scriptscriptstyle\mathsf{T}})_{i,j} = \nabla f(x_i;w(0))^{\scriptscriptstyle\mathsf{T}}\nabla f(x_j;w(0)). Holding i fixed and letting j vary, we can view the corresponding column of (J_0J_0^{\scriptscriptstyle\mathsf{T}}) as another feature representation, and \textrm{rank}(J_0)=n means none of these examples, in this feature representation, are linear combinations of others. This gives a concrete sense under which these eigenvalue assumptions are representation assumptions.
Now suppose each w_j(0) is an iid copy of some random variable v. Then, by definition of J_0, \begin{aligned} \mathop{\mathbb{E}}_{w(0)} (J_0J_0^{\scriptscriptstyle\mathsf{T}})_{i,j} &= \mathop{\mathbb{E}}_{w(0)} \nabla f(x_i;w(0))^{\scriptscriptstyle\mathsf{T}}\nabla f(x_j;w(0)). \\ &= \mathop{\mathbb{E}}_{w(0)} \sum_k s_k^2 \sigma'(w_k(0)^{\scriptscriptstyle\mathsf{T}}x_i) \sigma'(w_k(0)^{\scriptscriptstyle\mathsf{T}}x_j) x_i^{\scriptscriptstyle\mathsf{T}}x_j \\ &= m \mathop{\mathbb{E}}_{v} \sigma'(v^{\scriptscriptstyle\mathsf{T}}x_i) \sigma'(v^{\scriptscriptstyle\mathsf{T}}x_j) x_i^{\scriptscriptstyle\mathsf{T}}x_j. \end{aligned} In other words, it seems reasonable to expect \sigma_{\min} and \sigma_{\max} to scale with \sqrt m.
Initial risk \widehat{\mathcal{R}}_0. Let’s consider two different random initializations.
In the first case, we use one of the fancy schemes we mentioned to force f(w(0)) = 0; e.g., we can make sure that s_j is positive and negative an equal number of times, then sample w_j for s_j=+1, and then make w_j for s_j = -1 be the negation. With this choice, \widehat{\mathcal{R}}_0 = \|y\|^2/2 = \Theta(n).
On the other hand, if we do a general random initialization of both s_j and w_j, then we can expect enough cancellation that, roughly, f(x_i;w(0)) = \Theta(\sqrt m) (assuming w_j’s variance is a constant and not depending on m: that would defeat the purpose of separating out the scale parameter \alpha). then \|\alpha f(w(0))\|^2 = \Theta( \alpha^2 m n ), and \widehat{\mathcal{R}}_0 = \Theta( \alpha^2 m n), and thus the lower bound condition on \alpha will need to be checked carefully.
Combining all parameters. Again let’s split into two cases, based on the initialization as discussed immediately above.
The case \widehat{\mathcal{R}}_0 = \Theta(\alpha^2 n m). Using \beta = \Theta(n), the condition on \alpha indeed has \alpha on both sides, and becomes \sigma_{\min}^3 \geq \Omega( \beta \sigma_{\max}\sqrt{ n m }) = \sigma_{\max}\Omega(\sqrt{mn^3}). Since we said the singular values are of order \sqrt m, we get roughly m^{3/2} \geq \sqrt{m^2n^3}, thus m \geq n^3.
Since the lower bound on \alpha turned into a lower bound on m, let’s plug this \widehat{\mathcal{R}}_0 into the rates to see how they simplify: \begin{aligned} \max\left\{{ \widehat{\mathcal{R}}(\alpha f(w(t))), \widehat{\mathcal{R}}(\alpha f_0(u(t))) }\right\} &\leq \widehat{\mathcal{R}}_0 \exp\left({- \frac{t\alpha^2 \sigma_{\min}^2}{2}}\right) \\ &= \mathcal{O}\left({ \alpha^2 nm \exp\left({- \frac{t\alpha^2 \sigma_{\min}^2}{2}}\right) }\right), \\ \max\left\{{ \|w(t) - w(0)\|, \|u(t) - w(0)\| }\right\} &\leq \frac {3 \sqrt{8 \sigma_{\max}^2 \widehat{\mathcal{R}}_0}}{\alpha \sigma_{\min}^2} \\ &= \mathcal{O}\left({ \frac {\sqrt{\sigma_{\max}^2 nm}}{\sigma_{\min}^2} }\right). \end{aligned} In these inequalities, the distance to initialization is not affected by \alpha: this makes sense, as the key work needed by the gradient flow is to clear the initial noise so that y can be fit exactly. Meanwhile, the empirical risk rate does depend on \alpha, and is dominated by the exponential term, suggesting that \alpha should be made arbitrarily large. There is indeed a catch limiting the reasonable choices of \alpha, as will be pointed out shortly.
For now, to pick a value which makes the bounds more familiar, choose \alpha = \hat\alpha := 1 / \sigma_{\max}, whereby additionally simplifying via \sigma_{\min} and \sigma_{\max} being \Theta(\sqrt m) gives \begin{aligned} \max\left\{{ \widehat{\mathcal{R}}(\alpha f(w(t))), \widehat{\mathcal{R}}(\alpha f_0(u(t))) }\right\} &= \mathcal{O}\left({ \sigma_{\max}^{-2} nm \exp\left({- \frac{t\sigma_{\min}^2}{2\sigma_{\max}^2}}\right) }\right) \\ &= \mathcal{O}\left({ n \exp\left({- \frac{t\sigma_{\min}^2}{2\sigma_{\max}^2}}\right) }\right), \\ \max\left\{{ \|w(t) - w(0)\|, \|u(t) - w(0)\| }\right\} &= \mathcal{O}\left({ \frac { \sqrt{ \sigma_{\max}^2 nm}}{\sigma_{\min}^2} }\right) = \mathcal{O}\left({ \sqrt{n } }\right). \end{aligned} Written this way, the empirical risk rate depends on the condition number \sigma_{\max}/\sigma_{\min} of the NTK Gram matrix, which is reminiscent of the purely strongly convex and smooth analyses as in Theorem 7.5.
The case \widehat{\mathcal{R}}_0 = \Theta(n). Using \beta = \Theta(n), the condition on \alpha becomes \alpha = \Omega\left({ \frac{\beta \sqrt{\sigma_{\max}^2 \widehat{\mathcal{R}}_0}}{\sigma_{\min}^3} }\right) = \Omega\left({ \frac{\sigma_{\max}n^{3/2}}{\sigma_{\min}^3} }\right). We have removed the cancelation from the previous case, and are now constrained in our choice of \alpha; we can still set \alpha := 1/\sigma_{\max}, which after using our estimate of \sqrt{m} for \sigma_{\min} and \sigma_{\max} get a similar requirement m = \Omega(n^3). More generally, we get \alpha = \Omega( n^{3/2} / m), which means for large enough m we can treat as close to 1/m. Frederic Koehler points out that the first case can still look like \widehat{\mathcal{R}}_0 = \Theta(\alpha^2 m n + n) and even \Theta(n) when \alpha is small; I need to update this story.
Possible values of \alpha. The two preceding cases considered lower bounds on \alpha. In the case \widehat{\mathcal{R}}_0 = \Theta(\alpha^2 n m), it even seemed that we can make \alpha whatever we want; in either case, the time required to make \widehat{\mathcal{R}}(\alpha f(w(t))) small will decrease as \alpha increases, so why not simply make \alpha arbitrarily large?
An issue occurs once we perform time discretization. Below, we will see that the smoothness of the model looks like \alpha^2 \sigma_{\max}^2 near initialization; as such, a time discretization, using tools such as in Theorem 7.3, will require a step size roughly 1 / (\alpha^2 \sigma_{\max}^2), and in particular while we may increase \alpha to force the gradient flow to seemingly converge faster, a smoothness-based time discretization will need the same number of steps.
As such, \alpha = 1 / \sigma_{\max} seems a reasonable way to simplify many terms in this shallow setup, which translates into a familiar 1/\sqrt{m} NTK scaling.
::: Proof of Theorem 8.1.
Proof plan.
First we choose a fortuitous radius B:= \frac{\sigma_{\min}}{2\beta}, and seek to study the properties of weight vectors w which are B-close to initialization: \|w - w(0)\| \leq B; This B will be chosen to ensure J_t and J_0 are close, amongst other things. Moreover, we choose a T so that all t\in[0,T] are in this good regime: T := \inf\left\{{ t \geq 0 : \|w(t) - w(0)\| > B }\right\}.
Now consider any t\in[0,T]. i should include explicit lemma pointers for each.
First we show that if J_tJ_t^{\scriptscriptstyle\mathsf{T}} is positive definite, then we rapidly decrease risk, essentially following our old strong convexity proof.
Next, since the gradient of the least squares risk is the residual, then decreasing risk implies decreasing gradient norms, and in particular we can not travel far.
The above steps go through directly for u(t) due to the positive definiteness of J_0J_0^{\scriptscriptstyle\mathsf{T}}; by the choice of B, we can also prove they hold for J_tJ_t^{\scriptscriptstyle\mathsf{T}}.
As a consequence we also immediately get that we never escape this ball: the gradient norms decay sufficiently rapidly. Consequently, T=\infty, and we don’t need conditions on t in the theorem!
The evolution in prediction space is \begin{aligned} \frac {{\text{d}}}{{\text{d}}t} \alpha f(w(t)) &= \alpha J_t \dot w(t) = -\alpha^2 J_t J_t^{\scriptscriptstyle\mathsf{T}}\nabla\widehat{\mathcal{R}}(\alpha f(w(t))), \\ &= -\alpha^2 J_t J_t^{\scriptscriptstyle\mathsf{T}}(\alpha f(w(t)) - y), \\ \frac {{\text{d}}}{{\text{d}}t} \alpha f_0(u(t)) &= \frac {{\text{d}}}{{\text{d}}t} \alpha \left({ f(w(0) + J_0(u(t) - w(0)) }\right) = \alpha J_0 \dot u(t) \\ &= -\alpha^2 J_0 J_0^{\scriptscriptstyle\mathsf{T}}\nabla\widehat{\mathcal{R}}(\alpha f_0(u(t))) \\ &= -\alpha^2 J_0 J_0^{\scriptscriptstyle\mathsf{T}}(\alpha f_0(u(t)) - y). \end{aligned}
The first one is complicated because we don’t know how J_t evolves.
But the second one can be written \frac {{\text{d}}}{{\text{d}}t} \left[{ \alpha f_0(u(t)) }\right] = -\alpha^2 \left({ J_0 J_0^{\scriptscriptstyle\mathsf{T}}}\right) \left[{ \alpha f_0(u(t)) }\right] +\alpha^2 \left({ J_0 J_0^{\scriptscriptstyle\mathsf{T}}}\right)y, which is a concave quadratic in the predictions \alpha f_0(u(t)).
Let’s fantasize a little and suppose (J_w J_w)^{\scriptscriptstyle\mathsf{T}} is also positive semi-definite. Do we still have a nice convergence theory?
Proof. Mostly just repeating our old strong convexity steps, \begin{aligned} \frac {{\text{d}}} {{\text{d}}t} \frac 1 2 \|z(t) - y\|^2 &= \left\langle -Q(t) (z(t) - y) , z(t) - y \right \rangle \\ &\leq - \lambda_{\min}\left({ Q(t) }\right) \left\langle z(t) - y, z(t) - y \right \rangle \\ &\leq - 2\lambda \|z(t) - y\|^2/2, \end{aligned} and Grönwall’s inequality completes the proof.
We can also prove this setting implies we stay close to initialization.
Proof. \begin{aligned} \|v(t) - v(0)\| &= \left\|{ \int_0^t \dot v(s) {\text{d}}s }\right\| \leq \int_0^t \|\dot v(s)\|{\text{d}}s \\ &= \int_0^t \|S_t^{\scriptscriptstyle\mathsf{T}}\nabla\mathcal{R}(g(v(s)))\|{\text{d}}s \\ &\leq \sqrt{\lambda_1} \int_0^t \| g(v(s)) - y\|{\text{d}}s \\ &\stackrel{(*)}{\leq} \sqrt{\lambda_1} \|g(v(0)) - y \| \int_0^t \exp(- s\lambda) {\text{d}}s \\ &\leq \frac {\sqrt{\lambda_1}}{\lambda} \|g(v(0)) - y\| \\ &\leq \frac {\sqrt{2 \lambda_1 \widehat{\mathcal{R}}(g(v(0)))}}{\lambda}, \end{aligned} where (*) used +Lemma 8.1.
Where does this leave us?
We can apply the previous two lemmas to the tangent model u(t), since for any t\geq 0, \dot u(t) = - \alpha J_0^{\scriptscriptstyle\mathsf{T}}\nabla\widehat{\mathcal{R}}(\alpha f_0(u(t))), \quad \frac {{\text{d}}}{{\text{d}}t} \alpha f_0(u(t)) = -\alpha^2 (J_0 J_0^{\scriptscriptstyle\mathsf{T}}) \nabla\widehat{\mathcal{R}}(\alpha f_0(u(t))). Thus since Q_0 := \alpha^2 J_0J_0^{\scriptscriptstyle\mathsf{T}} satisfies \lambda_i(Q_0) \in \alpha^2 [\sigma_{\min}^2, \sigma_{\max}^2], \begin{aligned} \widehat{\mathcal{R}}(\alpha f_0(u(t))) &\leq \widehat{\mathcal{R}}_0 \exp(- 2 t \alpha^2 \sigma_{\min}^2), \\ \|u(t) - u(0)\| &\leq \frac {\sqrt{2 \sigma_{\max}^2 \widehat{\mathcal{R}}_0}}{\alpha \sigma_{\min}^2}. \end{aligned}
How about w(t)?
Let’s relate (J_wJ_w^{\scriptscriptstyle\mathsf{T}}) to (J_0 J_0^{\scriptscriptstyle\mathsf{T}}).
Proof. For the upper bound, \|J_w\| \leq \|J_0\| + \|J_w - J_0\| \leq \|J_0\| + \beta \|w - w(0)\| \leq \sigma_{\max}+ \beta B = \sigma_{\max}+ \frac {\sigma_{\min}}{2}. For the lower bound, given vector v define A_v := J_0^{\scriptscriptstyle\mathsf{T}}v and B_v := (J_w - J_0)^{\scriptscriptstyle\mathsf{T}}v, whereby \|A_v\| \geq \sigma_{\min}\|v\|, \qquad \|B_v\| \leq \|J_w - J_0\|\cdot\|v\| \leq \beta B \|v\|, and thus \begin{aligned} \sigma_{\min}(J_w)^2 &= \min_{\|v\| = 1} v^{\scriptscriptstyle\mathsf{T}}J_w J_w^{\scriptscriptstyle\mathsf{T}}v \\ &= \min_{\|v\| = 1} \left({ (J_0 + J_w - J_0)^{\scriptscriptstyle\mathsf{T}}v}\right)^{\scriptscriptstyle\mathsf{T}}(J_0 + J_w - J_0)^{\scriptscriptstyle\mathsf{T}}v \\ &= \min_{\|v\| = 1} \|A_v\|^2 + 2 A_v^{\scriptscriptstyle\mathsf{T}}B_v + \|B_v\|^2 \\ &\geq \min_{\|v\| = 1} \|A_v\|^2 - 2 \|A_v\|\cdot\| B_v\| + \|B_v\|^2 \\ &= \min_{\|v\| = 1} \left({ \|A_v\| - \| B_v\| }\right)^2 \geq \min_{\|v\| = 1} \left({ \sigma_{\min}- \beta B }\right)^2 \|v\|^2 = \left({ \frac {\sigma_{\min}}{2} }\right)^2 . \end{aligned}
Using this, for t\in[0,T], \dot w(t) = - \alpha J_w^{\scriptscriptstyle\mathsf{T}}\nabla\widehat{\mathcal{R}}(\alpha f(w(t))), \quad \frac {{\text{d}}}{{\text{d}}t} \alpha f(w(t)) = -\alpha^2 (J_w J_w^{\scriptscriptstyle\mathsf{T}}) \nabla\widehat{\mathcal{R}}(\alpha f(w(t))). Thus since Q_t := \alpha^2 J_tJ_t^{\scriptscriptstyle\mathsf{T}} satisfies \lambda_i(Q_t) \in \alpha^2 [\sigma_{\min}^2/4, 9\sigma_{\max}^2/4], \begin{aligned} \widehat{\mathcal{R}}(\alpha f(w(t))) &\leq \widehat{\mathcal{R}}_0 \exp(- t \alpha^2 \sigma_{\min}^2/2), \\ \|w(t) - w(0)\| &\leq \frac {3 \sqrt{8 \sigma_{\max}^2 \widehat{\mathcal{R}}_0}}{\alpha \sigma_{\min}^2} =: B'. \end{aligned} It remains to show that T=\infty. Invoke, for the first time, the assumed lower bound on \alpha, namely \alpha \geq \frac {\beta \sqrt{1152 \sigma_{\max}^2 \widehat{\mathcal{R}}_0}}{\sigma_{\min}^3}, which by the above implies then B' \leq \frac B 2. Suppose contradictorily that T < \infty; since t\mapsto w(t) is continuous, then t\mapsto \|w(t) - w(0)\| is also continuous and starts from 0, and therefore \|w(T) - w(0)\| = B>0 exactly. But due to the lower bound on \alpha, we also have \|w(T) - w(0)\| \leq \frac B 2 < B, a contradiction.
This completes the proof. :::
Under construction.
(include pre-amble saying this looks like + theorem:magic_inequality?){.mjt} Theorem 7.3
Smoothness and differentiability do not in general hold for us (ReLU, max-pooling, hinge loss, etc.).
One relaxation of the gradient is the subdifferential set {\partial_{\text{s}}} (whose elements are called subgradients), namely the set of tangents which lie below the predictor: {\partial_{\text{s}}}\widehat{\mathcal{R}}(w) := \left\{{ s \in\mathbb{R}^p : \forall w'\centerdot \widehat{\mathcal{R}}(w') \geq \widehat{\mathcal{R}}(w) + s^{\scriptscriptstyle\mathsf{T}}(w'-w)}\right\}.
If \widehat{\mathcal{R}}:\mathbb{R}^d\to\mathbb{R} is convex, then {\partial_{\text{s}}}\widehat{\mathcal{R}} is nonempty everywhere.
If \nabla\widehat{\mathcal{R}} exists and \widehat{\mathcal{R}} is convex, then {\partial_{\text{s}}}\widehat{\mathcal{R}}(w) = \{ \nabla\widehat{\mathcal{R}}(w)\}. does this need some continuity on \widehat{\mathcal{R}}? need to check and provide a reference.
Much of convex analysis and convex opt can use subgradients in place of gradients; cf. (Hiriart-Urruty and Lemaréchal 2001; Nesterov 2003). As an example from these notes, Lemma 7.2 can replace gradients with subgradients.
One fun application is a short proof of Jensen’s inequality.
Proof. Choose any s\in{\partial_{\text{s}}}f(\mathop{\mathbb{E}}X), and note \mathop{\mathbb{E}}f(X) \geq \mathop{\mathbb{E}}\left[{f(\mathop{\mathbb{E}}(X)) + s^{\scriptscriptstyle\mathsf{T}}(X - \mathop{\mathbb{E}}X)}\right] = f(\mathop{\mathbb{E}}X).
Typically, we lack convexity, and the subdifferential set is empty.
Our main formalism is the Clarke differential (Clarke et al. 1998):
\partial\widehat{\mathcal{R}}(w) := \textrm{conv}\left({\left\{{
s \in \mathbb{R}^p : \exists w_i \to w, \nabla\widehat{\mathcal{R}}(w_i)\to s
}\right\}}\right).
Key properties:
If \widehat{\mathcal{R}} is locally Lipschitz, \partial \widehat{\mathcal{R}} exists everywhere.
If \widehat{\mathcal{R}} is convex, then \partial\widehat{\mathcal{R}}= {\partial_{\text{s}}}\widehat{\mathcal{R}} everywhere. need to check some continuity conditions and add a reference.
\widehat{\mathcal{R}} is continuously differentiable at w iff \partial \widehat{\mathcal{R}}(w) = \{ \nabla\widehat{\mathcal{R}}(w) \}.
We can replace the gradient flow differential equation \dot w(t) = -\nabla\mathcal{R}(w(t)) with a differential inclusion: \dot w(t) \in -\partial \widehat{\mathcal{R}}(w(t)) \qquad \text{for a.e. } t\geq 0. If R satisfies some technical structural conditions, then the following nice properties hold; these properties are mostly taken from (Lemma 5.2, Theorem 5.8, Davis et al. 2018) (where the structural condition is C^1 Whitney stratifiability), which was slightly generalized in (Ji and Telgarsky 2020) under o-minimal definability; another alternative, followed in (Lyu and Li 2019), is to simply assume that a chain rule holds.
(Chain rule.) For a.e. t\geq 0 and every v\in \partial \widehat{\mathcal{R}}(w(t)), then \frac {{\text{d}}}{{\text{d}}t} \widehat{\mathcal{R}}(w(t)) = - \left\langle v, \dot w(t) \right \rangle. This is the key strong property; since it holds for every element v of the Clarke differential simultaneously, it implies the next property.
(Minimum norm path.) For almost every t\geq0, then \dot w(t) = -\mathop{\mathrm{arg\,min}}\{ \|v\| : v\in \partial \widehat{\mathcal{R}}(w(t)) \}. Consequently, \begin{aligned} \widehat{\mathcal{R}}(w(t)) - \widehat{\mathcal{R}}(w(0)) &= \int_0^t \frac {{\text{d}}}{{\text{d}}s} \widehat{\mathcal{R}}(w(s)) {\text{d}}s %\\ %& = -\int_0^t \min\{ \|v\|^2 : v\in\partial \widehat{\mathcal{R}}(w(s)) \}{\text{d}}s; \end{aligned} since the right hand size is nonpositive for all t, the flow never increases the objective.
This allows us to reprove our stationary point guarantee from an earlier lecture: since \begin{aligned} \widehat{\mathcal{R}}(w(t)) - \widehat{\mathcal{R}}(w(0)) = -\int_0^t \min\{ \|v\|^2 : v\in\partial \widehat{\mathcal{R}}(w(s)) \}{\text{d}}s \leq - t \min_{\substack{s\in [0,t]\\v\in\partial\widehat{\mathcal{R}}(w(s))}} \|v\|^2, \end{aligned} then just as before \begin{aligned} \min_{\substack{s\in [0,t]\\v\in\partial\widehat{\mathcal{R}}(w(s))}} \|v\|^2 \leq \frac{\widehat{\mathcal{R}}(w(0)) - \widehat{\mathcal{R}}(w(t))}{t}, \end{aligned} thus for some time s\in[0,t], we have an iterate w(s) which is an approximate stationary point.
This is not satisfied by pytorch/tensorflow/jax/…
(Kakade and Lee 2018) gives some bad examples, e.g., x\mapsto\sigma(\sigma(x)) - \sigma(-x) with \sigma the ReLU, evaluated at 0. (Kakade and Lee 2018) also give a randomized algorithm for finding good subdifferentials.
Does it matter? In the NTK regime, few activations change. In practice, many change, but it’s unclear what their effect is.
Another tool we will use heavily outside convexity is positive homogeneity.
Let’s work out an element of the Clarke differential for a ReLU network x \mapsto W_L \sigma_{L-1}(\cdots W_2 \sigma_1(W_1 x)). As a function of x, this mapping is 1-homogeneous and piecewise affine. As a function of w=(W_L,\cdot,W_1), it is L-homogeneous and piecewise polynomial. The boundary regions form a set of (Lebesgue) measure zero (wrt to either weights or parameters).
Fixing x and considering w, interior to each piece, the mapping is differentiable. Due to the definition of Clarke differential, it therefore suffices to compute the gradients in all adjacent pieces, and then take their convex hull.
So let’s return to considering some w where are differentiable. Let A_i be a diagonal matrix with activations of the output after layer i on the diagonal: A_i = \textrm{diag}\left({ \sigma'(W_i\sigma(\dots \sigma(W_1 x) \dots )) }\right), (note we’ve baked in x,) and so \sigma(r)=r\sigma'(r) implies layer i outputs x \mapsto A_i W_i\sigma(\dots \sigma(W_1 x) \dots )) = A_i W_i A_{i-1} W_{i-1} \cdots A_1 W_1 x, and the network outputs f(x; w) = W_L A_{L-1} W_{L-1} A_{L-2} \cdots A_1 W_1 x. and the gradient with respect to layer i is \frac {{\text{d}}}{{\text{d}}W_i} f(x; w) = (W_L A_{L-1} \cdots W_{i+1} A_i)^{\scriptscriptstyle\mathsf{T}}(A_{i-1} W_{i-1} \cdots W_1 x)^{\scriptscriptstyle\mathsf{T}}. Additionally \begin{aligned} %& \left\langle W_i , \frac {{\text{d}}}{{\text{d}}W_i} f(x; w) \right \rangle %\\ &= \left\langle W_i , (W_L A_{L-1} \cdots W_{i+1} A_i)^{\scriptscriptstyle\mathsf{T}}(A_{i-1} W_{i-1} \cdots W_1 x)^{\scriptscriptstyle\mathsf{T}} \right \rangle \\ &= \textrm{tr}\left({ W_i^{\scriptscriptstyle\mathsf{T}}(W_L A_{L-1} \cdots W_{i+1} A_i)^{\scriptscriptstyle\mathsf{T}}(A_{i-1} W_{i-1} \cdots W_1 x)^{\scriptscriptstyle\mathsf{T}} }\right) \\ &= \textrm{tr}\left({ (W_L A_{L-1} \cdots W_{i+1} A_i)^{\scriptscriptstyle\mathsf{T}}(W_i A_{i-1} W_{i-1} \cdots W_1 x)^{\scriptscriptstyle\mathsf{T}} }\right) \\ &= \textrm{tr}\left({ (W_i A_{i-1} W_{i-1} \cdots W_1 x)^{\scriptscriptstyle\mathsf{T}} (W_L A_{L-1} \cdots W_{i+1} A_i)^{\scriptscriptstyle\mathsf{T}} }\right) \\ &= \textrm{tr}\left({ W_L A_{L-1} \cdots W_{i+1} A_i W_i A_{i-1} W_{i-1} \cdots W_1 x }\right) \\ &= f(x; w), \end{aligned} and \left\langle W_i , \frac {{\text{d}}}{{\text{d}}W_i} f(x; w) \right \rangle = f(x; w) = \left\langle W_{i+1} , \frac {{\text{d}}}{{\text{d}}W_{i+1}} f(x; w) \right \rangle.
This calculation can in fact be made much more general (indeed with a simpler proof!).
Proof. If w = 0, then \left\langle s, w \right \rangle = 0 = Lf(w) for every s\in\partial f(w), so consider the case w\neq 0. Let D denote those w where f is differentiable, and consider the case that w\in D\setminus \{0\}. By the definition of gradient, \begin{aligned} \lim_{\delta\downarrow0}\frac{f(w+\delta w)-f(w)-\left\langle \nabla f(w), \delta w \right \rangle}{\delta\|w\|}=0, \end{aligned} and by using homogeneity in the form f(w+\delta w)=(1+\delta)^Lf(w) (for any \delta > 0), then \begin{aligned} 0 &= \lim_{\delta\downarrow0}\frac{\left({(1+\delta)^L-1}\right)f(w)-\left\langle \nabla f(w), \delta w \right \rangle}{\delta} = - \left\langle \nabla f(w), w \right \rangle + \lim_{\delta\downarrow0} f(w) \left({ L + \mathcal{O}(\delta) }\right), \end{aligned} which implies \left\langle w, \nabla f(w) \right \rangle=Lf(w).
Now consider w \in \mathbb{R}^d \setminus D \setminus \{0\}. For any sequence (w_i)_{i\geq 1} in D with \lim_i w_i=w for which there exists a limit s := \lim_i \nabla f(w_i), then \begin{aligned} \left\langle w, s \right \rangle=\lim_{i\to\infty}\left\langle w_i, \nabla f(w_i) \right \rangle=\lim_{i\to\infty}Lf(w_i)=Lf(w). \end{aligned} Lastly, for any element s\in\partial f(w) written in the form s = \sum_i \alpha_i s_i where \alpha_i\geq 0 satisfy \sum_i \alpha_i = 1 and each s_i is a limit of a sequence of gradients as above, then \left\langle w, s \right \rangle = \left\langle w, \sum_i \alpha_i s_i \right \rangle = \sum_i \alpha_i \left\langle w, s_i \right \rangle = \sum_i \alpha_i L f(w) = Lf(w).
If predictions are positive homogeneous with respect to each layer, then gradient flow preserves norms of layers.
Proof. Defining \ell_k'(s) := y_k \ell'(y_k f(x_k; w(s))), and fixing a layer i, \begin{aligned} \frac 1 2 \|W_i(t)\|^2 - \frac 1 2 \|W_i(0)\|^2 &= \int_0^t \frac {{\text{d}}} {{\text{d}}t} \frac 1 2 \|W_i(s)\|^2 {\text{d}}s \\ &= \int_0^t \left\langle W_i(s) , \dot W_i(s) \right \rangle {\text{d}}s \\ &= \int_0^t \left\langle W_i(s) , -\mathop{\mathbb{E}}_k \ell'_k(s) \frac {{\text{d}}f(x_k; w)}{{\text{d}}W_i(s)} \right \rangle{\text{d}}s \\ &= - \int_0^t \mathop{\mathbb{E}}_k \ell'_k(s) \left\langle W_i(s) , \frac {{\text{d}}f(x_k; w)}{{\text{d}}W_i(s)} \right \rangle{\text{d}}s \\ &= - \int_0^t \mathop{\mathbb{E}}_k \ell'_k(s) f(x_k; w) {\text{d}}s. \end{aligned} This final expression does not depend on i, which gives the desired equality.
This by itself implies \|W_j\| \to \infty for some j; combined with norm preservation, \min_j \|W_j\| \to \infty !
need to update this in light of the new material i’ve included?
Let’s consider: single hidden ReLU layer, only bottom trainable: f(x;w) := \frac {1}{\sqrt m} \sum_j a_j \sigma(\left\langle x, w_j \right \rangle), \qquad a_j \in \{\pm 1\}. Let W_s\in \mathbb{R}^{m\times d} denote parameters at time s, suppose \|x\|\leq 1. \begin{aligned} \frac {{\text{d}}f(x;W)}{{\text{d}}W} &= \begin{bmatrix} a_1 x \sigma'(w_1^{\scriptscriptstyle\mathsf{T}}x) / \sqrt{m} \\ \vdots \\ a_m x \sigma'(w_m^{\scriptscriptstyle\mathsf{T}}x) / \sqrt{m} \end{bmatrix}, \\ \left\|{ \frac {{\text{d}}f(x;W)}{{\text{d}}W} }\right\|_{{\textrm{F}}}^2 &= \sum_j \left\|{ a_j x \sigma'(w_j^{\scriptscriptstyle\mathsf{T}}x) / \sqrt{m}}\right\|^2_2 %\\ %& \leq \frac 1 m \sum_j \left\|{ x }\right\|^2_2 \leq 1. \end{aligned}
We’ll use the logistic loss, whereby \begin{aligned} \ell(z) &= \ln(1+\exp(-z)), \\ \ell'(z) &= \frac {-\exp(-z)}{1+ \exp(-z)} \in (-1,0), \\ \widehat{\mathcal{R}}(W) &:= \frac 1 n \sum_k \ell(y_k f(x_k;W)). \end{aligned} A key fact (can be verified with derivatives) is |\ell'(z)| = -\ell'(z) \leq \ell(z), % ln(1+e^z) = \int_{-infty}^z dx / (1+e^{-x}) % 1/(1+e^z) = \int_{-infty}^z e^{-x} dx / (1+e^{-x})^2. % suffices to prove second is everywhere smaller. % Let g(x) denote first. % second is g(x) * e^{-x} / (1+e^{-x}). % but e^{-x} / (1+e^{-x}) = 1 / (1+e^x) <= 1, % so first one g(x) is always larger. whereby \begin{aligned} \frac {{\text{d}}\widehat{\mathcal{R}}}{{\text{d}}W} &= \frac 1 n \sum_k \ell'(y_k f(x_k; W)) y_k \nabla_W f(x_k W), \\ \left\|{ \frac {{\text{d}}\widehat{\mathcal{R}}}{{\text{d}}W} }\right\|_{{\textrm{F}}} &\leq \frac 1 n \sum_k | \ell'(y_k f(x_k; W)) | \cdot \|y_k \nabla_W f(x_k W)\|_{{\textrm{F}}} \\ &\leq \frac 1 n \sum_k | \ell'(y_k f(x_k; W)) | \leq \min\left\{{ 1, \widehat{\mathcal{R}}(W) }\right\}. \end{aligned}
Now we can state a non-smooth, non-convex analog to +Theorem 7.3
Proof. Using the squared distance potential as usual, \begin{aligned} \|W_{i+1}-Z\|_{\textrm{F}}^2 &= \|W_{i}-Z\|_{\textrm{F}}^2 - 2 \eta \left\langle \nabla\widehat{\mathcal{R}}(W_i), W_i - Z \right \rangle + \eta^2 \|\nabla\widehat{\mathcal{R}}(W_i)\|_{\textrm{F}}^2, \end{aligned} where \displaystyle \|\nabla\widehat{\mathcal{R}}(W_i)\|_{\textrm{F}}^2 \leq \|\nabla\widehat{\mathcal{R}}(W_i)\|_{\textrm{F}}\leq \widehat{\mathcal{R}}(W_i) = \widehat{\mathcal{R}}^{(i)}(W_i), and \begin{aligned} & n \left\langle \nabla\widehat{\mathcal{R}}(W_i), Z-W_i \right \rangle \\ &= \sum_k y_k \ell'(y_kf(x_k;W_i)) \left\langle \nabla_W f(x_k;W_i) , Z-W_i \right \rangle \\ &= \sum_k \ell'(y_kf(x_k;W_i)) \left({y_k \left\langle \nabla_W f(x_k;W_i) , Z \right \rangle - y_k f(x_k;W_i) }\right) \\ &\leq \sum_k \left({\ell(y_k \left\langle \nabla_W f(x_k; W_i) , Z \right \rangle ) - \ell(y_k f(x_k;W_i)) }\right) \\ &= n \left({\widehat{\mathcal{R}}^{(i)}(Z) - \widehat{\mathcal{R}}^{(i)}(W_i)}\right). \end{aligned} Together, \begin{aligned} \|W_{i+1}-Z\|_{\textrm{F}}^2 &\leq \|W_{i}-Z\|_{\textrm{F}}^2 + 2\eta \left({\widehat{\mathcal{R}}^{(i)}(Z) - \widehat{\mathcal{R}}^{(i)}(W_i)}\right) + \eta \widehat{\mathcal{R}}_i(W_i); \end{aligned} applying \sum_{i<t} to both sides gives the bound.
During 2015-2016, various works pointed out that deep networks generalize well, even though parameter norms are large, and there is no explicit generalization (Neyshabur, Tomioka, and Srebro 2014; Zhang et al. 2017). This prompted authors to study implicit bias of gradient descent, the first such result being an analysis of linear predictors with linearly separable data, showing that gradient descent on the cross-entropy loss is implicitly biased towards a maximum margin direction (Soudry, Hoffer, and Srebro 2017).
This in turn inspired many other works, handling other types of data, networks, and losses (Ji and Telgarsky 2019b, 2018, 2020; Gunasekar et al. 2018a; Lyu and Li 2019; Chizat and Bach 2020; Ji et al. 2020).
Margin maximization of first-order methods applied to exponentially-tailed losses was first proved for coordinate descent (Telgarsky 2013). The basic proof scheme there was pretty straightforward, and based on the similarity of the empirical risk (after the monotone transformation \ln(\cdot)) to \ln \sum \exp, itself similar to \max(\cdot) and thus to margin maximization; we will use this connection as a basis for all proofs in this section (see also (Ji and Telgarsky 2019b; Gunasekar et al. 2018b)).
Throughout this section, fix training data ((x_i,y_i))_{i=1}^n, define a (an unnormalized) margin mapping m_i(w) := y_i f(x_i;w); by this choice, we can also conveniently write an unnormalized risk \mathcal L: \mathcal L(w) := \sum_i \ell(m_i(w)) = \sum_i \ell(y_i f(x_i;w)). Throughout this section, we will always assume f is locally-Lipschitz and L-homogeneous in w, which also means each m_i is locally-Lipschitz and L-homogeneous.
We will also use the exponential loss \ell(z) = \exp(-z). The results go through for similar losses.
We just said “maximum margin” and “separable data.” What do these mean?
Consider a linear predictor, meaning x\mapsto \left\langle w, x \right \rangle for some w\in\mathbb{R}^d. This w “separates the data” if y_i and \textrm{sgn}(\left\langle w, x_i \right \rangle) agree, which we can relax to the condition of strict separability, namely \min_i y_i \left\langle w, x_i \right \rangle > 0. It seems reasonable, or a nice inductive bias, if we are as far from 0 as possible: \max_{w\in ?} \min_i y_i \left\langle w, x_i \right \rangle > 0 The “?” indicates that we must somehow normalize or constrain, since otherwise, for separable data, this \max becomes a \sup and has value +\infty.
Consider now the general case of L-homogeneous predictors, where y_i\left\langle w, x_i \right \rangle is replaced by m_i(w).
Proof. Note \max_i \ell(-m_i(\hat w)) \leq \sum_i \ell(-m_i(\hat w)) = n \widehat{\mathcal{R}}(\hat w) < \ell(0), thus applying \ell^{-1} to both sides gives \min_i m_i(\hat w)) > 0. Therefore 0 \leq \inf_w \widehat{\mathcal{R}}(w) \leq \limsup_{c\to\infty}\widehat{\mathcal{R}}(c{\hat w}) = \sum_i \limsup_{c\to\infty} \ell(-m_i(c{\hat w})) = \sum_i \limsup_{c\to\infty} \ell(-c^L m_i({\hat w})) = 0.
This seems to be problematic; how can we “find” an “optimum,” when solutions are off at infinity? Moreover, we do not even have unique directions, nor a way to tell different ones apart!
We can use margins, now appropriately generalized to the L-homogeneous case, to build towards a better-behaved objective function. First note that since \min_i m_i(w) = \|w\|^L \min_i m_i\left({\frac{w}{\|w\|}}\right), we can compare different directions by normalizing the margin by \|w\|^L. Moreover, again using the exponential loss, \begin{aligned} \frac {\ell^{-1}\left({\mathcal L(w)}\right)}{\|w\|^L} + \frac{\ln(n)}{\|w\|^L} &= \frac {\ell^{-1}\left({\sum_i \ell(m_i(w))/n}\right)}{\|w\|^L} \geq \frac {\min_i m_i(w)}{\|w\|^L} \\ &= \frac {\ell^{-1}\left({\max_i \ell(m_i(w))}\right)}{\|w\|^L} \\ &\geq \frac{\ell^{-1}\left({ \mathcal L(w) }\right)}{\|w\|^L}. \end{aligned}(5)
This motivates the following definition.
The basic properties can be summarized as follows.
Suppose data is \vec m-separable. Then:
{\bar{\gamma}}:= \max_{\|w\| \leq 1} \gamma(w) > 0 is well-defined (the maximum is attained).
For any w\neq 0, For any {\hat w} with \bar\gamma = \gamma({\hat w}), \lim_{c\to\infty} {\tilde{\gamma}}(c w) = \gamma(w). In particular, for {\hat w} satisfying \bar\gamma = \gamma({\hat w}), then \lim_{c\to\infty} {\tilde{\gamma}}(c{\hat w}) = {\bar{\gamma}}.
Proof. The first part follows by continuity of m_i(w) and compactness of \{w\in\mathbb{R}^p : \|w\|=1\}, and the second from eq. 6 and eq. 5.
Let’s first see how far we can get in the linear case, using one of our earlier convex optimization tools, namely Theorem 7.4.
Later, when we study the L-homogeneous case, we are only able to show for every unit norm (to the power L), the (unnormalized) margin increases by at least the current margin, which implies nondecreasing, but not margin maximization.
Proof. By Theorem 7.4 with z = \ln(c)\bar u/\gamma for some c>0, \begin{aligned} \mathcal L(w(t)) &\leq \mathcal L(z) + \frac {1}{2t}\left({\|z\|^2 - \|w(t)-z\|^2}\right) \leq \sum_i \ell(m_i(z)) + \frac {\|z\|^2}{2t} \\ &\leq \sum_i \exp(-\ln(c)) + \frac {\ln(c)^2}{2t\gamma^2} = \frac {n}{c} + \frac {\ln(c)^2}{2t\gamma^2}, \end{aligned} and the first inequality follows from the choice c := 2tn\gamma^2. For the lower bound on \|w_t\|, using the preceding inequality, \ell\left({ \|w_t\| }\right) \leq \min_i \ell(m_i(w_t)) \leq \frac 1 n \mathcal L(w_t) \leq \frac {1+\ln(2tn\gamma^2)^2}{2tn\gamma^2}, and the second inequality follows by applying \ell^{-1} to both sides.
This nicely shows that we decrease the risk to 0, but not that we maximize margins. For this, we need a more specialized analysis.
Proof. For convenience, define u(t) := \ell^{-1}(\mathcal L(w(t))) and v(t) := \|w(t)\|, whereby \gamma(w(t)) = \frac {u(t)}{v(t)} = \frac{u(0)}{v(t)} + \frac {\int_0^t\dot u(s){\text{d}}s}{v(t)}. Let’s start by lower bounding the second term. Since \ell' = -\ell, \begin{aligned} \dot u(t) &= \left\langle \frac{ -\nabla \mathcal L(w(t))}{\mathcal L(w(t))}, \dot w(t) \right \rangle = \frac {\|\dot w(t)\|^2}{\mathcal L(w(t))}, \\ \|\dot w(s)\| &\geq \left\langle \dot w(s), \bar u \right \rangle = \left\langle -\sum_i x_iy_i \ell'(m_i(w(s))), \bar u \right \rangle \\ &= \sum_i\ell(m_i(w(s))) \left\langle x_iy_i , \bar u \right \rangle \geq \gamma\sum_i\ell(m_i(w(s))) = \gamma \mathcal L(w(s)), \\ v(t) &= \|w(t)-w(0)\| = \left\|{\int_0^t \dot w(s){\text{d}}s}\right\| \leq \int_0^t \left\|{\dot w(s)}\right\|{\text{d}}s, \end{aligned} thus \begin{aligned} \frac {\int_0^t\dot u(s){\text{d}}s}{v(t)} &\geq \frac {\int_0^t \frac {\|\dot w(s)\|^2}{\mathcal L(w(s))} {\text{d}}s}{v(t)} = \frac {\int_0^t \|\dot w(s)\| \frac {\|\dot w(s)\|}{\mathcal L(w(s))} {\text{d}}s}{v(t)} \geq \frac {\gamma \int_0^t \|\dot w(s)\|{\text{d}}s}{v(t)} = \gamma. \end{aligned}
For the first term u(0) / v(t), note \mathcal L(w(0)) = n and thus u(0) = -\ln n, whereas by the lower bound on \|w(t)\| from Lemma 10.1, \frac {u(0)}{v(t)} = \frac {-\ln(n)}{\|w(t)\|} \geq \frac {-\ln(n)}{\ln(t) + \ln(2n\gamma^2) - 2\ln\ln(2tne\gamma^2)}. Combining these inequalities gives the bound.
We are maximizing margins, but at a glacial rate of 1/\ln(t)!
To get some inspiration, notice that we keep running into \ell^{-1}(\mathcal L(w)) in all the analysis. Why don’t we just run gradient flow on this modified objective? In fact, the two gradient flows are the same!
Proof. We start as before: set u(t) := \ell^{-1}\mathcal L(\theta(t)) and v(t) := \|\theta(t)\|; then {\tilde{\gamma}}(t) = \frac {u(t)}{v(t)} = \frac {u(0)}{v(t)} + \frac {\int_0^t \dot u(s){\text{d}}s}{v(t)} = \frac {-\ln n}{v(t)} + \frac {\int_0^t \dot u(s){\text{d}}s}{v(t)}. Bounding these terms is now much simpler than for the regular gradient flow. Note \begin{aligned} \|\dot \theta(s)\| &\geq \left\langle \nabla \ln \mathcal L(\theta(s)), \bar u \right \rangle = \sum_i \frac {\ell'(m_i(\theta(s)))}{\mathcal L(\theta(s))} \left\langle x_iy_i, \bar u \right \rangle \geq \gamma \sum_i \frac {\ell'(m_i(\theta(s)))}{\mathcal L(\theta(s))} = \gamma, \\ \dot u(s) &= \left\langle \nabla \ln \mathcal L(\theta(s)), \dot \theta(s) \right \rangle = \|\dot \theta(s)\|^2, \end{aligned} thus \begin{aligned} \ell^{-1}\mathcal L(\theta(t)) &= \ell^{-1}\mathcal L(\theta(0)) + \int_0^t \frac {{\text{d}}}{{\text{d}}s} \ell^{-1}\mathcal L(\theta(s)){\text{d}}s \geq -\ln(n) + t\gamma^2, \\ \frac{\int_0^t \dot u(s){\text{d}}s}{v(t)} &= \frac{\int_0^t \|\dot \theta(s)\|^2{\text{d}}s}{v(t)} \geq \frac{\gamma \int_0^t \|\dot \theta(s)\|{\text{d}}s}{v(t)} \geq \frac{\gamma \|\int_0^t \dot \theta(s){\text{d}}s\|}{v(t)} = \gamma. \end{aligned} On the other hand, \|\theta(t)\|\gamma \geq \|\theta(t)\| \gamma(\theta(t)) \geq \ell^{-1}\mathcal L(\theta(t)) \geq t\gamma^2 - \ln(n). Together, \gamma(t) = \frac {u(t)}{v(t)} \geq \gamma - \frac {\ln(n)}{t\gamma^2 -\ln n}.
In the nonlinear case, we do not have a general result, and instead only prove that smoothed margins are nondecreasing.
The proof will use the following interesting approximate homogeneity property of \ln \sum \exp.
If -\ln\sum_i\exp were itself homogeneous, this would be an equality; instead, using only the L-homegeneity of m_i, we get a lower bound.
Proof. Let v\in - \partial \ln \sum \exp(m_i(w)) be given, whereby (thanks to assuming a chain rule) there exists v_i \in \partial m_i(w) for each i such that v = \sum_{i=1}^n \frac {\exp(-m_i(w)) v_i} {\sum_{j=1}^m \exp(-m_j(w))}. Since \exp(-m_k(w)) \geq 0 for every k and since -\ln is monotone decreasing, Then \begin{aligned} \left\langle v, w \right \rangle &= \left\langle \sum_{i=1}^n \frac {\exp(-m_i(w)) v_i} {\sum_{j=1}^m \exp(-m_j(w))}, w \right \rangle \\ &= \sum_{i=1}^n \frac {\exp(-m_i(w))} {\sum_{j=1}^m \exp(-m_j(w))}\left\langle v_i, w \right \rangle \\ &= L \sum_{i=1}^n \frac {\exp(-m_i(w))} {\sum_{j=1}^m \exp(-m_j(w))}m_i(w) \\ &= L \sum_{i=1}^n \frac {\exp(-m_i(w))} {\sum_{j=1}^m \exp(-m_j(w))}\left({ - \ln \exp(-m_i(w))}\right) \\ &\geq L \sum_{i=1}^n \frac {\exp(-m_i(w))} {\sum_{j=1}^m \exp(-m_j(w))}\left({ - \ln \sum_k \exp(-m_k(w))}\right) \\ &= - L \ln \sum_k \exp(-m_k(w)) \end{aligned} as desired.
Lemma 10.2 (taken from (Ji and Telgarsky 2020)) leads to a fairly easy proof of Theorem 10.3 (originally from (Lyu and Li 2019), simplification due to (Ji 2020)).
Proof of Theorem 10.3 (originally from (Lyu and Li 2019), simplification due to (Ji 2020)). It will be shown that ({\text{d}}/{\text{d}}t){\tilde{\gamma}}(t) \geq0 whenever {\tilde{\gamma}}(t) > 0, which completes the proof via the following contradiction. Let t > t_0 denote the earliest time where {\tilde{\gamma}}(t) < {\tilde{\gamma}}(t_0). But that means {\tilde{\gamma}}(t') > 0 for t'\in [t_0,t), whereby {\tilde{\gamma}}(t) = {\tilde{\gamma}}(t_0) + \int_{t_0}^t \frac {{\text{d}}}{{\text{d}}s} {\tilde{\gamma}}(s){\text{d}}s \geq {\tilde{\gamma}}(t_0) + 0, a contradiction, thus completing the proof.
To this end, fix any t with {\tilde{\gamma}}(t)>0, and the goal is to show ({\text{d}}/{\text{d}}t){\tilde{\gamma}}(t) \geq0. Define u(t) := - \ln \sum_i \exp(-m_i(w(t))), \qquad v(t) := \|w(t)\|^L, whereby {\tilde{\gamma}}(t) := {\tilde{\gamma}}(w(t)) := u(t) / v(t), and \frac {{\text{d}}}{{\text{d}}t} {\tilde{\gamma}}(t) = \frac {\dot u(t) v(t) - u(t) \dot v(t)}{v(t)^2}, where v(t) > 0 since {\tilde{\gamma}}(t) > 0 is impossible otherwise, which means the ratio is well-defined. Making use of Lemma 10.2 (taken from (Ji and Telgarsky 2020)), \begin{aligned} \dot u(t) &= \| \dot w(t) \|^2 \\ &\geq \|\dot w(t)\| \left\langle \frac{w(t)}{\|w(t)\|}, \dot w(t) \right \rangle \\ &\geq \frac { L u(t) \|\dot w(t)\|}{\|w(t)\|}, \\ \dot v(t) &= \frac {{\text{d}}}{{\text{d}}t} \|w(t)\|^{L/2} \\ &= L \|w(t)\|^{L - 1} \left\langle \frac {w(t)}{\|w(t)\|}, \dot w(t) \right \rangle \\ &\leq L \|w(t)\|^{L - 1} \|\dot w(t)\|, \end{aligned} whereby \begin{aligned} \dot u(t) v(t) - \dot v(t) u(t) &\geq \frac { L u(t) \|\dot w(t)\|}{\|w(t)\|} v(t) - u(t) L \|w(t)\|^{L - 1} \|\dot w(t)\| %\\ %& = 0, \end{aligned} which completes the proof.
The purpose of this generalization part is to bound the gap between testing and training error for standard (multilayer ReLU) deep networks via the classical uniform convergence tools, and also to present and develop these classical tools (based on Rademacher complexity).
These bounds are very loose, and there is extensive criticism now both of them and of the general approach, as will be discussed shortly (Neyshabur, Tomioka, and Srebro 2014; Zhang et al. 2017; Nagarajan and Kolter 2019; Dziugaite and Roy 2017); this work is ongoing and moving quickly and there are even already many responses to these criticisms (Negrea, Dziugaite, and Roy 2019; L. Zhou, Sutherland, and Srebro 2020; P. L. Bartlett and Long 2020).
Domain adaptation / covariate shift.
Generalization properties of more architectures. One key omission is of convolution layers; for one generalization analysis, see (Long and Sedghi 2019).
Other approaches and perspectives on generalization (possibly changing the basic definitions of “generalization”), for instance:
PAC-Bayes approaches (Dziugaite and Roy 2017). In the present notes, we only focus on uniform convergence bounds, which give high probability bounds between training and test error which hold simultaneously for every element of some class.
By contrast, PAC-Bayes consider a distribution over predictors, and bound the expected gap between testing and training error for these predictors in terms of how close this distribution is to some prior distribution over the predictors.
The looseness of the uniform-convergence bounds presented in these notes leads many authors to instead use them as explanatory tools, e.g., by studying their correlation with observed generalization. A correlation was claimed and presented in (P. Bartlett, Foster, and Telgarsky 2017), however it was on a single dataset and architecture. More extensive investigations have appeared recently (Jiang et al. 2020; Dziugaite et al. 2020), and highlight that while some bounds are correlated with generalization (or rather predictive of generalization) in some settings, there are other situations (e.g., large width) where no bound is correlated with observed generalization gaps.
Compression-based approaches (Arora, Ge, et al. 2018), which bound the generalization of the network after applying some compression procedure, with no guarantees on the original network; that said, it is a promising approach, and there has been some effort to recover guarantees on the original network (Suzuki, Abe, and Nishimura 2019).
Another relevant work, from an explicitly PAC-Bayes perspective, is (W. Zhou et al. 2018). For further connections between PAC-Bayes methodology and compression, see (Blum and Langford 2003), and for more on the concept of compression schemes, see for instance (Moran and Yehudayoff 2015).
Double descent (Belkin et al. 2018; Belkin, Hsu, and Xu 2019; Hastie et al. 2019), and related “interpolating predictors.”
Various omitted bounds in our uniform deviation framework:
(Wei and Ma 2019) give a bound which requires smooth activations; if we convert it to ReLU, it introduces a large factor which does not seem to improve over those presented here. That said, it is an interesting bound and approach. (There are a number of other bounds we don’t discuss since similarly they degrade for ReLU.)
(Golowich, Rakhlin, and Shamir 2018) have an additional bound over the one of theirs we present here: interestingly, it weakens the depends on \sqrt n to n^{1/4} or n^{1/5} but in exchange vastly improves the dependence on norms in the numerator, and is a very interesting bound.
Concentration of measure studies how certain distribution families and operations on distributions lead to “clumping up” of probability mass. Examples we’ve seen:
Gaussians concentrate around the one-standard-deviation shell; we used this in NTK to say few activations change (so it’s concentrated away from 0, sometimes this is called “anti-concentration”).
Azuma-Hoeffding gave us control on the errors in SGD; note that we averaged together many errors before studying concentration!
We’ll see in this section that concentration of measure allows us to handle the generalization gap of single predictors fixed in advance, but is insufficient to handle the output of training algorithms.
We will be absurdly brief. Some other resources:
Martin Wainwright’s lecture notes (Wainwright 2015), now turned into a book (Wainwright 2019).
My learning theory class, as well as Maxim Raginsky’s.
Our main concentration tool will be the Chernoff bounding method, which works nicely with sub-Gaussian random variables.
Example 12.1.
Gaussian \mathcal{N}(\mu,\sigma^2) is (\mu,\sigma^2)-sub-Gaussian.
\sum_i Z_i is (\sum_i \mu, \|\vec \sigma\|^2) when Z_i is (\mu_i,\sigma^2)-sub-Gaussian.
If Z\in [a,b] a.s., then Z is (\mathop{\mathbb{E}}Z, (b-a)^2/4)-sub-Gaussian (this is called the Hoeffding lemma I should pick a proof.).
There is also “sub-exponential”; we will not use it but it is fundamental.
Sometimes \mu is dropped from definition; in this case, one can study X-\mathop{\mathbb{E}}X, and we’ll just say “\sigma^2-sub-Gaussian.”
\mathop{\mathbb{E}}\exp(\lambda Z) is the moment generating function of Z; it has many nice properties, though we’ll only use it in a technical way.
sub-Gaussian random variables will be useful to us due to their vanishing tail probabilities. This indeed is an equivalent way to define sub-Gaussian (see (Wainwright 2015)), but we’ll just prove implication. The first step is Markov’s inequality.
Proof. Apply \mathop{\mathbb{E}} to both sides of \epsilon\mathbf{1}[X\geq \epsilon] \leq X.
Proof. Note \mathop{\textrm{Pr}}[X\geq \epsilon] \leq \mathop{\textrm{Pr}}[f(X) \geq f(\epsilon)] and apply Markov.
The Chernoff bounding technique is as follows. We can apply the proceeding corollary to the mapping t\mapsto \exp(tX) for all t>0: supposing \mathop{\mathbb{E}}X = 0, \mathop{\textrm{Pr}}[X\geq \epsilon] = \inf_{t\geq 0} \mathop{\textrm{Pr}}[\exp(tX)\geq \exp(t\epsilon)] \leq \inf_{t\geq 0} \frac {\mathop{\mathbb{E}}\exp(tX)}{\exp(t\epsilon)}. Simplifying the RHS via sub-Guassianity, \begin{aligned} \inf_{t>0}\frac{\mathop{\mathbb{E}}\exp(tX)}{\exp(t\epsilon)} &\leq \inf_{t>0}\frac{\exp(t^2\sigma^2/2)}{\exp(t\epsilon)} = \inf_{t>0}\exp\left({ t^2\sigma^2/2 - t\epsilon}\right) \\ &= \exp\left({ \inf_{t>0} t^2\sigma^2/2 - t\epsilon}\right). \end{aligned} The minimum of this convex quadratic is t := \frac {\epsilon}{\sigma^2}>0, thus \mathop{\textrm{Pr}}[X\geq \epsilon] = \inf_{t>0}\frac{\mathop{\mathbb{E}}\exp(tX)}{\exp(t\epsilon)} \leq \exp\left({ \inf_{t>0} t^2\sigma^2/2 - t\epsilon}\right) = \exp\left({ - \frac {\epsilon^2}{2 \sigma^2} }\right). (7)
What if we apply this to an average of sub-Gaussian r.v.’s? (The point is: this starts to look like an empirical risk!)
Proof. S_n:=\sum_iX_i/n is \sigma^2-subgaussian with \sigma^2 = \sum_i\sigma_i^2/n^2; plug this into the sub-Gaussian tail bound in eq. 7.
(Gaussian sanity check.) Let’s go back to the case n=1. It’s possible to get a tighter tail for the Gaussian directly (see (Wainwright 2015)), but it only changes log factors in the “inversion form” of the bound. Note also the bound is neat for the Gaussian since it says the tail mass and density are of the same order (algebraically this makes sense, as with geometric series).
(“Inversion” form.) This form is how things are commonly presented in machine learning; think of \delta as “confidence”; \ln(1/\delta) term means adding more digits to the confidence (e.g., bound holds with probability 99.999\%) means a linear increase in the term \ln(1/\delta).
There are more sophisticated bounds (e.g., Bernstein, Freedman, McDiarmid) proved in similar ways, often considering a Martingale rather than IID r.v.s.
I should say something about necessary and sufficient, like convex lipschitz bounded vs lipschitz gaussian.
maybe give heavy tail poiinter? dunno.
Let’s use what we’ve seen to bound misclassifications!
Proof. Plug Hoeffding Lemma into sub-Gaussian Chernoff bound.
Classifier f_n memorizes training data: f_n(x) := \begin{cases} y_i & x = x_i \in (x_1,\ldots,x_n), \\ 17 &\textup{otherwise.}\end{cases} Consider two situations with \mathop{\textrm{Pr}}[Y=+1|X=x] = 1.
Suppose marginal on X has finite support. Eventually (large n), this support is memorized and \widehat{\mathcal{R}}_{\textrm{z}}(f_n) = 0 = \mathcal{R}_{\textrm{z}}(f_n).
Suppose marginal on X is continuous. With probability 1, \widehat{\mathcal{R}}_{\textrm{z}}(f_n) = 0 but \mathcal{R}_{\textrm{z}}(f_n) = 1 !
What broke Hoeffding’s inequality (and its proof) between these two examples?
This f_n overfit: \widehat{\mathcal{R}}(f_n) is small, but \mathcal{R}(f_n) is large.
Possible fixes.
Two samples: train on S_1, evaluate on S_2. But now we’re using less data, and run into the same issue if we evaluate multiple predictors on S_2.
Restrict access to data within training algorithm: SGD does this, and has a specialized (martingale-based) deviation analysis.
Uniform deviations: define a new r.v. controlling errors of all possible predictors \mathcal{F} the algorithm might output: \left[{ \sup_{f\in\mathcal{F}} \mathcal{R}(f) - \widehat{\mathcal{R}}(f) }\right]. This last one is the approach we’ll follow here. It can be adapted to data and algorithms by adapting \mathcal{F} (we’ll discuss this more shortly).
As before we will apply a brute-force approach to controlling generalization over a function family \mathcal{F}: we will simultaneously control generalization for all elements of the class by working with the random variable \left[{ \sup_{f\in\mathcal{F}} \mathcal{R}(f) - \widehat{\mathcal{R}}(f) }\right]. This is called “uniform deviations” because we prove a deviation bound that holds uniformly over all elements of \mathcal{F}.
On the other hand, we can adapt the approach to the output of the algorithm in various ways, as we will discuss after presenting the main Rademacher bound.
This definition can be applied to arbitrary elements of \mathbb{R}^n, and is useful outside machine learning. We will typically apply it to the behavior of a function class on S = (z_i)_{i=1}^n: \mathcal{F}_{|S} := \left\{{ (f(x_1),\ldots,f(x_n)) : f\in\mathcal{F}}\right\} \subseteq \mathbb{R}^n. With this definition, \textrm{URad}(\mathcal{F}_{|S}) = \mathop{\mathbb{E}}_\epsilon\sup_{u \in \mathcal{F}_{|S}} \left\langle \epsilon, u \right \rangle = \mathop{\mathbb{E}}_\epsilon\sup_{f \in \mathcal{F}} \sum_i \epsilon_i f(z_i).
(Sanity checks.) We’d like \textrm{URad}(V) to measure how “big” or “complicated” V is. Here are a few basic checks:
\textrm{URad}(\{u\}) = \mathop{\mathbb{E}}\left\langle \epsilon, u \right \rangle = 0; this seems desirable, as a |V|=1 is simple.
More generally, \textrm{URad}(V + \{u\}) = \textrm{URad}(V).
If V\subseteq V', then \textrm{URad}(V) \leq \textrm{URad}(V').
\textrm{URad}(\{\pm 1\}^n) = \mathop{\mathbb{E}}_\epsilon\epsilon^2 = n; this also seems desirable, as V is as big/complicated as possible (amongst bounded vectors).
\textrm{URad}(\{(-1,\ldots,-1), (+1,\ldots,+1)\}) = \mathop{\mathbb{E}}_\epsilon| \sum_i \epsilon_i| = \Theta(\sqrt{n}). This also seems reasonable: |V|=2 and it is not completely trivial.
(\textrm{URad} vs \textrm{Rad}.) I don’t know other texts or even papers which use \textrm{URad}, I only see \textrm{Rad}. I prefer \textrm{URad} for these reasons:
The 1/n is a nuisance while proving Rademacher complexity bounds.
When we connect Rademacher complexity to covering numbers, we need to change the norms to account for this 1/n.
It looks more like a regret quantity.
(Absolute value version.) The original definition of Rademacher complexity (P. L. Bartlett and Mendelson 2002), which still appears in many papers and books, is \textrm{URad}_{|\cdot|}(V) = \mathop{\mathbb{E}}_\epsilon\sup_{u\in V} \left|{ \left\langle \epsilon, u \right \rangle }\right|. Most texts now drop the absolute value. Here are my reasons:
\textrm{URad}_{|\cdot|} violates basic sanity checks: it is possible that \textrm{URad}_{|\cdot|}(\{u\}) \neq 0 and more generally \textrm{URad}_{|\cdot|}(V + \{u\}) \neq \textrm{URad}_{|\cdot|}(V), which violates my basic intuition about a “complexity measure.”
To obtain 1/n rates rather than 1/\sqrt{n}, the notion of local Rademacher complexity was introduced, which necessitated dropping the absolute value essentially due to the preceding sanity checks.
We can use \textrm{URad} to reason about \textrm{URad}_{|\cdot|}, since \textrm{URad}_{|\cdot|}(V) = \textrm{URad}(V\cup -V).
While \textrm{URad}_{|\cdot|} is more convenient for certain operations, e.g., \textrm{URad}_{|\cdot|}(\cup_i V_i) \leq \sum_i \textrm{URad}_{|\cdot|}(V_i), there are reasonable surrogates for \textrm{URad} (as below).
The following theorem shows indeed that we can use Rademacher complexity to replace the \ln|\mathcal{F}| term from the finite-class bound with something more general (we’ll treat the Rademacher complexity of finite classes shortly).
Let \mathcal{F} be given with f(z) \in [a,b] a.s. \forall f\in\mathcal{F}.
With probability \geq 1-\delta, \begin{aligned} \sup_{f\in\mathcal{F}} \mathop{\mathbb{E}}f(Z) - \frac 1 n \sum_i f(z_i) &\leq \mathop{\mathbb{E}}_{(z_i)_{i=1}^n} \left({\sup_{f\in\mathcal{F}} \mathop{\mathbb{E}}f(z) - \frac 1 n \sum_i f(z_i)}\right) + (b-a) \sqrt{\frac{\ln(1/\delta)}{2n}}. \end{aligned}
With probability \geq 1-\delta, \mathop{\mathbb{E}}_{(z_i)_{i=1}^n} \textrm{URad}(\mathcal{F}_{|S}) \leq \textrm{URad}(\mathcal{F}_{|S}) + (b-a) \sqrt{\frac{n\ln(1/\delta)}{2}}.
With probability \geq 1-\delta, \sup_{f\in\mathcal{F}} \mathop{\mathbb{E}}f(Z) - \frac 1 n \sum_i f(z_i) \leq \frac 2 n \textrm{URad}(\mathcal{F}_{|S}) + 3(b-a) \sqrt{\frac{\ln(2/\delta)}{2 n}}.
The proof of this bound has many interesting points and is spread out over the next few subsections. It has these basic steps:
The expected uniform deviations are upper bounded by the expected Rademacher complexity. This itself is done in two steps:
The expected deviations are upper bounded by expected deviations between two finite samples. This is interesting since we could have reasonably defined generalization in terms of this latter quantity.
These two-sample deviations are upper bounded by expected Rademacher complexity by introducing random signs.
We replace this difference in expectations with high probability bounds via a more powerful concentration inequality which we haven’t discussed, McDiarmid’s inequality.
We’ll use further notation throughout this proof. \begin{aligned} Z &\quad \textrm{r.v.; e.g., $(x,y)$}, \\ \mathcal{F} &\quad \textrm{functions; e.g., } f(Z) = \ell(g(X), Y), \\ \mathop{\mathbb{E}}&\quad\textrm{expectation over } Z, \\ \mathop{\mathbb{E}}_n &\quad \textrm{expectation over } (Z_1,\ldots,Z_n), \\ \mathop{\mathbb{E}}f &= \mathop{\mathbb{E}}f(Z),\\ \widehat{\mathop{\mathbb{E}}}_n f &= \frac 1 n \sum_i f(Z_i). \end{aligned}
In this notation, \mathcal{R}_{\ell}(g) = \mathop{\mathbb{E}}\ell\circ g and \widehat{\mathcal{R}}_{\ell}(g) = \widehat{\mathop{\mathbb{E}}}\ell\circ g.
In this first step we’ll introduce another sample (“ghost sample”). Let (Z_1',\ldots,Z_n') be another iid draw from Z; define \mathop{\mathbb{E}}_n' and \widehat{\mathop{\mathbb{E}}}_n' analogously.
Proof. Fix any \epsilon>0 and apx max f_\epsilon\in\mathcal{F}; then \begin{aligned} \mathop{\mathbb{E}}_n\left({\sup_{f\in\mathcal{F}} \mathop{\mathbb{E}}f - \widehat{\mathop{\mathbb{E}}}_n f }\right) &\leq \mathop{\mathbb{E}}_n\left({ \mathop{\mathbb{E}}f_\epsilon- \widehat{\mathop{\mathbb{E}}}_n f_\epsilon}\right)+\epsilon \\ &= \mathop{\mathbb{E}}_n\left({ \mathop{\mathbb{E}}_n' \widehat{\mathop{\mathbb{E}}}_n' f_\epsilon- \widehat{\mathop{\mathbb{E}}}_n f_\epsilon}\right)+\epsilon \\ &= \mathop{\mathbb{E}}_n'\mathop{\mathbb{E}}_n\left({ \widehat{\mathop{\mathbb{E}}}_n' f_\epsilon- \widehat{\mathop{\mathbb{E}}}_n f_\epsilon}\right)+\epsilon \\ &\leq \mathop{\mathbb{E}}_n'\mathop{\mathbb{E}}_n\left({\sup_{f\in\mathcal{F}} \widehat{\mathop{\mathbb{E}}}_n' f - \widehat{\mathop{\mathbb{E}}}_n f }\right)+\epsilon \end{aligned} Result follows since \epsilon>0 was arbitrary.
As above, in this section we are working only in expectation for now. In the subsequent section, we’ll get high probability bounds. But \sup_{f\in\mathcal{F}} \mathop{\mathbb{E}}f - \mathop{\mathbb{E}}_n' f is a random variable; can describe it in many other ways too! (E.g., “asymptotic normality.”)
As mentioned before, the preceding lemma says we can instead work with two samples. Working with two samples could have been our starting point (and definition of generalization): by itself it is a meaningful and interpretable quantity!
The second step swaps points between the two samples; a magic trick with random signs boils this down into Rademacher complexity.
Proof. Fix a vector \epsilon\in \{-1,+1\}^n and define a r.v. (U_i,U_i') := (Z_i,Z_i') if \epsilon= 1 and (U_i,U_i') = (Z_i',Z_i) if \epsilon= -1. Then \begin{aligned} \mathop{\mathbb{E}}_n\mathop{\mathbb{E}}_n' \left({ \sup_{f\in\mathcal{F}} \widehat{\mathop{\mathbb{E}}}_n' f - \widehat{\mathop{\mathbb{E}}}_n f}\right) &= \mathop{\mathbb{E}}_n\mathop{\mathbb{E}}_n' \left({ \sup_{f\in\mathcal{F}} \frac 1 n \sum_i \left({ f(Z_i') - f(Z_i)}\right) }\right) \\ &= \mathop{\mathbb{E}}_n\mathop{\mathbb{E}}_n' \left({ \sup_{f\in\mathcal{F}} \frac 1 n \sum_i \epsilon_i \left({ f(U_i') - f(U_i)}\right) }\right). \end{aligned} Here’s the big trick: since (Z_1,\ldots,Z_n,Z_1',\ldots,Z_n') and (U_1,\ldots,U_n,U_1',\ldots,U_n') have same distribution, and \epsilon arbitrary, then (with \mathop{\textrm{Pr}}[\epsilon_i = +1] = 1/2 iid “Rademacher”) \begin{aligned} \mathop{\mathbb{E}}_\epsilon\mathop{\mathbb{E}}_n\mathop{\mathbb{E}}_n' \left({ \sup_{f\in\mathcal{F}} \widehat{\mathop{\mathbb{E}}}_n' f - \widehat{\mathop{\mathbb{E}}}_n f}\right) &= \mathop{\mathbb{E}}_\epsilon\mathop{\mathbb{E}}_n\mathop{\mathbb{E}}_n' \left({ \sup_{f\in\mathcal{F}} \frac 1 n \sum_i \epsilon_i \left({ f(U_i') - f(U_i)}\right) }\right) \\ &= \mathop{\mathbb{E}}_\epsilon\mathop{\mathbb{E}}_n\mathop{\mathbb{E}}_n' \left({ \sup_{f\in\mathcal{F}} \frac 1 n \sum_i \epsilon_i \left({ f(Z_i') - f(Z_i)}\right) }\right). \end{aligned}
Since similarly replacing \epsilon_i and -\epsilon_i doesn’t change \mathop{\mathbb{E}}_\epsilon, \begin{aligned} &\mathop{\mathbb{E}}_\epsilon\mathop{\mathbb{E}}_n\mathop{\mathbb{E}}_n' \left({ \sup_{f\in\mathcal{F}} \widehat{\mathop{\mathbb{E}}}_n' f - \widehat{\mathop{\mathbb{E}}}_n f}\right) \\ &= \mathop{\mathbb{E}}_\epsilon\mathop{\mathbb{E}}_n\mathop{\mathbb{E}}_n' \left({ \sup_{f\in\mathcal{F}} \frac 1 n \sum_i \epsilon_i \left({ f(Z_i') - f(Z_i)}\right) }\right) \\ &\leq \mathop{\mathbb{E}}_\epsilon\mathop{\mathbb{E}}_n\mathop{\mathbb{E}}_n' \left({ \sup_{f,f'\in\mathcal{F}} \frac 1 n \sum_i \epsilon_i \left({ f(Z_i') - f'(Z_i)}\right) }\right) \\ &= \mathop{\mathbb{E}}_\epsilon\mathop{\mathbb{E}}_n' \left({ \sup_{f'\in\mathcal{F}} \frac 1 n \sum_i \epsilon_i \left({ f(Z_i') }\right) }\right) + \mathop{\mathbb{E}}_\epsilon\mathop{\mathbb{E}}_n' \left({ \sup_{f\in\mathcal{F}} \frac 1 n \sum_i \epsilon_i \left({- f'(Z_i)}\right) }\right) \\ &= 2 \mathop{\mathbb{E}}_n \frac 1 n \mathop{\mathbb{E}}_\epsilon\sup_{f\in\mathcal{F}} \sum_i \epsilon_i \left({ f(Z_i) }\right) = 2 \mathop{\mathbb{E}}_n \frac 1 n \textrm{URad}(\mathcal{F}_{|S}). \end{aligned}
We controlled expected uniform deviations: \mathop{\mathbb{E}}_n \sup_{f\in\mathcal{F}} \mathop{\mathbb{E}}f - \widehat{\mathop{\mathbb{E}}}_n f.
High probability bounds will follow via concentration inequalities.
I’m omitting the proof. A standard way is via a Martingale variant of the Chernoff bounding method. The Martingale adds one point at a time, and sees how things grow.
Hoeffding follows by setting F(\vec Z) = \sum_i Z_i/n and verifying bounded differences c_i := (b_i-a_i)/n.
Proof of +Theorem 13.1.
The third bullet item follows from the first two by union bounding. To prove the first two, it suffices to apply the earlier two lemmas on expectations and verify the quantities satisfy bounded differences with constant (b-a)/n and (b-a), respectively.
For the first quantity, for any i and (z_1,\ldots,z_n,z_i') and writing z_j' := z_j for z_j\neq z_i for convenience, \begin{aligned} \left|{ \sup_{f\in\mathcal{F}} \mathop{\mathbb{E}}f - \widehat{\mathop{\mathbb{E}}}_n f - \sup_{g\in\mathcal{F}} ( \mathop{\mathbb{E}}g - \widehat{\mathop{\mathbb{E}}}_n' g ) }\right| &= \left|{ \sup_{f\in\mathcal{F}} \mathop{\mathbb{E}}f - \widehat{\mathop{\mathbb{E}}}_n f - \sup_{g\in\mathcal{F}} ( \mathop{\mathbb{E}}g - \widehat{\mathop{\mathbb{E}}}_n g + g(z_i) - g(z_i')) }\right| \\ &\leq \sup_{h\in\mathcal{F}} \left|{ \sup_{f\in\mathcal{F}} \mathop{\mathbb{E}}f - \widehat{\mathop{\mathbb{E}}}_n f - \sup_{g\in\mathcal{F}} ( \mathop{\mathbb{E}}g - \widehat{\mathop{\mathbb{E}}}_n g + h(z_i)/n - h(z_i'))/n }\right| \\ &= \sup_{h\in\mathcal{F}} \left|{ h(z_i) - h(z_i')) }\right|/n \\ &\leq \frac {b-a}{n}. \end{aligned} Using similar notation, and additionally writing S and S' for the two samples, for the Rademacher complexity, \begin{aligned} \left|{ \textrm{URad}(\mathcal{F}_{|S}) - \textrm{URad}(\mathcal{F}_{|S'})}\right| &= \left|{ \textrm{URad}(\mathcal{F}_{|S}) - \mathop{\mathbb{E}}_\epsilon\sup_{f\in\mathcal{F}} \sum_{i=1}^n \epsilon_i f(z_i')}\right| \\ &= \left|{ \textrm{URad}(\mathcal{F}_{|S}) - \mathop{\mathbb{E}}_\epsilon\sup_{f\in\mathcal{F}} \sum_{i=1}^n \epsilon_i f(z_i) - \epsilon_i f(z_i) + \epsilon_i f(z_i')}\right| \\ &\leq \sup_{h\in\mathcal{F}} \left|{ \textrm{URad}(\mathcal{F}_{|S}) - \mathop{\mathbb{E}}_\epsilon\sup_{f\in\mathcal{F}} \sum_{i=1}^n \epsilon_i f(z_i) - \epsilon_i h(z_i) + \epsilon_i h(z_i')}\right| \\ &\leq \sup_{h\in\mathcal{F}} \mathop{\mathbb{E}}_{\epsilon_i} \left|{\epsilon_i h(z_i) + \epsilon_i h(z_i')}\right| \leq (b-a). \end{aligned}
Let’s consider logistic regression with bounded weights: \begin{aligned} \ell(y f(x)) &:= \ln(1+\exp(-y f(x))), \\ |\ell'| &\leq 1, \\ \mathcal{F} &:= \left\{{ w\in\mathbb{R}^d : \|w\| \leq B }\right\}, \\ (\ell\circ \mathcal{F})_{|S} &:= \left\{{ (\ell(y_1 w^{\scriptscriptstyle\mathsf{T}}x_1),\ldots,\ell(y_n w^{\scriptscriptstyle\mathsf{T}}x_n)) : \|w\|\leq B }\right\}, \\ \mathcal{R}_{\ell}(w) &:= \mathop{\mathbb{E}}\ell(Y w^{\scriptscriptstyle\mathsf{T}}X), \\ \widehat{\mathcal{R}}_{\ell}(w) &:= \frac 1 n \sum_i \ell(y_i w^{\scriptscriptstyle\mathsf{T}}x_i). \end{aligned} The goal is to control \mathcal{R}_{\ell}- \widehat{\mathcal{R}}_{\ell} over \mathcal{F} via the earlier theorem; our main effort is in controlling \textrm{URad}((\ell\circ \mathcal{F})_{S}).
This has two steps:
“Peeling” off \ell.
Rademacher complexity of linear predictors.
Proof. The idea of the proof is to “de-symmetrize” and get a difference of coordinates to which we can apply the definition of L. To start, \begin{aligned} \textrm{URad}(\ell \circ V) &= \mathop{\mathbb{E}}\sup_{u\in V} \sum_i \epsilon_i \ell_i(u_i) \\ &= \frac 1 2 \mathop{\mathbb{E}}_{\epsilon_{2:n}} \sup_{u,w\in V} \left({\ell_1(u_1) - \ell_1(w_1) + \sum_{i=2}^n \epsilon_i (\ell_i(u_i) + \ell_i(w_i)) }\right) \\ &\leq \frac 1 2 \mathop{\mathbb{E}}_{\epsilon_{2:n}} \sup_{u,w\in V} \left({L|u_1 - w_1| + \sum_{i=2}^n \epsilon_i (\ell_i(u_i) + \ell_i(w_i)) }\right). \end{aligned} To get rid of the absolute value, for any \epsilon, by considering swapping u and w, \begin{aligned} & \sup_{u,w\in V} \left({L|u_1 - w_1| + \sum_{i=2}^n \epsilon_i (\ell_i(u_i) + \ell_i(w_i))}\right) \\ & = \max\Bigg\{ \sup_{u,w\in V} \left({L(u_1 - w_1) + \sum_{i=2}^n \epsilon_i (\ell_i(u_i) + \ell_i(w_i))}\right), \\ &\qquad \sup_{u,w} \left({L(w_1 - u_1) + \sum_{i=2}^n \epsilon_i (\ell_i(u_i) + \ell_i(w_i))}\right) \Bigg\} \\ \\ & = \sup_{u,w\in V} \left({L(u_1 - w_1) + \sum_{i=2}^n \epsilon_i (\ell_i(u_i) + \ell_i(w_i))}\right). \end{aligned} As such, \begin{aligned} \textrm{URad}(\ell \circ V) &\leq \frac 1 2 \mathop{\mathbb{E}}_{\epsilon_{2:n}} \sup_{u,w\in V} \left({L|u_1 - w_1| + \sum_{i=2}^n \epsilon_i (\ell_i(u_i) + \ell_i(w_i)) }\right) \\ &= \frac 1 2 \mathop{\mathbb{E}}_{\epsilon_{2:n}} \sup_{u,w\in V} \left({L(u_1 - w_1) + \sum_{i=2}^n \epsilon_i (\ell_i(u_i) + \ell_i(w_i)) }\right) \\ &= \mathop{\mathbb{E}}_{\epsilon} \sup_{u\in V} \left[{ L\epsilon_1 u_1 + \sum_{i=2}^n \epsilon_i \ell_i(u_i) }\right]. \end{aligned} Repeating this procedure for the other coordinates gives the bound.
Revisiting our overloaded composition notation: \begin{aligned} \left({\ell \circ f}\right) &= \left({ (x,y) \mapsto \ell(-y f(x)) }\right),\\ \ell \circ \mathcal{F}&= \{ \ell \circ f : f\in \mathcal{F}\}. \end{aligned}
Proof. Use the lipschitz composition lemma with \begin{aligned} |\ell(-y_i f(x_i) - \ell(-y_i f'(x_i))| &\leq L|-y_i f(x_i) + y_i f'(x_i))| \\ &\leq L|f(x_i) - f'(x_i))|. \end{aligned}
Now let’s handle step 2: Rademacher complexity of linear predictors (in \ell_2).
Proof. Fix any \epsilon\in\{-1,+1\}^n. Then \sup_{\|w\|\leq B} \sum_i \epsilon_i \left\langle w, x_i \right \rangle = \sup_{\|w\|\leq B} \left\langle w, \sum_i \epsilon_i x_i \right \rangle = B \left\|{ \sum_i \epsilon_i x_i }\right\|. We’ll bound this norm with Jensen’s inequality (only inequality in whole proof!): \mathop{\mathbb{E}}\left\|{ \sum_i \epsilon_i x_i }\right\| = \mathop{\mathbb{E}}\sqrt{ \left\|{ \sum_i \epsilon_i x_i }\right\|^2 } \leq \sqrt{ \mathop{\mathbb{E}}\left\|{ \sum_i \epsilon_i x_i }\right\|^2 }. To finish, \mathop{\mathbb{E}}\left\|{ \sum_i \epsilon_i x_i }\right\|^2 = \mathop{\mathbb{E}}\left({ \sum_i \left\|{ \epsilon_i x_i }\right\|^2 + \sum_{i,j} \left\langle \epsilon_i x_i, \epsilon_j x_j \right \rangle}\right) = \mathop{\mathbb{E}}\sum_i \left\|{x_i}\right\|^2 = \|X\|_F^2.
Let’s now return to the logistic regression example!
Suppose \|w\|\leq B and \|x_i\|\leq 1, and the loss is the 1-Lipschitz logistic loss \ell_{\log}(z) := \ln(1+\exp(z)). Note \ell(\left\langle w, yx \right \rangle) \geq 0 and \ell(\left\langle w, yx \right \rangle) \leq \ln(2) + \left\langle w, yx \right \rangle \leq \ln(2) + B.
Combining the main Rademacher bound with the Lipschitz composition lemma and the Rademacher bound on linear predictors, with probability at least 1-\delta, every w\in\mathbb{R}^d with \|w\|\leq B satisfies
\begin{aligned} \mathcal{R}_{\ell}(w) &\leq \widehat{\mathcal{R}}_{\ell}(w) + \frac {2}{n} \textrm{URad}((\ell\circ \mathcal{F})_{|S}) + 3(\ln(2)+B)\sqrt{\ln(2/\delta)/(2n)} \\ &\leq \widehat{\mathcal{R}}_{\ell}(w) + \frac {2B\|X\|_F}{n} + 3(\ln(2)+B)\sqrt{\ln(2/\delta)/(2n)} \\ &\leq \widehat{\mathcal{R}}_{\ell}(w) + \frac {2B + 3(B+\ln(2))\sqrt{\ln(2/\delta)/2}}{\sqrt{n}}. \end{aligned}
(Average case vs worst case.) Here we replaced \|X\|_F with the looser \sqrt{n}.
This bound scales as the SGD logistic regression bound proved via Azuma, despite following a somewhat different route (Azuma and McDiarmid are both proved with Chernoff bounding method; the former approach involves no symmetrization, whereas the latter holds for more than the output of an algorithm).
It would be nice to have an “average Lipschitz” bound rather than “worst-case Lipschitz”; e.g., when working with neural networks and the ReLU, which seems it can kill off many inputs! But it’s not clear how to do this. Relatedly: regularizing the gradient is sometimes used in practice?
In the logistic regression example, we peeled off the loss and bounded the Rademacher complexity of the predictors.
If most training labels are predicted not only accurately, but with a large margin, as in section 10, then we can further reduce the generalization bound.
Define \ell_\gamma(z) := \max\{0,\min\{1, 1-z/\gamma\}\}, \mathcal{R}_{\gamma}(f) := \mathcal{R}_{\ell_\gamma}(f) = \mathop{\mathbb{E}}\ell_\gamma(Y f(X)), and recall \mathcal{R}_{\textrm{z}}(f) = \mathop{\textrm{Pr}}[f(X) \neq Y].
Proof. Since \mathbf{1}[ \textrm{sgn}(f(x)) \neq y] \leq \mathbf{1}[ -f(x)y \geq 0] \leq \ell_\gamma(f(x)y), then \mathcal{R}_{\textrm{z}}(f) \leq \mathcal{R}_{\gamma}(f). The bound between \mathcal{R}_{\gamma} and \widehat{\mathcal{R}}_{\gamma} follows from the fundamental Rademacher bound, and by peeling the \frac 1 \gamma-Lipschitz function \ell_\gamma.
is that using per-example lipschitz? need to restate peeling? also, properly invoke peeling?
As a generalization notion, this was first introduced for 2-layer networks in (P. L. Bartlett 1996), and then carried to many other settings (SVM, boosting, …)
There are many different proof schemes; another one uses sparsification (Schapire et al. 1997).
This approach is again being extensively used for deep networks, since it seems that while weight matrix norms grow indefinitely, the margins grow along with them (P. Bartlett, Foster, and Telgarsky 2017).
In our warm-up example of finite classes, our complexity term was \ln|\mathcal{F}|. Here we will recover that, via Rademacher complexity. Moreover, the bound has a special form which will be useful in the later VC dimension and especially covering sections.
The \|\cdot\|_2 geometry here is intrinsic here; I don’t know how to replace it with other norms without introducing looseness. This matters later when we encounter the Dudley Entropy integral.
We’ll prove this via a few lemmas.
Proof.
\begin{aligned} \mathop{\mathbb{E}}\max_i X_i &= \inf_{t>0} \mathop{\mathbb{E}}\frac 1 t \ln \max_i \exp(tX_i) \leq \inf_{t>0} \mathop{\mathbb{E}}\frac 1 t \ln \sum_i \exp(tX_i) \\ &\leq \inf_{t>0} \frac 1 t \ln \sum_i \mathop{\mathbb{E}}\exp(tX_i) \leq \inf_{t>0} \frac 1 t \ln \sum_i \exp(t^2c^2/2) \\ &= \inf_{t>0} (\ln(n) / t + c^2t/2) \end{aligned} and plug in minimizer t = \sqrt{ 2\ln(n)/c^2 }
Proof. We did this in the concentration lecture, but here it is again: \mathop{\mathbb{E}}\exp(t\sum_i X_i) = \prod_i \mathop{\mathbb{E}}\exp(tX_i) \leq \prod_i \exp(t^2c_i^2/2) = \exp(t^2\|\vec c\|_2^2/2).
Proof of +Theorem 13.5 (Massart finite lemma).
Let \vec{\epsilon} be iid Rademacher and fix u\in V. Define X_{u,i} := \epsilon_i u_i and X_u := \sum_i X_{u,i}.
By Hoeffding lemma, X_{u,i} is (u_i - -u_i)^2/4 = u_i^2 -subgaussian, thus (by Lemma) X_u is \|u\|_2^2-subgaussian. Thus \textrm{URad}(V) = \mathop{\mathbb{E}}_\epsilon\max_{u\in V} \left\langle \epsilon, u \right \rangle = \mathop{\mathbb{E}}_\epsilon\max_{u\in V} X_u \leq \max_{u\in V} \|u\|_2 \sqrt{2\ln |V| }.
not an exhaustive list…
The bounds we will prove shortly are all loose. To some extent, it was argued in (Neyshabur, Tomioka, and Srebro 2014; Zhang et al. 2017) and (Nagarajan and Kolter 2019) that this may be intrinsic to Rademacher complexity, though these arguments can be overturned in various settings (in the former, via a posteriori bounds, e.g., as obtained via union bound; in the latter case, by considering a modified set of good predictors for the same problem); as such, that particular criticism is unclear. An alternative approach was highlighted in (Dziugaite and Roy 2017), however the bounds produced there are averages over some collection of predictors, and not directly comparable to the bounds here. Overall, though, many authors are investigating alternatives to the definition of generalization.
Looking outside the specific setting of neural network generalization, Rademacher complexity has been widely adopted since, to a great extent, it can cleanly re-prove many existing bounds, and moreover elements of Rademacher complexity proofs exist many decades prior to the coining of the term (P. L. Bartlett and Mendelson 2002). However, already in these settings, Rademacher complexity has extensive weaknesses.
For many learning problems, extensive effort was put into fast or optimal learning rates, which often boiled down to replacing a 1/\sqrt{n} dependency with a 1/n. While Local Rademacher Complexity is able to recover some of these bounds, it does not seem to recover all of them, and moreover the proofs are often very complicated.
In many non-parametric learning settings, for example k-nearest-neighbor, the best bounds all use a direct analysis (Chaudhuri and Dasgupta 2014), and attempts to recover these analyses with Rademacher complexity have been unsuccessful.
Closer to the investigation in these lecture notes, there are even cases where a direct Martingale analysis of SGD slightly beats the application of uniform convergence to the output of gradient descent, and similarly to the preceding case, attempts to close this gap have been unsuccessful (Ji and Telgarsky 2019a).
We will give two bounds, obtained by inductively peeling off layers.
One will depend on \|W_i^{\scriptscriptstyle\mathsf{T}}\|_{1,\infty}. This bound has a pretty clean proof, and appeared in (P. L. Bartlett and Mendelson 2002).
The other will depend on \|W_i^{\scriptscriptstyle\mathsf{T}}\|_{{\textrm{F}}}, and is more recent (Golowich, Rakhlin, and Shamir 2018).
also i didn’t mention yet that the other proof techniques reduce to this one?
Notation \|M\|_{b,c} = \| ( \|M_{:1}\|_b,\ldots,\|M_{:d}\|_b)\|_c means apply b-norm to columns, then c-norm to resulting vector.
Many newer bounds replace \|W_i^{\scriptscriptstyle\mathsf{T}}\| with a distance to initialization. (The NTK is one regime where this helps.) I don’t know how to use distance to initialize in the bounds in this section, but a later bound can handle it.
(\rho B)^L is roughly a Lipschitz constant of the network according to \infty-norm bounded inputs. Ideally we’d have “average Lipschitz” not “worst case,” but we’re still far from that…
The factor 2^L is not good and the next section gives one technique to remove it.
We’ll prove this with an induction “peeling” off layers. This peeling will use the following lemma, which collects many standard Rademacher properties.
Proof of +Lemma 14.1.
Fix any u_0 \in V; then \mathop{\mathbb{E}}_\epsilon\sup_{u\in V}\left\langle \epsilon, v \right \rangle \geq \mathop{\mathbb{E}}_\epsilon\left\langle \epsilon, u_0 \right \rangle = 0.
Can get inequality with |c|-Lipschitz functions \ell_i(r) := c\cdot r + u_i; for equality, note -\epsilon c and \epsilon c are same in distribution.
This follows since optimization over a polytope is achieved at a corner. In detail, \begin{aligned} \textrm{URad}(\textrm{conv}(V)) &= \mathop{\mathbb{E}}_\epsilon\sup_{\substack{k\geq 1 \\ \alpha \in \Delta_k}} \sup_{u_1,\ldots,u_k\in V} \left\langle \epsilon, \sum_j \alpha_j u_j \right \rangle \\ &= \mathop{\mathbb{E}}_\epsilon\sup_{\substack{k\geq 1 \\ \alpha \in \Delta_k}} \sum_j \alpha_j \sup_{u_j \in V} \left\langle \epsilon, u_j \right \rangle \\ &= \mathop{\mathbb{E}}_\epsilon\left({ \sup_{\substack{k\geq 1 \\ \alpha \in \Delta_k}} \sum_j \alpha_j}\right) \sup_{u \in V} \left\langle \epsilon, u \right \rangle \\ &= \textrm{URad}(V). \end{aligned}
Using the condition, \begin{aligned} \mathop{\mathbb{E}}_\epsilon\sup_{u\in \cup_i V_i} \left\langle \epsilon, u \right \rangle &= \mathop{\mathbb{E}}_\epsilon\sup_i \sup_{u\in V_i} \left\langle \epsilon, u \right \rangle \leq \mathop{\mathbb{E}}_\epsilon\sum_i \sup_{u\in V_i} \left\langle \epsilon, u \right \rangle \\ &= \sum_{i\geq1} \textrm{URad}(V_i). \end{aligned}
Since integrating over \epsilon is the same as integrating over -\epsilon (the two are equivalent distributions), \textrm{URad}(-V) = \mathop{\mathbb{E}}_{\epsilon} \sup_{u\in V} \left\langle \epsilon, -u \right \rangle = \mathop{\mathbb{E}}_{\epsilon} \sup_{u\in V} \left\langle -\epsilon, -u \right \rangle = \textrm{URad}(V).
Proof of +Theorem 14.1.
Let \mathcal{F}_i denote functions computed by nodes in layer i. It’ll be shown by induction that \textrm{URad}((\mathcal{F}_{i})_{|S}) \leq \|X \|_{2,\infty} (2\rho B)^i \sqrt{2 \ln(d)}. Base case (i=0): by the Massart finite lemma, \begin{aligned} \textrm{URad}((\mathcal{F}_{i})_{|S}) &= \textrm{URad}\left({ \left\{{x\mapsto x_j : j \in \{1,\ldots,d\}}\right\}_{|S} }\right) \\ &\leq \left({\max_j \|(x_1)_j, \ldots, (x_n)_j\|_2}\right) \sqrt{2\ln(d)} \\ &= \|X\|_{2,\infty} \sqrt{2\ln d } = \|X\|_{2,\infty} (2\rho B)^0 \sqrt{2\ln d}. \end{aligned}
Inductive step. Since 0 = \sigma(\left\langle 0, F(x) \right \rangle) \in \mathcal{F}_{i+1}, applying both Lipschitz peeling and the preceding multi-part lemma, \begin{aligned} &\textrm{URad}((\mathcal{F}_{i+1})_{|S}) \\ &=\textrm{URad}\left({\left\{{ x\mapsto \sigma_{i+1}(\|W_{i+1}^{\scriptscriptstyle\mathsf{T}}\|_{1,\infty} g(x)) : g\in \textrm{conv}(-\mathcal{F}_i \cup \mathcal{F}_i) }\right\}_{|S} }\right) \\ &\leq \rho B\cdot \textrm{URad}\left({-(\mathcal{F}_i)_{|S} \cup (\mathcal{F}_i)_{|S} }\right) \\ &\leq 2 \rho B\cdot \textrm{URad}\left({(\mathcal{F}_i)_{|S}}\right) \\ &\leq (2 \rho B)^{i+1} \|X\|_{2,\infty} \sqrt{2\ln d}. \end{aligned}
There are many related norm-based proofs now changing constants and also (1,\infty); see for instance Neyshabur-Tomioka-Srebro, Bartlett-Foster-Telgarsky (we’ll cover this), Golowich-Rakhlin-Shamir (we’ll cover this), Barron-Klusowski.
The best lower bound is roughly what you get by writing a linear function as a deep network \ddot \frown.
The proof does not “coordinate” the behavior of adjacent layers in any way, and worst-cases what can happen.
The proof technique can also handle other matrix norms (though with some adjustment), bringing it closer to the previous layer peeling proof.
For an earlier version of this bound but including things like 2^L, see Neyshabur-Tomioka-Srebro. I need a proper citation
The main proof trick (to remove 2^L) is to replace \mathop{\mathbb{E}}_\epsilon with \ln \mathop{\mathbb{E}}_\epsilon\exp; the 2^L now appears inside the \ln.
To make this work, we need two calculations, which we’ll wrap up into lemmas.
When we do “Lipschitz peeling,” we now have to deal with \exp inside \mathop{\mathbb{E}}_\epsilon. Magically, things still work, but the proof is nastier, and we’ll not include it.
That base case of the previous layer peeling could by handled by the Massart finite lemma; this time we end up with something of the form \mathop{\mathbb{E}}_\epsilon\exp( t\|X^{\scriptscriptstyle\mathsf{T}}\epsilon\|_2).
Comparing to the \infty\to\infty operator norm (aka (1,\infty)) bound, let’s suppose W\in\mathbb{R}^{m\times m} has row/node norm \|W_{j:}\|_2 \approx 1, thus \|W_{j:}\|_1 \approx \sqrt m, and \|W\|_{\textrm{F}}\approx \sqrt m \approx \|W^{\scriptscriptstyle\mathsf{T}}\|_{1,\infty}, so this bound only really improves on the previous one by removing 2^L?
Here is our refined Lipschitz peeling bound, stated without proof.
The peeling proof will end with a term \mathop{\mathbb{E}}\exp\left({ t\|X^{\scriptscriptstyle\mathsf{T}}\epsilon\|}\right), and we’ll optimize the t to get the final bound. Consequently, we are proving \|X^{\scriptscriptstyle\mathsf{T}}\epsilon\| is sub-Gaussian!
Proof. Following the notation of (Wainwright 2015), define Y_k := \mathop{\mathbb{E}}\left[{ \|X^{\scriptscriptstyle\mathsf{T}}\epsilon\|_2 | \epsilon_1,\ldots,\epsilon_k }\right], D_k := Y_k - Y_{k-1}, whereby Y_n - Y_0 = \sum_k D_k. For the base case, as usual \mathop{\mathbb{E}}\|X^{\scriptscriptstyle\mathsf{T}}\epsilon\|_2 \leq \sqrt{ \mathop{\mathbb{E}}\|X^{\scriptscriptstyle\mathsf{T}}\epsilon\|^2 } = \sqrt{ \sum_{j=1}^d\mathop{\mathbb{E}}(X_{:j}^{\scriptscriptstyle\mathsf{T}}\epsilon)^2 } = \sqrt{ \sum_{j=1}^d \| X_{:j}\|^2 } = \|X\|_{{\textrm{F}}}.
Supposing \epsilon and \epsilon' only differ on \epsilon_k, \begin{aligned} \sup_{\epsilon_k} | \|X^{\scriptscriptstyle\mathsf{T}}\epsilon\| - \|X^{\scriptscriptstyle\mathsf{T}}\epsilon'\| |^2 &\leq \sup_{\epsilon_k} \|X^{\scriptscriptstyle\mathsf{T}}(\epsilon- \epsilon') \|^2 = \sup_{\epsilon_k} \sum_{j=1}^d (X_{:j}^{\scriptscriptstyle\mathsf{T}}(\epsilon- \epsilon'))^2 \\ &= \sup_{\epsilon_k} \sum_{j=1}^d (X_{k,j} (\epsilon_k - \epsilon'_k))^2 \leq 4 \|X_{k:}\|^2, \end{aligned} therefore by the (conditional) Hoeffding lemma, D_k is \|X_{k:}\|^2-sub-Gaussian, thus (Theorem 2.3, Wainwright 2015) grants \sum_k D_k is \sigma^2-sub-Gaussian with \sigma^2 = \sum_k \|X_{k:}\|^2 = \|X\|_{\textrm{F}}^2.
Proof of +Theorem 14.2 ((Theorem 1, Golowich, Rakhlin, and Shamir 2018)). For convenience, let X_i denote the output of layer i, meaning X_0 = X \quad\textup{and}\quad X_i := \sigma_i(X_{i-1} W_i^{\scriptscriptstyle\mathsf{T}}). Let t>0 be a free parameter and let w denote all parameters across all layers; the bulk of the proof will show (by induction on layers) that \mathop{\mathbb{E}}\sup_w \exp( t \|\epsilon^{\scriptscriptstyle\mathsf{T}}X_i\| ) \leq \mathop{\mathbb{E}}2^i \exp( t B^i \|\epsilon^{\scriptscriptstyle\mathsf{T}}X_0\| ).
To see how to complete the proof from here, note by the earlier “base case lemma” (setting \mu := \mathop{\mathbb{E}}\|X_0^{\scriptscriptstyle\mathsf{T}}\epsilon\| for convenience) and Jensen’s inequality that \begin{aligned} \textrm{URad}(\mathcal{F}_{|S}) &= \mathop{\mathbb{E}}\sup_w \epsilon^{\scriptscriptstyle\mathsf{T}}X_L = \mathop{\mathbb{E}}\frac 1 t \ln \sup_w \exp\left({ t \epsilon^{\scriptscriptstyle\mathsf{T}}X_L }\right) \\ &\leq \frac 1 t \ln \mathop{\mathbb{E}}\sup_w \exp\left({ t | \epsilon^{\scriptscriptstyle\mathsf{T}}X_L | }\right) \leq \frac 1 t \ln \mathop{\mathbb{E}}2^L \exp\left({ t B^L \| \epsilon^{\scriptscriptstyle\mathsf{T}}X_0 \| }\right) \\ &\leq \frac 1 t \ln \mathop{\mathbb{E}}2^L \exp\left({ t B^L( \| \epsilon^{\scriptscriptstyle\mathsf{T}}X_0 \| - \mu + \mu)}\right) \\ &\leq \frac 1 t \ln\left[{ 2^L \exp\left({ t^2 B^{2L} \|X\|_{\textrm{F}}^2/2 + tB^L \mu}\right) }\right] \\ &\leq \frac {L\ln 2} t + \frac {t B^{2L}\|X\|_{\textrm{F}}^2}{2} + B^L \|X\|_{\textrm{F}}, \end{aligned} whereby the final bound follows with the minimizing choice t := \sqrt{\frac{ 2 L \ln(2) }{ B^{2L}\|X\|_{\textrm{F}}^2 }} \ \Longrightarrow\ {} \textrm{URad}(\mathcal{F}_{|S}) \leq \sqrt{2\ln(2) L B^{2L}\|X\|_{\textrm{F}}^2} + B^L \|X\|_{\textrm{F}}.
The main inequality is now proved via induction.
For convenience, define \sigma := \sigma_i and Y := X_{i-1} and V := W_i and \tilde V has \ell_2-normalized rows. By positive homogeneity and definition, \begin{aligned} \sup_w \|\epsilon^{\scriptscriptstyle\mathsf{T}}X_i\|^2 &= \sup_w \sum_j (\epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y V^{\scriptscriptstyle\mathsf{T}})_{:j})^2 \\ &= \sup_w \sum_j (\epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y V_{j:}^{\scriptscriptstyle\mathsf{T}}))^2 \\ &= \sup_w \sum_j (\epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(\|V_{j:}\| Y \tilde V_{j:}^{\scriptscriptstyle\mathsf{T}}))^2 \\ &= \sup_w \sum_j \|V_{j:}\|^2 (\epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y \tilde V_{j:}^{\scriptscriptstyle\mathsf{T}}))^2. \end{aligned}
The maximum over row norms is attained by placing all mass on a single row; thus, letting u denote an arbitrary unit norm (column) vector, and finally applying the peeling lemma, and re-introducing the dropped terms, and closing with the IH, \begin{aligned} \mathop{\mathbb{E}}_\epsilon \exp\left({t \sqrt{ \sup_w \|\epsilon^{\scriptscriptstyle\mathsf{T}}X_i\|^2 } }\right) &= \mathop{\mathbb{E}}_\epsilon \exp \left({t \sqrt{ \sup_{w,u} B^2 (\epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y u))^2 } }\right) \\ &= \mathop{\mathbb{E}}_\epsilon\sup_{w,u} \exp\left({t B |\epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y u)| }\right) \\ &\leq \mathop{\mathbb{E}}_\epsilon\sup_{w,u} \exp\left({t B \epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y u) }\right) + \exp\left({ - t B \epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y u) }\right) \\ &\leq \mathop{\mathbb{E}}_\epsilon\sup_{w,u} \exp\left({t B \epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y u) }\right) + \mathop{\mathbb{E}}_\epsilon\sup_{w,u} \exp\left({ - t B \epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y u) }\right) \\ &= \mathop{\mathbb{E}}_\epsilon 2 \sup_{w,u} \exp\left({t B \epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y u) }\right) \\ &\leq \mathop{\mathbb{E}}_\epsilon 2 \sup_{w,u} \exp\left({t B \epsilon^{\scriptscriptstyle\mathsf{T}}Y u }\right) \\ &\leq \mathop{\mathbb{E}}_\epsilon 2 \sup_{w} \exp\left({t B \| \epsilon^{\scriptscriptstyle\mathsf{T}}Y\|_2 }\right) \\ &\leq \mathop{\mathbb{E}}_\epsilon 2^i \sup_{w} \exp\left({t B^i \| \epsilon^{\scriptscriptstyle\mathsf{T}}X_0\|_2 }\right). \end{aligned}
Covering numbers are another way to do generalization. Covering numbers and Rademacher complexities are in some usual settings nearly tight with each other, though in these lectures we will only produce a way to upper bound Rademacher complexity with covering numbers.
Covering numbers are a classical concept. The idea is we discretize or cover the function class with some finite collection of representative elements; in this way, it’s tight to the “totally bounded” definition of compact set. Their first use in a statistical context is due to (Kolmogorov and Tikhomirov 1959).
i should discuss relating it to uniform convergence via rademacher, and how we have two ways, and neither is really tight, need chaining, and pointer to vershynin maybe.
“Improper” covers drop the requirement V\subseteq U. (We’ll come back to this.)
Most treatments define special norms with normalization 1/n baked in; we’ll use unnormalized Rademacher complexity and covering numbers.
Although the definition can handle directly covering functions \mathcal{F}, we get nice bounds by covering \mathcal{F}_{|S}, and conceptually it also becomes easier, just a vector (or matrix) covering problem with vector (and matrix) norms.
Proof. Let \alpha >0 be arbitrary, and suppose \mathcal{N}(U,\alpha,\|\cdot\|_2)<\infty (otherwise bound holds trivially). Let V denote a minimal cover, and V(a) its closest element to a\in U. Then \begin{aligned} \textrm{URad}(U) &= \mathop{\mathbb{E}}\sup_{a\in U} \left\langle \epsilon, a \right \rangle \\ &= \mathop{\mathbb{E}}\sup_{a\in U} \left\langle \epsilon, a- V(a) + V(a) \right \rangle \\ &= \mathop{\mathbb{E}}\sup_{a\in U} \left({\left\langle \epsilon, V(a) \right \rangle + \left\langle \epsilon, a - V(a) \right \rangle}\right) \\ &\leq \mathop{\mathbb{E}}\sup_{a\in U} \left({\left\langle \epsilon, V(a) \right \rangle + \|\epsilon\|\cdot\|a-V(a)\|}\right) \\ &\leq \textrm{URad}(V) + \alpha \sqrt{n} \\ &\leq \sup_{b\in V}(\|b\|_2) \sqrt{2 \ln |V| } + \alpha \sqrt{n} \\ &\leq \sup_{a\in U}(\|a\|_2) \sqrt{2 \ln |V| } + \alpha \sqrt{n}, \end{aligned} and the bound follows since \alpha>0 was arbitrary.
The same proof handles improper covers with minor adjustment: for every b\in V, there must be U(b)\in U with \|b-U(v)\|\leq \alpha (otherwise, b can be moved closer to U), thus \begin{aligned} \sup_{b\in V}\|b\|_2 &\leq \sup_{b\in V}\|b-U(b)\|_2 + \|U(b)\|_2 \leq \alpha + \sup_{a\in U} \|a\|_2. \end{aligned}
To handle other norms, superficially we need two adjustments: Cauchy-Schwarz can be replaced with Hölder, but it’s unclear how to replace Massart without slop relating different norms.
There is a classical proof that says that covering numbers and Rademacher complexities are roughly the same; the upper bound uses the Dudley entropy integral, and the lower bound uses a “Sudakov lower bound” which we will not include here.
crappy commment, needs to be improved.
The Dudley entropy integral works at multiple scales.
Suppose we have covers (V_N, V_{N-1}, ...) at scales (\alpha_N, \alpha_N/2, \alpha_N/4, \ldots).
Given a\in U, choosing V_i(a) := \mathop{\mathrm{arg\,min}}_{b\in V_i} \|a-b\|, a = (a - V_N(a)) + (V_N(a) - V_{N-1}(a)) + (V_{N-1}(a) - V_{N-2}(a)) + \cdots. We are thus rewriting a as a sequence of increments at different scales.
One way to think of it is as writing a number as its binary expansion x = (0.b_1b_2b_3\dots) = \sum_{i\geq 1} \frac{(b_i.b_{i+1}\ldots) - (0.b_{i+1}\ldots)}{2^i} = \sum_{i\geq 1} \frac{b_i}{2^i}. In the Dudley entropy integral, we are covering these increments b_i, rather than the number x directly.
One can cover increments via covering numbers for the base set, and that is why these basic covering numbers appear in the Dudley entropy integral. But internally, the argument really is about these increments.
Seems this works with improper covers. I should check carefully and include it in the statement or a remark.
citation for dudley? to dudley lol?
Proof. We’ll do the discrete sum first. The integral follows by relating an integral to its Riemann sum.
Let N\geq 1 be arbitrary.
For i\in \{1,\ldots,N\}, define scales \alpha_i := \sqrt{n} 2^{1-i}.
Define cover V_1 := \{0\}; since U\subseteq [-1,+1]^n, this is a minimal cover at scale \sqrt{n} = \alpha_1.
Let V_i for i\in \{2,\ldots,N\} denote any minimal cover at scale \alpha_i, meaning |V_i| = \mathcal{N}(U,\alpha_i,\|\cdot\|_2).
Since U\ni a = (a - V_N(a)) + \sum_{i=1}^{N-1} \left({ V_{i+1}(a) - V_i(a) }\right) + V_1(a), \begin{aligned} &\textrm{URad}(U) \\ &= \mathop{\mathbb{E}}\sup_{a\in U}\left\langle \epsilon, a \right \rangle \\ &= \mathop{\mathbb{E}}\sup_{a\in U}\left({ \left\langle \epsilon, a-V_N(a) \right \rangle + \sum_{i=1}^{N-1} \left\langle \epsilon, V_{i+1}(a) - V_i(a) \right \rangle + \left\langle \epsilon, V_1(a) \right \rangle }\right) \\ &\leq \mathop{\mathbb{E}}\sup_{a\in U}\left\langle \epsilon, a-V_N(a) \right \rangle %\qquad\qquad %``A'' \\ &\qquad + \sum_{i=1}^{N-1} \mathop{\mathbb{E}}\sup_{a\in U}\left\langle \epsilon, V_{i+1}-V_i(a) \right \rangle %\qquad\qquad %\sum_{i=1}^{N-1}``B_i'' \\ &\qquad + \mathop{\mathbb{E}}\sup_{a\in U} \left\langle \epsilon, V_1(a) \right \rangle %\qquad\qquad %``C'' . \end{aligned} Let’s now control these terms separately.
The first and last terms are easy: \begin{aligned} \mathop{\mathbb{E}}\sup_{a\in U}{\epsilon}{V_1(a)} &= \mathop{\mathbb{E}}\left\langle \epsilon, 0 \right \rangle = 0, \\ \mathop{\mathbb{E}}\sup_{a\in U}\left\langle \epsilon, a-V_N(a) \right \rangle &\leq \mathop{\mathbb{E}}\sup_{a\in U}\|\epsilon\|\|a-V_N(a)\| \leq \sqrt{n} \alpha_N = n 2^{1-N}. \end{aligned} For the middle term, define increment class W_i := \{ V_{i+1}(a) - V_i(a) : a\in U\}, whereby |W_i| \leq |V_{i+1}|\cdot |V_i| \leq |V_{i+1}|^2, and \begin{aligned} &\mathop{\mathbb{E}}\sup_{a \in U} \left\langle \epsilon, V_{i+1}(a) - V_i(a) \right \rangle = \textrm{URad}(W_i) \\ &\leq \left({\sup_{w\in W_i}\|w\|_2}\right) \sqrt{2 \ln |W_i|} \leq \left({\sup_{w\in W_i}\|w\|_2}\right) \sqrt{4 \ln |V_{i+1}|}, \\ \sup_{w\in W_i} \|w\| &\leq \sup_{a\in U} \|V_{i+1}\| + \|a - V_i(a)\| \leq \alpha_{i+1} + \alpha_i = 3 \alpha_{i+1}. \end{aligned} Combining these bounds, \textrm{URad}(U) \leq n 2^{1-N} + 0 + \sum_{i=1}^N 6 \sqrt{n} 2^{-i}\sqrt{\ln \mathcal{N}(U,2^{-i}\sqrt{n},\|\cdot\|_2} . N\geq 1 was arbitrary, so applying \inf_{N\geq 1} gives the first bound.
Since \ln \mathcal{N}(U,\beta,\|\cdot\|_2) is nonincreasing in \beta, the integral upper bounds the Riemann sum: \begin{aligned} \textrm{URad}(U) &\leq n2^{1-N} + 6\sum_{i=1}^{N-1} \alpha_{i+1} \sqrt{\ln \mathcal{N}(U,\alpha_{i+1}, \|\cdot\|)} \\ &= n2^{1-N} + 12\sum_{i=1}^{N-1} \left({ \alpha_{i+1} - \alpha_{i+2} }\right) \sqrt{\ln \mathcal{N}(U,\alpha_{i+1}, \|\cdot\|)} \\ &\leq \sqrt{n} \alpha_N + 12 \int_{\alpha_{N+1}}^{\alpha_2} \sqrt{\ln \mathcal{N}(U,\alpha_{i+1}, \|\cdot\|)} {\text{d}}\beta. \end{aligned} To finish, pick \alpha > 0 and N with \alpha_{N+1} \geq \alpha > \alpha_{N+2} = \frac {\alpha_{N+1}}{2} = \frac {\alpha_{N+2}}{4}, whereby \begin{aligned} \textrm{URad}(U) &\leq \sqrt{n} \alpha_N + 12 \int_{\alpha_{N+1}}^{\alpha_2} \sqrt{\ln \mathcal{N}(U,\alpha_{i+1}, \|\cdot\|)} {\text{d}}\beta \\ &\leq 4 \sqrt{n} \alpha + 12 \int_{\alpha}^{\sqrt{n}/2} \sqrt{\ln \mathcal{N}(U,\alpha_{i+1}, \|\cdot\|)} {\text{d}}\beta. \end{aligned}
Tightness of Dudley: Sudakov’s lower bound says there exists a universal C with \textrm{URad}(U) \geq \frac {c}{\ln(n)} \sup_{\alpha > 0} \alpha \sqrt{\ln\mathcal{N}(U,\alpha,\|\cdot\|)}, which implies \textrm{URad}(U) = \widetilde\Theta\left({\textup{Dudley entropy integral}}\right). needs references, detail, explanation.
Taking the notion of increments to heart and generalizing the proof gives the concept of chaining. One key question there is tightening the relationship with Rademacher complexity (shrinking constants and log factors in the above bound).
Another term for covering is “metric entropy.”
Recall once again that we drop the normalization 1/n from \textrm{URad} and the choice of norm when covering.
We will give two generalization bounds.
The first will be for arbitrary Lipschitz functions, and will be horifically loose (exponential in dimension).
The second will be, afaik, the tightest known bound for ReLU networks.
This bound is intended as a point of contrast with our deep network generalization bounds.
Exponential in dimension!
Revisiting the “point of contrast” comment above, our deep network generalization bounds are polynomial and not exponential in dimension; consequently, we really are doing much better than simply treating the networks as arbitrary Lipschitz functions.
Proof.
Suppose B > \epsilon, otherwise can use the trivial cover \{ x\mapsto 0\}.
Subdivide [-R-\epsilon, +R+\epsilon]^d into \left({\frac{4(R+\epsilon)\rho}{\epsilon}}\right)^d cubes of side length \frac{\epsilon}{2\rho}; call this U.
Subdivide [-B, +B] into intervals of length \epsilon, thus 2B/\epsilon elements; call this V.
Our candidate cover \mathcal{G} is the set of all piecewise constant maps from [-R-\epsilon, +R+\epsilon]^d to [-B, +B] discretized according to U and V, meaning |\mathcal{G}| \leq \left\lceil\frac {2B}{\epsilon}\right\rceil^{\left\lceil \frac{4(R+\epsilon)\rho}{\epsilon}\right\rceil^d}.
To show this is an improper cover, given f\in\mathcal{F}, choose g\in\mathcal{G} by proceeding over each C \in U, and assigning g_{|C}\in V to be the closest element to f(x_C), where x_C is the midpoint of C. Then \begin{aligned} \|f-g\|_{{\textrm{u}}} &= \sup_{C\in U} \sup_{x\in C} |f(x) - g(x)| \\ &\leq \sup_{C\in U} \sup_{x\in C} \left({ |f(x) - f(x_C)| + |f(x_C) - g(x)| }\right) \\ &\leq \sup_{C\in U} \sup_{x\in C} \left({ \rho \|x- x_C\|_\infty + \frac{\epsilon}{2} }\right) \\ &\leq \sup_{C\in U} \sup_{x\in C} \left({ \rho( \epsilon/ (4\rho) ) + \frac{\epsilon}{2} }\right) \leq \epsilon \end{aligned}
hmm the proof used uniform norm… is it defined?
Proof uses \|\sigma(M) - \sigma(M')\|_{\textrm{F}}\leq \|\sigma\|_{\textrm{Lip}}\cdot\|M-M'\|_{\textrm{F}}; in particular, it allows multi-variate gates like max-pooling! See (P. Bartlett, Foster, and Telgarsky 2017) for \|\sigma_i\|_{\textrm{Lip}} estimates.
This proof can be adjusted to handle “distance to initialization”; see (P. Bartlett, Foster, and Telgarsky 2017) and the notion “reference matrices.”
Let’s compare to our best “layer peeling” proof from before, which had \prod_i \|W_i\|_{\textrm{F}}\lesssim m^{L/2} \prod_i \|W_i\|_2. That proof assumed \rho_i = 1, so the comparison boils down to m^{L/2} \left({ \prod_i \|W_i\|_2 }\right) \quad\textup{vs.}\quad \left[{ \sum_i \left({\frac{\|W_i^{\scriptscriptstyle\mathsf{T}}\|_{2,1}^{2/3}}{\|W_i\|_2^{2/3}}}\right)}\right]^{3/2} \left({ \prod_i \|W_i\|_2 }\right), where L \leq \sum_i \left({\frac{\|W_i^{\scriptscriptstyle\mathsf{T}}\|_{2,1}^{2/3}}{\|W_i\|_2^{2/3}}}\right) \leq Lm^{2/3}. So the bound is better but still leaves a lot to be desired, and is loose in practice.
It is not clear how to prove exactly this bound with Rademacher peeling, which is a little eerie (independent of whether this bound is good or not).
The proof, as with Rademacher peeling proofs, is an induction on layers, similarly one which does not “coordinate” the behavior of the layers; this is one source of looseness.
The first step of the proof is a covering number for individual layers.
Proof. Let W\in\mathbb{R}^{m\times d} be given with \|W^{\scriptscriptstyle\mathsf{T}}\|_{2,1} \leq r. Define s_{ij} := W_{ij}/|W_{ij}|, and note \begin{aligned} WX^{\scriptscriptstyle\mathsf{T}} = \sum_{i,j} \mathbf{e}_i\mathbf{e}_i^{\scriptscriptstyle\mathsf{T}}W \mathbf{e}_j\mathbf{e}_j^{\scriptscriptstyle\mathsf{T}}X^{\scriptscriptstyle\mathsf{T}} = \sum_{i,j} \mathbf{e}_i W_{ij} (X\mathbf{e}_j)^{\scriptscriptstyle\mathsf{T}} = \sum_{i,j} \underbrace{\frac{|W_{ij}| \|X\mathbf{e}_j\|_2}{r\|X\|_{\textrm{F}}}}_{=:q_{ij}} \underbrace{\frac{r \|X\|_{\textrm{F}}s_{ij}\mathbf{e}_i (X\mathbf{e}_j)^{\scriptscriptstyle\mathsf{T}}}{\|X\mathbf{e}_j\|}}_{=:U_{ij}}. \end{aligned} Note by Cauchy-Schwarz that \sum_{i,j} q_{ij} \leq \frac {1}{r\|X\|_{\textrm{F}}} \sum_i \sqrt{\sum_j W_{ij}^2}\|X\|_{\textrm{F}} = \frac {\|W^{\scriptscriptstyle\mathsf{T}}\|_{2,1} \|X\|_{\textrm{F}}}{r\|X\|_{\textrm{F}}} \leq 1, potentially with strict inequality, thus q is not a probability vector, which we will want later. To remedy this, construct probability vector p from q by adding in, with equal weight, some U_{ij} and its negation, so that the above summation form of WX^{\scriptscriptstyle\mathsf{T}} goes through equally with p as with q.
Now define IID random variables (V_1,\ldots,V_k), where \begin{aligned} \mathop{\textrm{Pr}}[V_l = U_{ij}] &= p_{ij}, \\ \mathop{\mathbb{E}}V_l &= \sum_{i,j} p_{ij} U_{ij} = \sum_{i,j} q_{ij} U_{ij} = WX^{\scriptscriptstyle\mathsf{T}}, \\ \|U_{ij}\| &= \left\|{\frac {s_{ij} \mathbf{e}_i (X\mathbf{e}_j)}{\|X\mathbf{e}_j\|_2} }\right\|_{\textrm{F}}\cdot r \|X\|_{\textrm{F}} = |s_{ij}|\cdot\|\mathbf{e}_i\|_2 \cdot\left\|{\frac{X\mathbf{e}_j}{\|X\mathbf{e}_j\|_2} }\right\|_2 \cdot r \|X\|_{\textrm{F}} = r \|X\|_{\textrm{F}}, \\ \mathop{\mathbb{E}}\|V_l\|^2 &= \sum_{i,j} p_{ij} \|U_{ij}\|^2 \leq \sum_{ij} p_{ij} r^2 \|X\|_{\textrm{F}}^2 = r^2 \|X\|_{\textrm{F}}^2. \end{aligned} By +Lemma 3.1 (Maurey (Pisier 1980)), there exist ({\hat V}_1,\ldots,{\hat V}_k)\in S^k with \left\|{ WX^{\scriptscriptstyle\mathsf{T}}- \frac 1 k \sum_l {\hat V}_l }\right\|^2 \leq \mathop{\mathbb{E}}\left\|{ \mathop{\mathbb{E}}V_1 - \frac 1 k \sum_l V_l }\right\|^2 \leq \frac 1 k \mathop{\mathbb{E}}\|V_1\|^2 \leq \frac {r^2 \|X\|_{\textrm{F}}^2} k. Furthermore, the matrices {\hat V}_l have the form \frac 1 k \sum_l {\hat V}_l = \frac 1 k \sum_l \frac {s_l\mathbf{e}_{i_l} (X\mathbf{e}_{j_l})^{\scriptscriptstyle\mathsf{T}}}{\|X\mathbf{e}_{j_l}\|} = \left[{ \frac 1 k \sum_l \frac {s_l\mathbf{e}_{i_l} \mathbf{e}_{j_l}^{\scriptscriptstyle\mathsf{T}}}{\|X\mathbf{e}_{j_l}\|} }\right] X^{\scriptscriptstyle\mathsf{T}}; by this form, there are at most (2nd)^k choices for ({\hat V}_1,\ldots,{\hat V}_k).
Proof. Let X_i denote the output of layer i of the network, using weights (W_i,\ldots,W_1), meaning X_0 := X \qquad\text{and}\qquad X_i := \sigma_i(X_{i-1} W_i^{\scriptscriptstyle\mathsf{T}}).
The proof recursively constructs cover elements {\hat X}_i and weights {\hat W}_i for each layer with the following basic properties.
Define {\hat X}_0 := X_0, and {\hat X}_i := \Pi_{B_i} \sigma_i({\hat X}_{i-1} {\hat W}_i^{\scriptscriptstyle\mathsf{T}}), where B_i is the Frobenius-norm ball of radius \|X\|_{\textrm{F}}\prod_{j< i} \rho_j s_j.
Due to the projection \Pi_{B_i}, \|{\hat X}_i\|_{\textrm{F}}\leq \|X\|_{\textrm{F}}\prod_{j\leq i} \rho_j s_j. Similarly, using \rho_i(0) = 0, \|X_i\|_{\textrm{F}}\leq \|X\|_{\textrm{F}}\prod_{j<i} \rho_j s_j.
Given {\hat X}_{i-1}, choose {\hat W}_i via +Lemma 16.1 so that \|{\hat X}_{i-1}W_i^{\scriptscriptstyle\mathsf{T}}- {\hat X}_{i-1}{\hat W}_i^{\scriptscriptstyle\mathsf{T}}\|_{\textrm{F}}\leq \epsilon_i, whereby the corresponding covering number \mathcal{N}_i for this layer satisfies \ln\mathcal{N}_i \leq \left\lceil \frac { \|{\hat X}_{i-1}\|_{\textrm{F}}^2 b_i^2}{\epsilon_i^2} \right\rceil \ln(2m^2) \leq \left\lceil \frac { \|X\|_{\textrm{F}}^2 b_i^2 \prod_{j<i} \rho_j^2 s_j^2}{\epsilon_i^2} \right\rceil \ln(2m^2).
Since each cover element {\hat X}_i depends on the full tuple ({\hat W}_i,\ldots,{\hat W}_1), the final cover is the product of the individual covers (and not their union), and the final cover log cardinality is upper bounded by \ln \prod_{i=1}^L \mathcal{N}_i \leq \sum_{i=1}^L \left\lceil \frac { \|X\|_{\textrm{F}}^2 b_i^2\prod_{j<i}\rho_j^2 s_j^2}{\epsilon_i^2} \right\rceil \ln(2m^2).
It remains to prove, by induction, an error guarantee \|X_i - {\hat X}_i\|_{\textrm{F}} \leq \sum_{j=1}^{i} \rho_j \epsilon_j \prod_{k=j+1}^{i} \rho_k s_k.
The base case \|X_0-{\hat X}_0\|_{\textrm{F}}= 0 = \epsilon_0 holds directly. For the inductive step, by the above ingredients and the triangle inequality, \begin{aligned} \| X_i - {\hat X}_i \|_{\textrm{F}} &\leq \rho_i \|X_{i-1}W_i^{\scriptscriptstyle\mathsf{T}}- {\hat X}_{i-1}{\hat W}_i^{\scriptscriptstyle\mathsf{T}}\|_{\textrm{F}} \\ &\leq \rho_i \| X_{i-1}W_{i}^{\scriptscriptstyle\mathsf{T}}- {\hat X}_{i-1}W_i^{\scriptscriptstyle\mathsf{T}}\|_{\textrm{F}} + \rho_i \| {\hat X}_{i-1}W_i^{\scriptscriptstyle\mathsf{T}}- {\hat X}_{i-1}{\hat W}_i^{\scriptscriptstyle\mathsf{T}}\|_{\textrm{F}} \\ &\leq \rho_i s_i \| X_{i-1} - {\hat X}_{i-1}\|_{\textrm{F}} + \rho_i \epsilon_i \\ &\leq \rho_i s_i \left[{ \sum_{j=1}^{i-1} \rho_j \epsilon_j \prod_{k=j+1}^{i-1} \rho_k s_k }\right] + \rho_i \epsilon_i \\ &= \left[{ \sum_{j=1}^{i-1} \rho_j \epsilon_j \prod_{k=j+1}^{i} \rho_k s_k }\right] + \rho_i \epsilon_i \\ &= \sum_{j=1}^{i} \rho_j \epsilon_j \prod_{k=j+1}^{i} \rho_k s_k. \end{aligned}
Proof of +Theorem 16.2 (P. Bartlett, Foster, and Telgarsky (2017)). By solving a Lagrangian (minimize cover size subject to total error \leq \epsilon), choose \epsilon_i := \frac {\alpha_i \epsilon}{\rho_i \prod_{j>i} \rho_j s_j}, \qquad \alpha_i := \frac 1 \beta \left({\frac{b_i}{s_i}}\right)^{2/3}, \qquad \beta := \sum_{i=1}^L \left({\frac{b_i}{s_i}}\right)^{2/3}. Invoking the induction lemma with these choices, the resulting cover error is \sum_{i=1}^L \epsilon_i \rho_i \prod_{j>i} \rho_j s_j = \epsilon\sum_{j=1}^L \alpha_i = \epsilon. and the main term of the cardinality (ignoring \ln(2m^2)) satisfies \begin{aligned} & \sum_{i=1}^L \frac { \|X\|_{\textrm{F}}^2 b_i^2 \prod_{j<i} \rho_j^2 s_j^2}{\epsilon_i^2} = \frac {\|X\|_{\textrm{F}}^2}{\epsilon^2} \sum_{i=1}^L \frac{b_i^2\prod_{j=1}^L \rho_j^2 s_j^2}{\alpha_i^2 s_i^2} \\ &= \frac {\|X\|_{\textrm{F}}^2\prod_{j=1}^L \rho_j^2 s_j^2}{\epsilon^2} \sum_{i=1}^L \frac{\beta^2 b_i^{2/3}}{s_i^{2/3}} = \frac {\|X\|_{\textrm{F}}^2\prod_{j=1}^L \rho_j^2 s_j^2}{\epsilon^2} \left({ \sum_{i=1}^L \left({\frac{b_i}{s_i}}\right)^{2/3} }\right)^3. \end{aligned} I should include the Lagrangian explicitly and also explicit Dudley.
ok if VC dim is one section, covering numbers should be as well?
remainder is copy/pasted from fall2018, was not taught in fall2019.
should include in preamble various bounds not taught, and a comment that VC dim proofs are interesting and reveal structure not captured above.
VC dimension is an ancient generalization technique; essentially the quantity itself appears in the work of Kolmogorov, and was later rediscovered a few times, and named after Vapnik and Chervonenkis, whose used it for generalization.
To prove generalization, we will upper bound Rademacher complexity with VC dimension; classical VC dimension generalization proofs include Rademacher averages.
There is some huge ongoing battle of whether VC dimension is a good measure or not. I think the proofs are interesting and are sensitive to interesting properties of deep networks in ways not capture by many of the bounds we spent time on. Anyway, a discussion for another time…
First, some definitions. First, the zero-one/classification risk/error: \mathcal{R}_{\textrm{z}}(\textrm{sgn}(f)) = \mathop{\textrm{Pr}}[\textrm{sgn}(f(X)) \neq Y], \ {} \widehat{\mathcal{R}}_{\textrm{z}}(\textrm{sgn}(f)) = \frac 1 n \sum_{i=1}^n \mathbf{1}[\textrm{sgn}(f(x_i))\neq y_i]. The earlier Rademacher bound will now have \textrm{URad}\left({\left\{{ (x,y)\mapsto \mathbf{1}[\textrm{sgn}(f(x))\neq y] : f\in\mathcal{F}}\right\}_{|S}}\right). This is at most 2^n; we’ll reduce it to a combinatorial quantity: \begin{aligned} \textrm{sgn}(U) &:= \left\{{ (\textrm{sgn}(u_1),\ldots,\textrm{sgn}(u_n)) : u\in V }\right\}, \\ \textrm{Sh}(\mathcal{F}_{|S}) &:= \left|{ \textrm{sgn}(\mathcal{F}_{|S}) }\right|, \\ \textrm{Sh}(\mathcal{F}; n) &:= \sup_{\substack{S\in ? \\ |S|\leq n}} \left|{ \textrm{sgn}(\mathcal{F}_{|S}) }\right|, \\ \textrm{VC}(\mathcal{F}) &:= \sup\{ i \in \mathbb{Z}_{\geq 0} : \textrm{Sh}(\mathcal{F};i) = 2^i \}. \end{aligned}
\textrm{Sh} is “shatter coefficient,” \textrm{VC} is “VC dimension.”
Both quantities are criticized as being too tied to their worst case; bounds here depend on (empirical quantity!) \textrm{URad}(\textrm{sgn}(\mathcal{F}_{|S})), which can be better, but throws out the labels.
Say something like “Need \textrm{Sh}(\mathcal{F}_{|s}) = o(n)”in order to learn“.” ?
Minimizing \widehat{\mathcal{R}}_{\textrm{z}} is NP-hard in many trivial cases, but those require noise and neural networks can often get \widehat{\mathcal{R}}_{\textrm{z}}(\textrm{sgn}(f)) = 0.
\textrm{VC}(\mathcal{F})<\infty suffices; many considered this a conceptual breakthrough, namely “learning is possible!”
The quantities (\textrm{VC}, \textrm{Sh}) appeared in prior work (not by V-C). Symmetrization apparently too, though I haven’t dug this up.
First step of proof: pull out the zero-one loss.
Proof. For each i, define \ell_i(z) := \max\left\{{0,\min\left\{{1, \frac{1-y_i(2z-1)}{2} }\right\}}\right\}, which is 1-Lipschitz, and satisfies \ell_i(\textrm{sgn}(f(x_i))) = \mathbf{1}[ \textrm{sgn}(f(x_i)) \neq y_i ]. (Indeed, it is the linear interpolation.) Then \begin{aligned} &\textrm{URad}(\left\{{(x,y)\mapsto \mathbf{1}[ \textrm{sgn}(f(x))\neq y] : f\in\mathcal{F}}\right\}_{|S}) \\ &=\textrm{URad}(\left\{{(\ell_1(\textrm{sgn}(f(x_1))),\ldots, \ell_n(\textrm{sgn}(f(x_n)))):f\in\mathcal{F}}\right\}_{|S}) \\ &=\textrm{URad}(\ell \circ \textrm{sgn}(\mathcal{F})_{|S}) \\ &\leq \textrm{URad}(\textrm{sgn}(\mathcal{F})_{|S}). \end{aligned}
is that using the fancier per-coordinate vector-wise peeling again?
Plugging this into our Rademacher bound: w/ pr \geq 1-\delta, \forall f\in\mathcal{F}, \mathcal{R}_{\textrm{z}}(\textrm{sgn}(f)) \leq \widehat{\mathcal{R}}_{\textrm{z}}(\textrm{sgn}(f)) + \frac 2 n \textrm{URad}(\textrm{sgn}(\mathcal{F})_{|S}) + 3 \sqrt{\frac{\ln(2/\delta)}{2n}}.
The next step is to apply Massart’s finite lemma, giving \textrm{URad}(\textrm{sgn}(\mathcal{F}_{|S})) \leq \sqrt{ 2n \textrm{Sh}(\mathcal{F}_{|S})}.
One last lemma remains for the proof.
lol why mention warren. should be explicit and not passive-aggressive.
(Proof. Omitted. Exists in many standard texts.)
okay fine but i should give a reference, and eventually my own clean proof.
By Sauer-Shelah, \textrm{Sh}(\mathcal{F};n) \leq n^{d+1} + 1. Anthony-Bartlett chapter 3 gives an exact equality; only changes constants of \ln\textrm{VC}(\mathcal{F};n).
Let’s compare to Rademacher: \begin{aligned} \textrm{URad}(\textrm{sgn}(\mathcal{F}_{|S})) & \leq \sqrt{ 2n d \ln(n+1)}, \\ \textrm{URad}(\{ x\mapsto \left\langle w, x \right \rangle : \|w\|\leq R\}_{|S}) &\leq R \|X_S\|_F, \end{aligned} where \|X_S\|_F^2 = \sum_{x\in S} \|x\|_2^2 \leq n \cdot d\cdot \max_{i,j} x_{i,j}. One is scale-sensitive (and suggests regularization schemes), other is scale-insensitive.
Proof. First let’s do the lower bound \textrm{VC}(\mathcal{F})\geq d+1.
Suffices to show \exists S:=\{x_1,\ldots,x_{d+1}\} with \textrm{Sh}(\mathcal{F}_{|S}) = 2^{d+1}.
Choose S:= \{\mathbf{e}_1,\ldots,\mathbf{e}_d, (0,\dots,0)\}.
Given any P\subseteq S, define (a,b) as a_i := 2\cdot \mathbf{1}[\mathbf{e}_i \in P] - 1, \qquad b := \frac 1 2 - \mathbf{1}[0 \in P]. Then \begin{aligned} \textrm{sgn}(\left\langle a, \mathbf{e}_i \right \rangle - b) &= \textrm{sgn}( 2 \mathbf{1}[\mathbf{e}_i \in P] - 1 - b) = 2 \mathbf{1}[\mathbf{e}_i \in P] - 1, \\ \textrm{sgn}(\left\langle a, 0 \right \rangle - b) &= \textrm{sgn}( 2 \mathbf{1}[0 \in P] - 1/2) = 2 \mathbf{1}[0 \in P] - 1, \end{aligned} meaning this affine classifier labels S according to P, which was an arbitrary subset.
Now let’s do the upper bound \textrm{VC}(\mathcal{F}) < d+2.
Consider any S\subseteq \mathbb{R}^d with |S| = d+2.
By Radon’s Lemma (proved next), there exists a partition of S into nonempty (P,N) with \textrm{conv}(P)\cap \textrm{conv}(N).
Label P as positive and N as negative. Given any affine classifier, it can not be correct on all of S (and thus \textrm{VC}(\mathcal{F}) < d+2): either it is incorrect on some of P, or else it is correct on P, and thus has a piece of \textrm{conv}(N) and thus x\in N labeled positive.
needs ref
Proof. Let S = \{x_1,\ldots,x_{d+2}\} be given, and define \{u_1,\ldots,u_{d+1}\} as u_i := x_i - x_{d+2}, which must be linearly dependent:
Exist scalars (\alpha_1,\ldots,\alpha_{d+1}) and a j with \alpha_j := -1 so that \sum_i \alpha_i u_i = -u_j + \sum_{i\neq j} \alpha_i u_i = 0;
thus x_j - x_{d+2} = \sum_{\substack{i\neq j\\i<d+2}} \alpha_i (x_i-x_{d+2}) and 0 = \sum_{i<d+2} \alpha_i x_i - x_{d+2} \sum_{i < d+2}\alpha_i =: \sum_j \beta_j x_j, where \sum_j \beta_j = 0 and not all \beta_j are zero.
Set P:= \{i : \beta_i > 0\}, N := \{i : \beta_i \leq 0\}; where neither set is empty.
Set \beta := \sum_{i\in P} \beta_i - \sum_{i\in N} \beta_i > 0.
Since 0= \sum_i \beta_i x_i = \sum_{i\in P} \beta_i x_i + \sum_{i \in N} \beta_i x_i, then \frac 0 \beta = \sum_{i\in P} \frac{\beta_i}{\beta} x_i + \sum_{i \in N} \frac {\beta_i}{\beta} x_i and the point z:= \sum_{i\in P} \beta_i x_i / \beta = \sum_{i \in N} \beta_i x_i / (-\beta) satisfies z\in \textrm{conv}(P)\cap \textrm{conv}(N).
Generalizes Minsky-Papert “xor” construction.
Indeed, the first appearance I know of shattering/VC was in approximation theory, the papers of Warren and Shapiro, and perhaps it is somewhere in Kolmogorov’s old papers.
Consider iterating the previous construction, giving an “LTF network”: a neural network with activation z\mapsto \mathbf{1}[z\geq 0].
We’ll analyze this by studying output of all nodes. To analyze this, we’ll study not just the outputs, but the behavior of all nodes.
another suggestion of definition in pandoc-numbering
Definition.
Given a sample S of size n and an LTF network with m nodes (in any topologically sorted order), define activation matrix A:=\textrm{Act}(S; W:=(a_1,\ldots,a_m)) where A_{ij} is the output of node j on input i, with fixed network weights W.
Let \textrm{Act}(S;\mathcal{F}) denote the set of activation matrices with architecture fixed and weights W varying.
Since last column is the labeling, |\textrm{Act}(S;\mathcal{F})| \geq \textrm{Sh}(\mathcal{F}_{|S}).
\textrm{Act} seems a nice complexity measure, but it is hard to estimate given a single run of an algorithm (say, unlike a Lipschitz constant).
We’ll generalize \textrm{Act} to analyze ReLU networks.
Proof.
Topologically sort nodes, let (p_1,\ldots,p_m) denote numbers of respective numbers of parameters (thus \sum_i p_i = p).
Proof will iteratively construct sets (U_1,\ldots,U_m) where U_i partitions the weight space of nodes j\leq i so that, within each partition cell, the activation matrix does not vary.
The proof will show, by induction, that |U_i| \leq (n+1)^{\sum_{j\leq i}p_j}. This completes the proof of the first claim, since \textrm{Sh}(\mathcal{F}_{|S}) \leq |\textrm{Act}(\mathcal{F};S)| = |U_m|.
For convenience, define U_0 = \{\emptyset\}, |U_0| = 1; the base case is thus |U_0| = 1 = (n+1)^0.
(Inductive step). Let j\geq 1 be given; the proof will now construct U_{j+1} by refining the partition U_j.
Fix any cell C of U_j; as these weights vary, the activation is fixed, thus the input to node j+1 is fixed for each x\in S.
Therefore, on this augmented set of n inputs (S with columns of activations appended to each example), there are (n+1)^{p_{j+1}} possible outputs via Sauer-Shelah and the VC dimension of affine classifiers with p_{j+1} inputs.
In other words, C can be refined into (n+1)^{p_{j+1}} sets; since C was arbitrary, |U_{j+1}| = |U_j| (n+1)^{p_{j+1}} \leq (n+1)^{\sum_{l \leq j+1}p_l}. This completes the induction and establishes the Shattering number bound.
It remains to bound the VC dimension via this Shatter bound: \begin{aligned} &\textrm{VC}(\mathcal{F})<n \\ \Longleftarrow & \forall i \geq n \centerdot \textrm{Sh}(\mathcal{F};i)< 2^i \\ \Longleftarrow & \forall i \geq n \centerdot (i+1)^p < 2^i \\ \iff & \forall i \geq n \centerdot p\ln (i+1) < i \ln 2 \\ \iff & \forall i \geq n \centerdot p < \frac{i \ln (2)}{\ln(i+1)} \\ \Longleftarrow & p < \frac{n \ln (2)}{\ln(n+1)} \end{aligned} If n = 6p\ln(p), \begin{aligned} \frac{n \ln (2)}{\ln(n+1)} &\geq \frac {n\ln(2)}{\ln(2n)} = \frac {6p\ln(p)\ln(2)}{\ln 12 + \ln p + \ln\ln p} \\ &\geq \frac {6p \ln p \ln 2}{3\ln p} > p. \end{aligned}
Had to do handle \forall i\geq n since VC dimension is defined via \sup; one can define funky \mathcal{F} where \textrm{Sh} is not monotonic in n.
Lower bound is \Omega(p \ln m); see Anthony-Bartlett chapter 6 for a proof. This lower bound however is for a specific fixed architecture!
Other VC dimension bounds: ReLU networks have \widetilde{\mathcal{O}}(pL), sigmoid networks have \widetilde{\mathcal{O}}(p^2m^2), and there exists a convex-concave activation which is close to sigmoid but has VC dimension \infty.
Matching lower bounds exist for ReLU, not for sigmoid; but even the “matching” lower bounds are deceptive since they hold for a fixed architecture of a given number of parameters and layers.
Today’s ReLU networks will predict with x\mapsto A_L \sigma_{L-1}\left({A_{L-1} \cdots A_2\sigma_1(A_1x + b_1)+b_2\cdots + b_{L-1}}\right) + b_L, where A_i\in \mathbb{R}^{d_i\times d_{i-1}} and \sigma_i : \mathbb{R}^{d_i\to d_i} applies the ReLU z\mapsto\max\{0,z\} coordinate-wise.
Convenient notation: collect data as rows of matrix X\in\mathbb{R}^{n\times d}, and define \begin{aligned} X_0 &:= X^\top & Z_0 &:= \textup{all $1$s matrix},\\ X_i &:= A_i(Z_{i-1} \odot X_{i-1}) + b_i 1_n^\top, & X_i &:= \mathbf{1}[X_i \geq 0], \end{aligned} where (Z_1,\ldots,Z_L) are the activation matrices.
i should double check i have the tightest version? which is more sensitive to earlier layers? i should comment on that and the precise structure/meaning of the lower bounds?
As with LTF networks, the prove inductively constructs partitions of the weights up through layer i so that the activations are fixed across all weights in each partition cell.
Consider a fixed cell of the partition, whereby the activations are fixed zero-one matrices. As a function of the inputs, the ReLU network is now an affine function; as a function of the weights it is multilinear or rather a polynomial of degree L.
Consider again a fixed cell and some layer i; thus \sigma(X_i) = Z_i \odot X_i is a matrix of polynomials of degree i (in the weights). If we can upper bound the number of possible signs of A_{i+1} ( Z_i \odot X_i ) + b_i 1_n^\top, then we can refine our partition of weight space and recurse. For that we need a bound on sign patterns of polynomials, as on the next slide.
Proof (of ReLU VC bound).
We’ll inductively construct partitions (U_0,\ldots,U_L) where U_i partitions the parameters of layers j\leq i so that for any C\in U_i, the activations Z_j in layer j\leq i are fixed for all parameter choices within C (thus let Z_j(C) denote these fixed activations).
The proof will proceed by induction, showing |U_i| \leq (12nL)^{pi}.
Base case i=0: then U_0 = \{\emptyset\}, Z_0 is all ones, and |U_0| = 1 \leq (12nL)^{pi}.
(Inductive step).
Fix C\in S_i and (Z_1,\ldots,Z_i) = (Z_1(C),\ldots,Z_i(C)).
Note X_{i+1} = A_{i+1} (Z_i \odot X_i) + b_i 1_n^\top is polynomial (of degree i+1) in the parameters since (Z_1,\ldots,Z_i) are fixed.
Therefore \begin{aligned} \left|{\{ \mathbf{1}[ X_{i+1} \geq 0 ] : \textup{params} \in C \} }\right| &\leq \textrm{Sh}(\textup{$i+1$ deg poly; $m_i\cdot n$ functions}) \\ &\leq 2 \left({\frac{2 e n m_{i+1}}{\sum_{j \leq i} p_j}}\right)^{\sum_{j \leq i+1}p_j} \leq (12nL)^p. \end{aligned} [ Technical comment: to apply the earlier shatter bound for polynomials, we needed n \cdot m_{i+1} \geq \sum_j p_j; but if (even more simply) p \geq n m_{i+1}, we can only have \leq 2^{nm_{i+1}} \leq 2^{p} activation matrices anyway, so the bound still holds. ]
Therefore carving U_i into pieces according to Z_{i+1}=\mathbf{1}[X_{i+1}\geq 0] being fixed gives |U_{i+1}| \leq |U_i| (12nL)^p \leq (12nL)^{p(i+1)}. This completes the induction and upper bounds the number of possible activation patterns and also the shatter coefficient.
It remains to upper bound the VC dimension via the Shattering bound. As with LTF networks, \begin{aligned} \textrm{VC}(\mathcal{F}) < n & \Longleftarrow \forall i \geq n \centerdot \textrm{Sh}(\mathcal{F};i)< 2^i \\ & \Longleftarrow \forall i \geq n \centerdot (12iL)^{pL} < 2^i \\ & \iff \forall i \geq n \centerdot pL \ln(12iL) < i \ln 2 \\ & \iff \forall i \geq n \centerdot pL < \frac{ i \ln 2 } { \ln(12iL) } \\ & \Longleftarrow pL < \frac{ n \ln 2 } { \ln(12nL) } \\ \end{aligned}
If n = 6 pL \ln(pL), \begin{aligned} \frac {n\ln 2}{\ln(12nL)} &= \frac {6pL \ln(pL)\ln(2)}{\ln(72pL^2\ln(pL))} = \frac {6pL \ln(pL)\ln(2)}{\ln(72) + \ln(pL^2) + \ln \ln(pL)} \\ &\geq \frac {6pL \ln(pL)\ln(2)}{\ln(72) + \ln(pL^2) + \ln(pL) - 1} \geq \frac {6 \ln(pL)\ln(2)}{3\ln(pL^2)} \\ &= 2pL \ln 2 > pL. \end{aligned}
If ReLU is replaced with a degree r\geq 2 piecewise polynomial activation, have r^i-degree polynomial in each cell of partition, and shatter coefficient upper bound scales with L^2 not L. The lower bound in this case still has L not L^2; it’s not known where the looseness is.
Lower bounds are based on digit extraction, and for each pair (p,L) require a fixed architecture.
arXiv:2002.04486 [math.OC]
.
arXiv:1810.02032 [cs.LG]
.
arXiv:2006.06657 [cs.LG]
.
arXiv:1412.6614 [cs.LG]
.
arXiv:1809.08587 [cs.LG]
.
arXiv:2006.00625 [cs.LG]
.
arXiv:1904.00687 [cs.LG]
.
arXiv:2001.05205 [cs.LG]
.