Deep learning theory lecture notes

Matus Telgarsky

2021-02-14 v0.0-1dabbd4b (pre-alpha)

Preface

Philosophy of these notes. Two key ideas determined what has been included so far.

  1. I aim to provide simplified proofs over what appears in the literature, ideally reducing difficult things to something that fits in a single lecture.

  2. I have primarily focused on a classical perspective of achieving a low test error for binary classification with IID data via standard (typically ReLU) feedforward networks.

Organization. Following the second point above, the classical view decomposes the test error into three parts.

  1. Approximation (starts in section 1): given a classification problem, there exists a deep network which achieves low error over the distribution.

  2. Optimization (starts in section 9): given a finite training set for a classification problem, there exist algorithms to find predictors with low training error and low complexity.

  3. Generalization (starts in section 16): the gap between training and testing error is small for low complexity networks.

Remark 0.1 (weaknesses of this “classical” approach) .

Formatting.

Feedback. I’m very eager to hear any and all feedback!

How to cite. Please consider using a format which makes the version clear:


@misc{mjt_dlt,
       author = {Matus Telgarsky},
        title = {Deep learning theory lecture notes},
 howpublished = {\url{https://mjt.cs.illinois.edu/dlt/}},
         year = {2021},
         note = {Version: 2021-02-14 v0.0-1dabbd4b (pre-alpha)},
}

Basic setup: feedforward networks and test error decomposition

Basic shallow network. Consider the mapping x \mapsto \sum_{j=1}^m a_j \sigma(w_j^{\scriptscriptstyle\mathsf{T}}x + b_j).

Basic deep network. Extending the matrix notation, given parameters w = (W_1, b_1, \ldots, W_L, b_L), f(x;w) := \sigma_L( W_L \sigma_{L-1}( \cdots W_2 \sigma_1(W_1 x + b_1) + b_2 \cdots )+ b_L ). (1)

Remark 0.2.
Both the preceding formalisms (shallow and deep networks) have nodes organized as directed acyclic graphs. There are other formalisms, such as recurrent networks, which do not appear in these notes.

Basic supervised learning setup.

How do we define “do well on future data?”

“Do well on future data” becomes “minimize \mathcal{R}(f).” This can be split into separate terms: given a training algorithm’s choice \hat f in some class of functions/predictors \mathcal{F}, as well as some reference solution \bar f \in \mathcal{F}, \begin{aligned} \mathcal{R}(\hat f) &= \mathcal{R}(\hat f) - \widehat{\mathcal{R}}(\hat f) &\text{(generalization)} \\ &\quad + \widehat{\mathcal{R}}(\hat f) - \widehat{\mathcal{R}}(\bar f) &\text{(optimization)} \\ &\quad + \widehat{\mathcal{R}}(\bar f) -\mathcal{R}(\bar f) &\qquad\text{(concentration/generalization)} \\ &\quad + \mathcal{R}(\bar f); &\text{(approximation)} \end{aligned} this decomposition motivates the main approach in these notes.

Remark 0.3 (sensitivity to complexity) .
One departure from the purely classical setting is to have a sensitivity to complexity in each of the above terms.
Remark 0.4.
The two-argument form \ell(\hat y, y) is versatile. We will most often consider binary classification y\in\{\pm 1\}, where we always use the product \hat y y, even for the squared loss: \left[{\hat y - y}\right]^2 = \left[{y (y\hat y - 1) }\right]^2 = (y \hat y - 1)^2. This also means binary classification networks have one output node, not two.

Highlights

Here are a few of the shortened and/or extended proofs in these notes.

  1. Approximation.

  2. Optimization.

  3. Generalization.

Missing topics and references

Due to the above philosophy, many topics are currently omitted. Over time I hope to fill the gaps.

Here are some big omissions, hopefully resolved soon:

Further omitted topics, in a bit more detail, are discussed separately for approximation (section 1.1), optimization (section 9.1), and generalization (section 16.1).

Acknowledgements

Thanks to Ziwei Ji for extensive comments, discussion, and the proof of Theorem 15.3; thanks to Daniel Hsu for extensive comments and discussion; thanks to Francesco Orabona for detailed comments spanning many sections; thanks to Ohad Shamir for extensive comments on many topics; thanks to Karolina Dziugaite and Dan Roy for extensive comments on the generalization material. Further thanks to Nadav Cohen, Quanquan Gu, Suriya Gunasekar, Frederic Koehler, Justin Li, Akshayaa Magesh, Maxim Raginsky, David Rolnick, Matthieu Terris, and Alex Wozniakowski for various comments and feedback.

1 Approximation: preface

As above, we wish to ensure that our predictors \mathcal{F} (e.g., networks of a certain architecture) have some element \bar f\in\mathcal{F} which simultaneously has small \mathcal{R}(f) and small complexity; we can re-interpret our notation and suppose \mathcal{F} already is some constrained class of low-complexity predictors, and aim to make \inf_{f\in\mathcal{F}} \mathcal{R}(f) small.

Classical setup. The standard classical setup is to set a natural baseline, for instance compare against all measurable and continuous functions: \inf_{f\in\mathcal{F}} \mathcal{R}(f) \qquad\text{vs.}\qquad \inf_{g\ \textrm{continuous}} \mathcal{R}(g).

Remark 1.1.
(Is this too strenuous?) Most of the classical work uses the uniform norm: \|f -g\|_{{\textrm{u}}} = \sup_{x\in S} |f(x) - g(x)| where S is some compact set, and compares against continuous functions. Unfortunately, already if the target is Lipschitz continuous, this means our function class needs complexity which scales exponentially with dimension (Luxburg and Bousquet 2004): this highlights the need for more refined target functions and approximation measures.

(Lower bounds.) The uniform norm has certain nice properties for proving upper bounds, but is it meaningful for a lower bound? Functions can be well-separated in uniform norm even if they are mostly the same: they just need one point of large difference. For this reason, L_1 norms, for instance \int_{[0,1]^d} |f(x)-g(x)|{\text{d}}x are prefered for lower bounds.

How to parameterize function classes. Classical perspective: fix architecture and vary parameters: \mathcal{F}:= \left\{{ x \mapsto f(x;w) : w\in\mathbb{R}^p }\right\}, where f here again denotes some architecture parameterized by w\in\mathbb{R}^p, as in eq. 1. my use of “f” has ambiguity with the preceding material and performance I should fix notation for architectures, like “F.”

“Modern” perspective: \mathcal{F} should depend on data and algorithm.

This is complicated; so far we just apply norm constraints: \mathcal{F}_{\|\cdot\|}(r) := \left\{{ x \mapsto f(x;w) : w\in\mathbb{R}^p, \|w\|\leq r}\right\}.

Example. f(x;W) := a^{\scriptscriptstyle\mathsf{T}}\sigma(W x), with a\in\mathbb{R}^m fixed. Consider \mathcal{F}_{q}(r) = \left\{{ x \mapsto f(x;w) : \|W\|_{2,q}\leq r}\right\}, \quad \textrm{where } \|W^{\scriptscriptstyle\mathsf{T}}\|_{2,q} = \left({ \sum_{j=1}^m \|W_{j:}\|_2^q }\right)^{1/q} (with usual q=\infty.) Varying q drastically changes \mathcal{F}_{\|\cdot\|}(r)! I should introduce scale within the norm, or function? otherwise not apples-to-oranges?

Questions:

These approximation notes: we will mostly discuss \mathcal{F}(\infty), the classical setting.

Remark 1.2.
While norms have received much recent attention as a way to measure complexity, this idea is quite classical. For instance, a resurgence of interest in the 1990s led to the proof of many deep network VC dimension bounds, however very quickly it was highlighted (and proved) in (P. L. Bartlett 1996) that one has situations where the architecture (and connection cardinality) stays fixed (along with the VC dimension), yet the norms (and generalization properties) vary.

1.1 Omitted topics

2 Elementary constructive approximations

2.1 Finite-width univariate approximation

Proposition 2.1.
Suppose g :\mathbb{R}\to\mathbb{R} is \rho-Lipschitz. For any \epsilon>0, there exists a 2-layer network f with \lceil \frac\rho \epsilon\rceil threshold nodes z\mapsto \mathbf{1}[z\geq 0] so that \sup_{x\in[0,1]} |f(x) - g(x)| \leq \epsilon.

Proof. Define m := \lceil \frac\rho \epsilon\rceil and x_i := (i-1) \epsilon/ \rho, and Construct \sum_i a_i \mathbf{1}[ x - b_i] with a_1 = g(0), a_i = g(x_i) - g(x_{i-1}), b_i := x_i. Then for any x\in[0,1], pick the closest x_i\leq x, and note \begin{aligned} &|g(x) - f(x)| = |g(x) - f(x_i)| \leq |g(x) - g(x_i)| + |g(x_i) - f(x_i)| \\ &= \rho(\epsilon/\rho) + \left|{ g(x_i) - g(x_0) - \sum_{j=2}^i (g(x_j) - g(x_{j-1})) \mathbf{1}[x_i \geq x_j]}\right| =\epsilon. \end{aligned}

Remark 2.1.
This is standard, but we’ve lost something! We are paying for flat regions, which are a specialty of standard networks!

2.2 Infinite width univariate apx; flat regions are okay

Let’s first develop a notion of infinite-width (shallow) network.

Definition 2.1.
An infinite-width shallow network is characterized by a signed measure \nu over weight vectors in \mathbb{R}^p: x\mapsto \int \sigma(w^{\scriptscriptstyle\mathsf{T}}x) {\text{d}}\nu(w). The mass of \nu is the total positive and negative weight mass assigned by \nu: |\nu|(\mathbb{R}^p) = \nu_-(\mathbb{R}^p) + \nu_+(\mathbb{R}^p). (If this seems weird, please see the subsequent remark; the mass will be used in section 5 to produce size estimates for finite-width networks.)
Remark 2.2.
The use of a general measure \nu is flexible and allows for discrete weights, continuous weights, and more. As an example of continuous weights, consider a network computing x\mapsto \int \sigma(w^{\scriptscriptstyle\mathsf{T}}x) g(w) {\text{d}}w; then g combined with the regular Lebesgue integral define the measure on weights, and its total mass is \int |g(w)| {\text{d}}w.
More generally, any signed measure \nu has a unique (up to null sets) decomposition into two nonnegative measures \nu_+ and \nu_- (Jordan decomposition, Folland 1999); these points will be revisted in section 5.
Lastly, note that \mathbb{R}^p was used suggestively so that we can encode biases and other feature mappings.
Proposition 2.2.
Suppose g:\mathbb{R}\to\mathbb{R} is differentiable, and g(0) = 0. If x \in [0,1], then g(x) = \int_0^1 \mathbf{1}[x \geq b] g'(b) {\text{d}}b.

Proof. By FTC and g(0) = 0 and x\in[0,1], g(x) = g(0) + \int_0^x g'(b) {\text{d}}b = 0 + \int_0^1 \mathbf{1}[x \geq b] g'(b){\text{d}}b. \hspace{6em}

Remark 2.3.

2.3 Multivariate approximation via bumps

We will conclude this section with an elementary approximation of continuous functions via 2 hidden layers. Specifically, we will use these two activation layers to construct localized bumps/blocks, and then simply add them together.

Theorem 2.1.
Let cont g and an \epsilon>0 be given, and choose \delta >0 so that \|x-x'\|_\infty \leq \delta implies |g(x) - g(x')| \leq \epsilon. Then there exists a 3-layer network f with \Omega(\frac 1 {\delta^d}) ReLU with \int_{[0,1]^d} |f(x) - g(x)|{\text{d}}x \leq 2 \epsilon.
Remark 2.4.

The proof uses the following lemma (omitted in class), approximating continuous functions by piecewise constant functions.

Lemma 2.1.
Let g,\delta,\epsilon be given as in the theorem. For any partition \mathcal{P} of [0,1]^d into rectangles (products of intervals) \mathcal{P}= (R_1,\ldots,R_N) with all side lengths not exceeding \delta, there exist scalars (\alpha_1,\ldots,\alpha_N) so that \sup_{x\in[0,1]^d}\left|{ g(x) - h(x) }\right|_{\textrm{u}}\leq \epsilon, \qquad\textrm{where}\qquad h = \sum_{i=1}^N \alpha_i \mathbf{1}_{R_i}.

Proof. Let partition \mathcal{P}= (R_1,\ldots,R_N) be given, and for each R_i, pick some x_i \in R_i, and set \alpha_i := g(x_i). Since each side length of each R_i is at most \delta, \begin{aligned} \sup_{x\in[0,1]^d} |g(x) - h(x)| &= \sup_{i\in \{1,\ldots,N\}} \sup_{x\in R_i} |g(x) - h(x)| \\ &\leq \sup_{i\in \{1,\ldots,N\}} \sup_{x\in R_i}\left({ |g(x) - g(x_i)| + |g(x_i) - h(x)| }\right) \\ &\leq \sup_{i\in \{1,\ldots,N\}} \sup_{x\in R_i}\left({ \epsilon+ |g(x_i) - \alpha_i| }\right) = \epsilon. \end{aligned}

Proof of Theorem 2.1. We will use the function h = \sum_i \alpha_i \mathbf{1}_{R_i} from the earlier lemma. Specifically, we will use the first two layers to approximate x\mapsto \mathbf{1}_{R_i}(x) for each i using \mathcal{O}(d) nodes, and a final linear layer for the linear combination. Writing \|f-g\|_1 = \int_{[0,1]^d} |f(x)-g(x)|{\text{d}}x for convenience, since \|f-g\|_1 \leq \|f-h\|_1 + \|h-g\|_1 \leq \epsilon+ \|h-g\|_1, and letting g_i denote the approximation to \mathbf{1}_{R_i}, \|h-g\|_1 = \left\|{ \sum_i \alpha_i (\mathbf{1}_{R_i} - g_i) }\right\|_1 \leq \sum_i |\alpha_i|\cdot \| \mathbf{1}_{R_i} - g_i \|_1, so it suffices to make \| \mathbf{1}_{R_i} - g_i \|_1 \leq \frac {\epsilon}{\sum_i |\alpha_i|}. (If \sum_i |\alpha_i| = 0, we can set g to be the constant 0 network.)

Let’s do what we did in the univariate case, putting nodes where the function value changes. For each R_i := \times_{j=1}^d [a_j,b_j], and any \gamma > 0, define \hspace{-2em} g_{\gamma,j}(z) := \sigma\left({\frac {z - (a_j - \gamma)}{\gamma}}\right) - \sigma\left({\frac {z - a_j}{\gamma}}\right) - \sigma\left({\frac {z - b_j}{\gamma}}\right) + \sigma\left({\frac {z - (b_j+\gamma)}{\gamma}}\right) and g_\gamma(x) := \sigma(\sum_j g_{\gamma,j}(x_j) - (d-1)) (adding the additional ReLU layer is the key step!), whereby g_\gamma(x) = \begin{cases} 1 & x\in R_i,\\ 0 & x \not\in \times_j [a_j-\gamma, b_j+\gamma],\\ [0,1] & \textrm{otherwise.} \end{cases} Since g_\gamma \to \mathbf{1}_{R_i} pointwise, there exists \gamma with \|g_\gamma - \mathbf{1}_{R_i}\|_1 \leq \frac {\epsilon}{\sum_i |\alpha_i|}.

3 Universal approximation with a single hidden layer

The previous section developed g_\gamma such that g_\gamma(x) \approx \mathbf{1}\left[{ x\in \times_i [a_i,b_i] }\right].

If deep networks could multiply, we could do x \mapsto \prod_i \mathbf{1}\left[{ x_i \in [a_i,b_i] }\right].

Who thinks deep networks can multiply?

The answer will be yes, and we will use this to resolve the classical universal approximation question with a single hidden layer.

Definition :
A class of functions \mathcal{F} is a universal approximator over a compact set S if for every continuous function g and target accuracy \epsilon>0, there exists f\in\mathcal{F} with \sup_{x\in S} |f(x) - g(x)| \leq \epsilon.
Remark 3.1.
Typically we will take S = [0,1]^d. We can also consider more general universal approximation settings. Lastly, the compactness is in a sense necessary: as in the homework, consider approximating the \sin function with a finite-size ReLU network over all of \mathbb{R}. Lastly, there are also ways to phrase universal approximation in terms of denseness, but we won’t use it here.

3.1 Multiplication with shallow networks

Consider infinite-width networks with one hidden layer: \begin{aligned} \mathcal{F}_{\sigma,d,m} &:= \mathcal{F}_{d,m} := \left\{{ x\mapsto a^{\scriptscriptstyle\mathsf{T}}\sigma(W x + b) : a\in\mathbb{R}^m, W\in\mathbb{R}^{m\times d}, b\in\mathbb{R}^m}\right\}. \\ \mathcal{F}_{\sigma,d} &:= \mathcal{F}_d := \bigcup_{m\geq 0} \mathcal{F}_{\sigma,d,m}. \end{aligned} Consider \sigma = \cos. Since 2 \cos(y)\cos(z) = \cos(y+z) + \cos(y-z), \begin{aligned} & 2 \left[{\sum_{i=1}^m a_i \cos(w_i^{\scriptscriptstyle\mathsf{T}}x + b_i) }\right] \cdot \left[{ \sum_{j=1}^n c_j \cos(u_j^{\scriptscriptstyle\mathsf{T}}x + v_j) }\right]= \\ & \sum_{i=1}^m \sum_{j=1}^n a_i c_j \left({ \cos((w_i+u_j)^{\scriptscriptstyle\mathsf{T}}x + (b_i+v_j)) + \cos((w_i - u_j)^{\scriptscriptstyle\mathsf{T}}x + (b_i-v_j)) }\right), \end{aligned} thus f,g\in\mathcal{F}_{\cos,d} \Longrightarrow fg \in \mathcal{F}_{\cos,d} !

Remark 3.2.
Classical perspective: \mathcal{F}_d is the of functions computed by single-node networks (\mathcal{F}_{m,1} with a_1=1).

Note that this gives us “bumps” via x \mapsto \prod_{i=1}^d \cos(x_i)^r, and we can linearly combine bumps to get continuous functions.

Where does this leave us?

Theorem 3.1 (Stone-Weierstrass; (Folland 1999, Theorem 4.45)) .

Let functions \mathcal{F} be given as follows.

  1. Each f\in\mathcal{F} is continuous.

  2. For every x, there exists f\in\mathcal{F} with f(x) \neq 0.

  3. For every x\neq x' there exists f\in\mathcal{F} with f(x)\neq f(x') (\mathcal{F} separates points).

  4. \mathcal{F} is closed under multiplication and vector space operations (\mathcal{F} is an algebra).

Then for every continuous g :\mathbb{R}^d \to \mathbb{R} and \epsilon> 0, there exists f\in\mathcal{F} with \|f-g\|_{{\textrm{u}}}\leq \epsilon. (\mathcal{F} is universal.)

Remark 3.3.

It is heavyweight, but a good tool to have.

Proofs are not constructive, but seem to require size \Omega( \frac{1}{\epsilon^d} ).

Proofs are :

As a technical point, we could also approximately sastisfy the properties, and apply the theorem to the closure of \mathcal{F}.

The second and third conditions are necessary; if there exists x so that f(x) = 0 \forall f\in\mathcal{F}, then we can’t approximate g with g(x)\neq 0; if we can’t separate points x\neq x', then we can’t approximate functions with g(x)\neq g(x').

I used to have some remark about different ways to prove weierstrass? put that back in?

3.2 Universal approximation via Stone-Weierstrass

Lemma 3.1 ((Hornik, Stinchcombe, and White 1989)) .
\mathcal{F}_{\cos,d} is universal.

Proof. Let’s check the Stone-Weierstrass conditions:

  1. Each f \in \mathcal{F}_{\cos,d} is continuous.

  2. For each x, \cos(0^{\scriptscriptstyle\mathsf{T}}x) = 1 \neq 0.

  3. For each x\neq x', f(z) := \cos((z-x')^{\scriptscriptstyle\mathsf{T}}(x-x') / \|x-x'\|^2)\in\mathcal{F}_d satisfies f(x) = \cos(1) \neq \cos(0) = f(x').

  4. \mathcal{F}_{\cos,d} is closed under products and vector space operations as before.

We can work it out even more easily for \mathcal{F}_{\exp,d}.

Lemma 3.2.
\mathcal{F}_{\exp,d} is universal.

Proof. Let’s check the Stone-Weierstrass conditions:

  1. Each f \in \mathcal{F}_{\exp,d} is continuous.

  2. For each x, \exp(0^{\scriptscriptstyle\mathsf{T}}x) = 1 \neq 0.

  3. For each x\neq x', f(z) := \exp((z-x')^{\scriptscriptstyle\mathsf{T}}(x-x') / \|x-x'\|^2)\in\mathcal{F}_d satisfies f(x) = \exp(1) \neq \exp(0) = f(x').

  4. \mathcal{F}_{\exp,d} is closed under VS ops by construction; for products, \begin{aligned} \left({\sum_{i=1}^n r_i \exp(a_i^{\scriptscriptstyle\mathsf{T}}x) }\right) \left({\sum_{j=1}^m s_j \exp(b_j^{\scriptscriptstyle\mathsf{T}}x) }\right) = \sum_{i=1}^m \sum_{j=1}^m r_i s_j \exp( (a+b)^{\scriptscriptstyle\mathsf{T}}x). \end{aligned}

Remark 3.4.
Biases? x\mapsto \exp(a^{\scriptscriptstyle\mathsf{T}}x + b) = e^b\cdot \exp(a^{\scriptscriptstyle\mathsf{T}}x)

3.3 Arbitrary activations

Theorem 3.2 ((Hornik, Stinchcombe, and White 1989)) .
Suppose \sigma:\mathbb{R}\to\mathbb{R} is continuous, and \lim_{z\to-\infty} \sigma(z) = 0,\qquad \lim_{z\to+\infty} \sigma(z) = 1. Then \mathcal{F}_{\sigma,d} is universal.

Proof sketch. Given \epsilon>0 and continuous g, pick h\in\mathcal{F}_{\cos,d} (or \mathcal{F}_{\exp,d}) with \sup_{x\in[0,1]^d}|h(x)-g(x)|\leq \epsilon/2. To finish, replace all appearances of \cos with an element of \mathcal{F}_{\sigma,1}. (Details in hw1.)

Remark 3.5.

ReLU is fine: use z\mapsto \sigma(z) - \sigma(z-1) and split nodes.

\exp didn’t need bias, but ReLU apx of \exp needs bias.

Weakest conditions on \sigma (Leshno et al. 1993): universal apx iff not a polynomial

Never forget: curse of dimension (size \Omega(\frac{1}{\epsilon^d})).

Remark 3.6 (other universal approximation proofs) .

(Cybenko 1989) Assume contradictorily you miss some functions. By duality, 0 =\int \sigma(a^{\scriptscriptstyle\mathsf{T}}x -b) {\text{d}}\mu(x) for some signed measure \mu, all (a,b). Using Fourier, can show this implies \mu = 0

(Leshno et al. 1993) If \sigma a polynomial, ; else can (roughly) get derivatives and polynomials of all orders (we’ll have homework problems on this).

(Barron 1993) Use inverse Fourier representation to construct an infinite-width network; we’ll cover this next. It can beat the worst-case curse of dimension!

(Funahashi 1989) I’m sorry, I haven’t read it. Also uses Fourier.

4 Avoiding 1/\epsilon^d via infinite-width Fourier representations

Next we’ll rewrite the target function as an infinite-width network “for free” via Fourier inversion.

This idea is due to a very influential paper by (Barron 1993).

4.1 Fourier transforms review

Fourier transform (e.g., Folland 1999, Chapter 8): \hat f (w) := \int \exp(-2\pi i w^{\scriptscriptstyle\mathsf{T}}x) f(x) {\text{d}}x.

Fourier inversion: if f\in L^1 and \hat f\in L^1, f(x) := \int \exp(2\pi i w^{\scriptscriptstyle\mathsf{T}}x) \hat f(w) {\text{d}}w, rewriting f as infinite-width network with complex activation!

Real activations: using notation \Re(a+bi) = a, f(x) = \Re f(x) = \int \Re \exp(2\pi i w^{\scriptscriptstyle\mathsf{T}}x) \hat f(w) {\text{d}}w.

Issue: If we expand with e^{i z} = \cos(z) + i \sin(z), we’re left with \cos, which is neither compactly supported / localized, nor asymptotically converges!

4.2 Barron’s two tricks

  1. Polar decomposition. Let’s write \hat f(w) = |\hat f(w)| \exp( 2\pi i \theta(w)) with |\theta(w)| \leq 1; Since f is real-valued, \begin{aligned} f(x) &= \Re \int \exp(2\pi i w^{\scriptscriptstyle\mathsf{T}}x) \hat f(w) {\text{d}}w \\ &= \int \Re\left({ \exp(2\pi i w^{\scriptscriptstyle\mathsf{T}}x) \exp(2\pi i\theta(w)) | \hat f(w) |}\right) {\text{d}}w \\ &= \int \Re\left({ \exp(2\pi i w^{\scriptscriptstyle\mathsf{T}}x + 2\pi i\theta(w) }\right) |\hat f(w)| {\text{d}}w \\ &= \int \cos\left({ 2\pi w^{\scriptscriptstyle\mathsf{T}}x + 2\pi\theta(w) }\right) |\hat f(w)| {\text{d}}w. \end{aligned} We’ve now obtained an infinite width network over real-valued activations! Unfortunately, \cos is not ideal: it is neither compactly supported, nor obtains a limit as its argument goes to \pm\infty.

  2. Turning cosines into bumps! We’ll do two things to achieve our goal: subtracting f(0), and scaling by \|w\|: \begin{aligned} &f(x) - f(0) \\ &= \int \left[{\cos(2\pi w^{\scriptscriptstyle\mathsf{T}}x + 2\pi\theta(w) ) -\cos(2\pi w^{\scriptscriptstyle\mathsf{T}}0 + 2\pi\theta(w))}\right] |\hat f(w)| {\text{d}}w \\ &= \int \frac{\cos(2\pi w^{\scriptscriptstyle\mathsf{T}}x + 2\pi\theta(w) ) - \cos(2\pi\theta(w))}{\|w\|} \|w\| \cdot |\hat f(w)| {\text{d}}w. \end{aligned} The fraction does not blow up: since \cos is 1-Lipschitz, \begin{aligned} &\left|{ \frac{\cos(2\pi w^{\scriptscriptstyle\mathsf{T}}x + 2\pi\theta(w) ) - \cos(2\pi\theta(w))}{\|w\|} }\right| \\ &\leq \frac{\left|{2\pi w^{\scriptscriptstyle\mathsf{T}}x + 2\pi\theta(w) - 2\pi\theta(w)}\right|}{\|w\|} %\\ %& \leq \frac {2\pi |w^{\scriptscriptstyle\mathsf{T}}x|}{\|w\|} \leq 2\pi\|x\|. \end{aligned} Barron was able to rewrite these functions in terms of thresholds, we’ll follow a different approach.

Barron used these two tricks to develop his finite-width approximations (Barron 1993). Here, we will use just the first trick to write functions with equality as infinite-width networks; later in section 5, we will use the same tool as Barron, Lemma 5.1 (Maurey (Pisier 1980)), to sample and obtain finite-width networks.

Recall our univariate infinite-width approach: using \|x\|\leq1, \begin{aligned} &\cos(2\pi w^{\scriptscriptstyle\mathsf{T}}x + 2\pi\theta(w)) - \cos(2\pi\theta(w)) \\ &= \int_0^{w^{\scriptscriptstyle\mathsf{T}}x} - 2\pi\sin(2\pi b + 2\pi\theta(w)){\text{d}}b \\ &= - 2\pi\int_0^{\|w\|} \mathbf{1}[w^{\scriptscriptstyle\mathsf{T}}x - b \geq 0]\sin(2\pi b + 2\pi\theta(w)){\text{d}}b. \\ & + 2\pi\int_{-\|w\|}^0 \mathbf{1}[-w^{\scriptscriptstyle\mathsf{T}}x + b \geq 0]\sin(2\pi b + 2\pi\theta(w)){\text{d}}b. \end{aligned} Plugging this into the previous form (before dividing by \|w\|), \begin{aligned} &\hspace{-2em}f(x) - f(0) = -2\pi\int\!\! \int_0^{\|w\|} \mathbf{1}[w^{\scriptscriptstyle\mathsf{T}}x - b\geq 0]\left[{\sin(2\pi b + 2\pi\theta(w))|\hat f(w)|}\right]{\text{d}}b{\text{d}}w, \\ &+2\pi\int\!\! \int_{-\|w\|}^0 \mathbf{1}[-w^{\scriptscriptstyle\mathsf{T}}x + b\geq 0]\left[{\sin(2\pi b + 2\pi\theta(w))|\hat f(w)|}\right]{\text{d}}b{\text{d}}w, \end{aligned} an infinite width network with threshold nodes!

We’ll tidy up with \widehat{ \nabla f}(w) = - 2\pi i w \hat f(w), whereby \| \widehat {\nabla f}(w) \| = 2\pi\|w\|\cdot |\hat f(w)|; specifically, note that the integral can be upper bounded with \begin{aligned} & \left|{ 2\pi\int \int_0^{\|w\|}\mathbf{1}[w^{\scriptscriptstyle\mathsf{T}}x - b \geq 0] \left[{\sin(2\pi b + 2\pi\theta(w))|\hat f(w)|}\right]{\text{d}}b{\text{d}}w }\right| \\ &+ \left|{ 2\pi\int\!\! \int_{-\|w\|}^0 \mathbf{1}[-w^{\scriptscriptstyle\mathsf{T}}x + b\geq 0]\left[{\sin(2\pi b + 2\pi\theta(w))|\hat f(w)|}\right]{\text{d}}b{\text{d}}w}\right| \\ &\leq 2\pi\int \int_{-\|w\|}^{\|w\|} \left|{ \sin(2\pi b + 2\pi\theta(w)) }\right| |\hat f(w)|{\text{d}}b{\text{d}}w \\ &\leq 2\pi\int 2\|w\| \cdot |\hat f(w)|{\text{d}}b{\text{d}}w \\ &= 2 \int \left\|{ \widehat{\nabla f}(w) }\right\|{\text{d}}w. \end{aligned} This last quantity upper bounds the mass of the weight distribution (as in Definition 2.1), and will be used in section 5 to produce finite-width bounds.

Theorem 4.1 (based on (Barron 1993)) .
Suppose \int \left\|{ \widehat{\nabla f}(w) }\right\| {\text{d}}w < \infty, f\in L_1, \hat f\in L_1, and write \hat f(w) = |\hat f(w)| \exp(2\pi i \theta(w)). For \|x\|\leq 1, \begin{aligned} &f(x) - f(0) = \int \frac{\cos(2\pi w^{\scriptscriptstyle\mathsf{T}}x + 2\pi\theta(w) ) - \cos(2\pi\theta(w))}{2\pi\|w\|} \left\|{ \widehat{\nabla f}(w) }\right\| {\text{d}}w \\ &= -2\pi\int \int_0^{\|w\|} \mathbf{1}[w^{\scriptscriptstyle\mathsf{T}}x - b\geq 0]\left[{\sin(2\pi b + 2\pi\theta(w))|\hat f(w)|}\right]{\text{d}}b{\text{d}}w \\ & \quad{}+ 2\pi\int\int_{-\|w\|}^0 \mathbf{1}[-w^{\scriptscriptstyle\mathsf{T}}x + b \geq 0]\sin(2\pi b + 2\pi\theta(w)){\text{d}}b{\text{d}}w. \end{aligned} The corresponding measure on weights has mass at most 2 \int \Big\| \widehat{\nabla f}(w) \Big\|{\text{d}}w.

In words, f(x)-f(0) can be written as two infinite-width networks, both using the Fourier transform \hat f in the weighting measure, and one using standard threshold activations.

Remark 4.1.

Key point: after sampling, network width will scale with \frac 1 \epsilon\|\widehat{\nabla f}(w)\|, rather than \frac 1 {\epsilon^d} !

Barron did not present it as infinite width networks, but this proof (including Maurey step) is based on his.

now that i have cross-references, i should point to explicit maurey.

Here is a case where \int \left\|{ \widehat{\nabla f}(w) }\right\| {\text{d}}w is small.

5 Sampling from infinite width networks

Let’s first consider an abstract case: sampling in Hilbert spaces.

Suppose X = \mathop{\mathbb{E}}V, where r.v. V is supported on a set S.

A natural way to “simplify” X is to instead consider \hat X := \frac 1 k\sum_{i=1}^k V_i, where (V_1,\ldots,V_k) are sampled iid.

We want to argue \hat X \approx X; since we’re in a Hilbert space, we’ll try to make the Hilbert norm \|X - \hat X\| small.

Lemma 5.1 (Maurey (Pisier 1980)) .
Let X = \mathop{\mathbb{E}}V be given, with V supported on S, and let (V_1,\ldots,V_k) be iid draws from the same distribution. Then \mathop{\mathbb{E}}_{V_1,\ldots,V_k} \left\|{X - \frac 1 k \sum_i V_i}\right\|^2 \leq \frac {\mathop{\mathbb{E}}\|V\|^2}{k} \leq \frac{\sup_{U \in S} \|U\|^2}{k}, and moreover there exist (U_1,\ldots, U_k) in S so that \left\|{ X - \frac 1 k \sum_i U_i}\right\|^2 \leq \mathop{\mathbb{E}}_{V_1,\ldots,V_k} \left\|{X - \frac 1 k \sum_i V_i}\right\|^2.
Remark 5.1.

After proving this, we’ll get a corollary for sampling from networks.

This lemma is widely applicable; e.g., we’ll use it for generalization too.

First used for neural networks by (Barron 1993) and (Jones 1992), attributed to Maurey by (Pisier 1980).

Proof of Lemma 5.1 (Maurey (Pisier 1980)). Let (V_1,\ldots,V_k) be IID as stated. Then \begin{aligned} &\mathop{\mathbb{E}}_{V_1,\ldots,V_k}\left\|{ X - \frac 1 k \sum_i V_i }\right\|^2 %\\ %& = \mathop{\mathbb{E}}_{V_1,\ldots,V_k}\left\|{ \frac 1 k \sum_i \left({ V_i - X }\right) }\right\|^2 \\ &= \mathop{\mathbb{E}}_{V_1,\ldots,V_k}\frac 1 {k^2} \left[{ \sum_i \left\|{ V_i - X }\right\|^2 + \sum_{i\neq j} \left\langle V_i - X , V_j - X \right \rangle }\right] \\ &= \mathop{\mathbb{E}}_{V}\frac 1 {k} \left\|{ V - X }\right\|^2 \\ &= \mathop{\mathbb{E}}_{V}\frac 1 {k}\left({ \left\|{ V }\right\|^2 - \left\|{ X }\right\|^2 }\right) \\ &\leq \mathop{\mathbb{E}}_{V}\frac 1 {k} \left\|{ V }\right\|^2 \leq \sup_{U\in S}\frac 1 {k} \left\|{ U }\right\|^2. \end{aligned} To conclude, there must exist (U_1,\ldots,U_k) in S so that \left\|{ X - k^{-1} \sum_i U_i }\right\|^2 \leq \mathop{\mathbb{E}}_{V_1,\ldots,V_k}\left\|{ X - k^{-1} \sum_i V_i }\right\|^2. (“Probabilistic method.”)

Now let’s apply this to infinite-width networks (recall Definition 2.1). Two issues:

Let’s write a generalized shallow network as x\mapsto \int g(x;w){\text{d}}\mu(w), where \mu is a nonzero signed measure over some abstract parameter space \mathbb{R}^p. E.g., w = (a,b,v) and g(x;w) = a \sigma(v^{\scriptscriptstyle\mathsf{T}}x + b).

This sampling procedure has the correct mean: \begin{aligned} & \int g(x;w) {\text{d}}\mu(w) = \int g(x;w) {\text{d}}\mu_+(w) - \int g(x;w) {\text{d}}\mu_-(w) \\ &= \|\mu_+\|_1 \mathop{\mathbb{E}}_{\tilde\mu_+} g(x;w) - \|\mu_-\|_1 \mathop{\mathbb{E}}_{\tilde\mu_-} g(x;w) \\ &= \|\mu\|_1\left[{ \mathop{\textrm{Pr}}_{{\widetilde{\mu}}}[s = +1] \mathop{\mathbb{E}}_{{\widetilde{\mu}}_+} g(x;w) - \mathop{\textrm{Pr}}_{{\widetilde{\mu}}}[s = -1] \mathop{\mathbb{E}}_{{\widetilde{\mu}}_-} g(x;w)}\right] = \mathop{\mathbb{E}}_{{\widetilde{\mu}}} \tilde g(x;w,s). \end{aligned}

Lemma 5.2 (Maurey for signed measures) .
Let \mu denote a nonzero signed measure supported on S\subseteq \mathbb{R}^p, and write g(x) := \int g(x;w){\text{d}}\mu(x). Let ({\tilde{w}}_1,\ldots,{\tilde{w}}_k) be IID draws from the corresponding {\widetilde{\mu}}, and let P be a probability measure on x. Then \begin{aligned} \mathop{\mathbb{E}}_{{\tilde{w}}_1,\ldots,{\tilde{w}}_k} \left\|{ g - \frac 1 k \sum_i {\tilde{g}}(\cdot;{\tilde{w}}_i) }\right\|_{L_2(P)}^2 &\leq \frac {\mathop{\mathbb{E}}\|{\tilde{g}}(\cdot;{\tilde{w}})\|_{L_2(P)}^2}{k} \\ &\leq \frac {\|\mu\|_1^2 \sup_{w\in S} \|g(\cdot;w)\|_{L_2(P)}^2}{k}, \end{aligned} and moreover there exist (w_1,\ldots,w_k) in S and s\in\{\pm1\}^m with \left\|{ g - \frac 1 k \sum_i {\tilde{g}}(\cdot;w_i,s_i) }\right\|_{L_2(P)}^2 \leq \mathop{\mathbb{E}}_{{\tilde{w}}_1,\ldots,{\tilde{w}}_k} \left\|{ g - \frac 1 k \sum_i {\tilde{g}}(\cdot;{\tilde{w}}_i) }\right\|_{L_2(P)}^2.

Proof. By the mean calculation we did earlier, g = \mathop{\mathbb{E}}_{{\widetilde{\mu}}} \|\mu\| s g_w = \mathop{\mathbb{E}}_{{\widetilde{\mu}}} {\tilde{g}}, so by the regular Maurey applied to {\widetilde{\mu}} and Hilbert space L_2(P) (i.e., writing V := {\tilde{g}} and g = \mathop{\mathbb{E}}V), \begin{aligned} \mathop{\mathbb{E}}_{{\tilde{w}}_1,\ldots,{\tilde{w}}_k} \left\|{ g - \frac 1 k \sum_i {\tilde{g}}(\cdot;{\tilde{w}}_i) }\right\|_{L_2(P)}^2 &\leq \frac {\mathop{\mathbb{E}}\|{\tilde{g}}(\cdot;{\tilde{w}})\|_{L_2(P)}^2}{k} \\ &\leq \frac {\sup_{s \in \{\pm 1\}} \sup_{w\in\mathcal{W}} \left\|{\|\mu\|_1 s g(\cdot;w)}\right\|^2_{L_2(P)}}{k}. \\ &\leq \frac {\|\mu\|_1^2 \sup_{w\in S} \|g(\cdot;w)\|_{L_2(P)}^2}{k}, \end{aligned} and the existence of the fixed (w_i,s_i) is also from Maurey.

Example 5.1 (various infinite-width sampling bounds) .
  1. Suppose x\in [0,1] and f is differentiable. Using our old univariate calculation, f(x) - f(0) = \int_0^1 \mathbf{1}[x \geq b] f'(b) {\text{d}}b. Let \mu denote f'(b){\text{d}}b; then a sample ((b_i,s_i))_{i=1}^k from {\widetilde{\mu}} satisfies \begin{aligned} &\left\|{ f(\cdot) - f(0) - \frac 1 k \sum_i s_i \mathbf{1}[\cdot \geq b_i] }\right\|_{L_2(P)}^2 \\ &\leq \frac {\|\mu\|_1^2 \sup_{b\in[0,1]} \|\mathbf{1}[\cdot \geq b]\|_{L_2(P)}^2}{k} \\ &= \frac 1 k \left({ \int_0^1 |f'(b)|{\text{d}}b }\right)^2. \end{aligned}
  1. Now consider the Fourier representation via Barron’s theorem: \begin{aligned} &f(x) - f(0) = -2\pi\int \int_0^{\|w\|} \mathbf{1}[w^{\scriptscriptstyle\mathsf{T}}x - b\geq 0]\left[{\sin(2\pi b + 2\pi\theta(w))|\hat f(w)|}\right]{\text{d}}b{\text{d}}w \\ & \quad{}+ 2\pi\int\int_{-\|w\|}^0 \mathbf{1}[-w^{\scriptscriptstyle\mathsf{T}}x + b \geq 0]\sin(2\pi b + 2\pi\theta(w)){\text{d}}b{\text{d}}w, \end{aligned} and also our calculation that the corresponding measure \mu on thresholds has \|\mu\|_1 \leq 2\|\widehat{\nabla f}(w)\|. Then Maurey’s lemma implies that there exist ((w_i,b_i,s_i))_{i=1}^m such that, for any probability measure P support on \|x\|\leq 1, \begin{aligned} &\left\|{ f(\cdot) - f(0) - \frac 1 k \sum_i s_i \mathbf{1}[\left\langle w_i, \cdot \right \rangle \geq b_i] }\right\|_{L_2(P)}^2 \\ &\leq \frac {\|\mu\|_1^2 \sup_{w,b} \|\mathbf{1}[\left\langle w, \cdot \right \rangle \geq b]\|_{L_2(P)}^2}{k} \\ &\leq \frac {4 \|\widehat{\nabla f}(w)\|^2} k. \end{aligned}

6 Benefits of depth

Plan for a first result:

Theorem 6.1 ((Telgarsky 2015, 2016)) .
For any L\geq 2. f = \Delta^{L^2 + 2} is a ReLU network with 3L^2+6 nodes and 2L^2+4 layers, but any ReLU network g with \leq 2^L nodes and \leq L layers can not approximate it: \int_{[0,1]} \left|{ f(x) - g(x) }\right|{\text{d}}x \geq \frac 1 {32}.
Remark 6.1.

Previously, we used L_2 and L_\infty to state good upper bounds on approximation; for bad approximation, we want to argue there is a large region where we fail, not just a few points, and that’s why we use an L_1 norm.

To be able to argue that such a large region exists, we don’t just need the hard function f = \Delta^{L^2+2} to have many regions, we need them to be regularly spaced, and not bunch up. In particular, if we replaced \Delta with the similar function 4x(1-x), then this proof would need to replace \frac 1 {32} with something decreasing with L.

Proof plan for Theorem 6.1 ((Telgarsky 2015, 2016)):

  1. First we will upper bound the number of oscillations in ReLU networks. The key part of the story is that oscillations will grow polynomially in width, but exponentially in depth. give explicit lemma ref

  2. Then we will show there exists a function, realized by a slightly deeper network, which has many oscillations, which are moreover regularly spaced. The need for regular spacing will be clear at the end of the proof.

  3. Lastly, we will use a region-counting argument to combine the preceding two facts to prove the theorem. This step would be easy for the L_\infty norm, and takes a bit more effort for the L_1 norm.

Remark 6.2 (bibliographic notes) .
Theorem 6.1 ((Telgarsky 2015, 2016)) was the earliest proof showing that a deep network can not be approximated by a reasonably-sized shallow network, however prior work showed a separation for exact representation of deep sum-product networks as compared with shallow ones (Bengio and Delalleau 2011). A sum-product network has nodes which compute affine transformations or multiplications, and thus a multi-layer sum-product network is a polynomial, and this result, while interesting, does not imply a ReLU separation.
As above, step 1 of the proof upper bounds the total possible number of affine pieces in a univariate network of some depth and width, and step 2 constructs a deep function which roughly meets this bound. Step 1 can be generalized to the multivariate case, with reasoning similar to the VC-dimension bounds in section 22. A version of step 2 appeared in prior work but for the multivariate case, specifically giving a multivariate-input network with exponentially many affine pieces, using a similar construction (Montúfar et al. 2014). A version of step 2 also appeared previous as a step in a proof that recurrent networks are Turing complete, specifically a step used to perform digit extraction (Siegelmann and Sontag 1994, Figure 3).

6.1 Bounding oscillations in ReLU networks

Definition 6.1.
For any univariate function f:\mathbb{R}\to\mathbb{R}, let N_A(f) denote the number of affine pieces of f: the minimum cardinality (or \infty) of a partition of \mathbb{R} so that f is affine when restricted to each piece.
Theorem 6.2.
Let f:\mathbb{R}\to\mathbb{R} be a ReLU network with L layers of widths (m_1,\ldots,m_L) with m = \sum_i m_i.
Remark 6.3.
Working with the ReLU really simplifies this reasoning!

Our proof will proceed by induction, using the following combination rules for piecewise affine functions.

Lemma 6.1.

Let functions f,g,(g_1,\ldots,g_k), and scalars (a_1,\ldots,a_k,b) be given.

  1. N_A(f+g) \leq N_A(f) + N_A(g).

  2. N_A(\sum_i a_i g_i + b) \leq \sum_i N_A(g_i).

  3. N_A(f \circ g) \leq N_A(f) \cdot N_A(g).

  4. N_A\left({ x \mapsto f(\sum_i a_i g_i(x) + b)}\right) \leq N_A(f)\sum_i N_A(g_i).

Remark 6.4.
This immediately hints a “power of composition”: we increase the “complexity” multiplicatively rather than additively!
Remark 6.5.
It is natural and important to wonder if this exponential increase is realized in practice. Preliminary work reveals that, at least near initialization, the effective number of pieces is much smaller (Hanin and Rolnick 2019).

Proof of Lemma 6.1.

  1. Draw f and g, with vertical bars at the right boundaries of affine pieces. There are \leq N_A(f) + N_A(g) -1 distinct bars, and f+g is affine between each adjacent pair of bars.

  2. N_A(a_i g_i) \leq N_A(g_i) (equality if a_i\neq 0), thus induction with the preceding gives N_A(\sum_i a_i g_i) = \sum_i N_A(g_i), and N_A doesn’t change with addition of constants.

  3. Let P_A(g) denote the pieces of g, and fix some U\in P_A(g); g is a fixed affine function along U. U is an interval, and consider the pieces of f_{|g(U)}; for each T\in P_A(f_{|g(U)}), f is affine, thus f\circ g is affine (along U \cap g_{|U}^{-1}(T)), and the total number of pieces is \sum_{U \in P_A(g)} N_A(f_{|g(U)}) \leq \sum_{U \in P_A(g)} N_A(f) \leq N_A(g) \cdot N_A(f).

  4. Combine the preceding two.

Remark 6.6.
The composition rule is hard to make tight: the image of each piece of g must hit all intervals of f! This is part of the motivation for the triangle function, which essentially meets this bound with every composition: \Delta(x) = 2\sigma_{\textrm{r}}(x) - 2\sigma_{\textrm{r}}(2x-1) = \begin{cases} 2x&x\leq 1/2,\\ 2-2x&x > 1/2. \end{cases} i need to move this so triangle defined in next section…

Proof of Theorem 6.2.

To prove the second from the first, N_A(f) \leq 2^L \prod_{j\leq L} m_j, \prod_{j\leq L} m_j = \exp \sum_{j\leq L} \ln m_j = \exp \frac 1 L \sum_{j\leq L} L \ln m_j \leq \exp L \ln \frac 1 L \sum_{j\leq L} m_j = \left({\frac m L}\right)^L. For the first, proceed by induction on layers. Base case: layer 0 mapping the data with identity, thus N_A(g) = 1. For the inductive step, given g in layer i+1 which takes (g_1,\ldots,g_{m_i}) from the previous layer as input, \begin{aligned} N_A(g) &= N_A(\sigma( b + \sum_j a_j g_j)) \leq 2 \sum_{j=1}^{m_i} N_A(g_j) \\ & \leq 2 \sum_{j=1}^{m_i} 2^i \prod_{k < i} m_k = 2^{i+1} m_i \cdot \prod_{k < i} m_k. \end{aligned}

This completes part 1 of our proof plan, upper bounding the number of affine pieces polynomially in width and exponentially in depth.

6.2 Step 2: constructing a deep function with many regular pieces

Next, let’s prove that \Delta^L indeed has many pieces, and regular structure; specifically, L-fold composition gives 2^{L-1} copies. Let \left\langle x \right\rangle = x - \lfloor x\rfloor denote fractional part.

Lemma 6.2.
\Delta^L(x) = \Delta(\left\langle 2^{L-1} x \right\rangle) = \Delta(2^{L-1} x - \lfloor 2^{L-1} x\rfloor ).
Remark 6.7.

\Delta^L creates 2^L (forward and backward) copies of its input, and thus is generally useful to replicate its input.

Parity on the hypercube in dimension d=2^L: \prod_{i=1}^d x_i = \Delta^{L-1}\left({\frac{d+ \sum_i x_i}{2d}}\right).

We’ll use \Delta when constructing (x,y)\mapsto xy.

Digit extraction! (Which appears a lot in deep network lower and upper bounds!) (See also the Turing machine constructions in (Siegelmann and Sontag 1994, Figure 3) and elsewhere.)

Proof. Proof by induction on L=i.
For base case i=1, directly \Delta^1(x) = \Delta(x) = \Delta(\left\langle x \right\rangle) = \Delta(\left\langle 2^0 x \right\rangle).
For the inductive step, consider \Delta^{i+1}.

If x\in [0,1/2], \Delta^{i+1}(x) = \Delta^i(\Delta(x)) = \Delta^i(2x) = \Delta(\left\langle 2^{i-1} 2x \right\rangle) = \Delta(\left\langle 2^{i}x \right\rangle). If x\in (1/2,1], \begin{aligned} \Delta^{i+1}(x)& = \Delta^{i}(\Delta(x)) = \Delta^i(2 - 2x) \\ &= \Delta^{i-1}(\Delta(2-2x)) = \Delta^{i-1}(\Delta(1 - (2-2x))) = \Delta^i(2x - 1) \\ &= \Delta(\left\langle 2^{i} x - 2^{i-1} \right\rangle) = \Delta(\left\langle 2^{i} x \right\rangle). \end{aligned} (If i=1, use \Delta^{1-1}(x) = x.)

(Can alternatively structure the proof around \Delta^{i+1} = \Delta \circ \Delta^i.)

6.3 Step 3: depth separation proof via region counting

We’re now ready to prove our main theorem, that \Delta^{L^2+2} can not be approximated by shallow networks, unless they have exponential size.

Proof of Theorem 6.1 ((Telgarsky 2015, 2016)).


The proof proceeds by “counting triangles.”

6.4 Other depth separations

7 Approximating x^2

Why x^2?

Remark 7.1 (bibliographic notes) .
The ability to efficiently approximate x\mapsto x^2, and consequences of this, was observed nearly in parallel by a few authors; in addition to (Yarotsky 2016) as mentioned above (whose approach is roughly followed here), in parallel was the work of (Safran and Shamir 2016), and slightly later the result was also discovered by (Rolnick and Tegmark 2017), all of these with differing perspectives and proofs.

Define S_i := \left({ \frac 0 {2^i}, \frac 1 {2^i}, \ldots, \frac {2^i}{2^i} }\right); let h_i be the linear interpolation of x^2 on S_i.




Thus:

Theorem 7.1 (roughly following (Yarotsky 2016)) .
  1. h_i is the piecewise-affine interpolation of x^2 along [0,1] with interpolation points S_i.
  2. h_i can be written as a ReLU network consisting of 2i layers and 5i nodes (albeit using “skip connections”)
  3. \sup_{x\in[0,1]}|h_i(x) - x^2| \leq 4^{-i-1}.
  4. Any ReLU network f with \leq L layers and \leq N nodes satisfies \int_{[0,1]} (f(x) - x^2)^2 {\text{d}}x \geq \frac{1}{5760 (2N/L)^{4L}}.
Remark 7.2.

Proof.

  1. The interpolation property comes from construction/definition.

  2. Since h_i = x - \sum_{j=1}^i \frac{\Delta^{j}}{4^j} and since \Delta^j requires 3 nodes and 2 layers for each new power, a worst case construction would need 2i layers and 3 \sum_{j\leq i} j = \mathcal{O}(i^2) nodes, but we can reuse invidual \Delta elements across the powers, and thus need only 3i, though the network has “skip connections” (in the ResNet sense).

  3. Fix i, and set \tau := 2^{-i}, meaning \tau is the distance between interpolation points. The error between x^2 and h_i is thus bounded above by \begin{aligned} & \sup_{x\in [0,1-\tau]} \sup_{z \in [0,\tau]} \frac {\tau - z} \tau \left({x^2}\right) + \frac {z}{\tau}\left({x+\tau}\right)^2 - (x+z)^2 \\ &= \frac 1 \tau \sup_{x\in [0,1-\tau]} \sup_{z \in [0,\tau]} 2xz\tau + z\tau^2 - 2xz\tau - \tau z^2 \\ &= %opt at z = tau / 2 \frac 1 {4\tau} \sup_{x\in [0,1-\tau]} \frac {\tau^3}{4} = \frac {\tau^2}{4} = 4^{-i-1}. \end{aligned}

  4. By a bound from last lecture, N_A(f) \leq (2N/L)^L. Using a symbolic package to differentiate, for any interval [a,b], \min_{(c,d)}\int_{[a,b]} (x^2 - (cx+d))^2{\text{d}}x = \frac {(b-a)^5}{180}. Let S index the subintervals of length at least 1/(2N) with N:=N_A(f), and restrict attention to [0,1]. Then \sum_{[a,b]\in S} (b-a) = 1 - \sum_{[a,b]\not\in S} (b-a) \geq 1 - N/(2N) = 1/2. Consequently, \begin{aligned} \int_{[0,1]} (x^2 - f(x))^2 {\text{d}}x &= \sum_{[a,b]\in P_A(f)} \int_{[a,b]\cap[0,1]} (x^2 - f(x))^2 {\text{d}}x \\ &\geq \sum_{[a,b]\in S} \frac {(b - a)^5}{180} \\ &\geq \sum_{[a,b]\in S} \frac {(b - a)}{2880 N^4} \geq \frac {1}{5760 N^4}. \end{aligned}

From squaring we can get many other things (still with \mathcal{O}(\ln(1/\epsilon)) depth and size.

Theorem 7.2 (sketch, from (Yarotsky 2016; Schmidt-Hieber 2017)) .
Suppose f:\mathbb{R}^d\to\mathbb{R} has all coordinates of all partial derivatives of order up to r within [-1,+1] and let \epsilon>0 be given. Then there exists a \widetilde{\mathcal{O}}(\ln(1/\epsilon) layer and \widetilde{\mathcal{O}}(\epsilon^{-d/r}) width network so that \sup_{x\in[0,1]^d} |f(x) - g(x)|\leq \epsilon. gross and vague, i should clean
Remark 7.3.
There are many papers following up on these; e.g., crawl the citation graph outwards from (Yarotsky 2016).

8 Function space norms, and the Neural Tangent Kernel (NTK)

8.1 Temporary material

Originally I had a lot of material here, but it is somewhat disorganized, messy, and buggy; unfortunately I will not fix it until Summer 2021. For now, here are some pointers to various relevant references.

Remark 8.1 (bibliographic notes) .

(“NTK” term). “NTK” was cioned by Jacot, Gabriel, and Hongler (2018), which also argued gradient descent follows the NTK; a kernel connection was observed earlier in (Cho and Saul 2009).

(Parallel work.) The original NTK and its usefulness to analyzing the early stage of gradient descent appeared roughly simultaneously in works not only of (Jacot, Gabriel, and Hongler 2018) as above, but also (Li and Liang 2018; Simon S. Du et al. 2018).

(Various further works.) The NTK has appeared in a vast number of papers (and various papers use linearization and study the early stage of training, whether they refer to it as the NTK or not). Concurrent works giving general convergence to global minima are (Simon S. Du et al. 2018; Allen-Zhu, Li, and Liang 2018; Oymak and Soltanolkotabi 2019; Zou et al. 2018). Many works subsequently aimed to reduce the width dependence (Zou and Gu 2019; Oymak and Soltanolkotabi 2019); in the classification case, a vastly smaller width is possible (Ji and Telgarsky 2019a; Z. Chen et al. 2019). Another subsequent direction (in the regression case) was obtaining test error and not just training error bounds (Cao and Gu 2020b, 2020a; Arora, Du, Hu, Li, and Wang 2019). Lastly, another interesting point is the use of noisy gradient desceent in some of these analyses (Allen-Zhu, Li, and Liang 2018; Z. Chen et al. 2020).

(Explicit NTK gradient flow proof.) Later in these notes, in section 13, will appear a full shallow NTK proof with smooth activations, based on the work of (Chizat and Bach 2019).

(Empirical works.) Estimates of empirical infinite width performance are in (Arora, Du, Hu, Li, Salakhutdinov, et al. 2019) and (Novak et al. 2018).

(Minimum norm function class.) The NTK gives a kernel space, and we can consider SVM-style minimum-norm predictors within it. If instead we consider minimizing the infinite-width probability mass as in Definition 2.1, we get a different function space \mathcal{F}_1. A nice comparison between \mathcal{F}_1 and the NTK RKHS, including algorithms to find both in certain cases, is in (Chizat and Bach 2020).

8.2 Building up the NTK from finite width

Recall our three function classes from last time: \begin{aligned} \int f(z) \sigma(z^{\scriptscriptstyle\mathsf{T}}x){\text{d}}\mathcal{S}(z), &\quad\int f < \infty, &\mathcal{F}_1 \\\quad \int f(z) \sigma(z^{\scriptscriptstyle\mathsf{T}}x){\text{d}}\mathcal{S}(z), &\quad\int f^2 < \infty, &\mathcal{F}_2 \\ \frac 1 {\sqrt m} \sum_j w_j^{\scriptscriptstyle\mathsf{T}}\sigma'(z_j^{\scriptscriptstyle\mathsf{T}}x), &\quad z_j\sim \mathcal{S}, \|w_j-z_j\|\leq \tau, &\textup{NTK} \end{aligned}

Today our goal is to explain NTK, and specifically: why f^2, why \tau\downarrow 0 implies we converge to \mathcal{F}_2, why we can still fit things with \tau \downarrow 0, and the name. We will not explain why \tau\downarrow 0 is a relevant assumption: that will come in the optimization lectures, and characterizes the early phase of training.

For now: single (finite!) hidden layer, no biases, train only layer 1: x \mapsto \frac 1 {\sqrt m} \sum_j s_j \sigma(w_j^{\scriptscriptstyle\mathsf{T}}x), with W being the matrix with rows (w_1^{\scriptscriptstyle\mathsf{T}},\ldots,w_m^{\scriptscriptstyle\mathsf{T}}), and s_j\in\{ \pm 1\}.

Remark 8.2 (why \frac {1}{\sqrt m}?) .

A standard initialization has coordinates of s and W be standard Gaussian with variance 1/m and 1/d, respectively: the 1/\sqrt m is thus from s_j\in\{\pm 1\}. If instead we set the variance to output dimension, we’d control spectral norms.

In the infinite width case, we said the weighting should satisfy \int f^2 < \infty; the 1/\sqrt{m} is related to that, we’ll get to it.

Remark 8.3 (other theoretically-motivated initialization schemes) .

At initialization, the function is not 0, but with high probability is constant for every example.

To simplify this, some authors include an extra factor \epsilon>0.

Another choice is to sample a gaussian matrix W \in \mathbb{R}^{m\times d}, and consider networks of width 2m of the form 1^{\scriptscriptstyle\mathsf{T}}\sigma(W_+x) - 1^{\scriptscriptstyle\mathsf{T}}\sigma(W_- x), where W_{\pm} = W at initialization, whereby the function is forced to be 0 at initialization.

Many papers do not make a big deal about this, but this seems to change certain aspects of the theory; if the noise is not cleared at initialization, the network must work to clear the noise.

Let’s go back to developing the NTK, starting from a regular shallow network: x \mapsto \frac 1 {\sqrt m} \sum_j s_j \sigma(w_j^{\scriptscriptstyle\mathsf{T}}x),

Crazy idea: Linearize \sigma, killing “deep network”:

Adjusted crazy idea: now let’s linearize around initialization! \begin{aligned} W &\mapsto \frac 1 {\sqrt m} \sum_j s_j \left[{ \sigma(w_{j,0}^{\scriptscriptstyle\mathsf{T}}x) + (w_j-w_{j,0})^{\scriptscriptstyle\mathsf{T}}x \sigma'\left({w_{j,0}^{\scriptscriptstyle\mathsf{T}}x}\right) }\right] \\ & = \frac 1 {\sqrt m} \sum_j s_j \left[{ \sigma(w_{j,0}^{\scriptscriptstyle\mathsf{T}}x) - w_{j,0}^{\scriptscriptstyle\mathsf{T}}x\sigma'(w_{j,0}^{\scriptscriptstyle\mathsf{T}}x) }\right] + \frac 1 {\sqrt m} x^{\scriptscriptstyle\mathsf{T}}\sum_j s_j w_j \sigma'(w_{j,0}^{\scriptscriptstyle\mathsf{T}}). \end{aligned}

Remark 8.4 (handling multiple layers) .

Linearize predictions around initial weights: f(x;w) \approx f(x;w_0) + (w - w_0)^{\scriptscriptstyle\mathsf{T}}\nabla_{w} f(x;w_0) where in the earlier single-layer case \nabla_{w} f(x;w_0) = \begin{bmatrix} \longleftarrow &s_1 x^{\scriptscriptstyle\mathsf{T}}\sigma'( w_{1,0}^{\scriptscriptstyle\mathsf{T}}x) / \sqrt{m} &\longrightarrow \\ &\vdots& \\ \longleftarrow &s_m x^{\scriptscriptstyle\mathsf{T}}\sigma'( w_{m,0}^{\scriptscriptstyle\mathsf{T}}x) / \sqrt{m}& \longrightarrow \end{bmatrix} but another way to write it is as w = (W_L, \ldots, W_1) over layers: w^{\scriptscriptstyle\mathsf{T}}\nabla_w f(x;w_0) = \sum_{i=1}^L \left\langle W_i , \nabla_{W_i} f(x;w_0) \right \rangle. Indeed, the kernel now decomposes across layers: \mathop{\mathbb{E}}\left\langle \nabla_w f(x;w) , \nabla_w f(x';w) \right \rangle = \sum_i \mathop{\mathbb{E}}\left\langle \nabla_{W_i} f(x;w) , \nabla_{W_i} f(x';w) \right \rangle, and we can calculate these kernels separately for all layers!

This object is still complicated, but it seems to have a nice interpretation as a linear predictor on top of some complicated feature mapping. A standard approach, the kernel view can be built up from studying inner products of these features for different examples.

Let’s define a feature mapping \Phi(x) := \begin{bmatrix} \longleftarrow &s_1 x^{\scriptscriptstyle\mathsf{T}}\mathbf{1}[ w_{1,0}^{\scriptscriptstyle\mathsf{T}}x \geq 0] / \sqrt{m} &\longrightarrow \\ &\vdots& \\ \longleftarrow &s_m x^{\scriptscriptstyle\mathsf{T}}\mathbf{1}[ w_{m,0}^{\scriptscriptstyle\mathsf{T}}x \geq 0] / \sqrt{m}& \longrightarrow \end{bmatrix} whereby our predictor is \left\langle W , \Phi(x) \right \rangle after dropping bias.

The corresponding kernel \tilde K is \begin{aligned} \tilde K(x,x') &:= \left\langle \Phi(x), \Phi(x') \right \rangle \\ &= x^{\scriptscriptstyle\mathsf{T}}x' \frac 1 m \sum_{j=1}^m s_j^2 \mathbf{1}[ w_j(0)^{\scriptscriptstyle\mathsf{T}}x \geq 0 ]\cdot \mathbf{1}[w_j(0)^{\scriptscriptstyle\mathsf{T}}x' \geq 0]. \end{aligned} \frac 1 m is now very suggestive! Using s_j Rademacher and w_j Gaussian: K(x,x') := \mathop{\mathbb{E}}_{s,W} \tilde K(x,x') = x^{\scriptscriptstyle\mathsf{T}}x' \mathop{\mathbb{E}}_{W} \mathbf{1}[ w_j(0)^{\scriptscriptstyle\mathsf{T}}x \geq 0 ]\cdot \mathbf{1}[w_j(0)^{\scriptscriptstyle\mathsf{T}}x' \geq 0].

We can find a closed form! need to include picture proof

\begin{aligned} K(x,x') &= x^{\scriptscriptstyle\mathsf{T}}x' \mathop{\mathbb{E}}\mathbf{1}[ w_j(0)^{\scriptscriptstyle\mathsf{T}}x \geq 0 ]\cdot \mathbf{1}[w_j(0)^{\scriptscriptstyle\mathsf{T}}x' \geq 0] \\ & = \left({{x}^{{\scriptscriptstyle\mathsf{T}}} x'}\right) \frac {\pi - \arccos(x^{\scriptscriptstyle\mathsf{T}}x' / (\|x\|\cdot\|x'\|))}{2\pi}. \end{aligned} (Generalized in homework 1; for an initial multi-layer analysis, see the original work (which coined the term “Neural Tangent Kernel”) by Jacot, Gabriel, and Hongler (2018).)

Remark 8.5 (kernel after taking Taylor expansion at 0) .
Original Taylor at 0 gives a linear kernel: \begin{aligned} \Phi_0(x) &:= \begin{bmatrix} \longleftarrow & s_1 x^{\scriptscriptstyle\mathsf{T}}\mathbf{1}[ 0^{\scriptscriptstyle\mathsf{T}}x \geq 0] / \sqrt{m} & \longrightarrow \\ &\vdots& \\ \longleftarrow & s_m x^{\scriptscriptstyle\mathsf{T}}\mathbf{1}[ 0^{\scriptscriptstyle\mathsf{T}}x \geq 0] / \sqrt{m} & \longrightarrow \end{bmatrix}, \\ \tilde K_0(x,x') &:= \left\langle \Phi_0(x), \Phi_0(x') \right \rangle = \frac {x^{\scriptscriptstyle\mathsf{T}}x'}{m} \sum_j s_j^2 = x^{\scriptscriptstyle\mathsf{T}}x'. \end{aligned}

9 Optimization: preface

Classically, the purpose of optimization is to approximately minimize (or maximize) an objective function f over a domain S: \min_{w\in S} f(w).

A core tension in the use of optimization in machine learning is that we would like to minimize the population risk \mathcal{R}(w) := \mathop{\mathbb{E}}\ell(Yf(X;w)); however, we only have access to the empirical risk \widehat{\mathcal{R}}(w) := n^{-1} \sum_i \ell(y_if(x_i;w)).

As a result, when choosing a w_t, we not only care that \widehat{\mathcal{R}}(w_t) is small, but also other good properties which may indicate \mathcal{R}(w_t) is small as well. Foremost amongst these are that w_t has low norm, but there are other possibilities.

Outline.

Remark 9.1.

…maybe I should always use \widehat{\mathcal{R}} or F for objectives

9.1 Omitted topics

10 Smooth objectives in ML

Remark 10.1.
Smoothness is trivially false for standard deep networks: the ReLU is not even differentiable. However, many interesting properties carry over, and many lines of research proceed by trying to make these properties carry over, so at the very least, it’s good to understand.

A key consequence: we can guarantee gradient descent does not increase the objective.

Consider gradient iteration w' = w - \frac 1 \beta \nabla\widehat{\mathcal{R}}(w), then smoothness implies \widehat{\mathcal{R}}(w') \leq \widehat{\mathcal{R}}(w) - \left\langle \widehat{\mathcal{R}}(w), \widehat{\mathcal{R}}(w)/\beta \right \rangle + \frac {1}{2\beta}\|\widehat{\mathcal{R}}(w)\|^2 = \widehat{\mathcal{R}}(w) - \frac 1 {2\beta} \|\nabla\widehat{\mathcal{R}}(w)\|^2, and \|\nabla\widehat{\mathcal{R}}(w)\|^2 \leq 2\beta (\widehat{\mathcal{R}}(w) - \widehat{\mathcal{R}}(w')).

With deep networks, we’ll produce similar bounds but in other ways.

As an exercise, let’s prove the earlier smoothness consequence. Considering the curve t\mapsto \widehat{\mathcal{R}}(w + t(v-w)) along [0,1], \begin{aligned} &\left|{ \widehat{\mathcal{R}}(v) - \widehat{\mathcal{R}}(w) - \left\langle \nabla\widehat{\mathcal{R}}(w), v-w \right \rangle }\right| \\ &= \left|{ \int_0^1 \left\langle \nabla\widehat{\mathcal{R}}(w+t(v-w)), v-w \right \rangle{\text{d}}t - \left\langle \nabla\widehat{\mathcal{R}}(w), v-w \right \rangle }\right| \\ &\leq \int_0^1\left|{ \left\langle \nabla\widehat{\mathcal{R}}(w+t(v-w))-\nabla\widehat{\mathcal{R}}(w), v-w \right \rangle }\right| {\text{d}}t \\ &\leq \int_0^1\|\nabla\widehat{\mathcal{R}}(w+t(v-w))-\nabla\widehat{\mathcal{R}}(w)\|\cdot\|v-w\| {\text{d}}t \\ &\leq \int_0^1 t\beta \|v-w\|^2 {\text{d}}t \\ &= \frac {\beta}{2} \|v-w\|^2. \end{aligned}

Example 10.1.
Define \widehat{\mathcal{R}}(w) := \frac 1 2 \|Xw-y\|^2, and note \nabla\widehat{\mathcal{R}}(w) = X^{\scriptscriptstyle\mathsf{T}}(Xw-y). For any w,w', \begin{aligned} \widehat{\mathcal{R}}(w') &= \frac 1 2 \|Xw' - Xw + Xw - y\|^2 \\ &= \frac 1 2 \|Xw' - Xw\|^2 + \left\langle Xw' - Xw, Xw - y \right \rangle + \frac 1 2 \|Xw - y\|^2 \\ &= \frac 1 2 \|Xw' - Xw\|^2 + \left\langle w' - w, \widehat{\mathcal{R}}(w) \right \rangle + \widehat{\mathcal{R}}(w). \end{aligned}

Since \frac {\sigma_{\min}(X)}{2} \|w'-w\|^2 \leq \frac 1 2 \|Xw'-Xw\|^2 \leq \frac {\sigma_{\max}(X)}{2} \|w'-w\|^2, thus \widehat{\mathcal{R}} is \sigma_{\max}(X)-smooth (and \sigma_{\min}-strongly-convex, as we’ll discuss).

The smoothness bound holds if we use the seminorm \|v\|_X = \|Xv\|. We’ll discuss smoothness wrt other norms in homework.

I should use \mathcal L not \widehat{\mathcal{R}} since unnormalized.

10.1 Convergence to stationary points

Consider first the gradient iteration w' := w - \eta \nabla\widehat{\mathcal{R}}(w), where \eta\geq 0 is the step size. When f is \beta smooth but not necessarily convex, the smoothness inequality directly gives \begin{aligned} \widehat{\mathcal{R}}(w') &\leq \widehat{\mathcal{R}}(w) + \left\langle \nabla\widehat{\mathcal{R}}(w), w'-w \right \rangle + \frac \beta 2 \|w'-w\|^2 \\ &= \widehat{\mathcal{R}}(w) - \eta \|\nabla\widehat{\mathcal{R}}(w)\|^2 + \frac {\beta\eta^2}{2}\|\nabla\widehat{\mathcal{R}}(w)\|^2 \\ &= \widehat{\mathcal{R}}(w) - \eta \left({1 - \frac {\beta\eta}{2}}\right) \|\nabla\widehat{\mathcal{R}}(w)\|^2. \end{aligned}

If we choose \eta appropriately (\eta\leq 2/\beta) then: either we are near a critical point (\nabla\widehat{\mathcal{R}}(w)\approx 0), or we can decrease f.

Let’s refine our notation to tell iterates apart:

  1. Let w_0 be given.

  2. Recurse: w_i := w_{i-1} - \eta_i \nabla\widehat{\mathcal{R}}(w_{i-1}).

am I consistent with indexing? is \eta_i always with w_{i-1}?

Rearranging our iteration inequality and summing over i<t, \begin{aligned} \sum_{i<t} \eta_{i+1}\left({1 - \frac {\beta\eta_{i+1}}{2}}\right)\|\nabla\widehat{\mathcal{R}}(w_i)\|^2 &\leq \sum_{i<t} \left({\widehat{\mathcal{R}}(w_i) - \widehat{\mathcal{R}}(w_{i+1})}\right) \\ &= \widehat{\mathcal{R}}(w_0) - \widehat{\mathcal{R}}(w_{t}). \end{aligned} We can summarize these observations in the following theorem.

Theorem 10.1.
Let (w_i)_{i\geq 0} be given by gradient descent on \beta-smooth f.
Remark 10.2.

Gradient flow version. Using FTC, chain rule, and definition, \begin{aligned} \widehat{\mathcal{R}}(w(t)) - \widehat{\mathcal{R}}(w(0)) &= \int_0^t \left\langle \nabla\widehat{\mathcal{R}}(w(s)), \dot w(s) \right \rangle{\text{d}}s \\ &= - \int_0^t\|\nabla\widehat{\mathcal{R}}(w(s))\|{\text{d}}s \\ &\leq - t \inf_{s\in[0,t]} \|\nabla\widehat{\mathcal{R}}(w(s))\|^2, \end{aligned} which can be summarized as follows.

Theorem 10.2.
For the gradient flow, \inf_{s\in[0,t]} \|\nabla\widehat{\mathcal{R}}(w(s))\|^2 \leq \frac 1 t \left({ \widehat{\mathcal{R}}(w(0)) - \widehat{\mathcal{R}}(w(t)) }\right).
Remark 10.3.

GD: \min_{i<t} \|\nabla\widehat{\mathcal{R}}(w)\|^2 \le \frac {2\beta}{t}\left({\widehat{\mathcal{R}}(w_0) - \widehat{\mathcal{R}}(w_t)}\right).

10.2 Convergence rate for smooth & convex

Theorem 10.3.
Suppose \widehat{\mathcal{R}} is \beta-smooth and convex, and (w_i)_{\geq 0} given by GD with \eta_i := 1/\beta. Then for any z, \widehat{\mathcal{R}}(w_t) - \widehat{\mathcal{R}}(z) \leq \frac {\beta}{2t}\left({ \|w_0 - z\|^2 - \|w_t-z\|^2 }\right).
Remark 10.4.
We only invoke convexity via the inequality \widehat{\mathcal{R}}(w') \geq \widehat{\mathcal{R}}(w) + \left\langle \nabla\widehat{\mathcal{R}}(w), w'-w \right \rangle, meaning f lies above all tangents. I should give a summary of convexity characterizations. after all, I do it for strong convexity…

The reference point z allows us to use this bound effectively when \widehat{\mathcal{R}} lacks an optimum, or simply when the optimum is very large. For an example of such an application of z, see the margin maximization material.

Proof. By convexity and the earlier smoothness inequality \|\nabla\widehat{\mathcal{R}}(w)\|^2 \leq 2\beta (\widehat{\mathcal{R}}(w)-\widehat{\mathcal{R}}(w')), \begin{aligned} \|w'-z\|^2 &= \|w-z\|^2 - \frac 2 \beta \left\langle \nabla\widehat{\mathcal{R}}(w), w-z \right \rangle + \frac 1 {\beta^2} \|\nabla\widehat{\mathcal{R}}(w)\|^2 \\ &\leq \|w-z\|^2 + \frac 2 \beta (\widehat{\mathcal{R}}(z)-\widehat{\mathcal{R}}(w)) + \frac 2 \beta (\widehat{\mathcal{R}}(w) - \widehat{\mathcal{R}}(w')) \\ &= \|w-z\|^2 + \frac 2 \beta (\widehat{\mathcal{R}}(z)-\widehat{\mathcal{R}}(w')). \end{aligned} Rearranging and applying \sum_{i<t}, \frac 2 \beta \sum_{i<t} (\widehat{\mathcal{R}}(w_{i+1}) - \widehat{\mathcal{R}}(z)) \leq \sum_{i<t} \left({ \|w_{i}-z\|^2 - \|w_{i+1}-z\|^2 }\right) The final bound follows by noting \widehat{\mathcal{R}}(w_i)\geq \widehat{\mathcal{R}}(w_t), and since the right hand side telescopes.

For GF, we use the same potential, but indeed start from the telescoping sum, which can be viewed as a Riemann sum corresponding to the following application of FTC: \begin{aligned} \frac 1 2 \|w(t) - z\|_2^2 - \frac 1 2 \|w(0) - z\|_2^2 &= \frac 1 2 \int_0^t \frac {{\text{d}}}{{\text{d}}s} \|w(s) - z\|_2^2 {\text{d}}s \\ &= \int_0^t \left\langle \frac {{\text{d}}w}{{\text{d}}s} , w(s) - z \right \rangle {\text{d}}s \\ &\geq \int_0^t \left({ \widehat{\mathcal{R}}(w(s)) - \widehat{\mathcal{R}}(z) }\right) {\text{d}}s. \end{aligned}

Theorem 10.4.
For any z\in\mathbb{R}^d, GF satisfies \begin{aligned} t \widehat{\mathcal{R}}(w(t)) + \frac 1 2 \|w(t) - z\|_2^2 &\leq \int_0^t \widehat{\mathcal{R}}(w(0)) + \frac 1 2 \|w(0) - z\|_2^2 \\ &= t \widehat{\mathcal{R}}(z) + \frac 1 2 \|w(0) - z\|_2^2. \end{aligned}
Remark 10.5 (“units” of GD and GF: t vs \frac t \beta) .
Here’s a back-of-the-envelope calculation to see why t becomes t/\beta and why they are really the same, and not a sloppiness of the analysis.
Remark 10.6 (potential functions) .

We can use similar objective functions with deep learning, without smoothness (!).

Remark 10.7 (rates) .
Some rules of thumb (not comprehensive, and there are other ways).

11 Strong convexity

Here is a sort of companion to Lipschitz gradients; a stronger condition than convexity which will grant much faster convergence rates. (Convexity references I recommend: (Hiriart-Urruty and Lemaréchal 2001; Borwein and Lewis 2000).)

Say that \widehat{\mathcal{R}} is \lambda-strongly-convex (\lambda-sc) when \widehat{\mathcal{R}}(w') \geq \widehat{\mathcal{R}}(w) + \left\langle \nabla\widehat{\mathcal{R}}(w), w'-w \right \rangle + \frac \lambda 2 \|w'-w\|^2. Some alternative definitions:

Example 11.1 (least squares) .
Earlier we pointed out \frac 1 2 \|Xw'-w'\|^2 =: \widehat{\mathcal{R}}(w') = \widehat{\mathcal{R}}(w) + \left\langle \nabla\widehat{\mathcal{R}}(w), w'-w \right \rangle + \frac 1 2 \|Xw'-Xw\|^2 and \sigma_{\min}(X) \|w'-w\|^2 \leq \|Xw'-Xw\|^2 \leq \sigma_{\max}(X) \|w'-w\|^2. The latter implies a smoothness upper bound we used, now we know the former implies strong convexity. (We can also say that both hold with equality using the special seminorm \|v\|_X = \|Xv\|.) We can also verify these properties by noting \nabla^2 \widehat{\mathcal{R}}= X^{\scriptscriptstyle\mathsf{T}}X.
Example 11.2 (regularization) .
Define regularized risk \widehat{\mathcal{R}}_\lambda(w) := \widehat{\mathcal{R}}(w) + \lambda\|w\|^2/2.

If \widehat{\mathcal{R}} is convex, then \widehat{\mathcal{R}}_\lambda is \lambda-sc:

Another very useful property is that \lambda-sc gives a way to convert gradient norms to suboptimality.

Lemma 11.1.
Suppose \widehat{\mathcal{R}} is \lambda-sc. Then \forall w\centerdot\qquad \widehat{\mathcal{R}}(w) - \inf_v \widehat{\mathcal{R}}(v) \leq \frac 1 {2\lambda}\|\nabla\widehat{\mathcal{R}}(w)\|^2.
Remark 11.1.
Smoothness gave \frac 1 {2\beta} \|\nabla\widehat{\mathcal{R}}(w_i)\|^2 \leq \widehat{\mathcal{R}}(w_i) - \widehat{\mathcal{R}}(w_{i+1}).

Proof. Let w be given, and define the convex quadratic Q_w(v) := \widehat{\mathcal{R}}(w) + \left\langle \nabla\widehat{\mathcal{R}}(w), v-w \right \rangle + \frac \lambda 2 \|v-w\|^2, which attains its minimum at \bar v := w - \nabla\widehat{\mathcal{R}}(w)/\lambda. By definition \lambda-sc, \begin{aligned} \inf_v \widehat{\mathcal{R}}(v) \geq \inf_v Q_w(v) = Q_w(\bar v) = \widehat{\mathcal{R}}(w) - \frac 1 {2\lambda}\|\nabla\widehat{\mathcal{R}}(w)\|^2. \end{aligned}

Remark 11.2 (stopping conditions) .

Say our goal is to find w so that f(w) - \inf_v \widehat{\mathcal{R}}(v) \leq \epsilon. When do we stop gradient descent?

Remark 11.3 (Regularization and boundedness) .

I shuold lemmas lemmas giving level set containment, and existence of minimizers.

11.1 Rates when strongly convex and smooth

Theorem 11.1.
Suppose f is \lambda-sc and \beta-smooth, and GD is run with step size 1/\beta. Then a minimum \bar w exists, and \begin{aligned} \widehat{\mathcal{R}}(w_t) - \widehat{\mathcal{R}}(\bar w) &\leq \left({\widehat{\mathcal{R}}(w_0) - \widehat{\mathcal{R}}(\bar w)}\right) \exp(-t \lambda/\beta), \\ \|w_t-\bar w\|^2 &\leq \|w_0-\bar w\|^2\exp(-t\lambda/\beta). \end{aligned}

Proof. Using previously-proved Lemmas from smooothness and strong convexity, \begin{aligned} \widehat{\mathcal{R}}(w_{i+1}) - \widehat{\mathcal{R}}(\bar w) &\leq \widehat{\mathcal{R}}(w_{i}) - \widehat{\mathcal{R}}(\bar w) - \frac {\|\nabla\widehat{\mathcal{R}}(w_i)\|^2}{2\beta} \\ &\leq \widehat{\mathcal{R}}(w_{i}) - \widehat{\mathcal{R}}(\bar w) - \frac {2\lambda(\widehat{\mathcal{R}}(w_i) - \widehat{\mathcal{R}}(\bar w))}{2\beta} \\ &\leq \left({\widehat{\mathcal{R}}(w_{i}) - \widehat{\mathcal{R}}(\bar w)}\right)\left({1 - \lambda/\beta}\right), \end{aligned} which gives the first bound by induction since \prod_{i<t}(1-\lambda/\beta) \leq \prod_{i<t}\exp\left({-\lambda/\beta}\right) = \exp\left({-t\lambda/\beta}\right).

For the second guarantee, expanding the square as usual, \begin{aligned} \|w' -\bar w\|^2 &=\|w-\bar w\|^2 + \frac 2 \beta \left\langle \nabla\widehat{\mathcal{R}}(w), \bar w - w \right \rangle + \frac 1 {\beta^2}\|\nabla\widehat{\mathcal{R}}(w)\|^2 \\ &\leq\|w-\bar w\|^2 + \frac 2 \beta \left({\widehat{\mathcal{R}}(\bar w) - \widehat{\mathcal{R}}(w) - \frac{\lambda}{2} \|\bar w- w\|_2^2 }\right) \\ &\qquad + \frac 1 {\beta^2}\left({2\beta(\widehat{\mathcal{R}}(w) - \widehat{\mathcal{R}}(w'))}\right) \\ &= (1-\lambda/\beta)\|w-\bar w\|^2 + \frac 2 \beta \left({\widehat{\mathcal{R}}(\bar w) - \widehat{\mathcal{R}}(w) + \widehat{\mathcal{R}}(w) - \widehat{\mathcal{R}}(w')}\right) \\ &\leq (1-\lambda/\beta)\|w-\bar w\|^2, \end{aligned} which gives the argument after a similar induction argument as before.

Remark 11.4.

Next let’s handle the gradient flow.

Theorem 11.2.
If \widehat{\mathcal{R}} is \lambda-sc, a minimum \bar w exists, and the GF w(t) satisfies \begin{aligned} \|w(t) -\bar w\|^2 &\leq \|w(0)-\bar w\|^2 \exp(-2\lambda t), \\ \widehat{\mathcal{R}}(w(t)) - \widehat{\mathcal{R}}(\bar w) &\leq \left({ \widehat{\mathcal{R}}(w(0)) - \widehat{\mathcal{R}}(\bar w) }\right) \exp(-2t\lambda). \end{aligned}

Proof. By first-order optimality in the form \nabla\widehat{\mathcal{R}}(\bar w) = 0, then \begin{aligned} \frac {{\text{d}}}{{\text{d}}t} \frac 1 2 \|w(t) -\bar w\|^2 &= \left\langle w(t) - \bar w, \dot w(t) \right \rangle \\ &= -\left\langle w(t) - \bar w, \nabla \widehat{\mathcal{R}}(w(t)) - \nabla \widehat{\mathcal{R}}(\bar w) \right \rangle \\ &\leq - \lambda \|w(t) - \bar w\|^2. \end{aligned} By Grönwall’s inequality, this implies \begin{aligned} \|w(t) -\bar w\|^2 &\leq \|w(0)-\bar w\|^2 \exp\left({ -\int_0^t 2\lambda{\text{d}}s }\right) \\ &\leq \|w(0)-\bar w\|^2 \exp(-2\lambda t), \end{aligned} which establishes the guarantee on distances to initialization. For the objective function guarantee, \begin{aligned} \frac {{\text{d}}}{{\text{d}}t} (\widehat{\mathcal{R}}(w(t)) - \widehat{\mathcal{R}}(\bar w)) & = \left\langle \nabla\widehat{\mathcal{R}}(w(t)), \dot w(t) \right \rangle \\ &= - \left\|{ \nabla\widehat{\mathcal{R}}(w(t)) }\right\|^2 \leq - 2\lambda (\widehat{\mathcal{R}}(w(t)) - \widehat{\mathcal{R}}(\bar w)). \end{aligned} Grönwall’s inequality implies \widehat{\mathcal{R}}(w(t)) - \widehat{\mathcal{R}}(\bar w) \leq \left({ \widehat{\mathcal{R}}(w(0)) - \widehat{\mathcal{R}}(\bar w) }\right) \exp(-2t\lambda).

Remark 11.5.
As in all other rates proved for GF and GD, time t is replaced by “arc length units” t/\beta.

We have strayed a little from our goals by producing laborious proofs that not only separate the objective function and the distances, but also require minimizers. Interestingly, we can resolve this by changing the step size to a large (seemingly worse?) one.

Theorem 11.3.
Suppose \widehat{\mathcal{R}} is \beta-smooth and \lambda-sc, and a constant step size \frac {2}{\beta + \lambda}. Then, for any z, \widehat{\mathcal{R}}(w_t) - \widehat{\mathcal{R}}(z) + \frac \lambda 2 \|w_t - z\|^2 \leq \left[{ \frac {\beta - \lambda}{\beta + \lambda}}\right]^t \left({ \widehat{\mathcal{R}}(w_0) - \widehat{\mathcal{R}}(z) + \frac \lambda 2 \|w_0 - z\|^2 }\right).

Proof. Homework problem \ddot\smile.

Remark 11.6 (standard rates with strong convexity) .
Compared with standard proofs in the literature (Nesterov 2003, chap. 2), the preceding bound with step size 2/(\beta+\lambda) is possibly loose: it seems possible to have a 2t and not just t in the exponent, albeit after adjusting the other terms (and depending explicitly on minimizers). I need to resolve what’s going on here…

Moreover, another standard rate given in the literature is 1/t under just strong convexity (no smoothness); however, this requires a step size \eta_i := (\lambda(i+1))^{-1}.

12 Stochastic gradients

Let’s generalize gradient descent, and consider the iteration w_{i+1} := w_i - \eta g_i, where each g_i is merely some vector. If g_i := \nabla\widehat{\mathcal{R}}(w_i), then we have gradient descent, but in general we only approximate it. Later in this section, we’ll explain how to make g_i a “stochastic gradient.”

Our first step is to analyze this in our usual way with our favorite potential function, but accumulating a big error term: \begin{aligned} \hspace{0.2em}& \hspace{-0.2em} \|w_{i+1} - z\|^2 \\ &= \|w_i - \eta g_i - z\|^2 \\ &= \|w_i - z\|^2 - 2\eta_i \left\langle g_i, w_i - z \right \rangle + \eta^2 \|g_i\|^2 \\ &= \|w_i - z\|^2 + 2\eta \left\langle g_i - \nabla\mathcal{R}(w_i)+\nabla\mathcal{R}(w_i), w_i - z \right \rangle + \eta^2 \|g_i\|^2 \\ &\leq \|w_i - z\|^2 + 2\eta (\mathcal{R}(z) - \mathcal{R}(w_i) + \underbrace{\left\langle g_i - \nabla\mathcal{R}(w_i), w_i - z \right \rangle}_{\epsilon_i}) + \eta^2 \|g_i\|^2, \end{aligned} which after rearrangement gives 2\eta \mathcal{R}(w_i) \leq 2\eta \mathcal{R}(z) + \|w_{i} - z\|^2 - \|w_{i+1} - z\|^2 + 2\eta \epsilon_i+ \eta^2 \|g_i\|^2, and applying \frac 1 {2\eta t} \sum_{i<t} to both sides gives \frac 1 t \sum_{i<t} \mathcal{R}(w_i) \leq \mathcal{R}(z) + \frac{\|w_{0} - z\|^2 - \|w_{t} - z\|^2}{2 \eta t} + \frac 1 t \sum_{i<t}\left({ \epsilon_i+ \frac \eta 2 \|g_i\|^2}\right).

The following lemma summarizes this derivation.

Lemma 12.1.
Suppose \mathcal{R} convex; set G:=\max_i\|g_i\|_2, and \eta := \frac{c}{\sqrt t}. For any z, \mathcal{R}\left({ \frac 1 t \sum_{i<t} w_i }\right) \leq \frac 1 t \sum_{i<t} \mathcal{R}(w_i) \leq \mathcal{R}(z) + \frac{\|w_{0} - z\|^2}{2 c\sqrt{t}} + \frac{cG^2}{2\sqrt t} + \frac 1 t \sum_{i<t}\epsilon_i.

Proof. This follows from the earlier derivation after plugging in G, \eta=c/\sqrt{t}, and applying Jensen’s inequality to the left hand side.

Remark 12.1.

Now let us define the standard stochastic gradient oracle: \mathop{\mathbb{E}}[ g_i | w_{\leq i} ] = \nabla\mathcal{R}(w_i), where w_{\leq i} signifies all randomness in (w_1,\ldots,w_i).

Remark 12.2.

Indeed, this setup allows the expectation to be nicely interpreted as an iterated integral over (x_1,y_1), then (x_2,y_2), and so on. The stochastic gradient g_i depends on (x_i,y_i) and w_i, but w_i does not depend on (x_i,y_i), rather on ((x_j,y_j))_{j=1}^{i-1}.

Now let’s work towards our goal of showing that, with high probability, our stochastic gradient method does nearly as well as a regular gradient method. (We will not show any benefit to stochastic noise, other than computation!)

Our main tool is as follows.

Theorem 12.1 (Azuma-Hoeffding) .
Suppose (Z_i)_{i=1}^n is a martingale difference sequence (\mathop{\mathbb{E}}(Z_i | Z_{<i}) = 0) and \mathop{\mathbb{E}}|Z_i|\leq R. Then with probability at least 1-\delta, \sum_i Z_i \leq R\sqrt{2t\ln(1/\delta)}.

Proof omitted, though we’ll sketch some approaches in a few weeks.

We will use this inequality to handle \sum_{i<t} \epsilon_i. Firstly, we must show the desired expectations are zero. To start, \begin{aligned} \mathop{\mathbb{E}}\left[{ \epsilon_i\ \Big|\ w_{\leq i} }\right] &= \mathop{\mathbb{E}}\left[{ \left\langle g_i - \nabla\mathcal{R}(w_i), z - w_i \right \rangle\ \Big|\ w_{\leq i} }\right] \\ &= \left\langle \mathop{\mathbb{E}}\left[{ g_i - \nabla\mathcal{R}(w_i)\ {} \Big| \ {} w_{\leq i} }\right], z - w_i \right \rangle \\ &= \left\langle 0, z - w_i \right \rangle \\ &= 0. \end{aligned} Next, by Cauchy-Schwarz and the triangle inequality, \mathop{\mathbb{E}}|\epsilon_i| = \mathop{\mathbb{E}}\left|{ \left\langle g_i - \nabla\widehat{\mathcal{R}}(w_i), w_i - z \right \rangle }\right| \leq \mathop{\mathbb{E}}\left({ \|g_i\| + \|\nabla\widehat{\mathcal{R}}(w_i)\| }\right)\|w_i - z\| \leq 2GD. Consequently, by Azuma-Hoeffding, with probability at least 1-\delta, \sum_i \epsilon_i \leq 2GD \sqrt{2t \ln(1/\delta)}.

Plugging this into the earlier approximate gradient lemma gives the following. should give explicit cref

Lemma 12.2.
Suppose \mathcal{R} convex; set G:=\max_i\|g_i\|_2, and \eta := \frac{1}{\sqrt t}, D\geq \max_i \|w_i - z\|, and suppose g_i is a stochastic gradient at time i. With probability at least 1-\delta, \begin{aligned} \mathcal{R}\left({ \frac 1 t \sum_{i<t} w_i }\right) &\leq \frac 1 t \sum_{i<t} \mathcal{R}(w_i) \\ &\leq \mathcal{R}(z) + \frac{D^2}{2 \sqrt{t}} + \frac{G^2}{2\sqrt t} + \frac {2DG \sqrt{2 \ln(1/\delta)}}{\sqrt{t}}. \end{aligned}
Remark 12.3.

13 NTK-based Gradient flow analysis on smooth shallow networks, following (Chizat and Bach 2019)

Finally we will prove (rather than assert) that we can stay close to initialization long enough to get a small risk with an analysis that is essentially convex, essentially following the NTK (Taylor approximation).

Basic notation. For convenience, bake the training set into the predictor: \begin{aligned} f(w) &:= \begin{bmatrix} f(x_1;w)\\ \vdots \\ f(x_n;w) \end{bmatrix} \in \mathbb{R}^n. \end{aligned} We’ll be considering squared loss regression: \begin{aligned} \widehat{\mathcal{R}}(\alpha f(w)) &:= \frac {1}{2} \|\alpha f(w) - y\|^2, \qquad \widehat{\mathcal{R}}_0 := \widehat{\mathcal{R}}(\alpha f(w(0)) ), \end{aligned} where \alpha > 0 is a scale factor we’ll optimize later. maybe I should use \mathcal L not \widehat{\mathcal{R}} since unnormalized.

We’ll consider gradient flow: \begin{aligned} \dot w(t) &:= - \nabla_w \widehat{\mathcal{R}}(\alpha f(w(t))) = - \alpha J_t^{\scriptscriptstyle\mathsf{T}}\nabla\widehat{\mathcal{R}}(\alpha f(w(t))), \\ \textup{where } J_t := J_{w(t)} &:= \begin{bmatrix} \nabla f(x_1;w(t))^{\scriptscriptstyle\mathsf{T}}\\ \vdots \\ \nabla f(x_n;w(t))^{\scriptscriptstyle\mathsf{T}} \end{bmatrix} \in \mathbb{R}^{n\times p}. \end{aligned}

We will also explicitly define and track a flow u(t) over the tangent model; what we care about is w(t), but we will show that indeed u(t) and w(t) stay close in this setting. (Note that u(t) is not needed for the analysis of w(t).) \begin{aligned} f_0(u) &:= f(w(0)) + J_0(u - w(0)). \\ \dot u(t) &:= -\nabla_u \widehat{\mathcal{R}}(\alpha f_0(u(t))) = - \alpha J_0^{\scriptscriptstyle\mathsf{T}}\nabla\widehat{\mathcal{R}}(\alpha f_0(u(t))). \end{aligned} Both gradient flows have the same initial condition: u(0) = w(0),\qquad f_0(u(0)) = f_0(w(0)) = f(w(0)).

Remark 13.1 (initialization, width, etc) .

Assumptions. \begin{aligned} \textrm{rank}(J_0) &= n, \\ \sigma_{\min}&:= \sigma_{\min}(J_0) = \sqrt{\lambda_{\min}(J_0J_0^{\scriptscriptstyle\mathsf{T}})} = \sqrt{\lambda_{n}(J_0J_0^{\scriptscriptstyle\mathsf{T}})}> 0, \\ \sigma_{\max}&:= \sigma_{\max}(J_0) > 0, \\ \|J_w - J_v\| &\leq \beta \|w-v\|. \end{aligned}(2)

Remark 13.2 (J_0J_0^{\scriptscriptstyle\mathsf{T}} has full rank, a “representation assumption”) .
This is a “representation assumption” in an explicit sense: it implies the tangent model has exact solutions to the least squares problem, regardless of the choice of y, meaning the training error can always be made 0. In detail, consider the least squares problem solved by the tangent space: \min_{u\in \mathbb{R}^p} \frac 1 2 \left\|{ f_0(u) - y }\right\|^2 = \min_{u\in \mathbb{R}^p} \frac 1 2 \left\|{ J_0 u - y_0 }\right\|^2, where we have chosen y_0 := y + J_0 w(0) - f(w(0)) for convenience. The normal equations for this least squares problem are J_0^{\scriptscriptstyle\mathsf{T}}J_0 u = J_0^{\scriptscriptstyle\mathsf{T}}y_0. Let J_0 = \sum_{i=1}^n s_i u_i v_i^{\scriptscriptstyle\mathsf{T}} denote the SVD of J_0, which has n terms by the rank assumption; the corresponding pseudoinverse is J_0^\dagger = \sum_{i=1}^n s_i^{-1} v_i u_i^{\scriptscriptstyle\mathsf{T}}. Multiplying both sides by (J_0^\dagger)^{\scriptscriptstyle\mathsf{T}}, J_0 u = (J_0^\dagger)^{\scriptscriptstyle\mathsf{T}}J_0^{\scriptscriptstyle\mathsf{T}}J_0 u = (J_0^\dagger)^{\scriptscriptstyle\mathsf{T}}J_0^{\scriptscriptstyle\mathsf{T}}y_0 = \left[{ \sum_{i=1}^n u_i u_i^{\scriptscriptstyle\mathsf{T}}}\right] y_0 = y_0, where the last step follows since [ \sum_i u_i u_i^{\scriptscriptstyle\mathsf{T}}] is idempotent and full rank, and therefore the identity matrix. In particular, we can choose \hat u = J_0^\dagger y_0, then J_0 \hat u = [ \sum_i u_i u_i^{\scriptscriptstyle\mathsf{T}}] y_0 = y_0, and in particular \frac 1 2 \left\|{f_0(\hat u) - y}\right\|^2 = \frac 1 2 \left\|{J_0 \hat u - y_0}\right\|^2 = 0. As such, the full rank assumption is explicitly a representation assumption: we are forcing the tangent space least squares problem to always have solutions.
Theorem 13.1 (see also (Theorem 3.2, Chizat and Bach 2019)) .
Assume eq. 2 and \alpha \geq \frac {\beta \sqrt{1152 \sigma_{\max}^2 \widehat{\mathcal{R}}_0}}{\sigma_{\min}^3}. Then \begin{aligned} \max\left\{{ \widehat{\mathcal{R}}(\alpha f(w(t))), \widehat{\mathcal{R}}(\alpha f_0(u(t))) }\right\} &\leq \widehat{\mathcal{R}}_0 \exp(- t \alpha^2 \sigma_{\min}^2/2), \\ \max\left\{{ \|w(t) - w(0)\|, \|u(t) - w(0)\| }\right\} &\leq \frac {3 \sqrt{8 \sigma_{\max}^2 \widehat{\mathcal{R}}_0}}{\alpha \sigma_{\min}^2}. \end{aligned}
Remark 13.3 (shallow case) .

To get a handle on the various abstract constants and what they mean, consider the shallow case, namely f(x;w) = \sum_j s_j \sigma(w_j^{\scriptscriptstyle\mathsf{T}}x), where s_j\in\{\pm 1\} is not trained, and each w_j is trained.

Smoothness constant. Let X\in\mathbb{R}^{n\times d} be a matrix with the n training inputs as rows, and suppose \sigma is \beta_0-smooth. Then \begin{aligned} \|J_w - J_v\|_2^2 &= \sum_{i,j} \|x_i\|^2 (\sigma'(w_j^{\scriptscriptstyle\mathsf{T}}x_i) - \sigma'(v_j^{\scriptscriptstyle\mathsf{T}}x_i))^2 \\ &\leq \sum_{i,j} \|x_i\|^4 \beta_0^2 \|w_j - v_j \|^2 \\ &= \beta_0^2 \|X\|_{{\textrm{F}}}^4 \|w - v\|^2. \end{aligned} Thus \beta \leq \beta_0 \|X\|_{{\textrm{F}}}^2 suffices, which we can ballpark as \beta = \Theta(n).

Singular values. Now that we have an interpretation of the full rank assumption, ballpark the eigenvalues of J_0J_0^{\scriptscriptstyle\mathsf{T}}. By definition, (J_0J_0^{\scriptscriptstyle\mathsf{T}})_{i,j} = \nabla f(x_i;w(0))^{\scriptscriptstyle\mathsf{T}}\nabla f(x_j;w(0)). Holding i fixed and letting j vary, we can view the corresponding column of (J_0J_0^{\scriptscriptstyle\mathsf{T}}) as another feature representation, and \textrm{rank}(J_0)=n means none of these examples, in this feature representation, are linear combinations of others. This gives a concrete sense under which these eigenvalue assumptions are representation assumptions.

Now suppose each w_j(0) is an iid copy of some random variable v. Then, by definition of J_0, \begin{aligned} \mathop{\mathbb{E}}_{w(0)} (J_0J_0^{\scriptscriptstyle\mathsf{T}})_{i,j} &= \mathop{\mathbb{E}}_{w(0)} \nabla f(x_i;w(0))^{\scriptscriptstyle\mathsf{T}}\nabla f(x_j;w(0)). \\ &= \mathop{\mathbb{E}}_{w(0)} \sum_k s_k^2 \sigma'(w_k(0)^{\scriptscriptstyle\mathsf{T}}x_i) \sigma'(w_k(0)^{\scriptscriptstyle\mathsf{T}}x_j) x_i^{\scriptscriptstyle\mathsf{T}}x_j \\ &= m \mathop{\mathbb{E}}_{v} \sigma'(v^{\scriptscriptstyle\mathsf{T}}x_i) \sigma'(v^{\scriptscriptstyle\mathsf{T}}x_j) x_i^{\scriptscriptstyle\mathsf{T}}x_j. \end{aligned} In other words, it seems reasonable to expect \sigma_{\min} and \sigma_{\max} to scale with \sqrt m.

Initial risk \widehat{\mathcal{R}}_0. Let’s consider two different random initializations.

In the first case, we use one of the fancy schemes we mentioned to force f(w(0)) = 0; e.g., we can make sure that s_j is positive and negative an equal number of times, then sample w_j for s_j=+1, and then make w_j for s_j = -1 be the negation. With this choice, \widehat{\mathcal{R}}_0 = \|y\|^2/2 = \Theta(n).

On the other hand, if we do a general random initialization of both s_j and w_j, then we can expect enough cancellation that, roughly, f(x_i;w(0)) = \Theta(\sqrt m) (assuming w_j’s variance is a constant and not depending on m: that would defeat the purpose of separating out the scale parameter \alpha). then \|\alpha f(w(0))\|^2 = \Theta( \alpha^2 m n ), and \widehat{\mathcal{R}}_0 = \Theta( \alpha^2 m n), and thus the lower bound condition on \alpha will need to be checked carefully.

Combining all parameters. Again let’s split into two cases, based on the initialization as discussed immediately above.

Possible values of \alpha. The two preceding cases considered lower bounds on \alpha. In the case \widehat{\mathcal{R}}_0 = \Theta(\alpha^2 n m), it even seemed that we can make \alpha whatever we want; in either case, the time required to make \widehat{\mathcal{R}}(\alpha f(w(t))) small will decrease as \alpha increases, so why not simply make \alpha arbitrarily large?

An issue occurs once we perform time discretization. Below, we will see that the smoothness of the model looks like \alpha^2 \sigma_{\max}^2 near initialization; as such, a time discretization, using tools such as in Theorem 10.3, will require a step size roughly 1 / (\alpha^2 \sigma_{\max}^2), and in particular while we may increase \alpha to force the gradient flow to seemingly converge faster, a smoothness-based time discretization will need the same number of steps.

As such, \alpha = 1 / \sigma_{\max} seems a reasonable way to simplify many terms in this shallow setup, which translates into a familiar 1/\sqrt{m} NTK scaling.

13.1 Proof of Theorem 13.1

Proof plan.

Remark 13.4.
That is to say, in this setting, \alpha large enough (m large enough in the shallow case) ensure we stay in the NTK regime forever! This is not the general case.

The evolution in prediction space is \begin{aligned} \frac {{\text{d}}}{{\text{d}}t} \alpha f(w(t)) &= \alpha J_t \dot w(t) = -\alpha^2 J_t J_t^{\scriptscriptstyle\mathsf{T}}\nabla\widehat{\mathcal{R}}(\alpha f(w(t))), \\ &= -\alpha^2 J_t J_t^{\scriptscriptstyle\mathsf{T}}(\alpha f(w(t)) - y), \\ \frac {{\text{d}}}{{\text{d}}t} \alpha f_0(u(t)) &= \frac {{\text{d}}}{{\text{d}}t} \alpha \left({ f(w(0) + J_0(u(t) - w(0)) }\right) = \alpha J_0 \dot u(t) \\ &= -\alpha^2 J_0 J_0^{\scriptscriptstyle\mathsf{T}}\nabla\widehat{\mathcal{R}}(\alpha f_0(u(t))) \\ &= -\alpha^2 J_0 J_0^{\scriptscriptstyle\mathsf{T}}(\alpha f_0(u(t)) - y). \end{aligned}

The first one is complicated because we don’t know how J_t evolves.

But the second one can be written \frac {{\text{d}}}{{\text{d}}t} \left[{ \alpha f_0(u(t)) }\right] = -\alpha^2 \left({ J_0 J_0^{\scriptscriptstyle\mathsf{T}}}\right) \left[{ \alpha f_0(u(t)) }\right] +\alpha^2 \left({ J_0 J_0^{\scriptscriptstyle\mathsf{T}}}\right)y, which is a concave quadratic in the predictions \alpha f_0(u(t)).

Remark 13.5.
The original NTK paper, (Jacot, Gabriel, and Hongler 2018), had as its story that GF follows a gradient in kernel space. Seeing the evolution of \alpha f_0(u(t)) makes this clear, as it is governed by J_0J_0^{\scriptscriptstyle\mathsf{T}}, the Gram or kernel matrix!

Let’s fantasize a little and suppose (J_w J_w)^{\scriptscriptstyle\mathsf{T}} is also positive semi-definite. Do we still have a nice convergence theory?

Lemma 13.1.
Suppose \dot z(t) = -Q(t) \nabla\widehat{\mathcal{R}}(z(t)) and \lambda := \inf_{t\in[0,\tau]} \lambda_{\min} Q(t)>0. Then for any t\in[0,\tau], \widehat{\mathcal{R}}(z(t)) \leq \widehat{\mathcal{R}}(z(0)) \exp( - 2 t \lambda ).
Remark 13.6.
A useful consequence is \|z(t) - y\| = \sqrt{2\widehat{\mathcal{R}}(z(t))} \leq \sqrt{2\widehat{\mathcal{R}}(z(0))\exp(-2t\lambda)} = \|z(0)-y\| \exp( - t \lambda ).

Proof. Mostly just repeating our old strong convexity steps, \begin{aligned} \frac {{\text{d}}} {{\text{d}}t} \frac 1 2 \|z(t) - y\|^2 &= \left\langle -Q(t) (z(t) - y) , z(t) - y \right \rangle \\ &\leq - \lambda_{\min}\left({ Q(t) }\right) \left\langle z(t) - y, z(t) - y \right \rangle \\ &\leq - 2\lambda \|z(t) - y\|^2/2, \end{aligned} and Grönwall’s inequality completes the proof.

We can also prove this setting implies we stay close to initialization.

Lemma 13.2.
Suppose \dot v(t) = - S(t)^{\scriptscriptstyle\mathsf{T}}\nabla\widehat{\mathcal{R}}(g(v(t))), where S_tS_t^{\scriptscriptstyle\mathsf{T}}= Q_t, and \lambda_i( Q_t ) \in [\lambda, \lambda_1] for [0,\tau]. Then for t\in[0,\tau], \|v(t) - v(0)\| \leq \frac {\sqrt{\lambda_1}}{\lambda} \|g(v(0)) - y\| \leq \frac {\sqrt{2 \lambda_1 \widehat{\mathcal{R}}(g(v(0)))}}{\lambda}.

Proof. \begin{aligned} \|v(t) - v(0)\| &= \left\|{ \int_0^t \dot v(s) {\text{d}}s }\right\| \leq \int_0^t \|\dot v(s)\|{\text{d}}s \\ &= \int_0^t \|S_t^{\scriptscriptstyle\mathsf{T}}\nabla\mathcal{R}(g(v(s)))\|{\text{d}}s \\ &\leq \sqrt{\lambda_1} \int_0^t \| g(v(s)) - y\|{\text{d}}s \\ &\stackrel{(*)}{\leq} \sqrt{\lambda_1} \|g(v(0)) - y \| \int_0^t \exp(- s\lambda) {\text{d}}s \\ &\leq \frac {\sqrt{\lambda_1}}{\lambda} \|g(v(0)) - y\| \\ &\leq \frac {\sqrt{2 \lambda_1 \widehat{\mathcal{R}}(g(v(0)))}}{\lambda}, \end{aligned} where (*) used +Lemma 13.1.

Where does this leave us?

We can apply the previous two lemmas to the tangent model u(t), since for any t\geq 0, \dot u(t) = - \alpha J_0^{\scriptscriptstyle\mathsf{T}}\nabla\widehat{\mathcal{R}}(\alpha f_0(u(t))), \quad \frac {{\text{d}}}{{\text{d}}t} \alpha f_0(u(t)) = -\alpha^2 (J_0 J_0^{\scriptscriptstyle\mathsf{T}}) \nabla\widehat{\mathcal{R}}(\alpha f_0(u(t))). Thus since Q_0 := \alpha^2 J_0J_0^{\scriptscriptstyle\mathsf{T}} satisfies \lambda_i(Q_0) \in \alpha^2 [\sigma_{\min}^2, \sigma_{\max}^2], \begin{aligned} \widehat{\mathcal{R}}(\alpha f_0(u(t))) &\leq \widehat{\mathcal{R}}_0 \exp(- 2 t \alpha^2 \sigma_{\min}^2), \\ \|u(t) - u(0)\| &\leq \frac {\sqrt{2 \sigma_{\max}^2 \widehat{\mathcal{R}}_0}}{\alpha \sigma_{\min}^2}. \end{aligned}

How about w(t)?

Let’s relate (J_wJ_w^{\scriptscriptstyle\mathsf{T}}) to (J_0 J_0^{\scriptscriptstyle\mathsf{T}}).

Lemma 13.3.
Suppose \|w-w(0)\| \leq B = \frac {\sigma_{\min}}{2\beta}. Then \begin{aligned} \sigma_{\min}(J_w) &\geq \sigma_{\min}- \beta \|w - w(0)\|_2 \geq \frac {\sigma_{\min}}{2}, \\ \sigma_{\max}(J_w) &\leq \frac {3\sigma_{\max}}{2}. \end{aligned}

Proof. For the upper bound, \|J_w\| \leq \|J_0\| + \|J_w - J_0\| \leq \|J_0\| + \beta \|w - w(0)\| \leq \sigma_{\max}+ \beta B = \sigma_{\max}+ \frac {\sigma_{\min}}{2}. For the lower bound, given vector v define A_v := J_0^{\scriptscriptstyle\mathsf{T}}v and B_v := (J_w - J_0)^{\scriptscriptstyle\mathsf{T}}v, whereby \|A_v\| \geq \sigma_{\min}\|v\|, \qquad \|B_v\| \leq \|J_w - J_0\|\cdot\|v\| \leq \beta B \|v\|, and thus \begin{aligned} \sigma_{\min}(J_w)^2 &= \min_{\|v\| = 1} v^{\scriptscriptstyle\mathsf{T}}J_w J_w^{\scriptscriptstyle\mathsf{T}}v \\ &= \min_{\|v\| = 1} \left({ (J_0 + J_w - J_0)^{\scriptscriptstyle\mathsf{T}}v}\right)^{\scriptscriptstyle\mathsf{T}}(J_0 + J_w - J_0)^{\scriptscriptstyle\mathsf{T}}v \\ &= \min_{\|v\| = 1} \|A_v\|^2 + 2 A_v^{\scriptscriptstyle\mathsf{T}}B_v + \|B_v\|^2 \\ &\geq \min_{\|v\| = 1} \|A_v\|^2 - 2 \|A_v\|\cdot\| B_v\| + \|B_v\|^2 \\ &= \min_{\|v\| = 1} \left({ \|A_v\| - \| B_v\| }\right)^2 \geq \min_{\|v\| = 1} \left({ \sigma_{\min}- \beta B }\right)^2 \|v\|^2 = \left({ \frac {\sigma_{\min}}{2} }\right)^2 . \end{aligned}

Using this, for t\in[0,T], \dot w(t) = - \alpha J_w^{\scriptscriptstyle\mathsf{T}}\nabla\widehat{\mathcal{R}}(\alpha f(w(t))), \quad \frac {{\text{d}}}{{\text{d}}t} \alpha f(w(t)) = -\alpha^2 (J_w J_w^{\scriptscriptstyle\mathsf{T}}) \nabla\widehat{\mathcal{R}}(\alpha f(w(t))). Thus since Q_t := \alpha^2 J_tJ_t^{\scriptscriptstyle\mathsf{T}} satisfies \lambda_i(Q_t) \in \alpha^2 [\sigma_{\min}^2/4, 9\sigma_{\max}^2/4], \begin{aligned} \widehat{\mathcal{R}}(\alpha f(w(t))) &\leq \widehat{\mathcal{R}}_0 \exp(- t \alpha^2 \sigma_{\min}^2/2), \\ \|w(t) - w(0)\| &\leq \frac {3 \sqrt{8 \sigma_{\max}^2 \widehat{\mathcal{R}}_0}}{\alpha \sigma_{\min}^2} =: B'. \end{aligned} It remains to show that T=\infty. Invoke, for the first time, the assumed lower bound on \alpha, namely \alpha \geq \frac {\beta \sqrt{1152 \sigma_{\max}^2 \widehat{\mathcal{R}}_0}}{\sigma_{\min}^3}, which by the above implies then B' \leq \frac B 2. Suppose contradictorily that T < \infty; since t\mapsto w(t) is continuous, then t\mapsto \|w(t) - w(0)\| is also continuous and starts from 0, and therefore \|w(T) - w(0)\| = B>0 exactly. But due to the lower bound on \alpha, we also have \|w(T) - w(0)\| \leq \frac B 2 < B, a contradiction.

This completes the proof.

Remark 13.7 (retrospective) .

14 Nonsmoothness, Clarke differentials, and positive homogeneity

Smoothness and differentiability do not in general hold for us (ReLU, max-pooling, hinge loss, etc.).

One relaxation of the gradient is the subdifferential set {\partial_{\text{s}}} (whose elements are called subgradients), namely the set of tangents which lie below the predictor: {\partial_{\text{s}}}\widehat{\mathcal{R}}(w) := \left\{{ s \in\mathbb{R}^p : \forall w'\centerdot \widehat{\mathcal{R}}(w') \geq \widehat{\mathcal{R}}(w) + s^{\scriptscriptstyle\mathsf{T}}(w'-w)}\right\}.

One fun application is a short proof of Jensen’s inequality.

Lemma 14.1 (Jensen’s inequality) .
Suppose random variable X is supported on a set S, and f is convex on S. Then \mathop{\mathbb{E}}f(X) \geq f(\mathop{\mathbb{E}}X).

Proof. Choose any s\in{\partial_{\text{s}}}f(\mathop{\mathbb{E}}X), and note \mathop{\mathbb{E}}f(X) \geq \mathop{\mathbb{E}}\left[{f(\mathop{\mathbb{E}}(X)) + s^{\scriptscriptstyle\mathsf{T}}(X - \mathop{\mathbb{E}}X)}\right] = f(\mathop{\mathbb{E}}X).

Typically, we lack convexity, and the subdifferential set is empty.
Our main formalism is the Clarke differential (Clarke et al. 1998): \partial\widehat{\mathcal{R}}(w) := \textrm{conv}\left({\left\{{ s \in \mathbb{R}^p : \exists w_i \to w, \nabla\widehat{\mathcal{R}}(w_i)\to s }\right\}}\right).

Definition 14.1.
f is locally Lipschitz when for every point x, there exists a neighborhood S\supseteq\{x\} such that f is Lipschitz when restricted to S.

Key properties:

We can replace the gradient flow differential equation \dot w(t) = -\nabla\mathcal{R}(w(t)) with a differential inclusion: \dot w(t) \in -\partial \widehat{\mathcal{R}}(w(t)) \qquad \text{for a.e. } t\geq 0. If R satisfies some technical structural conditions, then the following nice properties hold; these properties are mostly taken from (Lemma 5.2, Theorem 5.8, Davis et al. 2018) (where the structural condition is C^1 Whitney stratifiability), which was slightly generalized in (Ji and Telgarsky 2020) under o-minimal definability; another alternative, followed in (Lyu and Li 2019), is to simply assume that a chain rule holds.

This allows us to reprove our stationary point guarantee from an earlier lecture: since \begin{aligned} \widehat{\mathcal{R}}(w(t)) - \widehat{\mathcal{R}}(w(0)) = -\int_0^t \min\{ \|v\|^2 : v\in\partial \widehat{\mathcal{R}}(w(s)) \}{\text{d}}s \leq - t \min_{\substack{s\in [0,t]\\v\in-\partial\widehat{\mathcal{R}}(w(s))}} \|v\|^2, \end{aligned} then just as before \begin{aligned} \min_{\substack{s\in [0,t]\\v\in-\partial\widehat{\mathcal{R}}(w(s))}} \|v\|^2 \leq \frac{\widehat{\mathcal{R}}(w(t)) - \widehat{\mathcal{R}}(w(0))}{t}, \end{aligned} thus for some time s\in[0,t], we have an iterate w(s) which is an approximate stationary point.

Remark 14.1.
Let’s go back to \dot w(t) := \mathop{\mathrm{arg\,min}}\{\|v\| : v\in-\partial \widehat{\mathcal{R}}(w(t))\}, which we said will hold almost everywhere.

This is not satisfied by pytorch/tensorflow/jax/…

(Kakade and Lee 2018) gives some bad examples, e.g., x\mapsto\sigma(\sigma(x)) - \sigma(-x) with \sigma the ReLU, evaluated at 0. (Kakade and Lee 2018) also give a randomized algorithm for finding good subdifferentials.

Does it matter? In the NTK regime, few activations change. In practice, many change, but it’s unclear what their effect is.

14.1 Positive homogeneity

Another tool we will use heavily outside convexity is positive homogeneity.

Definition 14.2.
g is positive homogeneous of degree L when g(\alpha x) = \alpha^L g(x) for \alpha\geq 0. (We will only consider continuous g, so \alpha>0 suffices.)
Example 14.1.
Remark 14.2.
The math community also has a notion of homogeneity without positivity; the monomial example above works with \alpha<0. Homogeneity in math is often tied to polynomials and generalizations thereof.
Example 14.2.

14.2 Positive homogeneity and the Clarke differential

Let’s work out an element of the Clarke differential for a ReLU network x \mapsto W_L \sigma_{L-1}(\cdots W_2 \sigma_1(W_1 x)). As a function of x, this mapping is 1-homogeneous and piecewise affine. As a function of w=(W_L,\cdot,W_1), it is L-homogeneous and piecewise polynomial. The boundary regions form a set of (Lebesgue) measure zero (wrt to either weights or parameters).

Fixing x and considering w, interior to each piece, the mapping is differentiable. Due to the definition of Clarke differential, it therefore suffices to compute the gradients in all adjacent pieces, and then take their convex hull.

Remark 14.3.
Note that we are not forming the differential by choosing an arbitrary differential element for each ReLU: we are doing a more complicated region-based calculation. However, the former is what pytorch does.

So let’s return to considering some w where are differentiable. Let A_i be a diagonal matrix with activations of the output after layer i on the diagonal: A_i = \textrm{diag}\left({ \sigma'(W_i\sigma(\dots \sigma(W_1 x) \dots )) }\right), (note we’ve baked in x,) and so \sigma(r)=r\sigma'(r) implies layer i outputs x \mapsto A_i W_i\sigma(\dots \sigma(W_1 x) \dots )) = A_i W_i A_{i-1} W_{i-1} \cdots A_1 W_1 x, and the network outputs f(x; w) = W_L A_{L-1} W_{L-1} A_{L-2} \cdots A_1 W_1 x. and the gradient with respect to layer i is \frac {{\text{d}}}{{\text{d}}W_i} f(x; w) = (W_L A_{L-1} \cdots W_{i+1} A_i)^{\scriptscriptstyle\mathsf{T}}(A_{i-1} W_{i-1} \cdots W_1 x)^{\scriptscriptstyle\mathsf{T}}. Additionally \begin{aligned} %& \left\langle W_i , \frac {{\text{d}}}{{\text{d}}W_i} f(x; w) \right \rangle %\\ &= \left\langle W_i , (W_L A_{L-1} \cdots W_{i+1} A_i)^{\scriptscriptstyle\mathsf{T}}(A_{i-1} W_{i-1} \cdots W_1 x)^{\scriptscriptstyle\mathsf{T}} \right \rangle \\ &= \textrm{tr}\left({ W_i^{\scriptscriptstyle\mathsf{T}}(W_L A_{L-1} \cdots W_{i+1} A_i)^{\scriptscriptstyle\mathsf{T}}(A_{i-1} W_{i-1} \cdots W_1 x)^{\scriptscriptstyle\mathsf{T}} }\right) \\ &= \textrm{tr}\left({ (W_L A_{L-1} \cdots W_{i+1} A_i)^{\scriptscriptstyle\mathsf{T}}(W_i A_{i-1} W_{i-1} \cdots W_1 x)^{\scriptscriptstyle\mathsf{T}} }\right) \\ &= \textrm{tr}\left({ (W_i A_{i-1} W_{i-1} \cdots W_1 x)^{\scriptscriptstyle\mathsf{T}} (W_L A_{L-1} \cdots W_{i+1} A_i)^{\scriptscriptstyle\mathsf{T}} }\right) \\ &= \textrm{tr}\left({ W_L A_{L-1} \cdots W_{i+1} A_i W_i A_{i-1} W_{i-1} \cdots W_1 x }\right) \\ &= f(x; w), \end{aligned} and \left\langle W_i , \frac {{\text{d}}}{{\text{d}}W_i} f(x; w) \right \rangle = f(x; w) = \left\langle W_{i+1} , \frac {{\text{d}}}{{\text{d}}W_{i+1}} f(x; w) \right \rangle.

This calculation can in fact be made much more general (indeed with a simpler proof!).

Lemma 14.2.
Suppose f:\mathbb{R}^d\to\mathbb{R} is locally Lipschitz and L-positively homogeneous. For any w\in\mathbb{R}^d and s \in \partial f(w), \left\langle s, w \right \rangle =Lf(w).
Remark 14.4.
This statement appears in various places (Lyu and Li 2019); the version here is somewhat more general, and appears in (Ji and Telgarsky 2020).

Proof. If w = 0, then \left\langle s, w \right \rangle = 0 = Lf(w) for every s\in\partial f(w), so consider the case w\neq 0. Let D denote those w where f is differentiable, and consider the case that w\in D\setminus \{0\}. By the definition of gradient, \begin{aligned} \lim_{\delta\downarrow0}\frac{f(w+\delta w)-f(w)-\left\langle \nabla f(w), \delta w \right \rangle}{\delta\|w\|}=0, \end{aligned} and by using homogeneity in the form f(w+\delta w)=(1+\delta)^Lf(w) (for any \delta > 0), then \begin{aligned} 0 &= \lim_{\delta\downarrow0}\frac{\left({(1+\delta)^L-1}\right)f(w)-\left\langle \nabla f(w), \delta w \right \rangle}{\delta} = - \left\langle \nabla f(w), w \right \rangle + \lim_{\delta\downarrow0} f(w) \left({ L + \mathcal{O}(\delta) }\right), \end{aligned} which implies \left\langle w, \nabla f(w) \right \rangle=Lf(w).

Now consider w \in \mathbb{R}^d \setminus D \setminus \{0\}. For any sequence (w_i)_{i\geq 1} in D with \lim_i w_i=w for which there exists a limit s := \lim_i \nabla f(w_i), then \begin{aligned} \left\langle w, s \right \rangle=\lim_{i\to\infty}\left\langle w_i, \nabla f(w_i) \right \rangle=\lim_{i\to\infty}Lf(w_i)=Lf(w). \end{aligned} Lastly, for any element s\in\partial f(w) written in the form s = \sum_i \alpha_i s_i where \alpha_i\geq 0 satisfy \sum_i \alpha_i = 1 and each s_i is a limit of a sequence of gradients as above, then \left\langle w, s \right \rangle = \left\langle w, \sum_i \alpha_i s_i \right \rangle = \sum_i \alpha_i \left\langle w, s_i \right \rangle = \sum_i \alpha_i L f(w) = Lf(w).

14.3 Norm preservation

If predictions are positive homogeneous with respect to each layer, then gradient flow preserves norms of layers.

Lemma 14.3 (Simon S. Du, Hu, and Lee (2018)) .
Suppose for \alpha >0, f(x; (W_L,\ldots, \alpha W_i,\ldots,W_1)) = \alpha f(x; w) (predictions are 1-homogeneous in each layer). Then for every pair of layers (i,j), the gradient flow maintains \frac 1 2 \|W_i(t)\|^2 - \frac 1 2 \|W_i(0)\|^2 = \frac 1 2 \|W_j(t)\|^2 - \frac 1 2 \|W_j(0)\|^2.
Remark 14.5.
We’ll assume a risk of the form \mathop{\mathbb{E}}_k \ell(y_k f(x_k;w)), but it holds more generally. We are also tacitly assuming we can invoke the chain rule, as discussed above.

Proof. Defining \ell_k'(s) := y_k \ell'(y_k f(x_k; w(s))), and fixing a layer i, \begin{aligned} \frac 1 2 \|W_i(t)\|^2 - \frac 1 2 \|W_i(0)\|^2 &= \int_0^t \frac {{\text{d}}} {{\text{d}}t} \frac 1 2 \|W_i(s)\|^2 {\text{d}}s \\ &= \int_0^t \left\langle W_i(s) , \dot W_i(s) \right \rangle {\text{d}}s \\ &= \int_0^t \left\langle W_i(s) , -\mathop{\mathbb{E}}_k \ell'_k(s) \frac {{\text{d}}f(x_k; w)}{{\text{d}}W_i(s)} \right \rangle{\text{d}}s \\ &= - \int_0^t \mathop{\mathbb{E}}_k \ell'_k(s) \left\langle W_i(s) , \frac {{\text{d}}f(x_k; w)}{{\text{d}}W_i(s)} \right \rangle{\text{d}}s \\ &= - \int_0^t \mathop{\mathbb{E}}_k \ell'_k(s) f(x_k; w) {\text{d}}s. \end{aligned} This final expression does not depend on i, which gives the desired equality.

Remark 14.6.
One interesting application is to classification losses like \exp(-z) and \ln(1+\exp(-z)), where \widehat{\mathcal{R}}(w)\to 0 implies \min_k y_k f(x_k; w) \to \infty.

This by itself implies \|W_j\| \to \infty for some j; combined with norm preservation, \min_j \|W_j\| \to \infty !

need to update this in light of the new material i’ve included?

14.4 Smoothness inequality adapted to ReLU

Let’s consider: single hidden ReLU layer, only bottom trainable: f(x;w) := \frac {1}{\sqrt m} \sum_j a_j \sigma(\left\langle x, w_j \right \rangle), \qquad a_j \in \{\pm 1\}. Let W_s\in \mathbb{R}^{m\times d} denote parameters at time s, suppose \|x\|\leq 1. \begin{aligned} \frac {{\text{d}}f(x;W)}{{\text{d}}W} &= \begin{bmatrix} a_1 x \sigma'(w_1^{\scriptscriptstyle\mathsf{T}}x) / \sqrt{m} \\ \vdots \\ a_m x \sigma'(w_m^{\scriptscriptstyle\mathsf{T}}x) / \sqrt{m} \end{bmatrix}, \\ \left\|{ \frac {{\text{d}}f(x;W)}{{\text{d}}W} }\right\|_{{\textrm{F}}}^2 &= \sum_j \left\|{ a_j x \sigma'(w_j^{\scriptscriptstyle\mathsf{T}}x) / \sqrt{m}}\right\|^2_2 %\\ %& \leq \frac 1 m \sum_j \left\|{ x }\right\|^2_2 \leq 1. \end{aligned}

We’ll use the logistic loss, whereby \begin{aligned} \ell(z) &= \ln(1+\exp(-z)), \\ \ell'(z) &= \frac {-\exp(-z)}{1+ \exp(-z)} \in (-1,0), \\ \widehat{\mathcal{R}}(W) &:= \frac 1 n \sum_k \ell(y_k f(x_k;W)). \end{aligned} A key fact (can be verified with derivatives) is |\ell'(z)| = -\ell'(z) \leq \ell(z), % ln(1+e^z) = \int_{-infty}^z dx / (1+e^{-x}) % 1/(1+e^z) = \int_{-infty}^z e^{-x} dx / (1+e^{-x})^2. % suffices to prove second is everywhere smaller. % Let g(x) denote first. % second is g(x) * e^{-x} / (1+e^{-x}). % but e^{-x} / (1+e^{-x}) = 1 / (1+e^x) <= 1, % so first one g(x) is always larger. whereby \begin{aligned} \frac {{\text{d}}\widehat{\mathcal{R}}}{{\text{d}}W} &= \frac 1 n \sum_k \ell'(y_k f(x_k; W)) y_k \nabla_W f(x_k W), \\ \left\|{ \frac {{\text{d}}\widehat{\mathcal{R}}}{{\text{d}}W} }\right\|_{{\textrm{F}}} &\leq \frac 1 n \sum_k | \ell'(y_k f(x_k; W)) | \cdot \|y_k \nabla_W f(x_k W)\|_{{\textrm{F}}} \\ &\leq \frac 1 n \sum_k | \ell'(y_k f(x_k; W)) | \leq \min\left\{{ 1, \widehat{\mathcal{R}}(W) }\right\}. \end{aligned}

Now we can state a non-smooth, non-convex analog to +Theorem 10.3

Lemma 14.4 ((Lemma 2.6, Ji and Telgarsky 2019a)) .
If \eta\leq 1, for any Z, \|W_t - Z\|_{{\textrm{F}}}^2 + \eta \sum_{i<t} \widehat{\mathcal{R}}^{(i)}(W_i) \leq \|W_0 - Z\|_{{\textrm{F}}}^2 + 2 \eta \sum_{i<t} \widehat{\mathcal{R}}^{(i)}(Z), where \widehat{\mathcal{R}}^{(i)}(W) = \frac 1 n \sum_k \ell(y_k \left\langle W, \nabla f(x_k; W_i) \right \rangle).
Remark 14.7.

Proof. Using the squared distance potential as usual, \begin{aligned} \|W_{i+1}-Z\|_{\textrm{F}}^2 &= \|W_{i}-Z\|_{\textrm{F}}^2 - 2 \eta \left\langle \nabla\widehat{\mathcal{R}}(W_i), W_i - Z \right \rangle + \eta^2 \|\nabla\widehat{\mathcal{R}}(W_i)\|_{\textrm{F}}^2, \end{aligned} where \displaystyle \|\nabla\widehat{\mathcal{R}}(W_i)\|_{\textrm{F}}^2 \leq \|\nabla\widehat{\mathcal{R}}(W_i)\|_{\textrm{F}}\leq \widehat{\mathcal{R}}(W_i) = \widehat{\mathcal{R}}^{(i)}(W_i), and \begin{aligned} & n \left\langle \nabla\widehat{\mathcal{R}}(W_i), Z-W_i \right \rangle \\ &= \sum_k y_k \ell'(y_kf(x_k;W_i)) \left\langle \nabla_W f(x_k;W_i) , Z-W_i \right \rangle \\ &= \sum_k \ell'(y_kf(x_k;W_i)) \left({y_k \left\langle \nabla_W f(x_k;W_i) , Z \right \rangle - y_k f(x_k;W_i) }\right) \\ &\leq \sum_k \left({\ell(y_k \left\langle \nabla_W f(x_k; W_i) , Z \right \rangle ) - \ell(y_k f(x_k;W_i)) }\right) \\ &= n \left({\widehat{\mathcal{R}}^{(i)}(Z) - \widehat{\mathcal{R}}^{(i)}(W_i)}\right). \end{aligned} Together, \begin{aligned} \|W_{i+1}-Z\|_{\textrm{F}}^2 &\leq \|W_{i}-Z\|_{\textrm{F}}^2 + 2\eta \left({\widehat{\mathcal{R}}^{(i)}(Z) - \widehat{\mathcal{R}}^{(i)}(W_i)}\right) + \eta \widehat{\mathcal{R}}_i(W_i); \end{aligned} applying \sum_{i<t} to both sides gives the bound.

15 Margin maximization and implicit bias

During 2015-2016, various works pointed out that deep networks generalize well, even though parameter norms are large, and there is no explicit generalization (Neyshabur, Tomioka, and Srebro 2014; Zhang et al. 2017). This prompted authors to study implicit bias of gradient descent, the first such result being an analysis of linear predictors with linearly separable data, showing that gradient descent on the cross-entropy loss is implicitly biased towards a maximum margin direction (Soudry, Hoffer, and Srebro 2017).

This in turn inspired many other works, handling other types of data, networks, and losses (Ji and Telgarsky 2019b, 2018, 2020; Gunasekar et al. 2018a; Lyu and Li 2019; Chizat and Bach 2020; Ji et al. 2020).

Margin maximization of first-order methods applied to exponentially-tailed losses was first proved for coordinate descent (Telgarsky 2013). The basic proof scheme there was pretty straightforward, and based on the similarity of the empirical risk (after the monotone transformation \ln(\cdot)) to \ln \sum \exp, itself similar to \max(\cdot) and thus to margin maximization; we will use this connection as a basis for all proofs in this section (see also (Ji and Telgarsky 2019b; Gunasekar et al. 2018b)).

Throughout this section, fix training data ((x_i,y_i))_{i=1}^n, define a (an unnormalized) margin mapping m_i(w) := y_i f(x_i;w); by this choice, we can also conveniently write an unnormalized risk \mathcal L: \mathcal L(w) := \sum_i \ell(m_i(w)) = \sum_i \ell(y_i f(x_i;w)). Throughout this section, we will always assume f is locally-Lipschitz and L-homogeneous in w, which also means each m_i is locally-Lipschitz and L-homogeneous.

We will also use the exponential loss \ell(z) = \exp(-z). The results go through for similar losses.

Remark 15.1 (generalization) .
As hinted before, margin maximization is one way gradient descent prefers a solution which has a hope to generalize well, and not merely achieve low empirical risk. This low generalization error of large-margin predictors will appear explicitly later on in section 18.4.
Remark 15.2 (implicit bias) .
As mentioned above, the proofs here will show implicit margin maximization, which is enough to invoke the generalization theory in section 18.4. However, in certain cases it is valuable to moreover prove converges rates to the maximum margin direction. In the linear case, is is possible to convert a margin maximization rate to an implicit bias rate, however the rate degrades by a factor \sqrt{\cdot} (Ji and Telgarsky 2019b); analyzing the implicit bias without degradation in the rate is more involved, and not treated here (Soudry, Hoffer, and Srebro 2017).
Remark 15.3 (squared loss) .
While the focus here is on losses with exponential tails and on bias towards the maximum margin direction, there are also many works (not further discussed here) which consider the squared loss (Gunasekar et al. 2017; Arora, Cohen, et al. 2018b, 2019).

15.1 Separability and margin maximization

We just said “maximum margin” and “separable data.” What do these mean?

Consider a linear predictor, meaning x\mapsto \left\langle w, x \right \rangle for some w\in\mathbb{R}^d. This w “separates the data” if y_i and \textrm{sgn}(\left\langle w, x_i \right \rangle) agree, which we can relax to the condition of strict separability, namely \min_i y_i \left\langle w, x_i \right \rangle > 0. It seems reasonable, or a nice inductive bias, if we are as far from 0 as possible: \max_{w\in ?} \min_i y_i \left\langle w, x_i \right \rangle > 0 The “?” indicates that we must somehow normalize or constrain, since otherwise, for separable data, this \max becomes a \sup and has value +\infty.

Definition 15.1.
Data is linearly separable when there exists w\in\mathbb{R}^d so that \min_i y_i \left\langle w, x_i \right \rangle > 0. In this situation, the (\ell_2) maximum margin predictor (which is unique!) is given by \bar u:= \mathop{\mathrm{arg\,max}}_{\|w\|=1}\min_i y_i \left\langle w, x_i \right \rangle, and the margin is \gamma := \min_i y_i \left\langle \bar u, x_i \right \rangle.
Remark 15.4.
This concept has a long history. Margins first appeared in the classical perceptron analysis (Novikoff 1962), and maximum margin predictors were a guiding motivation for the SVM need to add many more refs.

Consider now the general case of L-homogeneous predictors, where y_i\left\langle w, x_i \right \rangle is replaced by m_i(w).

Proposition 15.1.
Suppose f(x;w) is L-homogeneous in w, \ell is the exponential loss, and there exists \hat w with \widehat{\mathcal{R}}({\hat w}) < \ell(0)/n. Then \inf_w \widehat{\mathcal{R}}(w) = 0, and the infimum is not attained.

Proof. Note \max_i \ell(-m_i(\hat w)) \leq \sum_i \ell(-m_i(\hat w)) = n \mathcal{R}(\hat w) < \ell(0), thus applying \ell^{-1} to both sides gives \min_i m_i(\hat w)) > 0. Therefore 0 \leq \inf_w \widehat{\mathcal{R}}(w) \leq \limsup_{c\to\infty}\widehat{\mathcal{R}}(c{\hat w}) = \sum_i \limsup_{c\to\infty} \ell(-m_i(c{\hat w})) = \sum_i \limsup_{c\to\infty} \ell(-c^L m_i({\hat w})) = 0.

This seems to be problematic; how can we “find” an “optimum,” when solutions are off at infinity? Moreover, we do not even have unique directions, nor a way to tell different ones apart!

We can use margins, now appropriately generalized to the L-homogeneous case, to build towards a better-behaved objective function. First note that since \min_i m_i(w) = \|w\|^L \min_i m_i\left({\frac{w}{\|w\|}}\right), we can compare different directions by normalizing the margin by \|w\|^L. Moreover, again using the exponential loss, \begin{aligned} \frac {\ell^{-1}\left({\mathcal L(w)}\right)}{\|w\|^L} + \frac{\ln(n)}{\|w\|^L} &= \frac {\ell^{-1}\left({\sum_i \ell(m_i(w))/n}\right)}{\|w\|^L} \geq \frac {\min_i m_i(w)}{\|w\|^L} \\ &= \frac {\ell^{-1}\left({\max_i \ell(m_i(w))}\right)}{\|w\|^L} \\ &\geq \frac{\ell^{-1}\left({ \mathcal L(w) }\right)}{\|w\|^L}. \end{aligned}(3)

This motivates the following definition.

Definition 15.2.
Say the data is \vec m-separable when there exists w so that \min_i m_i(w) > 0. Define the margin, maximum margin, and smooth margin respectively as \gamma(w) := \min_i m_i(w/\|w\|) = \frac {\min_i m_i(w)}{\|w\|^L}, \qquad {\bar{\gamma}}:= \max_{\|w\|= 1} \gamma(w), \qquad {\tilde{\gamma}}(w) := \frac {\ell^{-1}(\mathcal L(w))}{\|w\|^L}. (4)
Remark 15.5.
The terminology “smoothed margin” is natural for L-homogeneous predictors, but even so it seems to have only appeared recently in (Lyu and Li 2019). In the 1-homogeneous case, the smoothed margin appeared much earlier, indeed throughout the boosting literature (Schapire and Freund 2012).
Remark 15.6 (multiclass margins) .
There is also a natural notion of multiclass margin: \min_i \frac {f(x_i;w)_{y_i} - \max_{j \neq y'} f(x_i;w)_j}{\|w\|^L}. The natural loss to consider in this setting is the cross-entropy loss.

The basic properties can be summarized as follows.

Proposition 15.2.

Suppose data is \vec m-separable. Then:

Proof. The first part follows by continuity of m_i(w) and compactness of \{w\in\mathbb{R}^p : \|w\|=1\}, and the second from eq. 4 and eq. 3.

Remark 15.7.
For the linear case, margins have a nice geometric interpretation. This is not currently true for the general homogeneous case: there is no known reasonable geometric characterization of large margin predictors even for simple settings.

15.2 Gradient flow maximizes margins of linear predictors

Let’s first see how far we can get in the linear case, using one of our earlier convex optimization tools, namely Theorem 10.4.

Lemma 15.1.
Consider the linear case, with linearly separable data and the exponential loss, and \max_i \|x_iy_i\|\leq 1. Then \begin{aligned} \mathcal L(w_t) &\leq \frac {1 + \ln(2nt\gamma^2)}{2t\gamma^2}, \\ \|w_t\| &\geq \ln(2tn\gamma^2) - \ln\left({1+ \ln (2tn\gamma^2)}\right). \end{aligned}
Remark 15.8.
The intuition we will follow for the proof is: for every unit of norm, the (unnormalized) margin increases by at least \gamma. Thus the margin bias affects the entire gradient descent process.

Later, when we study the L-homogeneous case, we are only able to show for every unit norm (to the power L), the (unnormalized) margin increases by at least the current margin, which implies nondecreasing, but not margin maximization.

Proof. By Theorem 10.4 with z = \ln(c)\bar u/\gamma for some c>0, \begin{aligned} \mathcal L(w(t)) &\leq \mathcal L(z) + \frac {1}{2t}\left({\|z\|^2 - \|w(t)-z\|^2}\right) \leq \sum_i \ell(m_i(z)) + \frac {\|z\|^2}{2t} \\ &\leq \sum_i \exp(-\ln(c)) + \frac {\ln(c)^2}{2t\gamma^2} = \frac {n}{c} + \frac {\ln(c)^2}{2t\gamma^2}, \end{aligned} and the first inequality follows from the choice c := 2tn\gamma^2. For the lower bound on \|w_t\|, using the preceding inequality, \ell\left({ \|w_t\| }\right) \leq \min_i \ell(m_i(w_t)) \leq \frac 1 n \mathcal L(w_t) \leq \frac {1+\ln(2tn\gamma^2)^2}{2tn\gamma^2}, and the second inequality follows by applying \ell^{-1} to both sides.

This nicely shows that we decrease the risk to 0, but not that we maximize margins. For this, we need a more specialized analysis.

Theorem 15.1.
Consider the linear case, with linearly separable data and the exponential loss, and \max_i \|x_iy_i\|\leq 1. Then \gamma(w_t) \geq {\tilde{\gamma}}(w_t) \geq {\bar{\gamma}}- \frac {\ln n}{\ln t + \ln(2n\gamma^2) - 2\ln\ln(2tne\gamma^2)} need to check some constants. also that denominator is hideous, maybe require slightly larger t to remove it?

Proof. For convenience, define u(t) := \ell^{-1}(\mathcal L(w(t))) and v(t) := \|w(t)\|, whereby \gamma(w(t)) = \frac {u(t)}{v(t)} = \frac{u(0)}{v(t)} + \frac {\int_0^t\dot u(s){\text{d}}s}{v(t)}. Let’s start by lower bounding the second term. Since \ell' = -\ell, \begin{aligned} \dot u(t) &= \left\langle \frac{ -\nabla \mathcal L(w(t))}{\mathcal L(w(t))}, \dot w(t) \right \rangle = \frac {\|\dot w(t)\|^2}{\mathcal L(w(t))}, \\ \|\dot w(s)\| &\geq \left\langle \dot w(s), \bar u \right \rangle = \left\langle -\sum_i x_iy_i \ell'(m_i(w(s))), \bar u \right \rangle \\ &= \sum_i\ell(m_i(w(s))) \left\langle x_iy_i , \bar u \right \rangle \geq \gamma\sum_i\ell(m_i(w(s))) = \gamma \mathcal L(w(s)), \\ v(t) &= \|w(t)-w(0)\| = \left\|{\int_0^t \dot w(s){\text{d}}s}\right\| \leq \int_0^t \left\|{\dot w(s)}\right\|{\text{d}}s, \end{aligned} thus \begin{aligned} \frac {\int_0^t\dot u(s){\text{d}}s}{v(t)} &\geq \frac {\int_0^t \frac {\|\dot w(s)\|^2}{\mathcal L(w(s))} {\text{d}}s}{v(t)} = \frac {\int_0^t \|\dot w(s)\| \frac {\|\dot w(s)\|}{\mathcal L(w(s))} {\text{d}}s}{v(t)} \geq \frac {\gamma \int_0^t \|\dot w(s)\|{\text{d}}s}{v(t)} = \gamma. \end{aligned}

For the first term u(0) / v(t), note \mathcal L(w(0)) = n and thus u(0) = -\ln n, whereas by the lower bound on \|w(t)\| from Lemma 15.1, \frac {u(0)}{v(t)} = \frac {-\ln(n)}{\|w(t)\|} \geq \frac {-\ln(n)}{\ln(t) + \ln(2n\gamma^2) - 2\ln\ln(2tne\gamma^2)}. Combining these inequalities gives the bound.

We are maximizing margins, but at a glacial rate of 1/\ln(t)!

To get some inspiration, notice that we keep running into \ell^{-1}(\mathcal L(w)) in all the analysis. Why don’t we just run gradient flow on this modified objective? In fact, the two gradient flows are the same!

Remark 15.9 (time rescaling) .
Let w(t) be given by gradient flow on \mathcal L(w(t)), and define a time rescaling h(t) via integration, namely so that \dot h(t) = 1/\mathcal L(w(h(t))). Then, by the substitution rule for integration, \begin{aligned} w(t) - w(0) &= \int_0^t \dot w(s){\text{d}}s = - \int_0^t \nabla\mathcal L(w(s)){\text{d}}s = \int_{h^{-1}([0,t])} \nabla\mathcal L(w(h(s)) |\dot h(s)| {\text{d}}s \\ &= \int_{h^{-1}([0,t])} \frac{\nabla\mathcal L(w(h(s))}{\mathcal L(w(h(s)))} {\text{d}}s = \int_{h^{-1}([0,t])} \nabla\ln \mathcal L(w(h(s)) {\text{d}}s \end{aligned} As such, the gradient flow on \mathcal L and on \ell^{-1}\circ \mathcal L are the same, modulo a time rescaling. This perspective was first explicitly stated by Chizat and Bach (2020), though analyses using this rescaled time (and alternate flow characterization) existed before Lyu and Li (2019).
Theorem 15.2 (time-rescaled flow) .
Consider linear predictors with linearly separable data, and the logistic loss. Suppose \dot \theta(t) := \nabla_\theta \ell^{-1} \mathcal L(\theta(t)). Then \gamma(\theta(t) \geq {\tilde{\gamma}}(\theta(t)) \geq \gamma - \frac {\ln n}{t\gamma^2-\ln n}.

Proof. We start as before: set u(t) := \ell^{-1}\mathcal L(\theta(t)) and v(t) := \|\theta(t)\|; then {\tilde{\gamma}}(t) = \frac {u(t)}{v(t)} = \frac {u(0)}{v(t)} + \frac {\int_0^t \dot u(s){\text{d}}s}{v(t)} = \frac {-\ln n}{v(t)} + \frac {\int_0^t \dot u(s){\text{d}}s}{v(t)}. Bounding these terms is now much simpler than for the regular gradient flow. Note \begin{aligned} \|\dot \theta(s)\| &\geq \left\langle \nabla \ln \mathcal L(\theta(s)), \bar u \right \rangle = \sum_i \frac {\ell'(m_i(\theta(s)))}{\mathcal L(\theta(s))} \left\langle x_iy_i, \bar u \right \rangle \geq \gamma \sum_i \frac {\ell'(m_i(\theta(s)))}{\mathcal L(\theta(s))} = \gamma, \\ \dot u(s) &= \left\langle \nabla \ln \mathcal L(\theta(s)), \dot \theta(s) \right \rangle = \|\dot \theta(s)\|^2, \end{aligned} thus \begin{aligned} \ell^{-1}\mathcal L(\theta(t)) &= \ell^{-1}\mathcal L(\theta(0)) + \int_0^t \frac {{\text{d}}}{{\text{d}}s} \ell^{-1}\mathcal L(\theta(s)){\text{d}}s \geq -\ln(n) + t\gamma^2, \\ \frac{\int_0^t \dot u(s){\text{d}}s}{v(t)} &= \frac{\int_0^t \|\dot \theta(s)\|^2{\text{d}}s}{v(t)} \geq \frac{\gamma \int_0^t \|\dot \theta(s)\|{\text{d}}s}{v(t)} \geq \frac{\gamma \|\int_0^t \dot \theta(s){\text{d}}s\|}{v(t)} = \gamma. \end{aligned} On the other hand, \|\theta(t)\|\gamma \geq \|\theta(t)\| \gamma(\theta(t)) \geq \ell^{-1}\mathcal L(\theta(t)) \geq t\gamma^2 - \ln(n). Together, \gamma(t) = \frac {u(t)}{v(t)} \geq \gamma - \frac {\ln(n)}{t\gamma^2 -\ln n}.

Remark 15.10.
The preceding two proofs are simplified from (Ji and Telgarsky 2019b), but follow a general scheme from the (coordinate descent!) analysis in (Telgarsky 2013); this scheme was also followed in (Gunasekar et al. 2018b). The proof in (Soudry, Hoffer, and Srebro 2017) is different, and is based on an SVM analogy, since {\tilde{\gamma}}\to \gamma.
Note also that the proofs here do not show w(t) converges to (the unique) maximum margin linear separator, which is easy to do with worse rates, and harder to do with good rates. However, large margins is sufficient for generalization in the linear case, as in section 18.4.

15.3 Smoothed margins are nondecreasing for homogeneous functions

In the nonlinear case, we do not have a general result, and instead only prove that smoothed margins are nondecreasing.

Theorem 15.3 (originally from (Lyu and Li 2019), simplification due to (Ji 2020)) .
Suppose there exists t_0 with {\tilde{\gamma}}(w(t_0)) > 0. Then t\mapsto {\tilde{\gamma}}(w(t)) is nondecreasing along [t_0,\infty).

Proof. Write {\tilde{\gamma}}_t := {\tilde{\gamma}}(w(t)) = u_t/v_t, where u_t := \ell^{-1}(\mathcal L(w_t)), \qquad v_t := \|w_t\|^L. By the quotient rule, \frac {{\text{d}}}{{\text{d}}t} {\tilde{\gamma}}_t = \frac {\dot u_t v_t - \dot v_t u_t}{v_t^2}. Therefore it suffices to show v_t \neq 0 and that the numerator is positive.

First note a technical fact that since -\ell' = \ell > 0, \begin{aligned} \left\langle w, \dot w \right \rangle &= \sum_j - \ell'(m_j(w)) \left\langle w, \nabla m_j(w) \right \rangle = L \sum_j - \ell'(m_j(w)) m_j(w). \\ &= L \sum_j - \ell'(m_j(w)) \ell^{-1}(\ell(m_j(w))) \geq L \sum_j - \ell'(m_j(w)) \ell^{-1}(\mathcal L(w)) \\ &= L \mathcal L(w) \ell^{-1}(\mathcal L(w)). \end{aligned} Moreover, \dot v_t = \frac {{\text{d}}} {{\text{d}}t} \left\langle w_t, w_t \right \rangle^{L/2} = \frac {L}{2} \left\langle w_t, w_t \right \rangle^{L/2-1} 2 \left\langle w_t, \dot w_t \right \rangle = L \|w_t\|^{L-2} \left\langle w_t, \dot w_t \right \rangle. Consequently, \begin{aligned} \dot v_t &= L \|w_t\|^{L-1} \left\langle \frac{w_t}{\|w_t\|}, \dot w_t \right \rangle \leq L \|w_t\|^{L-1} \sup_{\|v\|\leq 1} \left\langle v, \dot w_t \right \rangle = L \|w_t\|^{L-1} \|\dot w_t\|, \\ \dot v_t &\geq L^2 \|w_t\|^{L-2} \mathcal L(w_t) \ell^{-1}(\mathcal L(w_t)) \end{aligned}

For \dot u_t, using \ell(z) = \exp(-z) and the earlier lower bound on \left\langle w, \dot w \right \rangle, highlight this is a key step, differing from linear proof. \begin{aligned} \dot u_t & = - \frac {\left\langle \nabla\mathcal L(w_t), \dot w_t \right \rangle}{\mathcal L(w_t)} = \frac {\|\dot w_t\|^2}{\mathcal L(w_t)} \geq \frac {\|\dot w_t\|}{\mathcal L(w_t)\|w_t\|} \left\langle \dot w_t, w_t \right \rangle % \\ % &= % \frac {L\|\dot w_t\|}{\cL(w_t)\|w_t\|} % \sum_j - \ell'(m_j(w_t))\ell^{-1}\ell( m_j(w_t)) % \\ % & \geq % \frac {L\|\dot w_t\|}{\cL(w_t)\|w_t\|} % \sum_j -\ell'(m_j(w_t)) \ell^{-1}\del{\cL(w)} % \\ % &= \frac {L\|\dot w_t\| \ell^{-1}\left({\mathcal L(w)}\right)}{\|w_t\|}. \end{aligned}

Therefore \begin{aligned} \dot u_t v_t - \dot v_t u_t &\geq \frac {L\|\dot w_t\| \ell^{-1}\left({\mathcal L(w)}\right)}{\|w_t\|} \|w_t\|^L - L \|w_t\|^{L-1} \|\dot w_t\| \ell^{-1}(\mathcal L(w_t)) = 0. \end{aligned} It remains to show v_t stays positive. v_0 > 0 since \mathcal L(w_t) < \ell(0)/n \leq \mathcal L(0). As before, \begin{aligned} \dot v_t &\geq L^2 \|w_t\|^{L-2} \mathcal L(w_t) \ell^{-1}(\mathcal L(w_t)) = L^2 \|w_t\|^{2L-2} \mathcal L(w_t) {\tilde{\gamma}}_t. \end{aligned} Let T denote the first time where v_t = 0 For t\in[0,t), v_t>0, and thus {\tilde{\gamma}}_t\geq {\tilde{\gamma}}_0 and \dot v_t > 0, meaning such a time T can not exist, and \dot v_t>0 and ({\text{d}}/ {\text{d}}t) {\tilde{\gamma}}_t > 0.

Remark 15.11.
As mentioned, +Theorem 15.3 (originally from (Lyu and Li 2019), simplification due to (Ji 2020)) was originally presented in (Lyu and Li 2019), though this simplification is due to (Ji 2020), and its elements can be found throughout (Ji and Telgarsky 2020). The version in (Lyu and Li 2019) is significantly different, and makes heavy (and interesting) use of a polar decomposition of homogeneous functions and gradient flow on them.
For the case of an infinite-width 2-homogeneous network, assuming a number of convergence properties of the flow (which look technical, but are not “merely technical,” and indeed difficult to prove), margins are globally maximized (Chizat and Bach 2020).

16 Generalization: preface

The purpose of this generalization part is to bound the gap between testing and training error for standard (multilayer ReLU) deep networks via the classical uniform convergence tools, and also to present and develop these classical tools (based on Rademacher complexity).

These bounds are very loose, and there is extensive criticism now both of them and of the general approach, as will be discussed shortly (Neyshabur, Tomioka, and Srebro 2014; Zhang et al. 2017; Nagarajan and Kolter 2019; Dziugaite and Roy 2017); this work is ongoing and moving quickly and there are even already many responses to these criticisms (Negrea, Dziugaite, and Roy 2019; L. Zhou, Sutherland, and Srebro 2020; P. L. Bartlett and Long 2020).

16.1 Omitted topics

17 Concentration of measure

17.1 sub-Gaussian random variables and Chernoff’s bounding technique

Our main concentration tool will be the Chernoff bounding method, which works nicely with sub-Gaussian random variables.

Definition 17.1.
Random variable Z is sub-Gaussian with mean \mu and variance proxy \sigma^2 when \mathop{\mathbb{E}}e^{\lambda (Z - \mu)} \leq e^{\lambda^2 \sigma^2/2}.

Example 17.1.

Remark 17.1.

There is also “sub-exponential”; we will not use it but it is fundamental.

Sometimes \mu is dropped from definition; in this case, one can study X-\mathop{\mathbb{E}}X, and we’ll just say “\sigma^2-sub-Gaussian.”

\mathop{\mathbb{E}}\exp(\lambda Z) is the moment generating function of Z; it has many nice properties, though we’ll only use it in a technical way.

sub-Gaussian random variables will be useful to us due to their vanishing tail probabilities. This indeed is an equivalent way to define sub-Gaussian (see (Wainwright 2015)), but we’ll just prove implication. The first step is Markov’s inequality.

Theorem 17.1 (Markov’s inequality) .
For any nonnegative r.v. X and \epsilon>0, \mathop{\textrm{Pr}}[X\geq \epsilon] \leq \frac {\mathop{\mathbb{E}}X}{\epsilon}.

Proof. Apply \mathop{\mathbb{E}} to both sides of \epsilon\mathbf{1}[X\geq \epsilon] \leq X.

Corollary 17.1.
For any nonnegative, nondecreasing f\geq 0 and f(\epsilon)>0, \mathop{\textrm{Pr}}[X\geq \epsilon] \leq \frac {\mathop{\mathbb{E}}f(X)}{f(\epsilon)}.

Proof. Note \mathop{\textrm{Pr}}[X\geq \epsilon] \leq \mathop{\textrm{Pr}}[f(X) \geq f(\epsilon)] and apply Markov.

The Chernoff bounding technique is as follows. We can apply the proceeding corollary to the mapping t\mapsto \exp(tX) for all t>0: supposing \mathop{\mathbb{E}}X = 0, \mathop{\textrm{Pr}}[X\geq \epsilon] = \inf_{t\geq 0} \mathop{\textrm{Pr}}[\exp(tX)\geq \exp(t\epsilon)] \leq \inf_{t\geq 0} \frac {\mathop{\mathbb{E}}\exp(tX)}{\exp(t\epsilon)}. Simplifying the RHS via sub-Guassianity, \begin{aligned} \inf_{t>0}\frac{\mathop{\mathbb{E}}\exp(tX)}{\exp(t\epsilon)} &\leq \inf_{t>0}\frac{\exp(t^2\sigma^2/2)}{\exp(t\epsilon)} = \inf_{t>0}\exp\left({ t^2\sigma^2/2 - t\epsilon}\right) \\ &= \exp\left({ \inf_{t>0} t^2\sigma^2/2 - t\epsilon}\right). \end{aligned} The minimum of this convex quadratic is t := \frac {\epsilon}{\sigma^2}>0, thus \mathop{\textrm{Pr}}[X\geq \epsilon] = \inf_{t>0}\frac{\mathop{\mathbb{E}}\exp(tX)}{\exp(t\epsilon)} \leq \exp\left({ \inf_{t>0} t^2\sigma^2/2 - t\epsilon}\right) = \exp\left({ - \frac {\epsilon^2}{2 \sigma^2} }\right). (5)

What if we apply this to an average of sub-Gaussian r.v.’s? (The point is: this starts to look like an empirical risk!)

Theorem 17.2 (Chernoff bound for subgaussian r.v.’s) .
Suppose (X_1,\ldots,X_n) independent and respectively \sigma_i^2-subgaussian. Then \mathop{\textrm{Pr}}\left[{ \frac 1 n \sum_i X_i \geq \epsilon}\right] \leq \exp\left({-\frac {n^2\epsilon^2}{2\sum_i \sigma_i^2}}\right). In other words (“inversion form”), with probability \geq 1-\delta, \frac 1 n \sum_i \mathop{\mathbb{E}}X_i \leq \frac 1 n \sum_i X_i + \sqrt{\frac{2\sum_i\sigma_i^2}{n^2} \ln\left({\frac 1 \delta}\right) }.

Proof. S_n:=\sum_iX_i/n is \sigma^2-subgaussian with \sigma^2 = \sum_i\sigma_i^2/n^2; plug this into the sub-Gaussian tail bound in eq. 5.

Remark 17.2.

(Gaussian sanity check.) Let’s go back to the case n=1. It’s possible to get a tighter tail for the Gaussian directly (see (Wainwright 2015)), but it only changes log factors in the “inversion form” of the bound. Note also the bound is neat for the Gaussian since it says the tail mass and density are of the same order (algebraically this makes sense, as with geometric series).

(“Inversion” form.) This form is how things are commonly presented in machine learning; think of \delta as “confidence”; \ln(1/\delta) term means adding more digits to the confidence (e.g., bound holds with probability 99.999\%) means a linear increase in the term \ln(1/\delta).

There are more sophisticated bounds (e.g., Bernstein, Freedman, McDiarmid) proved in similar ways, often considering a Martingale rather than IID r.v.s.

I should say something about necessary and sufficient, like convex lipschitz bounded vs lipschitz gaussian.

maybe give heavy tail poiinter? dunno.

17.2 Hoeffding’s inequality and the need for uniform deviations

Let’s use what we’ve seen to bound misclassifications!

Theorem 17.3 (Hoeffding inequality) .
Given independent (X_1,\ldots,X_n) with X_i\in[a_i,b_i] a.s., \mathop{\textrm{Pr}}\left[{ \frac 1 n \sum_i (X_i - \mathop{\mathbb{E}}X_i) \geq \epsilon}\right] \leq \exp\left({ - \frac {2n^2\epsilon^2}{\sum_i (b_i-a_i)^2} }\right).

Proof. Plug Hoeffding Lemma into sub-Gaussian Chernoff bound.

Example 17.2.
Fix classifier f, sample ((X_i,Y_i))_{i=1}^n, and define Z_i := \mathbf{1}[f(X_i) \neq Y_i]. With probability at least 1-\delta, \mathop{\textrm{Pr}}[ f(X) \neq Y ] - \frac 1 n \sum_{i=1}^n \mathbf{1}[f(x_i) = y_i] = \mathop{\mathbb{E}}Z_1 - \frac 1 n \sum_{i=1}^n Z_i \leq \sqrt{\frac 1 {2n}\ln\left({\frac{1}{\delta}}\right)}. As in, test error is upper bounded by training error plus a term which goes \downarrow 0 as n\to\infty !
Example 17.3.

Classifier f_n memorizes training data: f_n(x) := \begin{cases} y_i & x = x_i \in (x_1,\ldots,x_n), \\ 17 &\textup{otherwise.}\end{cases} Consider two situations with \mathop{\textrm{Pr}}[Y=+1|X=x] = 1.

What broke Hoeffding’s inequality (and its proof) between these two examples?

This f_n overfit: \widehat{\mathcal{R}}(f_n) is small, but \mathcal{R}(f_n) is large.

Possible fixes.

Remark 17.3.
There are measure-theoretic issues with the uniform deviation approach, which we’ll omit here. Specifically, the most natural way to reason about \left[{ \sup_{f\in\mathcal{F}} \mathcal{R}(f) - \widehat{\mathcal{R}}(f) }\right] is via uncountably intersections of events, which are not guaranteed to be within the \sigma-algebra. The easiest fix is to worth with countable subfamilies, which will work for the standard ReLU networks we consider.

18 Rademacher complexity

As before we will apply a brute-force approach to controlling generalization over a function family \mathcal{F}: we will simultaneously control generalization for all elements of the class by working with the random variable \left[{ \sup_{f\in\mathcal{F}} \mathcal{R}(f) - \widehat{\mathcal{R}}(f) }\right]. This is called “uniform deviations” because we prove a deviation bound that holds uniformly over all elements of \mathcal{F}.

Remark 18.1.
The idea is that even though our algorithms output predictors which depend on data, we circumvent the independence issue by invoking a uniform bound on all elements of \mathcal{F} before we see the algorithm’s output, and thus generalization is bounded for the algorithm output (and for everything else in the class). This is a brute-force approach because it potentially controls much more than is necessary.

On the other hand, we can adapt the approach to the output of the algorithm in various ways, as we will discuss after presenting the main Rademacher bound.

Example 18.1 (finite classes) .
As an example of what is possible, suppose we have \mathcal{F}= (f_1,\ldots,f_k), meaning a finite function class \mathcal{F} with |\mathcal{F}|=k. If we apply Hoeffding’s inequality to each element of \mathcal{F} and then union bound, we get, with probability at least 1-\delta, for every f\in\mathcal{F}, \mathop{\textrm{Pr}}[f(X) \neq Y] - \mathop{\widehat{\textrm{Pr}}}[f(X)\neq Y] \leq \sqrt{\frac{\ln(k/\delta)}{2n}} \leq \sqrt{\frac{\ln|\mathcal{F}|}{2n}} + \sqrt{\frac{\ln(1/\delta)}{2n}}. Rademacher complexity will give us a way to replace \ln |\mathcal{F}| in the preceding finite class example with something non-trivial in the case |\mathcal{F}|=\infty.
Definition 18.1 (Rademacher complexity) .
Given a set of vectors V\subseteq \mathbb{R}^n, define the (unnormalized) Rademacher complexity as \textrm{URad}(V) := \mathop{\mathbb{E}}\sup_{u \in V} \left\langle \epsilon, u \right \rangle, \qquad \textrm{Rad}(V) := \frac 1 n \textrm{URad}(V), where \mathop{\mathbb{E}} is uniform over the corners of the hyerpcube over \epsilon\in\{\pm 1\}^n (each coordinate \epsilon_i is a Rademacher random variable, meaning \mathop{\textrm{Pr}}[\epsilon_i = +1] = \frac 1 2 = \mathop{\textrm{Pr}}[\epsilon_i = -1], and all coordinates are iid).

This definition can be applied to arbitrary elements of \mathbb{R}^n, and is useful outside machine learning. We will typically apply it to the behavior of a function class on S = (z_i)_{i=1}^n: \mathcal{F}_{|S} := \left\{{ (f(x_1),\ldots,f(x_n)) : f\in\mathcal{F}}\right\} \subseteq \mathbb{R}^n. With this definition, \textrm{URad}(\mathcal{F}_{|S}) = \mathop{\mathbb{E}}_\epsilon\sup_{u \in \mathcal{F}_{|S}} \left\langle \epsilon, u \right \rangle = \mathop{\mathbb{E}}_\epsilon\sup_{f \in \mathcal{F}} \sum_i \epsilon_i f(z_i).

Remark 18.2.
(Loss classes.) This looks like fitting random signs, but is not exactly that; often we apply it to the loss class: overloading notation, \textrm{URad}((\ell \circ \mathcal{F})_{|S}) = \textrm{URad}\left({ \left\{{ (\ell(y_1f(x_1)),\ldots,\ell(y_nf(x_n))) : f\in\mathcal{F}}\right\} }\right).

(Sanity checks.) We’d like \textrm{URad}(V) to measure how “big” or “complicated” V is. Here are a few basic checks:

  1. \textrm{URad}(\{u\}) = \mathop{\mathbb{E}}\left\langle \epsilon, u \right \rangle = 0; this seems desirable, as a |V|=1 is simple.

  2. More generally, \textrm{URad}(V + \{u\}) = \textrm{URad}(V).

  3. If V\subseteq V', then \textrm{URad}(V) \leq \textrm{URad}(V').

  4. \textrm{URad}(\{\pm 1\}^n) = \mathop{\mathbb{E}}_\epsilon\epsilon^2 = n; this also seems desirable, as V is as big/complicated as possible (amongst bounded vectors).

  5. \textrm{URad}(\{(-1,\ldots,-1), (+1,\ldots,+1)\}) = \mathop{\mathbb{E}}_\epsilon| \sum_i \epsilon_i| = \Theta(\sqrt{n}). This also seems reasonable: |V|=2 and it is not completely trivial.

(\textrm{URad} vs \textrm{Rad}.) I don’t know other texts or even papers which use \textrm{URad}, I only see \textrm{Rad}. I prefer \textrm{URad} for these reasons:

  1. The 1/n is a nuisance while proving Rademacher complexity bounds.

  2. When we connect Rademacher complexity to covering numbers, we need to change the norms to account for this 1/n.

  3. It looks more like a regret quantity.

(Absolute value version.) The original definition of Rademacher complexity (P. L. Bartlett and Mendelson 2002), which still appears in many papers and books, is \textrm{URad}_{|\cdot|}(V) = \mathop{\mathbb{E}}_\epsilon\sup_{u\in V} \left|{ \left\langle \epsilon, u \right \rangle }\right|. Most texts now drop the absolute value. Here are my reasons:

  1. \textrm{URad}_{|\cdot|} violates basic sanity checks: it is possible that \textrm{URad}_{|\cdot|}(\{u\}) \neq 0 and more generally \textrm{URad}_{|\cdot|}(V + \{u\}) \neq \textrm{URad}_{|\cdot|}(V), which violates my basic intuition about a “complexity measure.”

  2. To obtain 1/n rates rather than 1/\sqrt{n}, the notion of local Rademacher complexity was introduced, which necessitated dropping the absolute value essentially due to the preceding sanity checks.

  3. We can use \textrm{URad} to reason about \textrm{URad}_{|\cdot|}, since \textrm{URad}_{|\cdot|}(V) = \textrm{URad}(V\cup -V).

  4. While \textrm{URad}_{|\cdot|} is more convenient for certain operations, e.g., \textrm{URad}_{|\cdot|}(\cup_i V_i) \leq \sum_i \textrm{URad}_{|\cdot|}(V_i), there are reasonable surrogates for \textrm{URad} (as below).

The following theorem shows indeed that we can use Rademacher complexity to replace the \ln|\mathcal{F}| term from the finite-class bound with something more general (we’ll treat the Rademacher complexity of finite classes shortly).

Theorem 18.1.

Let \mathcal{F} be given with f(z) \in [a,b] a.s. \forall f\in\mathcal{F}.

  1. With probability \geq 1-\delta, \begin{aligned} \sup_{f\in\mathcal{F}} \mathop{\mathbb{E}}f(Z) - \frac 1 n \sum_i f(z_i) &\leq \mathop{\mathbb{E}}_{(z_i)_{i=1}^n} \left({\sup_{f\in\mathcal{F}} \mathop{\mathbb{E}}f(z) - \frac 1 n \sum_i f(z_i)}\right) + (b-a) \sqrt{\frac{\ln(1/\delta)}{2n}}. \end{aligned}

  2. With probability \geq 1-\delta, \mathop{\mathbb{E}}_{(z_i)_{i=1}^n} \textrm{URad}(\mathcal{F}_{|S}) \leq \textrm{URad}(\mathcal{F}_{|S}) + (b-a) \sqrt{\frac{n\ln(1/\delta)}{2}}.

  3. With probability \geq 1-\delta, \sup_{f\in\mathcal{F}} \mathop{\mathbb{E}}f(Z) - \frac 1 n \sum_i f(z_i) \leq \frac 2 n \textrm{URad}(\mathcal{F}_{|S}) + 3(b-a) \sqrt{\frac{\ln(2/\delta)}{2 n}}.

Remark 18.3.
To flip which side has an expectation and which side has an average of random variables, replace \mathcal{F} with -\mathcal{F}:= \{-f : f\in\mathcal{F}\}.

The proof of this bound has many interesting points and is spread out over the next few subsections. It has these basic steps:

  1. The expected uniform deviations are upper bounded by the expected Rademacher complexity. This itself is done in two steps:

    1. The expected deviations are upper bounded by expected deviations between two finite samples. This is interesting since we could have reasonably defined generalization in terms of this latter quantity.

    2. These two-sample deviations are upper bounded by expected Rademacher complexity by introducing random signs.

  2. We replace this difference in expectations with high probability bounds via a more powerful concentration inequality which we haven’t discussed, McDiarmid’s inequality.

18.1 Generalization without concentration; symmetrization

We’ll use further notation throughout this proof. \begin{aligned} Z &\quad \textrm{r.v.; e.g., $(x,y)$}, \\ \mathcal{F} &\quad \textrm{functions; e.g., } f(Z) = \ell(g(X), Y), \\ \mathop{\mathbb{E}}&\quad\textrm{expectation over } Z, \\ \mathop{\mathbb{E}}_n &\quad \textrm{expectation over } (Z_1,\ldots,Z_n), \\ \mathop{\mathbb{E}}f &= \mathop{\mathbb{E}}f(Z),\\ \widehat{\mathop{\mathbb{E}}}_n f &= \frac 1 n \sum_i f(Z_i). \end{aligned}

In this notation, \mathcal{R}_{\ell}(g) = \mathop{\mathbb{E}}\ell\circ g and \widehat{\mathcal{R}}_{\ell}(g) = \widehat{\mathop{\mathbb{E}}}\ell\circ g.

18.1.1 Symmetrization with a ghost sample

In this first step we’ll introduce another sample (“ghost sample”). Let (Z_1',\ldots,Z_n') be another iid draw from Z; define \mathop{\mathbb{E}}_n' and \widehat{\mathop{\mathbb{E}}}_n' analogously.

Lemma 18.1.
\mathop{\mathbb{E}}_n\left({ \sup_{f\in\mathcal{F}} \mathop{\mathbb{E}}f - \widehat{\mathop{\mathbb{E}}}_n f}\right) \leq \mathop{\mathbb{E}}_n\mathop{\mathbb{E}}_n' \left({ \sup_{f\in\mathcal{F}} \widehat{\mathop{\mathbb{E}}}_n' f - \widehat{\mathop{\mathbb{E}}}_n f}\right).

Proof. Fix any \epsilon>0 and apx max f_\epsilon\in\mathcal{F}; then \begin{aligned} \mathop{\mathbb{E}}_n\left({\sup_{f\in\mathcal{F}} \mathop{\mathbb{E}}f - \widehat{\mathop{\mathbb{E}}}_n f }\right) &\leq \mathop{\mathbb{E}}_n\left({ \mathop{\mathbb{E}}f_\epsilon- \widehat{\mathop{\mathbb{E}}}_n f_\epsilon}\right)+\epsilon \\ &= \mathop{\mathbb{E}}_n\left({ \mathop{\mathbb{E}}_n' \widehat{\mathop{\mathbb{E}}}_n' f_\epsilon- \widehat{\mathop{\mathbb{E}}}_n f_\epsilon}\right)+\epsilon \\ &= \mathop{\mathbb{E}}_n'\mathop{\mathbb{E}}_n\left({ \widehat{\mathop{\mathbb{E}}}_n' f_\epsilon- \widehat{\mathop{\mathbb{E}}}_n f_\epsilon}\right)+\epsilon \\ &\leq \mathop{\mathbb{E}}_n'\mathop{\mathbb{E}}_n\left({\sup_{f\in\mathcal{F}} \widehat{\mathop{\mathbb{E}}}_n' f - \widehat{\mathop{\mathbb{E}}}_n f }\right)+\epsilon \end{aligned} Result follows since \epsilon>0 was arbitrary.

Remark 18.4.

As above, in this section we are working only in expectation for now. In the subsequent section, we’ll get high probability bounds. But \sup_{f\in\mathcal{F}} \mathop{\mathbb{E}}f - \mathop{\mathbb{E}}_n' f is a random variable; can describe it in many other ways too! (E.g., “asymptotic normality.”)

As mentioned before, the preceding lemma says we can instead work with two samples. Working with two samples could have been our starting point (and definition of generalization): by itself it is a meaningful and interpretable quantity!

18.1.2 Symmetrization with random signs

The second step swaps points between the two samples; a magic trick with random signs boils this down into Rademacher complexity.

Lemma 18.2.
\mathop{\mathbb{E}}_n \mathop{\mathbb{E}}_n' \sup_{f\in\mathcal{F}} \left({ \widehat{\mathop{\mathbb{E}}}_n' f - \widehat{\mathop{\mathbb{E}}}_n f}\right) \leq \frac 2 n \mathop{\mathbb{E}}_n \textrm{URad}(\mathcal{F}_{|S}).

Proof. Fix a vector \epsilon\in \{-1,+1\}^n and define a r.v. (U_i,U_i') := (Z_i,Z_i') if \epsilon= 1 and (U_i,U_i') = (Z_i',Z_i) if \epsilon= -1. Then \begin{aligned} \mathop{\mathbb{E}}_n\mathop{\mathbb{E}}_n' \left({ \sup_{f\in\mathcal{F}} \widehat{\mathop{\mathbb{E}}}_n' f - \widehat{\mathop{\mathbb{E}}}_n f}\right) &= \mathop{\mathbb{E}}_n\mathop{\mathbb{E}}_n' \left({ \sup_{f\in\mathcal{F}} \frac 1 n \sum_i \left({ f(Z_i') - f(Z_i)}\right) }\right) \\ &= \mathop{\mathbb{E}}_n\mathop{\mathbb{E}}_n' \left({ \sup_{f\in\mathcal{F}} \frac 1 n \sum_i \epsilon_i \left({ f(U_i') - f(U_i)}\right) }\right). \end{aligned} Here’s the big trick: since (Z_1,\ldots,Z_n,Z_1',\ldots,Z_n') and (U_1,\ldots,U_n,U_1',\ldots,U_n') have same distribution, and \epsilon arbitrary, then (with \mathop{\textrm{Pr}}[\epsilon_i = +1] = 1/2 iid “Rademacher”) \begin{aligned} \mathop{\mathbb{E}}_\epsilon\mathop{\mathbb{E}}_n\mathop{\mathbb{E}}_n' \left({ \sup_{f\in\mathcal{F}} \widehat{\mathop{\mathbb{E}}}_n' f - \widehat{\mathop{\mathbb{E}}}_n f}\right) &= \mathop{\mathbb{E}}_\epsilon\mathop{\mathbb{E}}_n\mathop{\mathbb{E}}_n' \left({ \sup_{f\in\mathcal{F}} \frac 1 n \sum_i \epsilon_i \left({ f(U_i') - f(U_i)}\right) }\right) \\ &= \mathop{\mathbb{E}}_\epsilon\mathop{\mathbb{E}}_n\mathop{\mathbb{E}}_n' \left({ \sup_{f\in\mathcal{F}} \frac 1 n \sum_i \epsilon_i \left({ f(Z_i') - f(Z_i)}\right) }\right). \end{aligned}

Since similarly replacing \epsilon_i and -\epsilon_i doesn’t change \mathop{\mathbb{E}}_\epsilon, \begin{aligned} &\mathop{\mathbb{E}}_\epsilon\mathop{\mathbb{E}}_n\mathop{\mathbb{E}}_n' \left({ \sup_{f\in\mathcal{F}} \widehat{\mathop{\mathbb{E}}}_n' f - \widehat{\mathop{\mathbb{E}}}_n f}\right) \\ &= \mathop{\mathbb{E}}_\epsilon\mathop{\mathbb{E}}_n\mathop{\mathbb{E}}_n' \left({ \sup_{f\in\mathcal{F}} \frac 1 n \sum_i \epsilon_i \left({ f(Z_i') - f(Z_i)}\right) }\right) \\ &\leq \mathop{\mathbb{E}}_\epsilon\mathop{\mathbb{E}}_n\mathop{\mathbb{E}}_n' \left({ \sup_{f,f'\in\mathcal{F}} \frac 1 n \sum_i \epsilon_i \left({ f(Z_i') - f'(Z_i)}\right) }\right) \\ &= \mathop{\mathbb{E}}_\epsilon\mathop{\mathbb{E}}_n' \left({ \sup_{f'\in\mathcal{F}} \frac 1 n \sum_i \epsilon_i \left({ f(Z_i') }\right) }\right) + \mathop{\mathbb{E}}_\epsilon\mathop{\mathbb{E}}_n' \left({ \sup_{f\in\mathcal{F}} \frac 1 n \sum_i \epsilon_i \left({- f'(Z_i)}\right) }\right) \\ &= 2 \mathop{\mathbb{E}}_n \frac 1 n \mathop{\mathbb{E}}_\epsilon\sup_{f\in\mathcal{F}} \sum_i \epsilon_i \left({ f(Z_i) }\right) = 2 \mathop{\mathbb{E}}_n \frac 1 n \textrm{URad}(\mathcal{F}_{|S}). \end{aligned}

18.2 Generalization with concentration

We controlled expected uniform deviations: \mathop{\mathbb{E}}_n \sup_{f\in\mathcal{F}} \mathop{\mathbb{E}}f - \widehat{\mathop{\mathbb{E}}}_n f.

High probability bounds will follow via concentration inequalities.

Theorem 18.2 (McDiarmid) .
Suppose F:\mathbb{R}^n \to \mathbb{R} satisfies “bounded differences”: \forall i\in\{1,\ldots,n\} \exists c_i, \sup_{z_1,\ldots,z_n,z_i'} \left|{ F(z_1,\ldots,z_i,\ldots,z_n) - F(z_1,\ldots,z_i',\ldots,z_n) }\right| \leq c_i. With pr \geq 1-\delta, \mathop{\mathbb{E}}_n F(Z_1,\ldots,Z_n) \leq F(Z_1,\ldots,Z_n) + \sqrt{\frac {\sum_i c_i^2}{2} \ln(1/\delta) }.
Remark 18.5.

I’m omitting the proof. A standard way is via a Martingale variant of the Chernoff bounding method. The Martingale adds one point at a time, and sees how things grow.

Hoeffding follows by setting F(\vec Z) = \sum_i Z_i/n and verifying bounded differences c_i := (b_i-a_i)/n.

Proof of +Theorem 18.1.

The third bullet item follows from the first two by union bounding. To prove the first two, it suffices to apply the earlier two lemmas on expectations and verify the quantities satisfy bounded differences with constant (b-a)/n and (b-a), respectively.

For the first quantity, for any i and (z_1,\ldots,z_n,z_i') and writing z_j' := z_j for z_j\neq z_i for convenience, \begin{aligned} \left|{ \sup_{f\in\mathcal{F}} \mathop{\mathbb{E}}f - \widehat{\mathop{\mathbb{E}}}_n f - \sup_{g\in\mathcal{F}} ( \mathop{\mathbb{E}}g - \widehat{\mathop{\mathbb{E}}}_n' g ) }\right| &= \left|{ \sup_{f\in\mathcal{F}} \mathop{\mathbb{E}}f - \widehat{\mathop{\mathbb{E}}}_n f - \sup_{g\in\mathcal{F}} ( \mathop{\mathbb{E}}g - \widehat{\mathop{\mathbb{E}}}_n g + g(z_i) - g(z_i')) }\right| \\ &\leq \sup_{h\in\mathcal{F}} \left|{ \sup_{f\in\mathcal{F}} \mathop{\mathbb{E}}f - \widehat{\mathop{\mathbb{E}}}_n f - \sup_{g\in\mathcal{F}} ( \mathop{\mathbb{E}}g - \widehat{\mathop{\mathbb{E}}}_n g + h(z_i)/n - h(z_i'))/n }\right| \\ &= \sup_{h\in\mathcal{F}} \left|{ h(z_i) - h(z_i')) }\right|/n \\ &\leq \frac {b-a}{n}. \end{aligned} Using similar notation, and additionally writing S and S' for the two samples, for the Rademacher complexity, \begin{aligned} \left|{ \textrm{URad}(\mathcal{F}_{|S}) - \textrm{URad}(\mathcal{F}_{|S'})}\right| &= \left|{ \textrm{URad}(\mathcal{F}_{|S}) - \mathop{\mathbb{E}}_\epsilon\sup_{f\in\mathcal{F}} \sum_{i=1}^n \epsilon_i f(z_i')}\right| \\ &= \left|{ \textrm{URad}(\mathcal{F}_{|S}) - \mathop{\mathbb{E}}_\epsilon\sup_{f\in\mathcal{F}} \sum_{i=1}^n \epsilon_i f(z_i) - \epsilon_i f(z_i) + \epsilon_i f(z_i')}\right| \\ &\leq \sup_{h\in\mathcal{F}} \left|{ \textrm{URad}(\mathcal{F}_{|S}) - \mathop{\mathbb{E}}_\epsilon\sup_{f\in\mathcal{F}} \sum_{i=1}^n \epsilon_i f(z_i) - \epsilon_i h(z_i) + \epsilon_i h(z_i')}\right| \\ &\leq \sup_{h\in\mathcal{F}} \mathop{\mathbb{E}}_{\epsilon_i} \left|{\epsilon_i h(z_i) + \epsilon_i h(z_i')}\right| \leq (b-a). \end{aligned}

18.3 Example: basic logistic regression generalization analysis

Let’s consider logistic regression with bounded weights: \begin{aligned} \ell(y f(x)) &:= \ln(1+\exp(-y f(x))), \\ |\ell'| &\leq 1, \\ \mathcal{F} &:= \left\{{ w\in\mathbb{R}^d : \|w\| \leq B }\right\}, \\ (\ell\circ \mathcal{F})_{|S} &:= \left\{{ (\ell(y_1 w^{\scriptscriptstyle\mathsf{T}}x_1),\ldots,\ell(y_n w^{\scriptscriptstyle\mathsf{T}}x_n)) : \|w\|\leq B }\right\}, \\ \mathcal{R}_{\ell}(w) &:= \mathop{\mathbb{E}}\ell(Y w^{\scriptscriptstyle\mathsf{T}}X), \\ \widehat{\mathcal{R}}_{\ell}(w) &:= \frac 1 n \sum_i \ell(y_i w^{\scriptscriptstyle\mathsf{T}}x_i). \end{aligned} The goal is to control \mathcal{R}_{\ell}- \widehat{\mathcal{R}}_{\ell} over \mathcal{F} via the earlier theorem; our main effort is in controlling \textrm{URad}((\ell\circ \mathcal{F})_{S}).

This has two steps:

Lemma 18.3.
Let \ell:\mathbb{R}^n\to\mathbb{R}^n be a vector of univariate L-lipschitz functions. Then \textrm{URad}(\ell\circ V) \leq L \cdot \textrm{URad}(V).

Proof. The idea of the proof is to “de-symmetrize” and get a difference of coordinates to which we can apply the definition of L. To start, \begin{aligned} \textrm{URad}(\ell \circ V) &= \mathop{\mathbb{E}}\sup_{u\in V} \sum_i \epsilon_i \ell_i(u_i) \\ &= \frac 1 2 \mathop{\mathbb{E}}_{\epsilon_{2:n}} \sup_{u,w\in V} \left({\ell_1(u_1) - \ell_1(w_1) + \sum_{i=2}^n \epsilon_i (\ell_i(u_i) + \ell_i(w_i)) }\right) \\ &\leq \frac 1 2 \mathop{\mathbb{E}}_{\epsilon_{2:n}} \sup_{u,w\in V} \left({L|u_1 - w_1| + \sum_{i=2}^n \epsilon_i (\ell_i(u_i) + \ell_i(w_i)) }\right). \end{aligned} To get rid of the absolute value, for any \epsilon, by considering swapping u and w, \begin{aligned} & \sup_{u,w\in V} \left({L|u_1 - w_1| + \sum_{i=2}^n \epsilon_i (\ell_i(u_i) + \ell_i(w_i))}\right) \\ & = \max\Bigg\{ \sup_{u,w\in V} \left({L(u_1 - w_1) + \sum_{i=2}^n \epsilon_i (\ell_i(u_i) + \ell_i(w_i))}\right), \\ &\qquad \sup_{u,w} \left({L(w_1 - u_1) + \sum_{i=2}^n \epsilon_i (\ell_i(u_i) + \ell_i(w_i))}\right) \Bigg\} \\ \\ & = \sup_{u,w\in V} \left({L(u_1 - w_1) + \sum_{i=2}^n \epsilon_i (\ell_i(u_i) + \ell_i(w_i))}\right). \end{aligned} As such, \begin{aligned} \textrm{URad}(\ell \circ V) &\leq \frac 1 2 \mathop{\mathbb{E}}_{\epsilon_{2:n}} \sup_{u,w\in V} \left({L|u_1 - w_1| + \sum_{i=2}^n \epsilon_i (\ell_i(u_i) + \ell_i(w_i)) }\right) \\ &= \frac 1 2 \mathop{\mathbb{E}}_{\epsilon_{2:n}} \sup_{u,w\in V} \left({L(u_1 - w_1) + \sum_{i=2}^n \epsilon_i (\ell_i(u_i) + \ell_i(w_i)) }\right) \\ &= \mathop{\mathbb{E}}_{\epsilon} \sup_{u\in V} \left[{ L\epsilon_1 u_1 + \sum_{i=2}^n \epsilon_i \ell_i(u_i) }\right]. \end{aligned} Repeating this procedure for the other coordinates gives the bound.

Revisiting our overloaded composition notation: \begin{aligned} \left({\ell \circ f}\right) &= \left({ (x,y) \mapsto \ell(-y f(x)) }\right),\\ \ell \circ \mathcal{F}&= \{ \ell \circ f : f\in \mathcal{F}\}. \end{aligned}

Corollary 18.1.
Suppose \ell is L-lipschitz and \ell\circ \mathcal{F}\in[a,b] a.s.. With probability \geq 1-\delta, every f\in\mathcal{F} satisfies \mathcal{R}_{\ell}(f) \leq \widehat{\mathcal{R}}_{\ell}(f) + \frac {2L}{n}\textrm{URad}(\mathcal{F}_{|S}) + 3(b-a)\sqrt{\frac{\ln(2/\delta)}{2n}}.

Proof. Use the lipschitz composition lemma with \begin{aligned} |\ell(-y_i f(x_i) - \ell(-y_i f'(x_i))| &\leq L|-y_i f(x_i) + y_i f'(x_i))| \\ &\leq L|f(x_i) - f'(x_i))|. \end{aligned}

Now let’s handle step 2: Rademacher complexity of linear predictors (in \ell_2).

Theorem 18.3.
Collect sample S:= (x_1,\ldots,x_n) into rows of X\in \mathbb{R}^{n\times d}. \textrm{URad}(\{ x\mapsto \left\langle w, x \right \rangle : \|w\|_2\leq B\}_{|S}\}) \leq B\|X\|_F.

Proof. Fix any \epsilon\in\{-1,+1\}^n. Then \sup_{\|w\|\leq B} \sum_i \epsilon_i \left\langle w, x_i \right \rangle = \sup_{\|w\|\leq B} \left\langle w, \sum_i \epsilon_i x_i \right \rangle = B \left\|{ \sum_i \epsilon_i x_i }\right\|. We’ll bound this norm with Jensen’s inequality (only inequality in whole proof!): \mathop{\mathbb{E}}\left\|{ \sum_i \epsilon_i x_i }\right\| = \mathop{\mathbb{E}}\sqrt{ \left\|{ \sum_i \epsilon_i x_i }\right\|^2 } \leq \sqrt{ \mathop{\mathbb{E}}\left\|{ \sum_i \epsilon_i x_i }\right\|^2 }. To finish, \mathop{\mathbb{E}}\left\|{ \sum_i \epsilon_i x_i }\right\|^2 = \mathop{\mathbb{E}}\left({ \sum_i \left\|{ \epsilon_i x_i }\right\|^2 + \sum_{i,j} \left\langle \epsilon_i x_i, \epsilon_j x_j \right \rangle}\right) = \mathop{\mathbb{E}}\sum_i \left\|{x_i}\right\|^2 = \|X\|_F^2.

Remark 18.6.
By Khinchine’s inequality, the preceding Rademacher complexity estimate is tight up to constants.

Let’s now return to the logistic regression example!

Example 18.2 (logistic regression) .

Suppose \|w\|\leq B and \|x_i\|\leq 1, and the loss is the 1-Lipschitz logistic loss \ell_{\log}(z) := \ln(1+\exp(z)). Note \ell(\left\langle w, yx \right \rangle) \geq 0 and \ell(\left\langle w, yx \right \rangle) \leq \ln(2) + \left\langle w, yx \right \rangle \leq \ln(2) + B.

Combining the main Rademacher bound with the Lipschitz composition lemma and the Rademacher bound on linear predictors, with probability at least 1-\delta, every w\in\mathbb{R}^d with \|w\|\leq B satisfies

\begin{aligned} \mathcal{R}_{\ell}(w) &\leq \widehat{\mathcal{R}}_{\ell}(w) + \frac {2}{n} \textrm{URad}((\ell\circ \mathcal{F})_{|S}) + 3(\ln(2)+B)\sqrt{\ln(2/\delta)/(2n)} \\ &\leq \widehat{\mathcal{R}}_{\ell}(w) + \frac {2B\|X\|_F}{n} + 3(\ln(2)+B)\sqrt{\ln(2/\delta)/(2n)} \\ &\leq \widehat{\mathcal{R}}_{\ell}(w) + \frac {2B + 3(B+\ln(2))\sqrt{\ln(2/\delta)/2}}{\sqrt{n}}. \end{aligned}

Remark 18.7.

(Average case vs worst case.) Here we replaced \|X\|_F with the looser \sqrt{n}.

This bound scales as the SGD logistic regression bound proved via Azuma, despite following a somewhat different route (Azuma and McDiarmid are both proved with Chernoff bounding method; the former approach involves no symmetrization, whereas the latter holds for more than the output of an algorithm).

It would be nice to have an “average Lipschitz” bound rather than “worst-case Lipschitz”; e.g., when working with neural networks and the ReLU, which seems it can kill off many inputs! But it’s not clear how to do this. Relatedly: regularizing the gradient is sometimes used in practice?

18.4 Margin bounds

In the logistic regression example, we peeled off the loss and bounded the Rademacher complexity of the predictors.

If most training labels are predicted not only accurately, but with a large margin, as in section 15, then we can further reduce the generalization bound.

Define \ell_\gamma(z) := \max\{0,\min\{1, 1-z/\gamma\}\}, \mathcal{R}_{\gamma}(f) := \mathcal{R}_{\ell_\gamma}(f) = \mathop{\mathbb{E}}\ell_\gamma(Y f(X)), and recall \mathcal{R}_{\textrm{z}}(f) = \mathop{\textrm{Pr}}[f(X) \neq Y].

Theorem 18.4.
For any margin \gamma>0, with probability \geq 1-\delta, \forall f\in\mathcal{F}, \mathcal{R}_{\textrm{z}}(f) \leq \mathcal{R}_{\gamma}(f) \leq \widehat{\mathcal{R}}_{\gamma}(f) + \frac {2}{n\gamma}\textrm{URad}(\mathcal{F}) + 3 \sqrt{\frac{\ln(2/\delta)}{2n}}.

Proof. Since \mathbf{1}[ \textrm{sgn}(f(x)) \neq y] \leq \mathbf{1}[ -f(x)y \geq 0] \leq \ell_\gamma(f(x)y), then \mathcal{R}_{\textrm{z}}(f) \leq \mathcal{R}_{\gamma}(f). The bound between \mathcal{R}_{\gamma} and \widehat{\mathcal{R}}_{\gamma} follows from the fundamental Rademacher bound, and by peeling the \frac 1 \gamma-Lipschitz function \ell_\gamma.

is that using per-example lipschitz? need to restate peeling? also, properly invoke peeling?

Remark 18.8 (bibliographic notes) .

As a generalization notion, this was first introduced for 2-layer networks in (P. L. Bartlett 1996), and then carried to many other settings (SVM, boosting, …)

There are many different proof schemes; another one uses sparsification (Schapire et al. 1997).

This approach is again being extensively used for deep networks, since it seems that while weight matrix norms grow indefinitely, the margins grow along with them (P. Bartlett, Foster, and Telgarsky 2017).

18.5 Finite class bounds

In our warm-up example of finite classes, our complexity term was \ln|\mathcal{F}|. Here we will recover that, via Rademacher complexity. Moreover, the bound has a special form which will be useful in the later VC dimension and especially covering sections.

Theorem 18.5 (Massart finite lemma) .
\textrm{URad}(V) \leq \sup_{u\in V} \|u\|_2 \sqrt{2\ln|V|}.
Remark 18.9.
\ln|V| is what we expect from union bound.

The \|\cdot\|_2 geometry here is intrinsic here; I don’t know how to replace it with other norms without introducing looseness. This matters later when we encounter the Dudley Entropy integral.

We’ll prove this via a few lemmas.

Lemma 18.4.
If (X_1,\ldots,X_n) are c^2-subgaussian, then \mathop{\mathbb{E}}\max_i X_i \leq c\sqrt{2\ln(n)}.

Proof.

\begin{aligned} \mathop{\mathbb{E}}\max_i X_i &= \inf_{t>0} \mathop{\mathbb{E}}\frac 1 t \ln \max_i \exp(tX_i) \leq \inf_{t>0} \mathop{\mathbb{E}}\frac 1 t \ln \sum_i \exp(tX_i) \\ &\leq \inf_{t>0} \frac 1 t \ln \sum_i \mathop{\mathbb{E}}\exp(tX_i) \leq \inf_{t>0} \frac 1 t \ln \sum_i \exp(t^2c^2/2) \\ &= \inf_{t>0} (\ln(n) / t + c^2t/2) \end{aligned} and plug in minimizer t = \sqrt{ 2\ln(n)/c^2 }

Lemma 18.5.
If (X_1,\ldots,X_n) are c_i^2-subgaussian and independent, \sum_i X_i is \|\vec c\|_2^2-subgaussian.

Proof. We did this in the concentration lecture, but here it is again: \mathop{\mathbb{E}}\exp(t\sum_i X_i) = \prod_i \mathop{\mathbb{E}}\exp(tX_i) \leq \prod_i \exp(t^2c_i^2/2) = \exp(t^2\|\vec c\|_2^2/2).

Proof of +Theorem 18.5 (Massart finite lemma).

Let \vec{\epsilon} be iid Rademacher and fix u\in V. Define X_{u,i} := \epsilon_i u_i and X_u := \sum_i X_{u,i}.

By Hoeffding lemma, X_{u,i} is (u_i - -u_i)^2/4 = u_i^2 -subgaussian, thus (by Lemma) X_u is \|u\|_2^2-subgaussian. Thus \textrm{URad}(V) = \mathop{\mathbb{E}}_\epsilon\max_{u\in V} \left\langle \epsilon, u \right \rangle = \mathop{\mathbb{E}}_\epsilon\max_{u\in V} X_u \leq \max_{u\in V} \|u\|_2 \sqrt{2\ln |V| }.

18.6 Weaknesses of Rademacher complexity

not an exhaustive list…

19 Two Rademacher complexity proofs for deep networks

We will give two bounds, obtained by inductively peeling off layers.

also i didn’t mention yet that the other proof techniques reduce to this one?

19.1 First “layer peeling” proof: (1,\infty) norm

Theorem 19.1.
Let \rho-Lipschitz activations \sigma_i satisfy \sigma_i(0)=0, and \mathcal{F}:= \left\{{ x\mapsto \sigma_L(W_L\sigma_{L_1}(\cdots \sigma_1(W_1x)\cdots)) : \|W_i^{\scriptscriptstyle\mathsf{T}}\|_{1,\infty} \leq B }\right\} Then \displaystyle \textrm{URad}(\mathcal{F}_{|S}) \leq \|X\|_{2,\infty} (2\rho B)^L \sqrt{2 \ln(d)}.
Remark 19.1.

Notation \|M\|_{b,c} = \| ( \|M_{:1}\|_b,\ldots,\|M_{:d}\|_b)\|_c means apply b-norm to columns, then c-norm to resulting vector.

Many newer bounds replace \|W_i^{\scriptscriptstyle\mathsf{T}}\| with a distance to initialization. (The NTK is one regime where this helps.) I don’t know how to use distance to initialize in the bounds in this section, but a later bound can handle it.

(\rho B)^L is roughly a Lipschitz constant of the network according to \infty-norm bounded inputs. Ideally we’d have “average Lipschitz” not “worst case,” but we’re still far from that…

The factor 2^L is not good and the next section gives one technique to remove it.

We’ll prove this with an induction “peeling” off layers. This peeling will use the following lemma, which collects many standard Rademacher properties.

Lemma 19.1.
  1. \textrm{URad}(V) \geq 0.
  1. \textrm{URad}(cV + \{u\}) = |c|\textrm{URad}(V).
  1. \textrm{URad}(\textrm{conv}(V)) = \textrm{URad}(V).
  1. Let (V_i)_{i\geq 0} be given with \sup_{u\in V_i} \left\langle u, \epsilon \right \rangle \geq 0 \forall \epsilon\in\{-1,+1\}^n. (E.g., V_i = -V_i, or 0\in V_i.) Then \textrm{URad}(\cup_i V_i) \leq \sum_i \textrm{URad}(V_i).
  1. \textrm{URad}(V) = \textrm{URad}(-V).
Remark 19.2.
  1. is a mixed blessing: “Rademacher is insensitive to convex hulls,”
  1. is true for \textrm{URad}_{|\cdot|} directly, where \textrm{URad}_{|\cdot|}(V) = \mathop{\mathbb{E}}_\epsilon\sup_{u\in V} |\left\langle \epsilon, u \right \rangle| is the original definition of (unnormalized) Rademacher complexity: define W_i := V_i \cup -V_i, which satisfies the conditions, and note (\cup_i V_i) \cup - (\cup_i V_i) = \cup_i W_i. Since \textrm{URad}_{|\cdot|}(V_i) = \textrm{URad}(W_i), then \textrm{URad}_{|\cdot|}(\cup_i V_i) = \textrm{URad}(\cup_i W_i) \leq \sum_{i\geq 1} \textrm{URad}(W_i) = \sum_{i\geq 1} \textrm{URad}_{|\cdot|}(V_i). is this where i messed up and clipped an older Urada remark?
  1. is important and we’ll do the proof and some implications in homework.

Proof of +Lemma 19.1.

  1. Fix any u_0 \in V; then \mathop{\mathbb{E}}_\epsilon\sup_{u\in V}\left\langle \epsilon, v \right \rangle \geq \mathop{\mathbb{E}}_\epsilon\left\langle \epsilon, u_0 \right \rangle = 0.

  2. Can get inequality with |c|-Lipschitz functions \ell_i(r) := c\cdot r + u_i; for equality, note -\epsilon c and \epsilon c are same in distribution.

  3. This follows since optimization over a polytope is achieved at a corner. In detail, \begin{aligned} \textrm{URad}(\textrm{conv}(V)) &= \mathop{\mathbb{E}}_\epsilon\sup_{\substack{k\geq 1 \\ \alpha \in \Delta_k}} \sup_{u_1,\ldots,u_k\in V} \left\langle \epsilon, \sum_j \alpha_j u_j \right \rangle \\ &= \mathop{\mathbb{E}}_\epsilon\sup_{\substack{k\geq 1 \\ \alpha \in \Delta_k}} \sum_j \alpha_j \sup_{u_j \in V} \left\langle \epsilon, u_j \right \rangle \\ &= \mathop{\mathbb{E}}_\epsilon\left({ \sup_{\substack{k\geq 1 \\ \alpha \in \Delta_k}} \sum_j \alpha_j}\right) \sup_{u \in V} \left\langle \epsilon, u \right \rangle \\ &= \textrm{URad}(V). \end{aligned}

  4. Using the condition, \begin{aligned} \mathop{\mathbb{E}}_\epsilon\sup_{u\in \cup_i V_i} \left\langle \epsilon, u \right \rangle &= \mathop{\mathbb{E}}_\epsilon\sup_i \sup_{u\in V_i} \left\langle \epsilon, u \right \rangle \leq \mathop{\mathbb{E}}_\epsilon\sum_i \sup_{u\in V_i} \left\langle \epsilon, u \right \rangle \\ &= \sum_{i\geq1} \textrm{URad}(V_i). \end{aligned}

  5. Since integrating over \epsilon is the same as integrating over -\epsilon (the two are equivalent distributions), \textrm{URad}(-V) = \mathop{\mathbb{E}}_{\epsilon} \sup_{u\in V} \left\langle \epsilon, -u \right \rangle = \mathop{\mathbb{E}}_{\epsilon} \sup_{u\in V} \left\langle -\epsilon, -u \right \rangle = \textrm{URad}(V).

Proof of +Theorem 19.1.

Let \mathcal{F}_i denote functions computed by nodes in layer i. It’ll be shown by induction that \textrm{URad}((\mathcal{F}_{i})_{|S}) \leq \|X \|_{2,\infty} (2\rho B)^i \sqrt{2 \ln(d)}. Base case (i=0): by the Massart finite lemma, \begin{aligned} \textrm{URad}((\mathcal{F}_{i})_{|S}) &= \textrm{URad}\left({ \left\{{x\mapsto x_j : j \in \{1,\ldots,d\}}\right\}_{|S} }\right) \\ &\leq \left({\max_j \|(x_1)_j, \ldots, (x_n)_j\|_2}\right) \sqrt{2\ln(d)} \\ &= \|X\|_{2,\infty} \sqrt{2\ln d } = \|X\|_{2,\infty} (2\rho B)^0 \sqrt{2\ln d}. \end{aligned}

Inductive step. Since 0 = \sigma(\left\langle 0, F(x) \right \rangle) \in \mathcal{F}_{i+1}, applying both Lipschitz peeling and the preceding multi-part lemma, \begin{aligned} &\textrm{URad}((\mathcal{F}_{i+1})_{|S}) \\ &=\textrm{URad}\left({\left\{{ x\mapsto \sigma_{i+1}(\|W_{i+1}^{\scriptscriptstyle\mathsf{T}}\|_{1,\infty} g(x)) : g\in \textrm{conv}(-\mathcal{F}_i \cup \mathcal{F}_i) }\right\}_{|S} }\right) \\ &\leq \rho B\cdot \textrm{URad}\left({-(\mathcal{F}_i)_{|S} \cup (\mathcal{F}_i)_{|S} }\right) \\ &\leq 2 \rho B\cdot \textrm{URad}\left({(\mathcal{F}_i)_{|S}}\right) \\ &\leq (2 \rho B)^{i+1} \|X\|_{2,\infty} \sqrt{2\ln d}. \end{aligned}

Remark 19.3.

There are many related norm-based proofs now changing constants and also (1,\infty); see for instance Neyshabur-Tomioka-Srebro, Bartlett-Foster-Telgarsky (we’ll cover this), Golowich-Rakhlin-Shamir (we’ll cover this), Barron-Klusowski.

The best lower bound is roughly what you get by writing a linear function as a deep network \ddot \frown.

The proof does not “coordinate” the behavior of adjacent layers in any way, and worst-cases what can happen.

19.2 Second “layer peeling” proof: Frobenius norm

Theorem 19.2 ((Theorem 1, Golowich, Rakhlin, and Shamir 2018)) .
Let 1-Lipschitz positive homogeneous activation \sigma_i be given, and \mathcal{F}:= \left\{{ x\mapsto \sigma_L(W_L\sigma_{L_1}(\cdots \sigma_1(W_1x)\cdots)) : \|W_i\|_{\textrm{F}}\leq B }\right\}. Then \textrm{URad}(\mathcal{F}_{|S}) \leq B^L \|X\|_{\textrm{F}}\left({1 + \sqrt{2L\ln(2)}}\right).
Remark 19.4.
The criticisms of the previous layer peeling proof still apply, except we’ve removed 2^L.

The proof technique can also handle other matrix norms (though with some adjustment), bringing it closer to the previous layer peeling proof.

For an earlier version of this bound but including things like 2^L, see Neyshabur-Tomioka-Srebro. I need a proper citation

The main proof trick (to remove 2^L) is to replace \mathop{\mathbb{E}}_\epsilon with \ln \mathop{\mathbb{E}}_\epsilon\exp; the 2^L now appears inside the \ln.

To make this work, we need two calculations, which we’ll wrap up into lemmas.

Here is our refined Lipschitz peeling bound, stated without proof.

Lemma 19.2 ((Eq. 4.20, Ledoux and Talagrand 1991)) .
Let \ell:\mathbb{R}^n\to\mathbb{R}^n be a vector of univariate \rho-lipschitz functions with \ell_i(0)=0. Then \mathop{\mathbb{E}}_\epsilon\exp \left({ \sup_{u\in V} \sum_i \epsilon_i \ell_i(u_i) }\right) \leq \mathop{\mathbb{E}}_\epsilon\exp \left({ \rho \sup_{u\in V} \sum_i \epsilon_i u_i }\right).
Remark 19.5.
With \exp gone, our proof was pretty clean, but all proofs I know of this are more complicated case analyses. So I will not include a proof \ddot\frown.

The peeling proof will end with a term \mathop{\mathbb{E}}\exp\left({ t\|X^{\scriptscriptstyle\mathsf{T}}\epsilon\|}\right), and we’ll optimize the t to get the final bound. Consequently, we are proving \|X^{\scriptscriptstyle\mathsf{T}}\epsilon\| is sub-Gaussian!

Lemma 19.3.
\mathop{\mathbb{E}}\|X^{\scriptscriptstyle\mathsf{T}}\epsilon\|_2 \leq \|X\|_{\textrm{F}} and \|X^{\scriptscriptstyle\mathsf{T}}\epsilon\| is (\mathop{\mathbb{E}}\|X^{\scriptscriptstyle\mathsf{T}}\epsilon\|, \|X\|_{\textrm{F}}^2)-sub-Gaussian.

Proof. Following the notation of (Wainwright 2015), define Y_k := \mathop{\mathbb{E}}\left[{ \|X^{\scriptscriptstyle\mathsf{T}}\epsilon\|_2 | \epsilon_1,\ldots,\epsilon_k }\right], D_k := Y_k - Y_{k-1}, whereby Y_n - Y_0 = \sum_k D_k. For the base case, as usual \mathop{\mathbb{E}}\|X^{\scriptscriptstyle\mathsf{T}}\epsilon\|_2 \leq \sqrt{ \mathop{\mathbb{E}}\|X^{\scriptscriptstyle\mathsf{T}}\epsilon\|^2 } = \sqrt{ \sum_{j=1}^d\mathop{\mathbb{E}}(X_{:j}^{\scriptscriptstyle\mathsf{T}}\epsilon)^2 } = \sqrt{ \sum_{j=1}^d \| X_{:j}\|^2 } = \|X\|_{{\textrm{F}}}.

Supposing \epsilon and \epsilon' only differ on \epsilon_k, \begin{aligned} \sup_{\epsilon_k} | \|X^{\scriptscriptstyle\mathsf{T}}\epsilon\| - \|X^{\scriptscriptstyle\mathsf{T}}\epsilon'\| |^2 &\leq \sup_{\epsilon_k} \|X^{\scriptscriptstyle\mathsf{T}}(\epsilon- \epsilon') \|^2 = \sup_{\epsilon_k} \sum_{j=1}^d (X_{:j}^{\scriptscriptstyle\mathsf{T}}(\epsilon- \epsilon'))^2 \\ &= \sup_{\epsilon_k} \sum_{j=1}^d (X_{k,j} (\epsilon_k - \epsilon'_k))^2 \leq 4 \|X_{k:}\|^2, \end{aligned} therefore by the (conditional) Hoeffding lemma, D_k is \|X_{k:}\|^2-sub-Gaussian, thus (Theorem 2.3, Wainwright 2015) grants \sum_k D_k is \sigma^2-sub-Gaussian with \sigma^2 = \sum_k \|X_{k:}\|^2 = \|X\|_{\textrm{F}}^2.

Remark 19.6 (pointed out by Ziwei Ji) .
Alternatively, we can use the Lipschitz-convex concentration bound for bounded random variables, and get a variance proxy of roughly \|X\|_2. Plugging this into the full peeling proof, we get an interesting bound B^L\left({\|X\|_{\textrm{F}}+ \|X\|_2\sqrt{128L\ln(2)}}\right), thus dimension and depth don’t appear together.

Proof of +Theorem 19.2 ((Theorem 1, Golowich, Rakhlin, and Shamir 2018)). For convenience, let X_i denote the output of layer i, meaning X_0 = X \quad\textup{and}\quad X_i := \sigma_i(X_{i-1} W_i^{\scriptscriptstyle\mathsf{T}}). Let t>0 be a free parameter and let w denote all parameters across all layers; the bulk of the proof will show (by induction on layers) that \mathop{\mathbb{E}}\sup_w \exp( t \|\epsilon^{\scriptscriptstyle\mathsf{T}}X_i\| ) \leq \mathop{\mathbb{E}}2^i \exp( t B^i \|\epsilon^{\scriptscriptstyle\mathsf{T}}X_0\| ).

To see how to complete the proof from here, note by the earlier “base case lemma” (setting \mu := \mathop{\mathbb{E}}\|X_0^{\scriptscriptstyle\mathsf{T}}\epsilon\| for convenience) and Jensen’s inequality that \begin{aligned} \textrm{URad}(\mathcal{F}_{|S}) &= \mathop{\mathbb{E}}\sup_w \epsilon^{\scriptscriptstyle\mathsf{T}}X_L = \mathop{\mathbb{E}}\frac 1 t \ln \sup_w \exp\left({ t \epsilon^{\scriptscriptstyle\mathsf{T}}X_L }\right) \\ &\leq \frac 1 t \ln \mathop{\mathbb{E}}\sup_w \exp\left({ t | \epsilon^{\scriptscriptstyle\mathsf{T}}X_L | }\right) \leq \frac 1 t \ln \mathop{\mathbb{E}}2^L \exp\left({ t B^L \| \epsilon^{\scriptscriptstyle\mathsf{T}}X_0 \| }\right) \\ &\leq \frac 1 t \ln \mathop{\mathbb{E}}2^L \exp\left({ t B^L( \| \epsilon^{\scriptscriptstyle\mathsf{T}}X_0 \| - \mu + \mu)}\right) \\ &\leq \frac 1 t \ln\left[{ 2^L \exp\left({ t^2 B^{2L} \|X\|_{\textrm{F}}^2/2 + tB^L \mu}\right) }\right] \\ &\leq \frac {L\ln 2} t + \frac {t B^{2L}\|X\|_{\textrm{F}}^2}{2} + B^L \|X\|_{\textrm{F}}, \end{aligned} whereby the final bound follows with the minimizing choice t := \sqrt{\frac{ 2 L \ln(2) }{ B^{2L}\|X\|_{\textrm{F}}^2 }} \ \Longrightarrow\ {} \textrm{URad}(\mathcal{F}_{|S}) \leq \sqrt{2\ln(2) L B^{2L}\|X\|_{\textrm{F}}^2} + B^L \|X\|_{\textrm{F}}.

The main inequality is now proved via induction.

For convenience, define \sigma := \sigma_i and Y := X_{i-1} and V := W_i and \tilde V has \ell_2-normalized rows. By positive homogeneity and definition, \begin{aligned} \sup_w \|\epsilon^{\scriptscriptstyle\mathsf{T}}X_i\|^2 &= \sup_w \sum_j (\epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y V^{\scriptscriptstyle\mathsf{T}})_{:j})^2 \\ &= \sup_w \sum_j (\epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y V_{j:}^{\scriptscriptstyle\mathsf{T}}))^2 \\ &= \sup_w \sum_j (\epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(\|V_{j:}\| Y \tilde V_{j:}^{\scriptscriptstyle\mathsf{T}}))^2 \\ &= \sup_w \sum_j \|V_{j:}\|^2 (\epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y \tilde V_{j:}^{\scriptscriptstyle\mathsf{T}}))^2. \end{aligned}

The maximum over row norms is attained by placing all mass on a single row; thus, letting u denote an arbitrary unit norm (column) vector, and finally applying the peeling lemma, and re-introducing the dropped terms, and closing with the IH, \begin{aligned} \mathop{\mathbb{E}}_\epsilon \exp\left({t \sqrt{ \sup_w \|\epsilon^{\scriptscriptstyle\mathsf{T}}X_i\|^2 } }\right) &= \mathop{\mathbb{E}}_\epsilon \exp \left({t \sqrt{ \sup_{w,u} B^2 (\epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y u))^2 } }\right) \\ &= \mathop{\mathbb{E}}_\epsilon\sup_{w,u} \exp\left({t B |\epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y u)| }\right) \\ &\leq \mathop{\mathbb{E}}_\epsilon\sup_{w,u} \exp\left({t B \epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y u) }\right) + \exp\left({ - t B \epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y u) }\right) \\ &\leq \mathop{\mathbb{E}}_\epsilon\sup_{w,u} \exp\left({t B \epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y u) }\right) + \mathop{\mathbb{E}}_\epsilon\sup_{w,u} \exp\left({ - t B \epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y u) }\right) \\ &= \mathop{\mathbb{E}}_\epsilon 2 \sup_{w,u} \exp\left({t B \epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y u) }\right) \\ &\leq \mathop{\mathbb{E}}_\epsilon 2 \sup_{w,u} \exp\left({t B \epsilon^{\scriptscriptstyle\mathsf{T}}Y u }\right) \\ &\leq \mathop{\mathbb{E}}_\epsilon 2 \sup_{w} \exp\left({t B \| \epsilon^{\scriptscriptstyle\mathsf{T}}Y\|_2 }\right) \\ &\leq \mathop{\mathbb{E}}_\epsilon 2^i \sup_{w} \exp\left({t B^i \| \epsilon^{\scriptscriptstyle\mathsf{T}}X_0\|_2 }\right). \end{aligned}

20 Covering numbers

Definition 20.1.
Given a set U, scale \epsilon, norm \|\cdot\|, V\subseteq U is a (proper) cover when \sup_{a\in U} \inf_{b\in V} \|a-b\| \leq \epsilon. Let \mathcal{N}(U,\epsilon,\|\cdot\|) denote the covering number: the minimum cardinality (proper) cover.
Remark 20.1.

“Improper” covers drop the requirement V\subseteq U. (We’ll come back to this.)

Most treatments define special norms with normalization 1/n baked in; we’ll use unnormalized Rademacher complexity and covering numbers.

Although the definition can handle directly covering functions \mathcal{F}, we get nice bounds by covering \mathcal{F}_{|S}, and conceptually it also becomes easier, just a vector (or matrix) covering problem with vector (and matrix) norms.

20.1 Basic Rademacher-covering relationship

Theorem 20.1.
Given U\subseteq \mathbb{R}^n, \textrm{URad}(U) \leq \inf_{\alpha>0}\left({ \alpha\sqrt{n} + \left({\sup_{a\in U}\|a\|_2}\right)\sqrt{2 \ln \mathcal{N}(U,\alpha,\|\cdot\|_2)} }\right).
Remark 20.2.
\|\cdot\|_2 comes from applying Massart. It’s unclear how to handle other norms without some technical slop.

Proof. Let \alpha >0 be arbitrary, and suppose \mathcal{N}(U,\alpha,\|\cdot\|_2)<\infty (otherwise bound holds trivially). Let V denote a minimal cover, and V(a) its closest element to a\in U. Then \begin{aligned} \textrm{URad}(U) &= \mathop{\mathbb{E}}\sup_{a\in U} \left\langle \epsilon, a \right \rangle \\ &= \mathop{\mathbb{E}}\sup_{a\in U} \left\langle \epsilon, a- V(a) + V(a) \right \rangle \\ &= \mathop{\mathbb{E}}\sup_{a\in U} \left({\left\langle \epsilon, V(a) \right \rangle + \left\langle \epsilon, a - V(a) \right \rangle}\right) \\ &\leq \mathop{\mathbb{E}}\sup_{a\in U} \left({\left\langle \epsilon, V(a) \right \rangle + \|\epsilon\|\cdot\|a-V(a)\|}\right) \\ &\leq \textrm{URad}(V) + \alpha \sqrt{n} \\ &\leq \sup_{b\in V}(\|b\|_2) \sqrt{2 \ln |V| } + \alpha \sqrt{n} \\ &\leq \sup_{a\in U}(\|a\|_2) \sqrt{2 \ln |V| } + \alpha \sqrt{n}, \end{aligned} and the bound follows since \alpha>0 was arbitrary.

Remark 20.3.

The same proof handles improper covers with minor adjustment: for every b\in V, there must be U(b)\in U with \|b-U(v)\|\leq \alpha (otherwise, b can be moved closer to U), thus \begin{aligned} \sup_{b\in V}\|b\|_2 &\leq \sup_{b\in V}\|b-U(b)\|_2 + \|U(b)\|_2 \leq \alpha + \sup_{a\in U} \|a\|_2. \end{aligned}

To handle other norms, superficially we need two adjustments: Cauchy-Schwarz can be replaced with Hölder, but it’s unclear how to replace Massart without slop relating different norms.

20.2 Second Rademacher-covering relationship: Dudley’s entropy integral

There is a classical proof that says that covering numbers and Rademacher complexities are roughly the same; the upper bound uses the Dudley entropy integral, and the lower bound uses a “Sudakov lower bound” which we will not include here.

crappy commment, needs to be improved.

Seems this works with improper covers. I should check carefully and include it in the statement or a remark.

citation for dudley? to dudley lol?

Theorem 20.2 (Dudley) .
Let U\subseteq [-1,+1]^n be given with 0\in U. \begin{aligned} \textrm{URad}(U) &\leq \inf_{N \in \mathbb{Z}_{\geq 1}} \left({ n \cdot 2^{-N+1} + 6\sqrt{n} \sum_{i=1}^{N-1} 2^{-i} \sqrt{\ln \mathcal{N}(U, 2^{-i}\sqrt{n}, \|\cdot\|_2} }\right) \\ &\leq \inf_{\alpha > 0} \left({ 4\alpha \sqrt{n} + 12\int_{\alpha}^{\sqrt{n}/2} \sqrt{\ln \mathcal{N}(U, \beta , \|\cdot\|_2} {\text{d}}\beta }\right). \end{aligned}

Proof. We’ll do the discrete sum first. The integral follows by relating an integral to its Riemann sum.

Since U\ni a = (a - V_N(a)) + \sum_{i=1}^{N-1} \left({ V_{i+1}(a) - V_i(a) }\right) + V_1(a), \begin{aligned} &\textrm{URad}(U) \\ &= \mathop{\mathbb{E}}\sup_{a\in U}\left\langle \epsilon, a \right \rangle \\ &= \mathop{\mathbb{E}}\sup_{a\in U}\left({ \left\langle \epsilon, a-V_N(a) \right \rangle + \sum_{i=1}^{N-1} \left\langle \epsilon, V_{i+1}(a) - V_i(a) \right \rangle + \left\langle \epsilon, V_1(a) \right \rangle }\right) \\ &\leq \mathop{\mathbb{E}}\sup_{a\in U}\left\langle \epsilon, a-V_N(a) \right \rangle %\qquad\qquad %``A'' \\ &\qquad + \sum_{i=1}^{N-1} \mathop{\mathbb{E}}\sup_{a\in U}\left\langle \epsilon, V_{i+1}-V_i(a) \right \rangle %\qquad\qquad %\sum_{i=1}^{N-1}``B_i'' \\ &\qquad + \mathop{\mathbb{E}}\sup_{a\in U} \left\langle \epsilon, V_1(a) \right \rangle %\qquad\qquad %``C'' . \end{aligned} Let’s now control these terms separately.

The first and last terms are easy: \begin{aligned} \mathop{\mathbb{E}}\sup_{a\in U}{\epsilon}{V_1(a)} &= \mathop{\mathbb{E}}\left\langle \epsilon, 0 \right \rangle = 0, \\ \mathop{\mathbb{E}}\sup_{a\in U}\left\langle \epsilon, a-V_N(a) \right \rangle &\leq \mathop{\mathbb{E}}\sup_{a\in U}\|\epsilon\|\|a-V_N(a)\| \leq \sqrt{n} \alpha_N = n 2^{1-N}. \end{aligned} For the middle term, define increment class W_i := \{ V_{i+1}(a) - V_i(a) : a\in U\}, whereby |W_i| \leq |V_{i+1}|\cdot |V_i| \leq |V_{i+1}|^2, and \begin{aligned} &\mathop{\mathbb{E}}\sup_{a \in U} \left\langle \epsilon, V_{i+1}(a) - V_i(a) \right \rangle = \textrm{URad}(W_i) \\ &\leq \left({\sup_{w\in W_i}\|w\|_2}\right) \sqrt{2 \ln |W_i|} \leq \left({\sup_{w\in W_i}\|w\|_2}\right) \sqrt{4 \ln |V_{i+1}|}, \\ \sup_{w\in W_i} \|w\| &\leq \sup_{a\in U} \|V_{i+1}\| + \|a - V_i(a)\| \leq \alpha_{i+1} + \alpha_i = 3 \alpha_{i+1}. \end{aligned} Combining these bounds, \textrm{URad}(U) \leq n 2^{1-N} + 0 + \sum_{i=1}^N 6 \sqrt{n} 2^{-i}\sqrt{\ln \mathcal{N}(U,2^{-i}\sqrt{n},\|\cdot\|_2} . N\geq 1 was arbitrary, so applying \inf_{N\geq 1} gives the first bound.

Since \ln \mathcal{N}(U,\beta,\|\cdot\|_2) is nonincreasing in \beta, the integral upper bounds the Riemann sum: \begin{aligned} \textrm{URad}(U) &\leq n2^{1-N} + 6\sum_{i=1}^{N-1} \alpha_{i+1} \sqrt{\ln \mathcal{N}(U,\alpha_{i+1}, \|\cdot\|)} \\ &= n2^{1-N} + 12\sum_{i=1}^{N-1} \left({ \alpha_{i+1} - \alpha_{i+2} }\right) \sqrt{\ln \mathcal{N}(U,\alpha_{i+1}, \|\cdot\|)} \\ &\leq \sqrt{n} \alpha_N + 12 \int_{\alpha_{N+1}}^{\alpha_2} \sqrt{\ln \mathcal{N}(U,\alpha_{i+1}, \|\cdot\|)} {\text{d}}\beta. \end{aligned} To finish, pick \alpha > 0 and N with \alpha_{N+1} \geq \alpha > \alpha_{N+2} = \frac {\alpha_{N+1}}{2} = \frac {\alpha_{N+2}}{4}, whereby \begin{aligned} \textrm{URad}(U) &\leq \sqrt{n} \alpha_N + 12 \int_{\alpha_{N+1}}^{\alpha_2} \sqrt{\ln \mathcal{N}(U,\alpha_{i+1}, \|\cdot\|)} {\text{d}}\beta \\ &\leq 4 \sqrt{n} \alpha + 12 \int_{\alpha}^{\sqrt{n}/2} \sqrt{\ln \mathcal{N}(U,\alpha_{i+1}, \|\cdot\|)} {\text{d}}\beta. \end{aligned}

Remark 20.4.

Tightness of Dudley: Sudakov’s lower bound says there exists a universal C with \textrm{URad}(U) \geq \frac {c}{\ln(n)} \sup_{\alpha > 0} \alpha \sqrt{\ln\mathcal{N}(U,\alpha,\|\cdot\|)}, which implies \textrm{URad}(U) = \widetilde\Theta\left({\textup{Dudley entropy integral}}\right). needs references, detail, explanation.

Taking the notion of increments to heart and generalizing the proof gives the concept of chaining. One key question there is tightening the relationship with Rademacher complexity (shrinking constants and log factors in the above bound).

Another term for covering is “metric entropy.”

Recall once again that we drop the normalization 1/n from \textrm{URad} and the choice of norm when covering.

21 Two deep network covering number bounds

We will give two generalization bounds.

21.1 First covering number bound: Lipschitz functions

This bound is intended as a point of contrast with our deep network generalization bounds.

Theorem 21.1.
Let data S = (x_1,\ldots,x_n) be given with R := \max_{i,j} \|x_i-x_j\|_\infty. Let \mathcal{F} denote all \rho-Lipschitz functions from [-R,+R]^d\to [-B,+B] (where Lipschitz is measured wrt \|\cdot\|_\infty). Then the improper covering number \widetilde{\mathcal{N}} satisfies \ln \widetilde{\mathcal{N}}(\mathcal{F},\epsilon,\|\cdot\|_{{\textrm{u}}}) \leq \max\left\{{0, \left\lceil \frac {4\rho (R+\epsilon)}{\epsilon}\right\rceil^d \ln \left\lceil \frac {2B}{\epsilon}\right\rceil }\right\}.
Remark 21.1.

Exponential in dimension!

Revisiting the “point of contrast” comment above, our deep network generalization bounds are polynomial and not exponential in dimension; consequently, we really are doing much better than simply treating the networks as arbitrary Lipschitz functions.

Proof.

To show this is an improper cover, given f\in\mathcal{F}, choose g\in\mathcal{G} by proceeding over each C \in U, and assigning g_{|C}\in V to be the closest element to f(x_C), where x_C is the midpoint of C. Then \begin{aligned} \|f-g\|_{{\textrm{u}}} &= \sup_{C\in U} \sup_{x\in C} |f(x) - g(x)| \\ &\leq \sup_{C\in U} \sup_{x\in C} \left({ |f(x) - f(x_C)| + |f(x_C) - g(x)| }\right) \\ &\leq \sup_{C\in U} \sup_{x\in C} \left({ \rho \|x- x_C\|_\infty + \frac{\epsilon}{2} }\right) \\ &\leq \sup_{C\in U} \sup_{x\in C} \left({ \rho( \epsilon/ (4\rho) ) + \frac{\epsilon}{2} }\right) \leq \epsilon \end{aligned}

hmm the proof used uniform norm… is it defined?

21.2 “Spectrally-normalized” covering number bound

Theorem 21.2 (P. Bartlett, Foster, and Telgarsky (2017)) .
Fix multivariate activations (\sigma_i)_{i=1}^L with \|\sigma\|_{\textrm{Lip}}=: \rho_i and \sigma_i(0)=0, and data X\in\mathbb{R}^{n\times d}, and define \begin{aligned} \mathcal{F}_n := \Bigg\{ \sigma_L(W_L\sigma_{L-1}\cdots \sigma_1(W_1 X^{\scriptscriptstyle\mathsf{T}})\cdots )\ :\ {} \|W_i^{\scriptscriptstyle\mathsf{T}}\|_2 \leq s_i, \|W_i^{\scriptscriptstyle\mathsf{T}}\|_{2,1}\leq b_i \Bigg\}, \end{aligned} and all matrix dimensions are at most m. Then \ln\mathcal{N}\left({ \mathcal{F}_n,\epsilon, \|\cdot\|_{\textrm{F}} }\right) \leq \frac {\|X\|_{\textrm{F}}^2\prod_{j=1}^L \rho_j^2 s_j^2}{\epsilon^2} \left({ \sum_{i=1}^L \left({\frac{b_i}{s_i}}\right)^{2/3} }\right)^3 \ln(2m^2).
Remark 21.2.
Applying Dudley, \textrm{URad}(\mathcal{F}_n)=\widetilde{\mathcal{O}}\left({ \|X\|_{\textrm{F}}\left[{ \prod_{j=1}^L \rho_j s_j }\right]\cdot \left[{ \sum_{i=1}^L \left({\frac{b_i}{s_i}}\right)^{2/3} }\right]^{3/2} }\right). that’s annoying and should be included/performed rigorously.

Proof uses \|\sigma(M) - \sigma(M')\|_{\textrm{F}}\leq \|\sigma\|_{\textrm{Lip}}\cdot\|M-M'\|_{\textrm{F}}; in particular, it allows multi-variate gates like max-pooling! See (P. Bartlett, Foster, and Telgarsky 2017) for \|\sigma_i\|_{\textrm{Lip}} estimates.

This proof can be adjusted to handle “distance to initialization”; see (P. Bartlett, Foster, and Telgarsky 2017) and the notion “reference matrices.”

Let’s compare to our best “layer peeling” proof from before, which had \prod_i \|W_i\|_{\textrm{F}}\lesssim m^{L/2} \prod_i \|W_i\|_2. That proof assumed \rho_i = 1, so the comparison boils down to m^{L/2} \left({ \prod_i \|W_i\|_2 }\right) \quad\textup{vs.}\quad \left[{ \sum_i \left({\frac{\|W_i^{\scriptscriptstyle\mathsf{T}}\|_{2,1}^{2/3}}{\|W_i\|_2^{2/3}}}\right)}\right]^{3/2} \left({ \prod_i \|W_i\|_2 }\right), where L \leq \sum_i \left({\frac{\|W_i^{\scriptscriptstyle\mathsf{T}}\|_{2,1}^{2/3}}{\|W_i\|_2^{2/3}}}\right) \leq Lm^{2/3}. So the bound is better but still leaves a lot to be desired, and is loose in practice.

It is not clear how to prove exactly this bound with Rademacher peeling, which is a little eerie (independent of whether this bound is good or not).

The proof, as with Rademacher peeling proofs, is an induction on layers, similarly one which does not “coordinate” the behavior of the layers; this is one source of looseness.

Remark 21.3 (practical regularization schemes) .
This bound suggests regularization based primarily on the Lipschitz constant of the network; similar ideas appeared in parallel applied work, both for classification problems (Cisse et al. 2017), and for GANs (Arjovsky, Chintala, and Bottou 2017).
Remark 21.4 (another proof) .
For an alternate proof a similar fact (albeit requiring univariate gates), see (Neyshabur, Bhojanapalli, and Srebro 2018).

The first step of the proof is a covering number for individual layers.

Lemma 21.1.
\ln \mathcal{N}(\{ WX^{\scriptscriptstyle\mathsf{T}}: X\in\mathbb{R}^{m\times d}, \|W^{\scriptscriptstyle\mathsf{T}}\|_{2,1}\leq b\}, \epsilon, \|\cdot\|_{\textrm{F}}) \leq\left\lceil \frac{\|X\|_{{\textrm{F}}}^2 b^2}{\epsilon^2} \right\rceil \ln (2dm).

Proof. Let W\in\mathbb{R}^{m\times d} be given with \|W^{\scriptscriptstyle\mathsf{T}}\|_{2,1} \leq r. Define s_{ij} := W_{ij}/|W_{ij}|, and note \begin{aligned} WX^{\scriptscriptstyle\mathsf{T}} = \sum_{i,j} \mathbf{e}_i\mathbf{e}_i^{\scriptscriptstyle\mathsf{T}}W \mathbf{e}_j\mathbf{e}_j^{\scriptscriptstyle\mathsf{T}}X^{\scriptscriptstyle\mathsf{T}} = \sum_{i,j} \mathbf{e}_i W_{ij} (X\mathbf{e}_j)^{\scriptscriptstyle\mathsf{T}} = \sum_{i,j} \underbrace{\frac{|W_{ij}| \|X\mathbf{e}_j\|_2}{r\|X\|_{\textrm{F}}}}_{=:q_{ij}} \underbrace{\frac{r \|X\|_{\textrm{F}}s_{ij}\mathbf{e}_i (X\mathbf{e}_j)^{\scriptscriptstyle\mathsf{T}}}{\|X\mathbf{e}_j\|}}_{=:U_{ij}}. \end{aligned} Note by Cauchy-Schwarz that \sum_{i,j} q_{ij} \leq \frac {1}{r\|X\|_{\textrm{F}}} \sum_i \sqrt{\sum_j W_{ij}^2}\|X\|_{\textrm{F}} = \frac {\|W^{\scriptscriptstyle\mathsf{T}}\|_{2,1} \|X\|_{\textrm{F}}}{r\|X\|_{\textrm{F}}} \leq 1, potentially with strict inequality, thus q is not a probability vector, which we will want later. To remedy this, construct probability vector p from q by adding in, with equal weight, some U_{ij} and its negation, so that the above summation form of WX^{\scriptscriptstyle\mathsf{T}} goes through equally with p as with q.

Now define IID random variables (V_1,\ldots,V_k), where \begin{aligned} \mathop{\textrm{Pr}}[V_l = U_{ij}] &= p_{ij}, \\ \mathop{\mathbb{E}}V_l &= \sum_{i,j} p_{ij} U_{ij} = \sum_{i,j} q_{ij} U_{ij} = WX^{\scriptscriptstyle\mathsf{T}}, \\ \|U_{ij}\| &= \left\|{\frac {s_{ij} \mathbf{e}_i (X\mathbf{e}_j)}{\|X\mathbf{e}_j\|_2} }\right\|_{\textrm{F}}\cdot r \|X\|_{\textrm{F}} = |s_{ij}|\cdot\|\mathbf{e}_i\|_2 \cdot\left\|{\frac{X\mathbf{e}_j}{\|X\mathbf{e}_j\|_2} }\right\|_2 \cdot r \|X\|_{\textrm{F}} = r \|X\|_{\textrm{F}}, \\ \mathop{\mathbb{E}}\|V_l\|^2 &= \sum_{i,j} p_{ij} \|U_{ij}\|^2 \leq \sum_{ij} p_{ij} r^2 \|X\|_{\textrm{F}}^2 = r^2 \|X\|_{\textrm{F}}^2. \end{aligned} By +Lemma 5.1 (Maurey (Pisier 1980)), there exist ({\hat V}_1,\ldots,{\hat V}_k)\in S^k with \left\|{ WX^{\scriptscriptstyle\mathsf{T}}- \frac 1 k \sum_l {\hat V}_l }\right\|^2 \leq \mathop{\mathbb{E}}\left\|{ \mathop{\mathbb{E}}V_1 - \frac 1 k \sum_l V_l }\right\|^2 \leq \frac 1 k \mathop{\mathbb{E}}\|V_1\|^2 \leq \frac {r^2 \|X\|_{\textrm{F}}^2} k. Furthermore, the matrices {\hat V}_l have the form \frac 1 k \sum_l {\hat V}_l = \frac 1 k \sum_l \frac {s_l\mathbf{e}_{i_l} (X\mathbf{e}_{j_l})^{\scriptscriptstyle\mathsf{T}}}{\|X\mathbf{e}_{j_l}\|} = \left[{ \frac 1 k \sum_l \frac {s_l\mathbf{e}_{i_l} \mathbf{e}_{j_l}^{\scriptscriptstyle\mathsf{T}}}{\|X\mathbf{e}_{j_l}\|} }\right] X^{\scriptscriptstyle\mathsf{T}}; by this form, there are at most (2nd)^k choices for ({\hat V}_1,\ldots,{\hat V}_k).

Lemma 21.2.
Let \mathcal{F}_n be the same image vectors as in the theorem, and let per-layer tolerances (\epsilon_1,\ldots,\epsilon_L) be given. then \hspace{-2em} \ln\mathcal{N}\left({ \mathcal{F}_n,\ {} \sum_{j=1}^{L} \rho_j \epsilon_j \prod_{k=j+1}^{L} \rho_k s_k,\ {} \|\cdot\|_{\textrm{F}}}\right) \leq \sum_{i=1}^L \left\lceil \frac { \|X\|_{\textrm{F}}^2 b_i^2 \prod_{j<i} \rho_j^2 s_j^2}{\epsilon_i^2} \right\rceil \ln(2m^2).

Proof. Let X_i denote the output of layer i of the network, using weights (W_i,\ldots,W_1), meaning X_0 := X \qquad\text{and}\qquad X_i := \sigma_i(X_{i-1} W_i^{\scriptscriptstyle\mathsf{T}}).

The proof recursively constructs cover elements {\hat X}_i and weights {\hat W}_i for each layer with the following basic properties.

It remains to prove, by induction, an error guarantee \|X_i - {\hat X}_i\|_{\textrm{F}} \leq \sum_{j=1}^{i} \rho_j \epsilon_j \prod_{k=j+1}^{i} \rho_k s_k.

The base case \|X_0-{\hat X}_0\|_{\textrm{F}}= 0 = \epsilon_0 holds directly. For the inductive step, by the above ingredients and the triangle inequality, \begin{aligned} \| X_i - {\hat X}_i \|_{\textrm{F}} &\leq \rho_i \|X_{i-1}W_i^{\scriptscriptstyle\mathsf{T}}- {\hat X}_{i-1}{\hat W}_i^{\scriptscriptstyle\mathsf{T}}\|_{\textrm{F}} \\ &\leq \rho_i \| X_{i-1}W_{i}^{\scriptscriptstyle\mathsf{T}}- {\hat X}_{i-1}W_i^{\scriptscriptstyle\mathsf{T}}\|_{\textrm{F}} + \rho_i \| {\hat X}_{i-1}W_i^{\scriptscriptstyle\mathsf{T}}- {\hat X}_{i-1}{\hat W}_i^{\scriptscriptstyle\mathsf{T}}\|_{\textrm{F}} \\ &\leq \rho_i s_i \| X_{i-1} - {\hat X}_{i-1}\|_{\textrm{F}} + \rho_i \epsilon_i \\ &\leq \rho_i s_i \left[{ \sum_{j=1}^{i-1} \rho_j \epsilon_j \prod_{k=j+1}^{i-1} \rho_k s_k }\right] + \rho_i \epsilon_i \\ &= \left[{ \sum_{j=1}^{i-1} \rho_j \epsilon_j \prod_{k=j+1}^{i} \rho_k s_k }\right] + \rho_i \epsilon_i \\ &= \sum_{j=1}^{i} \rho_j \epsilon_j \prod_{k=j+1}^{i} \rho_k s_k. \end{aligned}

Proof of +Theorem 21.2 (P. Bartlett, Foster, and Telgarsky (2017)). By solving a Lagrangian (minimize cover size subject to total error \leq \epsilon), choose \epsilon_i := \frac {\alpha_i \epsilon}{\rho_i \prod_{j>i} \rho_j s_j}, \qquad \alpha_i := \frac 1 \beta \left({\frac{b_i}{s_i}}\right)^{2/3}, \qquad \beta := \sum_{i=1}^L \left({\frac{b_i}{s_i}}\right)^{2/3}. Invoking the induction lemma with these choices, the resulting cover error is \sum_{i=1}^L \epsilon_i \rho_i \prod_{j>i} \rho_j s_j = \epsilon\sum_{j=1}^L \alpha_i = \epsilon. and the main term of the cardinality (ignoring \ln(2m^2)) satisfies \begin{aligned} & \sum_{i=1}^L \frac { \|X\|_{\textrm{F}}^2 b_i^2 \prod_{j<i} \rho_j^2 s_j^2}{\epsilon_i^2} = \frac {\|X\|_{\textrm{F}}^2}{\epsilon^2} \sum_{i=1}^L \frac{b_i^2\prod_{j=1}^L \rho_j^2 s_j^2}{\alpha_i^2 s_i^2} \\ &= \frac {\|X\|_{\textrm{F}}^2\prod_{j=1}^L \rho_j^2 s_j^2}{\epsilon^2} \sum_{i=1}^L \frac{\beta^2 b_i^{2/3}}{s_i^{2/3}} = \frac {\|X\|_{\textrm{F}}^2\prod_{j=1}^L \rho_j^2 s_j^2}{\epsilon^2} \left({ \sum_{i=1}^L \left({\frac{b_i}{s_i}}\right)^{2/3} }\right)^3. \end{aligned} I should include the Lagrangian explicitly and also explicit Dudley.

22 VC dimension

ok if VC dim is one section, covering numbers should be as well?

remainder is copy/pasted from fall2018, was not taught in fall2019.

should include in preamble various bounds not taught, and a comment that VC dim proofs are interesting and reveal structure not captured above.

First, some definitions. First, the zero-one/classification risk/error: \mathcal{R}_{\textrm{z}}(\textrm{sgn}(f)) = \mathop{\textrm{Pr}}[\textrm{sgn}(f(X)) \neq Y], \ {} \widehat{\mathcal{R}}_{\textrm{z}}(\textrm{sgn}(f)) = \frac 1 n \sum_{i=1}^n \mathbf{1}[\textrm{sgn}(f(x_i))\neq y_i]. The earlier Rademacher bound will now have \textrm{URad}\left({\left\{{ (x,y)\mapsto \mathbf{1}[\textrm{sgn}(f(x))\neq y] : f\in\mathcal{F}}\right\}_{|S}}\right). This is at most 2^n; we’ll reduce it to a combinatorial quantity: \begin{aligned} \textrm{sgn}(U) &:= \left\{{ (\textrm{sgn}(u_1),\ldots,\textrm{sgn}(u_n)) : u\in V }\right\}, \\ \textrm{Sh}(\mathcal{F}_{|S}) &:= \left|{ \textrm{sgn}(\mathcal{F}_{|S}) }\right|, \\ \textrm{Sh}(\mathcal{F}; n) &:= \sup_{\substack{S\in ? \\ |S|\leq n}} \left|{ \textrm{sgn}(\mathcal{F}_{|S}) }\right|, \\ \textrm{VC}(\mathcal{F}) &:= \sup\{ i \in \mathbb{Z}_{\geq 0} : \textrm{Sh}(\mathcal{F};i) = 2^i \}. \end{aligned}

Remark 22.1.

\textrm{Sh} is “shatter coefficient,” \textrm{VC} is “VC dimension.”

Both quantities are criticized as being too tied to their worst case; bounds here depend on (empirical quantity!) \textrm{URad}(\textrm{sgn}(\mathcal{F}_{|S})), which can be better, but throws out the labels.

Theorem 22.1 (“VC Theorem”) .
With probability at least 1-\delta, every f\in\mathcal{F} satisfies \mathcal{R}_{\textrm{z}}(\textrm{sgn}(f)) \leq \widehat{\mathcal{R}}_{\textrm{z}}(\textrm{sgn}(f) + \frac 2 n \textrm{URad}(\textrm{sgn}(\mathcal{F}_{|S})) + 3 \sqrt{\frac{\ln(2/\delta)}{2n}}, and \begin{aligned} \textrm{URad}(\textrm{sgn}(\mathcal{F}_{|S})) &\leq \sqrt{ 2n \ln \textrm{Sh}(\mathcal{F}_{|S})},\\ \ln \textrm{Sh}(\mathcal{F}_{|S}) &\leq \ln \textrm{Sh}(\mathcal{F};n) \leq \textrm{VC}(\mathcal{F}) \ln (n+1). \end{aligned}
Remark 22.2.

Say something like “Need \textrm{Sh}(\mathcal{F}_{|s}) = o(n)”in order to learn“.” ?

Minimizing \widehat{\mathcal{R}}_{\textrm{z}} is NP-hard in many trivial cases, but those require noise and neural networks can often get \widehat{\mathcal{R}}_{\textrm{z}}(\textrm{sgn}(f)) = 0.

\textrm{VC}(\mathcal{F})<\infty suffices; many considered this a conceptual breakthrough, namely “learning is possible!”

The quantities (\textrm{VC}, \textrm{Sh}) appeared in prior work (not by V-C). Symmetrization apparently too, though I haven’t dug this up.

First step of proof: pull out the zero-one loss.

Lemma 22.1.
\textrm{URad}(\{(x,y)\mapsto \mathbf{1}[\textrm{sgn}(f(x))\neq y] : f\in\mathcal{F}\}_{|S}) \leq \textrm{URad}(\textrm{sgn}(\mathcal{F}_{|S})).

Proof. For each i, define \ell_i(z) := \max\left\{{0,\min\left\{{1, \frac{1-y_i(2z-1)}{2} }\right\}}\right\}, which is 1-Lipschitz, and satisfies \ell_i(\textrm{sgn}(f(x_i))) = \mathbf{1}[ \textrm{sgn}(f(x_i)) \neq y_i ]. (Indeed, it is the linear interpolation.) Then \begin{aligned} &\textrm{URad}(\left\{{(x,y)\mapsto \mathbf{1}[ \textrm{sgn}(f(x))\neq y] : f\in\mathcal{F}}\right\}_{|S}) \\ &=\textrm{URad}(\left\{{(\ell_1(\textrm{sgn}(f(x_1))),\ldots, \ell_n(\textrm{sgn}(f(x_n)))):f\in\mathcal{F}}\right\}_{|S}) \\ &=\textrm{URad}(\ell \circ \textrm{sgn}(\mathcal{F})_{|S}) \\ &\leq \textrm{URad}(\textrm{sgn}(\mathcal{F})_{|S}). \end{aligned}

is that using the fancier per-coordinate vector-wise peeling again?

Plugging this into our Rademacher bound: w/ pr \geq 1-\delta, \forall f\in\mathcal{F}, \mathcal{R}_{\textrm{z}}(\textrm{sgn}(f)) \leq \widehat{\mathcal{R}}_{\textrm{z}}(\textrm{sgn}(f)) + \frac 2 n \textrm{URad}(\textrm{sgn}(\mathcal{F})_{|S}) + 3 \sqrt{\frac{\ln(2/\delta)}{2n}}.

The next step is to apply Massart’s finite lemma, giving \textrm{URad}(\textrm{sgn}(\mathcal{F}_{|S})) \leq \sqrt{ 2n \textrm{Sh}(\mathcal{F}_{|S})}.

One last lemma remains for the proof.

lol why mention warren. should be explicit and not passive-aggressive.

Lemma 22.2 (Sauer-Shelah? Vapnik-Chervonenkis? Warren? …) .
Let \mathcal{F} be given, and define V := \textrm{VC}(\mathcal{F}). Then \textrm{Sh}(\mathcal{F};n) \leq \begin{cases} 2^n & \textup{when } n\leq V,\\ \left({\frac{en}{V}}\right)^V & \textup{otherwise}. \end{cases} Moreover, \textrm{Sh}(\mathcal{F};n)\leq n^V + 1.

(Proof. Omitted. Exists in many standard texts.)

okay fine but i should give a reference, and eventually my own clean proof.

22.1 VC dimension of linear predictors

Theorem 22.2.
Define \mathcal{F}:= \left\{{ x \mapsto \textrm{sgn}(\left\langle a, x \right \rangle -b) : a\in\mathbb{R}^d, b\in\mathbb{R}}\right\} (“linear classifiers”/“affine classifier”/ “linear threshold function (LTF)”). Then \textrm{VC}(\mathcal{F}) = d+1.
Remark 22.3.

By Sauer-Shelah, \textrm{Sh}(\mathcal{F};n) \leq n^{d+1} + 1. Anthony-Bartlett chapter 3 gives an exact equality; only changes constants of \ln\textrm{VC}(\mathcal{F};n).

Let’s compare to Rademacher: \begin{aligned} \textrm{URad}(\textrm{sgn}(\mathcal{F}_{|S})) & \leq \sqrt{ 2n d \ln(n+1)}, \\ \textrm{URad}(\{ x\mapsto \left\langle w, x \right \rangle : \|w\|\leq R\}_{|S}) &\leq R \|X_S\|_F, \end{aligned} where \|X_S\|_F^2 = \sum_{x\in S} \|x\|_2^2 \leq n \cdot d\cdot \max_{i,j} x_{i,j}. One is scale-sensitive (and suggests regularization schemes), other is scale-insensitive.

Proof. First let’s do the lower bound \textrm{VC}(\mathcal{F})\geq d+1.

Given any P\subseteq S, define (a,b) as a_i := 2\cdot \mathbf{1}[\mathbf{e}_i \in P] - 1, \qquad b := \frac 1 2 - \mathbf{1}[0 \in P]. Then \begin{aligned} \textrm{sgn}(\left\langle a, \mathbf{e}_i \right \rangle - b) &= \textrm{sgn}( 2 \mathbf{1}[\mathbf{e}_i \in P] - 1 - b) = 2 \mathbf{1}[\mathbf{e}_i \in P] - 1, \\ \textrm{sgn}(\left\langle a, 0 \right \rangle - b) &= \textrm{sgn}( 2 \mathbf{1}[0 \in P] - 1/2) = 2 \mathbf{1}[0 \in P] - 1, \end{aligned} meaning this affine classifier labels S according to P, which was an arbitrary subset.

Now let’s do the upper bound \textrm{VC}(\mathcal{F}) < d+2.

needs ref

Lemma 22.3 (Radon’s Lemma) .
Given S\subseteq \mathbb{R}^d with |S|=d+2, there exists a partition of S into nonempty (P,N) with \textrm{conv}(P)\cap \textrm{conv}(S) \neq \emptyset.

Proof. Let S = \{x_1,\ldots,x_{d+2}\} be given, and define \{u_1,\ldots,u_{d+1}\} as u_i := x_i - x_{d+2}, which must be linearly dependent:

Set P:= \{i : \beta_i > 0\}, N := \{i : \beta_i \leq 0\}; where neither set is empty.

Set \beta := \sum_{i\in P} \beta_i - \sum_{i\in N} \beta_i > 0.

Since 0= \sum_i \beta_i x_i = \sum_{i\in P} \beta_i x_i + \sum_{i \in N} \beta_i x_i, then \frac 0 \beta = \sum_{i\in P} \frac{\beta_i}{\beta} x_i + \sum_{i \in N} \frac {\beta_i}{\beta} x_i and the point z:= \sum_{i\in P} \beta_i x_i / \beta = \sum_{i \in N} \beta_i x_i / (-\beta) satisfies z\in \textrm{conv}(P)\cap \textrm{conv}(N).

Remark 22.4.

Generalizes Minsky-Papert “xor” construction.

Indeed, the first appearance I know of shattering/VC was in approximation theory, the papers of Warren and Shapiro, and perhaps it is somewhere in Kolmogorov’s old papers.

22.2 VC dimension of threshold networks

Consider iterating the previous construction, giving an “LTF network”: a neural network with activation z\mapsto \mathbf{1}[z\geq 0].

We’ll analyze this by studying output of all nodes. To analyze this, we’ll study not just the outputs, but the behavior of all nodes.

another suggestion of definition in pandoc-numbering

Definition.

Remark 22.5.

Since last column is the labeling, |\textrm{Act}(S;\mathcal{F})| \geq \textrm{Sh}(\mathcal{F}_{|S}).

\textrm{Act} seems a nice complexity measure, but it is hard to estimate given a single run of an algorithm (say, unlike a Lipschitz constant).

We’ll generalize \textrm{Act} to analyze ReLU networks.

Theorem 22.3.
For any LTF architecture \mathcal{F} with p parameters, \textrm{Sh}(\mathcal{F};n) \leq |\textrm{Act}(S;\mathcal{F})| \leq (n+1)^p. When p\geq 12, then \textrm{VC}(\mathcal{F})\leq 6p\ln(p).

Proof.

(Inductive step). Let j\geq 1 be given; the proof will now construct U_{j+1} by refining the partition U_j.

It remains to bound the VC dimension via this Shatter bound: \begin{aligned} &\textrm{VC}(\mathcal{F})<n \\ \Longleftarrow & \forall i \geq n \centerdot \textrm{Sh}(\mathcal{F};i)< 2^i \\ \Longleftarrow & \forall i \geq n \centerdot (i+1)^p < 2^i \\ \iff & \forall i \geq n \centerdot p\ln (i+1) < i \ln 2 \\ \iff & \forall i \geq n \centerdot p < \frac{i \ln (2)}{\ln(i+1)} \\ \Longleftarrow & p < \frac{n \ln (2)}{\ln(n+1)} \end{aligned} If n = 6p\ln(p), \begin{aligned} \frac{n \ln (2)}{\ln(n+1)} &\geq \frac {n\ln(2)}{\ln(2n)} = \frac {6p\ln(p)\ln(2)}{\ln 12 + \ln p + \ln\ln p} \\ &\geq \frac {6p \ln p \ln 2}{3\ln p} > p. \end{aligned}

Remark 22.6.

Had to do handle \forall i\geq n since VC dimension is defined via \sup; one can define funky \mathcal{F} where \textrm{Sh} is not monotonic in n.

Lower bound is \Omega(p \ln m); see Anthony-Bartlett chapter 6 for a proof. This lower bound however is for a specific fixed architecture!

Other VC dimension bounds: ReLU networks have \widetilde{\mathcal{O}}(pL), sigmoid networks have \widetilde{\mathcal{O}}(p^2m^2), and there exists a convex-concave activation which is close to sigmoid but has VC dimension \infty.

Matching lower bounds exist for ReLU, not for sigmoid; but even the “matching” lower bounds are deceptive since they hold for a fixed architecture of a given number of parameters and layers.

22.3 VC dimension of ReLU networks

Today’s ReLU networks will predict with x\mapsto A_L \sigma_{L-1}\left({A_{L-1} \cdots A_2\sigma_1(A_1x + b_1)+b_2\cdots + b_{L-1}}\right) + b_L, where A_i\in \mathbb{R}^{d_i\times d_{i-1}} and \sigma_i : \mathbb{R}^{d_i\to d_i} applies the ReLU z\mapsto\max\{0,z\} coordinate-wise.

Convenient notation: collect data as rows of matrix X\in\mathbb{R}^{n\times d}, and define \begin{aligned} X_0 &:= X^\top & Z_0 &:= \textup{all $1$s matrix},\\ X_i &:= A_i(Z_{i-1} \odot X_{i-1}) + b_i 1_n^\top, & X_i &:= \mathbf{1}[X_i \geq 0], \end{aligned} where (Z_1,\ldots,Z_L) are the activation matrices.

i should double check i have the tightest version? which is more sensitive to earlier layers? i should comment on that and the precise structure/meaning of the lower bounds?

Theorem 22.4 ((Theorem 6, P. L. Bartlett et al. 2017)) .
Let fixed ReLU architecture \mathcal{F} be given with p= \sum_{i=1}^L p_i parameters, L layers, m=\sum_{i=1}^L m_i nodes. Let examples (x_1,\ldots,x_n) be given and collected into matrix X. There exists a partition U_L of the parameter space satisfying:
Remark 22.7 (on the proof) .

As with LTF networks, the prove inductively constructs partitions of the weights up through layer i so that the activations are fixed across all weights in each partition cell.

Consider a fixed cell of the partition, whereby the activations are fixed zero-one matrices. As a function of the inputs, the ReLU network is now an affine function; as a function of the weights it is multilinear or rather a polynomial of degree L.

Consider again a fixed cell and some layer i; thus \sigma(X_i) = Z_i \odot X_i is a matrix of polynomials of degree i (in the weights). If we can upper bound the number of possible signs of A_{i+1} ( Z_i \odot X_i ) + b_i 1_n^\top, then we can refine our partition of weight space and recurse. For that we need a bound on sign patterns of polynomials, as on the next slide.

Theorem 22.5 (Warren ’68; see also Anthony-Bartlet Theorem 8.3) .
Let F denote functions x\mapsto f(x;w) which are r-degree polynomials in w\in\mathbb{R}^p. If n\geq p, then \textrm{Sh}(\mathcal{F};n) \leq 2 (\frac{2enr}{p})^p.
Remark 22.8.
Proof is pretty intricate, and omitted. It relates the VC dimension of F to the zero sets Z_i := \{ w\in\mathbb{R}^p : f(x; w) = 0\}, which it controls with an application of Bezout’s Theorem. The zero-counting technique is also used to obtain an exact Shatter coefficient for affine classifiers.

Proof (of ReLU VC bound).

We’ll inductively construct partitions (U_0,\ldots,U_L) where U_i partitions the parameters of layers j\leq i so that for any C\in U_i, the activations Z_j in layer j\leq i are fixed for all parameter choices within C (thus let Z_j(C) denote these fixed activations).

The proof will proceed by induction, showing |U_i| \leq (12nL)^{pi}.

Base case i=0: then U_0 = \{\emptyset\}, Z_0 is all ones, and |U_0| = 1 \leq (12nL)^{pi}.

(Inductive step).

It remains to upper bound the VC dimension via the Shattering bound. As with LTF networks, \begin{aligned} \textrm{VC}(\mathcal{F}) < n & \Longleftarrow \forall i \geq n \centerdot \textrm{Sh}(\mathcal{F};i)< 2^i \\ & \Longleftarrow \forall i \geq n \centerdot (12iL)^{pL} < 2^i \\ & \iff \forall i \geq n \centerdot pL \ln(12iL) < i \ln 2 \\ & \iff \forall i \geq n \centerdot pL < \frac{ i \ln 2 } { \ln(12iL) } \\ & \Longleftarrow pL < \frac{ n \ln 2 } { \ln(12nL) } \\ \end{aligned}

If n = 6 pL \ln(pL), \begin{aligned} \frac {n\ln 2}{\ln(12nL)} &= \frac {6pL \ln(pL)\ln(2)}{\ln(72pL^2\ln(pL))} = \frac {6pL \ln(pL)\ln(2)}{\ln(72) + \ln(pL^2) + \ln \ln(pL)} \\ &\geq \frac {6pL \ln(pL)\ln(2)}{\ln(72) + \ln(pL^2) + \ln(pL) - 1} \geq \frac {6 \ln(pL)\ln(2)}{3\ln(pL^2)} \\ &= 2pL \ln 2 > pL. \end{aligned}

Remark 22.9.

If ReLU is replaced with a degree r\geq 2 piecewise polynomial activation, have r^i-degree polynomial in each cell of partition, and shatter coefficient upper bound scales with L^2 not L. The lower bound in this case still has L not L^2; it’s not known where the looseness is.

Lower bounds are based on digit extraction, and for each pair (p,L) require a fixed architecture.

References

Allen-Zhu, Zeyuan, and Yuanzhi Li. 2019. “What Can ResNet Learn Efficiently, Going Beyond Kernels?”
Allen-Zhu, Zeyuan, Yuanzhi Li, and Yingyu Liang. 2018. “Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers.” arXiv Preprint arXiv:1811.04918.
Arjovsky, Martin, Soumith Chintala, and Léon Bottou. 2017. “Wasserstein Generative Adversarial Networks.” In ICML.
Arora, Sanjeev, Nadav Cohen, Noah Golowich, and Wei Hu. 2018a. “A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks.”
———. 2018b. “A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks.”
Arora, Sanjeev, Nadav Cohen, and Elad Hazan. 2018. “On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization.” In Proceedings of the 35th International Conference on Machine Learning, edited by Jennifer Dy and Andreas Krause, 80:244–53. Proceedings of Machine Learning Research. Stockholmsmässan, Stockholm Sweden: PMLR. http://proceedings.mlr.press/v80/arora18a.html.
Arora, Sanjeev, Nadav Cohen, Wei Hu, and Yuping Luo. 2019. “Implicit Regularization in Deep Matrix Factorization.” In Advances in Neural Information Processing Systems, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett, 32:7413–24. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/c0c783b5fc0d7d808f1d14a6e9c8280d-Paper.pdf.
Arora, Sanjeev, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. 2019. “On Exact Computation with an Infinitely Wide Neural Net.” arXiv Preprint arXiv:1904.11955.
Arora, Sanjeev, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. 2019. “Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks.” arXiv Preprint arXiv:1901.08584.
Arora, Sanjeev, Rong Ge, Behnam Neyshabur, and Yi Zhang. 2018. “Stronger Generalization Bounds for Deep Nets via a Compression Approach.”
Barron, Andrew R. 1993. “Universal Approximation Bounds for Superpositions of a Sigmoidal Function.” IEEE Transactions on Information Theory 39 (3): 930–45.
Bartlett, Peter L. 1996. “For Valid Generalization, the Size of the Weights Is More Important Than the Size of the Network.” In NIPS.
Bartlett, Peter L., Nick Harvey, Chris Liaw, and Abbas Mehrabian. 2017. “Nearly-Tight VC-Dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks.”
Bartlett, Peter L., and Philip M. Long. 2020. “Failures of Model-Dependent Generalization Bounds for Least-Norm Interpolation.”
Bartlett, Peter L., and Shahar Mendelson. 2002. “Rademacher and Gaussian Complexities: Risk Bounds and Structural Results.” JMLR 3 (November): 463–82.
Bartlett, Peter, Dylan Foster, and Matus Telgarsky. 2017. “Spectrally-Normalized Margin Bounds for Neural Networks.” NIPS.
Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2018. “Reconciling Modern Machine Learning Practice and the Bias-Variance Trade-Off.”
Belkin, Mikhail, Daniel Hsu, and Ji Xu. 2019. “Two Models of Double Descent for Weak Features.”
Bengio, Yoshua, and Olivier Delalleau. 2011. “Shallow Vs. Deep Sum-Product Networks.” In NIPS.
Blum, Avrim, and John Langford. 2003. “PAC-MDL Bounds.” In Learning Theory and Kernel Machines, 344–57. Springer.
Borwein, Jonathan, and Adrian Lewis. 2000. Convex Analysis and Nonlinear Optimization. Springer Publishing Company, Incorporated.
Bubeck, Sébastien. 2014. “Theory of Convex Optimization for Machine Learning.”
Cao, Yuan, and Quanquan Gu. 2020a. “Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks.”
———. 2020b. “Generalization Error Bounds of Gradient Descent for Learning over-Parameterized Deep ReLU Networks.”
Carmon, Yair, and John C. Duchi. 2018. “Analysis of Krylov Subspace Solutions of Regularized Nonconvex Quadratic Problems.” In NIPS.
Chaudhuri, Kamalika, and Sanjoy Dasgupta. 2014. “Rates of Convergence for Nearest Neighbor Classification.”
Chen, Ricky T. Q., Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. 2018. “Neural Ordinary Differential Equations.”
Chen, Zixiang, Yuan Cao, Quanquan Gu, and Tong Zhang. 2020. “A Generalized Neural Tangent Kernel Analysis for Two-Layer Neural Networks.”
Chen, Zixiang, Yuan Cao, Difan Zou, and Quanquan Gu. 2019. “How Much over-Parameterization Is Sufficient to Learn Deep ReLU Networks?”
Chizat, Lénaïc, and Francis Bach. 2018. On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport.” arXiv e-Prints, May, arXiv:1805.09545. http://arxiv.org/abs/1805.09545.
———. 2019. A Note on Lazy Training in Supervised Differentiable Programming.”
———. 2020. “Implicit Bias of Gradient Descent for Wide Two-Layer Neural Networks Trained with the Logistic Loss.” arXiv:2002.04486 [math.OC].
Cho, Youngmin, and Lawrence K. Saul. 2009. “Kernel Methods for Deep Learning.” In NIPS.
Cisse, Moustapha, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. 2017. “Parseval Networks: Improving Robustness to Adversarial Examples.”
Clarke, Francis H., Yuri S. Ledyaev, Ronald J. Stern, and Peter R. Wolenski. 1998. Nonsmooth Analysis and Control Theory. Springer.
Cohen, Nadav, Or Sharir, and Amnon Shashua. 2016. “On the Expressive Power of Deep Learning: A Tensor Analysis.” In 29th Annual Conference on Learning Theory, edited by Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, 49:698–728. Proceedings of Machine Learning Research. Columbia University, New York, New York, USA: PMLR. http://proceedings.mlr.press/v49/cohen16.html.
Cohen, Nadav, and Amnon Shashua. 2016. “Convolutional Rectifier Networks as Generalized Tensor Decompositions.” In Proceedings of the 33rd International Conference on Machine Learning, edited by Maria Florina Balcan and Kilian Q. Weinberger, 48:955–63. Proceedings of Machine Learning Research. New York, New York, USA: PMLR. http://proceedings.mlr.press/v48/cohenb16.html.
Cybenko, George. 1989. Approximation by superpositions of a sigmoidal function.” Mathematics of Control, Signals and Systems 2 (4): 303–14.
Daniely, Amit. 2017. “Depth Separation for Neural Networks.” In COLT.
Daniely, Amit, and Eran Malach. 2020. “Learning Parities with Neural Networks.”
Davis, Damek, Dmitriy Drusvyatskiy, Sham Kakade, and Jason D. Lee. 2018. “Stochastic Subgradient Method Converges on Tame Functions.”
Diakonikolas, Ilias, Surbhi Goel, Sushrut Karmalkar, Adam R. Klivans, and Mahdi Soltanolkotabi. 2020. “Approximation Schemes for ReLU Regression.”
Du, Simon S., Wei Hu, and Jason D. Lee. 2018. “Algorithmic Regularization in Learning Deep Homogeneous Models: Layers Are Automatically Balanced.”
Du, Simon S., Xiyu Zhai, Barnabas Poczos, and Aarti Singh. 2018. “Gradient Descent Provably Optimizes over-Parameterized Neural Networks.”
Du, Simon S, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. 2018. “Gradient Descent Finds Global Minima of Deep Neural Networks.” arXiv Preprint arXiv:1811.03804.
Du, Simon, and Wei Hu. 2019. “Width Provably Matters in Optimization for Deep Linear Neural Networks.”
Dziugaite, Gintare Karolina, Alexandre Drouin, Brady Neal, Nitarshan Rajkumar, Ethan Caballero, Linbo Wang, Ioannis Mitliagkas, and Daniel M. Roy. 2020. “In Search of Robust Measures of Generalization.”
Dziugaite, Gintare Karolina, and Daniel M. Roy. 2017. “Computing Nonvacuous Generalization Bounds for Deep (stochastic) Neural Networks with Many More Parameters Than Training Data.”
Eldan, Ronen, and Ohad Shamir. 2015. “The Power of Depth for Feedforward Neural Networks.”
Folland, Gerald B. 1999. Real Analysis: Modern Techniques and Their Applications. 2nd ed. Wiley Interscience.
Funahashi, K. 1989. “On the Approximate Realization of Continuous Mappings by Neural Networks.” Neural Netw. 2 (3): 183–92.
Ge, Rong, Jason D. Lee, and Tengyu Ma. 2016. “Matrix Completion Has No Spurious Local Minimum.” In NIPS.
Ghorbani, Behrooz, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. 2020. “When Do Neural Networks Outperform Kernel Methods?”
Goel, Surbhi, Adam Klivans, Pasin Manurangsi, and Daniel Reichman. 2020. “Tight Hardness Results for Training Depth-2 ReLU Networks.”
Golowich, Noah, Alexander Rakhlin, and Ohad Shamir. 2018. “Size-Independent Sample Complexity of Neural Networks.” In COLT.
Gunasekar, Suriya, Jason D Lee, Daniel Soudry, and Nati Srebro. 2018a. “Implicit Bias of Gradient Descent on Linear Convolutional Networks.” In Advances in Neural Information Processing Systems, 9461–71.
Gunasekar, Suriya, Jason Lee, Daniel Soudry, and Nathan Srebro. 2018b. “Characterizing Implicit Bias in Terms of Optimization Geometry.” arXiv Preprint arXiv:1802.08246.
Gunasekar, Suriya, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. 2017. “Implicit Regularization in Matrix Factorization.”
Gurvits, Leonid, and Pascal Koiran. 1995. “Approximation and Learning of Convex Superpositions.” In Computational Learning Theory, edited by Paul Vitányi, 222–36. Springer.
Hanin, Boris, and David Rolnick. 2019. “Deep ReLU Networks Have Surprisingly Few Activation Patterns.”
Hastie, Trevor, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. 2019. “Surprises in High-Dimensional Ridgeless Least Squares Interpolation.”
Hiriart-Urruty, Jean-Baptiste, and Claude Lemaréchal. 2001. Fundamentals of Convex Analysis. Springer Publishing Company, Incorporated.
Hornik, K., M. Stinchcombe, and H. White. 1989. “Multilayer Feedforward Networks Are Universal Approximators.” Neural Networks 2 (5): 359–66.
Jacot, Arthur, Franck Gabriel, and Clément Hongler. 2018. “Neural Tangent Kernel: Convergence and Generalization in Neural Networks.” In Advances in Neural Information Processing Systems, 8571–80.
Ji, Ziwei. 2020. “Personal Communication.”
Ji, Ziwei, Miroslav Dudík, Robert E. Schapire, and Matus Telgarsky. 2020. “Gradient Descent Follows the Regularization Path for General Losses.” In COLT.
Ji, Ziwei, and Matus Telgarsky. 2018. “Gradient Descent Aligns the Layers of Deep Linear Networks.” arXiv:1810.02032 [cs.LG].
———. 2019a. “Polylogarithmic Width Suffices for Gradient Descent to Achieve Arbitrarily Small Test Error with Shallow ReLU Networks.”
———. 2019b. “Risk and Parameter Convergence of Logistic Regression.” In COLT.
———. 2020. “Directional Convergence and Alignment in Deep Learning.” arXiv:2006.06657 [cs.LG].
Ji, Ziwei, Matus Telgarsky, and Ruicheng Xian. 2020. “Neural Tangent Kernels, Transportation Mappings, and Universal Approximation.” In ICLR.
Jiang, Yiding, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. 2020. “Fantastic Generalization Measures and Where to Find Them.” In ICLR.
Jin, Chi, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. 2017. “How to Escape Saddle Points Efficiently.” In ICML.
Jones, Lee K. 1992. “A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training.” The Annals of Statistics 20 (1): 608–13.
Kakade, Sham, and Jason D. Lee. 2018. “Provably Correct Automatic Subdifferentiation for Qualified Programs.”
Kamath, Pritish, Omar Montasser, and Nathan Srebro. 2020. “Approximate Is Good Enough: Probabilistic Variants of Dimensional and Margin Complexity.”
Kawaguchi, Kenji. 2016. “Deep Learning Without Poor Local Minima.” In NIPS.
Kolmogorov, A. N., and V. M. Tikhomirov. 1959. \epsilon-Entropy and \epsilon-Capacity of Sets in Function Spaces.” Uspekhi Mat. Nauk 14 (86, 2): 3–86.
Ledoux, M., and M. Talagrand. 1991. Probability in Banach Spaces: Isoperimetry and Processes. Springer.
Lee, Holden, Rong Ge, Tengyu Ma, Andrej Risteski, and Sanjeev Arora. 2017. “On the Ability of Neural Nets to Express Distributions.” In COLT.
Lee, Jason D., Max Simchowitz, Michael I. Jordan, and Benjamin Recht. 2016. “Gradient Descent Only Converges to Minimizers.” In COLT.
Leshno, Moshe, Vladimir Ya. Lin, Allan Pinkus, and Shimon Schocken. 1993. “Multilayer Feedforward Networks with a Nonpolynomial Activation Function Can Approximate Any Function.” Neural Networks 6 (6): 861–67. http://dblp.uni-trier.de/db/journals/nn/nn6.html#LeshnoLPS93.
Li, Yuanzhi, and Yingyu Liang. 2018. “Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data.”
Long, Philip M., and Hanie Sedghi. 2019. “Generalization Bounds for Deep Convolutional Neural Networks.”
Luxburg, Ulrike von, and Olivier Bousquet. 2004. “Distance-Based Classification with Lipschitz Functions.” Journal of Machine Learning Research.
Lyu, Kaifeng, and Jian Li. 2019. “Gradient Descent Maximizes the Margin of Homogeneous Neural Networks.”
Mei, Song, Andrea Montanari, and Phan-Minh Nguyen. 2018. A Mean Field View of the Landscape of Two-Layers Neural Networks.” arXiv e-Prints, April, arXiv:1804.06561. http://arxiv.org/abs/1804.06561.
Montúfar, Guido, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. 2014. “On the Number of Linear Regions of Deep Neural Networks.” In NIPS.
Moran, Shay, and Amir Yehudayoff. 2015. “Sample Compression Schemes for VC Classes.”
Nagarajan, Vaishnavh, and J. Zico Kolter. 2019. “Uniform Convergence May Be Unable to Explain Generalization in Deep Learning.”
Negrea, Jeffrey, Gintare Karolina Dziugaite, and Daniel M. Roy. 2019. “In Defense of Uniform Convergence: Generalization via Derandomization with an Application to Interpolating Predictors.”
Nesterov, Yurii. 2003. Introductory Lectures on Convex Optimization — a Basic Course. Springer.
Nesterov, Yurii, and B. T. Polyak. 2006. “Cubic Regularization of Newton Method and Its Global Performance.” Math. Program. 108 (1): 177–205.
Neyshabur, Behnam, Srinadh Bhojanapalli, and Nathan Srebro. 2018. “A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks.” In ICLR.
Neyshabur, Behnam, Ryota Tomioka, and Nathan Srebro. 2014. “In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning.” arXiv:1412.6614 [cs.LG].
Nguyen, Quynh, and Matthias Hein. 2017. “The Loss Surface of Deep and Wide Neural Networks.”
Novak, Roman, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. 2018. Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes.” arXiv e-Prints. http://arxiv.org/abs/1810.05148.
Novikoff, Albert B. J. 1962. “On Convergence Proofs on Perceptrons.” In Proceedings of the Symposium on the Mathematical Theory of Automata 12: 615–22.
Oymak, Samet, and Mahdi Soltanolkotabi. 2019. “Towards Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks.” arXiv Preprint arXiv:1902.04674.
Pisier, Gilles. 1980. “Remarques Sur Un résultat Non Publié de b. Maurey.” Séminaire Analyse Fonctionnelle (dit), 1–12.
Rolnick, David, and Max Tegmark. 2017. “The Power of Deeper Networks for Expressing Natural Functions.”
Safran, Itay, and Ohad Shamir. 2016. “Depth-Width Tradeoffs in Approximating Natural Functions with Neural Networks.”
Schapire, Robert E., and Yoav Freund. 2012. Boosting: Foundations and Algorithms. MIT Press.
Schapire, Robert E., Yoav Freund, Peter Bartlett, and Wee Sun Lee. 1997. “Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods.” In ICML, 322–30.
Schmidt-Hieber, Johannes. 2017. “Nonparametric Regression Using Deep Neural Networks with ReLU Activation Function.”
Shallue, Christopher J., Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl. 2018. “Measuring the Effects of Data Parallelism on Neural Network Training.”
Shamir, Ohad. 2018. “Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks.” arXiv:1809.08587 [cs.LG].
Shamir, Ohad, and Tong Zhang. 2013. “Stochastic Gradient Descent for Non-Smooth Optimization: Convergence Results and Optimal Averaging Schemes.” In ICML.
Siegelmann, Hava, and Eduardo Sontag. 1994. “Analog Computation via Neural Networks.” Theoretical Computer Science 131 (2): 331–60.
Soudry, Daniel, Elad Hoffer, and Nathan Srebro. 2017. “The Implicit Bias of Gradient Descent on Separable Data.” arXiv Preprint arXiv:1710.10345.
Suzuki, Taiji, Hiroshi Abe, and Tomoaki Nishimura. 2019. “Compression Based Bound for Non-Compressed Network: Unified Generalization Error Analysis of Large Compressible Deep Neural Network.”
Telgarsky, Matus. 2013. “Margins, Shrinkage, and Boosting.” In ICML.
———. 2015. “Representation Benefits of Deep Feedforward Networks.”
———. 2016. “Benefits of Depth in Neural Networks.” In COLT.
———. 2017. “Neural Networks and Rational Functions.” In ICML.
Tzen, Belinda, and Maxim Raginsky. 2019. “Neural Stochastic Differential Equations: Deep Latent Gaussian Models in the Diffusion Limit.”
Vardi, Gal, and Ohad Shamir. 2020. “Neural Networks with Small Weights and Depth-Separation Barriers.” arXiv:2006.00625 [cs.LG].
Wainwright, Martin J. 2015. UC Berkeley Statistics 210B, Lecture Notes: Basic tail and concentration bounds.” January 2015. https://www.stat.berkeley.edu/ mjwain/stat210b/.
———. 2019. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. 1st ed. Cambridge University Press.
Wei, Colin, and Tengyu Ma. 2019. “Data-Dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation.”
Weierstrass, Karl. 1885. Über Die Analytische Darstellbarkeit Sogenannter Willkürlicher Functionen Einer Reellen Veränderlichen.” Sitzungsberichte Der Akademie Zu Berlin, 633–39, 789–805.
Yarotsky, Dmitry. 2016. “Error Bounds for Approximations with Deep ReLU Networks.”
Yehudai, Gilad, and Ohad Shamir. 2019. “On the Power and Limitations of Random Features for Understanding Neural Networks.” arXiv:1904.00687 [cs.LG].
———. 2020. “Learning a Single Neuron with Gradient Methods.” arXiv:2001.05205 [cs.LG].
Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. “Understanding Deep Learning Requires Rethinking Generalization.” ICLR.
Zhou, Lijia, D. J. Sutherland, and Nathan Srebro. 2020. “On Uniform Convergence and Low-Norm Interpolation Learning.”
Zhou, Wenda, Victor Veitch, Morgane Austern, Ryan P. Adams, and Peter Orbanz. 2018. “Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach.”
Zou, Difan, Yuan Cao, Dongruo Zhou, and Quanquan Gu. 2018. “Stochastic Gradient Descent Optimizes over-Parameterized Deep Relu Networks.”
Zou, Difan, and Quanquan Gu. 2019. “An Improved Analysis of Training over-Parameterized Deep Neural Networks.”