Deep learning theory lecture notes

Philosophy of these notes. Two key ideas determined what has been included so far.

Organization. Following the second point above, the classical view decomposes the test error into three parts.

Basic setup: feedforward networks and test error decomposition

Although this means we exclude many settings, as discussed above, much of the work in other settings uses tools from this most standard one.

Basic shallow network. Consider the mapping x \mapsto \sum_{j=1}^m a_j \sigma(w_j^{\scriptscriptstyle\mathsf{T}}x + b_j).

Basic deep network. Extending the matrix notation, given parameters w = (W_1, b_1, \ldots, W_L, b_L), f(x;w) := \sigma_L( W_L \sigma_{L-1}( \cdots W_2 \sigma_1(W_1 x + b_1) + b_2 \cdots )+ b_L ). (1)

“Performs well on future examples” becomes “minimize \mathcal{R}(f).” We can decompose \mathcal{R}(f) into three separate concerns: given a training algorithm’s choice \hat f in some class of functions/predictors \mathcal{F}, as well as some reference solution \bar f \in \mathcal{F}, \begin{aligned} \mathcal{R}(\hat f) &= \mathcal{R}(\hat f) - \widehat{\mathcal{R}}(\hat f) &\text{(generalization)} \\ &\quad + \widehat{\mathcal{R}}(\hat f) - \widehat{\mathcal{R}}(\bar f) &\text{(optimization)} \\ &\quad + \widehat{\mathcal{R}}(\bar f) -\mathcal{R}(\bar f) &\qquad\text{(concentration/generalization)} \\ &\quad + \mathcal{R}(\bar f). &\text{(approximation)} \end{aligned} These notes are organized are organized into separately considering these three terms (treating “generalization” and “concentration/generalization” together).

Highlights

Missing topics and references

Due to the above philosophy, many topics are currently omitted. Over time I hope to fill the gaps.

Further omitted topics, in a bit more detail, are discussed separately for approximation (section 1.1), optimization (section 6.1), and generalization (section 11.1).

Acknowledgements

Thanks to Ziwei Ji for extensive comments, discussion, and the proof of Theorem 10.3; thanks to Daniel Hsu for extensive comments and discussion; thanks to Francesco Orabona for detailed comments spanning many sections; thanks to Ohad Shamir for extensive comments on many topics; thanks to Karolina Dziugaite and Dan Roy for extensive comments on the generalization material; thanks to Thien Nguyen for extensive and detailed comments and corrections on many sections. Further thanks to Nadav Cohen, Quanquan Gu, Suriya Gunasekar, Frederic Koehler, Justin Li, Akshayaa Magesh, Maxim Raginsky, David Rolnick, Kartik Sreenivasan, Matthieu Terris, and Alex Wozniakowski for various comments and feedback.

1 Approximation: preface

As above, we wish to ensure that our predictors \mathcal{F} (e.g., networks of a certain architecture) have some element \bar f\in\mathcal{F} which simultaneously has small \mathcal{R}(f) and small complexity; we can re-interpret our notation and suppose \mathcal{F} already is some constrained class of low-complexity predictors, and aim to make \inf_{f\in\mathcal{F}} \mathcal{R}(f) small.

What is \mathcal{F}? In keeping with the earlier theme, it should be some convenient notion of “low complexity model”; but what is that?

The standard classical setup (“all models of some fixed architecture”) is often stated with a goal of competing with all continuous functions: \inf_{f\in\mathcal{F}} \mathcal{R}(f) \qquad\text{vs.}\qquad \inf_{g\ \textrm{continuous}} \mathcal{R}(g). E.g., \sup_{g\text{ cont.}} \inf_{f\in\mathcal{F}} \mathcal{R}(f) - \mathcal{R}(g). To simplify further, if \ell is \rho-Lipschitz (and still y=\pm 1), \begin{aligned} &\mathcal{R}(f) - \mathcal{R}(g) = \int \left({ \ell(yf(x)) - \ell(yg(x)) }\right){\text{d}}\mu(x,y) \\ &\leq \int \rho|yf(x) - yg(x)| {\text{d}}\mu(x,y) %\\ %& = \rho \int |f(x) - g(x)| {\text{d}}\mu(x,y), \end{aligned} and in particular we have reduced the approximation question to one about studying \|f-g\| with function space norms.

1.1 Omitted topics

2 Classical approximations and “universal approximation”

We start with two types of standard approximation results, in the “classical” regime where we only care about the number of nodes and not the magnitude of the weights, and also the worst-case goal of competing with an arbitrary continuous function using some function space norm.

There are weaknesses in these results (e.g., curse of dimension), and thus they are far from the practical picture. Still, they are very interesting and influential.

2.1 Elementary folklore constructions

We can handle the univariate case by gridding the line and taking steps appropriately.

Now let’s handle the multivariate case. We will replicate the univariate approach: we will increment function values when the target function changes. In the univariate case, we could “localize” function modifications, but in the multivariate case by default we will modify an entire halfspace at once. To get around this, we use an additional layer.

The proof uses the following lemma (omitted in class), approximating continuous functions by piecewise constant functions.

2.2 Universal approximation with a single hidden layer

The proof of Theorem 2.1 use two layers to construct g_\gamma such that g_\gamma(x) \approx \mathbf{1}\left[{ x\in \times_i [a_i,b_i] }\right]. If instead we had a way to approximate multiplication we could instead approximate x \mapsto \prod_i \mathbf{1}\left[{ x_i \in [a_i,b_i] }\right] = \mathbf{1}\left[{ x\in \times_i [a_i,b_i] }\right]. Can we do this and then form a linear combination, all with just one hidden layer?

The answer will be yes, and we will use this to resolve the classical universal approximation question with a single hidden layer.

Consider unbounded width networks with one hidden layer: \begin{aligned} \mathcal{F}_{\sigma,d,m} &:= \mathcal{F}_{d,m} := \left\{{ x\mapsto a^{\scriptscriptstyle\mathsf{T}}\sigma(W x + b) : a\in\mathbb{R}^m, W\in\mathbb{R}^{m\times d}, b\in\mathbb{R}^m}\right\}. \\ \mathcal{F}_{\sigma,d} &:= \mathcal{F}_d := \bigcup_{m\geq 0} \mathcal{F}_{\sigma,d,m}. \end{aligned} Note that \mathcal{F}_{\sigma, m,1} denotes networks with a single node, and \mathcal{F}_{\sigma,d} is the linear span (in function space) of single-node networks.

First consider the (unusual) activation \sigma = \cos. Since 2 \cos(y)\cos(z) = \cos(y+z) + \cos(y-z), then \begin{aligned} & 2 \left[{\sum_{i=1}^m a_i \cos(w_i^{\scriptscriptstyle\mathsf{T}}x + b_i) }\right] \cdot \left[{ \sum_{j=1}^n c_j \cos(u_j^{\scriptscriptstyle\mathsf{T}}x + v_j) }\right]= \\ & \sum_{i=1}^m \sum_{j=1}^n a_i c_j \left({ \cos((w_i+u_j)^{\scriptscriptstyle\mathsf{T}}x + (b_i+v_j)) + \cos((w_i - u_j)^{\scriptscriptstyle\mathsf{T}}x + (b_i-v_j)) }\right), \end{aligned} thus f,g\in\mathcal{F}_{\cos,d} \Longrightarrow fg \in \mathcal{F}_{\cos,d} ! In other words, \mathcal{F}_{\cos,d} is closed under multiplication, and since we know we can approximate univariate functions arbitrarily well, this suggests that we can approximate x \mapsto \prod_i \mathbf{1}\left[{ x_i \in [a_i,b_i] }\right] = \mathbf{1}\left[{ x\in \times_i [a_i,b_i] }\right], and use it to achieve our more general approximation goal.

We’re in good shape to give the general universal approximation result. The classical Weierstrass theorem establishes that polynomials are universal approximators (Weierstrass 1885), and its generalization, the Stone-Weierstrass theorem, says that any family of functions satisfying some of the same properties as polynomials will also be a universal approximator. Thus we will show \mathcal{F}_{\sigma,d} is a universal approximator via Stone-Weierstrass, a key step being closure under multiplication as above; this proof scheme was first suggested in (Hornik, Stinchcombe, and White 1989), but is now a fairly standard way to prove universal approximation.

First, we go back to \cos activations, which was the original choice in (Hornik, Stinchcombe, and White 1989); we can then handle arbitrary activations by univariate approximation of \cos, without increasing the depth (but increasing the width).

3 Infinite-width Fourier representations and the Barron norm

This section presents two ideas which have recently become very influential again.

We will finish the section with a more general view of these infinite-width constructions, and a technique to sample finite-width networks from them.

3.1 Infinite-width univariate approximations

That’s really it! We’ve written a differentiable function as a shallow infinite-width network, with equality, effortlessly.

3.2 Barron’s construction for infinite-width multivariate approximation

This approach uses Fourier transforms; for those less familiar, it might seem daunting, but:

Let’s first argue it’s natural. Recall the Fourier transform (e.g., Folland 1999, Chapter 8): \hat f (w) := \int \exp(-2\pi i w^{\scriptscriptstyle\mathsf{T}}x) f(x) {\text{d}}x. We also have Fourier inversion: if f\in L^1 and \hat f\in L^1, f(x) = \int \exp(2\pi i w^{\scriptscriptstyle\mathsf{T}}x) \hat f(w) {\text{d}}w. The inversion formula rewrite f as an infinite-width network! The only catch is that the activations are not only non-standard, they are over the complex plane.

Barron’s aproach is to convert these activations into something more normal; here we’ll use threshold nodes, but others are fine as well. If our starting function f is over the reals, then using \Re to denote the real part of a complex number, meaning \Re(a+bi) = a, then f(x) = \Re f(x) = \int \Re \exp(2\pi i w^{\scriptscriptstyle\mathsf{T}}x) \hat f(w) {\text{d}}w. If we expand with e^{i z} = \cos(z) + i \sin(z), we’re left with \cos, which is not compactly supported; to obtain an infinite-width form with threshold gates using a density which is compactly supported, Barron uses two tricks.

Barron combined these ideas with the sampling technique in Lemma 3.1 (Maurey (Pisier 1980)) to obtain estimates on the number of nodes needed to approximate functions whenever \|w\|\cdot|\hat f(w)| is well-behaved. We will follow a simpler approach here: we will give an explicit infinite-width form via only the first trick above and some algebra, and only then invoke sampling. The quantity \|w\|\cdot|\hat f(w)| will appear in the estimate of the “mass” of the infinite-width network as used to estimate how much to sample, analogous to the quantity \int_0^1 |g'(x)|{\text{d}}x from Proposition 2.1.

Before continuing, let’s discuss \|w\|\cdot|\hat f(w)| a bit more, which can be simplified via \widehat{\nabla f}(w) = 2\pi i w \hat f(w) into a form commonly seen in the literature.

Here is our approach in detail. Continuing with the previous Barron representation and using \|x\|\leq1, \begin{aligned} \cos(2\pi w^{\scriptscriptstyle\mathsf{T}}x + 2\pi\theta(w)) - \cos(2\pi\theta(w)) &= \int_0^{w^{\scriptscriptstyle\mathsf{T}}x} - 2\pi\sin(2\pi b + 2\pi\theta(w)){\text{d}}b \\ &= - 2\pi\int_0^{\|w\|} \mathbf{1}[w^{\scriptscriptstyle\mathsf{T}}x - b \geq 0]\sin(2\pi b + 2\pi\theta(w)){\text{d}}b \\ & + 2\pi\int_{-\|w\|}^0 \mathbf{1}[-w^{\scriptscriptstyle\mathsf{T}}x + b \geq 0]\sin(2\pi b + 2\pi\theta(w)){\text{d}}b. \end{aligned} Plugging this into the previous form (before dividing by \|w\|), \begin{aligned} \hspace{-2em}f(x) - f(0) &= -2\pi\int\!\! \int_0^{\|w\|} \mathbf{1}[w^{\scriptscriptstyle\mathsf{T}}x - b\geq 0]\left[{\sin(2\pi b + 2\pi\theta(w))|\hat f(w)|}\right]{\text{d}}b{\text{d}}w \\ &+2\pi\int\!\! \int_{-\|w\|}^0 \mathbf{1}[-w^{\scriptscriptstyle\mathsf{T}}x + b\geq 0]\left[{\sin(2\pi b + 2\pi\theta(w))|\hat f(w)|}\right]{\text{d}}b{\text{d}}w, \end{aligned} an infinite width network with threshold nodes!

We’ll tidy up with \widehat{ \nabla f}(w) = 2\pi i w \hat f(w) whereby \| \widehat {\nabla f}(w) \| = 2\pi\|w\|\cdot |\hat f(w)| as mentioned before. Lastly, to estimate the “mass” of this infinite width network (the integral of the density part of the integrand), \begin{aligned} & \left|{ 2\pi\int\!\! \int_0^{\|w\|} %\1[w^\T x - b \geq 0] \left[{\sin(2\pi b + 2\pi\theta(w))|\hat f(w)|}\right]{\text{d}}b{\text{d}}w }\right| \\ &+ \left|{ 2\pi\int\!\! \int_{-\|w\|}^0 %\1[-w^\T x + b\geq 0] \left[{\sin(2\pi b + 2\pi\theta(w))|\hat f(w)|}\right]{\text{d}}b{\text{d}}w}\right| \\ &\leq 2\pi\int\!\! \int_{-\|w\|}^{\|w\|} \left|{ \sin(2\pi b + 2\pi\theta(w)) }\right| |\hat f(w)|{\text{d}}b{\text{d}}w \\ &\leq 2\pi\int 2\|w\| \cdot |\hat f(w)|{\text{d}}b{\text{d}}w \\ &= 2 \int \left\|{ \widehat{\nabla f}(w) }\right\|{\text{d}}w. \end{aligned} Summarizing this derivations gives the following version of Barron’s approach.

When combined with the sampling tools in 3.3, we will recover Barron’s full result that the number of nodes needed to approximate f to accuracy \epsilon>0 is roughly \Big\| \widehat{\nabla f}(w) \Big\|{\text{d}}w / \epsilon^2.

Ideally, the Barron norm is small, for instance polynomial (rather than exponential) in dimension for interesting examples. Here are a few, mostly taken from (Barron 1993).

3.3 Sampling from infinite width networks

Now we will show how to obtain a finite-width representation from an infinite-width representation. Coarsely, given a representation \int \sigma(w^{\scriptscriptstyle\mathsf{T}}x) g(w) {\text{d}}w, we can form an estimate \sum_{j=1}^m s_j \tilde\sigma(w_j^{\scriptscriptstyle\mathsf{T}}x), \qquad \text{where } s_j \in \pm 1, \ {} \tilde \sigma(z) = \sigma(z) \int |g(w)|{\text{d}}w, by sampling w_j \sim |g(w)| / \int |g(w)|{\text{d}}w, and letting s_j := \textrm{sgn}(g(w_j)), meaning the sign corresponding to whether w fell in a negative or positive region of g. In expectation, this estimate is equal to the original function.

Here we will give a more general construction where the integral is not necessarily over the Lebesgue measure, which is useful when it has discrete parts and low-dimensional sets. This section will follow the same approach as (Barron 1993), namely using Maurey’s sampling method (cf. Lemma 3.1 (Maurey (Pisier 1980))), which gives an L_2 error; it is possible to use these techniques to obtain an L_\infty error via the “co-VC dimension technique” (Gurvits and Koiran 1995), but this is not pursued here.

To build this up, first let us formally define these infinite-width networks and their mass.

To develop sampling bounds, first we give the classical general Maurey sampling technique, which is stated as sampling in Hilbert spaces.

Suppose X = \mathop{\mathbb{E}}V, where r.v. V is supported on a set S. A natural way to “simplify” X is to instead consider \hat X := \frac 1 k\sum_{i=1}^k V_i, where (V_1,\ldots,V_k) are sampled iid. We want to argue \hat X \approx X; since we’re in a Hilbert space, we’ll try to make the Hilbert norm \|X - \hat X\| small.

Now let’s apply this to infinite-width networks in the generality of Definition 3.2. We have two issues to resolve.

Let’s write a generalized shallow network as x\mapsto \int g(x;w){\text{d}}\mu(w), where \mu is a nonzero signed measure over some abstract parameter space \mathbb{R}^p. E.g., w = (a,b,v) and g(x;w) = a \sigma(v^{\scriptscriptstyle\mathsf{T}}x + b).

This sampling procedure has the correct mean: \begin{aligned} \int g(x;w) {\text{d}}\mu(w) &= \int g(x;w) {\text{d}}\mu_+(w) - \int g(x;w) {\text{d}}\mu_-(w) \\ &= \|\mu_+\|_1 \mathop{\mathbb{E}}_{\tilde\mu_+} g(x;w) - \|\mu_-\|_1 \mathop{\mathbb{E}}_{\tilde\mu_-} g(x;w) \\ &= \|\mu\|_1\left[{ \mathop{\textrm{Pr}}_{{\widetilde{\mu}}}[s = +1] \mathop{\mathbb{E}}_{{\widetilde{\mu}}_+} g(x;w) - \mathop{\textrm{Pr}}_{{\widetilde{\mu}}}[s = -1] \mathop{\mathbb{E}}_{{\widetilde{\mu}}_-} g(x;w)}\right] = \mathop{\mathbb{E}}_{{\widetilde{\mu}}} \tilde g(x;w,s). \end{aligned}

4 Approximation near initialization and the Neural Tangent Kernel

In this section we consider networks close to their random initialization. Briefly, the core idea is to compare a network f :\mathbb{R}^d \times \mathbb{R}^p \to \mathbb{R}, which takes input x\in\mathbb{R}^d and has parameters W\in\mathbb{R}^p, to its first-order Taylor approximation at random initialization W_0: f_0(x;W) := f(x;W_0) + \left\langle \nabla f(x;W_0), W-W_0 \right \rangle. The key property of this simplification is that while it is nonlinear in x, it is affine in W, which will greatly ease analysis. This section is roughly organized as follows

4.1 Basic setup: Taylor expansion of shallow networks

As explained shortly, we will almost solely consider the shallow case: f(x;W) := \frac 1 {\sqrt m} \sum_{j=1}^m a_j \sigma(w_j^{\scriptscriptstyle\mathsf{T}}x), \qquad W := \begin{bmatrix} \gets w_1^{\scriptscriptstyle\mathsf{T}}\to \\ \vdots\\ \gets w_m^{\scriptscriptstyle\mathsf{T}}\to \end{bmatrix} \in \mathbb{R}^{m\times d}, (2) where \sigma will either be a smooth activation or the ReLU, and we will treat a\in\mathbb{R}^m as fixed and only allow W\in\mathbb{R}^{m\times d} to vary. There are a number of reasons for this exact formalism, they are summarized below in Remark 4.1.

Now let’s consider the corresponding first-order Taylor approximation f_0 in detail. Consider any univariate activation \sigma which is differentiable except on a set of measure zero (e.g., countably many points), and Gaussian initialization W_0\in\mathbb{R}^{m\times d} as before. Consider the Taylor expansion at initialization: \begin{aligned} f_0(x;W) &= f(x;W_0) + \left\langle \nabla f(x;W_0), W-W_0 \right \rangle \\ &= \frac {1}{\sqrt m}\sum_{j=1}^m a_j \left({ \sigma(w_{0,j}^{\scriptscriptstyle\mathsf{T}}x) + \sigma'(w_{0,j}) x^{\scriptscriptstyle\mathsf{T}}(w_j - w_{0,j}) }\right) \\ &= \frac {1}{\sqrt m}\sum_{j=1}^m a_j \left({ \left[{ \sigma(w_{0,j}^{\scriptscriptstyle\mathsf{T}}x) - \sigma'(w_{0,j}) w_{0,j}^{\scriptscriptstyle\mathsf{T}}x}\right] + \sigma'(w_{0,j}) w_j^{\scriptscriptstyle\mathsf{T}}x }\right). \end{aligned} If \sigma is nonlinear, then this mapping is nonlinear in x, despite being affine in W! Indeed \nabla f(\cdot;W_0) defines a feature mapping: \nabla f(x;W_0) := \begin{bmatrix} \gets& a_1 \sigma'(w_{0,1}^{\scriptscriptstyle\mathsf{T}}x) x^{\scriptscriptstyle\mathsf{T}}&\to\\ &\vdots&\\ \gets& a_m \sigma'(w_{0,m}^{\scriptscriptstyle\mathsf{T}}x) x^{\scriptscriptstyle\mathsf{T}}&\to \end{bmatrix}; the predictor f_0 is affine is an affine function of the parameters, and is also affine in this feature-mapped data.

4.2 Networks near initialization are almost linear

Our first step is to show that f-f_0 shrinks as m increases, which has a few immediate consequences.

First we handle the case that \sigma is smooth, by which we mean \sigma'' exists and satisfies |\sigma''|\leq \beta everywhere. This is not satisfied for the ReLU, but the proof is so simple that it is a good motivator for other cases.

Now we switch to the ReLU. The proof is much more complicated, but is instructive of the general calculations one must perform frequently with the ReLU.

Proof of Lemma 4.1. Fix x\in\mathbb{R}^d. If \|x\|=0, then for any W\in\mathbb{R}^d, f(x;W) = 0 = f_0(x;W), and the proof is complete; henceforth consider the case \|x\|>0.

The proof idea is roughly as follows. The Gaussian initialization on W_0 concentrates around a rather large shell, and this implies |w_{0,j}^{\scriptscriptstyle\mathsf{T}}x| is large with reasonably high probability. If \|W-W_0\|_{\textrm{F}} is not too large, then \|w_{j} - w_{0,j}\| must be small for most coordinates; this means that w_j^{\scriptscriptstyle\mathsf{T}}x and w_{0,j}^{\scriptscriptstyle\mathsf{T}}x must have the same sign for most j.

Proceeding in detail, fix a parameter r>0, which will be optimized shortly. Let W be given with \|W-W_0\|\leq B. Define the sets \begin{aligned} S_1 &:= \left\{{ j \in [m] : |w_{j}^{\scriptscriptstyle\mathsf{T}}x| \leq r \|x\| }\right\}, \\ S_2 &:= \left\{{ j \in [m] : \|w_j-w_{0,j}\| \geq r }\right\}, \\ S &:= S_1 \cup S_2. \end{aligned} By Lemma 4.2, with probability at least 1-\delta, |S_1| \leq r m + \sqrt{m \ln(1/\delta)}. On the other hand, B^2 \geq \|W-W_0\|^2 \geq \sum_{j\in S_2} \|w_j-w_{0,j}\|^2 \geq |S_2| r^2, meaning |S_2| \leq B^2 / r^2. For any j\not \in S, if w_j^{\scriptscriptstyle\mathsf{T}}x > 0, then w_{0,j}^{\scriptscriptstyle\mathsf{T}}x \geq w_j^{\scriptscriptstyle\mathsf{T}}x - \|w_j-w_{0,j}\|\cdot\|x\| > \|x\| \left({ r - r }\right) = 0, meaning \mathbf{1}[w_j^{\scriptscriptstyle\mathsf{T}}x \geq 0] = \mathbf{1}[w_{0,j}^{\scriptscriptstyle\mathsf{T}}x \geq 0]; the case that j\not\in S and w_j^{\scriptscriptstyle\mathsf{T}}x < 0 is analogous. Together, |S| \leq r m + \sqrt{m \ln(1/\delta)} + \frac {B^2}{r^2} \quad \textup{and} \quad j\not\in S \Longrightarrow \mathbf{1}[w_j^{\scriptscriptstyle\mathsf{T}}x \geq 0 ] = \mathbf{1}[w_{0,j}^{\scriptscriptstyle\mathsf{T}}x\geq 0]. Lastly, we can finally choose r to balance terms in |S|: picking r := B^{2/3} / m^{1/3} gives |S| \leq (Bm)^{2/3} + \sqrt{m\ln(1/\delta)} + (Bm)^{2/3} \leq m^{2/3} \left({ 2 B^{2/3} + \sqrt{\ln(1/\delta)}}\right).

Now that |S| has been bounded, the proof considers the two different statements separately, though their proofs are similar.

As in the above remark, \begin{aligned} |f(x;W) - f_0(x;W)| &= \left|{ \left\langle \nabla f(x;W) - \nabla f(x;W_0), W \right \rangle }\right| \\ &=\frac 1 {\sqrt{m}} \left|{\sum_j a_j \left({\mathbf{1}[w_j^{\scriptscriptstyle\mathsf{T}}x \geq 0] - \mathbf{1}[w_{0,j}^{\scriptscriptstyle\mathsf{T}}x\geq 0]}\right) w_j^{\scriptscriptstyle\mathsf{T}}x}\right| \\ &\leq \frac 1 {\sqrt{m}} \sum_j \left|{\mathbf{1}[w_j^{\scriptscriptstyle\mathsf{T}}x \geq 0] - \mathbf{1}[w_{0,j}^{\scriptscriptstyle\mathsf{T}}x\geq 0]}\right| \cdot |w_j^{\scriptscriptstyle\mathsf{T}}x|. \end{aligned} To simplify this, as above \left|{\mathbf{1}[w_j^{\scriptscriptstyle\mathsf{T}}x \geq 0] - \mathbf{1}[w_{0,j}^{\scriptscriptstyle\mathsf{T}}x\geq 0]}\right| is only nonzero for j\in S. But when it is nonzero, this means \textrm{sgn}(w_j^{\scriptscriptstyle\mathsf{T}}x) \neq \textrm{sgn}(w_{0,j}^{\scriptscriptstyle\mathsf{T}}x), and thus |w_j^{\scriptscriptstyle\mathsf{T}}x| \leq |w_j^{\scriptscriptstyle\mathsf{T}}x - w_{0,j}^{\scriptscriptstyle\mathsf{T}}x|, and together with Cauchy-Schwarz (two applications!), and the above upper bound on |S| gives \begin{aligned} |f(x;W) - f_0(x;W)| &\leq \frac 1 {\sqrt{m}} \sum_j \left|{\mathbf{1}[w_j^{\scriptscriptstyle\mathsf{T}}x \geq 0] - \mathbf{1}[w_{0,j}^{\scriptscriptstyle\mathsf{T}}x\geq 0]}\right| \cdot |w_j^{\scriptscriptstyle\mathsf{T}}x| \\ &\leq \frac 1 {\sqrt{m}} \sum_{j\in S} \left|{w_j^{\scriptscriptstyle\mathsf{T}}x - w_{0,j}^{\scriptscriptstyle\mathsf{T}}x}\right| \\ &\leq \frac 1 {\sqrt{m}} \sum_{j\in S} \left\|{w_j - w_{0,j}}\right\| \\ &\leq \frac 1 {\sqrt{m}} \sqrt{|S|} \cdot \|W - W_0\|_{{\textrm{F}}} \\ &\leq B \sqrt{\frac{|S|} m} \\ &\leq B \sqrt{\frac {2 B^{2/3} + \sqrt{\ln(1/\delta)}} {m^{1/3}}} \\ &\leq \frac {2 B^{4/3} + B \ln(1/\delta)^{1/4}}{m^{1/6}}. \end{aligned}
Following similar reasoning, \begin{aligned} & \left|{f(x;V) - \left({f(x;W) + \left\langle \nabla_W f(x;W), V-W \right \rangle}\right)}\right| \\ &= \left|{ \left\langle \nabla f(x;V) - \nabla f(x;W), V \right \rangle }\right| \\ &=\frac 1 {\sqrt{m}} \left|{\sum_j a_j \left({\mathbf{1}[w_j^{\scriptscriptstyle\mathsf{T}}x \geq 0] - \mathbf{1}[v_{j}^{\scriptscriptstyle\mathsf{T}}x\geq 0]}\right) v_j^{\scriptscriptstyle\mathsf{T}}x}\right| \\ &\leq \frac 1 {\sqrt{m}} \left|{\sum_j a_j \left({\mathbf{1}[w_j^{\scriptscriptstyle\mathsf{T}}x \geq 0] - \mathbf{1}[v_{j}^{\scriptscriptstyle\mathsf{T}}x\geq 0]}\right)}\right|\cdot \left|{ w_j^{\scriptscriptstyle\mathsf{T}}x - v_j^{\scriptscriptstyle\mathsf{T}}x}\right| \\ &\leq \frac 1 {\sqrt{m}} \left|{\sum_j a_j \left({\mathbf{1}[w_j^{\scriptscriptstyle\mathsf{T}}x \geq 0] - \mathbf{1}[v_{j}^{\scriptscriptstyle\mathsf{T}}x\geq 0]}\right)}\right|\cdot \|w_j - v_j\|. \end{aligned} Now define S_3 analogously to S_2, but for the new matrix V: S_3 := \left\{{ j \in [m] : \|v_j-w_{0,j}\| \geq r }\right\}, and additionally define S_4 := S_1 \cup S_2 \cup S_3. By the earlier choice of r and related calculations, with probability at least 1-\delta, |S| \leq r m + \sqrt{m \ln(1/\delta)} + \frac {2B^2}{r^2} \leq m^{2/3} \left({ 3 B^{2/3} + \sqrt{\ln(1/\delta)}}\right). Plugging this back in and continuing as before, \begin{aligned} \left|{ \left\langle \nabla f(x;V) - \nabla f(x;W), V \right \rangle }\right| &\leq \frac 1 {\sqrt{m}} \left|{\sum_j a_j \left({\mathbf{1}[w_j^{\scriptscriptstyle\mathsf{T}}x \geq 0] - \mathbf{1}[v_{j}^{\scriptscriptstyle\mathsf{T}}x\geq 0]}\right)}\right|\cdot \|w_j - v_j\|. \\ &\leq \frac 1 {\sqrt{m}} \sum_{j\in S_4} \|w_j - v_j\|_{{\textrm{F}}} \\ &\leq \frac 1 {\sqrt{m}} \sqrt{|S_4|} \|V - W\|_{{\textrm{F}}} \\ &\leq 2B \sqrt{\frac {3B^{2/3} + \sqrt{\ln(1/\delta)}}{m^{1/3}}} \\ &\leq \frac {6B^{4/3} + 2B\ln(1/\delta)^{1/4}}{m^{1/6}}. \end{aligned}

4.3 Properties of the kernel at initialization

So far, we’ve said that f-f_0 is small when the width is large. Now we will focus on f_0, showing that it is a large class of functions; thus, when the width is large, f obtained with small \|W-W_0\|_{\textrm{F}} can also capture many functions.

To start, let us see how to define a kernel. In the standard kernel setup, the kernel can be written as the inner product between feature mappings for two data points: \begin{aligned} k_m(x,x') &:= \left\langle \nabla f(x;W_0), \nabla f(x';W_0) \right \rangle \\ &= \left\langle \begin{bmatrix} \longleftarrow &a_1 x^{\scriptscriptstyle\mathsf{T}}\sigma'( w_{1,0}^{\scriptscriptstyle\mathsf{T}}x) / \sqrt{m} &\longrightarrow \\ &\vdots& \\ \longleftarrow &a_m x^{\scriptscriptstyle\mathsf{T}}\sigma'( w_{m,0}^{\scriptscriptstyle\mathsf{T}}x) / \sqrt{m}& \longrightarrow \end{bmatrix} , \begin{bmatrix} \longleftarrow &a_1 (x')^{\scriptscriptstyle\mathsf{T}}\sigma'( w_{1,0}^{\scriptscriptstyle\mathsf{T}}x') / \sqrt{m} &\longrightarrow \\ &\vdots& \\ \longleftarrow &a_m (x')^{\scriptscriptstyle\mathsf{T}}\sigma'( w_{m,0}^{\scriptscriptstyle\mathsf{T}}x') / \sqrt{m}& \longrightarrow \end{bmatrix} \right \rangle \\ &= \frac 1 m \sum_{j=1}^m a_j^2 \left\langle x \sigma'( w_{j,0}^{\scriptscriptstyle\mathsf{T}}x) , x' \sigma'( w_{j,0}^{\scriptscriptstyle\mathsf{T}}x') \right \rangle \\ &= x^{\scriptscriptstyle\mathsf{T}}x' \left[{ \frac 1 m \sum_{j=1}^m \sigma'( w_{j,0}^{\scriptscriptstyle\mathsf{T}}x) \sigma'( w_{j,0}^{\scriptscriptstyle\mathsf{T}}x') }\right]. \end{aligned} This gives one justification of the 1/\sqrt{m} factor: now this kernel is an average and not a sum, and we should expect it to have a limit as m\to \infty. To this end, and noting that the rows (w_{0,j}^{\scriptscriptstyle\mathsf{T}})_{j=1}^m are iid, then each term of the summation is iid, so by the SLLN, almost surely k_m(x,x') \xrightarrow{m\to\infty} k(x,x') := x^{\scriptscriptstyle\mathsf{T}}x' \mathop{\mathbb{E}}_w \left[{ \sigma'(w^{\scriptscriptstyle\mathsf{T}}x) \sigma'(w^{\scriptscriptstyle\mathsf{T}}x') }\right]. In homework we will (a) provide a more explicit form as dot product kernel, and (b) bound the difference exactly. add explicit ref.

For now, let us calculate the closed form for the ReLU; let’s do this geometrically. need to include picture proof

Together, still using \|x\| = 1 = \|x'\|, \begin{aligned} k(x,x') &= x^{\scriptscriptstyle\mathsf{T}}x' \mathop{\mathbb{E}}_w \mathbf{1}[ w^{\scriptscriptstyle\mathsf{T}}x \geq 0 ]\cdot \mathbf{1}[w^{\scriptscriptstyle\mathsf{T}}x' \geq 0] %\\ %& = {x}^{{\scriptscriptstyle\mathsf{T}}} x' \left({ \frac {\pi - \arccos(x^{\scriptscriptstyle\mathsf{T}}x')}{2\pi} }\right). \end{aligned}

Now let’s return to the task of assessing how many functions we can represent near initialization. For this part, we will fix one degree of freedom in the data to effectively include a bias term; this is not necessary, but gives a shorter proof by reducing to standard kernel approximation theorems. We will show that this class is a universal approximator. Moreover \|W-V\| will correspond to the RKHS norm, thus by making the width large, we can approximate elements of this large RKHS arbitrarily finely.

Proceeding in detail, first let’s define our domain \mathcal{X}:= \left\{{ x \in \mathbb{R}^d : \|x\| = 1, x_d = 1/\sqrt{2} }\right\}, and our predictors \mathcal{H}:= \left\{{ x\mapsto \sum_{j=1}^m \alpha_j k(x,x_j) \ : \ m\geq 0, \alpha_j \in \mathbb{R}, x_j \in \mathcal{X}}\right\}. This might look fancy, but is the same as the functions we get by starting with x \mapsto \left\langle \nabla f(x;W_0), W-W_0 \right \rangle and allowing the width to go to infinity, and \|W-W_0\| be arbitrarily large; by the results in section 4.2, we can always choose an arbitrarily large width so that f - f_0\approx 0 even when \|W-W_0\| is large, and we will also show that large width approximates infinite width in the homework. As such, it suffices to show that \mathcal{H} is a universal approximator over \mathcal{X}.

We have not quite closed the loop, as we have not combined the pieces to show that for any continuous function g, we can select a large width m and W so that g \approx f_0(\cdot;W) \approx f(\cdot;W), but we’ve done most of the work, and a few remaining steps will be in homework. For a direct argument about this using a different approach based on (Barron 1993), see (Ji, Telgarsky, and Xian 2020).

5 Benefits of depth

So far we have given no compelling presentation of depth; in particular we have not justified the high depths used in practice.

In this section, we will give constructions of interesting functions by deep networks which can not be approximated by polynomially-sized shallow networks. These are only constructions, and it is unlikely these network structure are found by gradient descent and other practical methods, so the general question of justifying the high depth and particular architectures used in practice is still open.

5.1 The humble \Delta mapping.

Consider the \Delta function: \Delta(x) = 2\sigma_{\textrm{r}}(x) - 4\sigma_{\textrm{r}}(x-1/2) + 2\sigma_{\textrm{r}}(x-1) = \begin{cases} 2x&x \in [0,1/2),\\ 2-2x& x \in [1/2,1),\\ 0 & \text{otherwise}. \end{cases} How does \Delta look? And how about \Delta^2 := \Delta\circ \Delta? And \Delta^3? Picture drawn in class; figures forthcoming.

The pattern is that \Delta^L has 2^{L-1} copies of it self, uniformly shrunk down. In a sense, complexity has increased exponentially as a function of the the number of nodes and layers (both \mathcal{O}(L)). Later, it will matter that we not only have many copies, but that they are identical (giving uniform spacing). For now, here’s one way to characerize this behavior.

5.2 Separating shallow and deep networks

This section will establish the following separation between constant-width deep networks and subexponential width shallow networks.

Proceeding with the proof, first we want to argue that shallow networks have low complexity. Our notion of complexity is simply the number of affine pieces.

Our proof will proceed by induction, using the following combination rules for piecewise affine functions.

This completes part 1 of our proof plan, upper bounding the number of affine pieces polynomially in width and exponentially in depth.

The second part of the proof was to argue that \Delta^L gives a high complexity, regular function: we already provided this in Proposition 5.1, which showed that \Delta^L gives exactly 2^{L-1} copies of \Delta, each shrunken uniformly by a factor of 2^{L-1}.

The third part is a counting argument which ensures the preceding two imply the claimed separation in L_1 distance; details are as folllows.

5.3 Approximating x^2

Define S_i := \left({ \frac 0 {2^i}, \frac 1 {2^i}, \ldots, \frac {2^i}{2^i} }\right); let h_i be the linear interpolation of x^2 on S_i.

From squaring we can get many other things (still with \mathcal{O}(\ln(1/\epsilon)) depth and size.

5.4 Sobolev balls

Here we will continue and give a version of Yarotsky’s main consequence to the approximation of x^2: approximating functions with many bounded derivatives (by approximating their Taylor expansions), formally an approximation result against a Sobolev ball in function space.

6 Optimization: preface

Classically, the purpose of optimization is to approximately minimize (or maximize) an objective function f over a domain S: \min_{w\in S} f(w).

A core tension in the use of optimization in machine learning is that we would like to minimize the population risk \mathcal{R}(w) := \mathop{\mathbb{E}}\ell(Yf(X;w)); however, we only have access to the empirical risk \widehat{\mathcal{R}}(w) := n^{-1} \sum_i \ell(y_if(x_i;w)).

As a result, when choosing a w_t, we not only care that \widehat{\mathcal{R}}(w_t) is small, but also other good properties which may indicate \mathcal{R}(w_t) is small as well. Foremost amongst these are that w_t has low norm, but there are other possibilities.

6.1 Omitted topics

7 Semi-classical convex optimization

First we will revisit classical convex optimization ideas. Our presentation differs from the normal one in one key way: we state nearly results without any assumption of a minimizer, but instead use an arbitrary reference point z\in\mathbb{R}^p. We will invoke these bounds later in settings where the minimum may not exist, but the problem structure suggests good choices for z (see e.g., Lemma 10.1).

7.1 Smooth objectives in ML

We say “\widehat{\mathcal{R}} is \beta-smooth” to mean \beta-Lipschitz gradients: \|\nabla\widehat{\mathcal{R}}(w) - \nabla\widehat{\mathcal{R}}(v)\| \leq \beta \|w-v\|. (The math community says “smooth” for C^\infty.) We primarily invoke smoothness via the key inequality \widehat{\mathcal{R}}(v) \leq \widehat{\mathcal{R}}(w) + \left\langle \nabla\widehat{\mathcal{R}}(w), v-w \right \rangle + \frac \beta 2 \|v-w\|^2. In words: f can be upper bounded with the convex quadratic v \mapsto \frac \beta 2 \|v-w\|^2 + \left\langle \nabla\widehat{\mathcal{R}}(w), v-w \right \rangle + \widehat{\mathcal{R}}(w), which shares tangent and function value with \widehat{\mathcal{R}} at w. (The first definition also implies that we are lower bounded by concave quadratics.)

A key consequence: we can guarantee gradient descent does not increase the objective. Consider gradient iteration w' = w - \frac 1 \beta \nabla\widehat{\mathcal{R}}(w), then smoothness implies \widehat{\mathcal{R}}(w') \leq \widehat{\mathcal{R}}(w) - \left\langle \widehat{\mathcal{R}}(w), \widehat{\mathcal{R}}(w)/\beta \right \rangle + \frac {1}{2\beta}\|\widehat{\mathcal{R}}(w)\|^2 = \widehat{\mathcal{R}}(w) - \frac 1 {2\beta} \|\nabla\widehat{\mathcal{R}}(w)\|^2, and \|\nabla\widehat{\mathcal{R}}(w)\|^2 \leq 2\beta (\widehat{\mathcal{R}}(w) - \widehat{\mathcal{R}}(w')). With deep networks, we’ll produce similar bounds but in other ways.

As an exercise, let’s prove the earlier smoothness consequence. Considering the curve t\mapsto \widehat{\mathcal{R}}(w + t(v-w)) along [0,1], \begin{aligned} &\left|{ \widehat{\mathcal{R}}(v) - \widehat{\mathcal{R}}(w) - \left\langle \nabla\widehat{\mathcal{R}}(w), v-w \right \rangle }\right| \\ &= \left|{ \int_0^1 \left\langle \nabla\widehat{\mathcal{R}}(w+t(v-w)), v-w \right \rangle{\text{d}}t - \left\langle \nabla\widehat{\mathcal{R}}(w), v-w \right \rangle }\right| \\ &\leq \int_0^1\left|{ \left\langle \nabla\widehat{\mathcal{R}}(w+t(v-w))-\nabla\widehat{\mathcal{R}}(w), v-w \right \rangle }\right| {\text{d}}t \\ &\leq \int_0^1\|\nabla\widehat{\mathcal{R}}(w+t(v-w))-\nabla\widehat{\mathcal{R}}(w)\|\cdot\|v-w\| {\text{d}}t \\ &\leq \int_0^1 t\beta \|v-w\|^2 {\text{d}}t \\ &= \frac {\beta}{2} \|v-w\|^2. \end{aligned}

7.1.1 Convergence to stationary points

Consider first the gradient iteration w' := w - \eta \nabla\widehat{\mathcal{R}}(w), where \eta\geq 0 is the step size. When f is \beta smooth but not necessarily convex, the smoothness inequality directly gives \begin{aligned} \widehat{\mathcal{R}}(w') &\leq \widehat{\mathcal{R}}(w) + \left\langle \nabla\widehat{\mathcal{R}}(w), w'-w \right \rangle + \frac \beta 2 \|w'-w\|^2 \\ &= \widehat{\mathcal{R}}(w) - \eta \|\nabla\widehat{\mathcal{R}}(w)\|^2 + \frac {\beta\eta^2}{2}\|\nabla\widehat{\mathcal{R}}(w)\|^2 \\ &= \widehat{\mathcal{R}}(w) - \eta \left({1 - \frac {\beta\eta}{2}}\right) \|\nabla\widehat{\mathcal{R}}(w)\|^2. \end{aligned} (3)

If we choose \eta appropriately (\eta\leq 2/\beta) then: either we are near a critical point (\nabla\widehat{\mathcal{R}}(w)\approx 0), or we can decrease \widehat{\mathcal{R}}.

Rearranging our iteration inequality eq. 3 and summing over i<t, \begin{aligned} \sum_{i<t} \eta_{i}\left({1 - \frac {\beta\eta_{i}}{2}}\right)\|\nabla\widehat{\mathcal{R}}(w_i)\|^2 &\leq \sum_{i<t} \left({\widehat{\mathcal{R}}(w_i) - \widehat{\mathcal{R}}(w_{i+1})}\right) \\ &= \widehat{\mathcal{R}}(w_0) - \widehat{\mathcal{R}}(w_{t}). \end{aligned} We can summarize these observations in the following theorem.

Gradient flow version. Using FTC, chain rule, and definition, \begin{aligned} \widehat{\mathcal{R}}(w(t)) - \widehat{\mathcal{R}}(w(0)) &= \int_0^t \left\langle \nabla\widehat{\mathcal{R}}(w(s)), \dot w(s) \right \rangle{\text{d}}s \\ &= - \int_0^t\|\nabla\widehat{\mathcal{R}}(w(s))\|^2{\text{d}}s \\ &\leq - t \inf_{s\in[0,t]} \|\nabla\widehat{\mathcal{R}}(w(s))\|^2, \end{aligned} which can be summarized as follows.

7.1.2 Convergence rate for smooth & convex

If \widehat{\mathcal{R}} is differentiable and convex, then it is bounded below by its first-order approximations: \widehat{\mathcal{R}}(w') \geq \widehat{\mathcal{R}}(w) + \left\langle \nabla\widehat{\mathcal{R}}(w), w'-w \right \rangle\qquad \forall w,w'.

For GF, we use the same potential, but indeed start from the telescoping sum, which can be viewed as a Riemann sum corresponding to the following application of FTC: \begin{aligned} \frac 1 2 \|w(t) - z\|_2^2 - \frac 1 2 \|w(0) - z\|_2^2 &= \frac 1 2 \int_0^t \frac {{\text{d}}}{{\text{d}}s} \|w(s) - z\|_2^2 {\text{d}}s \\ &= \int_0^t \left\langle \frac {{\text{d}}w}{{\text{d}}s} , w(s) - z \right \rangle {\text{d}}s \\ &= \int_0^t \left\langle \nabla \widehat{\mathcal{R}}(w(s)) , z - w(s) \right \rangle {\text{d}}s \\ &\leq \int_0^t \left({ \widehat{\mathcal{R}}(z) - \widehat{\mathcal{R}}(w(s)) }\right) {\text{d}}s. \end{aligned}

7.2 Strong convexity

Recall one of our definitions of strong convexity: say that \widehat{\mathcal{R}} is \lambda-strongly-convex (\lambda-sc) when \widehat{\mathcal{R}}(w') \geq \widehat{\mathcal{R}}(w) + \left\langle \nabla\widehat{\mathcal{R}}(w), w'-w \right \rangle + \frac \lambda 2 \|w'-w\|^2; see Remark 7.5 (characterizing convexity) for more forms.

Another very useful property is that \lambda-sc gives a way to convert gradient norms to suboptimality.

I shuold lemmas lemmas giving level set containment, and existence of minimizers.

7.2.1 Rates when strongly convex and smooth

We have strayed a little from our goals by producing laborious proofs that not only separate the objective function and the distances, but also require minimizers. Interestingly, we can resolve this by changing the step size to a large (seemingly worse?) one.

7.3 Stochastic gradients

Let’s generalize gradient descent, and consider the iteration w_{i+1} := w_i - \eta_i g_i, where each g_i is merely some vector. If g_i := \nabla\widehat{\mathcal{R}}(w_i), then we have gradient descent, but in general we only approximate it. Later in this section, we’ll explain how to make g_i a “stochastic gradient.”

Our first step is to analyze this in our usual way with our favorite potential function, but accumulating a big error term: using convexity of \mathcal{R} and choosing a constant step size \eta_i := \eta \geq 0 for simplicity, \begin{aligned} \|w_{i+1} - z\|^2 &= \|w_i - \eta g_i - z\|^2 \\ &= \|w_i - z\|^2 - 2\eta_i \left\langle g_i, w_i - z \right \rangle + \eta^2 \|g_i\|^2 \\ &= \|w_i - z\|^2 + 2\eta \left\langle g_i - \nabla\mathcal{R}(w_i)+\nabla\mathcal{R}(w_i), z - w_i \right \rangle + \eta^2 \|g_i\|^2 \\ &\leq \|w_i - z\|^2 + 2\eta (\mathcal{R}(z) - \mathcal{R}(w_i) + \underbrace{\left\langle g_i - \nabla\mathcal{R}(w_i), z-w_i \right \rangle}_{\epsilon_i}) + \eta^2 \|g_i\|^2, \end{aligned} which after rearrangement gives 2\eta \mathcal{R}(w_i) \leq 2\eta \mathcal{R}(z) + \|w_{i} - z\|^2 - \|w_{i+1} - z\|^2 + 2\eta \epsilon_i+ \eta^2 \|g_i\|^2, and applying \frac 1 {2\eta t} \sum_{i<t} to both sides gives \frac 1 t \sum_{i<t} \mathcal{R}(w_i) \leq \mathcal{R}(z) + \frac{\|w_{0} - z\|^2 - \|w_{t} - z\|^2}{2 \eta t} + \frac 1 t \sum_{i<t}\left({ \epsilon_i+ \frac \eta 2 \|g_i\|^2}\right).

Now let us define the standard stochastic gradient oracle: \mathop{\mathbb{E}}[ g_i | w_{\leq i} ] = \nabla\mathcal{R}(w_i), where w_{\leq i} signifies all randomness in (w_1,\ldots,w_i).

Now let’s work towards our goal of showing that, with high probability, our stochastic gradient method does nearly as well as a regular gradient method. (We will not show any benefit to stochastic noise, other than computation!)

We will use this inequality to handle \sum_{i<t} \epsilon_i. Firstly, we must show the desired expectations are zero. To start, \begin{aligned} \mathop{\mathbb{E}}\left[{ \epsilon_i\ \Big|\ w_{\leq i} }\right] &= \mathop{\mathbb{E}}\left[{ \left\langle g_i - \nabla\mathcal{R}(w_i), z - w_i \right \rangle\ \Big|\ w_{\leq i} }\right] \\ &= \left\langle \mathop{\mathbb{E}}\left[{ g_i - \nabla\mathcal{R}(w_i)\ {} \Big| \ {} w_{\leq i} }\right], z - w_i \right \rangle \\ &= \left\langle 0, z - w_i \right \rangle \\ &= 0. \end{aligned} Next, by Cauchy-Schwarz and the triangle inequality, \mathop{\mathbb{E}}|\epsilon_i| = \mathop{\mathbb{E}}\left|{ \left\langle g_i - \nabla\widehat{\mathcal{R}}(w_i), w_i - z \right \rangle }\right| \leq \mathop{\mathbb{E}}\left({ \|g_i\| + \|\nabla\widehat{\mathcal{R}}(w_i)\| }\right)\|w_i - z\| \leq 2GD. Consequently, by Azuma-Hoeffding, with probability at least 1-\delta, \sum_i \epsilon_i \leq 2GD \sqrt{2t \ln(1/\delta)}.

Plugging this into the earlier approximate gradient lemma gives the following. should give explicit cref

8 Two NTK-based optimization proofs near initializatikon

Here we will show our first optimization guarantees for (shallow) networks: one based on strong convexity, and one based on smoothness.

8.1 Strong convexity style NTK optimization proof

(include preamble saying this looks like + theorem:sc_smooth?){.mjt} Theorem 7.5

Finally we will prove (rather than assert) that we can stay close to initialization long enough to get a small risk with an analysis that is essentially convex, essentially following the NTK (Taylor approximation).

Basic notation. For convenience, bake the training set into the predictor: \begin{aligned} f(w) &:= \begin{bmatrix} f(x_1;w)\\ \vdots \\ f(x_n;w) \end{bmatrix} \in \mathbb{R}^n. \end{aligned} We’ll be considering squared loss regression: \begin{aligned} \widehat{\mathcal{R}}(\alpha f(w)) &:= \frac {1}{2} \|\alpha f(w) - y\|^2, \qquad \widehat{\mathcal{R}}_0 := \widehat{\mathcal{R}}(\alpha f(w(0)) ), \end{aligned} where \alpha > 0 is a scale factor we’ll optimize later. maybe I should use \mathcal L not \widehat{\mathcal{R}} since unnormalized.

We’ll consider gradient flow: \begin{aligned} \dot w(t) &:= - \nabla_w \widehat{\mathcal{R}}(\alpha f(w(t))) = - \alpha J_t^{\scriptscriptstyle\mathsf{T}}\nabla\widehat{\mathcal{R}}(\alpha f(w(t))), \\ \textup{where } J_t := J_{w(t)} &:= \begin{bmatrix} \nabla f(x_1;w(t))^{\scriptscriptstyle\mathsf{T}}\\ \vdots \\ \nabla f(x_n;w(t))^{\scriptscriptstyle\mathsf{T}} \end{bmatrix} \in \mathbb{R}^{n\times p}. \end{aligned}

We will also explicitly define and track a flow u(t) over the tangent model; what we care about is w(t), but we will show that indeed u(t) and w(t) stay close in this setting. (Note that u(t) is not needed for the analysis of w(t).) \begin{aligned} f_0(u) &:= f(w(0)) + J_0(u - w(0)). \\ \dot u(t) &:= -\nabla_u \widehat{\mathcal{R}}(\alpha f_0(u(t))) = - \alpha J_0^{\scriptscriptstyle\mathsf{T}}\nabla\widehat{\mathcal{R}}(\alpha f_0(u(t))). \end{aligned} Both gradient flows have the same initial condition: u(0) = w(0),\qquad f_0(u(0)) = f_0(w(0)) = f(w(0)).

Assumptions. \begin{aligned} \textrm{rank}(J_0) &= n, \\ \sigma_{\min}&:= \sigma_{\min}(J_0) = \sqrt{\lambda_{\min}(J_0J_0^{\scriptscriptstyle\mathsf{T}})} = \sqrt{\lambda_{n}(J_0J_0^{\scriptscriptstyle\mathsf{T}})}> 0, \\ \sigma_{\max}&:= \sigma_{\max}(J_0) > 0, \\ \|J_w - J_v\| &\leq \beta \|w-v\|. \end{aligned}(4)

The evolution in prediction space is \begin{aligned} \frac {{\text{d}}}{{\text{d}}t} \alpha f(w(t)) &= \alpha J_t \dot w(t) = -\alpha^2 J_t J_t^{\scriptscriptstyle\mathsf{T}}\nabla\widehat{\mathcal{R}}(\alpha f(w(t))), \\ &= -\alpha^2 J_t J_t^{\scriptscriptstyle\mathsf{T}}(\alpha f(w(t)) - y), \\ \frac {{\text{d}}}{{\text{d}}t} \alpha f_0(u(t)) &= \frac {{\text{d}}}{{\text{d}}t} \alpha \left({ f(w(0) + J_0(u(t) - w(0)) }\right) = \alpha J_0 \dot u(t) \\ &= -\alpha^2 J_0 J_0^{\scriptscriptstyle\mathsf{T}}\nabla\widehat{\mathcal{R}}(\alpha f_0(u(t))) \\ &= -\alpha^2 J_0 J_0^{\scriptscriptstyle\mathsf{T}}(\alpha f_0(u(t)) - y). \end{aligned}

But the second one can be written \frac {{\text{d}}}{{\text{d}}t} \left[{ \alpha f_0(u(t)) }\right] = -\alpha^2 \left({ J_0 J_0^{\scriptscriptstyle\mathsf{T}}}\right) \left[{ \alpha f_0(u(t)) }\right] +\alpha^2 \left({ J_0 J_0^{\scriptscriptstyle\mathsf{T}}}\right)y, which is a concave quadratic in the predictions \alpha f_0(u(t)).

Let’s fantasize a little and suppose (J_w J_w)^{\scriptscriptstyle\mathsf{T}} is also positive semi-definite. Do we still have a nice convergence theory?

We can apply the previous two lemmas to the tangent model u(t), since for any t\geq 0, \dot u(t) = - \alpha J_0^{\scriptscriptstyle\mathsf{T}}\nabla\widehat{\mathcal{R}}(\alpha f_0(u(t))), \quad \frac {{\text{d}}}{{\text{d}}t} \alpha f_0(u(t)) = -\alpha^2 (J_0 J_0^{\scriptscriptstyle\mathsf{T}}) \nabla\widehat{\mathcal{R}}(\alpha f_0(u(t))). Thus since Q_0 := \alpha^2 J_0J_0^{\scriptscriptstyle\mathsf{T}} satisfies \lambda_i(Q_0) \in \alpha^2 [\sigma_{\min}^2, \sigma_{\max}^2], \begin{aligned} \widehat{\mathcal{R}}(\alpha f_0(u(t))) &\leq \widehat{\mathcal{R}}_0 \exp(- 2 t \alpha^2 \sigma_{\min}^2), \\ \|u(t) - u(0)\| &\leq \frac {\sqrt{2 \sigma_{\max}^2 \widehat{\mathcal{R}}_0}}{\alpha \sigma_{\min}^2}. \end{aligned}

Let’s relate (J_wJ_w^{\scriptscriptstyle\mathsf{T}}) to (J_0 J_0^{\scriptscriptstyle\mathsf{T}}).

Using this, for t\in[0,T], \dot w(t) = - \alpha J_w^{\scriptscriptstyle\mathsf{T}}\nabla\widehat{\mathcal{R}}(\alpha f(w(t))), \quad \frac {{\text{d}}}{{\text{d}}t} \alpha f(w(t)) = -\alpha^2 (J_w J_w^{\scriptscriptstyle\mathsf{T}}) \nabla\widehat{\mathcal{R}}(\alpha f(w(t))). Thus since Q_t := \alpha^2 J_tJ_t^{\scriptscriptstyle\mathsf{T}} satisfies \lambda_i(Q_t) \in \alpha^2 [\sigma_{\min}^2/4, 9\sigma_{\max}^2/4], \begin{aligned} \widehat{\mathcal{R}}(\alpha f(w(t))) &\leq \widehat{\mathcal{R}}_0 \exp(- t \alpha^2 \sigma_{\min}^2/2), \\ \|w(t) - w(0)\| &\leq \frac {3 \sqrt{8 \sigma_{\max}^2 \widehat{\mathcal{R}}_0}}{\alpha \sigma_{\min}^2} =: B'. \end{aligned} It remains to show that T=\infty. Invoke, for the first time, the assumed lower bound on \alpha, namely \alpha \geq \frac {\beta \sqrt{1152 \sigma_{\max}^2 \widehat{\mathcal{R}}_0}}{\sigma_{\min}^3}, which by the above implies then B' \leq \frac B 2. Suppose contradictorily that T < \infty; since t\mapsto w(t) is continuous, then t\mapsto \|w(t) - w(0)\| is also continuous and starts from 0, and therefore \|w(T) - w(0)\| = B>0 exactly. But due to the lower bound on \alpha, we also have \|w(T) - w(0)\| \leq \frac B 2 < B, a contradiction.

8.2 Smoothness-based proof

(include pre-amble saying this looks like + theorem:magic_inequality?){.mjt} Theorem 7.3

9 Nonsmoothness, Clarke differentials, and positive homogeneity

Smoothness and differentiability do not in general hold for us (ReLU, max-pooling, hinge loss, etc.).

One relaxation of the gradient is the subdifferential set {\partial_{\text{s}}} (whose elements are called subgradients), namely the set of tangents which lie below the predictor: {\partial_{\text{s}}}\widehat{\mathcal{R}}(w) := \left\{{ s \in\mathbb{R}^p : \forall w'\centerdot \widehat{\mathcal{R}}(w') \geq \widehat{\mathcal{R}}(w) + s^{\scriptscriptstyle\mathsf{T}}(w'-w)}\right\}.

Typically, we lack convexity, and the subdifferential set is empty.
Our main formalism is the Clarke differential (Clarke et al. 1998): \partial\widehat{\mathcal{R}}(w) := \textrm{conv}\left({\left\{{ s \in \mathbb{R}^p : \exists w_i \to w, \nabla\widehat{\mathcal{R}}(w_i)\to s }\right\}}\right).

We can replace the gradient flow differential equation \dot w(t) = -\nabla\mathcal{R}(w(t)) with a differential inclusion: \dot w(t) \in -\partial \widehat{\mathcal{R}}(w(t)) \qquad \text{for a.e. } t\geq 0. If R satisfies some technical structural conditions, then the following nice properties hold; these properties are mostly taken from (Lemma 5.2, Theorem 5.8, Davis et al. 2018) (where the structural condition is C^1 Whitney stratifiability), which was slightly generalized in (Ji and Telgarsky 2020) under o-minimal definability; another alternative, followed in (Lyu and Li 2019), is to simply assume that a chain rule holds.

This allows us to reprove our stationary point guarantee from an earlier lecture: since \begin{aligned} \widehat{\mathcal{R}}(w(t)) - \widehat{\mathcal{R}}(w(0)) = -\int_0^t \min\{ \|v\|^2 : v\in\partial \widehat{\mathcal{R}}(w(s)) \}{\text{d}}s \leq - t \min_{\substack{s\in [0,t]\\v\in\partial\widehat{\mathcal{R}}(w(s))}} \|v\|^2, \end{aligned} then just as before \begin{aligned} \min_{\substack{s\in [0,t]\\v\in\partial\widehat{\mathcal{R}}(w(s))}} \|v\|^2 \leq \frac{\widehat{\mathcal{R}}(w(0)) - \widehat{\mathcal{R}}(w(t))}{t}, \end{aligned} thus for some time s\in[0,t], we have an iterate w(s) which is an approximate stationary point.

9.1 Positive homogeneity

9.2 Positive homogeneity and the Clarke differential

Let’s work out an element of the Clarke differential for a ReLU network x \mapsto W_L \sigma_{L-1}(\cdots W_2 \sigma_1(W_1 x)). As a function of x, this mapping is 1-homogeneous and piecewise affine. As a function of w=(W_L,\cdot,W_1), it is L-homogeneous and piecewise polynomial. The boundary regions form a set of (Lebesgue) measure zero (wrt to either weights or parameters).

Fixing x and considering w, interior to each piece, the mapping is differentiable. Due to the definition of Clarke differential, it therefore suffices to compute the gradients in all adjacent pieces, and then take their convex hull.

So let’s return to considering some w where are differentiable. Let A_i be a diagonal matrix with activations of the output after layer i on the diagonal: A_i = \textrm{diag}\left({ \sigma'(W_i\sigma(\dots \sigma(W_1 x) \dots )) }\right), (note we’ve baked in x,) and so \sigma(r)=r\sigma'(r) implies layer i outputs x \mapsto A_i W_i\sigma(\dots \sigma(W_1 x) \dots )) = A_i W_i A_{i-1} W_{i-1} \cdots A_1 W_1 x, and the network outputs f(x; w) = W_L A_{L-1} W_{L-1} A_{L-2} \cdots A_1 W_1 x. and the gradient with respect to layer i is \frac {{\text{d}}}{{\text{d}}W_i} f(x; w) = (W_L A_{L-1} \cdots W_{i+1} A_i)^{\scriptscriptstyle\mathsf{T}}(A_{i-1} W_{i-1} \cdots W_1 x)^{\scriptscriptstyle\mathsf{T}}. Additionally \begin{aligned} %& \left\langle W_i , \frac {{\text{d}}}{{\text{d}}W_i} f(x; w) \right \rangle %\\ &= \left\langle W_i , (W_L A_{L-1} \cdots W_{i+1} A_i)^{\scriptscriptstyle\mathsf{T}}(A_{i-1} W_{i-1} \cdots W_1 x)^{\scriptscriptstyle\mathsf{T}} \right \rangle \\ &= \textrm{tr}\left({ W_i^{\scriptscriptstyle\mathsf{T}}(W_L A_{L-1} \cdots W_{i+1} A_i)^{\scriptscriptstyle\mathsf{T}}(A_{i-1} W_{i-1} \cdots W_1 x)^{\scriptscriptstyle\mathsf{T}} }\right) \\ &= \textrm{tr}\left({ (W_L A_{L-1} \cdots W_{i+1} A_i)^{\scriptscriptstyle\mathsf{T}}(W_i A_{i-1} W_{i-1} \cdots W_1 x)^{\scriptscriptstyle\mathsf{T}} }\right) \\ &= \textrm{tr}\left({ (W_i A_{i-1} W_{i-1} \cdots W_1 x)^{\scriptscriptstyle\mathsf{T}} (W_L A_{L-1} \cdots W_{i+1} A_i)^{\scriptscriptstyle\mathsf{T}} }\right) \\ &= \textrm{tr}\left({ W_L A_{L-1} \cdots W_{i+1} A_i W_i A_{i-1} W_{i-1} \cdots W_1 x }\right) \\ &= f(x; w), \end{aligned} and \left\langle W_i , \frac {{\text{d}}}{{\text{d}}W_i} f(x; w) \right \rangle = f(x; w) = \left\langle W_{i+1} , \frac {{\text{d}}}{{\text{d}}W_{i+1}} f(x; w) \right \rangle.

This calculation can in fact be made much more general (indeed with a simpler proof!).

9.3 Norm preservation

If predictions are positive homogeneous with respect to each layer, then gradient flow preserves norms of layers.

9.4 Smoothness inequality adapted to ReLU

Let’s consider: single hidden ReLU layer, only bottom trainable: f(x;w) := \frac {1}{\sqrt m} \sum_j a_j \sigma(\left\langle x, w_j \right \rangle), \qquad a_j \in \{\pm 1\}. Let W_s\in \mathbb{R}^{m\times d} denote parameters at time s, suppose \|x\|\leq 1. \begin{aligned} \frac {{\text{d}}f(x;W)}{{\text{d}}W} &= \begin{bmatrix} a_1 x \sigma'(w_1^{\scriptscriptstyle\mathsf{T}}x) / \sqrt{m} \\ \vdots \\ a_m x \sigma'(w_m^{\scriptscriptstyle\mathsf{T}}x) / \sqrt{m} \end{bmatrix}, \\ \left\|{ \frac {{\text{d}}f(x;W)}{{\text{d}}W} }\right\|_{{\textrm{F}}}^2 &= \sum_j \left\|{ a_j x \sigma'(w_j^{\scriptscriptstyle\mathsf{T}}x) / \sqrt{m}}\right\|^2_2 %\\ %& \leq \frac 1 m \sum_j \left\|{ x }\right\|^2_2 \leq 1. \end{aligned}

We’ll use the logistic loss, whereby \begin{aligned} \ell(z) &= \ln(1+\exp(-z)), \\ \ell'(z) &= \frac {-\exp(-z)}{1+ \exp(-z)} \in (-1,0), \\ \widehat{\mathcal{R}}(W) &:= \frac 1 n \sum_k \ell(y_k f(x_k;W)). \end{aligned} A key fact (can be verified with derivatives) is |\ell'(z)| = -\ell'(z) \leq \ell(z), % ln(1+e^z) = \int_{-infty}^z dx / (1+e^{-x}) % 1/(1+e^z) = \int_{-infty}^z e^{-x} dx / (1+e^{-x})^2. % suffices to prove second is everywhere smaller. % Let g(x) denote first. % second is g(x) * e^{-x} / (1+e^{-x}). % but e^{-x} / (1+e^{-x}) = 1 / (1+e^x) <= 1, % so first one g(x) is always larger. whereby \begin{aligned} \frac {{\text{d}}\widehat{\mathcal{R}}}{{\text{d}}W} &= \frac 1 n \sum_k \ell'(y_k f(x_k; W)) y_k \nabla_W f(x_k W), \\ \left\|{ \frac {{\text{d}}\widehat{\mathcal{R}}}{{\text{d}}W} }\right\|_{{\textrm{F}}} &\leq \frac 1 n \sum_k | \ell'(y_k f(x_k; W)) | \cdot \|y_k \nabla_W f(x_k W)\|_{{\textrm{F}}} \\ &\leq \frac 1 n \sum_k | \ell'(y_k f(x_k; W)) | \leq \min\left\{{ 1, \widehat{\mathcal{R}}(W) }\right\}. \end{aligned}

10 Margin maximization and implicit bias

During 2015-2016, various works pointed out that deep networks generalize well, even though parameter norms are large, and there is no explicit generalization (Neyshabur, Tomioka, and Srebro 2014; Zhang et al. 2017). This prompted authors to study implicit bias of gradient descent, the first such result being an analysis of linear predictors with linearly separable data, showing that gradient descent on the cross-entropy loss is implicitly biased towards a maximum margin direction (Soudry, Hoffer, and Srebro 2017).

This in turn inspired many other works, handling other types of data, networks, and losses (Ji and Telgarsky 2019b, 2018, 2020; Gunasekar et al. 2018a; Lyu and Li 2019; Chizat and Bach 2020; Ji et al. 2020).

Margin maximization of first-order methods applied to exponentially-tailed losses was first proved for coordinate descent (Telgarsky 2013). The basic proof scheme there was pretty straightforward, and based on the similarity of the empirical risk (after the monotone transformation \ln(\cdot)) to \ln \sum \exp, itself similar to \max(\cdot) and thus to margin maximization; we will use this connection as a basis for all proofs in this section (see also (Ji and Telgarsky 2019b; Gunasekar et al. 2018b)).

Throughout this section, fix training data ((x_i,y_i))_{i=1}^n, define a (an unnormalized) margin mapping m_i(w) := y_i f(x_i;w); by this choice, we can also conveniently write an unnormalized risk \mathcal L: \mathcal L(w) := \sum_i \ell(m_i(w)) = \sum_i \ell(y_i f(x_i;w)). Throughout this section, we will always assume f is locally-Lipschitz and L-homogeneous in w, which also means each m_i is locally-Lipschitz and L-homogeneous.

We will also use the exponential loss \ell(z) = \exp(-z). The results go through for similar losses.

10.1 Separability and margin maximization

Consider a linear predictor, meaning x\mapsto \left\langle w, x \right \rangle for some w\in\mathbb{R}^d. This w “separates the data” if y_i and \textrm{sgn}(\left\langle w, x_i \right \rangle) agree, which we can relax to the condition of strict separability, namely \min_i y_i \left\langle w, x_i \right \rangle > 0. It seems reasonable, or a nice inductive bias, if we are as far from 0 as possible: \max_{w\in ?} \min_i y_i \left\langle w, x_i \right \rangle > 0 The “?” indicates that we must somehow normalize or constrain, since otherwise, for separable data, this \max becomes a \sup and has value +\infty.

Consider now the general case of L-homogeneous predictors, where y_i\left\langle w, x_i \right \rangle is replaced by m_i(w).

This seems to be problematic; how can we “find” an “optimum,” when solutions are off at infinity? Moreover, we do not even have unique directions, nor a way to tell different ones apart!

We can use margins, now appropriately generalized to the L-homogeneous case, to build towards a better-behaved objective function. First note that since \min_i m_i(w) = \|w\|^L \min_i m_i\left({\frac{w}{\|w\|}}\right), we can compare different directions by normalizing the margin by \|w\|^L. Moreover, again using the exponential loss, \begin{aligned} \frac {\ell^{-1}\left({\mathcal L(w)}\right)}{\|w\|^L} + \frac{\ln(n)}{\|w\|^L} &= \frac {\ell^{-1}\left({\sum_i \ell(m_i(w))/n}\right)}{\|w\|^L} \geq \frac {\min_i m_i(w)}{\|w\|^L} \\ &= \frac {\ell^{-1}\left({\max_i \ell(m_i(w))}\right)}{\|w\|^L} \\ &\geq \frac{\ell^{-1}\left({ \mathcal L(w) }\right)}{\|w\|^L}. \end{aligned}(5)

10.2 Gradient flow maximizes margins of linear predictors

Let’s first see how far we can get in the linear case, using one of our earlier convex optimization tools, namely Theorem 7.4.

This nicely shows that we decrease the risk to 0, but not that we maximize margins. For this, we need a more specialized analysis.

To get some inspiration, notice that we keep running into \ell^{-1}(\mathcal L(w)) in all the analysis. Why don’t we just run gradient flow on this modified objective? In fact, the two gradient flows are the same!

10.3 Smoothed margins are nondecreasing for homogeneous functions

In the nonlinear case, we do not have a general result, and instead only prove that smoothed margins are nondecreasing.

The proof will use the following interesting approximate homogeneity property of \ln \sum \exp.

If -\ln\sum_i\exp were itself homogeneous, this would be an equality; instead, using only the L-homegeneity of m_i, we get a lower bound.

11 Generalization: preface

The purpose of this generalization part is to bound the gap between testing and training error for standard (multilayer ReLU) deep networks via the classical uniform convergence tools, and also to present and develop these classical tools (based on Rademacher complexity).

These bounds are very loose, and there is extensive criticism now both of them and of the general approach, as will be discussed shortly (Neyshabur, Tomioka, and Srebro 2014; Zhang et al. 2017; Nagarajan and Kolter 2019; Dziugaite and Roy 2017); this work is ongoing and moving quickly and there are even already many responses to these criticisms (Negrea, Dziugaite, and Roy 2019; L. Zhou, Sutherland, and Srebro 2020; P. L. Bartlett and Long 2020).

11.1 Omitted topics

12 Concentration of measure

12.1 sub-Gaussian random variables and Chernoff’s bounding technique

Our main concentration tool will be the Chernoff bounding method, which works nicely with sub-Gaussian random variables.

sub-Gaussian random variables will be useful to us due to their vanishing tail probabilities. This indeed is an equivalent way to define sub-Gaussian (see (Wainwright 2015)), but we’ll just prove implication. The first step is Markov’s inequality.

The Chernoff bounding technique is as follows. We can apply the proceeding corollary to the mapping t\mapsto \exp(tX) for all t>0: supposing \mathop{\mathbb{E}}X = 0, \mathop{\textrm{Pr}}[X\geq \epsilon] = \inf_{t\geq 0} \mathop{\textrm{Pr}}[\exp(tX)\geq \exp(t\epsilon)] \leq \inf_{t\geq 0} \frac {\mathop{\mathbb{E}}\exp(tX)}{\exp(t\epsilon)}. Simplifying the RHS via sub-Guassianity, \begin{aligned} \inf_{t>0}\frac{\mathop{\mathbb{E}}\exp(tX)}{\exp(t\epsilon)} &\leq \inf_{t>0}\frac{\exp(t^2\sigma^2/2)}{\exp(t\epsilon)} = \inf_{t>0}\exp\left({ t^2\sigma^2/2 - t\epsilon}\right) \\ &= \exp\left({ \inf_{t>0} t^2\sigma^2/2 - t\epsilon}\right). \end{aligned} The minimum of this convex quadratic is t := \frac {\epsilon}{\sigma^2}>0, thus \mathop{\textrm{Pr}}[X\geq \epsilon] = \inf_{t>0}\frac{\mathop{\mathbb{E}}\exp(tX)}{\exp(t\epsilon)} \leq \exp\left({ \inf_{t>0} t^2\sigma^2/2 - t\epsilon}\right) = \exp\left({ - \frac {\epsilon^2}{2 \sigma^2} }\right). (7)

What if we apply this to an average of sub-Gaussian r.v.’s? (The point is: this starts to look like an empirical risk!)

I should say something about necessary and sufficient, like convex lipschitz bounded vs lipschitz gaussian.

12.2 Hoeffding’s inequality and the need for uniform deviations

What broke Hoeffding’s inequality (and its proof) between these two examples?

This f_n overfit: \widehat{\mathcal{R}}(f_n) is small, but \mathcal{R}(f_n) is large.

13 Rademacher complexity

As before we will apply a brute-force approach to controlling generalization over a function family \mathcal{F}: we will simultaneously control generalization for all elements of the class by working with the random variable \left[{ \sup_{f\in\mathcal{F}} \mathcal{R}(f) - \widehat{\mathcal{R}}(f) }\right]. This is called “uniform deviations” because we prove a deviation bound that holds uniformly over all elements of \mathcal{F}.

On the other hand, we can adapt the approach to the output of the algorithm in various ways, as we will discuss after presenting the main Rademacher bound.

This definition can be applied to arbitrary elements of \mathbb{R}^n, and is useful outside machine learning. We will typically apply it to the behavior of a function class on S = (z_i)_{i=1}^n: \mathcal{F}_{|S} := \left\{{ (f(x_1),\ldots,f(x_n)) : f\in\mathcal{F}}\right\} \subseteq \mathbb{R}^n. With this definition, \textrm{URad}(\mathcal{F}_{|S}) = \mathop{\mathbb{E}}_\epsilon\sup_{u \in \mathcal{F}_{|S}} \left\langle \epsilon, u \right \rangle = \mathop{\mathbb{E}}_\epsilon\sup_{f \in \mathcal{F}} \sum_i \epsilon_i f(z_i).

The following theorem shows indeed that we can use Rademacher complexity to replace the \ln|\mathcal{F}| term from the finite-class bound with something more general (we’ll treat the Rademacher complexity of finite classes shortly).

The proof of this bound has many interesting points and is spread out over the next few subsections. It has these basic steps:

13.1 Generalization without concentration; symmetrization

We’ll use further notation throughout this proof. \begin{aligned} Z &\quad \textrm{r.v.; e.g., $(x,y)$}, \\ \mathcal{F} &\quad \textrm{functions; e.g., } f(Z) = \ell(g(X), Y), \\ \mathop{\mathbb{E}}&\quad\textrm{expectation over } Z, \\ \mathop{\mathbb{E}}_n &\quad \textrm{expectation over } (Z_1,\ldots,Z_n), \\ \mathop{\mathbb{E}}f &= \mathop{\mathbb{E}}f(Z),\\ \widehat{\mathop{\mathbb{E}}}_n f &= \frac 1 n \sum_i f(Z_i). \end{aligned}

In this notation, \mathcal{R}_{\ell}(g) = \mathop{\mathbb{E}}\ell\circ g and \widehat{\mathcal{R}}_{\ell}(g) = \widehat{\mathop{\mathbb{E}}}\ell\circ g.

13.1.1 Symmetrization with a ghost sample

In this first step we’ll introduce another sample (“ghost sample”). Let (Z_1',\ldots,Z_n') be another iid draw from Z; define \mathop{\mathbb{E}}_n' and \widehat{\mathop{\mathbb{E}}}_n' analogously.

13.1.2 Symmetrization with random signs

The second step swaps points between the two samples; a magic trick with random signs boils this down into Rademacher complexity.

13.2 Generalization with concentration

We controlled expected uniform deviations: \mathop{\mathbb{E}}_n \sup_{f\in\mathcal{F}} \mathop{\mathbb{E}}f - \widehat{\mathop{\mathbb{E}}}_n f.

13.3 Example: basic logistic regression generalization analysis

Let’s consider logistic regression with bounded weights: \begin{aligned} \ell(y f(x)) &:= \ln(1+\exp(-y f(x))), \\ |\ell'| &\leq 1, \\ \mathcal{F} &:= \left\{{ w\in\mathbb{R}^d : \|w\| \leq B }\right\}, \\ (\ell\circ \mathcal{F})_{|S} &:= \left\{{ (\ell(y_1 w^{\scriptscriptstyle\mathsf{T}}x_1),\ldots,\ell(y_n w^{\scriptscriptstyle\mathsf{T}}x_n)) : \|w\|\leq B }\right\}, \\ \mathcal{R}_{\ell}(w) &:= \mathop{\mathbb{E}}\ell(Y w^{\scriptscriptstyle\mathsf{T}}X), \\ \widehat{\mathcal{R}}_{\ell}(w) &:= \frac 1 n \sum_i \ell(y_i w^{\scriptscriptstyle\mathsf{T}}x_i). \end{aligned} The goal is to control \mathcal{R}_{\ell}- \widehat{\mathcal{R}}_{\ell} over \mathcal{F} via the earlier theorem; our main effort is in controlling \textrm{URad}((\ell\circ \mathcal{F})_{S}).

Revisiting our overloaded composition notation: \begin{aligned} \left({\ell \circ f}\right) &= \left({ (x,y) \mapsto \ell(-y f(x)) }\right),\\ \ell \circ \mathcal{F}&= \{ \ell \circ f : f\in \mathcal{F}\}. \end{aligned}

Now let’s handle step 2: Rademacher complexity of linear predictors (in \ell_2).

\begin{aligned} \mathcal{R}_{\ell}(w) &\leq \widehat{\mathcal{R}}_{\ell}(w) + \frac {2}{n} \textrm{URad}((\ell\circ \mathcal{F})_{|S}) + 3(\ln(2)+B)\sqrt{\ln(2/\delta)/(2n)} \\ &\leq \widehat{\mathcal{R}}_{\ell}(w) + \frac {2B\|X\|_F}{n} + 3(\ln(2)+B)\sqrt{\ln(2/\delta)/(2n)} \\ &\leq \widehat{\mathcal{R}}_{\ell}(w) + \frac {2B + 3(B+\ln(2))\sqrt{\ln(2/\delta)/2}}{\sqrt{n}}. \end{aligned}

13.4 Margin bounds

In the logistic regression example, we peeled off the loss and bounded the Rademacher complexity of the predictors.

If most training labels are predicted not only accurately, but with a large margin, as in section 10, then we can further reduce the generalization bound.

Define \ell_\gamma(z) := \max\{0,\min\{1, 1-z/\gamma\}\}, \mathcal{R}_{\gamma}(f) := \mathcal{R}_{\ell_\gamma}(f) = \mathop{\mathbb{E}}\ell_\gamma(Y f(X)), and recall \mathcal{R}_{\textrm{z}}(f) = \mathop{\textrm{Pr}}[f(X) \neq Y].

is that using per-example lipschitz? need to restate peeling? also, properly invoke peeling?

13.5 Finite class bounds

In our warm-up example of finite classes, our complexity term was \ln|\mathcal{F}|. Here we will recover that, via Rademacher complexity. Moreover, the bound has a special form which will be useful in the later VC dimension and especially covering sections.

13.6 Weaknesses of Rademacher complexity

14 Two Rademacher complexity proofs for deep networks

14.1 First “layer peeling” proof: (1,\infty) norm

We’ll prove this with an induction “peeling” off layers. This peeling will use the following lemma, which collects many standard Rademacher properties.

14.2 Second “layer peeling” proof: Frobenius norm

The main proof trick (to remove 2^L) is to replace \mathop{\mathbb{E}}_\epsilon with \ln \mathop{\mathbb{E}}_\epsilon\exp; the 2^L now appears inside the \ln.

The peeling proof will end with a term \mathop{\mathbb{E}}\exp\left({ t\|X^{\scriptscriptstyle\mathsf{T}}\epsilon\|}\right), and we’ll optimize the t to get the final bound. Consequently, we are proving \|X^{\scriptscriptstyle\mathsf{T}}\epsilon\| is sub-Gaussian!

Proof of +Theorem 14.2 ((Theorem 1, Golowich, Rakhlin, and Shamir 2018)). For convenience, let X_i denote the output of layer i, meaning X_0 = X \quad\textup{and}\quad X_i := \sigma_i(X_{i-1} W_i^{\scriptscriptstyle\mathsf{T}}). Let t>0 be a free parameter and let w denote all parameters across all layers; the bulk of the proof will show (by induction on layers) that \mathop{\mathbb{E}}\sup_w \exp( t \|\epsilon^{\scriptscriptstyle\mathsf{T}}X_i\| ) \leq \mathop{\mathbb{E}}2^i \exp( t B^i \|\epsilon^{\scriptscriptstyle\mathsf{T}}X_0\| ).

To see how to complete the proof from here, note by the earlier “base case lemma” (setting \mu := \mathop{\mathbb{E}}\|X_0^{\scriptscriptstyle\mathsf{T}}\epsilon\| for convenience) and Jensen’s inequality that \begin{aligned} \textrm{URad}(\mathcal{F}_{|S}) &= \mathop{\mathbb{E}}\sup_w \epsilon^{\scriptscriptstyle\mathsf{T}}X_L = \mathop{\mathbb{E}}\frac 1 t \ln \sup_w \exp\left({ t \epsilon^{\scriptscriptstyle\mathsf{T}}X_L }\right) \\ &\leq \frac 1 t \ln \mathop{\mathbb{E}}\sup_w \exp\left({ t | \epsilon^{\scriptscriptstyle\mathsf{T}}X_L | }\right) \leq \frac 1 t \ln \mathop{\mathbb{E}}2^L \exp\left({ t B^L \| \epsilon^{\scriptscriptstyle\mathsf{T}}X_0 \| }\right) \\ &\leq \frac 1 t \ln \mathop{\mathbb{E}}2^L \exp\left({ t B^L( \| \epsilon^{\scriptscriptstyle\mathsf{T}}X_0 \| - \mu + \mu)}\right) \\ &\leq \frac 1 t \ln\left[{ 2^L \exp\left({ t^2 B^{2L} \|X\|_{\textrm{F}}^2/2 + tB^L \mu}\right) }\right] \\ &\leq \frac {L\ln 2} t + \frac {t B^{2L}\|X\|_{\textrm{F}}^2}{2} + B^L \|X\|_{\textrm{F}}, \end{aligned} whereby the final bound follows with the minimizing choice t := \sqrt{\frac{ 2 L \ln(2) }{ B^{2L}\|X\|_{\textrm{F}}^2 }} \ \Longrightarrow\ {} \textrm{URad}(\mathcal{F}_{|S}) \leq \sqrt{2\ln(2) L B^{2L}\|X\|_{\textrm{F}}^2} + B^L \|X\|_{\textrm{F}}.

The main inequality is now proved via induction.

For convenience, define \sigma := \sigma_i and Y := X_{i-1} and V := W_i and \tilde V has \ell_2-normalized rows. By positive homogeneity and definition, \begin{aligned} \sup_w \|\epsilon^{\scriptscriptstyle\mathsf{T}}X_i\|^2 &= \sup_w \sum_j (\epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y V^{\scriptscriptstyle\mathsf{T}})_{:j})^2 \\ &= \sup_w \sum_j (\epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y V_{j:}^{\scriptscriptstyle\mathsf{T}}))^2 \\ &= \sup_w \sum_j (\epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(\|V_{j:}\| Y \tilde V_{j:}^{\scriptscriptstyle\mathsf{T}}))^2 \\ &= \sup_w \sum_j \|V_{j:}\|^2 (\epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y \tilde V_{j:}^{\scriptscriptstyle\mathsf{T}}))^2. \end{aligned}

The maximum over row norms is attained by placing all mass on a single row; thus, letting u denote an arbitrary unit norm (column) vector, and finally applying the peeling lemma, and re-introducing the dropped terms, and closing with the IH, \begin{aligned} \mathop{\mathbb{E}}_\epsilon \exp\left({t \sqrt{ \sup_w \|\epsilon^{\scriptscriptstyle\mathsf{T}}X_i\|^2 } }\right) &= \mathop{\mathbb{E}}_\epsilon \exp \left({t \sqrt{ \sup_{w,u} B^2 (\epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y u))^2 } }\right) \\ &= \mathop{\mathbb{E}}_\epsilon\sup_{w,u} \exp\left({t B |\epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y u)| }\right) \\ &\leq \mathop{\mathbb{E}}_\epsilon\sup_{w,u} \exp\left({t B \epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y u) }\right) + \exp\left({ - t B \epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y u) }\right) \\ &\leq \mathop{\mathbb{E}}_\epsilon\sup_{w,u} \exp\left({t B \epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y u) }\right) + \mathop{\mathbb{E}}_\epsilon\sup_{w,u} \exp\left({ - t B \epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y u) }\right) \\ &= \mathop{\mathbb{E}}_\epsilon 2 \sup_{w,u} \exp\left({t B \epsilon^{\scriptscriptstyle\mathsf{T}}\sigma(Y u) }\right) \\ &\leq \mathop{\mathbb{E}}_\epsilon 2 \sup_{w,u} \exp\left({t B \epsilon^{\scriptscriptstyle\mathsf{T}}Y u }\right) \\ &\leq \mathop{\mathbb{E}}_\epsilon 2 \sup_{w} \exp\left({t B \| \epsilon^{\scriptscriptstyle\mathsf{T}}Y\|_2 }\right) \\ &\leq \mathop{\mathbb{E}}_\epsilon 2^i \sup_{w} \exp\left({t B^i \| \epsilon^{\scriptscriptstyle\mathsf{T}}X_0\|_2 }\right). \end{aligned}

15 Covering numbers

15.1 Basic Rademacher-covering relationship

15.2 Second Rademacher-covering relationship: Dudley’s entropy integral

There is a classical proof that says that covering numbers and Rademacher complexities are roughly the same; the upper bound uses the Dudley entropy integral, and the lower bound uses a “Sudakov lower bound” which we will not include here.

Seems this works with improper covers. I should check carefully and include it in the statement or a remark.

16 Two deep network covering number bounds

16.1 First covering number bound: Lipschitz functions

This bound is intended as a point of contrast with our deep network generalization bounds.

16.2 “Spectrally-normalized” covering number bound

17 VC dimension

should include in preamble various bounds not taught, and a comment that VC dim proofs are interesting and reveal structure not captured above.

First, some definitions. First, the zero-one/classification risk/error: \mathcal{R}_{\textrm{z}}(\textrm{sgn}(f)) = \mathop{\textrm{Pr}}[\textrm{sgn}(f(X)) \neq Y], \ {} \widehat{\mathcal{R}}_{\textrm{z}}(\textrm{sgn}(f)) = \frac 1 n \sum_{i=1}^n \mathbf{1}[\textrm{sgn}(f(x_i))\neq y_i]. The earlier Rademacher bound will now have \textrm{URad}\left({\left\{{ (x,y)\mapsto \mathbf{1}[\textrm{sgn}(f(x))\neq y] : f\in\mathcal{F}}\right\}_{|S}}\right). This is at most 2^n; we’ll reduce it to a combinatorial quantity: \begin{aligned} \textrm{sgn}(U) &:= \left\{{ (\textrm{sgn}(u_1),\ldots,\textrm{sgn}(u_n)) : u\in V }\right\}, \\ \textrm{Sh}(\mathcal{F}_{|S}) &:= \left|{ \textrm{sgn}(\mathcal{F}_{|S}) }\right|, \\ \textrm{Sh}(\mathcal{F}; n) &:= \sup_{\substack{S\in ? \\ |S|\leq n}} \left|{ \textrm{sgn}(\mathcal{F}_{|S}) }\right|, \\ \textrm{VC}(\mathcal{F}) &:= \sup\{ i \in \mathbb{Z}_{\geq 0} : \textrm{Sh}(\mathcal{F};i) = 2^i \}. \end{aligned}

Plugging this into our Rademacher bound: w/ pr \geq 1-\delta, \forall f\in\mathcal{F}, \mathcal{R}_{\textrm{z}}(\textrm{sgn}(f)) \leq \widehat{\mathcal{R}}_{\textrm{z}}(\textrm{sgn}(f)) + \frac 2 n \textrm{URad}(\textrm{sgn}(\mathcal{F})_{|S}) + 3 \sqrt{\frac{\ln(2/\delta)}{2n}}.

The next step is to apply Massart’s finite lemma, giving \textrm{URad}(\textrm{sgn}(\mathcal{F}_{|S})) \leq \sqrt{ 2n \textrm{Sh}(\mathcal{F}_{|S})}.

17.1 VC dimension of linear predictors

17.2 VC dimension of threshold networks

Consider iterating the previous construction, giving an “LTF network”: a neural network with activation z\mapsto \mathbf{1}[z\geq 0].

We’ll analyze this by studying output of all nodes. To analyze this, we’ll study not just the outputs, but the behavior of all nodes.

17.3 VC dimension of ReLU networks

Today’s ReLU networks will predict with x\mapsto A_L \sigma_{L-1}\left({A_{L-1} \cdots A_2\sigma_1(A_1x + b_1)+b_2\cdots + b_{L-1}}\right) + b_L, where A_i\in \mathbb{R}^{d_i\times d_{i-1}} and \sigma_i : \mathbb{R}^{d_i\to d_i} applies the ReLU z\mapsto\max\{0,z\} coordinate-wise.

Convenient notation: collect data as rows of matrix X\in\mathbb{R}^{n\times d}, and define \begin{aligned} X_0 &:= X^\top & Z_0 &:= \textup{all $1$s matrix},\\ X_i &:= A_i(Z_{i-1} \odot X_{i-1}) + b_i 1_n^\top, & X_i &:= \mathbf{1}[X_i \geq 0], \end{aligned} where (Z_1,\ldots,Z_L) are the activation matrices.

i should double check i have the tightest version? which is more sensitive to earlier layers? i should comment on that and the precise structure/meaning of the lower bounds?

References

Allen-Zhu, Zeyuan, and Yuanzhi Li. 2019. “What Can ResNet Learn Efficiently, Going Beyond Kernels?”

Allen-Zhu, Zeyuan, Yuanzhi Li, and Yingyu Liang. 2018. “Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers.” arXiv Preprint arXiv:1811.04918.

Allen-Zhu, Zeyuan, Yuanzhi Li, and Zhao Song. 2018. “A Convergence Theory for Deep Learning via over-Parameterization.”

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. 2017. “Wasserstein Generative Adversarial Networks.” In ICML.

Arora, Sanjeev, Nadav Cohen, Noah Golowich, and Wei Hu. 2018a. “A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks.”

———. 2018b. “A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks.”

Arora, Sanjeev, Nadav Cohen, and Elad Hazan. 2018. “On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization.” In Proceedings of the 35th International Conference on Machine Learning, edited by Jennifer Dy and Andreas Krause, 80:244–53. Proceedings of Machine Learning Research. Stockholmsmässan, Stockholm Sweden: PMLR. http://proceedings.mlr.press/v80/arora18a.html.

Arora, Sanjeev, Nadav Cohen, Wei Hu, and Yuping Luo. 2019. “Implicit Regularization in Deep Matrix Factorization.” In Advances in Neural Information Processing Systems, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett, 32:7413–24. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/c0c783b5fc0d7d808f1d14a6e9c8280d-Paper.pdf.

Arora, Sanjeev, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. 2019. “On Exact Computation with an Infinitely Wide Neural Net.” arXiv Preprint arXiv:1904.11955.

Arora, Sanjeev, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. 2019. “Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks.” arXiv Preprint arXiv:1901.08584.

Arora, Sanjeev, Rong Ge, Behnam Neyshabur, and Yi Zhang. 2018. “Stronger Generalization Bounds for Deep Nets via a Compression Approach.”

Bach, Francis. 2017. “Breaking the Curse of Dimensionality with Convex Neural Networks.” Journal of Machine Learning Research 18 (19): 1–53.

Barron, Andrew R. 1993. “Universal Approximation Bounds for Superpositions of a Sigmoidal Function.” IEEE Transactions on Information Theory 39 (3): 930–45.

Bartlett, Peter L. 1996. “For Valid Generalization, the Size of the Weights Is More Important Than the Size of the Network.” In NIPS.

Bartlett, Peter L., Nick Harvey, Chris Liaw, and Abbas Mehrabian. 2017. “Nearly-Tight VC-Dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks.”

Bartlett, Peter L., and Philip M. Long. 2020. “Failures of Model-Dependent Generalization Bounds for Least-Norm Interpolation.”

Bartlett, Peter L., and Shahar Mendelson. 2002. “Rademacher and Gaussian Complexities: Risk Bounds and Structural Results.” JMLR 3 (November): 463–82.

Bartlett, Peter, Dylan Foster, and Matus Telgarsky. 2017. “Spectrally-Normalized Margin Bounds for Neural Networks.” NIPS.

Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2018. “Reconciling Modern Machine Learning Practice and the Bias-Variance Trade-Off.”

Belkin, Mikhail, Daniel Hsu, and Ji Xu. 2019. “Two Models of Double Descent for Weak Features.”

Bengio, Yoshua, and Olivier Delalleau. 2011. “Shallow Vs. Deep Sum-Product Networks.” In NIPS.

Bietti, Alberto, and Francis Bach. 2020. “Deep Equals Shallow for ReLU Networks in Kernel Regimes.”

Blum, Avrim, and John Langford. 2003. “PAC-MDL Bounds.” In Learning Theory and Kernel Machines, 344–57. Springer.

Borwein, Jonathan, and Adrian Lewis. 2000. Convex Analysis and Nonlinear Optimization. Springer Publishing Company, Incorporated.

Bubeck, Sébastien. 2014. “Theory of Convex Optimization for Machine Learning.”

Cao, Yuan, and Quanquan Gu. 2020a. “Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks.”

———. 2020b. “Generalization Error Bounds of Gradient Descent for Learning over-Parameterized Deep ReLU Networks.”

Carmon, Yair, and John C. Duchi. 2018. “Analysis of Krylov Subspace Solutions of Regularized Nonconvex Quadratic Problems.” In NIPS.

Chaudhuri, Kamalika, and Sanjoy Dasgupta. 2014. “Rates of Convergence for Nearest Neighbor Classification.”

Chen, Ricky T. Q., Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. 2018. “Neural Ordinary Differential Equations.”

Chen, Zixiang, Yuan Cao, Quanquan Gu, and Tong Zhang. 2020. “A Generalized Neural Tangent Kernel Analysis for Two-Layer Neural Networks.”

Chen, Zixiang, Yuan Cao, Difan Zou, and Quanquan Gu. 2019. “How Much over-Parameterization Is Sufficient to Learn Deep ReLU Networks?”

Chizat, Lénaïc, and Francis Bach. 2018. “On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport.” arXiv e-Prints, May, arXiv:1805.09545. http://arxiv.org/abs/1805.09545.

———. 2019. “A Note on Lazy Training in Supervised Differentiable Programming.”

———. 2020. “Implicit Bias of Gradient Descent for Wide Two-Layer Neural Networks Trained with the Logistic Loss.” arXiv:2002.04486 [math.OC].

Cho, Youngmin, and Lawrence K. Saul. 2009. “Kernel Methods for Deep Learning.” In NIPS.

Cisse, Moustapha, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. 2017. “Parseval Networks: Improving Robustness to Adversarial Examples.”

Clarke, Francis H., Yuri S. Ledyaev, Ronald J. Stern, and Peter R. Wolenski. 1998. Nonsmooth Analysis and Control Theory. Springer.

Cohen, Nadav, Or Sharir, and Amnon Shashua. 2016. “On the Expressive Power of Deep Learning: A Tensor Analysis.” In 29th Annual Conference on Learning Theory, edited by Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, 49:698–728. Proceedings of Machine Learning Research. Columbia University, New York, New York, USA: PMLR. http://proceedings.mlr.press/v49/cohen16.html.

Cohen, Nadav, and Amnon Shashua. 2016. “Convolutional Rectifier Networks as Generalized Tensor Decompositions.” In Proceedings of the 33rd International Conference on Machine Learning, edited by Maria Florina Balcan and Kilian Q. Weinberger, 48:955–63. Proceedings of Machine Learning Research. New York, New York, USA: PMLR. http://proceedings.mlr.press/v48/cohenb16.html.

Cybenko, George. 1989. “Approximation by superpositions of a sigmoidal function.” Mathematics of Control, Signals and Systems 2 (4): 303–14.

Daniely, Amit. 2017. “Depth Separation for Neural Networks.” In COLT.

Daniely, Amit, and Eran Malach. 2020. “Learning Parities with Neural Networks.”

Davis, Damek, Dmitriy Drusvyatskiy, Sham Kakade, and Jason D. Lee. 2018. “Stochastic Subgradient Method Converges on Tame Functions.”

Diakonikolas, Ilias, Surbhi Goel, Sushrut Karmalkar, Adam R. Klivans, and Mahdi Soltanolkotabi. 2020. “Approximation Schemes for ReLU Regression.”

Du, Simon S., Wei Hu, and Jason D. Lee. 2018. “Algorithmic Regularization in Learning Deep Homogeneous Models: Layers Are Automatically Balanced.”

Du, Simon S., Xiyu Zhai, Barnabas Poczos, and Aarti Singh. 2018. “Gradient Descent Provably Optimizes over-Parameterized Neural Networks.”

Du, Simon S, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. 2018. “Gradient Descent Finds Global Minima of Deep Neural Networks.” arXiv Preprint arXiv:1811.03804.

Du, Simon, and Wei Hu. 2019. “Width Provably Matters in Optimization for Deep Linear Neural Networks.”

Dziugaite, Gintare Karolina, Alexandre Drouin, Brady Neal, Nitarshan Rajkumar, Ethan Caballero, Linbo Wang, Ioannis Mitliagkas, and Daniel M. Roy. 2020. “In Search of Robust Measures of Generalization.”

Dziugaite, Gintare Karolina, and Daniel M. Roy. 2017. “Computing Nonvacuous Generalization Bounds for Deep (stochastic) Neural Networks with Many More Parameters Than Training Data.”

Eldan, Ronen, and Ohad Shamir. 2015. “The Power of Depth for Feedforward Neural Networks.”

Folland, Gerald B. 1999. Real Analysis: Modern Techniques and Their Applications. 2nd ed. Wiley Interscience.

Funahashi, K. 1989. “On the Approximate Realization of Continuous Mappings by Neural Networks.” Neural Netw. 2 (3): 183–92.

Ge, Rong, Jason D. Lee, and Tengyu Ma. 2016. “Matrix Completion Has No Spurious Local Minimum.” In NIPS.

Ghorbani, Behrooz, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. 2020. “When Do Neural Networks Outperform Kernel Methods?”

Goel, Surbhi, Adam Klivans, Pasin Manurangsi, and Daniel Reichman. 2020. “Tight Hardness Results for Training Depth-2 ReLU Networks.”

Golowich, Noah, Alexander Rakhlin, and Ohad Shamir. 2018. “Size-Independent Sample Complexity of Neural Networks.” In COLT.

Gunasekar, Suriya, Jason D Lee, Daniel Soudry, and Nati Srebro. 2018a. “Implicit Bias of Gradient Descent on Linear Convolutional Networks.” In Advances in Neural Information Processing Systems, 9461–71.

Gunasekar, Suriya, Jason Lee, Daniel Soudry, and Nathan Srebro. 2018b. “Characterizing Implicit Bias in Terms of Optimization Geometry.” arXiv Preprint arXiv:1802.08246.

Gunasekar, Suriya, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. 2017. “Implicit Regularization in Matrix Factorization.”

Gurvits, Leonid, and Pascal Koiran. 1995. “Approximation and Learning of Convex Superpositions.” In Computational Learning Theory, edited by Paul Vitányi, 222–36. Springer.

Hanin, Boris, and David Rolnick. 2019. “Deep ReLU Networks Have Surprisingly Few Activation Patterns.”

Hastie, Trevor, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. 2019. “Surprises in High-Dimensional Ridgeless Least Squares Interpolation.”

Hertz, John, Anders Krogh, and Richard G. Palmer. 1991. Introduction to the Theory of Neural Computation. USA: Addison-Wesley Longman Publishing Co., Inc.

Hiriart-Urruty, Jean-Baptiste, and Claude Lemaréchal. 2001. Fundamentals of Convex Analysis. Springer Publishing Company, Incorporated.

Hornik, K., M. Stinchcombe, and H. White. 1989. “Multilayer Feedforward Networks Are Universal Approximators.” Neural Networks 2 (5): 359–66.

Jacot, Arthur, Franck Gabriel, and Clément Hongler. 2018. “Neural Tangent Kernel: Convergence and Generalization in Neural Networks.” In Advances in Neural Information Processing Systems, 8571–80.

Ji, Ziwei. 2020. “Personal Communication.”

Ji, Ziwei, Miroslav Dudík, Robert E. Schapire, and Matus Telgarsky. 2020. “Gradient Descent Follows the Regularization Path for General Losses.” In COLT.

Ji, Ziwei, Justin D. Li, and Matus Telgarsky. 2021. “Early-Stopped Neural Networks Are Consistent.”

Ji, Ziwei, and Matus Telgarsky. 2018. “Gradient Descent Aligns the Layers of Deep Linear Networks.” arXiv:1810.02032 [cs.LG].

———. 2019a. “Polylogarithmic Width Suffices for Gradient Descent to Achieve Arbitrarily Small Test Error with Shallow ReLU Networks.”

———. 2019b. “Risk and Parameter Convergence of Logistic Regression.” In COLT.

———. 2020. “Directional Convergence and Alignment in Deep Learning.” arXiv:2006.06657 [cs.LG].

Ji, Ziwei, Matus Telgarsky, and Ruicheng Xian. 2020. “Neural Tangent Kernels, Transportation Mappings, and Universal Approximation.” In ICLR.

Jiang, Yiding, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. 2020. “Fantastic Generalization Measures and Where to Find Them.” In ICLR.

Jin, Chi, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. 2017. “How to Escape Saddle Points Efficiently.” In ICML.

Jones, Lee K. 1992. “A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training.” The Annals of Statistics 20 (1): 608–13.

Kakade, Sham, and Jason D. Lee. 2018. “Provably Correct Automatic Subdifferentiation for Qualified Programs.”

Kamath, Pritish, Omar Montasser, and Nathan Srebro. 2020. “Approximate Is Good Enough: Probabilistic Variants of Dimensional and Margin Complexity.”

Kawaguchi, Kenji. 2016. “Deep Learning Without Poor Local Minima.” In NIPS.

Kolmogorov, A. N., and V. M. Tikhomirov. 1959. “\epsilon-Entropy and \epsilon-Capacity of Sets in Function Spaces.” Uspekhi Mat. Nauk 14 (86, 2): 3–86.

Ledoux, M., and M. Talagrand. 1991. Probability in Banach Spaces: Isoperimetry and Processes. Springer.

Lee, Holden, Rong Ge, Tengyu Ma, Andrej Risteski, and Sanjeev Arora. 2017. “On the Ability of Neural Nets to Express Distributions.” In COLT.

Lee, Jason D., Max Simchowitz, Michael I. Jordan, and Benjamin Recht. 2016. “Gradient Descent Only Converges to Minimizers.” In COLT.

Leshno, Moshe, Vladimir Ya. Lin, Allan Pinkus, and Shimon Schocken. 1993. “Multilayer Feedforward Networks with a Nonpolynomial Activation Function Can Approximate Any Function.” Neural Networks 6 (6): 861–67. http://dblp.uni-trier.de/db/journals/nn/nn6.html#LeshnoLPS93.

Li, Yuanzhi, and Yingyu Liang. 2018. “Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data.”

Long, Philip M., and Hanie Sedghi. 2019. “Generalization Bounds for Deep Convolutional Neural Networks.”

Luxburg, Ulrike von, and Olivier Bousquet. 2004. “Distance-Based Classification with Lipschitz Functions.” Journal of Machine Learning Research.

Lyu, Kaifeng, and Jian Li. 2019. “Gradient Descent Maximizes the Margin of Homogeneous Neural Networks.”

Mei, Song, Andrea Montanari, and Phan-Minh Nguyen. 2018. “A Mean Field View of the Landscape of Two-Layers Neural Networks.” arXiv e-Prints, April, arXiv:1804.06561. http://arxiv.org/abs/1804.06561.

Montanelli, Hadrien, Haizhao Yang, and Qiang Du. 2020. “Deep ReLU Networks Overcome the Curse of Dimensionality for Bandlimited Functions.”

Montúfar, Guido, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. 2014. “On the Number of Linear Regions of Deep Neural Networks.” In NIPS.

Moran, Shay, and Amir Yehudayoff. 2015. “Sample Compression Schemes for VC Classes.”

Nagarajan, Vaishnavh, and J. Zico Kolter. 2019. “Uniform Convergence May Be Unable to Explain Generalization in Deep Learning.”

Negrea, Jeffrey, Gintare Karolina Dziugaite, and Daniel M. Roy. 2019. “In Defense of Uniform Convergence: Generalization via Derandomization with an Application to Interpolating Predictors.”

Nesterov, Yurii. 2003. Introductory Lectures on Convex Optimization — a Basic Course. Springer.

Nesterov, Yurii, and B. T. Polyak. 2006. “Cubic Regularization of Newton Method and Its Global Performance.” Math. Program. 108 (1): 177–205.

Neyshabur, Behnam, Srinadh Bhojanapalli, and Nathan Srebro. 2018. “A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks.” In ICLR.

Neyshabur, Behnam, Ryota Tomioka, and Nathan Srebro. 2014. “In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning.” arXiv:1412.6614 [cs.LG].

Nguyen, Quynh, and Matthias Hein. 2017. “The Loss Surface of Deep and Wide Neural Networks.”

Novak, Roman, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. 2018. “Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes.” arXiv e-Prints. http://arxiv.org/abs/1810.05148.

Novikoff, Albert B. J. 1962. “On Convergence Proofs on Perceptrons.” In Proceedings of the Symposium on the Mathematical Theory of Automata 12: 615–22.

Oymak, Samet, and Mahdi Soltanolkotabi. 2019. “Towards Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks.” arXiv Preprint arXiv:1902.04674.

Pisier, Gilles. 1980. “Remarques Sur Un résultat Non Publié de b. Maurey.” Séminaire Analyse Fonctionnelle (dit), 1–12.

Rolnick, David, and Max Tegmark. 2017. “The Power of Deeper Networks for Expressing Natural Functions.”

Safran, Itay, and Ohad Shamir. 2016. “Depth-Width Tradeoffs in Approximating Natural Functions with Neural Networks.”

Schapire, Robert E., and Yoav Freund. 2012. Boosting: Foundations and Algorithms. MIT Press.

Schapire, Robert E., Yoav Freund, Peter Bartlett, and Wee Sun Lee. 1997. “Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods.” In ICML, 322–30.

Schmidt-Hieber, Johannes. 2017. “Nonparametric Regression Using Deep Neural Networks with ReLU Activation Function.”

Shallue, Christopher J., Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl. 2018. “Measuring the Effects of Data Parallelism on Neural Network Training.”

Shamir, Ohad. 2018. “Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks.” arXiv:1809.08587 [cs.LG].

Shamir, Ohad, and Tong Zhang. 2013. “Stochastic Gradient Descent for Non-Smooth Optimization: Convergence Results and Optimal Averaging Schemes.” In ICML.

Siegelmann, Hava, and Eduardo Sontag. 1994. “Analog Computation via Neural Networks.” Theoretical Computer Science 131 (2): 331–60.

Soudry, Daniel, Elad Hoffer, and Nathan Srebro. 2017. “The Implicit Bias of Gradient Descent on Separable Data.” arXiv Preprint arXiv:1710.10345.

Steinwart, Ingo, and Andreas Christmann. 2008. Support Vector Machines. 1st ed. Springer.

Suzuki, Taiji, Hiroshi Abe, and Tomoaki Nishimura. 2019. “Compression Based Bound for Non-Compressed Network: Unified Generalization Error Analysis of Large Compressible Deep Neural Network.”

Telgarsky, Matus. 2013. “Margins, Shrinkage, and Boosting.” In ICML.

———. 2015. “Representation Benefits of Deep Feedforward Networks.”

———. 2016. “Benefits of Depth in Neural Networks.” In COLT.

———. 2017. “Neural Networks and Rational Functions.” In ICML.

Tzen, Belinda, and Maxim Raginsky. 2019. “Neural Stochastic Differential Equations: Deep Latent Gaussian Models in the Diffusion Limit.”

Vardi, Gal, and Ohad Shamir. 2020. “Neural Networks with Small Weights and Depth-Separation Barriers.” arXiv:2006.00625 [cs.LG].

Wainwright, Martin J. 2015. “UC Berkeley Statistics 210B, Lecture Notes: Basic tail and concentration bounds.” January 2015. https://www.stat.berkeley.edu/ mjwain/stat210b/.

———. 2019. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. 1st ed. Cambridge University Press.

Wei, Colin, and Tengyu Ma. 2019. “Data-Dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation.”

Weierstrass, Karl. 1885. “Über Die Analytische Darstellbarkeit Sogenannter Willkürlicher Functionen Einer Reellen Veränderlichen.” Sitzungsberichte Der Akademie Zu Berlin, 633–39, 789–805.

Yarotsky, Dmitry. 2016. “Error Bounds for Approximations with Deep ReLU Networks.”

Yehudai, Gilad, and Ohad Shamir. 2019. “On the Power and Limitations of Random Features for Understanding Neural Networks.” arXiv:1904.00687 [cs.LG].

———. 2020. “Learning a Single Neuron with Gradient Methods.” arXiv:2001.05205 [cs.LG].

Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. “Understanding Deep Learning Requires Rethinking Generalization.” ICLR.

Zhou, Lijia, D. J. Sutherland, and Nathan Srebro. 2020. “On Uniform Convergence and Low-Norm Interpolation Learning.”

Zhou, Wenda, Victor Veitch, Morgane Austern, Ryan P. Adams, and Peter Orbanz. 2018. “Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach.”

Zou, Difan, Yuan Cao, Dongruo Zhou, and Quanquan Gu. 2018. “Stochastic Gradient Descent Optimizes over-Parameterized Deep Relu Networks.”

Zou, Difan, and Quanquan Gu. 2019. “An Improved Analysis of Training over-Parameterized Deep Neural Networks.”

Preface