Time & Place. TR 3:304:45, Siebel 1304.
Course staff.
Instructor Matus Telgarsky. Office hours: Siebel 3212, T 56.
TA Ruicheng Xian. Office hours: common area outside Siebel 3102, M 2:304:30.
Essential URLs.
You will submit all homework on gradescope (selfenrollment code M4D6X6). Gradescope is also where you can view homework and exam grades, and make regrade requests. No late homework; homework must be typeset.
All discussion and announcements take place at piazza.
Academic integrity.
All homework is submitted individually, and must be in your own words.
For homeworks 12, you may discuss only at a high level with up to three classmates; please list their NetIDs on the first page of your homework. Everyone must still submit an individual writeup, and yours must be in your own words; indeed, your discussions with classmates should be too high level for it to be possible that they are not in your own words.
We prefer you do not dig around for homework solutions; if you do rely upon external resources, cite them, and still write your solutions in your own words. Please see Jeff Erickson’s discussion of academic integrity.
When integrity violations are found, they will be submitted to the department’s evaluation board.
Evaluation.
Your course grade is 5% hw0, 35% hw1, 35% hw2, 25% paper presentation.
Please write clean, succinct, unambiguous solutions.
Homework:
Absolutely no late homework.
Homework must be typed, though you are free to use whatever typesetting software you wish (latex, markdown, google doc, ms word, …).
Paper presentation (full information below):
The last 4 classes will be invidividual student presentations. I’ll figure out the presentation length and schedule the day proposals are handed in (November 7).
Presentations must have a few components and are required to follow a template provided below, which includes: relationship to other literature, relationship to the course, main theorem, proof sketch.
Papers must be from the list below.
Readings and prerequisites.
Basic machine learning, for instance the material in my CS 446 course.
"Understanding Machine Learning, by Shai ShalevShwartz and Shai BenDavid can be downloaded from that page, is free for personal use. I think this book is a wonderful resource, I find its presentation very clear, direct, and minimal.
Schedule tentative! Check back often!
Date  Topic  Assignments 

8/27  No lecture!  hw0 handout, hw0 template 
8/29  No lecture!  
9/3  Introduction; initial constructive approximations  
9/5  Constructive 3 and nonconstructive 2layer apx  hw0 due 
9/10  Infinite width 1: univariate, Maurey  
9/12  Infinite width 2: Barron  
9/17 9/19 9/24 
Infinite width 3: Neural Tangent Kernel (NTK)  
9/26 10/1 
Benefits of depth basic separations; sobolev spaces 

10/3 10/8 10/10 10/15 
Optimization basics Smooth optimization, nondifferentiabilities and differential inclusion, stochastic optimization 
hw1 handout, hw1 template 
10/17 10/22 10/24 
Shallow and deep network optimization (Various vignettes) 
project proposal out 
10/29 10/31 11/5 11/7 
Rademacher bounds for deep networks For basics of concentration, see also my ml theory course and Martin Wainwright’s concentration notes 
hw1 due hw2 handout, hw2 template project proposal due 
11/12 11/14 
Covering number bounds for deep networks  
Omitted lectures: VC bounds for deep networks  
11/19  [14] Philip Amortila [28] John Anthony Pavlik [8] Gao Tang [1] Zihao Yang [26] Linyi Li [2] Erchi Wang [6] Vasileios Livanos 

11/21  [3] Ziwei Ji [17] Lihui Liu [23] Yibo Zhang [18] Amnon Attali [9] Tengyang Xie [19] Junyeob Lim [16] Eddie Huang 

12/3  No lecture!  
12/5  [24] Qinghai Zhou [31] Efthymios Tzinis [12] Si Zhang [13] Gang Qiao [10] Priyank Agrawal [27] Shiliang Zuo [20] Sahand Mozaffari 

12/10  [15] Wentao Yang [7] Mohit Vyas [22] Dimitrios Gotsis [11] Jian Kang [29] Raphael Long [21] Christian Howard 

12/17  (No lecture.)  hw2 due 
The project is a paper presentation.
Deliverables (failure to do any of these three parts results in a project grade of 0).
A project proposal is due November 7 at noon (on gradescope!).
You don’t need to prepare a pdf, but will simply answer some questions on gradescope; look for the assignment titled “project proposal”.
Your proposal will include: (a) a ranked list of at least the first 29 papers below, (b) any presentation date constraints you may have.
I will use the stable marriage algorithm to assign papers, thus it’s in your best interest to give a good ranked list!
I realize that some papers are easier than others; just do your best and try to learn something and teach it to the class.
Project slides are due at noon on the day you present (on gradescope!).
Please see the provided template: here as pdf, here as markdown.
You are required to follow the template, though you are free to use other typesetting software (e.g., latex (directly), powerpoint, etc.), so long as you hand in a PDF.
All presentations will use my own laptop, and thus need to be turned in by noon.
You will present in class. The date and duration of your presentation will be announced on November 7.
Further info.
The project is individual; no groups.
Your paper must come from the following list; in the proposal, provide a ranked list of all papers, identified by their number in the following list:
“Compression based bound for noncompressed network: unified generalization error analysis of large compressible deep neural network”. Taiji Suzuki. arXiv:1909.11274 [cs.LG].
“Breaking the Curse of Dimensionality with Convex Neural Networks”. Francis Bach. arXiv:1412.8690 [cs.LG].
“Generalization Bounds for Neural Networks via Approximate Description Length”. Amit Daniely, Elad Granot. arXiv:1910.05697 [cs.LG].
“Gradient Descent Maximizes the Margin of Homogeneous Neural Networks”. Kaifeng Lyu, Jian Li. arXiv:1906.05890 [cs.LG].
Linearized twolayers neural networks in high dimension. Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari. arXiv:1904.12191 [math.ST].
“Stochastic subgradient method converges on tame functions”. Damek Davis, Dmitriy Drusvyatskiy, Sham Kakade, Jason D. Lee. arXiv:1804.07795 [math.OC].
“Beyond Linearization: On Quadratic and HigherOrder Approximation of Wide Neural Networks”. Yu Bai, Jason D. Lee. arXiv:1910.01619 [cs.LG].
The Complexity of Making the Gradient Small in Stochastic Convex Optimization. Dylan J. Foster, Ayush Sekhari, Ohad Shamir, Nathan Srebro, Karthik Sridharan, Blake Woodworth. arXiv:1902.04686 [cs.LG].
Nonparametric regression using deep neural networks with ReLU activation function. Johannes SchmidtHieber. arXiv:1708.06633 [math.ST].
What Can ResNet Learn Efficiently, Going Beyond Kernels? Zeyuan AllenZhu, Yuanzhi Li. arXiv:1905.10337 [cs.LG].
Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel. Colin Wei, Jason D. Lee, Qiang Liu, Tengyu Ma. arXiv:1810.05369 [stat.ML].
On the Global Convergence of Gradient Descent for Overparameterized Models using Optimal Transport. Lenaic Chizat, Francis Bach. arXiv:1805.09545 [math.OC].
The loss surface of deep and wide neural networks. Quynh Nguyen, Matthias Hein. arXiv:1704.08045 [cs.LG].
Recovering a feedforward net from its output. Charles Fefferman and Scott Merkel. NIPS 1993.
Learning One Convolutional Layer with Overlapping Patches. Surbhi Goel, Adam Klivans, Raghu Meka. arXiv:1802.02547 [cs.LG].
“Convex Until Proven Guilty”: DimensionFree Acceleration of Gradient Descent on NonConvex Functions. Yair Carmon, Oliver Hinder, John C. Duchi, Aaron Sidford. arXiv:1705.02766 [math.OC].
Reconciling modern machine learning practice and the biasvariance tradeoff. Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal. arXiv:1812.11118 [stat.ML]. Two models of double descent for weak features. Mikhail Belkin, Daniel Hsu, Ji Xu. arXiv:1903.07571 [cs.LG].
Compressed Sensing with Deep Image Prior and Learned Regularization. Dave Van Veen, Ajil Jalal, Mahdi Soltanolkotabi, Eric Price, Sriram Vishwanath, Alexandros G. Dimakis. arXiv:1806.06438 [stat.ML].
Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. Samet Oymak, Mahdi Soltanolkotabi. arXiv:1902.04674 [cs.LG].
Universal Approximation of InputOutput Maps by Temporal Convolutional Nets. Joshua Hanson, Maxim Raginsky. arXiv:1906.09211 [cs.LG].
AlgorithmDependent Generalization Bounds for Overparameterized Deep Residual Networks. Spencer Frei, Yuan Cao, Quanquan Gu. arXiv:1910.02934 [cs.LG].
Approximating Continuous Functions by ReLU Nets of Minimal Width. Boris Hanin, Mark Sellke. arXiv:1710.11278 [stat.ML].
Exponential Convergence Time of Gradient Descent for OneDimensional Deep Linear Neural Networks. Ohad Shamir.
arXiv:1809.08587 [cs.LG].
Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks. Peter L. Bartlett, David P. Helmbold, Philip M. Long. arXiv:1802.06093 [cs.LG].
Deep learning is adaptive to intrinsic dimensionality of model smoothness in anisotropic Besov space. Taiji Suzuki, Atsushi Nitanda. arXiv:1910.12799 [stat.ML].
Denoising and Regularization via Exploiting the Structural Bias of Convolutional Generators. Reinhard Heckel, Mahdi Soltanolkotabi. arXiv:1910.14634 [cs.LG].