University of California, Berkeley
April 7, 2020
A continuing mystery in understanding the empirical success of deep neural networks has been in their ability to achieve zero training error and yet generalize well, even when the training data is noisy and there are many more parameters than data points. Following the information-theoretic tradition of seeking understanding, this talk will share our four-part approach to shedding some light on this phenomenon. First, following the tradition of such distilled toy models like the BSC and AWGN channels, the Gaussian source, or scalar linear control systems, we zoom in on the classical linear regression problem in the underdetermined setting with more parameters than training data. Here, the solutions that minimize training error interpolate the data, including noise. Second, following the tradition of converse bounds, we give a genie-aided bound on how well any interpolative solutions can generalize to fresh test data, and show that this bound generically decays to zero (at a known rate) with the number of extra features, thus characterizing an explicit benefit of overparameterization. Third, we talk about what it takes to achieve such harmless interpolation in appropriately overparameterized limits. The key for us turns out to be some appropriate sense of implicit or explicit sparsity. Along the way, we call out a family of key concepts that help us understand what is required to achieve harmless interpolation in general, in particular the ideas of "signal survival" and "contamination." Fourth, we take what we have learned from the regression setting to better understand overparameterized classification, and see how there is a regime in which classification is fundamentally more forgiving than regression, though it can be understood using the same tools.