Learning algorithms work exceptionally well in practice, but we have yet to find a coherent mathematical foundation explaining when they work and how to improve their performance. The challenge is that most learning algorithms rely on fitting highly nonlinear models via simple nonconvex optimization heuristics, and except for a few exceptional cases, there is no guarantee they will find global optima. Despite this and NP-hardness, simple heuristics often succeed, and over the last few years, I have studied why and when they do.
I spend a lot of time thinking about neural networks. I am particularly interested in whether we can provide provable convergence guarantees to standard training algorithms or substantially improve existing methods. Deep networks fall outside the scope of classical optimization theory since they lead to problems that lack conventionally helpful notions of convexity or smoothness. Taking the inherent nonsmooth structure of neural networks seriously is crucial to understand these methods. I study this structure and the associated algorithms by using and developing tools from several disciplines, including nonsmooth/variational analysis, tame geometry, and high-dimensional statistics.
While neural networks are nonsmooth, they are not pathological — they are built from just a few simple components, like polynomials, exponentials, logs, max’s, min’s, and absolute values. The best model of such non-pathological functions available in optimization is the so-called tame class, a class which appears in several of my papers and precludes cantor function-esque behavior. I have spent much time trying to uncover notions of beneficial “partial” smoothness in tame optimization problems to exploit this structure in algorithms.
While tame problems comprise virtually all tasks of interest, they lack enough structure to endow simple iterative methods with global efficiency guarantees. A class with more structure, which I view as a stepping stone between convex functions and general neural networks, is the so-called weakly convex class. These are functions that differ from convex functions by a simple quadratic. This class is deceptively simple yet surprisingly broad. It includes, for example, all C^2 smooth functions (on compact sets) and all compositions of Lipschitz convex functions with smooth mappings: h(c(x)). These losses appear throughout data science, particularly in low-rank matrix recovery problems (e.g., matrix completion and sensing).
My group has been working towards understanding the convergence of simple iterative methods, such as the stochastic subgradient method (SGD), on the tame and weakly convex problem classes. We have also been working towards designing methods that outperform SGD.
I will briefly summarize some of the contributions of my group. For those interested, you can find a brief technical introduction to some of my papers in the expository note:
Subgradient methods under weak convexity and tame geometry Damek Davis, Dmitriy Drusvyatskiy SIAG/OPT Views and News (2020)
We developed the first first-order method that (locally) converges nearly linearly (i.e., exponentially fast) on “generic” tame problems. This result shows that we can exponentially(!) surpass the “speed limit” of gradient methods derived by Nemirovski and Yudin in the 80s – if we wait a bit. The result applies to “almost every problem” in practice. We found this super surprising!
A nearly linearly convergent first-order method for nonsmooth functions with quadratic growth Damek Davis, Liwei Jiang Foundations of Computational Mathematics (to appear) | code | Twitter thread
We developed the first algorithm that (locally) converges nearly superlinearly (i.e., double exponentially fast) on “generic” tame equations.
A superlinearly convergent subgradient method for sharp semismooth problems Vasileios Charisopoulos, Damek Davis Mathematics of Operations Research (2023) | code | Twitter thread
We showed that the stochastic subgradient method (e.g., backpropagation) converges to first-order critical points on virtually any neural network.
Stochastic subgradient method converges on tame functions Damek Davis, Dmitriy Drusvyatskiy, Sham Kakade, Jason D. Lee Foundations of Computational Mathematics (2018) | Talk
We proved the first sample/computational efficiency guarantees for the stochastic subgradient method on the weakly convex class.
Stochastic model-based minimization of weakly convex functions Damek Davis, Dmitriy Drusvyatskiy SIAM Journal on Optimization (2018) | blog
Proximally Guided Stochastic Subgradient Method for Nonsmooth, Nonconvex Problems. Damek Davis, Benjamin Grimmer SIAM Journal on Optimization (2018) [code]
We developed the concept of an avoidable nonsmooth saddle point — nonoptimal points that algorithms may approach. The proper formulation of this concept is well-known in C^2 smooth optimization but was missing even for C^1 functions. We showed that both first-order and proximal methods do not converge to these points on “generic” tame problems:
Talk: avoiding saddle points in nonsmooth optimization Updated (11/2021) | video
Proximal methods avoid active strict saddles of weakly convex functions Damek Davis, Dmitriy Drusvyatskiy Foundations of Computational Mathematics (2021)
Escaping strict saddle points of the Moreau envelope in nonsmooth optimization Damek Davis, Mateo Díaz, Dmitriy Drusvyatskiy SIAM Journal on Optimization (2022)
Active manifolds, stratifications, and convergence to local minima in nonsmooth optimization Damek Davis, Dmitriy Drusvyatskiy, Liwei Jiang Manuscript (2022)
We characterized the asymptotic distribution of the error sequence in stochastic subgradient methods, proving it is asymptotically normal with “optimal covariance” on “generic” tame problems.
Asymptotic normality and optimality in nonsmooth stochastic approximation Damek Davis, Dmitriy Drusvyatskiy, Liwei Jiang Manuscript (2023)
We achieved the first sample complexity optimal and computationally optimal methods for several low-rank matrix recovery based on nonsmooth weakly convex formulations. Nonsmoothness was crucial to establishing these rates since prior smooth formulations suffered from “poor conditioning.”
Composite optimization for robust rank one bilinear sensing Vasileios Charisopoulos, Damek Davis, Mateo Diaz, Dmitriy Drusvyatskiy IMA Journal on Information and Inference (2020) [ code ]
The nonsmooth landscape of phase retrieval Damek Davis, Dmitriy Drusvyatskiy, Courtney Paquette IMA Journal on Numerical Analysis (2017) | Talk
Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence Vasileios Charisopoulos, Yudong Chen, Damek Davis, Mateo Díaz, Lijun Ding, Dmitriy Drusvyatskiy Foundations of Computational Mathematics (2019) | code
Besides my work on nonconvex learning algorithms, I also have worked on clustering and convex optimization algorithms.
Clustering is a fundamental statistical problem of dividing a dataset into two or more groups. Our work on this talk topic focuses on the classical setting wherein both clusters follow a Gaussian distribution with identical covariance but distinct means. When the covariance matrix is known or “nearly spherical,” there are efficient algorithms to perform the clustering and achieve the “Bayes-optimal error rate.” When the covariance is unknown or poorly conditioned, no known algorithms achieve the Bayes-optimal rate.
Our contribution to this topic is a surprising dichotomy for clustering with an unknown covariance matrix: on the one hand, the maximum likelihood estimator uncovers the correct clustering and achieves the Bayes-optimal error; on the other, we give evidence that no known algorithm can compute the maximum likelihood estimator unless one increases the number of samples by an order of magnitude. Thus, we conjecture that there is a statistical-to-computational gap for this classical statistical problem.
Clustering a Mixture of Gaussians with Unknown Covariance Damek Davis, Mateo Diaz, Kaizheng Wang Manuscript (2021)
I focused on a class of convex optimization algorithms called operator-splitting methods for my PhD thesis. An operator splitting method is a technique for writing the solution of a “structured” convex optimization problem as the fixed-point of a well-behaved nonlinear operator. Algorithmically, one then finds the fixed-point of the operator through, e.g., the classical fixed-point iteration. My best-known contributions to the topic include the (1) “Three-Operator-Splitting” method, which has been widely used throughout computational imaging, and (2) my work that established the convergence rates of several classical splitting methods, such as the Douglas-Rachford splitting method and Alternating Direction Method of Multipliers (ADMM).
A Three-Operator Splitting Scheme and its Optimization Applications Damek Davis, Wotao Yin Set-Valued and Variational Analysis (2017)
Convergence rate analysis of several splitting schemes Damek Davis, Wotao Yin Splitting Methods in Communication and Imaging, Science and Engineering (2017)