Recognition: unknown
Closed-Form Last Layer Optimization
read the original abstract
Neural networks are typically optimized with variants of stochastic gradient descent. Under a squared loss, however, the optimal solution to the linear last layer weights is known in closed-form. We propose to leverage this during optimization, treating the last layer as a function of the backbone parameters, and optimizing solely for these parameters. We show this is equivalent to alternating between gradient descent steps on the backbone and closed-form updates on the last layer. We adapt the method for the setting of stochastic gradient descent, by trading off the loss on the current batch against the accumulated information from previous batches. We provide theoretical analyses showing convergence of the method to an optimal solution in the neural tangent kernel regime, as well as quantifying the gains compared to standard SGD in a one-step analysis. Finally, we demonstrate the effectiveness of our approach compared with SGD and Adam on a squared loss in several regression tasks, including neural operators and causal inference.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Doubly Robust Proxy Causal Learning with Neural Mean Embeddings
A neural doubly robust proxy causal learning framework using mean embeddings for treatment bridges provides consistent estimators for causal dose-response functions under unobserved confounding for continuous and stru...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.