Stochastic Momentum Methods for Non-smooth Non-Convex Finite-Sum Coupled Compositional Optimization
By: Xingyu Chen , Bokun Wang , Ming Yang and more
Potential Business Impact:
Makes computer learning faster and better.
Finite-sum Coupled Compositional Optimization (FCCO), characterized by its coupled compositional objective structure, emerges as an important optimization paradigm for addressing a wide range of machine learning problems. In this paper, we focus on a challenging class of non-convex non-smooth FCCO, where the outer functions are non-smooth weakly convex or convex and the inner functions are smooth or weakly convex. Existing state-of-the-art result face two key limitations: (1) a high iteration complexity of $O(1/\epsilon^6)$ under the assumption that the stochastic inner functions are Lipschitz continuous in expectation; (2) reliance on vanilla SGD-type updates, which are not suitable for deep learning applications. Our main contributions are two fold: (i) We propose stochastic momentum methods tailored for non-smooth FCCO that come with provable convergence guarantees; (ii) We establish a new state-of-the-art iteration complexity of $O(1/\epsilon^5)$. Moreover, we apply our algorithms to multiple inequality constrained non-convex optimization problems involving smooth or weakly convex functional inequality constraints. By optimizing a smoothed hinge penalty based formulation, we achieve a new state-of-the-art complexity of $O(1/\epsilon^5)$ for finding an (nearly) $\epsilon$-level KKT solution. Experiments on three tasks demonstrate the effectiveness of the proposed algorithms.
Similar Papers
Single-loop Algorithms for Stochastic Non-convex Optimization with Weakly-Convex Constraints
Machine Learning (CS)
Makes AI learn better with fewer steps.
Stochastic Difference-of-Convex Optimization with Momentum
Machine Learning (CS)
Makes computer learning work with smaller groups.
Compressed Decentralized Momentum Stochastic Gradient Methods for Nonconvex Optimization
Machine Learning (CS)
Makes computers learn faster with less data.