Listen

Description

This January 2019 academic paper addresses the common issue of poor generalization in adaptive gradient optimization methods like Adam, compared to traditional Stochastic Gradient Descent (SGD) with momentum. The authors demonstrate that L2 regularization and weight decay are not equivalent for adaptive optimizers, unlike for standard SGD, leading to suboptimal performance in Adam. They propose a simple modification called "decoupled weight decay" (AdamW), which separates the weight decay step from the gradient-based updates. Empirical evidence shows that AdamW significantly improves Adam's generalization performance on image classification tasks and simplifies hyperparameter tuning by decoupling the learning rate and weight decay factors. Furthermore, the paper introduces AdamWR, incorporating warm restarts to further enhance AdamW's anytime performance, ultimately making Adam competitive with SGD with momentum.

Source:

https://arxiv.org/pdf/1711.05101