Rachel Ward
University of Texas, Austin
"Learning the learning rate in stochastic gradient descent"
Finding a proper learning rate in stochastic optimization is an important problem. Choosing a learning rate that is too small leads to painfully slow convergence, while a learning rate that is too large can cause the loss function to fluctuate around the minimum or even to diverge. In practice, the learning rate is often tuned by hand for different problems at hand. Several methods have been proposed recently for automatic adjustment of the learning rate according to gradient data that is received along the way. We review these methods, and propose a simple method, inspired by reparametrization of the loss function in polar coordinates. We prove that the proposed method achieves optimal oracle convergence rates in batch and stochastic settings, but without having to know certain parameters of the loss function in advance.
Host: Qiang Du