— Ch. 1 · The 1847 Suggestion —
Gradient descent.
~2 min read · Ch. 1 of 5
Augustin-Louis Cauchy first suggested the method in 1847. He proposed taking repeated steps opposite to the gradient of a function at its current point. This direction represents the steepest descent available on the curve. The idea was simple yet powerful for minimizing differentiable multivariate functions. Jacques Hadamard independently proposed a similar approach much later, in 1907. Haskell Curry studied convergence properties for non-linear problems starting in 1944. These early mathematicians laid the groundwork for what would become a cornerstone of modern optimization.
Mathematical Mechanics And Steps
A multi-variable function decreases fastest when moving from a point along the negative gradient vector. One subtracts the step size times this gradient from the current position to find the next point. The sequence of points converges to a local minimum under specific assumptions about convexity and Lipschitz continuity. If the function is convex, all local minima are also global minima. The value of the step size must be small enough to ensure monotonic decrease. Philip Wolfe advocated using clever choices of descent direction to improve practical performance. Line search algorithms help determine locally optimal step sizes on every iteration.Training Deep Networks Today
Stochastic gradient descent serves as the most basic algorithm used for training most deep networks today. It adds a stochastic property to weight updates during backpropagation. This method allows neural networks to learn complex patterns by processing data in batches rather than all at once. The weights calculate derivatives that guide the network toward lower error values. Modern optimizers like Adam and Yogi build upon these foundational concepts. They incorporate momentum terms to accelerate convergence across vast parameter spaces. The technique remains fundamental to artificial intelligence research and application.