— Ch. 1 · Foundations Of Approximation —
Universal approximation theorem.
~4 min read · Ch. 1 of 5
In 1989, George Cybenko published a technical report that changed the trajectory of machine learning. His work proved that feedforward neural networks with just one hidden layer could approximate any continuous function to arbitrary accuracy. This property became known as universality. The theorem relied on specific mathematical conditions for the activation functions used within the network. If an activation function was non-polynomial, such as the sigmoid function or ReLU, the network gained this powerful capability. Researchers realized that increasing the number of neurons in the hidden layer made the network wider and thus more capable. This widening process allowed the system to model complex relationships found in real-world data. The proof did not specify how to find these parameters efficiently. It only guaranteed that such a network existed if the structure was correct. Practical training remained a separate challenge requiring optimization algorithms like backpropagation.
Historical Proofs And Evolution
The timeline of discovery began with Cybenko's 1989 paper on sigmoid activation functions. Maxwell Stinchcombe and Halbert White followed shortly after in 1989 to extend these findings to multilayer feed-forward networks. Kurt Hornik demonstrated in 1991 that the architecture itself provided the potential rather than the specific choice of activation function. Moshe Leshno and his colleagues published their equivalence results in 1993 regarding nonpolynomial activation functions. Allan Pinkus refined these concepts further in 1999. The focus shifted toward arbitrary depth cases starting around 2003 when Gustaf Gripenberg contributed to the field. Dmitry Yarotsky and Zhou Lu advanced the theory significantly in 2017 using ReLU activation functions. Boris Hanin and Mark Sellke expanded these results in 2018. Patrick Kidger and Terry Lyons generalized the work to include general activation functions like tanh or GeLU by 2020. Cai constructed a finite set of mappings named a vocabulary in 2024, allowing any continuous function to be approximated through composition.