— Ch. 1 · Foundational History And Origins —
Feedforward neural network.
~4 min read · Ch. 1 of 5
In 1805, Adrien-Marie Legendre published a method for least squares that laid the groundwork for what would become linear regression. This early mathematical approach allowed scientists to predict planetary movements by fitting lines through scattered data points. Carl Friedrich Gauss had independently developed similar techniques in 1795, creating a simple feedforward structure with just one weight layer and linear activation functions. These pioneers established the basic principle of minimizing error between predicted values and actual observations. The binary artificial neuron emerged decades later when Warren McCulloch and Walter Pitts proposed their logical model in 1943. Frank Rosenblatt then introduced the perceptron concept in 1958, featuring an input layer, a hidden layer with randomized weights, and an output layer with learnable connections. R. D. Joseph noted in 1960 that Farley and Clark at MIT Lincoln Laboratory had actually built a perceptron-like device before Rosenblatt, though they eventually abandoned the project. Alexey Grigorevich Ivakhnenko and Valentin Lapa published their Group Method of Data Handling algorithm in 1965, which became the first working deep learning method capable of training arbitrarily deep neural networks. They used Kolmogorov-Gabor polynomials as activation functions and pruned unnecessary hidden units using validation sets. By 1971, researchers successfully trained an eight-layer neural net using this approach.
Mathematical Activation Functions
The hyperbolic tangent function ranges from negative one to positive one while the logistic function spans zero to one. Both sigmoids served as historically common activation functions for early neural network calculations. Alternative approaches included rectifier and softplus functions designed to overcome numerical problems inherent in sigmoidal mathematics. The rectified linear unit or ReLU gained prominence in recent deep learning developments as practitioners sought solutions to these computational challenges. Radial basis functions emerged as specialized alternatives used specifically within radial basis networks, another class of supervised models. These mathematical choices directly influenced how information flowed through different layers of artificial systems. The derivative of each activation function determined how quickly weights could adjust during training processes. Simple threshold functions created what became known as linear threshold units when applied with specific boundary conditions. Multiple parallel non-linear units proved capable of approximating any continuous function despite limited single-unit computational power. This flexibility allowed networks to distinguish data that was not linearly separable across complex problem spaces.