— Ch. 1 · Cognitive Origins And Early History —
Attention (machine learning).
~4 min read · Ch. 1 of 6
In the 1950s, researchers studied how humans focus on specific sounds while ignoring background noise. This phenomenon became known as the cocktail party effect. Scientists developed filter models to explain how attention selects relevant information from a chaotic environment. By the 1980s, engineers created sigma-pi units and higher-order neural networks that mimicked these biological processes. Fast weight controllers emerged in the early 1990s to establish dynamic links between neurons. These systems anticipated the key-value mechanisms found in modern machine learning. The bilateral filter arrived in image processing during 1998 to propagate relevance across elements using pairwise affinity matrices. Non-local means extended this filtering approach in 2005 by applying Gaussian similarity kernels as fixed attention-like weights. A major shift occurred in 2014 when seq2seq models combined recurrent neural networks with attention mechanisms. This integration allowed translation systems to handle long sentences more effectively than previous designs.
The Transformer Revolution
A research paper titled Attention is All You Need appeared in 2017 to introduce the Transformer architecture. This model formalized scaled dot-product self-attention to replace slower sequential recurrent neural networks. Self-attention enabled each element in an input sequence to attend directly to all other elements. This design choice removed the bottleneck of serial processing inherent in earlier RNN systems. Parallel computation became possible because every token could interact with every other token simultaneously. Global dependencies within a sentence were captured without attenuation over distance. Relation networks and set Transformers applied these principles to unordered sets for relational reasoning. Graph attention networks brought the mechanism to graph-structured data in 2018. Efficient Transformers like Reformer, Linformer, and Performer followed between 2019 and 2020 to scale approximations for long sequences. Vision transformers achieved competitive results in image classification during 2019. These architectures formed the foundation for models such as BERT, T5, and generative pre-trained transformers.