Chris Olah coined the term mechanistic interpretability in the early 2010s to describe a radical shift in how researchers approached artificial intelligence. Before this moment, neural networks were treated as opaque black boxes where inputs produced outputs without any clear understanding of the internal logic. Olah and his team began to treat these complex mathematical structures like binary computer programs that could be reverse-engineered to reveal their true functions. This approach moved the field beyond simple input-output explanations and into the messy, high-dimensional reality of how models actually think. Early work combined feature visualization, dimensionality reduction, and attribution methods with human-computer interaction techniques to analyze models like the vision model Inception v1. The goal was not just to see what the model saw, but to understand the specific circuits and algorithms encoded within its weights.
Linear Directions In Space
The linear representation hypothesis suggests that high-level concepts are represented as linear directions in the activation space of neural networks. This idea implies that the relationship between a country and its capital is encoded in a specific linear direction within the model's mathematical space. Empirical evidence from word embeddings and more recent studies supports this view, although it does not hold up universally across all architectures. Researchers found that simple word embeddings exhibit a linear representation of semantics, allowing them to trace how abstract ideas are stored as geometric vectors. This discovery challenged the prevailing assumption that neural networks operated through chaotic, non-linear interactions that were impossible to decipher. By mapping these linear directions, scientists could begin to predict how a model would respond to new inputs based on its internal geometry.Causal Methods And Circuits
Mechanistic interpretability employs causal methods to understand how internal model components influence outputs, often using formal tools from causality theory. This approach allows researchers to identify structures, circuits, or algorithms encoded in the weights of machine learning models. Unlike earlier interpretability methods that focused primarily on input-output explanations, this technique digs into the causal mechanisms that drive decision-making. Scientists use these tools to verify the behavior of complex AI systems and to attempt to identify potential risks before they manifest in real-world applications. The process involves isolating specific neurons or groups of neurons to see how they contribute to a final output. By manipulating these internal components, researchers can confirm whether a specific circuit is responsible for a particular behavior, effectively proving the existence of distinct algorithms within the network.