— CH. 1 · INTRODUCTION —

Hash function

~5 min read · Ch. 1 of 8

8 sections

A hash function is a mathematical tool that can take any piece of data, no matter how large or irregular in size, and produce a value of fixed size. That single capability underpins an enormous range of computing tasks, from storing passwords securely to searching through massive databases in a fraction of a second. But how does turning data into a compact code actually work, and what makes one hash function better than another? The history of the idea stretches back to a memo written in January 1953, and the story of how it became a cornerstone of modern computing is worth following carefully.
When a program needs to store and retrieve records quickly, it uses a hash table. The hash function takes a key, which might be a name or a number or any identifying marker, and translates it into a hash code. That code becomes the address of a slot, or bucket, in the table where the record is placed. The arrangement is sometimes called scatter-storage addressing, which captures the way data is deliberately spread across available positions. Near-constant retrieval time is the payoff, something that neither a sorted list nor a tree structure can reliably guarantee. The challenge that must be solved before any of that speed is possible is the collision, which happens when two different keys produce the same hash code and compete for the same slot.
Collisions are not a sign of a broken function; a small number are virtually inevitable, and the birthday problem from probability theory explains why even in tables much larger than the number of stored items, two keys will occasionally land on the same slot. What matters is how the table handles the situation. Chained hashing places a linked list at every slot, and any item that collides is simply added to that chain. Open address hashing takes a different route: the table is probed in a defined sequence, using linear probing, quadratic probing, or double hashing, until an empty slot is found. The choice between these approaches shapes how a hash table performs as it fills up and how quickly searches complete when collisions do occur.
A good hash function spreads its outputs as evenly as possible across the available range. The technical name for this ideal is uniformity, and it can be measured formally using the chi-squared test, which compares the actual distribution of items across buckets to the expected uniform distribution. A ratio within one confidence interval such as 0.95 to 1.05 signals that the function is behaving well. A related property, the strict avalanche criterion, requires that whenever a single input bit is changed, every output bit should flip with a 50% probability. The reason this matters is that clustered keys, where many inputs share structural similarities, can drag a hash function toward poor performance unless small input changes reliably produce large output changes.
Several distinct methods exist for computing a hash value from a key. Division hashing applies a modulo operation using a prime number close to the table size, which gives a good spread across many key sets, but division requires multiple processor cycles on most modern architectures including x86, and can run ten times slower than multiplication. Multiplicative hashing addresses that speed problem by using an integer multiplication and a right-shift, making it among the fastest hash functions to compute. A specialized variant called Fibonacci hashing uses a multiplier derived from the golden ratio, approximately 1.618, and has the property of distributing consecutive keys uniformly across the table even when those keys cluster in the high or low bits. The mid-squares method offers a third path: squaring the key and extracting the middle digits, as in the example of the input 123456789 producing 15241578750190521 from which the middle four digits 8750 are taken as the hash code.
Zobrist hashing, named after Albert Zobrist, was originally built to represent chess positions compactly inside computer game-playing programs. A unique random number was assigned to each type of piece, six piece types each for black and white, on each of the 64 squares of the board, producing a table of 64 multiplied by 12 numbers. A position was encoded by cycling through all pieces on the board, retrieving their corresponding random numbers, and combining them using the XOR operation. Because XOR was chosen, the starting value was set to 0, the identity for that operation. The method was later extended to general integer hashing by representing each byte of a 32-bit integer across four possible positions, and the resulting scheme carries a theoretical property called 3-tuple independence, meaning every three distinct keys are equally likely to map to any three hash values.
Cryptographic hash functions serve a different set of demands from those used inside data structures. Password storage relies on the fact that a hash value reveals nothing about the original password, so the server can store only the hash and never the plaintext. Message authentication codes, known as MACs, blend a confidential key with the input data using a hash function to give the recipient a way to verify that the message has not been tampered with; the variant called HMAC follows this pattern. Integrity checking rests on the same logic: identical hash values for two files imply the files are equal, making hash functions a reliable detector of any modification. Signatures follow a similar shortcut, signing the hash of a message rather than the message itself, since the hash is small regardless of how large the original data is.
Donald Knuth traced the precise history of the word hash in this technical sense and found that Hans Peter Luhn of IBM appears to have been the first to use the concept, in a memo dated January 1953. The term itself did not appear in published literature until the late 1960s, surfacing in Herbert Hellerman's Digital Computer System Principles, even though it was already widespread jargon among computing practitioners well before that. The analogy built into the name is deliberate: to hash something in ordinary language is to chop it up or make a mess of it, which captures exactly what a hash function does to its input to produce an output that looks nothing like the original. The PJW hash function, developed by Peter J. Weinberger at Bell Labs in the 1970s and documented in the textbook known as the Dragon Book, carried this tradition forward into the world of compiler symbol tables.

Up Next

Common questions

What is a hash function and what does it do?

A hash function maps data of arbitrary size to fixed-size values called hash codes or hashes. Those values are used to index a hash table, giving near-constant data retrieval time regardless of the number of stored records.

Who invented the hash function concept?

Hans Peter Luhn of IBM appears to have been the first to use the concept of a hash function, in a memo dated January 1953. The term itself did not appear in published literature until the late 1960s, in Herbert Hellerman's Digital Computer System Principles.

What is a hash collision and how is it resolved?

A collision occurs when two different keys produce the same hash code and compete for the same slot in a hash table. It can be resolved by chained hashing, which stores colliding items in a linked list, or by open address hashing, which probes for the next empty slot using linear probing, quadratic probing, or double hashing.

What is Zobrist hashing and where was it first used?

Zobrist hashing, named after Albert Zobrist, was originally introduced for compactly representing chess positions in computer game-playing programs. It assigns unique random numbers to each piece type on each square of the board and combines them using XOR operations.

How are hash functions used in cybersecurity and cryptography?

Hash functions secure passwords by storing only the hash value rather than the plaintext. They also underpin message authentication codes such as HMACs, support file integrity checking, and allow digital signatures to be applied to a small hash of a message rather than the full message.

What is the strict avalanche criterion in hash function design?

The strict avalanche criterion requires that whenever a single input bit is changed, each output bit should change with a 50% probability. This property ensures that clustered or low-variability key sets still produce evenly distributed hash codes.

See all questions about Hash function →

All sources

19 references cited across the entry

1conferenceHash_RC6 — Variable length Hash algorithm using RC6Kirti Aggarwal et al. — March 19, 2015
2webhash digestNIST
3bookThe Art of Computer Programming, Vol. 3, Sorting and SearchingDonald E. Knuth — Addison-Wesley — 1973
4webUnderstanding CPU caching and performanceJon Stokes — 2002-07-08
5bookHandbook of Applied CryptographyAlfred J. Menezes et al. — CRC Press — 1996
6journalThe strict avalanche criterion randomness testJulio Cesar Hernandez Castro et al. — Elsevier — 3 February 2005
7webFibonacci Hashing: The Optimization that the World ForgotMalte Sharupke — 16 June 2018
8citationTrends in Data Protection and Encryption TechnologiesUrs Wagner et al. — Springer Nature Switzerland — 2023
9web3. Data model — Python 3.6.1 documentation
10bookAlgorithms in JavaRobert Sedgewick — Addison Wesley — 2002
11journalUnique permutation hashingShlomi Dolev et al. — 2013
12webCS 3110 Lecture 21: Hash functions
13citationA New Hashing Method with Application for Game PlayingAlbert L. Zobrist — Computer Sciences Department, University of Wisconsin — April 1970
14bookCompilers: Principles, Techniques and ToolsA. Aho et al. — Addison-Wesley — 1986
15conferencePerformance in Practice of String Hashing FunctionsM. V. Ramakrishna et al. — 1997
16bookA Handbook of AlgorithmsN. B. Singh — N.B. Singh
17bookThe Art of Computer Programming, Vol. 3, Sorting and SearchingDonald E. Knuth — Addison-Wesley — 1975
18tech reportExpected Length of the Longest Probe Sequence in Hash Code SearchingG. Gonnet — University of Waterloo — 1978
19bookThe Art of Computer Programming, Vol. 3, Sorting and SearchingDonald E. Knuth — Addison-Wesley — 2000

Hash function

1. Introduction

2. Scatter Storage and Hash Tables

3. Collision and Resolution

4. Uniformity and the Avalanche Criterion

5. From Modulo to Fibonacci

6. Zobrist Hashing and Game Playing

7. Hashing in Cryptography

8. Origins of the Term

Up Next

Common questions

All sources