— Ch. 1 · Foundations Of Tokenization —
Lexical analysis.
~5 min read · Ch. 1 of 5
In 1960, the ALGOL programming language eliminated whitespace and comments during its initial compilation phase. This decision marked a shift in how early compilers handled raw text input. Lexical tokenization converts character sequences into meaningful tokens within computer science. A lexer program defines categories such as identifiers, operators, grouping symbols, and data types. These tokens form the building blocks for further processing stages like parsing or semantic analysis. The process transforms raw strings into structured units that machines can interpret. For example, the string "The quick brown fox jumps over the lazy dog" contains 43 characters but yields only nine distinct tokens when split by spaces. Each token carries an assigned meaning based on predefined rules. In natural languages, these categories include nouns, verbs, adjectives, and punctuation marks. Programming languages use similar logic to identify reserved words, numeric literals, and symbolic operators. Tokens are often represented as enumerated types where each category maps to a number. An identifier might be zero while an addition operator becomes two. This mapping allows parsers to work with simple numerical values instead of complex character streams.
Scanner And Evaluator Mechanics
A finite-state machine processes input one character at a time until it reaches a boundary defined by acceptable characters. This first stage is called scanning and produces lexemes from continuous character streams. Consider the C programming language where a single 'L' prefix cannot distinguish between an identifier starting with L and a wide-character string literal. The scanner must examine subsequent characters before making a final decision. Once a lexeme is identified, the second stage evaluates its value for downstream use. An evaluator converts raw text into processed data such as numeric values or stripped quotes. For instance, a quoted string literal may have its surrounding quotation marks removed during evaluation. Integer literals can either pass through unchanged or be converted directly into numbers depending on compiler design choices. Some evaluators suppress entire lexemes like whitespace or comments entirely since they carry no semantic weight for most compilers. A typical token stream might list IDENTIFIER followed by "net_worth_future" then EQUALS and OPEN_PARENTHESIS. Each line represents a distinct unit passed forward to syntactic analysis. The maximal munch rule ensures scanners consume the longest possible sequence matching valid patterns before stopping. Backtracking becomes necessary when rules involve recursive structures that simple state machines cannot handle alone. Finite automata lack the ability to count nested parentheses across arbitrary depths without external help.