— Ch. 1 · Evolution Of Transformer Architecture —
Large language model.
~7 min read · Ch. 1 of 7
In 2017, Google researchers published a paper titled Attention Is All You Need at the NeurIPS conference. This document introduced the transformer architecture to replace older recurrent neural network methods. The new system used self-attention mechanisms to process all elements in a sequence simultaneously. It allowed for efficient parallelization and handled much longer contexts than previous models. Before this breakthrough, early statistical models from the 1990s relied on word alignment techniques for machine translation. By 2001, smoothed n-gram models trained on 300 million words achieved state-of-the-art perplexity scores. Researchers began compiling massive text datasets from the web during the 2000s to train these earlier systems. In 2016, Google transitioned its translation service to neural machine translation using LSTM-based encoder-decoder architectures. These early NMT systems preceded the invention of transformers entirely. The following year in 2018, BERT was introduced as an encoder-only model that quickly became ubiquitous. Academic usage of BERT declined by 2023 after decoder-only models like GPT improved their task-solving abilities through prompting. Although decoder-only GPT-1 appeared in 2018, it was GPT-2 released in 2019 that caught widespread attention. OpenAI initially deemed GPT-2 too powerful to release publicly due to fears of malicious use. GPT-3 arrived in 2020 and remains available only via API without local execution options. The 2022 consumer-facing chatbot ChatGPT received extensive media coverage and public attention. In 2024, OpenAI released the reasoning model o1 which generates long chains of thought before returning answers.
Training Data And Tokenization Methods
Machine learning algorithms process numbers rather than raw text so conversion is required first. A vocabulary decides upon integer indices assigned arbitrarily but uniquely to each entry. An embedding associates with the integer index for every token. Algorithms include byte-pair encoding or WordPiece methods for this conversion. Special tokens serve as control characters such as [MASK] for masked-out tokens used in BERT. Another special symbol [UNK] denotes unknown characters not appearing in the vocabulary. Some symbols denote specific formatting like the preceding whitespace character in RoBERTa and GPT. The average number of words per token depends heavily on the language being processed. A token vocabulary based on frequencies from mainly English corpora uses few tokens for an average English word. However, an average word in another language encoded by such an English-optimized tokenizer splits into suboptimal amounts. The GPT-2 tokenizer can use up to 15 times more tokens per word for some languages like Shan from Myanmar. Even widespread languages such as Portuguese and German carry a premium of 50% compared to English. Datasets are typically cleaned by removing low-quality duplicated or toxic data during preprocessing. Cleaned datasets increase training efficiency and lead to improved downstream performance. With increasing proportions of LLM-generated content online future cleaning may filter out such material. Synthetic data might be used when naturally occurring data proves insufficient in quantity or quality.