Large language model

Training Data And Tokenization Methods

Machine learning algorithms process numbers rather than raw text so conversion is required first. A vocabulary decides upon integer indices assigned arbitrarily but uniquely to each entry. An embedding associates with the integer index for every token. Algorithms include byte-pair encoding or WordPiece methods for this conversion. Special tokens serve as control characters such as [MASK] for masked-out tokens used in BERT. Another special symbol [UNK] denotes unknown characters not appearing in the vocabulary. Some symbols denote specific formatting like the preceding whitespace character in RoBERTa and GPT. The average number of words per token depends heavily on the language being processed. A token vocabulary based on frequencies from mainly English corpora uses few tokens for an average English word. However, an average word in another language encoded by such an English-optimized tokenizer splits into suboptimal amounts. The GPT-2 tokenizer can use up to 15 times more tokens per word for some languages like Shan from Myanmar. Even widespread languages such as Portuguese and German carry a premium of 50% compared to English. Datasets are typically cleaned by removing low-quality duplicated or toxic data during preprocessing. Cleaned datasets increase training efficiency and lead to improved downstream performance. With increasing proportions of LLM-generated content online future cleaning may filter out such material. Synthetic data might be used when naturally occurring data proves insufficient in quantity or quality.

When did Google researchers publish the paper Attention Is All You Need?

Google researchers published the paper titled Attention Is All You Need in 2017 at the NeurIPS conference. This document introduced the transformer architecture to replace older recurrent neural network methods.

What is the Chinchilla scaling law for large language models?

The Chinchilla scaling law states that training cost equals six FLOPs per parameter per token. Scaling laws predict LLM performance based on total compute used, size of the artificial neural network, and size of its pretraining dataset.

How much energy does text generation require per prompt as of 2025?

Text generation requires around 0.05 Wh per prompt while image generation averages 2.91 Wh which is the most energy-intensive process. Simple classification tasks consume an average of 0.002 to 0.007 Wh per prompt about nine percent smartphone charge.

Why do large language models exhibit gender bias in their outputs?

Gender bias manifests through stereotypical occupational associations assigning nursing roles disproportionately to women because AI models inherit biases present in training data. In 2023 LLMs assigned roles and characteristics based on traditional gender norms like nurses being women.

What legal settlement did Anthropic reach regarding memorization practices in 2025?

In 2025 Anthropic reached a preliminary agreement to settle a class action by authors for approximately $1.5 billion after a judge found the company stored millions of pirated books. Meta obtained a favorable judgment mid-2025 where the court found plaintiffs lacked records sufficient to show infringement.

Large language model.

Training Data And Tokenization Methods

Up Next

Continue Browsing

Common questions

When did Google researchers publish the paper Attention Is All You Need?

What is the Chinchilla scaling law for large language models?

How much energy does text generation require per prompt as of 2025?

Why do large language models exhibit gender bias in their outputs?

What legal settlement did Anthropic reach regarding memorization practices in 2025?

Scaling Laws And Emergent Abilities

Alignment Techniques And Fine-Tuning

Hallucinations And Algorithmic Bias

Safety Risks And Security Vulnerabilities

Societal Impact And Ethical Debates