— Ch. 1 · Contrastive Vector Alignment —
Contrastive Language-Image Pre-training.
~3 min read · Ch. 1 of 7
In 2021, OpenAI released a model called RN50 that processed images and text into single vectors. These vectors lived in a shared space where similar pairs sat close together. The system measured similarity using the dot product between two numbers. A large dot product meant the image matched the text description well. A small dot product indicated a mismatch. This method trained models to pull matching pairs closer while pushing non-matching pairs apart. The loss function used was multi-class N-pair loss over similarity scores. Temperature parameters adjusted how strictly the model separated correct matches from incorrect ones.
Vision Transformer And ResNet Models
The original CLIP implementation relied on Vision Transformers or ViT architectures for visual encoding. Model names like ViT-L/14 indicated specific patch sizes of 14 pixels. Larger variants such as ViT-B/32 used smaller patches for processing. Some versions utilized ResNet convolutional neural networks instead of transformers. The embedding dimension ranged from 512 to 1024 depending on the model size. OpenAI modified standard ResNet implementations by adding three stacked 3x3 convolutions at the start. They also introduced average pooling with stride 2 before downsampling layers. Final layers included multiheaded attention pooling mechanisms. Google researchers later developed ALIGN using EfficientNet architectures for similar tasks.WebImageText Dataset Construction
OpenAI trained their initial models on a private dataset named WebImageText containing 400 million image-caption pairs. These pairs were scraped directly from the internet without public release. The total word count in this collection matched the scale of the WebText dataset used for GPT-2 training. Researchers generated text queries starting with words appearing at least 100 times in English Wikipedia. Bigrams with high mutual information extended these base queries further. Names of popular Wikipedia articles and WordNet synsets added variety to the dataset. Later organizations published datasets like LAION-400M and DataComp-1B for broader community use. Google's ALIGN method extracted over one billion image-text pairs using alt-tags from online crawlers.