Questions about Contrastive Language-Image Pre-training

Short answers, pulled from the story.

When did OpenAI release the RN50 model for contrastive language-image pre-training?

OpenAI released the RN50 model in 2021. This system processed images and text into single vectors that lived in a shared space where similar pairs sat close together.

What architectures does the original CLIP implementation use for visual encoding?

The original CLIP implementation relies on Vision Transformers or ViT architectures for visual encoding. Some versions utilize ResNet convolutional neural networks instead of transformers, with embedding dimensions ranging from 512 to 1024 depending on the model size.

How many image-caption pairs were contained in the WebImageText dataset used by OpenAI?

OpenAI trained their initial models on a private dataset named WebImageText containing 400 million image-caption pairs. These pairs were scraped directly from the internet without public release.

How long did it take to train the largest ResNet model using V100 GPUs?

Training the largest ResNet model required 592 V100 GPUs running for 18 days straight. Each model underwent 32 epochs of training cycles during the original OpenAI report.

How do users retrieve images based on text descriptions without explicit annotations?

Users can retrieve images based on text descriptions without needing explicit annotations beforehand through text-to-image retrieval capabilities. The system compares the image embedding against phrases like A photo of a class to produce predictions.