• Human learning is inherently multi-modal as jointly leveraging multiple senses helps us understand and analyze new information better. Unsurprisingly, recent advances in multi-modal learning take inspiration from the effectiveness of this process to create models that can process and link information using various modalities such as image, video, text, audio, body gestures, facial expressions, and physiological signals. (View Highlight)
  • oint vision-language models have shown particularly impressive capabilities in very challenging tasks such as image captioning, text-guided image generation and manipulation, and visual question-answering. (View Highlight)
  • Take, for example, the task of zero-shot image classification. We’ll pass an image and a few prompts like so to obtain the most probable prompt for the input image. drawing The cat and dog image has been taken from here. To predict something like that, the model needs to understand both the input image and the text prompts. The model would have separate or fused encoders for vision and language to achieve this understanding. But these inputs and outputs can take several forms. Below we give some examples: • Image retrieval from natural language text. • Phrase grounding, i.e., performing object detection from an input image and natural language phrase (example: A young person swings a bat). • Visual question answering, i.e., finding answers from an input image and a question in natural language. • Generate a caption for a given image. This can also take the form of conditional text generation, where you’d start with a natural language prompt and an image. • Detection of hate speech from social media content involving both images and text modalities. (View Highlight)
  • A vision-language model typically consists of 3 key elements: an image encoder, a text encoder, and a strategy to fuse information from the two encoders. These key elements are tightly coupled together as the loss functions are designed around both the model architecture and the learning strategy. While vision-language model research is hardly a new research area, the design of such models has changed tremendously over the years. Whereas earlier research adopted hand-crafted image descriptors and pre-trained word vectors or the frequency-based TF-IDF features, the latest research predominantly adopts image and text encoders with transformer architectures to separately or jointly learn image and text features. These models are pre-trained with strategic pre-training objectives that enable various downstream tasks. (View Highlight)
  • We’ll cover the following themes in the pre-training objectives: • Contrastive Learning: Aligning images and texts to a joint feature space in a contrastive manner • PrefixLM: Jointly learning image and text embeddings by using images as a prefix to a language model • Multi-modal Fusing with Cross Attention: Fusing visual information into layers of a language model with a cross-attention mechanism • MLM / ITM: Aligning parts of images with text with masked-language modeling and image-text matching objectives • No Training: Using stand-alone vision and language models via iterative optimization (View Highlight)
    1. Contrastive Learning Contrastive Learning Contrastive pre-training and zero-shot image classification as shown here. Contrastive learning is a commonly used pre-training objective for vision models and has proven to be a highly effective pre-training objective for vision-language models as well. Recent works such as CLIP, CLOOB, ALIGN, and DeCLIP bridge the vision and language modalities by learning a text encoder and an image encoder jointly with a contrastive loss, using large datasets consisting of {image, caption} pairs. Contrastive learning aims to map input images and texts to the same feature space such that the distance between the embeddings of image-text pairs is minimized if they match or maximized if they don’t. For CLIP, the distance is simply the cosine distance between the text and image embeddings, whereas models such as ALIGN and DeCLIP design their own distance metrics to account for noisy datasets. Another work, LiT, introduces a simple method for fine-tuning the text encoder using the CLIP pre-training objective while keeping the image encoder frozen. The authors interpret this idea as a way to teach the text encoder to better read image embeddings from the image encoder. This approach has been shown to be effective and is more sample efficient than CLIP. Other works, such as FLAVA, use a combination of contrastive learning and other pretraining strategies to align vision and language embeddings. (View Highlight)
  • PrefixLM PrefixLM A diagram of the PrefixLM pre-training strategy (image source) Another approach to training vision-language models is using a PrefixLM objective. Models such as SimVLM and VirTex use this pre-training objective and feature a unified multi-modal architecture consisting of a transformer encoder and transformer decoder, similar to that of an autoregressive language model. (View Highlight)
  • Language models with a prefix objective predict the next token given an input text as the prefix. For example, given the sequence “A man is standing at the corner”, we can use “A man is standing at the” as the prefix and train the model with the objective of predicting the next token - “corner” or another plausible continuation of the prefix. Visual transformers (ViT) apply the same concept of the prefix to images by dividing each image into a number of patches and sequentially feeding these patches to the model as inputs. Leveraging this idea, SimVLM features an architecture where the encoder receives a concatenated image patch sequence and prefix text sequence as the prefix input, and the decoder then predicts the continuation of the textual sequence. The diagram above depicts this idea. The SimVLM model is first pre-trained on a text dataset without image patches present in the prefix and then on an aligned image-text dataset. These models are used for image-conditioned text generation/captioning and VQA tasks. (View Highlight)
  • Models that leverage a unified multi-modal architecture to fuse visual information into a language model (LM) for image-guided tasks show impressive capabilities. However, models that solely use the PrefixLM strategy can be limited in terms of application areas as they are mainly designed for image captioning or visual question-answering downstream tasks. For example, given an image of a group of people, we can query the image to write a description of the image (e.g., “A group of people is standing together in front of a building and smiling”) or query it with questions that require visual reasoning: “How many people are wearing red t-shirts?”. On the other hand, models that learn multi-modal representations or adopt hybrid approaches can be adapted for various other downstream tasks, such as object detection and image segmentation. (View Highlight)
  • Frozen PrefixLM Frozen PrefixLM Frozen PrefixLM pre-training strategy (image source) While fusing visual information into a language model is highly effective, being able to use a pre-trained language model (LM) without the need for fine-tuning would be much more efficient. Hence, another pre-training objective in vision-language models is learning image embeddings that are aligned with a frozen language model. (View Highlight)
  • Models such as Frozen and ClipCap use this Frozen PrefixLM pre-training objective. They only update the parameters of the image encoder during training to generate image embeddings that can be used as a prefix to the pre-trained, frozen language model in a similar fashion to the PrefixLM objective discussed above. Both Frozen and ClipCap are trained on aligned image-text (caption) datasets with the objective of generating the next token in the caption, given the image embeddings and the prefix text. Finally, models such as MAPL and Flamingo keep both the pre-trained vision encoder and language model frozen. Flamingo sets a new state-of-the-art in few-shot learning on a wide range of open-ended vision and language tasks by adding Perceiver Resampler modules on top of the pre-trained frozen vision model and inserting new cross-attention layers between existing pre-trained and frozen LM layers to condition the LM on visual data. A nifty advantage of the Frozen PrefixLM pre-training objective is it enables training with limited aligned image-text data, which is particularly useful for domains where aligned multi-modal datasets are not available. (View Highlight)
  • Multi-modal Fusing with Cross Attention Cross Attention Fusing Fusing visual information with a cross-attention mechanism as shown (image source) Another approach to leveraging pre-trained language models for multi-modal tasks is to directly fuse visual information into the layers of a language model decoder using a cross-attention mechanism instead of using images as additional prefixes to the language model. Models such as VisualGPT, VC-GPT, and Flamingo use this pre-training strategy and are trained on image captioning and visual question-answering tasks. The main goal of such models is to balance the mixture of text generation capacity and visual information efficiently, which is highly important in the absence of large multi-modal datasets. Models such as VisualGPT use a visual encoder to embed images and feed the visual embeddings to the cross-attention layers of a pre-trained language decoder module to generate plausible captions. A more recent work, FIBER, inserts cross-attention layers with a gating mechanism into both vision and language backbones, for more efficient multi-modal fusing and enables various other downstream tasks, such as image-text retrieval and open vocabulary object detection. (View Highlight)
  • Masked-Language Modeling / Image-Text Matching Another line of vision-language models uses a combination of Masked-Language Modeling (MLM) and Image-Text Matching (ITM) objectives to align specific parts of images with text and enable various downstream tasks such as visual question answering, visual commonsense reasoning, text-based image retrieval, and text-guided object detection. Models that follow this pre-training setup include VisualBERT, FLAVA, ViLBERT, LXMERT and BridgeTower. MLM / ITM Aligning parts of images with text (image source) Let’s break down what MLM and ITM objectives mean. Given a partially masked caption, the MLM objective is to predict the masked words based on the corresponding image. Note that the MLM objective requires either using a richly annotated multi-modal dataset with bounding boxes or using an object detection model to generate object region proposals for parts of the input text. For the ITM objective, given an image and caption pair, the task is to predict whether the caption matches the image or not. The negative samples are usually randomly sampled from the dataset itself. The MLM and ITM objectives are often combined during the pre-training of multi-modal models. For instance, VisualBERT proposes a BERT-like architecture that uses a pre-trained object detection model, Faster-RCNN, to detect objects. This model uses a combination of the MLM and ITM objectives during pre-training to implicitly align elements of an input text and regions in an associated input image with self-attention. Another work, FLAVA, consists of an image encoder, a text encoder, and a multi-modal encoder to fuse and align the image and text representations for multi-modal reasoning, all of which are based on transformers. In order to achieve this, FLAVA uses a variety of pre-training objectives: MLM, ITM, as well as Masked-Image Modeling (MIM), and contrastive learning. (View Highlight)
  • No Training Finally, various optimization strategies aim to bridge image and text representations using the pre-trained image and text models or adapt pre-trained multi-modal models to new downstream tasks without additional training. For example, MaGiC proposes iterative optimization through a pre-trained autoregressive language model to generate a caption for the input image. To do this, MaGiC computes a CLIP-based “Magic score” using CLIP embeddings of the generated tokens and the input image. ASIF Crafting a similarity search space using pre-trained, frozen unimodal image and text encoders (image source) ASIF proposes a simple method to turn pre-trained uni-modal image and text models into a multi-modal model for image captioning using a relatively small multi-modal dataset without additional training. The key intuition behind ASIF is that captions of similar images are also similar to each other. Hence we can perform a similarity-based search by crafting a relative representation space using a small dataset of ground-truth multi-modal pairs. (View Highlight)
  • Datasets Vision-language models are typically trained on large image and text datasets with different structures based on the pre-training objective. After they are pre-trained, they are further fine-tuned on various downstream tasks using task-specific datasets. This section provides an overview of some popular pre-training and downstream datasets used for training and evaluating vision-language models. (View Highlight)
  • Pre-training datasets Vision-language models are typically pre-trained on large multi-modal datasets harvested from the web in the form of matching image/video and text pairs. The text data in these datasets can be human-generated captions, automatically generated captions, image metadata, or simple object labels. Some examples of such large datasets are PMD and LAION-5B. The PMD dataset combines multiple smaller datasets such as the Flickr30K, COCO, and Conceptual Captions datasets. The COCO detection and image captioning (>330K images) datasets consist of image instances paired with the text labels of the objects each image contains, and natural sentence descriptions, respectively. The Conceptual Captions (> 3.3M images) and Flickr30K (> 31K images) datasets are scraped from the web along with their captions - free-form sentences describing the image. Even image-text datasets consisting solely of human-generated captions, such as Flickr30K, are inherently noisy as users only sometimes write descriptive or reflective captions for their images. To overcome this issue, datasets such as the LAION-5B dataset leverage CLIP or other pre-trained multi-modal models to filter noisy data and create high-quality multi-modal datasets. Furthermore, some vision-language models, such as ALIGN, propose further preprocessing steps and create their own high-quality datasets. Other vision-language datasets, such as the LSVTD and WebVid datasets, consist of video and text modalities, although at a smaller scale. (View Highlight)
  • Models fine-tuned on the question-answering downstream task, such as ViLT and GLIP, most commonly use the VQA (visual question-answering), VQA v2, NLVR2, OKVQA, TextVQA, TextCaps and VizWiz datasets. These datasets typically contain images paired with multiple open-ended questions and answers. Furthermore, datasets such as VizWiz and TextCaps can also be used for image segmentation and object localization downstream tasks. Some other interesting multi-modal downstream datasets are Hateful Memes for multi-modal classification, SNLI-VE for visual entailment prediction, and Winoground for visio-linguistic compositional reasoning. (View Highlight)
  • Note that vision-language models are used for various classical NLP and computer vision tasks such as text or image classification and typically use uni-modal datasets (SST2, ImageNet-1k, for example) for such downstream tasks. In addition, datasets such as COCO and Conceptual Captions are commonly used both in the pre-training of models and also for the caption generation downstream task. (View Highlight)
  • While models such as CLIP, FLAVA, BridgeTower, BLIP, LiT and VisionEncoderDecoder models provide joint image-text embeddings that can be used for downstream tasks such as zero-shot image classification, other models are trained on interesting downstream tasks. In addition, FLAVA is trained with both unimodal and multi-modal pre-training objectives and can be used for both unimodal vision or language tasks and multi-modal tasks. (View Highlight)
  • OWL-ViT enables zero-shot / text-guided and one-shot / image-guided object detection, CLIPSeg and GroupViT enable text and image-guided image segmentation, and VisualBERT, GIT and ViLT enable visual question answering as well as various other tasks. X-CLIP is a multi-modal model trained with video and text modalities and enables zero-shot video classification similar to CLIP’s zero-shot image classification capabilities. (View Highlight)