Vision Language Models

rw-book-cover

Metadata

Author: huggingface.co
Full Title: Vision Language Models
URL: https://huggingface.co/blog/vlms-2025

Highlights

Vision Language Models (VLMs) are the talk of the town. In a previous blog post (from April 2024), we talked a lot about VLMs. A major chunk was about LLaVA, the first successful and easily reproducible open-source vision language model, along with tips on how to discover, evaluate, and fine-tune open models. Since then, so much has changed. Models have become smaller yet more powerful. We’ve seen the rise of new architectures and capabilities (reasoning, agency, long video understanding, etc.). In parallel, entirely new paradigms, such as multimodal Retrieval Augmented Generation (RAG) and multimodal agents have taken shape. (View Highlight)
Any-to-any models Any-to-any models, as the name suggests, are models that can take in any modality and output any modality (image, text, audio). They do it by aligning the modalities, where an input from one modality can be translated to another (e.g. the word “dog” would be associated with an image of a dog, or with the utterance of the word). (View Highlight)
These models have multiple encoders (one for each modality) and then fuse the embeddings together to create a shared representation space. The decoders (multiple or single) use the shared latent space as input and decode into the modality of choice. Earliest attempt to build any-to-any models is Chameleon by Meta, which can take in image and text and output image and text. Meta didn’t release image generation capability, in this model, so the Alpha-VLLM has released Lumina-mGPT, which has built image generation on top of Chameleon. The latest and most capable any-to-any model, Qwen 2.5 Omni (figure below) is a good example to understand the architecture of an any-to-any model. (View Highlight)
Qwen2.5-Omni employs a novel “Thinker-Talker” architecture, where the “Thinker” handles text generation, and the “Talker” produces natural speech responses in a streaming manner. MiniCPM-o 2.6, an 8B parameter multimodal model is capable of understanding and generating content across vision, speech, and language modalities. Janus-Pro-7B, introduced by DeepSeek AI, is a unified multimodal model that excels in both understanding and generating content across modalities. It features a decoupled visual encoding architecture, separating the processes for understanding and generation. We suspect an uptick in the number of such models in the coming years. It is a well-known intuition that multimodal learning is the only way we can learn deep representations better. We have curated some any-to-any models and demos in this collection. (View Highlight)
Reasoning models are models that can solve complex problems. We saw them first with large language models, and now vision language models. Until 2025, there was only one open-source multimodal reasoning model, QVQ-72B-preview by Qwen. It was an experimental model that was developed by the Alibaba Qwen team and came with many disclaimers. (View Highlight)
This year there’s another player, Kimi-VL-A3B-Thinking by the Moonshot AI team. It consists of MoonViT (SigLIP-so-400M) as the image encoder and a Mixture-of-Experts (MoE) decoder with 16B total parameters and only 2.8B active parameters. The model is a long chain-of-thought fine-tuned and further aligned (reinforcement learning) version of the Kimi-VL base vision language model. You can try the model here.

The authors also released an instruction fine-tuned version called Kimi-VL-A3B-Instruct. (View Highlight)
The community used to scale intelligence through the number of parameters, and then high-quality synthetic data. After a certain point, the benchmarks saturated and scaling models had diminishing returns. The community went to shrink larger models through various methods, like distillation. This makes sense because it reduces compute costs, simplifies deployment, and unlocks use cases like local execution, enhancing data privacy. (View Highlight)
When we say small vision language models we often refer to models with less than 2B parameters that can be run on consumer GPUs. SmolVLM is a good example model family for smaller vision language models. Instead of shrinking larger models, went all the way and tried to fit models into tiny number of parameters like 256M, 500M and 2.2B. SmolVLM2, for instance, attempted to solve video understanding in these sizes and found 500M to be a good trade-off. At Hugging Face, we have built an iPhone application, HuggingSnap, to demonstrate that these model sizes can achieve video understanding on consumer devices. (View Highlight)
Another striking model is gemma3-1b-it by Google DeepMind. It’s particularly exciting as it’s one of the smallest multimodal models to have 32k token context window, and supports 140+ languages. The model comes with the Gemma 3 family of models, with its largest model ranking first on Chatbot Arena at the time. The largest model was then distilled to a 1B variant. Lastly, although not the smallest, Qwen2.5-VL-3B-Instruct is worth noting. The model can do various tasks ranging from localization (object detection and pointing), to document understanding, to agentic tasks; with context length up to 32k tokens. You can use small models through MLX and Llama.cpp integrations. For MLX, assuming you have it installed, you can get started with SmolVLM-500M-Instruct with this one liner: (View Highlight)
Mixture of Expert (MoEs) models offer an alternative to dense architectures by dynamically selecting and activating only the most relevant sub-models, termed “experts”, to process a given input data segment. This selective activation (done by a router) mechanism has demonstrated the potential to substantially enhance model performance and operational efficiency while utilizing fewer computational resources. (View Highlight)
MoEs are faster at inference than their similar parameter-dense counterparts because of the selective activation of a smaller slice of the network. They also converge quickly during training. Every good thing comes with a cost, as MoEs need more memory cost due to the entire model being on the GPU, even if a smaller chunk is used. (View Highlight)
In the widely adopted Transformer architecture, MoE layers are most commonly integrated by replacing the standard Feed-Forward Network (FFN) layers within each Transformer block. Dense networks use the entire model to run an inference, while similarly sized MoE networks selectively activate some experts. This helps in better compute utilization and faster inference. (View Highlight)
Vision language models that have mixture-of-experts decoders seem to have enhanced performance. For instance, Kimi-VL as of now is the most advanced open reasoning model that has a mixture-of-experts decoder. Mixture-of-Experts show promising results with MoE-LLaVA’s focus on efficiency and hallucination reduction and DeepSeek-VL2’s broad multimodal capabilities too. The latest version of Llama (Llama 4) is an MoE with vision capabilities. MoE as a decoder is a promising research area, and we suspect an increase in models like these. (View Highlight)
VLMs are even making their mark in the field of robotics! There, they are known as Vision-language-action models (VLA). But don’t be fooled, those are mainly VLMs with a little moustache and hat. VLAs take images and text instructions, and return text indicating actions for the robot to take directly. VLAs extend vision language models by adding action and state tokens to interact with and control physical environments. These extra tokens represent the system’s internal state (how it perceives the environment), actions (what it does based on commands), and time-related information (like the order of steps in a task). These tokens are appended to the vision language input to generate actions or policy. (View Highlight)
Object Detection, Segmentation, Counting with Vision Language Models As we’ve seen in earlier sections, VLMs enable generalization over traditional computer vision tasks. Models can now take in images and a variety of prompts, such as open-ended text, and output structured text with localization tokens (for detection, segmentation and more). Last year, PaliGemma was the first model to attempt solving these tasks. The model takes in an image and text, where text is a description of an object of interest, along with a task prefix. The text prompt looks like “segment striped cat” or “detect bird on the roof”. For detection, the model outputs the bounding box coordinates as tokens. For segmentation, on the other hand, the model outputs detection tokens and segmentation tokens. These segmentation tokens aren’t all the segmented pixel coordinates, but codebook indices that are decoded by a variational autoencoder trained to decode these tokens into valid segmentation masks (as shown in the figure below). (View Highlight)
Many models have been introduced to do localization tasks after PaliGemma. Late last year, an upgraded version of PaliGemma, PaliGemma 2, appeared with the same capabilities and better performance. Another model that came later was Molmo by Allen AI, which can point to instances with dots and count object instances. (View Highlight)
Qwen2.5-VL can also detect, point to, and count objects, and this includes UI elements as objects too! (View Highlight)
Vision language models in production require filtering inputs and outputs to prevent jailbreaks and harmful outputs for compliance. Harmful content varies from inputs with violence to sexually explicit content. That’s where multimodal safety models come in: they are used before and after vision language models to filter their inputs and outputs. They are just like LLM safety models but with additional image input. (View Highlight)
In early 2025, Google introduced the first open multimodal safety model, ShieldGemma 2. It is built on ShieldGemma, the text-only safety model. This model takes in images and content policies and returns whether an image is safe for a given policy. Policy refers to a criterion in which the image is inappropriate. ShieldGemma 2 can also be used to filter outputs of image generation models. Llama Guard 4 by Meta, is a dense multimodal and multilingual safety model. It is densely pruned from Llama 4 Scout (a multimodal mixture-of-experts) with safety fine tuning. (View Highlight)
Now let’s look at how Retrieval Augmented Generation has evolved in the multimodal space. RAG for complex documents, usually formatted in PDF, is processed in three steps:
1. parsing the document completely into text
2. passing the plain text and the query to a retriever and a reranker to get the most relevant document
3. passing the relevant context and query to an LLM A traditional PDF parser consists of multiple elements to preserve the structure and visual elements in the document, such as layout, tables, images, charts, all rendered into a markdown. But this setup can be hard to maintain. (View Highlight)
Multimodal retrievers take a stack of PDFs and a query as input and return the most relevant page numbers along with their confidence scores. The scores represent how likely the page contains the answer to the query, or how relevant the query is to the page. This bypasses the brittle parsing step. The most relevant pages are then fed to the vision language model along with the query, and the VLM generates the answer. There are two main multimodal retriever architectures:
1. Document Screenshot Embedding (DSE, MCDSE)
2. ColBERT-like models (ColPali, ColQwen2, ColSmolVLM) DSE models consist of a text encoder and an image encoder, returning a single vector per query. The returned scores are softmax over the dot products of embeddings. They return a single vector per passage. (View Highlight)
ColBERT-like models, like ColPali, are also dual encoder models, with a twist: ColPali has a vision language model as an image encoder, and a large language model as a text encoder. These models are inherently not encoders, but the models output embeddings, which are then passed to a “MaxSim”. The outputs are multiple vectors, one for each token, unlike DSE. In MaxSim, the similarity between each text token embedding and each image patch embedding is calculated, and this approach captures nuances better. Due to this reason, ColBERT-like models are less cost-efficient, have better performance. Below you can see the indexing latency for ColPali. Since it’s just a single model, it’s also easier to maintain. (View Highlight)
Vision language models unlock many agentic workflows from chatting with documents to computer use. Here we will cover the latter since it requires more advanced agentic capabilities. Recently, there have been many vision language models releases that understand and operate over UIs. The latest one is UI-TARS-1.5 by ByteDance, which showed great results in operating over browser, computer and phone use. It can also do gameplay with reasoning, and operate in open world games. Another impactful release of this year is MAGMA-8B, it’s a foundation model for both UI navigation and physical interaction with the real world. Moreover, Qwen2.5-VL (especially its 32B variant as it is further trained on agentic tasks) and Kimi-VL reasoning model are good in GUI agentic tasks. (View Highlight)
At the beginning of 2025, we introduced smolagents, a new lightweight agentic library that implements the ReAct framework. Shortly after, we implemented vision language support for the library. This integration took place on two use cases: • At the beginning of the run, provide images for once. This is useful for document AI with tool use. • Dynamically retrieve images. This is useful for cases such as GUI control with VLM agents, where the agent repeatedly takes screenshots. (View Highlight)
Most vision language models these days can handle videos, because videos can be represented as a sequence of frames. However, video understanding is tricky because of the temporal relationship between frames and the large amount of frames, so different techniques are used to select a representative set of video frames. (View Highlight)
Since last year, the community has weighed on different approaches and tricks to solve this problem. A good example is the LongVU model by Meta. It downsamples video frames by passing them to DINOv2 to pick the most similar ones to remove them, and then the model further refines frames by picking the most relevant frames according to the text query, where both the text and the frames are projected to the same space and similarity is calculated. Qwen2.5VL can handle long context and is adapted to dynamic FPS rates, as the model is trained with videos with different frame rates. Through extended multimodal RoPE, it understands the absolute time positions of frames, and can handle different rates and still understand the speed of the events happening in real life. Another model is Gemma 3, which can accept video frames interleaved with timestamps in text prompt, e.g. “Frame 00.00: ..”, and is very performant for video understanding tasks. (View Highlight)
New Alignment Techniques for Vision Language Models Preference optimization is an alternative fine-tuning approach for language models that can also be extended to vision language models. Instead of relying on fixed labels, this method focuses on comparing and ranking candidate responses based on preferences. The trl library offers support for direct preference optimization (DPO), including for VLMs. Below is an example of how a preference dataset for DPO of a VLM fine-tuning is structured. Each entry consists of an image + question pair and two corresponding answers: one chosen and one rejected. The VLM is fine-tuned to generate responses aligned with the preferred (chosen) answer. (View Highlight)