Pelayo Arbués

Recent Notes

I am cooking again
Mar 22, 2026
The 10x Manager
Feb 16, 2026
2025 Reading Wrapped
Jan 07, 2026

See 99 more →

❯

Literature Notes

❯

❯

Introducing Idefics2: A Powerful 8B Vision Language Model for the Community

Introducing Idefics2: A Powerful 8B Vision-Language Model for the Community

Apr 16, 20253 min read

articles
literature-note

Metadata

Author: huggingface.co
Full Title: Introducing Idefics2: A Powerful 8B Vision-Language Model for the Community
URL: https://huggingface.co/blog/idefics2

Highlights

We are excited to release Idefics2, a general multimodal model that takes as input arbitrary sequences of texts and images, and generates text responses. It can answer questions about images, describe visual content, create stories grounded in multiple images, extract information from documents, and perform basic arithmetic operations. Idefics2 improves upon Idefics1: with 8B parameters, an open license (Apache 2.0), and enhanced OCR (Optical Character Recognition) capabilities, Idefics2 is a strong foundation for the community working on multimodality. Its performance on Visual Question Answering benchmarks is top of its class size, and competes with much larger models such as LLava-Next-34B and MM1-30B-chat. Idefics2 is also integrated in 🤗 Transformers from the get-go and therefore is straightforward to finetune for many multimodal applications. You can try out the models on the Hub right now! (View Highlight)
Idefics2 was trained on a mixture of openly available datasets for the pretraining: Interleaved webdocuments (Wikipedia,OBELICS), image-caption pairs (Public Multimodal Dataset, LAION-COCO), OCR data (PDFA (en), IDL and Rendered-text, and image-to-code data (WebSight)). The interactive visualization allows exploring the OBELICS dataset. Following common practices in the foundation model community, we further train the base model on task-oriented data. However, these data are often in disparate formats, and scattered in various places. Gathering them is a barrier for the community. To address that problem, we are releasing the multimodal instruction fine-tuning dataset we’ve been cooking: The Cauldron, an open compilation of 50 manually-curated datasets formatted for multi-turn conversations. We instruction fine-tuned Idefics2 on the concatenation of The Cauldron and various text-only instruction fine-tuning datasets. (View Highlight)
We significantly enhanced OCR abilities by integrating data that requires the model to transcribe text in an image or a document. We also improved abilities in answering questions on charts, figures, and documents with appropriate training data. (View Highlight)
We departed from the Idefics1’s architecture (gated cross-attentions) and simplified the integration of visual features into the language backbone. The images are fed to the vision encoder followed by a learned Perceiver pooling and a MLP modality projection. That pooled sequence is then concatenated with the text embeddings to obtain an (interleaved) sequence of image(s) and text(s). (View Highlight)
The model is built on top of two pre-trained models: Mistral-7B-v0.1 and siglip-so400m-patch14-384. Both of them have been released under Apache-2.0 license. We release Idefics2 weights under an Apache-2.0 license as well. (View Highlight)

Graph View

Metadata
Highlights

Now Reading

Advisor Tool
May 08, 2026

See 1776 more →

Created with Quartz, © 2026

Linkedin
Bluesky
Unsplash
Twitter
GitHub
RSS