nanoVLM is the simplest way to get started with training your very own Vision Language Model (VLM) using pure PyTorch. It is lightweight toolkit which allows you to launch a VLM training on a free tier colab notebook. (View Highlight)
At its heart, nanoVLM is a toolkit that helps you build and train a model that can understand both images and text, and then generate text based on that. The beauty of nanoVLM lies in its simplicity. The entire codebase is intentionally kept minimal and readable, making it perfect for beginners or anyone who wants to peek under the hood of VLMs without getting overwhelmed. (View Highlight)
As the name suggests, a Vision Language Model (VLM) is a multi-modal model that processes two modalities: vision and text. These models typically take images and/or text as input and generate text as output. (View Highlight)
Generating text (output) conditioned on the understanding of images and texts (inputs) is a powerful paradigm. It enables a wide range of applications, from image captioning and object detection to answering questions about visual content (as shown in the table below). One thing to note is that nanoVLM focuses only on Visual Question Answering as the training objective. (View Highlight)
We model nanoVLM after two well known and widely used architectures. Our vision backbone (models/vision_transformer.py) is the standard vision transformer, more specifically Google’s SigLIP vision encoder. Our language backbone follows the Llama 3 architecture. (View Highlight)
The vision and text modalities are aligned using a Modality Projection module. This module takes the image embeddings produced by the vision backbone as input, and transforms them into embeddings compatible with the text embeddings from the embedding layer of the language model. These embeddings are then concatenated and fed into the language decoder. The Modality Projection module consists of a pixel shuffle operation followed by a linear layer. (View Highlight)
Pixel shuffle reduces the number of image tokens, which helps reduce computational cost and speeds up training, especially for transformer-based language decoders which are sensitive to input length. The figure below demonstrates the concept.
(View Highlight)
All the files are very lightweight and well documented. We highly encourage you to check them out individually to get a better understanding of the implementation details (models/xxx.py)
While training, we use the following pre-trained backbone weights:
efore anything else, the script loads two configuration classes from models/config.py:
• TrainConfig: Configuration parameters useful for training, like learning rates, checkpoint paths, etc.
• VLMConfig: The configuration parameters used to initialize the VLM, like hidden dimensions, number of attention heads, etc. (View Highlight)
At the heart of the data pipeline is the get_dataloaders function. It:
• Loads datasets via Hugging Face’s load_dataset API.
• Combines and shuffles multiple datasets (if provided).
• Applies a train/val split via indexing.
• Wraps them in custom datasets (VQADataset, MMStarDataset) and collators (VQACollator, MMStarCollator). (View Highlight)
Model Initialization
The model is built via the VisionLanguageModel class. If you’re resuming from a checkpoint, it’s as easy as: (View Highlight)
Because the modality projector (MP) is freshly initialized while the backbones are pre-trained, the optimizer is split into two parameter groups, each with its own learning rate:
• A higher LR for the MP
• A smaller LR for the encoder/decoder stack
This balance ensures the MP learns quickly while preserving knowledge in the vision and language backbones. (View Highlight)
This part is fairly standard but thoughtfully structured:
• Mixed precision is used with torch.autocast to improve performance.
• A cosine learning rate schedule with linear warmup is implemented via get_lr.
• Token throughput (tokens/sec) is logged per batch for performance monitoring.
Every 250 steps (configurable), the model is evaluated on the validation and MMStar test datasets. If accuracy improves, the model is checkpointed. (View Highlight)
Using nanoVLM as the toolkit, we have trained a model and published it to Hub. We have used the google/siglip-base-patch16-224 and HuggingFaceTB/SmolLM2-135M as backbones. The model was trained this for ~6h on a single H100 GPU on ~1.7M samples of the cauldron.
This model isn’t intended to compete with SoTA models, but rather to demystify the components and training process of VLMs. (View Highlight)
In this blog post, we walked through what VLMs are, explored the architecture choices that power nanoVLM, and unpacked the training and inference workflows in detail.
By keeping the codebase lightweight and readable, nanoVLM aims to serve as both a learning tool and a foundation you can build upon. Whether you’re looking to understand how multi-modal inputs are aligned, or you want to train a VLM on your own dataset, this repository gives you a head start. (View Highlight)