• In mid-2023, LLaVA emerged as a groundbreaking multimodal language model, showcasing an advanced approach integrating language and visual data. Unlike traditional models that primarily focus on either text or image processing, LLaVA stands out for its ability to seamlessly blend both domains. This enables the model to understand and interpret the intricate relationship between visual elements and textual descriptions, leading to more nuanced and contextually rich AI interactions. (View Highlight)
  • Architecturally, LLaVA unites the strengths of pre-trained language models like Vicuna or LLaMA with visual models like CLIP’s visual encoder. This integration involves transforming the visual features extracted from images into a format that aligns with the language model’s embeddings. (View Highlight)
  • The authors of LLaVA also introduced a visual instruction tuning process, which has proven to be a pioneering approach for multimodal AI. They utilize GPT-4, a language-only model, to generate instruction-following data that pairs language with images. This innovative method involves converting image-text pairs into formats suitable for instruction-following tasks, effectively creating a bridge between visual data and language processing. (View Highlight)
  • Essentially, they use existing datasets containing text-image pairs and prompt GPT-4 to generate more detailed text data based on the of the text-only label data. By employing existing datasets composed of text-image pairs, they initiate a process where GPT-4—inherently a text-based model—is prompted to elaborate on the text-only label data associated with each image. This procedure transforms the original dataset into a more complex and instruction-rich version. GPT-4 generates an array of questions and detailed descriptions based on the initial image captions, effectively deepening the contextual understanding and expanding the instructional content of the data. (View Highlight)
  • This expansion is not just an increase in the volume of text but an enhancement in the quality and depth of information. The language model delves into the nuances of each image, asking pertinent questions and providing detailed descriptions that go beyond the surface level. This results in a dataset that is richer and more suited for training an AI model capable of nuanced multimodal understanding and response. (View Highlight)
  • Here’s an example of the data used for training the LLaVA model: (View Highlight)
  • Architecturally, LLaVA unites the strengths of pre-trained language models like Vicuna or LLaMA with visual models like CLIP’s visual encoder. This integration involves transforming the visual features extracted from images into a format that aligns with the language model’s embeddings. The model employs a trainable projection matrix for this purpose, resulting in a sequence of visual token embeddings that are compatible with the language model. (View Highlight)
  • LLaVA’s training comprises a two-stage process. The initial stage, referred to as pre-training, utilizes image-text pairs to align the visual features with the language model’s embeddings. This stage keeps the weights of both the visual encoder and language model frozen, focusing on training the projection matrix. The subsequent stage involves fine-tuning the model end-to-end. Here, the visual encoder’s weights are frozen, while updates are made to the projection layer and language model. (View Highlight)
  • For this experiment, we’ll focus on fine-tuning LLaVA on a custom dataset using the official LLaVA repo with the Llama-2 7B backbone language model. We will use the OK-VQA dataset, which contains image text pairs that involve reasoning to answer questions about images. For example, instead of simply asking the model to describe the image, specific questions are asked about the image, that relate to its contents. For fine-tuning LLaVA on the OK-VQA dataset, we must first format the data to align with the specific requirements of the LLaVA repository. The OK-VQA dataset presents a unique challenge with its focus on complex reasoning tasks, involving image-text pairs with questions that go beyond simple image descriptions. These questions require deeper cognitive processing, making it a suitable choice for testing LLaVA’s advanced capabilities. (View Highlight)
  • The script processes each image and its associated question from the dataset, saves the images locally and creates a unique identifier for each. The questions and answers are formatted into a single JSON file. In this structure, the ‘human’ key represents the person asking the question, and the ‘gpt’ key represents LLaVA’s response. The JSON format is crucial as it matches the expected input format for LLaVA, enabling effective training and fine-tuning of the model. Note that we will not follow the same instruction tuning process as demonstrated in the paper, and we will mainly focus on training the model to do single response ‘complex reasoning’ given an image and a query. (View Highlight)
  • Now that the dataset is formatted and ready, we move on to the training phase of LLaVA. We will build off of the original LLAVa repo. Notably, the original repository for LLaVA lacked features for intermediate evaluations in between epochs, which is helpful for identifying signs of overfitting. (View Highlight)
  • Training large language models typically presents a challenging trade-off between computational efficiency and model performance. Traditionally, you’re either faced with utilizing vast computational resources for training large models or accepting diminished performance with smaller models. However, there is an approach that reconciles these conflicting demands: Q-Lora. (View Highlight)
  • To grasp the essence of QLoRA (Quantized Lora), it’s essential to first understand the concept of LoRA. LoRA’s strategy involves maintaining the original pre-trained backbone of the model intact while appending additional, more efficiently trainable layers. This approach facilitates rapid adaptation to new tasks without the need for retraining the entire network. By concentrating the learning on a select group of new parameters, LoRA effectively retains the benefits of a substantial pre-trained model but with significantly reduced computational demands. This aspect is particularly beneficial in practical scenarios where resources are constrained or swift adaptation to novel data is paramount. QLoRA introduced a novel data type, the 4-bit NormalFloat, specifically designed for normally distributed weights, which surpasses the performance of other 4-bit data types. This new 4-bit NormalFloat reduces computational requirements even further! (View Highlight)
  • DeepSpeed is an open-source deep learning optimization library designed to enhance the speed, scale, and efficiency of training large-scale deep learning models. Developed by Microsoft, it allows for faster and more efficient training, particularly for very large models, by leveraging various optimization techniques. One of the key components of DeepSpeed is its ZeRO technology. ZeRO is designed to optimize the memory usage during training, enabling the training of much larger models than was previously possible on the same hardware. ZeRO is divided into different optimization stages, with ZeRO Stage 2 being one of them. ZeRO Stage 2 reduces memory redundancy by partitioning optimizer state, gradients, and parameters across the data parallel processes. This means each process stores only a portion of these components, drastically reducing the memory requirements for each process. If you experience CUDA memory errors with this config, consider trying the stage 3 config, which allows for offloading gradients to the CPU, which will slow down training, but may solve the memory error. (View Highlight)
  • I won’t go into the details of the training script, however, I will cover the run command for using the script, as many of the details can be covered easily here. Generally, instead of pasting long command like this into the terminal, I prefer to create a bash script (with the .sh extension) and place the command in this file. I’ve found this makes things easier when testing out different hyper-parameters as well as avoiding syntax errors in the command line. (View Highlight)
  • • lora_alpha: Following the guidelines of the LLaVA authors, we’ve set lora_alpha to 256. This alpha value is pivotal in preserving numerical stability and the full expressive power of the model. It’s worth noting that this is an adjustment from the typical values around 16 • lora_r: The lora_r parameter represents the rank of the decomposition matrices in LoRA. We’ve chosen a value of 128, diverging from the common range of 8 to 64 seen in typical LLM fine-tunes. A higher rank, as in our case, can enhance the model’s representational capability (View Highlight)