In this recipe, we’ll demonstrate how to fine-tune a Vision-Language Model (VLM) for object detection grounding using TRL.
Traditionally, object detection involves identifying a predefined set of classes (e.g., “car”, “person”, “dog”) within an image. However, this paradigm shifted with models like Grounding DINO, GLIP, or OWL-ViT, which introduced open-ended object detection—enabling models to detect any class described in natural language. (View Highlight)
Grounding goes a step further by adding contextual understanding. Instead of just detecting a “car”, grounded detection can locate the “car on the left”, or the “red car behind the tree”. This provides a more nuanced and powerful approach to object detection. (View Highlight)