A FiftyOne zoo model integration for Qwen3-VL-Embedding, enabling state-of-the-art multimodal embeddings for video and image datasets.
Qwen3-VL-Embedding maps text, images, and video into a unified representation space, enabling powerful cross-modal retrieval and understanding. Built on the Qwen3-VL foundation model, it achieves state-of-the-art results on multimodal embedding benchmarks including MMEB-V2. (View Highlight)
Features
• Multimodal Embeddings: Generate embeddings for videos, images, and text in a shared vector space
• Text-to-Video/Image Search: Find media similar to natural language queries
• Zero-Shot Classification: Classify media using text prompts without training
• Batched Inference: Efficient processing with configurable batch sizes
• Flexible Video Sampling: Configurable FPS and frame limits for different video lengths (View Highlight)