Qwen3-Vl-Embedding for FiftyOne

rw-book-cover

Metadata

A FiftyOne zoo model integration for Qwen3-VL-Embedding, enabling state-of-the-art multimodal embeddings for video and image datasets. Qwen3-VL-Embedding maps text, images, and video into a unified representation space, enabling powerful cross-modal retrieval and understanding. Built on the Qwen3-VL foundation model, it achieves state-of-the-art results on multimodal embedding benchmarks including MMEB-V2. (View Highlight)
Features • Multimodal Embeddings: Generate embeddings for videos, images, and text in a shared vector space • Text-to-Video/Image Search: Find media similar to natural language queries • Zero-Shot Classification: Classify media using text prompts without training • Batched Inference: Efficient processing with configurable batch sizes • Flexible Video Sampling: Configurable FPS and frame limits for different video lengths (View Highlight)