Multimodal Embedding & Reranker Models with Sentence Transformers

2026-04-09

1 min read

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

Sentence Transformers v5.4 adds multimodal embedding and reranker models supporting text, images, audio, and video encoding in a unified API.

•Multimodal embeddings map different input modalities into a shared vector space for cross-modal similarity searches and retrieval
•Multimodal rerankers score relevance between mixed-modality document-query pairs with better quality than embedding models alone
•Installation requires optional dependencies for each modality (image, audio, video) with GPU requirements of 8GB+ VRAM
•encode_query() and encode_document() methods automatically apply appropriate instruction prompts for optimal retrieval performance
•Models like Qwen3-VL-Embedding-2B detect supported modalities and accept images from URLs, file paths, or PIL Image objects

Related Articles