Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

2026-04-16

1 min read

Read Original

Get the latest tech trends every morning

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

Endigest AI Core Summary

This post demonstrates how to train and finetune multimodal embedding and reranker models using Sentence Transformers on custom domain data.

•Multimodal models handle text, images, audio, and video inputs simultaneously, requiring specialized training approaches beyond text-only methods
•Finetuning on domain-specific data significantly improves performance: Visual Document Retrieval example improved NDCG@10 from 0.888 to 0.947 and outperformed larger models
•Key training components include the model, dataset, loss function (CachedMultipleNegativesRankingLoss, MatryoshkaLoss), training arguments, evaluator, and trainer
•Dataset format must match the loss function requirements, supporting multimodal inputs: text strings, PIL images, file paths, URLs, audio and video arrays
•

The SentenceTransformerTrainer pipeline automatically handles image preprocessing through the model's processor with configurable parameters like max_pixels and attention implementation

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Get the latest tech trends every morning

Endigest AI Core Summary

Related Articles

How Trustpilot built a real-time architecture for data enrichment using Gemma

Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

Specialization Beats Scale: A Strategic Variable Most AI Procurement Decisions Overlook