Building a Fast Multilingual OCR Model with Synthetic Data

2026-04-17

1 min read

Read Original

Get the latest tech trends every morning

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

Endigest AI Core Summary

Nemotron OCR v2 is a multilingual OCR model trained on 12.2 million synthetic images generated by combining mOSCAR text corpus with modified SynthDoG renderer. This approach overcomes Nemotron OCR v1's limitations with non-English text.

•Synthetic data generation provides pixel-precise multi-level annotations (word, line, paragraph) with reading order graphs, avoiding expensive manual annotation.
•The pipeline generates diverse layouts including multi-column text, tables, vertical text for CJK, and slides with 165-1,258 open-source fonts per language.
•Accuracy improved significantly: NED scores from 0.56-0.92 down to 0.035-0.069 on non-English languages, with 34.7 pages/second inference speed on A100 GPU.
•The approach is language-agnostic: adding new languages requires only source text and fonts without architecture modifications.

Building a Fast Multilingual OCR Model with Synthetic Data

Get the latest tech trends every morning

Endigest AI Core Summary

Related Articles

How Trustpilot built a real-time architecture for data enrichment using Gemma

Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

Specialization Beats Scale: A Strategic Variable Most AI Procurement Decisions Overlook