Direct Preference Optimization Beyond Chatbots

2026-06-03

1 min read

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

This article discusses applying Direct Preference Optimization (DPO) to address text degeneration in OCR models.

•DharmaOCR identified persistent text degeneration (repetition loops) that supervised fine-tuning (SFT) cannot fully resolve
•DPO applied after SFT reduced degeneration by an average of 59.4%, with best-case improvement of 87.6%
•DPO operates at completion level, explicitly penalizing failed outputs unlike token-level SFT optimization
•The key innovation was deliberately preserving the SFT model's degenerate outputs as rejection examples in DPO training pairs
•This requires no specialized annotation, only a model capable of producing both correct and identifiable failure outputs

This summary was automatically generated by AI based on the original article and may not be fully accurate.

Related Articles