This article discusses applying Direct Preference Optimization (DPO) to address text degeneration in OCR models.
- •DharmaOCR identified persistent text degeneration (repetition loops) that supervised fine-tuning (SFT) cannot fully resolve
- •DPO applied after SFT reduced degeneration by an average of 59.4%, with best-case improvement of 87.6%
- •DPO operates at completion level, explicitly penalizing failed outputs unlike token-level SFT optimization
- •The key innovation was deliberately preserving the SFT model's degenerate outputs as rejection examples in DPO training pairs
- •This requires no specialized annotation, only a model capable of producing both correct and identifiable failure outputs
This summary was automatically generated by AI based on the original article and may not be fully accurate.