Falcon Perception

2026-04-01

1 min read

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

Falcon Perception is a 0.6B-parameter early-fusion Transformer for open-vocabulary object grounding and segmentation from natural language prompts.

•Uses hybrid attention mask with bidirectional image tokens and causal text/task tokens in a single backbone, eliminating separate vision and fusion stages.
•Implements Chain-of-Perception: sequential prediction of coordinate → size → segmentation with Fourier feature encoding for precise localization.
•Lightweight output heads compute masks via dot products with upsampled image features, avoiding Hungarian matching and separate mask decoders.
•Trained on 54M images with 195M positive expressions and 488M hard negatives through three-stage curriculum with multi-teacher distillation.
•

Related Articles