Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension
By: Juexi Shao , Siyou Li , Yujian Gan and more
Potential Business Impact:
Helps computers understand conversations in pictures.
Dialogue-Based Generalized Referring Expressions Comprehension (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference across a long dialogue context. However, existing systems struggle under distribution shift between training and evaluation domains, a gap exacerbated by the scarcity of annotated dialogue grounding data. We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics.
Similar Papers
Reasoning Matters for 3D Visual Grounding
CV and Pattern Recognition
Teaches computers to find objects in 3D scenes.
Error-Driven Scene Editing for 3D Grounding in Large Language Models
CV and Pattern Recognition
Teaches robots to understand 3D spaces better.
Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension
CV and Pattern Recognition
Helps computers find one or many things in pictures.