Skip to main navigation Skip to search Skip to main content

AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference Understanding

  • Faculty of Computing, Harbin Institute of Technology
  • Harbin Institute of Technology
  • School of Mechatronics Engineering, Harbin Institute of Technology
  • School of Medicine and Health, Harbin Institute of Technology
  • Nanjing University of Information Science & Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Embodied reference understanding is crucial for intelligent agents to predict referents based on human intention through gesture signals and language descriptions. This paper introduces the Attention-Dynamic DINO, a novel framework designed to mitigate misinterpretations of pointing gestures across various interaction contexts. Our approach integrates visual and textual features to simultaneously predict the target object’s bounding box and the attention source in pointing gestures. Leveraging the distance-aware nature of nonverbal communication in visual perspective taking, we extend the virtual touch line mechanism and propose an attention-dynamic touch line to represent referring gesture based on interactive distances. The combination of this distance-aware approach and independent prediction of the attention source, enhances the alignment between objects and the gesture represented line. Extensive experiments on the YouRefIt dataset demonstrate the efficacy of our gesture information understanding method in significantly improving task performance. Our model achieves 76.3% accuracy at the 0.25 IoU threshold and, notably, surpasses human performance at the 0.75 IoU threshold, marking a first in this domain. Comparative experiments with distance-unaware understanding methods from previous research further validate the superiority of the Attention-Dynamic Touch Line across diverse contexts.

Original languageEnglish
Pages (from-to)10238-10249
Number of pages12
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume35
Issue number10
DOIs
StatePublished - 2025
Externally publishedYes

Keywords

  • Embodied reference understanding
  • referring expression comprehension
  • visual grounding

Fingerprint

Dive into the research topics of 'AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference Understanding'. Together they form a unique fingerprint.

Cite this