Abstract
The objective of referring expression comprehension (REC) is to find the common feature domain between language expressions and visual objects. Due to the complex nature of modeling relationships between objects in images, graph-based methods are widely used for the REC task. However, during the process of graph construction, existing graph-based REC methods insufficiently harness the visual information associated with objects in images. Moreover, in modeling the relationships between objects, these methods consider only the relational words of the expression and the positions of the objects, while ignoring the objects themselves. Thus, they are sub-optimal in capturing underlying relationships between the objects and the expression, leading to incorrect predictions when given a complex expression. To address these issues, we propose a plug-and-adapt module called expression-guided selective and filtering module (EGSFM) for graph-based REC methods that constructs an expression-guided filter to adaptively select relevant and important visual features from feature maps of objects. Then, the selected visual object features and the textual features of the expression are jointly used for graph construction. Finally, a noun-oriented reasoning strategy is proposed for graph reasoning and target object matching, with the number of reasoning steps based on the number of nouns or noun phrases in the expression. Extensive experimental results on three challenging public datasets, including RefCOCO, RefCOCO+, and RefCOCOg, show that our method outperforms the compared graph-based methods and is robust to complex language expressions. In addition, our method performs favorably against other state-of-the-art transformer-based methods while consuming much fewer computational resources for training than those methods.
| Original language | English |
|---|---|
| Article number | 111222 |
| Journal | Pattern Recognition |
| Volume | 161 |
| DOIs | |
| State | Published - May 2025 |
| Externally published | Yes |
Keywords
- Expression-guided selective and filtering module
- Language-to-vision mapping
- Noun-oriented reasoning
- Referring expression comprehension
Fingerprint
Dive into the research topics of 'Graph-based referring expression comprehension with expression-guided selective filtering and noun-oriented reasoning'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver