Abstract
Grounded multimodal named entity recognition (GMNER) aims to identify named entities and their visual correspondences in multimodal content. Despite recent advancements, existing approaches face two limitations: (1) insufficient bridge of the hierarchical global-local modality gap between images and text, and (2) inadequate modeling of visual content, particularly in distinguishing inter-object relationships (shared information) from distinctive object characteristics (private information). These issues hinder both accurate fine-grained entity recognition and reliable visual grounding, the two core objectives of the GMNER task. To address this, we propose a Vision-to-Text Connector (VTC) framework that projects visual features into the text feature space. Our framework introduces two key components. First, a Shared-Private Connector disentangles visual semantics by integrating a graph-based branch to capture shared contextual dependencies and a mixture-of-experts branch to extract discriminative cues. Second, to address the hierarchical modality gap, we design a Synergistic Global–Local Alignment (SGLA) objective that jointly aligns representations of global image-text pairs and local object-entity sets. This objective integrates a relevance-aware coupling strategy, which leverages global image-text semantic consistency to adaptively weight the local alignment term, thus effectively alleviating noise from weak sparse supervision. Experimental results on two public benchmarks show that our framework achieves F1 scores of 59.41% and 50.0%, respectively, outperforming all baseline methods. Ablation studies further verify the effectiveness of each proposed component.
| Original language | English |
|---|---|
| Article number | 266 |
| Journal | International Journal of Machine Learning and Cybernetics |
| Volume | 17 |
| Issue number | 6 |
| DOIs | |
| State | Published - Jun 2026 |
Keywords
- Multimodality
- Named entity recognition
- Social media
- Visual grounding
Fingerprint
Dive into the research topics of 'Shared-private vision-to-text connector for grounded multimodal named entity recognition with synergistic global–local alignment'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver