Skip to main navigation Skip to search Skip to main content

Shared-private vision-to-text connector for grounded multimodal named entity recognition with synergistic global–local alignment

  • Zihao Zheng
  • , Lei Chen*
  • , Chen Zhao
  • , Dandan Tu
  • , Ming Liu
  • , Bing Qin
  • *Corresponding author for this work
  • Harbin Institute of Technology
  • Beijing Normal University
  • Huawei Technologies Co., Ltd.
  • Peng Cheng Laboratory

Research output: Contribution to journalArticlepeer-review

Abstract

Grounded multimodal named entity recognition (GMNER) aims to identify named entities and their visual correspondences in multimodal content. Despite recent advancements, existing approaches face two limitations: (1) insufficient bridge of the hierarchical global-local modality gap between images and text, and (2) inadequate modeling of visual content, particularly in distinguishing inter-object relationships (shared information) from distinctive object characteristics (private information). These issues hinder both accurate fine-grained entity recognition and reliable visual grounding, the two core objectives of the GMNER task. To address this, we propose a Vision-to-Text Connector (VTC) framework that projects visual features into the text feature space. Our framework introduces two key components. First, a Shared-Private Connector disentangles visual semantics by integrating a graph-based branch to capture shared contextual dependencies and a mixture-of-experts branch to extract discriminative cues. Second, to address the hierarchical modality gap, we design a Synergistic Global–Local Alignment (SGLA) objective that jointly aligns representations of global image-text pairs and local object-entity sets. This objective integrates a relevance-aware coupling strategy, which leverages global image-text semantic consistency to adaptively weight the local alignment term, thus effectively alleviating noise from weak sparse supervision. Experimental results on two public benchmarks show that our framework achieves F1 scores of 59.41% and 50.0%, respectively, outperforming all baseline methods. Ablation studies further verify the effectiveness of each proposed component.

Original languageEnglish
Article number266
JournalInternational Journal of Machine Learning and Cybernetics
Volume17
Issue number6
DOIs
StatePublished - Jun 2026

Keywords

  • Multimodality
  • Named entity recognition
  • Social media
  • Visual grounding

Fingerprint

Dive into the research topics of 'Shared-private vision-to-text connector for grounded multimodal named entity recognition with synergistic global–local alignment'. Together they form a unique fingerprint.

Cite this