TY - GEN
T1 - Self-Relevance-Based Multimodal In-Context Learning for Multimodal Named Entity Recognition
AU - Zhang, Zhi
AU - Xu, Bing
AU - Yang, Muyun
AU - Cao, Hailong
AU - Zhu, Conghui
AU - Lu, Wenpeng
AU - Zhao, Tiejun
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Recently, Multimodal Named Entity Recognition (MNER) has attracted significant attention. Although MNER utilizing in-context learning has shown improved performance, modality retrieval bias often diminishes the relevance of in-context examples. To address this issue, we propose a self-relevance-based multimodal in-context learning method to mitigate modality retrieval bias by dynamically adjusting the weight of each modality. Specifically, we first measure the self-relevance of the query by calculating the similarity between textual and visual modalities, which helps to assess how much visual information contributes to the textual context. Then, we rank the similarity of different modalities, adjust the image rankings based on self-relevance to reduce modality retrieval bias, and integrate them to select the k most relevant examples. Finally, we use task definition and retrieved examples as effective guidance provided to the Multimodal Large Language Models to obtain feedback. Experimental results demonstrate that our method achieves SOTA performance on two benchmark datasets.
AB - Recently, Multimodal Named Entity Recognition (MNER) has attracted significant attention. Although MNER utilizing in-context learning has shown improved performance, modality retrieval bias often diminishes the relevance of in-context examples. To address this issue, we propose a self-relevance-based multimodal in-context learning method to mitigate modality retrieval bias by dynamically adjusting the weight of each modality. Specifically, we first measure the self-relevance of the query by calculating the similarity between textual and visual modalities, which helps to assess how much visual information contributes to the textual context. Then, we rank the similarity of different modalities, adjust the image rankings based on self-relevance to reduce modality retrieval bias, and integrate them to select the k most relevant examples. Finally, we use task definition and retrieved examples as effective guidance provided to the Multimodal Large Language Models to obtain feedback. Experimental results demonstrate that our method achieves SOTA performance on two benchmark datasets.
KW - Multimodal information extraction
KW - data mining
KW - in-context learning
KW - multimodal large language model
UR - https://www.scopus.com/pages/publications/105022632519
U2 - 10.1109/ICME59968.2025.11209093
DO - 10.1109/ICME59968.2025.11209093
M3 - 会议稿件
AN - SCOPUS:105022632519
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
BT - 2025 IEEE International Conference on Multimedia and Expo
PB - IEEE Computer Society
T2 - 2025 IEEE International Conference on Multimedia and Expo, ICME 2025
Y2 - 30 June 2025 through 4 July 2025
ER -