Skip to main navigation Skip to search Skip to main content

SMSMO: Learning to generate multimodal summary for scientific papers

  • Xinyi Zhong
  • , Zusheng Tan
  • , Shen Gao
  • , Jing Li
  • , Jiaxing Shen
  • , Jingyu Ji
  • , Jeff Tang
  • , Billy Chiu*
  • *Corresponding author for this work
  • Lingnan University
  • University of Electronic Science and Technology of China
  • Harbin Institute of Technology Shenzhen
  • Hong Kong Polytechnic University

Research output: Contribution to journalArticlepeer-review

Abstract

Nowadays, publishers like Elsevier increasingly use graphical abstracts (i.e., a pictorial paper summary) along with textual abstracts to facilitate scientific paper readings. In such a case, automatically identifying a representative image and generating a suitable textual summary for individual papers can help editors and readers save time, facilitating them in reading and understanding papers. To tackle the case, we introduce the dataset for Scientific Multimodal Summarization with Multimodal Output (SMSMO). Unlike other multimodal tasks which performed on generic, medium-size contents (e.g., news), SMSMO needs to tackle longer multimodal contents in papers, with finer-grained multimodality interactions and semantic alignments between images and text. For this, we propose a cross-modality, multi-task learning summarizer (CMT-Sum). It captures the intra- and inter-modality interactions between images and text through a cross-fusion module; and models the finer-grained image–text semantic alignment by jointly generating the text summary, selecting the key image and matching the text and image. Extensive experiments conducted on two newly introduced datasets on the SMSMO task showcase our model's effectiveness.

Original languageEnglish
Article number112908
JournalKnowledge-Based Systems
Volume310
DOIs
StatePublished - 15 Feb 2025
Externally publishedYes

Keywords

  • Cross-modality fusion
  • Multi-task
  • Multimodal scientific summarization

Fingerprint

Dive into the research topics of 'SMSMO: Learning to generate multimodal summary for scientific papers'. Together they form a unique fingerprint.

Cite this