Abstract
Nowadays, publishers like Elsevier increasingly use graphical abstracts (i.e., a pictorial paper summary) along with textual abstracts to facilitate scientific paper readings. In such a case, automatically identifying a representative image and generating a suitable textual summary for individual papers can help editors and readers save time, facilitating them in reading and understanding papers. To tackle the case, we introduce the dataset for Scientific Multimodal Summarization with Multimodal Output (SMSMO). Unlike other multimodal tasks which performed on generic, medium-size contents (e.g., news), SMSMO needs to tackle longer multimodal contents in papers, with finer-grained multimodality interactions and semantic alignments between images and text. For this, we propose a cross-modality, multi-task learning summarizer (CMT-Sum). It captures the intra- and inter-modality interactions between images and text through a cross-fusion module; and models the finer-grained image–text semantic alignment by jointly generating the text summary, selecting the key image and matching the text and image. Extensive experiments conducted on two newly introduced datasets on the SMSMO task showcase our model's effectiveness.
| Original language | English |
|---|---|
| Article number | 112908 |
| Journal | Knowledge-Based Systems |
| Volume | 310 |
| DOIs | |
| State | Published - 15 Feb 2025 |
| Externally published | Yes |
Keywords
- Cross-modality fusion
- Multi-task
- Multimodal scientific summarization
Fingerprint
Dive into the research topics of 'SMSMO: Learning to generate multimodal summary for scientific papers'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver