Skip to main navigation Skip to search Skip to main content

OMNICORPUS: A UNIFIED MULTIMODAL CORPUS OF 10 BILLION-LEVEL IMAGES INTERLEAVED WITH TEXT

  • Qingyun Li
  • , Zhe Chen
  • , Weiyun Wang
  • , Wenhai Wang
  • , Shenglong Ye
  • , Zhenjiang Jin
  • , Guanzhou Chen
  • , Yinan He
  • , Zhangwei Gao
  • , Erfei Cui
  • , Jiashuo Yu
  • , Hao Tian
  • , Jiasheng Zhou
  • , Chao Xu
  • , Bin Wang
  • , Xingjian Wei
  • , Wei Li
  • , Wenjian Zhang
  • , Bo Zhang
  • , Pinlong Cai
  • Licheng Wen, Xiangchao Yan, Pei Chu, Yi Wang, Min Dou, Changyao Tian, Xizhou Zhu, Lewei Lu, Yushi Chen, Junjun He, Tong Lu, Yali Wang, Limin Wang, Dahua Lin, Yu Qiao, Botian Shi, Conghui He*, Jifeng Dai*
*Corresponding author for this work
  • Shanghai Artificial Intelligence Laboratory
  • Harbin Institute of Technology
  • Nanjing University
  • Fudan University
  • Chinese University of Hong Kong
  • SenseTime Group Limited
  • Tsinghua University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal large language models. In this paper, we introduce OmniCorpus, a 10 billion-level open-source image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features more diverse sources, including both English and non-English websites as well as video-centric websites; 3) is more flexible, easily degradable from an image-text interleaved format to pure text corpus and image-text pairs. Through comprehensive analysis and experiments, we validate the quality, usability, and effectiveness of the proposed dataset. We hope this could provide a solid data foundation for future multimodal model research. Code and data are released at https://github.com/OpenGVLab/OmniCorpus.

Original languageEnglish
Title of host publication13th International Conference on Learning Representations, ICLR 2025
PublisherInternational Conference on Learning Representations, ICLR
Pages53080-53122
Number of pages43
ISBN (Electronic)9798331320850
StatePublished - 2025
Event13th International Conference on Learning Representations, ICLR 2025 - Singapore, Singapore
Duration: 24 Apr 202528 Apr 2025

Publication series

Name13th International Conference on Learning Representations, ICLR 2025

Conference

Conference13th International Conference on Learning Representations, ICLR 2025
Country/TerritorySingapore
CitySingapore
Period24/04/2528/04/25

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

Fingerprint

Dive into the research topics of 'OMNICORPUS: A UNIFIED MULTIMODAL CORPUS OF 10 BILLION-LEVEL IMAGES INTERLEAVED WITH TEXT'. Together they form a unique fingerprint.

Cite this