Skip to main navigation Skip to search Skip to main content

CNVid-3.5M: Build, Filter, and Pre-Train the Large-Scale Public Chinese Video-Text Dataset

  • Tian Gan
  • , Qing Wang
  • , Xingning Dong
  • , Xiangyuan Ren
  • , Liqiang Nie
  • , Qingpei Guo*
  • *Corresponding author for this work
  • Shandong University
  • Ant Group
  • Harbin Institute of Technology Shenzhen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Owing to well-designed large-scale video-text datasets, recent years have witnessed tremendous progress in video-text Pre-training. However, existing large-scale video-text datasets are mostly English-only. Though there are certain methods studying the Chinese video-text Pre-training, they pre-train their models on private datasets whose videos and text are unavailable. This lack of large-scale public datasets and benchmarks in Chinese hampers the research and downstream applications of Chinese video-text Pre-training. Towards this end, we release and benchmark CNVid-3.5M, a large-scale public cross-modal dataset containing over 3.5M Chinese video-text pairs. We summarize our contributions by three verbs, i.e., "Build", "Filter", and "Pretrain": 1) To build a public Chinese video-text dataset, we collect over 4.5M videos from the Chinese websites. 2) To improve the data quality, we propose a novel method to filter out 1M weakly-paired videos, resulting in the CNVid-3.5M dataset. And 3) we benchmark CNVid-3.5M with three mainstream pixel-level Pre-training architectures. At last, we propose the Hard Sample Curriculum Learning strategy to promote the Pre-training performance. To the best of our knowledge, CNVid-3.5M is the largest public video-text dataset in Chinese, and we provide the first pixel-level benchmarks for Chinese video-text Pre-training.

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
PublisherIEEE Computer Society
Pages14815-14824
Number of pages10
ISBN (Electronic)9798350301298
ISBN (Print)9798350301298
DOIs
StatePublished - 2023
Externally publishedYes
Event2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 - Vancouver, Canada
Duration: 18 Jun 202322 Jun 2023

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2023-June
ISSN (Print)1063-6919

Conference

Conference2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
Country/TerritoryCanada
CityVancouver
Period18/06/2322/06/23

Keywords

  • Multi-modal learning

Fingerprint

Dive into the research topics of 'CNVid-3.5M: Build, Filter, and Pre-Train the Large-Scale Public Chinese Video-Text Dataset'. Together they form a unique fingerprint.

Cite this