Skip to main navigation Skip to search Skip to main content

DuReadervis: A Chinese Dataset for Open-domain Document Visual Question Answering

  • Le Qi
  • , Shangwen Lv
  • , Hongyu Li
  • , Jing Liu
  • , Yu Zhang*
  • , Qiaoqiao She
  • , Hua Wu
  • , Haifeng Wang
  • , Ting Liu
  • *Corresponding author for this work
  • Harbin Institute of Technology
  • Baidu Inc

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Open-domain question answering has been used in a wide range of applications, such as web search and enterprise search, which usually takes clean texts extracted from various formats of documents (e.g., web pages, PDFs, or Word documents) as the information source. However, designing different text extraction approaches is time-consuming and not scalable. In order to reduce human cost and improve the scalability of QA systems, we propose and study an Open-domain Document Visual Question Answering (Open-domain DocVQA) task, which requires answering questions based on a collection of document images directly instead of only document texts, utilizing layouts and visual features additionally. To advance this task, we introduce the first Chinese Open-domain DocVQA dataset called DuReadervis, containing about 15K question-answering pairs and 158K document images from the Baidu search engine. There are three main challenges in DuReadervis: (1) long document understanding, (2) noisy texts, and (3) multi-span answer extraction. The extensive experiments demonstrate that the dataset is challenging. Additionally, we propose a simple approach that incorporates the layout and visual features, and the experimental results show the effectiveness of the proposed approach. The dataset and code will be publicly available at https://github.com/baidu/DuReader/tree/master/DuReader-vis.

Original languageEnglish
Title of host publicationACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics, Findings of ACL 2022
EditorsSmaranda Muresan, Preslav Nakov, Aline Villavicencio
PublisherAssociation for Computational Linguistics (ACL)
Pages1338-1351
Number of pages14
ISBN (Electronic)9781955917254
DOIs
StatePublished - 2022
EventFindings of the Association for Computational Linguistics: ACL 2022 - Dublin, Ireland
Duration: 22 May 202227 May 2022

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (Print)0736-587X

Conference

ConferenceFindings of the Association for Computational Linguistics: ACL 2022
Country/TerritoryIreland
CityDublin
Period22/05/2227/05/22

Fingerprint

Dive into the research topics of 'DuReadervis: A Chinese Dataset for Open-domain Document Visual Question Answering'. Together they form a unique fingerprint.

Cite this