Skip to main navigation Skip to search Skip to main content

LayoutLMv2: Multi-modal pre-training for visually-rich document understanding

  • Yang Xu*
  • , Yiheng Xu*
  • , Tengchao Lv*
  • , Lei Cui
  • , Furu Wei
  • , Guoxin Wang
  • , Yijuan Lu
  • , Dinei Florencio
  • , Cha Zhang
  • , Wanxiang Che
  • , Min Zhang
  • , Lidong Zhou
  • *Corresponding author for this work
  • Harbin Institute of Technology
  • Microsoft USA
  • Microsoft Azure AI
  • Soochow University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 ? 0.8420), CORD (0.9493 ? 0.9601), SROIE (0.9524 ? 0.9781), Kleister-NDA (0.8340 ? 0.8520), RVL-CDIP (0.9443 ? 0.9564), and DocVQA (0.7295 ? 0.8672). We made our model and code publicly available at https://aka.ms/layoutlmv2.

Original languageEnglish
Title of host publicationACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages2579-2591
Number of pages13
ISBN (Electronic)9781954085527
DOIs
StatePublished - 2021
EventJoint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021 - Virtual, Online
Duration: 1 Aug 20216 Aug 2021

Publication series

NameACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference
Volume1

Conference

ConferenceJoint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021
CityVirtual, Online
Period1/08/216/08/21

Fingerprint

Dive into the research topics of 'LayoutLMv2: Multi-modal pre-training for visually-rich document understanding'. Together they form a unique fingerprint.

Cite this