TY - GEN
T1 - LayoutLMv2
T2 - Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021
AU - Xu, Yang
AU - Xu, Yiheng
AU - Lv, Tengchao
AU - Cui, Lei
AU - Wei, Furu
AU - Wang, Guoxin
AU - Lu, Yijuan
AU - Florencio, Dinei
AU - Zhang, Cha
AU - Che, Wanxiang
AU - Zhang, Min
AU - Zhou, Lidong
N1 - Publisher Copyright:
© 2021 Association for Computational Linguistics
PY - 2021
Y1 - 2021
N2 - Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 ? 0.8420), CORD (0.9493 ? 0.9601), SROIE (0.9524 ? 0.9781), Kleister-NDA (0.8340 ? 0.8520), RVL-CDIP (0.9443 ? 0.9564), and DocVQA (0.7295 ? 0.8672). We made our model and code publicly available at https://aka.ms/layoutlmv2.
AB - Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 ? 0.8420), CORD (0.9493 ? 0.9601), SROIE (0.9524 ? 0.9781), Kleister-NDA (0.8340 ? 0.8520), RVL-CDIP (0.9443 ? 0.9564), and DocVQA (0.7295 ? 0.8672). We made our model and code publicly available at https://aka.ms/layoutlmv2.
UR - https://www.scopus.com/pages/publications/85117510197
U2 - 10.18653/v1/2021.acl-long.201
DO - 10.18653/v1/2021.acl-long.201
M3 - 会议稿件
AN - SCOPUS:85117510197
T3 - ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference
SP - 2579
EP - 2591
BT - ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference
PB - Association for Computational Linguistics (ACL)
Y2 - 1 August 2021 through 6 August 2021
ER -