Skip to main navigation Skip to search Skip to main content

DISTRIBUTED AUDIO-VISUAL PARSING BASED ON MULTIMODAL TRANSFORMER AND DEEP JOINT SOURCE CHANNEL CODING

  • Penghong Wang*
  • , Jiahui Li
  • , Mengyao Ma
  • , Xiaopeng Fan*
  • *Corresponding author for this work
  • School of Computer Science and Technology, Harbin Institute of Technology
  • Huawei Technologies Co., Ltd.
  • PengCheng Lab

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Audio-visual parsing (AVP) is a newly emerged multimodal perception task, which detects and classifies audio-visual events in video. However, most existing AVP networks only use a simple attention mechanism to guide audio-visual multimodal events, and are implemented in a single end. This makes it unable to effectively capture the relationship between audio-visual events, and is not suitable for implementation in the network transmission scenario. In this paper, we focus on these problems and propose a distributed audiovisual parsing network (DAVPNet) based on multimodal transformer and deep joint source channel coding (DJSCC). Multimodal transformers are used to enhance the attention calculation between audio-visual events, and DJSCC is used to apply DAVP tasks to network transmission scenarios. Finally, the Look, Listen, and Parse (LLP) dataset is used to test the algorithm performance, and the experimental results show that the DAVPNet has superior parsing performance.

Original languageEnglish
Title of host publication2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages4623-4627
Number of pages5
ISBN (Electronic)9781665405409
DOIs
StatePublished - 2022
Externally publishedYes
Event2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022 - Hybrid, Singapore
Duration: 22 May 202227 May 2022

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2022-May
ISSN (Print)1520-6149

Conference

Conference2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022
Country/TerritorySingapore
CityHybrid
Period22/05/2227/05/22

Keywords

  • deep joint source channel coding
  • distributed audio-visual parsing network
  • multimodal transformer

Fingerprint

Dive into the research topics of 'DISTRIBUTED AUDIO-VISUAL PARSING BASED ON MULTIMODAL TRANSFORMER AND DEEP JOINT SOURCE CHANNEL CODING'. Together they form a unique fingerprint.

Cite this