Skip to main navigation Skip to search Skip to main content

Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning

  • Wenrui Li
  • , Xi Le Zhao
  • , Zhengyu Ma*
  • , Xingtao Wang
  • , Xiaopeng Fan*
  • , Yonghong Tian
  • *Corresponding author for this work
  • Harbin Institute of Technology
  • University of Electronic Science and Technology of China
  • Peng Cheng Laboratory
  • Peking University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Audio-visual zero-shot learning (ZSL) has attracted board attention, as it could classify video data from classes that are not observed during training. However, most of the existing methods are restricted to background scene bias and fewer motion details by employing a single-stream network to process scenes and motion information as a unified entity. In this paper, we address this challenge by proposing a novel dual-stream architecture Motion-Decoupled Spiking Transformer (MDFT) to explicitly decouple the contextual semantic information and highly sparsity dynamic motion information. Specifically, The Recurrent Joint Learning Unit (RJLU) could extract contextual semantic information effectively and understand the environment in which actions occur by capturing joint knowledge between different modalities. By converting RGB images to events, our approach effectively captures motion information while mitigating the influence of background scene biases, leading to more accurate classification results. We utilize the inherent strengths of Spiking Neural Networks (SNNs) to process highly sparsity event data efficiently. Additionally, we introduce a Discrepancy Analysis Block (DAB) to model the audio motion features. To enhance the efficiency of SNNs in extracting dynamic temporal and motion information, we dynamically adjust the threshold of Leaky Integrate-and-Fire (LIF) neurons based on the statistical cues of global motion and contextual semantic information. Our experiments demonstrate the effectiveness of MDFT, which consistently outperforms state-of-the-art methods across mainstream benchmarks. Moreover, we find that motion information serves as a powerful regularization for video networks, where using it improves the accuracy of HM and ZSL by 19.1% and 38.4%, respectively.

Original languageEnglish
Title of host publicationMM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages3994-4002
Number of pages9
ISBN (Electronic)9798400701085
DOIs
StatePublished - 27 Oct 2023
Event31st ACM International Conference on Multimedia, MM 2023 - Ottawa, Canada
Duration: 29 Oct 20233 Nov 2023

Publication series

NameMM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

Conference

Conference31st ACM International Conference on Multimedia, MM 2023
Country/TerritoryCanada
CityOttawa
Period29/10/233/11/23

Keywords

  • audio-visual zero-shot learning
  • spiking neural network

Fingerprint

Dive into the research topics of 'Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning'. Together they form a unique fingerprint.

Cite this