Skip to main navigation Skip to search Skip to main content

GCN-Based Multi-Modality Fusion Network for Action Recognition

  • Shaocan Liu
  • , Xingtao Wang*
  • , Ruiqin Xiong
  • , Xiaopeng Fan
  • *Corresponding author for this work
  • School of Computer Science and Technology, Harbin Institute of Technology
  • Peking University
  • Peng Cheng Laboratory

Research output: Contribution to journalArticlepeer-review

Abstract

Thanks to the remarkably expressive power for depicting structural data, Graph Convolutional Network (GCN) has been extensively adopted for skeleton-based action recognition in recent years. However, GCN is designed to operate on irregular graphs of skeletons, making it difficult to deal with other modalities represented on regular grids directly. Thus, although existing works have demonstrated the necessity of multi-modality fusion, few methods in the literature explore the fusion of skeleton and other modalities within a GCN architecture. In this paper, we present a novel GCN-based framework, termed GCN-based Multi-modality Fusion Network (GMFNet), to efficiently utilize complementary information in RGB and skeleton data. GMFNet is constructed by connecting a main stream with a GCN-based multi-modality fusion module (GMFM), whose goal is to gradually combine finer and coarse action-related information extracted from skeletons and RGB videos, respectively. Specifically, a cross-modality data mapping method is designed to transform an RGB video into a skeleton-like (SL) sequence, which is then integrated with the skeleton sequence under a gradual fusion scheme in GMFM. The fusion results are fed into the following main stream to extract more discriminative features and produce the final prediction. In addition, a spatio-temporal joint attention mechanism is introduced for more accurate action recognition. Compared to the multi-stream approaches, GMFNet can be implemented within an end-to-end training pipeline and thereby reduces the training complexity. Experimental results show the proposed GMFNet achieves impressive performance on two large-scale data sets of NTU RGB+D 60 and 120.

Original languageEnglish
Pages (from-to)1242-1253
Number of pages12
JournalIEEE Transactions on Multimedia
Volume27
DOIs
StatePublished - 2025
Externally publishedYes

Keywords

  • Multi-modality Fusion
  • action recognition
  • graph convolutional network
  • skeleton

Fingerprint

Dive into the research topics of 'GCN-Based Multi-Modality Fusion Network for Action Recognition'. Together they form a unique fingerprint.

Cite this