Abstract
Thanks to the remarkably expressive power for depicting structural data, Graph Convolutional Network (GCN) has been extensively adopted for skeleton-based action recognition in recent years. However, GCN is designed to operate on irregular graphs of skeletons, making it difficult to deal with other modalities represented on regular grids directly. Thus, although existing works have demonstrated the necessity of multi-modality fusion, few methods in the literature explore the fusion of skeleton and other modalities within a GCN architecture. In this paper, we present a novel GCN-based framework, termed GCN-based Multi-modality Fusion Network (GMFNet), to efficiently utilize complementary information in RGB and skeleton data. GMFNet is constructed by connecting a main stream with a GCN-based multi-modality fusion module (GMFM), whose goal is to gradually combine finer and coarse action-related information extracted from skeletons and RGB videos, respectively. Specifically, a cross-modality data mapping method is designed to transform an RGB video into a skeleton-like (SL) sequence, which is then integrated with the skeleton sequence under a gradual fusion scheme in GMFM. The fusion results are fed into the following main stream to extract more discriminative features and produce the final prediction. In addition, a spatio-temporal joint attention mechanism is introduced for more accurate action recognition. Compared to the multi-stream approaches, GMFNet can be implemented within an end-to-end training pipeline and thereby reduces the training complexity. Experimental results show the proposed GMFNet achieves impressive performance on two large-scale data sets of NTU RGB+D 60 and 120.
| Original language | English |
|---|---|
| Pages (from-to) | 1242-1253 |
| Number of pages | 12 |
| Journal | IEEE Transactions on Multimedia |
| Volume | 27 |
| DOIs | |
| State | Published - 2025 |
| Externally published | Yes |
Keywords
- Multi-modality Fusion
- action recognition
- graph convolutional network
- skeleton
Fingerprint
Dive into the research topics of 'GCN-Based Multi-Modality Fusion Network for Action Recognition'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver