Skip to main navigation Skip to search Skip to main content

Three-stream CNNs for action recognition

  • Liangliang Wang*
  • , Lianzheng Ge
  • , Ruifeng Li
  • , Yajun Fang
  • *Corresponding author for this work
  • Harbin Institute of Technology
  • Massachusetts Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Existing Convolutional Neural Networks (CNNs) based methods for action recognition are either spatial or temporally local while actions are 3D signals. In this paper, we propose a global spatial-temporal three-stream CNNs architecture, which is able to be used for action feature extraction. Specifically, the three-stream CNNs comprises of spatial, local temporal and global temporal streams generated respectively from deep learning single frame, optical flow and global accumulated motion features in the form of a new formulation named Motion Stacked Difference Image (MSDI). Moreover, a novel soft Vector of Locally Aggregated Descriptors (soft-VLAD) is developed to further represent the extracted features, combining the advantage of Gaussian Mixture Models (GMMs) and VLAD by encoding data according to their overall probability distribution and the corresponding difference with respect to clustered centers. To deal with the inadequacy of training samples during learning, we introduce a data augmentation scheme which is very efficient due to its origin at cropping across videos. We conduct our experiments on UCF101 and HMDB51 datasets, and the results demonstrate the effectiveness of our approach.

Original languageEnglish
Pages (from-to)33-40
Number of pages8
JournalPattern Recognition Letters
Volume92
DOIs
StatePublished - 1 Jun 2017

Keywords

  • Action recognition
  • Soft Vector of Locally Aggregated Descriptors
  • Support Vector Machines
  • Three-stream convolutional neural networks

Fingerprint

Dive into the research topics of 'Three-stream CNNs for action recognition'. Together they form a unique fingerprint.

Cite this