TY - GEN
T1 - Sample Less, Learn More
T2 - 31st ACM International Conference on Multimedia, MM 2023
AU - Cheng, Harry
AU - Guo, Yangyang
AU - Nie, Liqiang
AU - Cheng, Zhiyong
AU - Kankanhalli, Mohan
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/10/27
Y1 - 2023/10/27
N2 - Training an effective video action recognition model poses significant computational challenges, particularly under limited resource budgets. Current methods primarily aim to either reduce model size or utilize pre-trained models, limiting their adaptability to various backbone architectures. This paper investigates the issue of over-sampled frames, a prevalent problem in many approaches yet it has received relatively little attention. Despite the use of fewer frames being a potential solution, this approach often results in a substantial decline in performance. To address this issue, we propose a novel method to restore the intermediate features for two sparsely sampled and adjacent video frames. This feature restoration technique brings a negligible increase in computational requirements compared to resource-intensive image encoders, such as ViT. To evaluate the effectiveness of our method, we conduct extensive experiments on four public datasets, including Kinetics-400, ActivityNet, UCF-101, and HMDB-51. With the integration of our method, the efficiency of three commonly used baselines has been improved by over 50%, with a mere 0.5% reduction in recognition accuracy. In addition, our method also surprisingly helps improve the generalization ability of the models under zero-shot settings.
AB - Training an effective video action recognition model poses significant computational challenges, particularly under limited resource budgets. Current methods primarily aim to either reduce model size or utilize pre-trained models, limiting their adaptability to various backbone architectures. This paper investigates the issue of over-sampled frames, a prevalent problem in many approaches yet it has received relatively little attention. Despite the use of fewer frames being a potential solution, this approach often results in a substantial decline in performance. To address this issue, we propose a novel method to restore the intermediate features for two sparsely sampled and adjacent video frames. This feature restoration technique brings a negligible increase in computational requirements compared to resource-intensive image encoders, such as ViT. To evaluate the effectiveness of our method, we conduct extensive experiments on four public datasets, including Kinetics-400, ActivityNet, UCF-101, and HMDB-51. With the integration of our method, the efficiency of three commonly used baselines has been improved by over 50%, with a mere 0.5% reduction in recognition accuracy. In addition, our method also surprisingly helps improve the generalization ability of the models under zero-shot settings.
KW - action recognition
KW - contrastive learning
KW - cross-modal learning
KW - frame restoration
UR - https://www.scopus.com/pages/publications/85179550748
U2 - 10.1145/3581783.3611696
DO - 10.1145/3581783.3611696
M3 - 会议稿件
AN - SCOPUS:85179550748
T3 - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
SP - 7101
EP - 7110
BT - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
Y2 - 29 October 2023 through 3 November 2023
ER -