TY - GEN
T1 - Mask Again
T2 - 31st ACM International Conference on Multimedia, MM 2023
AU - Li, Xiaojie
AU - He, Shaowei
AU - Wu, Jianlong
AU - Yu, Yue
AU - Nie, Liqiang
AU - Zhang, Min
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/10/27
Y1 - 2023/10/27
N2 - Masked video modeling has shown remarkable performance in downstream tasks by predicting masked video tokens from visible ones. However, training models from scratch on large-scale unlabeled data remains computationally challenging and time-consuming. Moreover, the commonly used random-based sampling techniques may lead to the selection of redundant or low-information regions, hindering the model from learning discriminative representations within the limited training epochs. To achieve efficient pre-training, we propose MaskAgain, an efficient feature-based knowledge distillation framework for masked video pre-training that facilitates knowledge transfer from a pre-trained teacher model to a student model. In contrast to previous approaches that align all visible token features with the teacher model at output layers, MaskAgain adopts a selective approach by masking visible tokens again at both the hidden and output layers of the transformer block. Attention mechanisms are utilized for informative feature selection. At the hidden level, attention maps generated by the transformer's multi-head attention structure are utilized to select crucial token information at both temporally-global and temporally-local levels. Additionally, at the output level, an activation-based attention map is generated using token features, enabling us to focus on important tokens while preserving feature similarity and the relationship matrix similarity between patches. Extensive experimental results show that MaskAgain achieves comparable or even better performance than existing methods on benchmark datasets with much fewer training epochs and much less memory, which demonstrates that MaskAgain allows for efficient pre-training of accurate video models, reducing computational resources and training time significantly. Code is released at https://github.com/xiaojieli0903/MaskAgain.
AB - Masked video modeling has shown remarkable performance in downstream tasks by predicting masked video tokens from visible ones. However, training models from scratch on large-scale unlabeled data remains computationally challenging and time-consuming. Moreover, the commonly used random-based sampling techniques may lead to the selection of redundant or low-information regions, hindering the model from learning discriminative representations within the limited training epochs. To achieve efficient pre-training, we propose MaskAgain, an efficient feature-based knowledge distillation framework for masked video pre-training that facilitates knowledge transfer from a pre-trained teacher model to a student model. In contrast to previous approaches that align all visible token features with the teacher model at output layers, MaskAgain adopts a selective approach by masking visible tokens again at both the hidden and output layers of the transformer block. Attention mechanisms are utilized for informative feature selection. At the hidden level, attention maps generated by the transformer's multi-head attention structure are utilized to select crucial token information at both temporally-global and temporally-local levels. Additionally, at the output level, an activation-based attention map is generated using token features, enabling us to focus on important tokens while preserving feature similarity and the relationship matrix similarity between patches. Extensive experimental results show that MaskAgain achieves comparable or even better performance than existing methods on benchmark datasets with much fewer training epochs and much less memory, which demonstrates that MaskAgain allows for efficient pre-training of accurate video models, reducing computational resources and training time significantly. Code is released at https://github.com/xiaojieli0903/MaskAgain.
KW - knowledge distillation
KW - masked visual modeling
KW - video representation learning
UR - https://www.scopus.com/pages/publications/85179558017
U2 - 10.1145/3581783.3612129
DO - 10.1145/3581783.3612129
M3 - 会议稿件
AN - SCOPUS:85179558017
T3 - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
SP - 2221
EP - 2232
BT - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
Y2 - 29 October 2023 through 3 November 2023
ER -