TY - GEN
T1 - Improving Sequential DeepFake Detection with Local information enhancement
AU - Dong, Longyun
AU - Xu, Yuanrong
AU - Zhong, Jianping
AU - Qi, Zhaobo
AU - Zhang, Weigang
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s).
PY - 2024/12/28
Y1 - 2024/12/28
N2 - Existing Deepfake technology involves multi-step forgery to generate images.However, there are few sequential Deepfake detection methods available.To address this challenge, we propose a feature cross-fusion model that combines Vision Transformer (ViT) and Convolutional Neural Network (CNN), along with a novel data augmentation technique called Channel Random Erasing(CRE).This model first enhances its robustness by using CRE, which introduces controlled occlusions during training to simulate real-world manipulations.It then captures both the global and local features of images through multi-scale feature fusion.The Vision Transformer (ViT) captures global contextual information via self-attention mechanism, providing a strong global feature representation, while the Convolutional Neural Network (CNN) extracts local details through convolution operations, effectively capturing edges and texture information.Extensive experiments on the Seq-Deepfake benchmark demonstrate the effectiveness of this model, achieving better performance compared to current state-of-the-art methods.
AB - Existing Deepfake technology involves multi-step forgery to generate images.However, there are few sequential Deepfake detection methods available.To address this challenge, we propose a feature cross-fusion model that combines Vision Transformer (ViT) and Convolutional Neural Network (CNN), along with a novel data augmentation technique called Channel Random Erasing(CRE).This model first enhances its robustness by using CRE, which introduces controlled occlusions during training to simulate real-world manipulations.It then captures both the global and local features of images through multi-scale feature fusion.The Vision Transformer (ViT) captures global contextual information via self-attention mechanism, providing a strong global feature representation, while the Convolutional Neural Network (CNN) extracts local details through convolution operations, effectively capturing edges and texture information.Extensive experiments on the Seq-Deepfake benchmark demonstrate the effectiveness of this model, achieving better performance compared to current state-of-the-art methods.
KW - Channel Random Erasing
KW - Cross-attention
KW - Deepfake detection
KW - ViT
UR - https://www.scopus.com/pages/publications/85216179362
U2 - 10.1145/3696409.3700276
DO - 10.1145/3696409.3700276
M3 - 会议稿件
AN - SCOPUS:85216179362
T3 - Proceedings of the 6th ACM International Conference on Multimedia in Asia, MMAsia 2024
BT - Proceedings of the 6th ACM International Conference on Multimedia in Asia, MMAsia 2024
PB - Association for Computing Machinery, Inc
T2 - 6th ACM International Conference on Multimedia in Asia, MMAsia 2024
Y2 - 3 December 2024 through 6 December 2024
ER -