TY - GEN
T1 - FRADE
T2 - 32nd ACM International Conference on Multimedia, MM 2024
AU - Nie, Fan
AU - Ni, Jiangqun
AU - Zhang, Jian
AU - Zhang, Bin
AU - Zhang, Weizhe
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/10/28
Y1 - 2024/10/28
N2 - Nowadays, the abuse of AI-generated content (AIGC), especially the facial images known as deepfake, on social networks has raised severe security concerns, which might involve the manipulations of both visual and audio signals. For multimodal deepfake detection, previous methods usually exploit forgery-relevant knowledge to fully finetune Vision transformers (ViTs) and perform cross-modal interaction to expose the audio-visual inconsistencies. However, these approaches may undermine the prior knowledge of pretrained ViTs and ignore the domain gap between different modalities, resulting in unsatisfactory performance. To tackle these challenges, in this paper, we propose a new framework, i.e., Forgery-aware Audio-distilled Multimodal Learning (FRADE), for deepfake detection. In FRADE, the parameters of pretrained ViT are frozen to preserve its prior knowledge, while two well-devised learnable components, i.e., the Adaptive Forgery-aware Injection (AFI) and Audio-distilled Cross-modal Interaction (ACI), are leveraged to adapt forgery relevant knowledge. Specifically, AFI captures high-frequency discriminative features on both audio and visual signals and injects them into ViT via the self-attention layer. Meanwhile, ACI employs a set of latent tokens to distill audio information, which could bridge the domain gap between audio and visual modalities. The ACI is then used to well learn the inherent audio-visual relationships by cross-modal interaction. Extensive experiments demonstrate that the proposed framework could outperform other state-of-the-art multimodal deepfake detection methods under various circumstances.
AB - Nowadays, the abuse of AI-generated content (AIGC), especially the facial images known as deepfake, on social networks has raised severe security concerns, which might involve the manipulations of both visual and audio signals. For multimodal deepfake detection, previous methods usually exploit forgery-relevant knowledge to fully finetune Vision transformers (ViTs) and perform cross-modal interaction to expose the audio-visual inconsistencies. However, these approaches may undermine the prior knowledge of pretrained ViTs and ignore the domain gap between different modalities, resulting in unsatisfactory performance. To tackle these challenges, in this paper, we propose a new framework, i.e., Forgery-aware Audio-distilled Multimodal Learning (FRADE), for deepfake detection. In FRADE, the parameters of pretrained ViT are frozen to preserve its prior knowledge, while two well-devised learnable components, i.e., the Adaptive Forgery-aware Injection (AFI) and Audio-distilled Cross-modal Interaction (ACI), are leveraged to adapt forgery relevant knowledge. Specifically, AFI captures high-frequency discriminative features on both audio and visual signals and injects them into ViT via the self-attention layer. Meanwhile, ACI employs a set of latent tokens to distill audio information, which could bridge the domain gap between audio and visual modalities. The ACI is then used to well learn the inherent audio-visual relationships by cross-modal interaction. Extensive experiments demonstrate that the proposed framework could outperform other state-of-the-art multimodal deepfake detection methods under various circumstances.
KW - deepfake detection
KW - multimodal learning
KW - vision transformer
UR - https://www.scopus.com/pages/publications/85209796465
U2 - 10.1145/3664647.3681672
DO - 10.1145/3664647.3681672
M3 - 会议稿件
AN - SCOPUS:85209796465
T3 - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
SP - 6297
EP - 6306
BT - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
Y2 - 28 October 2024 through 1 November 2024
ER -