Skip to main navigation Skip to search Skip to main content

FRADE: Forgery-aware Audio-distilled Multimodal Learning for Deepfake Detection

  • Fan Nie
  • , Jiangqun Ni
  • , Jian Zhang
  • , Bin Zhang*
  • , Weizhe Zhang
  • *Corresponding author for this work
  • Sun Yat-Sen University
  • Pengcheng Laboratory
  • Harbin Institute of Technology

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Nowadays, the abuse of AI-generated content (AIGC), especially the facial images known as deepfake, on social networks has raised severe security concerns, which might involve the manipulations of both visual and audio signals. For multimodal deepfake detection, previous methods usually exploit forgery-relevant knowledge to fully finetune Vision transformers (ViTs) and perform cross-modal interaction to expose the audio-visual inconsistencies. However, these approaches may undermine the prior knowledge of pretrained ViTs and ignore the domain gap between different modalities, resulting in unsatisfactory performance. To tackle these challenges, in this paper, we propose a new framework, i.e., Forgery-aware Audio-distilled Multimodal Learning (FRADE), for deepfake detection. In FRADE, the parameters of pretrained ViT are frozen to preserve its prior knowledge, while two well-devised learnable components, i.e., the Adaptive Forgery-aware Injection (AFI) and Audio-distilled Cross-modal Interaction (ACI), are leveraged to adapt forgery relevant knowledge. Specifically, AFI captures high-frequency discriminative features on both audio and visual signals and injects them into ViT via the self-attention layer. Meanwhile, ACI employs a set of latent tokens to distill audio information, which could bridge the domain gap between audio and visual modalities. The ACI is then used to well learn the inherent audio-visual relationships by cross-modal interaction. Extensive experiments demonstrate that the proposed framework could outperform other state-of-the-art multimodal deepfake detection methods under various circumstances.

Original languageEnglish
Title of host publicationMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages6297-6306
Number of pages10
ISBN (Electronic)9798400706868
DOIs
StatePublished - 28 Oct 2024
Externally publishedYes
Event32nd ACM International Conference on Multimedia, MM 2024 - Melbourne, Australia
Duration: 28 Oct 20241 Nov 2024

Publication series

NameMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

Conference

Conference32nd ACM International Conference on Multimedia, MM 2024
Country/TerritoryAustralia
CityMelbourne
Period28/10/241/11/24

Keywords

  • deepfake detection
  • multimodal learning
  • vision transformer

Fingerprint

Dive into the research topics of 'FRADE: Forgery-aware Audio-distilled Multimodal Learning for Deepfake Detection'. Together they form a unique fingerprint.

Cite this