Statistics-aware Audio-visual Deepfake Detector

Authors: Marcella Astrid, Enjie Ghorbel, Djamila Aouada

Published: 2024-07-16 12:15:41+00:00

AI Summary

This paper proposes SADD, a statistics-aware audio-visual deepfake detector that improves upon existing methods by incorporating a statistical feature loss to enhance discrimination, using raw waveforms for audio input, and employing a shallower network for reduced computational complexity.

Abstract

In this paper, we propose an enhanced audio-visual deep detection method. Recent methods in audio-visual deepfake detection mostly assess the synchronization between audio and visual features. Although they have shown promising results, they are based on the maximization/minimization of isolated feature distances without considering feature statistics. Moreover, they rely on cumbersome deep learning architectures and are heavily dependent on empirically fixed hyperparameters. Herein, to overcome these limitations, we propose: (1) a statistical feature loss to enhance the discrimination capability of the model, instead of relying solely on feature distances; (2) using the waveform for describing the audio as a replacement of frequency-based representations; (3) a post-processing normalization of the fakeness score; (4) the use of shallower network for reducing the computational complexity. Experiments on the DFDC and FakeAVCeleb datasets demonstrate the relevance of the proposed method.


Key findings
Experiments on DFDC showed that SADD achieved a higher AUC than state-of-the-art methods, while maintaining lower computational cost. The statistics-aware loss significantly improved performance, particularly with limited training data. Cross-dataset evaluation on FakeAVCeleb demonstrated improved generalization compared to the baseline but still lagged behind state-of-the-art models on this specific dataset.
Approach
SADD enhances a previous audio-visual deepfake detection model by adding a statistical feature loss to improve the separation of real and fake data distributions. It uses raw audio waveforms instead of frequency representations and a shallower network architecture for efficiency. Post-processing normalization of the fakeness score is also implemented.
Datasets
DFDC and FakeAVCeleb datasets
Model(s)
An enhanced version of the Modality Dissonance Score (MDS) model with a shallower architecture and a novel statistical feature loss.
Author countries
Luxembourg, Tunisia