Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization

Authors: Nicholas Klein, Hemlata Tak, James Fullwood, Krishna Regmi, Leonidas Spinoulas, Ganesh Sivaraman, Tianxiang Chen, Elie Khoury

Published: 2025-08-11 16:14:17+00:00

AI Summary

This paper presents methods for deepfake video classification and localization, submitted to the ACM 1M Deepfakes Detection Challenge. The approach achieved the best performance in the temporal localization task and a top-four ranking in the classification task for the TestA split.

Abstract

The field of visual and audio generation is burgeoning with new state-of-the-art methods. This rapid proliferation of new techniques underscores the need for robust solutions for detecting synthetic content in videos. In particular, when fine-grained alterations via localized manipulations are performed in visual, audio, or both domains, these subtle modifications add challenges to the detection algorithms. This paper presents solutions for the problems of deepfake video classification and localization. The methods were submitted to the ACM 1M Deepfakes Detection Challenge, achieving the best performance in the temporal localization task and a top four ranking in the classification task for the TestA split of the evaluation dataset.


Key findings
The fused model achieved a top-four ranking in the classification task (AUC of 92.49% on TestA) and the best performance in the temporal localization task (score of 67.20% on TestA). The ResNet model showed strong performance in boundary prediction for localization, while LipForensics excelled in recall but was less accurate in boundary prediction.
Approach
The authors propose an ensemble of specialized networks targeting audio and visual manipulations independently for both classification and localization tasks. For localization, they adapt an ActionFormer-inspired training paradigm with frame-wise classification and boundary regression heads. Model outputs are fused using score-level polynomial logistic regression (classification) and Soft-NMS (localization).
Datasets
AV-Deepfake1M++ dataset
Model(s)
ResNet-152, multi-resolution gMLP with Wav2Vec 2.0, LipForensics (MS-TCN), LSTM
Author countries
USA