Audio-Visual Person-of-Interest DeepFake Detection

Authors: Davide Cozzolino, Alessandro Pianese, Matthias Nießner, Luisa Verdoliva

Published: 2022-04-06 20:51:40+00:00

AI Summary

This paper proposes POI-Forensics, an audio-visual deepfake detector that leverages contrastive learning to learn person-specific audio-visual features. By comparing embeddings of test videos to those of real videos of the person of interest, the system detects inconsistencies indicative of manipulation, achieving state-of-the-art performance, especially on low-quality videos.

Abstract

Face manipulation technology is advancing very rapidly, and new methods are being proposed day by day. The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world. Our key insight is that each person has specific characteristics that a synthetic generator likely cannot reproduce. Accordingly, we extract audio-visual features which characterize the identity of a person, and use them to create a person-of-interest (POI) deepfake detector. We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity. As a result, when the video and/or audio of a person is manipulated, its representation in the embedding space becomes inconsistent with the real identity, allowing reliable detection. Training is carried out exclusively on real talking-face video; thus, the detector does not depend on any specific manipulation method and yields the highest generalization ability. In addition, our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos. Experiments on a wide variety of datasets confirm that our method ensures a SOTA performance, especially on low quality videos. Code is publicly available on-line at https://github.com/grip-unina/poi-forensics.


Key findings
POI-Forensics outperforms state-of-the-art methods, particularly on low-quality videos. The inclusion of audio significantly improves performance. The method shows robustness to some adversarial attacks.
Approach
The approach uses contrastive learning to generate audio and video embeddings for each person. At test time, it compares the embeddings of a test video segment to a reference set of real videos from the same person. Inconsistencies indicate a deepfake.
Datasets
pDFDC, DF-TIMIT, FakeAVCelebV2, KoDF, VoxCeleb2, FaceForensics++
Model(s)
ResNet-50 with Group-Normalization for both audio and video networks.
Author countries
Italy, Germany