Improved DeepFake Detection Using Whisper Features

View on arXiv ← Back to list

Authors: Piotr Kawa, Marcin Plata, Michał Czuba, Piotr Szymański, Piotr Syga

Published: 2023-06-02 10:34:05+00:00

AI Summary

This paper investigates using the Whisper automatic speech recognition model as a front-end for audio deepfake detection. By incorporating Whisper features with existing front-ends and training three detection models, the authors demonstrate improved detection accuracy, reducing the Equal Error Rate by 21% on the DeepFakes In-The-Wild dataset.

Abstract

With a recent influx of voice generation methods, the threat introduced by audio DeepFake (DF) is ever-increasing. Several different detection methods have been presented as a countermeasure. Many methods are based on so-called front-ends, which, by transforming the raw audio, emphasize features crucial for assessing the genuineness of the audio sample. Our contribution contains investigating the influence of the state-of-the-art Whisper automatic speech recognition model as a DF detection front-end. We compare various combinations of Whisper and well-established front-ends by training 3 detection models (LCNN, SpecRNet, and MesoNet) on a widely used ASVspoof 2021 DF dataset and later evaluating them on the DF In-The-Wild dataset. We show that using Whisper-based features improves the detection for each model and outperforms recent results on the In-The-Wild dataset by reducing Equal Error Rate by 21%.

Key findings

Using Whisper-based features significantly improved deepfake detection accuracy across all three models. The best result achieved an EER of 0.2672 on the DeepFakes In-The-Wild dataset, outperforming previous state-of-the-art results. Fine-tuning the Whisper model further enhanced performance.

Approach

The authors use the Whisper ASR model's encoder as a feature extractor for audio deepfake detection. They compare its performance with other well-established front-ends (LFCC, MFCC) and their combinations when training LCNN, SpecRNet, and MesoNet models. The best performing model utilizes fine-tuned Whisper features along with MFCCs.

Datasets

ASVspoof 2021 (DF) and DeepFakes In-The-Wild datasets.

Model(s)

LCNN, SpecRNet, MesoNet, and RawNet3. Whisper is used as a feature extractor.

Author countries

Poland

← Previous