LENS-DF: Deepfake Detection and Temporal Localization for Long-Form Noisy Speech

View on arXiv ← Back to list

Authors: Xuechen Liu, Wanying Ge, Xin Wang, Junichi Yamagishi

Published: 2025-07-22 04:31:13+00:00

AI Summary

LENS-DF is a novel recipe for training and evaluating audio deepfake detection and temporal localization under realistic conditions (longer duration, noisy conditions, multiple speakers). Models trained with LENS-DF consistently outperform those trained using conventional methods, demonstrating its effectiveness for robust audio deepfake detection and localization.

Abstract

This study introduces LENS-DF, a novel and comprehensive recipe for training and evaluating audio deepfake detection and temporal localization under complicated and realistic audio conditions. The generation part of the recipe outputs audios from the input dataset with several critical characteristics, such as longer duration, noisy conditions, and containing multiple speakers, in a controllable fashion. The corresponding detection and localization protocol uses models. We conduct experiments based on self-supervised learning front-end and simple back-end. The results indicate that models trained using data generated with LENS-DF consistently outperform those trained via conventional recipes, demonstrating the effectiveness and usefulness of LENS-DF for robust audio deepfake detection and localization. We also conduct ablation studies on the variations introduced, investigating their impact on and relevance to realistic challenges in the field.

Key findings

Models trained on LENS-DF data significantly outperform those trained on conventional datasets in detecting deepfakes under complex, realistic conditions. Detection performance improves with longer audio segments, while localization remains a challenge even with the enhanced training data. The study highlights the importance of realistic training data for robust deepfake detection and the need for further research in temporal localization.

Approach

LENS-DF generates training data by concatenating short audio segments from ASVspoof 2019 LA, adding noise, and optionally re-segmenting into shorter clips. A self-supervised learning (SSL) model is fine-tuned on this data with a simple back-end classifier for detection and localization.

Datasets

ASVspoof 2019 logical access (LA), MUSAN (for noise augmentation)

Model(s)

Wav2Vec 2.0 based models (MMS-1B, MMS-300M), with global average pooling and a fully connected layer as a back-end classifier. RawBoost used in some experiments for comparison.

Author countries

Japan

← Previous