LENS-DF: Deepfake Detection and Temporal Localization for Long-Form Noisy Speech

Authors: Xuechen Liu, Wanying Ge, Xin Wang, Junichi Yamagishi

Published: 2025-07-22 04:31:13+00:00

Comment: Accepted by IEEE International Joint Conference on Biometrics (IJCB) 2025, Osaka, Japan

AI Summary

This study introduces LENS-DF, a novel recipe for training and evaluating audio deepfake detection and temporal localization under realistic conditions, including longer duration, noisy environments, and multiple speakers. Models trained using data generated with LENS-DF consistently outperform those trained with conventional recipes, demonstrating its effectiveness for robust audio deepfake detection and localization.

Abstract

This study introduces LENS-DF, a novel and comprehensive recipe for training and evaluating audio deepfake detection and temporal localization under complicated and realistic audio conditions. The generation part of the recipe outputs audios from the input dataset with several critical characteristics, such as longer duration, noisy conditions, and containing multiple speakers, in a controllable fashion. The corresponding detection and localization protocol uses models. We conduct experiments based on self-supervised learning front-end and simple back-end. The results indicate that models trained using data generated with LENS-DF consistently outperform those trained via conventional recipes, demonstrating the effectiveness and usefulness of LENS-DF for robust audio deepfake detection and localization. We also conduct ablation studies on the variations introduced, investigating their impact on and relevance to realistic challenges in the field.


Key findings
Models trained on conventional clean, short speech datasets perform poorly on long-form, noisy, multi-speaker deepfake audio. Training with LENS-DF-generated data significantly enhances detection and localization performance in complex, realistic conditions. Despite improvements, robust temporal localization in these challenging scenarios still requires further research and tailored strategies.
Approach
LENS-DF employs a comprehensive data generation pipeline to create realistic deepfake audio datasets from existing ones (e.g., ASVspoof 2019 LA), incorporating controlled noise, multi-speaker scenarios, and concatenation to form longer samples. The detection and localization protocol utilizes pre-trained self-supervised learning (SSL) models (Wav2Vec 2.0 variants) as front-ends, fine-tuned with a simple global average pooling and fully connected layer back-end.
Datasets
ASVspoof 2019 LA, MUSAN, LENS-DF (generated dataset variants: long, SEG-N), LAV-DF (for comparative evaluation).
Model(s)
Self-supervised learning (SSL) models based on Wav2Vec 2.0 architecture (specifically MMS-1B and MMS-300M variants) as front-ends, combined with a Global Average Pooling (GAP) and a Fully Connected (FC) layer back-end. XLS-R-300M with AASIST back-end is used for comparison in ablation studies.
Author countries
Japan