An Efficient Temporary Deepfake Location Approach Based Embeddings for Partially Spoofed Audio Detection

View on arXiv ← Back to list

Authors: Yuankun Xie, Haonan Cheng, Yutian Wang, Long Ye

Published: 2023-09-06 14:29:29+00:00

AI Summary

This paper proposes Temporal Deepfake Location (TDL), a fine-grained partially spoofed audio detection method. TDL uses an embedding similarity module to separate real and fake audio frames in an embedding space and a temporal convolution operation to focus on positional information, improving detection accuracy.

Abstract

Partially spoofed audio detection is a challenging task, lying in the need to accurately locate the authenticity of audio at the frame level. To address this issue, we propose a fine-grained partially spoofed audio detection method, namely Temporal Deepfake Location (TDL), which can effectively capture information of both features and locations. Specifically, our approach involves two novel parts: embedding similarity module and temporal convolution operation. To enhance the identification between the real and fake features, the embedding similarity module is designed to generate an embedding space that can separate the real frames from fake frames. To effectively concentrate on the position information, temporal convolution operation is proposed to calculate the frame-specific similarities among neighboring frames, and dynamically select informative neighbors to convolution. Extensive experiments show that our method outperform baseline models in ASVspoof2019 Partial Spoof dataset and demonstrate superior performance even in the crossdataset scenario.

Key findings

TDL outperforms baseline models on the ASVspoof2019 Partial Spoof dataset, achieving a 7.04% EER. It also demonstrates superior performance in cross-dataset experiments on LAV-DF, achieving an EER of 11.23%. The model is also shown to be more parameter efficient than the baselines.

Approach

The approach uses Wav2Vec-XLS-R to extract audio features. An embedding similarity module generates an embedding space separating real and fake frames, while a temporal convolution operation uses frame-specific similarities to locate spoofed segments. A binary cross-entropy loss is used for training.

Datasets

ASVspoof2019 Partial Spoof (19PS) and LAV-DF datasets.

Model(s)

Wav2Vec-XLS-R (front-end), custom-designed embedding similarity module and temporal convolution operation (back-end). Compared against LCNN-BLSTM and SELCNN-BLSTM baselines.

Author countries

China

← Previous