Source Tracing: Detecting Voice Spoofing

Authors: Tinglong Zhu, Xingming Wang, Xiaoyi Qin, Ming Li

Published: 2022-12-16 17:29:15+00:00

AI Summary

This paper proposes a system for classifying different spoofing attributes in audio deepfakes, focusing on identifying the generation methods rather than just detecting the presence of a fake. This approach, using multi-task learning, improves robustness against unseen spoofing methods and achieves a 20% relative improvement over conventional binary spoof detection methods on the ASVspoof 2019 LA dataset.

Abstract

Recent anti-spoofing systems focus on spoofing detection, where the task is only to determine whether the test audio is fake. However, there are few studies putting attention to identifying the methods of generating fake speech. Common spoofing attack algorithms in the logical access (LA) scenario, such as voice conversion and speech synthesis, can be divided into several stages: input processing, conversion, waveform generation, etc. In this work, we propose a system for classifying different spoofing attributes, representing characteristics of different modules in the whole pipeline. Classifying attributes for the spoofing attack other than determining the whole spoofing pipeline can make the system more robust when encountering complex combinations of different modules at different stages. In addition, our system can also be used as an auxiliary system for anti-spoofing against unseen spoofing methods. The experiments are conducted on ASVspoof 2019 LA data set and the proposed method achieved a 20% relative improvement against conventional binary spoof detection methods.


Key findings
The multi-task system achieved high accuracy in identifying conversion and waveform generator attributes (over 80%). The speaker representation attribute accuracy was lower (around 50%). The proposed method showed at least a 20% relative improvement in spoof detection performance compared to a conventional binary classifier.
Approach
The authors use a multi-task learning approach where a shared frontend extracts features from audio, and three separate backend classifiers identify spoofing attributes (conversion, speaker representation, and waveform generation). The final spoof score is a combination of the bona fide probabilities from each classifier.
Datasets
ASVspoof 2019 LA dataset
Model(s)
ResNet34 and RawNet2
Author countries
China