CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems

Authors: Haibin Wu, Yuan Tseng, Hung-yi Lee

Published: 2024-06-11 13:16:09+00:00

Comment: Accepted to Interspeech 2024, project page: https://codecfake.github.io/

AI Summary

This paper introduces CodecFake, the first dataset specifically designed for detecting deepfake audios generated by contemporary codec-based speech synthesis systems. The authors demonstrate that current state-of-the-art anti-spoofing models trained on traditional datasets are largely ineffective against these new deepfakes. However, training with the proposed CodecFake dataset significantly enhances these models' detection capabilities.

Abstract

Current state-of-the-art (SOTA) codec-based audio synthesis systems can mimic anyone's voice with just a 3-second sample from that specific unseen speaker. Unfortunately, malicious attackers may exploit these technologies, causing misuse and security issues. Anti-spoofing models have been developed to detect fake speech. However, the open question of whether current SOTA anti-spoofing models can effectively counter deepfake audios from codec-based speech synthesis systems remains unanswered. In this paper, we curate an extensive collection of contemporary SOTA codec models, employing them to re-create synthesized speech. This endeavor leads to the creation of CodecFake, the first codec-based deepfake audio dataset. Additionally, we verify that anti-spoofing models trained on commonly used datasets cannot detect synthesized speech from current codec-based speech generation systems. The proposed CodecFake dataset empowers these models to counter this challenge effectively.


Key findings
Anti-spoofing models trained on traditional datasets like ASVspoof 2019 exhibit significantly high Equal Error Rates (EERs) when confronted with deepfakes from modern codec-based speech synthesis systems. The CodecFake dataset, however, proves highly effective in enhancing these models' ability to detect such deepfakes. Notably, models trained on CodecFake achieved an impressive 0.4% EER for synthesized speech from VALL-E.
Approach
The authors curate an extensive collection of 15 state-of-the-art neural audio codec models to re-synthesize speech from the VCTK corpus, thereby creating the CodecFake dataset. This dataset is then used to train and evaluate anti-spoofing models, demonstrating that training with CodecFake effectively enables the detection of deepfakes generated by current codec-based speech synthesis systems.
Datasets
CodecFake (created), VCTK, ASVspoof 2019, VALL-E (synthesized VCTK), VALL-E X (demo page samples), SpeechX (demo page samples)
Model(s)
AASIST, AASIST-L
Author countries
Taiwan