CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems

View on arXiv ← Back to list

Authors: Haibin Wu, Yuan Tseng, Hung-yi Lee

Published: 2024-06-11 13:16:09+00:00

AI Summary

This paper introduces CodecFake, the first dataset of deepfake audios generated using state-of-the-art codec-based speech synthesis systems. It demonstrates that existing anti-spoofing models fail to detect these deepfakes and shows that training on CodecFake significantly improves detection accuracy.

Abstract

Current state-of-the-art (SOTA) codec-based audio synthesis systems can mimic anyone's voice with just a 3-second sample from that specific unseen speaker. Unfortunately, malicious attackers may exploit these technologies, causing misuse and security issues. Anti-spoofing models have been developed to detect fake speech. However, the open question of whether current SOTA anti-spoofing models can effectively counter deepfake audios from codec-based speech synthesis systems remains unanswered. In this paper, we curate an extensive collection of contemporary SOTA codec models, employing them to re-create synthesized speech. This endeavor leads to the creation of CodecFake, the first codec-based deepfake audio dataset. Additionally, we verify that anti-spoofing models trained on commonly used datasets cannot detect synthesized speech from current codec-based speech generation systems. The proposed CodecFake dataset empowers these models to counter this challenge effectively.

Key findings

Anti-spoofing models trained on ASVspoof performed poorly on CodecFake, highlighting a significant gap in detection capabilities for codec-based deepfakes. Training on CodecFake substantially improved detection accuracy, achieving an equal error rate of 0.4% on VALL-E generated speech. The results demonstrate the effectiveness of CodecFake in enhancing deepfake audio detection.

Approach

The authors created CodecFake by using 15 different state-of-the-art codec models to resynthesize speech from the VCTK corpus. They then evaluated the performance of existing and CodecFake-trained anti-spoofing models (AASIST-L) on this new dataset and several other datasets to assess their ability to detect codec-based deepfakes.

Datasets

VCTK corpus, ASVspoof 2019, CodecFake (created by the authors), VALL-E, VALL-E X demo page, SpeechX demo page

Model(s)

AASIST-L

Author countries

Taiwan

← Previous