Room Impulse Responses help attackers to evade Deep Fake Detection

View on arXiv ← Back to list

Authors: Hieu-Thi Luong, Duc-Tuan Truong, Kong Aik Lee, Eng Siong Chng

Published: 2024-09-23 05:17:30+00:00

AI Summary

This paper investigates the vulnerability of deepfake audio detection systems to attacks using room impulse responses (RIRs) to add reverberation to fake speech. The authors demonstrate that this simple attack significantly increases the error rate of state-of-the-art systems, and propose a defense mechanism using large-scale synthetic RIR data augmentation during training, substantially improving detection accuracy.

Abstract

The ASVspoof 2021 benchmark, a widely-used evaluation framework for anti-spoofing, consists of two subsets: Logical Access (LA) and Deepfake (DF), featuring samples with varied coding characteristics and compression artifacts. Notably, the current state-of-the-art (SOTA) system boasts impressive performance, achieving an Equal Error Rate (EER) of 0.87% on the LA subset and 2.58% on the DF. However, benchmark accuracy is no guarantee of robustness in real-world scenarios. This paper investigates the effectiveness of utilizing room impulse responses (RIRs) to enhance fake speech and increase their likelihood of evading fake speech detection systems. Our findings reveal that this simple approach significantly improves the evasion rate, doubling the SOTA system's EER. To counter this type of attack, We augmented training data with a large-scale synthetic/simulated RIR dataset. The results demonstrate significant improvement on both reverberated fake speech and original samples, reducing DF task EER to 2.13%.

Key findings

Adding RIRs to fake speech significantly increased the Equal Error Rate (EER) of state-of-the-art deepfake detection systems. Augmenting training data with a large-scale synthetic RIR dataset substantially reduced the EER on both reverberated fake speech and original samples. Synthetic RIRs proved slightly more effective than simulated RIRs for augmentation.

Approach

The authors added room impulse responses (RIRs) to fake speech to evade detection systems. To counter this, they augmented training data with a large-scale synthetic RIR dataset generated using a previously published method.

Datasets

ASVspoof 2021 Deepfake (DF) subset for evaluation, ASVspoof 2019 Logical Access (LA) subset for training, MIT RIR dataset, synthetic RIR dataset (generated), simulated RIR dataset (from Ko et al. 2017)

Model(s)

RawNet2, Wav2Vec2 + AASIST, Wav2Vec2 + Conformer

Author countries

Singapore, China

← Previous