Room Impulse Responses help attackers to evade Deep Fake Detection

Authors: Hieu-Thi Luong, Duc-Tuan Truong, Kong Aik Lee, Eng Siong Chng

Published: 2024-09-23 05:17:30+00:00

Comment: 7 pages, to be presented at SLT 2024

AI Summary

This paper investigates the vulnerability of state-of-the-art deepfake speech detection systems to attacks leveraging Room Impulse Responses (RIRs) to add reverberation to fake speech, significantly increasing their evasion rate. To counteract this, the authors propose augmenting training data with large-scale synthetic or simulated RIRs. Their method significantly enhances detection robustness, improving performance on both reverberated fake speech and original samples.

Abstract

The ASVspoof 2021 benchmark, a widely-used evaluation framework for anti-spoofing, consists of two subsets: Logical Access (LA) and Deepfake (DF), featuring samples with varied coding characteristics and compression artifacts. Notably, the current state-of-the-art (SOTA) system boasts impressive performance, achieving an Equal Error Rate (EER) of 0.87% on the LA subset and 2.58% on the DF. However, benchmark accuracy is no guarantee of robustness in real-world scenarios. This paper investigates the effectiveness of utilizing room impulse responses (RIRs) to enhance fake speech and increase their likelihood of evading fake speech detection systems. Our findings reveal that this simple approach significantly improves the evasion rate, doubling the SOTA system's EER. To counter this type of attack, We augmented training data with a large-scale synthetic/simulated RIR dataset. The results demonstrate significant improvement on both reverberated fake speech and original samples, reducing DF task EER to 2.13%.


Key findings
The study found that applying RIRs to fake speech can significantly degrade the performance of SOTA detection models, effectively doubling their EER (e.g., System C's EER increased from 2.58% to 5.90% on reverberant C1R1). However, augmenting training data with large-scale synthetic RIRs effectively mitigated this vulnerability, achieving new SOTA performance (e.g., BR2 reduced DF EER to 2.13% and C1R2 EER to 2.57%), demonstrating improved robustness and generalization across various conditions.
Approach
The authors first demonstrate the adversarial potential of RIRs by applying reverberation to fake speech and evaluating its impact on existing detection systems, showing a substantial increase in Equal Error Rate. To defend against this, they augment the training datasets of these detection systems with a large-scale collection of synthetic and simulated RIRs, aiming to improve model robustness and generalization.
Datasets
ASVspoof 2021 DF subset, ASVspoof 2019 LA subset, MIT RIR dataset, synthetic RIR dataset (generated from MIT RIRs), simulated RIR dataset (from Ko et al. [18] via OpenSLR.org/26)
Model(s)
RawNet2, Wav2Vec2.0 + AASIST, Wav2Vec2.0 + Conformer
Author countries
Singapore, Hong Kong SAR, China