Investigating Causal Cues: Strengthening Spoofed Audio Detection with Human-Discernible Linguistic Features

Authors: Zahra Khanjani, Tolulope Ale, Jianwu Wang, Lavon Davis, Christine Mallinson, Vandana P. Janeja

Published: 2024-09-09 19:47:57+00:00

AI Summary

This paper investigates causal relationships between human-discernible linguistic features (EDLFs) and spoofed audio detection. Using a hybrid dataset of spoofed audio augmented with sociolinguistic annotations and causal discovery models, the authors analyze the impact of EDLFs on audio authenticity.

Abstract

Several types of spoofed audio, such as mimicry, replay attacks, and deepfakes, have created societal challenges to information integrity. Recently, researchers have worked with sociolinguistics experts to label spoofed audio samples with Expert Defined Linguistic Features (EDLFs) that can be discerned by the human ear: pitch, pause, word-initial and word-final release bursts of consonant stops, audible intake or outtake of breath, and overall audio quality. It is established that there is an improvement in several deepfake detection algorithms when they augmented the traditional and common features of audio data with these EDLFs. In this paper, using a hybrid dataset comprised of multiple types of spoofed audio augmented with sociolinguistic annotations, we investigate causal discovery and inferences between the discernible linguistic features and the label in the audio clips, comparing the findings of the causal models with the expert ground truth validation labeling process. Our findings suggest that the causal models indicate the utility of incorporating linguistic features to help discern spoofed audio, as well as the overall need and opportunity to incorporate human knowledge into models and techniques for strengthening AI models. The causal discovery and inference can be used as a foundation of training humans to discern spoofed audio as well as automating EDLFs labeling for the purpose of performance improvement of the common AI-based spoofed audio detectors.


Key findings
AudioQualityAnomaly is the most significant causal cue for spoofed audio. When excluding AudioQualityAnomaly, PitchAnomaly is the primary direct cause of the spoof label. Causal inference validates the importance of PitchAnomaly and PauseAnomaly, while IntakeOrOuttakeofBreath shows little causal effect.
Approach
The authors employ an ensemble causal discovery model (combining PC and GES algorithms) to identify EDLFs' impact on audio spoofing labels. Causal inference, using Random Forest, Logistic Regression, and XGBoost, estimates the average causal effect of each EDLF on the label, validated by expert ground truth.
Datasets
A hybrid dataset combining subsets of existing datasets (e.g., ASVspoof 2021, FoR) and newly generated samples using Melgan, Assem-VC, and Google WaveNet, augmented with sociolinguistic annotations.
Model(s)
PC algorithm, GES algorithm, Random Forest, Logistic Regression, XGBoost.
Author countries
USA