Betray Oneself: A Novel Audio DeepFake Detection Model via Mono-to-Stereo Conversion

View on arXiv ← Back to list

Authors: Rui Liu, Jinhua Zhang, Guanglai Gao, Haizhou Li

Published: 2023-05-25 02:54:29+00:00

AI Summary

This paper introduces M2S-ADD, a novel audio deepfake detection model that leverages stereo audio information. It converts mono audio to stereo using a pre-trained model and then employs a dual-branch neural network to analyze the left and right channels, improving detection accuracy.

Abstract

Audio Deepfake Detection (ADD) aims to detect the fake audio generated by text-to-speech (TTS), voice conversion (VC) and replay, etc., which is an emerging topic. Traditionally we take the mono signal as input and focus on robust feature extraction and effective classifier design. However, the dual-channel stereo information in the audio signal also includes important cues for deepfake, which has not been studied in the prior work. In this paper, we propose a novel ADD model, termed as M2S-ADD, that attempts to discover audio authenticity cues during the mono-to-stereo conversion process. We first projects the mono to a stereo signal using a pretrained stereo synthesizer, then employs a dual-branch neural architecture to process the left and right channel signals, respectively. In this way, we effectively reveal the artifacts in the fake audio, thus improve the ADD performance. The experiments on the ASVspoof2019 database show that M2S-ADD outperforms all baselines that input mono. We release the source code at url{https://github.com/AI-S2-Lab/M2S-ADD}.

Key findings

M2S-ADD outperforms all baselines using mono input on the ASVspoof 2019 dataset, achieving a lower Equal Error Rate (EER). Ablation studies confirm the importance of the dual-branch architecture. Visualization analysis shows that stereo conversion exposes spectral artifacts in fake audio.

Approach

M2S-ADD converts mono audio input to stereo using a pre-trained stereo synthesizer. A dual-branch neural network processes the left and right channels separately, and their features are fused for final classification. This approach reveals artifacts in fake audio better than mono-only methods.

Datasets

ASVspoof 2019 logical access (LA) database for model training and evaluation; a separate 2-hour dataset of paired mono and binaural audio for pre-training the mono-to-stereo converter.

Model(s)

A dual-branch neural network architecture consisting of SincNet layers, residual layers, graph attention network (GAT) layers, and graph pooling layers. A pre-trained mono-to-stereo converter is also used.

Author countries

China, China, China, Singapore

← Previous