Audios Don't Lie: Multi-Frequency Channel Attention Mechanism for Audio Deepfake Detection

View on arXiv ← Back to list

Authors: Yangguang Feng

Published: 2024-12-12 17:15:49+00:00

AI Summary

This research proposes an audio deepfake detection method using a multi-frequency channel attention mechanism (MFCA) and 2D discrete cosine transform (DCT). The method leverages MobileNet V2 for feature extraction and MFCA to weight different frequency channels, improving the detection of fine-grained frequency features in audio signals.

Abstract

With the rapid development of artificial intelligence technology, the application of deepfake technology in the audio field has gradually increased, resulting in a wide range of security risks. Especially in the financial and social security fields, the misuse of deepfake audios has raised serious concerns. To address this challenge, this study proposes an audio deepfake detection method based on multi-frequency channel attention mechanism (MFCA) and 2D discrete cosine transform (DCT). By processing the audio signal into a melspectrogram, using MobileNet V2 to extract deep features, and combining it with the MFCA module to weight different frequency channels in the audio signal, this method can effectively capture the fine-grained frequency domain features in the audio signal and enhance the Classification capability of fake audios. Experimental results show that compared with traditional methods, the model proposed in this study shows significant advantages in accuracy, precision,recall, F1 score and other indicators. Especially in complex audio scenarios, this method shows stronger robustness and generalization capabilities and provides a new idea for audio deepfake detection and has important practical application value. In the future, more advanced audio detection technologies and optimization strategies will be explored to further improve the accuracy and generalization capabilities of audio deepfake detection.

Key findings

The proposed MFCMNet model significantly outperformed other models (CNN, VGG16, ResNet50, MobileNet) in accuracy, precision, recall, and F1-score on the Fake or Real dataset. The improvement is particularly notable in complex audio scenarios, demonstrating enhanced robustness and generalization capabilities.

Approach

The approach uses a mel-spectrogram representation of the audio signal. MobileNet V2 extracts deep features, which are then processed by the MFCA module to weight different frequency channels. 2D DCT enhances the weighting information for improved feature fusion.

Datasets

Fake or Real dataset (for-norm version)

Model(s)

MobileNet V2, MFCA module (Multi-Frequency Channel Attention Mechanism)

Author countries

China

← Previous