Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion

View on arXiv ← Back to list

Authors: Ajinkya Kulkarni, Sandipana Dowerah, Tanel Alumae, Mathew Magimai. -Doss

Published: 2025-06-02 12:42:09+00:00

AI Summary

This paper introduces a novel audio source tracing system for identifying the origin of audio deepfakes. It combines deep metric learning with a Conformer network and ensemble score-embedding fusion to improve both in-domain and out-of-domain source tracing accuracy.

Abstract

Audio deepfakes are acquiring an unprecedented level of realism with advanced AI. While current research focuses on discerning real speech from spoofed speech, tracing the source system is equally crucial. This work proposes a novel audio source tracing system combining deep metric multi-class N-pair loss with Real Emphasis and Fake Dispersion framework, a Conformer classification network, and ensemble score-embedding fusion. The N-pair loss improves discriminative ability, while Real Emphasis and Fake Dispersion enhance robustness by focusing on differentiating real and fake speech patterns. The Conformer network captures both global and local dependencies in the audio signal, crucial for source tracing. The proposed ensemble score-embedding fusion shows an optimal trade-off between in-domain and out-of-domain source tracing scenarios. We evaluate our method using Frechet Distance and standard metrics, demonstrating superior performance in source tracing over the baseline system.

Key findings

The proposed system outperforms baseline models in both in-domain and out-of-domain scenarios. Ensemble fusion improves in-domain performance, while the Conformer model shows better generalization to unseen deepfakes. Fr´echet Distance effectively evaluates the stability of feature representations.

Approach

The approach uses a two-stage training framework (Real Emphasis and Fake Dispersion) with a multi-class N-pair loss for improved discriminative ability. It employs a Conformer network for feature extraction and ensemble fusion of multiple model outputs for robust source tracing.

Datasets

MLAAD dataset, ASVSpoof 2019 dataset

Model(s)

Wav2Vec2-XLSR, Conformer, MAMBA, HYDRA

Author countries

Switzerland, Estonia

← Previous