Explaining Speaker and Spoof Embeddings via Probing

Authors: Xuechen Liu, Junichi Yamagishi, Md Sahidullah, Tomi kinnunen

Published: 2024-12-24 05:56:49+00:00

AI Summary

This research investigates the explainability of embeddings used in audio spoofing detection systems. By training classifiers to predict speaker attributes from these embeddings, the study reveals which traits are preserved and how this impacts spoofing detection robustness.

Abstract

This study investigates the explainability of embedding representations, specifically those used in modern audio spoofing detection systems based on deep neural networks, known as spoof embeddings. Building on established work in speaker embedding explainability, we examine how well these spoof embeddings capture speaker-related information. We train simple neural classifiers using either speaker or spoof embeddings as input, with speaker-related attributes as target labels. These attributes are categorized into two groups: metadata-based traits (e.g., gender, age) and acoustic traits (e.g., fundamental frequency, speaking rate). Our experiments on the ASVspoof 2019 LA evaluation set demonstrate that spoof embeddings preserve several key traits, including gender, speaking rate, F0, and duration. Further analysis of gender and speaking rate indicates that the spoofing detector partially preserves these traits, potentially to ensure the decision process remains robust against them.


Key findings
Spoof embeddings preserve several key traits like gender and speaking rate, but discard most speaker-related information. Analysis suggests the spoofing detector partially preserves these traits to ensure robustness. The study highlights the potential for integrating ASV and CM systems by leveraging preserved information.
Approach
The authors use a probing analysis approach. They train simple neural classifiers (MLPs) to predict speaker-related attributes (metadata and acoustic) using speaker and spoof embeddings as input. Performance on these prediction tasks indicates which attributes are preserved in the embeddings.
Datasets
ASVspoof 2019 LA evaluation set and VCTK corpus
Model(s)
Multi-layer Perceptron (MLP) used as a probing classifier; ECAPA-TDNN and AASIST used as backbone models for extracting speaker and spoof embeddings respectively.
Author countries
Japan, India, Finland