Half-Truth: A Partially Fake Audio Detection Dataset

View on arXiv ← Back to list

Authors: Jiangyan Yi, Ye Bai, Jianhua Tao, Haoxin Ma, Zhengkun Tian, Chenglong Wang, Tao Wang, Ruibo Fu

Published: 2021-04-08 08:57:13+00:00

AI Summary

This paper introduces the Half-Truth Audio Detection (HAD) dataset, focusing on partially fake audio where only a few words in an utterance are synthetically generated. This addresses a significant gap in existing datasets and provides a more realistic scenario for fake audio detection, enabling both fake utterance detection and localization of manipulated regions.

Abstract

Diverse promising datasets have been designed to hold back the development of fake audio detection, such as ASVspoof databases. However, previous datasets ignore an attacking situation, in which the hacker hides some small fake clips in real speech audio. This poses a serious threat since that it is difficult to distinguish the small fake clip from the whole speech utterance. Therefore, this paper develops such a dataset for half-truth audio detection (HAD). Partially fake audio in the HAD dataset involves only changing a few words in an utterance.The audio of the words is generated with the very latest state-of-the-art speech synthesis technology. We can not only detect fake uttrances but also localize manipulated regions in a speech using this dataset. Some benchmark results are presented on this dataset. The results show that partially fake audio presents much more challenging than fully fake audio for fake audio detection. The HAD dataset is publicly available: https://zenodo.org/records/10377492.

Key findings

Benchmark results show that detecting partially fake audio is significantly more challenging than detecting fully fake audio. The performance of localization of manipulated regions in partially fake audio is also poor, highlighting the difficulty of this task.

Approach

The HAD dataset is created by manipulating real speech audio. A few words are replaced with synthetic audio generated using state-of-the-art speech synthesis technology. The dataset includes fully real, fully fake, and partially fake audio for benchmarking.

Datasets

AISHELL-3 corpus, HAD (Half-truth Audio Detection) dataset (created by the authors)

Model(s)

Gaussian Mixture Model (GMM), Light Convolutional Neural Network (LCNN)

Author countries

China

← Previous