PITCH: AI-assisted Tagging of Deepfake Audio Calls using Challenge-Response

View on arXiv ← Back to list

Authors: Govind Mittal, Arthur Jakobsson, Kelly O. Marshall, Chinmay Hegde, Nasir Memon

Published: 2024-02-28 06:17:55+00:00

AI Summary

PITCH, a challenge-response system, enhances deepfake audio detection by incorporating audio challenges designed to exploit weaknesses in voice cloning technology. This human-AI collaborative system achieves 84.5% accuracy, significantly improving upon human-only performance (72.6%) by leveraging machine precision while maintaining human decision authority.

Abstract

The rise of AI voice-cloning technology, particularly audio Real-time Deepfakes (RTDFs), has intensified social engineering attacks by enabling real-time voice impersonation that bypasses conventional enrollment-based authentication. This technology represents an existential threat to phone-based authentication systems, while total identity fraud losses reached $43 billion. Unlike traditional robocalls, these personalized AI-generated voice attacks target high-value accounts and circumvent existing defensive measures, creating an urgent cybersecurity challenge. To address this, we propose PITCH, a robust challenge-response method to detect and tag interactive deepfake audio calls. We developed a comprehensive taxonomy of audio challenges based on the human auditory system, linguistics, and environmental factors, yielding 20 prospective challenges. Testing against leading voice-cloning systems using a novel dataset (18,600 original and 1.6 million deepfake samples from 100 users), PITCH's challenges enhanced machine detection capabilities to 88.7% AUROC score, enabling us to identify 10 highly-effective challenges. For human evaluation, we filtered a challenging, balanced subset on which human evaluators independently achieved 72.6% accuracy, while machines scored 87.7%. Recognizing that call environments require human control, we developed a novel human-AI collaborative system that tags suspicious calls as Deepfake-likely. Contrary to prior findings, we discovered that integrating human intuition with machine precision offers complementary advantages, giving users maximum control while boosting detection accuracy to 84.5%. This significant improvement situates PITCH's potential as an AI-assisted pre-screener for verifying calls, offering an adaptable approach to combat real-time voice-cloning attacks while maintaining human decision authority.

Key findings

Integrating audio challenges significantly boosted machine detection AUROC to 88.7% from a baseline of 56%. Human evaluators achieved 72.6% accuracy, while a combined human-AI system reached 84.5% accuracy, a 16.4% improvement over human-only performance. The study revealed that human and machine assessments are often complementary, not correlated.

Approach

PITCH uses a challenge-response mechanism where callers are asked to perform various audio tasks (e.g., whispering, speaking with a cup over their mouth). These challenges expose limitations in deepfake generation, improving both machine and human detection accuracy. A human-AI collaborative system combines the strengths of both for optimal performance.

Datasets

A novel open-source dataset with 18,600 original and 1.6 million deepfake audio samples from 100 users, recorded across mobile and desktop environments.

Model(s)

Wav2Vec2 (fine-tuned for challenge compliance detection), NISQA (for realism assessment), WhisperX (for transcription), SpeechBrain (for speaker recognition). FREE-VC, StarGANv2-VC, and PPG-VC voice cloning systems were used to generate deepfakes.

Author countries

USA

← Previous