A Preliminary Exploration with GPT-4o Voice Mode

Authors: Yu-Xiang Lin, Chih-Kai Yang, Wei-Chih Chen, Chen-An Li, Chien-yu Huang, Xuanjun Chen, Hung-yi Lee

Published: 2025-02-14 06:34:08+00:00

AI Summary

This paper presents a preliminary exploration of GPT-4o's audio processing capabilities, evaluating its performance across various audio, speech, and music tasks. The study reveals GPT-4o's strengths in tasks like intent classification and multilingual speech recognition but also its limitations and safety-related restrictions, notably a refusal to perform tasks like audio deepfake detection.

Abstract

With the rise of multimodal large language models, GPT-4o stands out as a pioneering model, driving us to evaluate its capabilities. This report assesses GPT-4o across various tasks to analyze its audio processing and reasoning abilities. We find that GPT-4o exhibits strong knowledge in audio, speech, and music understanding, performing well in tasks like intent classification, spoken command classification, semantic and grammatical reasoning., multilingual speech recognition, and singing analysis. It also shows greater robustness against hallucinations than other large audio-language models (LALMs). However, it struggles with tasks such as audio duration prediction and instrument classification. Additionally, GPT-4o's safety mechanisms cause it to decline tasks like speaker identification, age classification, MOS prediction, and audio deepfake detection. Notably, the model exhibits a significantly different refusal rate when responding to speaker verification tasks on different datasets. This is likely due to variations in the accompanying instructions or the quality of the input audio, suggesting the sensitivity of its built-in safeguards. Finally, we acknowledge that model performance varies with evaluation protocols. This report only serves as a preliminary exploration of the current state of LALMs.


Key findings
GPT-4o demonstrates strong performance in several audio understanding tasks, surpassing other models in some cases. However, its built-in safety mechanisms cause it to refuse many tasks, including audio deepfake detection. The refusal rate varied significantly depending on the dataset and task instructions.
Approach
The researchers evaluated GPT-4o's performance on a wide range of tasks from several large benchmarks (Dynamic-SUPERB, MMAU, and CMM), spanning audio, speech, and music domains. They analyzed the model's accuracy and also its refusal rate for tasks deemed potentially unsafe.
Datasets
Dynamic-SUPERB Phase 2, MMAU, CMM
Model(s)
GPT-4o (gpt-4o-audio-preview-2024-10-01), Whisper-LLaMA, Qwen2-Audio-7B-Instruct, GAMA-IT, MU-LLaMA, SALMONN-7B, SALMONN-13B, Qwen-Audio-Chat, LLaMA-3.1-8B-Instruct
Author countries
Taiwan