InnerSelf: Designing Self-Deepfaked Voice for Emotional Well-being

Authors: Guang Dai, Pinhao Wang, Cheng Yao, Fangtian Ying

Published: 2025-03-18 13:45:22+00:00

AI Summary

InnerSelf introduces an innovative voice system leveraging speech synthesis and Large Language Models to create a personalized self-deepfaked voice for emotional well-being. This system allows users to engage in supportive and empathic dialogue with their own cloned voice, aiming to promote self-disclosure and regulation. By manipulating positive self-talk, InnerSelf seeks to reshape negative thoughts and improve overall emotional well-being.

Abstract

One's own voice is one of the most frequently heard voices. Studies found that hearing and talking to oneself have positive psychological effects. However, the design and implementation of self-voice for emotional regulation in HCI have yet to be explored. In this paper, we introduce InnerSelf, an innovative voice system based on speech synthesis technologies and the Large Language Model. It allows users to engage in supportive and empathic dialogue with their deepfake voice. By manipulating positive self-talk, our system aims to promote self-disclosure and regulation, reshaping negative thoughts and improving emotional well-being.


Key findings
The paper proposes InnerSelf as a novel voice interactive system that leverages the psychological benefits of self-voice for emotional regulation. It outlines the system's architecture and potential applications in stopping negative self-talk, facilitating long-term changes in thought patterns, and providing psychological assistance. While a design paper, it highlights the potential of combining deepfake self-voice with LLMs to foster positive self-talk and improve mental health.
Approach
The system processes user speech to recognize emotional states using a Transformer architecture with Wav2Vec 2.0 and speech-to-text models for multimodal feature fusion. A GPT-4-driven conversation module then generates appropriate, empathetic text responses based on recognized emotions and dialog context. Finally, a SV2TTS model-based voice cloning module converts these text responses into a naturally synthesized deepfake version of the user's own voice for real-time interaction.
Datasets
UNKNOWN
Model(s)
Wav2Vec 2.0, GPT-4, SV2TTS model, WaveRNN vocoder
Author countries
China