Synthesizing Photorealistic Virtual Humans Through Cross-modal Disentanglement

View on arXiv ← Back to list

Authors: Siddarth Ravichandran, Ondřej Texler, Dimitar Dinev, Hyun Jae Kang

Published: 2022-09-03 03:56:49+00:00

AI Summary

This paper presents a real-time framework for synthesizing high-quality virtual human faces with accurate lip synchronization. It introduces a novel network using visemes as an intermediate audio representation and a data augmentation strategy for disentangling audio and visual modalities.

Abstract

Over the last few decades, many aspects of human life have been enhanced with virtual domains, from the advent of digital assistants such as Amazon's Alexa and Apple's Siri to the latest metaverse efforts of the rebranded Meta. These trends underscore the importance of generating photorealistic visual depictions of humans. This has led to the rapid growth of so-called deepfake and talking-head generation methods in recent years. Despite their impressive results and popularity, they usually lack certain qualitative aspects such as texture quality, lips synchronization, or resolution, and practical aspects such as the ability to run in real-time. To allow for virtual human avatars to be used in practical scenarios, we propose an end-to-end framework for synthesizing high-quality virtual human faces capable of speaking with accurate lip motion with a special emphasis on performance. We introduce a novel network utilizing visemes as an intermediate audio representation and a novel data augmentation strategy employing a hierarchical image synthesis approach that allows disentanglement of the different modalities used to control the global head motion. Our method runs in real-time, and is able to deliver superior results compared to the current state-of-the-art.

Key findings

The proposed method achieves superior visual quality and lip synchronization compared to state-of-the-art methods, as measured by PSNR, SSIM, and lip synchronization metrics. It also achieves a high inference speed (110 FPS) suitable for real-time applications.

Approach

The approach uses a two-encoder-two-decoder neural network architecture. A novel data augmentation strategy, employing keypoint mashing and a high-resolution generative oracle network, disentangles lip motion from upper face motion. A hierarchical outpainting approach generates high-resolution synthetic data.

Datasets

UNKNOWN. The abstract mentions a single identity dataset from a 20-minute video recording, using 6K footage.

Model(s)

A two-encoder-two-decoder GAN architecture with 1D convolutional and 2D convolutional layers, along with a generative oracle network (Pix2Pix-based). VGG feature matching loss and smooth L1 loss are used alongside GAN loss.

Author countries

USA

← Previous