Leveraging Deepfakes to Close the Domain Gap between Real and Synthetic Images in Facial Capture Pipelines

Authors: Winnie Lin, Yilin Zhu, Demi Guo, Ron Fedkiw

Published: 2022-04-22 15:09:49+00:00

AI Summary

This paper proposes an end-to-end pipeline for building and tracking 3D facial models from unconstrained video data. It leverages deepfake technology to bridge the domain gap between synthetic and real images, enabling robust tracking and avoiding the need for high-end equipment or real-world ground truth data.

Abstract

We propose an end-to-end pipeline for both building and tracking 3D facial models from personalized in-the-wild (cellphone, webcam, youtube clips, etc.) video data. First, we present a method for automatic data curation and retrieval based on a hierarchical clustering framework typical of collision detection algorithms in traditional computer graphics pipelines. Subsequently, we utilize synthetic turntables and leverage deepfake technology in order to build a synthetic multi-view stereo pipeline for appearance capture that is robust to imperfect synthetic geometry and image misalignment. The resulting model is fit with an animation rig, which is then used to track facial performances. Notably, our novel use of deepfake technology enables us to perform robust tracking of in-the-wild data using differentiable renderers despite a significant synthetic-to-real domain gap. Finally, we outline how we train a motion capture regressor, leveraging the aforementioned techniques to avoid the need for real-world ground truth data and/or a high-end calibrated camera capture setup.


Key findings
The method successfully generates high-quality 3D facial models and tracks facial motion from limited in-the-wild data. The use of deepfakes mitigates issues with imperfect synthetic geometry and real-world image misalignment. The approach shows improvements, particularly for non-Caucasian female subjects, compared to traditional methods.
Approach
The approach uses a hierarchical clustering framework for data curation, generating synthetic turntable renders and employing deepfake technology to correlate synthetic and real images. This enables robust texture acquisition and motion tracking using differentiable renderers and a motion capture regressor.
Datasets
In-the-wild video data (cellphone, webcam, YouTube clips) and synthetically generated turntable renders of faces.
Model(s)
Unsupervised conditional autoencoders (deepfake networks), a pretrained landmark detection network, and multi-layer perceptron networks for motion capture regression. Wav2Lip model is also used for lip-sync.
Author countries
USA