Generation and Detection of Sign Language Deepfakes - A Linguistic and Visual Analysis

View on arXiv ← Back to list

Authors: Shahzeb Naeem, Muhammad Riyyan Khan, Usman Tariq, Abhinav Dhall, Carlos Ivan Colon, Hasan Al-Nashash

Published: 2024-04-01 19:22:43+00:00

AI Summary

This research generates a large dataset of sign language deepfakes using a pose/style transfer model, vetted by a sign language expert for accuracy. The dataset is then used to establish a baseline for deepfake detection in sign language videos, addressing the lack of research in this area.

Abstract

This research explores the positive application of deepfake technology for upper body generation, specifically sign language for the Deaf and Hard of Hearing (DHoH) community. Given the complexity of sign language and the scarcity of experts, the generated videos are vetted by a sign language expert for accuracy. We construct a reliable deepfake dataset, evaluating its technical and visual credibility using computer vision and natural language processing models. The dataset, consisting of over 1200 videos featuring both seen and unseen individuals, is also used to detect deepfake videos targeting vulnerable individuals. Expert annotations confirm that the generated videos are comparable to real sign language content. Linguistic analysis, using textual similarity scores and interpreter evaluations, shows that the interpretation of generated videos is at least 90% similar to authentic sign language. Visual analysis demonstrates that convincingly realistic deepfakes can be produced, even for new subjects. Using a pose/style transfer model, we pay close attention to detail, ensuring hand movements are accurate and align with the driving video. We also apply machine learning algorithms to establish a baseline for deepfake detection on this dataset, contributing to the detection of fraudulent sign language videos.

Key findings

Linguistic analysis showed that the interpretation of generated videos is at least 90% similar to authentic sign language. Visual analysis revealed that machine learning models struggled to reliably distinguish between real and fake videos, indicating the realism of the generated deepfakes. A sign language expert also found it difficult to consistently identify the deepfakes.

Approach

The authors use a modified First Order Motion Model (FOMM) for image animation, enhancing it for hand accuracy and detail. This unsupervised model dynamically extracts key points, eliminating the need for pre-labeled pose data. Post-processing includes sharpening for improved clarity.

Datasets

How2Sign dataset; a created dataset of over 1200 sign language deepfake videos featuring both seen and unseen individuals.

Model(s)

Modified First Order Motion Model (FOMM), ConvLSTM, CNN, Random Forest, SVM.

Author countries

UAE, Australia, USA

← Previous