Towards Measuring Fairness in AI: the Casual Conversations Dataset

Authors: Caner Hazirbas, Joanna Bitton, Brian Dolhansky, Jacqueline Pan, Albert Gordo, Cristian Canton Ferrer

Published: 2021-04-06 22:48:22+00:00

AI Summary

This paper introduces the Casual Conversations dataset, a large-scale video dataset designed to evaluate the fairness and robustness of AI models across diverse demographics (age, gender, skin tone, and lighting). The dataset's unique feature is self-reported age and gender annotations, enabling more unbiased evaluation of model performance.

Abstract

This paper introduces a novel dataset to help researchers evaluate their computer vision and audio models for accuracy across a diverse set of age, genders, apparent skin tones and ambient lighting conditions. Our dataset is composed of 3,011 subjects and contains over 45,000 videos, with an average of 15 videos per person. The videos were recorded in multiple U.S. states with a diverse set of adults in various age, gender and apparent skin tone groups. A key feature is that each subject agreed to participate for their likenesses to be used. Additionally, our age and gender annotations are provided by the subjects themselves. A group of trained annotators labeled the subjects' apparent skin tone using the Fitzpatrick skin type scale. Moreover, annotations for videos recorded in low ambient lighting are also provided. As an application to measure robustness of predictions across certain attributes, we provide a comprehensive study on the top five winners of the DeepFake Detection Challenge (DFDC). Experimental evaluation shows that the winning models are less performant on some specific groups of people, such as subjects with darker skin tones and thus may not generalize to all people. In addition, we also evaluate the state-of-the-art apparent age and gender classification methods. Our experiments provides a thorough analysis on these models in terms of fair treatment of people from various backgrounds.


Key findings
Deepfake detection models showed significant bias towards lighter skin tones, performing poorly on darker-skinned subjects. Age and gender classification models also exhibited bias, with lower accuracy for darker skin tones. These findings highlight the need for more inclusive datasets and fairness-aware model training.
Approach
The authors created a new dataset with diverse subjects and collected multiple videos per subject under varying lighting conditions. They then evaluated existing deepfake detection models and age/gender classification models on this dataset, analyzing their performance across different demographic groups to identify biases.
Datasets
Casual Conversations dataset (containing over 45,000 videos of 3,011 subjects), DeepFake Detection Challenge (DFDC) dataset (partially overlapping with Casual Conversations)
Model(s)
Top five winners of the DeepFake Detection Challenge (DFDC), Levi & Hassner age/gender classification model, LMTCNN age/gender classification model, LightFace age/gender classification model, DLIB face detection model
Author countries
USA, Germany, Spain, France, UK