Joint Engagement Classification using Video Augmentation Techniques for Multi-person Human-robot Interaction

Authors: Yubin Kim, Huili Chen, Sharifa Alghowinem, Cynthia Breazeal, Hae Won Park

Published: 2022-12-28 23:52:55+00:00

AI Summary

This paper presents a hybrid framework for identifying parent-child dyad joint engagement using deep learning and video augmentation techniques. It trains RGB frame- and skeleton-based models, applying augmentations to improve performance and introduces a behavior-based metric for evaluating model interpretability.

Abstract

Affect understanding capability is essential for social robots to autonomously interact with a group of users in an intuitive and reciprocal way. However, the challenge of multi-person affect understanding comes from not only the accurate perception of each user's affective state (e.g., engagement) but also the recognition of the affect interplay between the members (e.g., joint engagement) that presents as complex, but subtle, nonverbal exchanges between them. Here we present a novel hybrid framework for identifying a parent-child dyad's joint engagement by combining a deep learning framework with various video augmentation techniques. Using a dataset of parent-child dyads reading storybooks together with a social robot at home, we first train RGB frame- and skeleton-based joint engagement recognition models with four video augmentation techniques (General Aug, DeepFake, CutOut, and Mixed) applied datasets to improve joint engagement classification performance. Second, we demonstrate experimental results on the use of trained models in the robot-parent-child interaction context. Third, we introduce a behavior-based metric for evaluating the learned representation of the models to investigate the model interpretability when recognizing joint engagement. This work serves as the first step toward fully unlocking the potential of end-to-end video understanding models pre-trained on large public datasets and augmented with data augmentation and visualization techniques for affect recognition in the multi-person human-robot interaction in the wild.


Key findings
General Aug and DeepFake augmentations significantly improved the performance of end-to-end models for joint engagement recognition. Skeleton-based models performed worse than end-to-end models. Grad-CAM visualization showed that models trained with augmentations focused more on the dyads' faces and bodies, indicating better interpretability.
Approach
The authors combine deep learning models (pre-trained on action recognition datasets) with video augmentation techniques (General Aug, DeepFake, CutOut, and Mixed). These augmented datasets are used to train RGB frame- and skeleton-based joint engagement recognition models. A novel behavior-based metric evaluates model interpretability using Grad-CAM.
Datasets
A dataset of parent-child dyads reading storybooks with a social robot, with annotations using the Joint Engagement Rating Inventory (JERI) on a five-point scale converted to three classes (low, medium, high).
Model(s)
TimeSformer, I3D, SlowFast, CTR-GCN, MS-G3D, ST-GCN++, ST-GCN. Models were pre-trained on large public datasets for action recognition and then fine-tuned on the parent-child interaction dataset.
Author countries
USA