Extending Information Bottleneck Attribution to Video Sequences

View on arXiv ← Back to list

Authors: Veronika Solopova, Lucas Schmidt, Dorothea Kolossa

Published: 2025-01-28 12:19:44+00:00

AI Summary

VIBA, a novel explainable video classification approach, adapts Information Bottlenecks for Attribution (IBA) to video sequences. Applied to deepfake detection using Xception and a VGG11-based optical flow model, VIBA generates temporally and spatially consistent explanations aligning with human annotations.

Abstract

We introduce VIBA, a novel approach for explainable video classification by adapting Information Bottlenecks for Attribution (IBA) to video sequences. While most traditional explainability methods are designed for image models, our IBA framework addresses the need for explainability in temporal models used for video analysis. To demonstrate its effectiveness, we apply VIBA to video deepfake detection, testing it on two architectures: the Xception model for spatial features and a VGG11-based model for capturing motion dynamics through optical flow. Using a custom dataset that reflects recent deepfake generation techniques, we adapt IBA to create relevance and optical flow maps, visually highlighting manipulated regions and motion inconsistencies. Our results show that VIBA generates temporally and spatially consistent explanations, which align closely with human annotations, thus providing interpretability for video classification and particularly for deepfake detection.

Key findings

VIBA showed high temporal and spatial consistency in its explanations. The model's performance on deepfake detection was not significantly impacted by the IBA injection. VIBA explanations moderately aligned with human annotations, highlighting regions like lips, mouth, brows, eyes, and forehead as indicative of deepfakes.

Approach

VIBA adapts the Information Bottleneck for Attribution (IBA) method to video sequences. It uses two models: Xception for spatial features and a VGG11-based model for optical flow. IBA injects noise into intermediate layers to identify regions most influential on predictions, generating relevance and optical flow maps.

Datasets

A custom dataset combining videos from FaceForensics++, Celeb-DF, DFDC, DFD, DeeperForensics, FakeAVCeleb, AV-Deepfake1M, KoDF, YouTube, and the TV show "Deep Fake Neighbour Wars".

Model(s)

Xception (for spatial features) and a VGG11-based model (for motion dynamics through optical flow).

Author countries

Germany

← Previous