Identifying the Context Shift between Test Benchmarks and Production Data

Authors: Matthew Groh

Published: 2022-07-03 14:54:54+00:00

AI Summary

This paper introduces "context shift" to explain the discrepancy between machine learning model performance on benchmark datasets and real-world applications. It proposes three methods to address this shift: leveraging human intuition, employing dynamic benchmarking, and clarifying model limitations.

Abstract

Machine learning models are often brittle on production data despite achieving high accuracy on benchmark datasets. Benchmark datasets have traditionally served dual purposes: first, benchmarks offer a standard on which machine learning researchers can compare different methods, and second, benchmarks provide a model, albeit imperfect, of the real world. The incompleteness of test benchmarks (and the data upon which models are trained) hinder robustness in machine learning, enable shortcut learning, and leave models systematically prone to err on out-of-distribution and adversarially perturbed data. The mismatch between a single static benchmark dataset and a production dataset has traditionally been described as a dataset shift. In an effort to clarify how to address the mismatch between test benchmarks and production data, we introduce context shift to describe semantically meaningful changes in the underlying data generation process. Moreover, we identify three methods for addressing context shift that would otherwise lead to model prediction errors: first, we describe how human intuition and expert knowledge can identify semantically meaningful features upon which models systematically fail, second, we detail how dynamic benchmarking - with its focus on capturing the data generation process - can promote generalizability through corroboration, and third, we highlight that clarifying a model's limitations can reduce unexpected errors. Robust machine learning is focused on model performance beyond benchmarks, and as such, we consider three model organism domains - facial expression recognition, deepfake detection, and medical diagnosis - to highlight how implicit assumptions in benchmark tasks lead to errors in practice. By paying close attention to the role of context, researchers can design more comprehensive benchmarks, reduce context shift errors, and increase generalizability.


Key findings
The paper highlights the limitations of relying solely on static benchmarks for evaluating machine learning models. It demonstrates how context shift, stemming from differences in data generation processes, leads to significant performance gaps between benchmark and real-world scenarios across various domains including facial expression recognition, deepfake detection, and medical diagnosis. The proposed methods offer a pathway towards building more robust and generalizable models.
Approach
The paper analyzes the mismatch between benchmark datasets and real-world data using the concept of context shift, focusing on semantically meaningful changes in data generation processes. It proposes using human intuition, dynamic benchmarking, and clear communication of model limitations to improve robustness.
Datasets
ImageNet, CIFAR-10, DeepFake Detection Competition Dataset (DFDC), Presidential Deepfakes Dataset, Protecting World Leaders against Deepfakes Dataset, Diverse Dermatology Images (DDI), SFEW, MMI, DISFA, FER2013, FERA, CK+, MultiPie, Fitzpatrick 17k
Model(s)
AlexNet, unspecified state-of-the-art deepfake detection models, unspecified state-of-the-art skin disease classification models
Author countries
USA