Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection

Authors: Cong Zeng, Shengkun Tang, Yuanzhou Chen, Zhiqiang Shen, Wenchao Yu, Xujiang Zhao, Haifeng Chen, Wei Cheng, Zhiqiang Xu

Published: 2025-10-07 08:14:45+00:00

Journal Ref: NeurIPS 2025

AI Summary

This paper proposes reframing LLM-generated text detection as an Out-of-Distribution (OOD) detection problem, arguing that human texts are diverse outliers while machine-generated texts are in-distribution. The developed framework employs one-class and score-based learning methods to model the compact distribution of LLM outputs. This OOD-based approach achieves superior robustness and generalization across multilingual, attacked, and unseen-model/domain text settings.

Abstract

The rapid advancement of large language models (LLMs) such as ChatGPT, DeepSeek, and Claude has significantly increased the presence of AI-generated text in digital communication. This trend has heightened the need for reliable detection methods to distinguish between human-authored and machine-generated content. Existing approaches both zero-shot methods and supervised classifiers largely conceptualize this task as a binary classification problem, often leading to poor generalization across domains and models. In this paper, we argue that such a binary formulation fundamentally mischaracterizes the detection task by assuming a coherent representation of human-written texts. In reality, human texts do not constitute a unified distribution, and their diversity cannot be effectively captured through limited sampling. This causes previous classifiers to memorize observed OOD characteristics rather than learn the essence of `non-ID' behavior, limiting generalization to unseen human-authored inputs. Based on this observation, we propose reframing the detection task as an out-of-distribution (OOD) detection problem, treating human-written texts as distributional outliers while machine-generated texts are in-distribution (ID) samples. To this end, we develop a detection framework using one-class learning method including DeepSVDD and HRN, and score-based learning techniques such as energy-based method, enabling robust and generalizable performance. Extensive experiments across multiple datasets validate the effectiveness of our OOD-based approach. Specifically, the OOD-based method achieves 98.3% AUROC and AUPR with only 8.9% FPR95 on DeepFake dataset. Moreover, we test our detection framework on multilingual, attacked, and unseen-model and -domain text settings, demonstrating the robustness and generalizability of our framework. Code, pretrained weights, and demo will be released.


Key findings
The OOD-based methods consistently outperformed baseline binary classification and zero-shot detectors. Specifically, DeepSVDD achieved 98.3% AUROC and AUPR with only 8.9% FPR95 on the DeepFake dataset. The framework demonstrated strong robustness against adversarial attacks on the RAID dataset and superior generalizability in multilingual, unseen-model, and unseen-domain scenarios, validating the effectiveness of the OOD reformulation.
Approach
The authors reformulate LLM-generated text detection as an Out-of-Distribution (OOD) task, treating machine-generated text as in-distribution (ID) samples and human-written text as OOD outliers due to its inherent diversity. They develop a detection framework utilizing one-class learning methods (DeepSVDD, HRN) and score-based learning (Energy-based method) to model the distribution of machine-generated texts, often combined with a contrastive loss for representation learning.
Datasets
DeepFake, M4 (M4-multilingual), RAID, EvoBench (XSum dataset with GPT-4o outputs).
Model(s)
Text Encoder (e.g., RoBERTa, specifically SimCSE-RoBERTabase for best results), DeepSVDD, HRN (Holistic approach to One-Class Learning), Energy-based method.
Author countries
United Arab Emirates, United States