Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection

View on arXiv ← Back to list

Authors: Cong Zeng, Shengkun Tang, Yuanzhou Chen, Zhiqiang Shen, Wenchao Yu, Xujiang Zhao, Haifeng Chen, Wei Cheng, Zhiqiang Xu

Published: 2025-10-07 08:14:45+00:00

AI Summary

This paper proposes reframing the task of Large Language Model (LLM) generated text detection as an Out-of-Distribution (OOD) detection problem. Existing binary classifiers fail due to the highly diverse, open-ended nature of human text, which behaves as distributional outliers. The proposed framework treats machine-generated text as the compact in-distribution (ID) data, enabling robust detection with strong generalization capabilities.

Abstract

The rapid advancement of large language models (LLMs) such as ChatGPT, DeepSeek, and Claude has significantly increased the presence of AI-generated text in digital communication. This trend has heightened the need for reliable detection methods to distinguish between human-authored and machine-generated content. Existing approaches both zero-shot methods and supervised classifiers largely conceptualize this task as a binary classification problem, often leading to poor generalization across domains and models. In this paper, we argue that such a binary formulation fundamentally mischaracterizes the detection task by assuming a coherent representation of human-written texts. In reality, human texts do not constitute a unified distribution, and their diversity cannot be effectively captured through limited sampling. This causes previous classifiers to memorize observed OOD characteristics rather than learn the essence of `non-ID' behavior, limiting generalization to unseen human-authored inputs. Based on this observation, we propose reframing the detection task as an out-of-distribution (OOD) detection problem, treating human-written texts as distributional outliers while machine-generated texts are in-distribution (ID) samples. To this end, we develop a detection framework using one-class learning method including DeepSVDD and HRN, and score-based learning techniques such as energy-based method, enabling robust and generalizable performance. Extensive experiments across multiple datasets validate the effectiveness of our OOD-based approach. Specifically, the OOD-based method achieves 98.3% AUROC and AUPR with only 8.9% FPR95 on DeepFake dataset. Moreover, we test our detection framework on multilingual, attacked, and unseen-model and -domain text settings, demonstrating the robustness and generalizability of our framework. Code, pretrained weights, and demo will be released.

Key findings

The OOD-based approach consistently outperformed prior supervised and zero-shot baselines across all metrics and benchmarks. Specifically, the DeepSVDD variant achieved 98.3% AUROC/AUPR and 8.9% FPR95 on the DeepFake dataset. The framework demonstrated superior robustness against adversarial attacks (RAID dataset) and excellent generalizability in multilingual and unseen domain/model scenarios.

Approach

The detection framework utilizes one-class learning methods (DeepSVDD, HRN) or score-based techniques (Energy-based method) combined with a contrastive loss. The model is trained solely on LLM-generated text (ID samples) to learn a compact representation boundary, flagging any input text that deviates significantly from this cluster (i.e., human-written text) as OOD.

Datasets

DeepFake, M4 (multilingual setting), RAID

Model(s)

DeepSVDD, HRN, Energy-based method, SimCSE-RoBERTa (Text Encoder backbone)

Author countries

UAE, USA

← Previous