Honeyfile Camouflage: Hiding Fake Files in Plain Sight

Authors: Roelien C. Timmer, David Liebowitz, Surya Nepal, Salil S. Kanhere

Published: 2024-05-08 02:01:17+00:00

AI Summary

This paper addresses the challenge of camouflaging honeyfiles (fake files used for intrusion detection) within real filesystems by focusing on filename generation. Two metrics are developed to quantify filename camouflage: one based on simple averaging and another using clustering with mixture fitting, both evaluated on a GitHub software repository dataset.

Abstract

Honeyfiles are a particularly useful type of honeypot: fake files deployed to detect and infer information from malicious behaviour. This paper considers the challenge of naming honeyfiles so they are camouflaged when placed amongst real files in a file system. Based on cosine distances in semantic vector spaces, we develop two metrics for filename camouflage: one based on simple averaging and one on clustering with mixture fitting. We evaluate and compare the metrics, showing that both perform well on a publicly available GitHub software repository dataset.


Key findings
Both simple averaging and clustering-based camouflage metrics effectively distinguished between honeyfile filenames and those sampled from other repositories. The simple camouflage metric showed a greater distinction between the distributions of scores for large directories, suggesting its efficiency in computational cost.
Approach
The authors propose two metrics for honeyfile filename camouflage. The first calculates the cosine distance between a honeyfile's name embedding and the average embedding of existing filenames. The second uses von Mises-Fisher mixture modeling to cluster filenames and calculates the distance to the nearest cluster centroid.
Datasets
Publicly available GitHub software repository dataset and Google BigQuery GitHub dataset
Model(s)
FastText (for word embeddings)
Author countries
Australia