Casting out Demons: Sanitizing Training Data for Anomaly Sensors

1 May 2008

conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

No. 10816011,p. 81-95
https://doi.org/10.1109/sp.2008.11

Abstract

The efficacy of anomaly detection (AD) sensors depends heavily on the quality of the data used to train them. Artificial or contrived training data may not provide a realistic view of the deployment environment. Most realistic data sets are dirty; that is, they contain a number of attacks or anomalous events. The size of these high-quality training data sets makes manual removal or labeling of attack data infeasible. As a result, sensors trained on this data can miss attacks and their variations. We propose extending the training phase of AD sensors (in a manner agnostic to the underlying AD algorithm) to include a sanitization phase. This phase generates multiple models conditioned on small slices of the training data. We use these "micro- models" to produce provisional labels for each training input, and we combine the micro-models in a voting scheme to determine which parts of the training data may represent attacks. Our results suggest that this phase automatically and significantly improves the quality of unlabeled training data by making it as "attack-free" and "regular" as possible in the absence of absolute ground truth. We also show how a collaborative approach that combines models from different networks or domains can further refine the sanitization process to thwart targeted training or mimicry attacks against a single site.

Keywords

This publication has 17 references indexed in Scilit:

Privacy-preserving payload-based correlation for accurate malicious traffic detection
Published by Association for Computing Machinery (ACM) ,2006
Anagram: A Content Anomaly Detector Resistant to Mimicry Attack
Lecture Notes in Computer Science, 2006
"Why 6?" Defining the operational limits of stide, an anomaly-based intrusion detector
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2005
Polygraph: Automatically Generating Signatures for Polymorphic Worms
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2005
Mimicry attacks on host-based intrusion detection systems
Published by Association for Computing Machinery (ACM) ,2002
Service specific anomaly detection for network intrusion detection
Published by Association for Computing Machinery (ACM) ,2002
Analysis and Results of the 1999 DARPA Off-Line Intrusion Detection Evaluation
Lecture Notes in Computer Science, 2000
Testing Intrusion detection systems
ACM Transactions on Information and System Security, 2000
A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Journal of Computer and System Sciences, 1997
Experiments on multistrategy learning by meta-learning
Published by Association for Computing Machinery (ACM) ,1993

Cited by 116 articles