Benchmarking Outlier Detection Algorithms

3 min readApr 22, 2021

There are many different outlier detection algorithms that have been developed and published over the years. In this post, we will benchmark these algorithms.

Methodology

In this section, we will discuss the outlier detection algorithms in our benchmark, the oracle algorithm, data sets, and the metric for evaluation.

Algorithms

We will use the Python toolkit for detecting outlying objects in multivariate data (PYOD) [1]. PYOD is an open source Python-based library that includes many outlier detection algorithms. In our benchmarking study, we will focus on the following 19 algorithms. As you can see, these include algorithms from last century to 2020. And they include different types, from Linear Model to Outlier Ensemble, from Probabilistic to Neural Networks, and so on.

The Oracle algorithm

In order to evaluate and compare these above algorithms in fair manner, we need a baseline algorithm. So, we define the Oracle algorithm, which always gives the best of 19 algorithm for an input data set. That is, the Oracle ‘knows’ which one is the best for a given data set, so the name ‘Oracle’. In our study, we will compare the performance of 19 algorithms against the Oracle on the benchmark data sets below. Obviously, since the Oracle is always the best, 19 algorithms will have a lower performance compared to the Oracle. However, the less inferior an algorithm than the Oracle, the better it is.

Data sets

We will use the Outlier Detection DataSets (ODDS) [2] to benchmark the 19 algorithms against the Oracle. ODDS has been widely used in many scientific papers. These 31 data sets are listed in the below table, you can also find the basic information about number of rows (i.e., data points), number of columns (i.e., dimensions), number of outliers, and percentage of outliers. These give you the overall diversity of the data sets in the study.

For each benchmarking data set above, we have ground truth labels. That is, an item has label 0 if it is a normal item while the item has label 1 if it is an outlier. As shown in the %outliers column, these data sets are highly skew and most of them have few outliers. This skewness is important since it decides how we select the benchmarking evaluation metric below.

Evaluation Metric

Given the skewness of the benchmarking data sets and the labels of 0 and 1, we use AUC ROC [3] as the metric to measure the performance of these 19 algorithms on the 31 data sets. Note that, AUC ROC has been widely used for measuring ML algorithm performance in the context of classification. Since we have normal and anomalous data points (with labels of 0 and 1, respectively), the outlier detection algorithms work like a classifier. Moreover, AUC ROC works well with highly imbalanced data sets, which is suitable for our ODDS. That explains why AUC ROC is used in this study.

We define an experiment run as a pair of one algorithm A and a data set D. So, we have a total of 19 * 31 = 589 experiment runs. For each experiment run, we call the fit() and predict() methods of the algorithm A, then given the output of predict(), we calculate AUC-ROC score using the ground truth labels of D.

For each data set D, the highest of 19 AUC-ROC scores indicates the Oracle’s score s. Then, for each algorithm A (out of 19 algorithms), let r be A’s AUC-ROC score on D. We calculate the regret AUC-ROC score for A as: s - r. So, the regret AUC-ROC of an algorithm on a data set is a non-negative number.

For 31 data sets, each algorithm A has 31 regret AUC-ROC scores. We then rank the 19 algorithms based on the median regret AUC-ROC score out of these 31 AUC-ROC scores. The smaller the median regret ROC, the better the algorithm is.

Results

As shown in the below table, Minimum Covariance Determinant (MCD) algorithm is the best outlier detection in our benchmarking study. Its median regret AUC-ROC is 0.03. Meanwhile, MO_GAAL, a recent algorithm developed in 2020, is the worst with a median regret AUC-ROC of 0.39. Isolation Forest, a very popular algorithm due to its robustness, ranks second in our benchmarking study with a median regret AUC-ROC of 0.06.

Reference

[1]. PYOD, https://pyod.readthedocs.io/en/latest/

[2]. ODDS, http://odds.cs.stonybrook.edu/

[3]. Understanding AUC-ROC Curve, https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5