Anomaly detection finds extensive use in various applications such as intrusion detection for cyber-security, fraud detection for credit cards, insurance, or health care, fault detection in safety-critical systems, and military surveillance for enemy activities.
Anomaly detection refers to finding patterns in data that do not conform to expected behavior. Anomalies are data patterns that do not serve a well-defined notion of normal behavior.
These non-conforming patterns are called in multiple terms: anomalies, outliers, discordant observations, exceptions, aberrations, surprises, peculiarities, or contaminants in different application domains. Of these, anomalies and outliers are the two most commonly used terms, sometimes interchangeably.
This post will look at the key advantages and disadvantages of popular anomaly detection techniques based on classification, nearest neighbor, clustering, statistical, information-theoretic and spectral.
1. Classification-based techniques
Classification is used to learn a model (classifier) from a set of labeled data instances (training) and then classify a test instance into one of the classes using the learned model (testing). Classification-based anomaly detection techniques operate in a similar two-phase fashion. First, the training phase learns a classifier using the available labeled training data. Then, the testing phase classifies a test instance as normal or abnormal using the classifier.
Classification-based anomaly detection techniques operate under a general assumption, i.e., A classifier that can distinguish between normal and abnormal classes can be learned in the given feature space.
Popular anomaly detection techniques that use different classification algorithms to build classifiers include Neural Networks-Based, Bayesian Networks Based, Support Vector Machines Based, and Rule-Based.
- Classification-based techniques, especially multi-class techniques, can use powerful algorithms that can distinguish between instances belonging to different classes.
- The testing phase of classification-based techniques is fast since each test instance needs to be compared against the pre-computed model.
- Multi-class classification-based techniques rely on accurate labels for various regular classes, which is often not possible.
- Classification-based techniques assign a label to each test instance, which can also become a disadvantage when a meaningful anomaly score is desired for the test instances. However, some classification techniques that obtain a probabilistic prediction score from the output of a classifier can be used to address this issue.
2. Nearest neighbor-based techniques
The nearest neighbor-based techniques for anomaly detection are based on a critical assumption that typical data instances occur in dense neighborhoods while anomalies occur far from their closest neighbors. Thus, nearest neighbor-based anomaly detection techniques can be broadly grouped into two categories: (1) Techniques that use the distance of a data instance to its kth nearest neighbor as the anomaly score, and (2) Techniques that compute the relative density of each data instance to compute its anomaly score.
- A key advantage of nearest-neighbor-based techniques is that they are unsupervised in nature and do not make any assumptions regarding the generative distribution of the data. Instead, they are purely data-driven.
- Semi-supervised techniques perform better than unsupervised techniques in terms of missed anomalies since the likelihood of an anomaly to form a close neighborhood in the training data set is very low.
- Adapting nearest neighbor-based techniques to a different data type is straightforward and primarily requires defining an appropriate distance measure for the given data.
- For unsupervised techniques, if the data has normal instances that do not have enough close neighbors or have anomalies with enough close neighbors, the technique fails to label them correctly, resulting in missed anomalies.
- For semi-supervised techniques, if the normal instances in test data do not have enough similar normal instances in the training data, the false positive rate for such techniques is high.
- The computational complexity of the testing phase is also a significant challenge since it involves computing the distance of each test instance with all instances belonging to either the test data itself or to the training data to compute the nearest neighbors.
• The performance of a nearest-neighbor-based technique greatly relies on a distance measure, defined between a pair of data instances, that can effectively distinguish between normal and abnormal instances. Unfortunately, defining distance measures between instances can be challenging when the data is complex, e.g., graphs, sequences, etc.
3. Clustering-based techniques
Clustering is used to group similar data instances into clusters. Clustering-based techniques rely on the assumption that normal data instances belong to a cluster in the data, while anomalies do not belong to any cluster. These techniques consist of two steps. In the first step, the data is clustered using a clustering algorithm. In the second step, its distance to its closest cluster centroid is calculated as its anomaly score for each data instance.
- Clustering-based techniques can operate in an unsupervised mode.
- Such techniques can often be adapted to other complex data types by simply plugging in a clustering algorithm that can handle the particular data type.
- The testing phase for clustering-based techniques is fast since the number of clusters against which every test instance needs to be compared is a small constant.
- The performance of clustering-based techniques depends on the clustering algorithm’s effectiveness in capturing the cluster structure of normal instances.
- Many techniques detect anomalies as a by-product of clustering and hence are not optimized for anomaly detection.
- Several clustering algorithms force every instance to be assigned to some cluster. This might result in anomalies getting assigned to a large cluster, thereby being considered normal instances by techniques that operate under the assumption that anomalies do not belong to any cluster.
- Several clustering-based techniques are effective only when the anomalies do not form significant clusters among themselves.
- The computational complexity for clustering the data is often a bottleneck, especially if O(N2d) clustering algorithms are used.
5. Statistical techniques
The underlying principle of any statistical anomaly detection technique is that an anomaly is an observation suspected of being partially or wholly irrelevant because it is not generated by the stochastic model assumed. Statistical anomaly detection techniques are based on the assumption that normal data instances occur in high probability regions of a stochastic model. In contrast, anomalies occur in the low probability regions of the stochastic model.
- If the underlying data distribution assumptions hold true, statistical techniques provide a statistically justifiable solution for anomaly detection.
- The anomaly score provided by a statistical technique is associated with a confidence interval, which can be used as additional information while making a decision regarding any test instance.
- If the distribution estimation step is robust to anomalies in data, statistical techniques can operate in an unsupervised setting without any need for labeled training data.
- The key disadvantage of statistical techniques is that they rely on the assumption that the data is generated from a particular distribution. Unfortunately, this assumption often does not hold true, especially for high-dimensional real data sets.
- Even when the statistical assumption can be reasonably justified, several hypothesis test statistics can be applied to detect anomalies; choosing the best statistic is often not a straightforward task [Motulsky 1995]. In particular, constructing hypothesis tests for complex distributions required to fit high dimensional data sets is nontrivial.
- Histogram-based techniques are relatively simple to implement, but a key shortcoming of such techniques for multivariate data is that they cannot capture the interactions between different attributes. For example, an anomaly might have attribute values that are individually very frequent. Still, their combination is very rare, but an attribute-wise histogram-based technique would not detect such anomalies.
5. Information-theoretic techniques
Information-theoretic techniques analyze the information content of a data set using different information-theoretic measures such as Kolmogorov Complexity, entropy, relative entropy, etc. Such techniques are based on the key assumption that anomalies in data induce irregularities in the information content of the data set.
- They can operate in an unsupervised setting.
- They do not make any assumptions about the underlying statistical distribution of the data.
- The performance of such techniques is highly dependent on the choice of the information-theoretic measure. Such measures can often detect anomalies only when there is a significantly large number of anomalies present in the data.
- Information-theoretic techniques applied to sequences and spatial data sets rely on the size of the substructure, which is often nontrivial to obtain.
- It is difficult to associate an anomaly score with a test instance using an information-theoretic technique.
6. Spectral techniques
Spectral techniques try to approximate the data using a combination of attributes that capture the bulk of variability in the data. Such techniques are based on the assumption that data can be embedded into a lower-dimensional subspace in which normal instances and anomalies appear significantly different
- Spectral techniques automatically perform dimensionality reduction and hence are suitable for handling high dimensional data sets. Moreover, they can also be used as a pre-processing step followed by applying any existing anomaly detection technique in the transformed space.
- Spectral techniques can be used in an unsupervised setting.
- Spectral techniques are useful only if the normal and abnormal instances are separable in the lower dimensional embedding of the data.
- Spectral techniques typically have high computational complexity.