Pros and cons of anomaly detection techniques

anomaly detection

Our recent post about data anomaly detection explored various supervised and unsupervised techniques commonly used to identify intrusions and non-authorized activities. This post will present a comparison between these algorithms, summarizing their pros and cons.

Anomaly detection is the process of finding the patterns in a dataset whose behavior is not normal on expected. These unexpected behaviors are also termed anomalies or outliers. The anomalies cannot always be categorized as an attack, but they can be a surprising behavior previously unknown. It may or may not be harmful. Anomaly detection provides very significant and critical information in various applications.

The best performance is achieved by non-linear methods, such as SVM, multi-layer perceptron, and rule-based methods, among the supervised methods. Techniques for unsupervised such as K-Means, Self-organizing map (SOM), and one class SVM achieved better performance than other techniques.

Let’s get started:

1. K-Nearest Neighbor

K-nearest neighbor (k-NN) is a modest and conventional nonparametric technique for classifying samples. It calculates the approximate distances between points on the input vectors and then assigns the unlabeled point to its K-nearest neighbors’ class.


  • It is very easy to understand when there are few predictor variables.
  • It is useful for building models that involve non-standard data types, such as text.


  • Have large storage requirements.
  • It is sensitive to the choice of the similarity function that is used to compare instances.
  • It lacks a principled way to choose k, except through cross-validation or similar.
  • Computationally-expensive technique.

2. Supervised Neural Network

Multi-layer perceptron (MLP) and Radial basis function (RBF) are the most commonly used supervised neural networks. Since they perform classification by measuring distances between inputs and the RBF hidden neurons’ centers, RBF networks are much faster than the time-consuming backpropagation and most suitable for problems with large sample size.


  • A neural network can perform tasks that a linear program cannot.
  • Can effectively address many problems encountered by rule-based approaches.
  • When an element of the neural network fails, it can continue without its parallel nature.
  • Has high their tolerance to imprecise data and uncertain information
  • Has the ability to conclude solutions from data without having previous knowledge of the regularities in the data.
  • The neural network learns and does not need to be reprogrammed.
  • It can be implemented in any application.


  • The neural network needs the training to operate.
  • Neural network architecture is different from microprocessors’ architecture; therefore, it needs to be emulated.
  • Requires high processing time for large neural networks.

3. Decision Tree

A Decision Tree is a powerful and common tool for classification and prediction. It is a tree with three main components: nodes, arcs, and leaves. Each node is labeled with a feature attribute, which is most informative among the attributes not yet considered in the root’s path.


  • Simple to understand and interpret.
  • It can handle both numerical and categorical data.
  • Requires little data preparation.
  • Uses a white-box model.
  • It can perform well with large data in a short time.
  • It is possible to validate a model using statistical tests.
  • Robust.


  • Decision-tree learners create over-complex trees that do not generalize the data well.
  • Learning an optimal decision tree is known to be NP-complete under several aspects of optimality and simple concepts.
  • Some concepts are hard to learn because decision trees do not express them easily.

4. Support Vector Machine (SVM)

In Support vector machines (SVM), the input vector is first mapped into a higher-dimensional feature space. The best separating hyper-plane in the high-dimensional feature space is found. The separating hyperplane, or decision boundary, is determined by support vectors rather than the entire training sample and is extremely robust to outliers.


  • Can find the optimal separation hyperplane.
  • Can deal with very high dimensional data.
  • Some kernels have infinite Vapnik-Chervonenkis dimensions, which means that they can learn very elaborate concepts.
  • Usually, work very well.


  • Requires both positive and negative examples.
  • Requires lots of memory and CPU time.
  • Needs to select a good kernel function.
  • It has some numerical stability problems in solving the constraint QP.

5. Self-organizing map (SOM)

The Self-organizing map (SOM) is trained by an unsupervised competitive learning algorithm to reduce data visualization.


  • Simple and easy-to-understand algorithm that works.
  • It is a topological clustering unsupervised algorithm that works with non-linear data set.
  • The excellent capability to visualize high-dimensional data onto 1 or 2-dimensional space makes it unique, especially for dimensionality reduction.


  • Time-consuming algorithm.

6. K-means

K-means algorithm is a traditional clustering algorithm that divides the data into k clusters and guarantees that the same cluster’s data are similar. In contrast, the data in various clusters have low similarities.


  • Get a better detection effect due to the sensitiveness to the outliers.
  • Low complexity.
  • Has a high detection rate.


  • The necessity of specifying k.
  • Sensitive to noise and outlier data points.
  • Clusters are sensitive to the initial assignment of centroids.

7. Fuzzy C-means

Fuzzy C-means is a clustering method that allows one piece of data to belong to two or more clusters. The C-means algorithm is similar to K-Means, except that each point’s membership is determined by a fuzzy function. Based on their fuzzy membership in that cluster, all of the points contribute to the relocation of a cluster centroid.


  • Solves the problem and helps achieve a higher detection rate, less false positive rate, and stronger stability.
  • It allows a data point to be in multiple clusters.
  • It is a more natural representation of the behavior of genes.
  • It is used in all applications in which hard classification of data is not meaningful and very difficult to achieve (e.g., pattern recognition).


  • Need to define c, the cluster number.
  • Need to determine membership cutoff value.
  • Clusters are sensitive to the initial assignment of centroids.

7. Expectation-Maximization Meta Algorithm (EM)

Expectation-Maximization is a soft clustering algorithm for finding maximum probability estimates of parameters in probabilistic models. It alternates between computing an estimation of likelihood using current model parameters (as if they are known) and computing the maximum probability estimates of model parameters in the performing expectation (E) step.


  • It can easily change the model to adapt to a different distribution of data sets.
  • Parameters the number does not increase with the training data increases.
  • It enhances the convergence speed of the clustering process


  • Slow convergence in some cases.