Even after 20 years of research and development, anomaly detection still confronts difficult and unresolved challenges, including reducing many false alerts during the process of detecting unknown attack patterns.
Thankfully, several methods recently have shown potential solutions that can effectively indicate a presence of intended or unintended induced attacks or intrusions, faults, defects, and others that violate the information system security policy.
All intrusion detection methods assume that the intruder behavior will be significantly diverse from the legitimate and authorized behaviors or activities. Machine learning techniques receive considerable attention among anomaly detection researchers and developers. This post will explore top supervised and unsupervised methods for managing the problem of anomaly detection.
1. Supervised Anomaly Detection
Supervised methods are also known as classification methods that require a labeled training set containing both normal and abnormal samples to construct the predictive model. Theoretically, supervised methods provide better detection rates than semi-supervised and unsupervised methods since they have access to more information. However, some technical issues exist due to the shortage of a training data set covering all areas and inaccurate labels. Supervised Neural Networks, Support Vector Machines (SVM), k-Nearest Neighbors, Bayesian Networks, and Decision Tree are the most common supervised algorithms.
a. K -Nearest Neighbor (k-NN)
k-NN is one of the conventional and modest nonparametric techniques for classifying samples. It is used to calculate the approximate distances between various points on the input vectors and then assigns the unlabeled point to its K-nearest neighbors’ class.
b. Bayesian Network (BN)
The Bayesian Network (BN) is a model for encoding probabilistic relationships between variables. It provides a proper mathematical foundation to make straightforward, apparently difficult problems. It is commonly used for intrusion detection combined with statistical schemes because of its several advantages, including encoding interdependencies between variables and predicting events, and incorporating both prior knowledge and data.
c. Supervised Neural Network (NN)
The NNs learning predicts different users’ behavior in systems. If they are properly designed and implemented, NNs can solve many issues that rule-based approaches have. The main advantage of NNs is their tolerance for ambiguous data and information and their ability to derive solutions from data without prior knowledge of the data’s regularities. The most commonly used supervised neural networks today are multilayer perceptron (MLP) and radial basis function (RBF).
d. Decision Tree (DT)
A decision tree is a versatile and well-known classification and prediction tool. Nodes, arcs, and leaves are the three main components of a decision tree. A feature attribute is assigned to each node, which is the most informative among the attributes not yet considered in the root’s path. Each leaf is labeled with a category or class, and each arc out of a node is labeled with a feature value for the node’s feature. Starting at the root of the tree and moving through it until you reach a leaf node, a decision tree can classify a data point. The classification of the data point is provided by the leaf node.
e. Support Vector Machine (SVM)
Support vector machines (SVM) first maps the input vector into a higher-dimensional feature space and then obtain the optimal separating hyper-plane in the high dimensional feature space. Furthermore, rather than the entire training samples, support vectors determine a decision boundary, i.e., the separating hyper-plane, which is extremely robust to outliers. An SVM classifier, in particular, is intended for binary classification. The support vectors are the training samples close to a decision boundary that separate a set of training vectors that belong to two different class notes. A penalty factor, which is a user-specified parameter in the SVM, is also available. It enables users to choose between the number of misclassification samples and the width of the decision boundary. Two artificial intelligence techniques called Kohonen neural network (KNN) and support vector machine (SVM) are used to detect anomalies.
2. Unsupervised Anomaly Detection
Unsupervised anomaly detection techniques do not need training data. They are based on two fundamental assumptions. First, they presume that most network connections are regular traffic, and only a tiny traffic percentage is abnormal. Second, they anticipate that malicious traffic is statistically different from normal traffic. According to these two assumptions, data groups of similar instances that appear frequently are assumed to be regular traffic. In contrast, infrequency instances that considerably vary from most instances are regarded to be malicious. K-Means, Self-organizing maps (SOM), C-means, Adaptive resonance theory (ART), Expectation-Maximization Meta algorithm (EM), Unsupervised Niche Clustering (UNC), and One-Class Support Vector Machine are the most common unsupervised algorithms.
a. Clustering techniques
Clustering techniques work by dividing observed data into groups based on a similarity or distance metric. Clustering-based anomaly detection can be done in at least two ways. The anomaly detection model is trained using unlabeled data that includes both normal and attack traffic in the first approach. The model is trained using only normal data in the second approach, and a profile of normal activity is created. The first approach is based on the assumption that anomalous or attack data makes up a small percentage of total data. Anomalies and attacks can be detected based on cluster sizes if this assumption is correct. The rest of the data points, which are outliers, correspond to attacks, while large clusters correspond to normal data.
- Unsupervised Neural Networks: These are self-organizing maps (SOM) and adaptive resonance theory that embraces a series of neural network models. They are adequate for intrusion detection tasks where normal behavior is densely concentrated around one or two centers. In contrast, anomaly behavior and intrusions spread out in space outside of typical clusters, thanks to an unsupervised competitive learning algorithm that reduces the dimension of data visualization.
- K-Means: K-means algorithm is a traditional clustering algorithm. It divides the data into k clusters and guarantees that the same cluster data are similar, while the data in various sets have low similarities. The K-means algorithm selects K data at random as the initial cluster center; the remaining data is then added to the cluster with the highest similarity based on its distance from the cluster center. Each cluster is then recalculated. Repeat this process until no changes are made to the cluster centers. As a result, the data is divided into K clusters.
- Fuzzy C-Means (FCM): Fuzzy C-means is a clustering method, which grants one piece of data to belong to two or more clusters. It is used in applications for which complex data classification is not meaningful or challenging to achieve (e.g., pattern recognition). The C-means algorithm is similar to K-Means except that each point’s membership is defined based on a fuzzy function. All the points contribute to the relocation of a cluster centroid based on their fuzzy membership to that cluster.
- Unsupervised Niche Clustering (UNC): UNC is a robust clustering algorithm that employs a niching strategy and an evolutionary algorithm. The evolutionary algorithm uses a robust density fitness function to help find clusters, while the niching technique allows it to create and maintain niches (candidate clusters). UNC is less susceptible to suboptimal solutions than traditional techniques because it is based on genetic optimization. The algorithm’s main advantage is the ability to handle noise and to determine clusters number automatically.
- Expectation-Maximization Meta Algorithm (EM): EM is another soft clustering method based on Expectation-Maximization, an algorithm for finding maximum probability estimates of parameters in probabilistic models. The EM clustering algorithm alternates between an expectation (E) step in which an estimation of likelihood is computed using current model parameters (as if they are known) and a maximization (M) step in which the maximum probability estimates of model parameters are computed. The new estimations of the model parameters contribute to the next iteration’s expectation step.
b. One-Class Support Vector Machine (OCSVM)
OCSVM is a very specified sample of a support vector machine used for anomaly detection. It varies from the SVM generic version, and the resulting quadratic optimization problem includes an allowance for a specific small predefined outliers percentage, making it appropriate for anomaly detection. These outliers lie between the origin and the optimal separating hyperplane. All remaining data lie on the other side of the optimal separating hyperplane, forming a single nominal class, hence the term “one-class” SVM. The SVM returns a score representing the distance between the data point under consideration and the optimal hyperplane. Positive values in the one-class SVM output indicate normal behavior (higher values indicate greater normality), while negative values indicate abnormal behavior (with lower values representing greater abnormality).
Let’s conclude. Among the supervised methods, the best performance is achieved by the non-linear methods, such as SVM, multilayer perceptron, and rule-based methods. Techniques for unsupervised such as SOM, K-Means, and one class SVM achieved better performance than other techniques. However, they differ in their capabilities of detecting all attack classes efficiently.