A data anomaly is a point or sequence of points that deviate from the normal behavior of the data. Anomaly detection is a process of detecting such data instances that significantly deviate from most data instances.
Anomaly detection is an active research area for several decades, with early exploration dating back to the 1960s. Some various methods or measures can be used to differentiate the normal points from anomalous ones.
Anomaly detection plays increasingly important roles in various communities, including data mining, machine learning, computer vision, and statistics, due to the increasing demand and applications, such as risk management, security, compliance, financial surveillance, AI safety and health, and medical risk.
Anomalies are mainly categorized into three types:
- Global/point anomalies are individual instances in which, in a particular data set, a data point is a global anomaly if it differs greatly from the remaining data points, e.g., the abnormal health indicators of a patient. This type of anomaly is considered the easiest category of anomalies to detect, and the majority of anomaly detection methods focus on detecting it.
- Conditional or contextual anomalies refer to individual anomalous instances in a specific context. In other words, they indicate data instances that are anomalous in the specific context, otherwise normal.
- Collective or group anomalies are a subset of data instances anomalous as a whole concerning the other data instances. The individual members of the collective anomaly may not be anomalies, and the individual nodes in the subgraphs can be as normal as real accounts.
Owing to the unique nature of anomalies, anomaly detection presents distinct problem complexities from most analytical and learning problems and tasks. This post will discuss some of the intrinsic anomaly detection complexities and unsolved detection challenges in complex anomaly data.
Complexities and challenges
- Unknownness: Most anomalies are unknowns or instances with unknown abrupt behaviors, data structures, and distributions. They remain unknown until they actually occur.
- Heterogeneous anomaly classes: Anomalies are irregular. One class of anomalies may present completely different abnormal characteristics from another class of anomalies. This requires detecting anomalies with multiple heterogeneous data sources, e.g., multidimensional data, graph, image, text, and audio data.
- Rarity and class imbalance: Typically, anomalies are rare data instances. Therefore, it is difficult to collect many labeled abnormal instances, resulting in the unavailability of large-scale labeled data.
- Low anomaly detection recall rate: Since some anomalies are highly rare and heterogeneous, they are difficult to identify. Many normal instances sometimes can be wrongly reported as anomalies, while true yet sophisticated anomalies are missed. Reducing false positives and enhancing detection recall rates is one of the most important and yet difficult challenges.
- Anomaly detection in high-dimensional and not-independent data: Anomalies often present evident abnormal characteristics in a low-dimensional space yet may become hidden and unnoticeable in a high-dimensional space. It is also challenging to detect anomalies from instances that may depend on each other by temporal, spatial, graph-based, and other interdependency relationships.
- Data-efficient learning of normality and abnormality: Due to the difficulty and high cost of collecting large labeled anomaly datasets, fully supervised anomaly detection is often not practical. Although unsupervised methods do not have any prior knowledge of true anomalies, they rely heavily on their assumption on the distribution of anomalies. Two major challenges here are a) how to learn expressive normality and abnormality representations with a small amount of labeled anomaly data and b) how to learn detection models generalized to novel anomalies uncovered by the given labeled anomaly data.
- Noise-resilient anomaly detection: Many anomaly detection methods assume the labeled training data is clean but can be vulnerable to noisy instances mistakenly labeled as an opposite class label. The main challenge is that the number of noises can differ significantly from datasets, and noisy instances might be irregularly distributed in the data space.
- Anomaly explanation and accuracy: There are major risks if anomaly detection models are directly used as black-box models in many safety-critical domains. For instance, the rare data instances, which are reported as anomalies, can lead to possible algorithmic bias against minority groups presented in the data, such as under-represented groups in crime and fraud detection systems. An effective mitigation approach for this risk is to introduce anomaly explanation algorithms that can provide direct clues about why a specific data instance is identified as an anomaly. Human experts can then have a look into and correct the bias. Providing such an explanation can be as important as detection accuracy in some applications. However, most anomaly detection focuses on detection accuracy only. Deriving anomaly explanation from specific detection methods is still a largely unsolved problem, especially for complex models.