Hundreds of millions of individuals, homes, and businesses use smart sensors and advanced communication technologies today, playing indispensable parts in the way they work, learn, innovate, live and entertain. Sensors capture the raw data from the physical world by being the first port of contact.
However, sensors are the most inexpensive electronic components typically prone to malfunctions, failure, malfunction, malicious attacks, and tampering. All of these conditions cause the sensors to produce unusual, erroneous, and conflicting readings.
Any faulty, erroneous, corrupted sensor readings can compromise the overall performance of the IoT systems like autonomous driving vehicles and corrupt the trained Machine Learning (ML) models.
Therefore, this necessitates a monitoring process, commonly referred to as the sensor outlier detection or sensor anomaly detection, which is crucial to detect the high probability of erroneous reading and data corruption, thus ensuring the data quality.
Anomaly detection is primarily concerned with identifying data patterns that deviate remarkably from the expected behavior. Detecting anomalies pose several complex challenges due to the data volumes, data speed, etc.
Mainly, there are three causes of sensor outliers:
- Intrinsic sensor errors: This kind of error, also known as a binary failure, is associated with impaired readings or measurements from a faulty sensor. Sensors often fail suddenly and stop working without any indication of degrading performance. This sensor failure feeds in either no readings or null readings to the data processing algorithm.
- Sensor events: This refers to unprecedented change caused by unlikely situations that severely affect the sensors, thereby causing outliers.
- Intermittent sensor errors: This is a sensor failure majorly caused due to sporadic events like theft, malicious attack, and tampering with the sensor.
This post will present the pros and cons of five anomaly detection techniques, i.e., statistical, nearest-neighbor, artificial neural network, clustering, and classification-based techniques.
1. Statistical techniques
Statistical-based techniques use previous sensor measurements to approximate and build a stochastic distribution model of a sensor’s accurate behavior. Whenever the sensor registers a new measurement, the model compares the data point and checks if the unique data point is statistically incompatible with the model. If the model is inconsistent with the sensor reading, it is marked as an outlier or an inaccurate measurement.
- Can efficiently identify all sensor faults and outliers if a proper probability distribution model is in place.
- Can detect the sensor faults and outliers using temporal correlation. Any unprecedented change in the data distribution immediately decreases the temporal correlations, thus detecting outliers.
- The parametric statistical approach is not beneficial in real-time settings, where there is often no previous sensor data.
- Non-parametric statistical models are not suitable for data-intense applications, working in real-time.
- Often, it requires a high computational cost of managing multivariate data.
2. Nearest-neighbour techniques
Nearest-neighbor techniques are widely used to analyze the sensors’ data points concerning their nearest neighbors. Basically, the method explicitly relies on proximity, i.e., the distances between sensor data measurements to differentiate between the abnormal and correct readings. The Local Outlier Factor (LOF) is a prominent nearest-neighbor algorithm, which attributes a fault or the outlier score to each sensor reading based on the number of measurements around its k-nearest-neighbors the number of measures around the sensor reading. The sensor readings with the higher scores are labeled as anomalies.
- Very simple to apply to various data types produced by multiple sensors.
- Can be left unattended or without supervision.
- Computation cost increases dramatically when used on complex multivariate data.
- Scalability of models can be a concern.
- Often, it produces a high false-negative rate for sensor faults and outlier detection.
3. Artificial Neural Network techniques
Neural networks and fuzzy logic are the recent approaches for detecting sensor faults and outliers. The neural network technique is a logical model that renders a comprehensive idea that aids the decision-making process by analyzing the whole sensor data set. In contrast, the fuzzy logic technique allows transition values (like right/wrong, yes/no, high/low) to demarcate between the standard/correct sensor readings.
- As the model’s inherent behavior generalizes data points, it can be used when the sensors produce poor, noisy, and incomplete data.
- There is a limited and sometimes no need to re-train the model when new sensor data is added.
- Requires fine-tuning and simulations before making it operational in real-time.
- Since the model is often rules-based, it can exponentially increase the number of rules if the number of sensor data variables increases.
4. Cluster techniques
The cluster-based analysis is a subset of the proximity techniques and is a popular approach in data mining. Partitioning the data into clusters of similar data points from sensors. Each data cluster contains data points identical to one another and is different from the data points in other groups of clusters. The initial readings from sensors are first used to create the clusters. The new sensor measurements allocated to small and remote data clusters or sensor measurements that are very far from the primary cluster’s centroid are marked as abnormal readings.
- The model can easily be made adaptable to an incremental form once the clusters and the data points are inserted and tested in the system for sensor faults and outliers.
- No supervision necessary.
- Well suited for detecting sensor anomaly from the temporal data.
- Computationally costly while working on multivariate sensor data.
- It is unsuitable for inadequately resourced sensors due to the models’ high computational cost.
- Cannot cope with any sudden changes in the data.
5. Classification techniques
Classification techniques are precise methods in data mining and machine learning that aim to identify a classification model (classifier) using a collection of designated sensor data points (training points) and then classify obscure data instances into one of the learned (normal/outlier) group. Although the technique requires constant updating to accommodate the new sensor data that belongs to a typical class, it is adequately suited for the faults and outlier detection since this technique tends to operate under the common assumption that a classifier could be learned from a provided space feature to identify normal and outlier classes.
- Not dependant on a statistical model or estimated data parameters.
- It can provide maximum identification of sensor faults and outliers.
- Often used on multidimensional data to detect sensor outliers and faults.
- Computationally complex, compared to clustering and statistical techniques.
- The model needs training for new data points.