Data plays a fundamental role in just about all of our lives. A massive amount of digital information, called big data, is rapidly generated every day in complex structured and unstructured datasets from a wide variety of sources, including sensors, IoT devices, machines, organizations, and social media.
By using this torrent of big data, machine learning has driven advances in many domains including computer vision, natural language processing (NLP), and automatic speech recognition (ASR) to deliver robust systems, ranging from driverless cars, voice-activated personal assistants, automated translation to malware filtering and medical diagnosis.
Unlike traditional statistical models like multiple linear regression, machine learning methods like random forest (RF) and artificial neural network (ANN) showcase superior predictive performance and accuracy when applied to large datasets to understand the hidden relationships and insights.
Despite these benefits, machine learning comes with unique challenges in terms of overall scaling and reproducibility, when you start using it for real prediction or classification purposes. This post will explain some of those machine Learning implementation challenges that organizations encounter in their deployments.
1. Processing Performance
The major challenges encountered in computations with big data comes from the scale or volume of computational complexity. As the scale becomes larger, even trivial operations become expensive.
For example, the standard vector support machine (SVM) algorithm has O(m3) training time complexity and O(m2) space complexity, where m is the number of training samples. Therefore, an increase in size m will have a drastic effect on the time and memory required to train the SVM algorithm and may even become computationally infeasible on very large datasets.
Several other ML algorithms exhibit similar high time complexity. Therefore, the time required to perform the computations for all these algorithms increases exponentially with the increasing size of data and may render the algorithms unusable for large datasets.
Besides, as the data size increases, the performance of the algorithms becomes more dependent on the architecture built to store and move data. With the growth in data size, parallel data structures, data partitioning and placement, and data reuse are becoming more important. Examples of a new abstraction for in-memory computations on large clusters are resilient distributed datasets (RDDs); RDDs are implemented within the Spark cluster computing framework. Data size, therefore, affects performance, but it also necessitates rethinking the typical architecture used to implement and develop algorithms.
2. Curse of Modularity
Many algorithms are on the assumption that the data being processed can be stored entirely in memory or on a disk in a single file. Multiple classes of algorithms are designed based on strategies and building blocks that depend on the validity of this assumption. However, when data size leads to this premise failure, it affects whole families of algorithms. This challenge is termed the curse of modularity.
MapReduce, which is a scalable programming paradigm for processing large datasets by means of parallel execution on a large number of nodes, is one of the approaches proposed as a solution for this curse. Some algorithms in machine learning are inherently parallel and can be adapted to the MapReduce paradigm. On the other hand, others are hard to decompose in a way that can leverage large numbers of computing nodes.
The three main algorithm categories that encounter the modularity curse when attempting to use the MapReduce paradigm include an iterative graph, gradient descent, and expectation-maximization algorithms. Their iterative nature, along with their reliance on in-memory data, creates a disconnect with the parallel and distributed nature of MapReduce. This leads to problems adapting these algorithm families to MapReduce or to another distributed computing paradigm. Consequently, while some algorithms, such as k-means, could be adapted to overcome the curse of modularity through parallelization or distributed computing, others are still bound with certain paradigms, or even unusable to them.
3. Class Imbalance
The assumption that the data is distributed uniformly across all classes is often broken as the datasets grow bigger. It leads to a challenge referred to as a class imbalance: the performance of a machine learning algorithm can be negatively affected if datasets contain data from classes with different probabilities. This problem is particularly prominent when there are many samples in some classes and very few in some. Class imbalance is not exclusive to Big Data and has been the subject of research for more than a decade.
The severity of the imbalance problem depends on task complexity, the degree of class imbalance, and the overall size of the training set. They suggest that there is a good chance that a reasonable number of samples represent classes; however, to confirm this observation, evaluations of real-world Big Data sets are needed.
On the other hand, the complexity of big data tasks is expected to be high, which could severely impact class imbalance. It is to expect that this challenge would be more common, severe, and complex in the Big Data context because the extent of imbalance has immense potential to grow due to increased data size. Decision trees, neural networks, and support vector machine algorithms are all susceptible to class imbalance.
Therefore, their unaltered execution in the Big Data context without addressing class imbalance may produce poor results. Consequently, in the Big Data context, due to data size, the probability of class imbalance is high. Also, because of the complex problems embedded in such data, the potential effects of class imbalance on machine learning are severe.
4. Curse of Dimensionality
Another issue related to the Big Data volume is the curse of dimensionality, which refers to the difficulties encountered when working in high dimensional space. The dimensionality describes the number of features (attributes) present in the dataset. For a training set of static size, the predictive ability and effectiveness of an algorithm decrease as the dimensionality increases.
Therefore, as the number of features increases, the performance and accuracy of machine learning algorithms degrade. It can be explained by the breakdown of the similarity-based reasoning upon which many machine learning algorithms rely. Unfortunately, the higher the amount of data available to describe a phenomenon, the greater becomes the potential for high dimensionality because there are more prospective features. Consequently, as the volume of Big Data increases, so does the likelihood of high dimensionality.
Beside, dimensionality affects processing performance: the time and space complexity of machine learning algorithms is closely related to the dimensionality of data. Many ML algorithms have a time complexity which is polynomial in the number of dimensions.
5. Feature Engineering
High dimensionality is closely related to another volume challenge: feature engineering. It is the process of creating features that typically use domain knowledge to make machine learning perform better. Indeed one of the most time-consuming pre-processing tasks in machine learning is selecting the most appropriate features. As the dataset grows vertically and horizontally, new, highly relevant features become more difficult.
Consequently, like dimensionality, as the size of the dataset increases, so make the difficulties associated with feature engineering. Feature engineering is related to feature selection: whereas feature engineering creates new features to improve learning outcomes, feature selection (dimensionality reduction) aims to select the most relevant features.
Although feature selection reduces dimensionality and has the potential to reduce time, it is challenging in high dimensions due to spurious correlations and incidental indigeneity. Overall, both feature selection and engineering are still very relevant in the Big Data context, but, at the same time, they become more complex.
Data size poses challenges to the application of standard methodologies used to evaluate dataset characteristics and algorithm performance. Indeed, the validity of many metrics and techniques relies upon a set of assumptions, including the prevalent assumption of linearity. For example, the correlation coefficient is often cited as a good indicator of the strength of the relationship between two or more variables.
However, the value of the coefficient is only entirely meaningful if a linear relationship exists between these variables. The performance of neural networks and logistic regression is very negatively affected by non-linearity. Although this problem is not exclusive to Big Data, non-linearity can be expected to be more prominent in large datasets. The challenge of non-linearity in Big Data also stems from the difficulties associated with evaluating linearity.
Linearity is often evaluated using graphical techniques such as scatterplots; however, in the case of Big Data, the large number of points often creates a large cloud, making it challenging to observe relationships and assess linearity. Therefore, both the difficulty of determining linearity and nonlinearity pose challenges to the execution of machine learning algorithms in the context of Big Data.
7. Bonferonni’s Principle
Bonferonni’s principle embodies the idea that if one is looking within a certain amount of data for a specific type of event, the likelihood of finding this event is high. However, these occurrences are more often than not bogus, meaning they have no cause and are, therefore, meaningless instances within a dataset. Also, often this statistical challenge is described as a spurious correlation.
In statistics, the Bonferroni correction provides a means to avoid those bogus positive searches within a dataset. It suggests that if testing m hypotheses with a desired significance of α, each hypothesis should be tested at a significance level of α/m. However, the incidences of such phenomena increase with data size. As data become exponentially more significant, finding an event of interest, legitimate or not, is bound to increase.
Given a large enough volume, most correlations tend to be spurious. Therefore, including a means of preventing those false positives is essential to consider in the context of machine learning with Big Data.
8. Variance and Bias
Machine learning relies on the idea of generalization; representations can be generalized to allow for analysis and prediction through observations and manipulations of data. An error in generalization may be broken down into two components: variance and bias.
Variance describes the consistency of the ability of a learner to predict random things. On the other hand, bias describes the ability of a learner to learn the wrong thing. Ideally, to get an accurate output, both the variance and the bias error should be minimized.
As data volume increases, however, the learner may become too closely biased towards the training set and be unable to generalize adequately for new data. Therefore, caution must be taken while dealing with big data as bias may be introduced, thereby compromising the ability to generalize.