AI in robotics: How machine learning works in collaborative robots

One of the critical problems in a collaborative robot (cobot) design lies in providing the machine with an understanding of its world. The robot needs to interact with humans in a shared space and must be able to differentiate between objects that it needs to pick up and move from the people who may be working alongside it. It should detect obstacles and dangers and understand their nature so it can react to each situation appropriately.

Although it is possible to build rules-based models for guiding an autonomous system, it has proven difficult in complex situations and applications such as cobots in the factory or warehouse environment or delivery robots.

Machine learning provides an alternative path to achieving a solution. The key application of machine learning in robot design is perception, i.e., to provide the robot with the ability to react appropriately to the input from cameras and sensors that image the 3D landscape around it.

Sensory artificial intelligence provides the robot with the ability to recognize objects in the surrounding environment. Using that understanding, the robot can use pattern matching to learn appropriate behaviors from past experience. And it may learn new situations as they arise through reinforcement learning techniques. AI is becoming more present in our daily lives.

Devices like Amazon’s Alexa, Google’s OK Google, and many other web services depend on these complex algorithms run on servers in the cloud. Machine learning has also been successfully demonstrated in numerous other applications, from drones that can follow paths through a forest to self-driving vehicles that are reliable enough to be allowed to run in trials on city streets.

How does Machine Learning work in robotics?

Since its inception over 50 years ago, there are now many approaches to the concept of machine learning. The foundation of all machine-learning technologies is that they take in data, train a model on that data, and then use the derived model to make predictions on new data.

Training a model is a learning process where the model is exposed to unfamiliar data at each step and is asked to make predictions. Feedback from these predictions is used to alter the model so that, over time, the model improves.

Often, the model adjustments made for new data worsen the performance compared to prior samples. Therefore, it takes multiple iterations over the training set to achieve consistent performance. Typically, training stops when the model’s predictions reach a point at which the error does not improve – which may be a local or, ideally, a global minimum. As a result, machine learning has strong links to optimization techniques such as linear regression in which a curve is fitted to a set of data points.

Supervised vs. unsupervised learning

An important distinction to understand here is between supervised and unsupervised learning. In unsupervised learning, the model is provided with unlabeled data and asked to segment the elements into groups. A common algorithm used for this purpose is k-means clustering. The algorithm works iteratively to assign each data point to one of several clusters. The algorithm does this by first estimating centroids for each cluster – often by an initial random selection – and then refining its model based on the distance between data points from each other until it determines the most likely clustering.

In robotics, k-means and similar unsupervised clustering approaches have been used to support the automated mapping of unknown spaces by groups of robots. However, supervised learning is currently the most common form of machine learning being applied in research and production robots for perception-based tasks.

Deep learning and artificial neural network

Until recently, one of the most successful techniques for image-recognition tasks was the support vector machine (SVM). This technique is similar to clustering but works with data that has been labeled into two or more classes. The SVM’s job is to determine the parameters that will allow the model to place unlabeled data into the most appropriate class. Although SVMs were used in research for applications such as autonomous vehicles in the late 1990s and early 2000s, their use has mostly given way to deep learning. Deep learning is a modification of the artificial neural network (ANN) technology that was highly publicized in the 1980s and 1990s, which drew on theories developed more than half a century earlier, which were inspired by the biology of the animal brain.

In a traditional ANN design, artificial neurons are arranged in a small number of layers – an input, an output, and a hidden layer. Each neuron in the hidden layer takes in data from every neuron in the input layer, performs a weighted sum, and applies an activation function, such as the hyperbolic tangent or logistic function, before passing the result to the output layer.

The network’s training is typically performed using backpropagation, an approach to optimization and error reduction that works from the output back to the input – giving the technique its name. Backpropagation calculates the gradient of the error. This gradient is used to perform gradient descent to find a set of weight values that are more likely to reduce the error during each epoch of training.

This approach to ANN showed early promise. But the need for intensive computing resources to perform backpropagation and its inability to compete with the SVM meant that ANN slipped into relative obscurity. That situation began to reverse with a reinvigoration of deep networks – ANNs with more than one hidden layer – that was first proposed in the 1960s but foundered because optimizing the network weights proved extremely difficult.

A critical development was applying a more efficient approach to training and backpropagation developed by Geoffrey Hinton and Ruslan Salakhutdinov, working at the University of Toronto in the mid-2000s. The development was aided by the massive improvement in compute performance compared to the early 1990s, first with multi-core CPUs and then with GPUs. Increases in model performance came with the application of refinements to the fully connected architecture proposed over the previous two decades. One was to introduce convolutional layers interspersed between fully connected layers.

The feature map can be regarded as a filter. Convolutions of this kind are frequently used in image processing to blur images or to find sharp edges. They also provide a way of converting data in a spatial domain to a representation based on the time domain, where waves are superimposed on each other to form the overall image. As a result, convolutions make it possible to convert pixel arrays into collections of features that can be worked on independently by the following layers.

In contrast to the conventional use of convolution in image processing, the feature maps are learned as part of the ANN training process. This makes it possible for the model to adapt to differences in the training set, making it easier to distinguish between examples. For example, feature maps tuned to detect differences in shape will be most appropriate for general image-recognition tasks. Feature maps optimized for color will be favored in situations where the objects to be separated have similar shapes but are differentiated by their surface attributes.

One significant advantage of the convolutional layer is compute efficiency. It is easier to implement in an ANN as it employs far fewer connections per neuron than fully connected layers and maps readily to GPUs and other parallel-processing architectures with single-instruction, multiple-data (SIMD) arithmetic units. Another attribute of convolutional layers is that the design resembles the neurons’ organization in the organic brain’s visual cortex, which is different from the more highly connected regions used for cognition.

Convolutional neural network (CNN)

Multiple convolutional layers are often used in series in deep-learning architectures. Each successive layer filters the image for increasingly abstract content. In a convolutional neural network (CNN), a set of convolution layers is often followed by a pooling layer. These pooling layers combine the outputs from multiple neurons to produce a single output – creating a sub-sampling effect – that can be fed to various inputs in the following layer. This pooling affects concentrating information and steering it to the most appropriate set of neurons that follow. The benefit of their use is that they improve the performance of recognition operations on images where important features may move around within the input. For example, a person’s face may move around in the image field as the robot approaches. Pooling layers help ensure that features activated by the shape and color consistent with those of a face are steered towards neurons that can perform a more detailed analysis. Training on images in which faces are offset and rotated helps build the connections between the most appropriate neurons.

There are different kinds of pooling operations. A max-pooling layer, for example, takes the maximum value from the inputs and passes that on. The highly influential AlexNet entry to the ImageNet LSCVRC-2010 contest employed these structures. AlexNet comprised five convolution layers, three fully connected layers, and three max-pooling stages.

A further improvement in training performance came with adopting stochastic gradient descent (SGD) as the mechanism for calculating gradients during backpropagation. This was primarily a choice made for computational efficiency, as it uses a small subset of the training data to estimate gradients. However, the random-walk effect of SGD helps move the optimization towards a good global minimum faster and more frequently than with previous techniques. Not long after deep-learning architectures were first employed, researchers at IDSIA in Switzerland showed that the machines could outperform humans on recognition tasks. In one experiment, CNN could correctly identify heavily damaged road signs because it could make use of visual features that humans would generally ignore. However, this ability to use non-obvious features can be a weakness with current approaches based on ANNs.

Researchers have found in recent years that, by merely changing a single pixel in an image, the network will provide the wrong classification. Analysis of the weights chosen by one CNN indicated that, in trying to classify cats, the network had learned to use unrelated markings in some training images as part of the identification. Networks will also sometimes claim a successful classification for an image that is only noise.

The architecture of CNN should be chosen to fit the application. There is no one-size-fits-all architecture. Decisions about the number and order of convolutional, pooling, and fully connected layers significantly impact performance. And the feature map and kernel sizes for each of the convolutional layers provide trade-offs between performance, memory usage, and compute resources.

The basic CNN’s classical feedforward architecture is far from being the only option, particularly as deep learning moves from classification tasks to control. Feedback is becoming an element of the design in applications such as voice recognition. Recurrent neural networks use feedback loops. Memory networks use elements other than neurons to hold temporary data that can be used to store contextual information. It is likely to be useful in applications that call for a degree of planning, including systems that control robot behavior and motion. Another option is the adversarial architecture, based on two linked networks. The competition between them helps avoid the risk of a single network making fundamental mistakes. As technology continues to develop, we can expect other novel architectures to emerge.

Supervised learning is different from the organic experience in that training and execution occur in different phases: the network does not typically learn as it runs. However, to ensure the system can meet new challenges, it can be essential to perform training sessions on recorded data, mainly if the system flags them as situations that led to errors or poor performance.

For control of the core robot functions, reinforcement learning is often employed. This rewards the robot during training for ‘good’ behavior and penalizes poor decisions. In contrast to simple image-classification tasks, forward planning is a vital component of the process. This calls for the use of discounting techniques to tune rewards for decisions made in a given state. A discounting factor of 0.5, for example, will be just one-eighth of its value after three state changes. This will cause the machine-learning network to pursue near-term rewards. A higher discounting factor will push the network to consider longer-term outcomes.

Machine learning model training

A key question for designers of robots is where training occurs. The separation of training and the inferencing needed during execution provides an opportunity to offload the most compute-intensive part of the problem to remote servers. Inferencing can take place in real-time using less hardware while servers perform training updates in batches overnight. The cloud environment provides access to standard tools such as Caffe and TensorFlow that can be used to design, build, and test different CNN strategies.

With a hardware platform optimized for inferencing, designers can take advantage of some CNN architecture features to improve processing efficiency. Typically, the backpropagation calculations used during training demand high-precision floating-point arithmetic. This keeps errors to a minimum. The processes of normalization and regularization reduce the size of individual weights on each neuronal input. These steps are needed to prevent a small number of nodes from developing strong weights that reduce overall performance. As a result of normalization, some weights will reduce to deficient levels and diminish to zero in the optimization process. In the runtime application, these calculations can be dropped entirely. In many of the interneural connections with low significance, the weighted-sum calculation can tolerate increased errors from low-precision fixed-point arithmetic use. Often 8-bit fixed-point arithmetic is sufficient. And, for some connections, the 4-bit resolution has been found to not increase mistakes significantly. This favors hardware platforms that offer high flexibility over numeric precision. Several microprocessors with SIMD execution units will handle low-precision arithmetic operations in parallel. Field-programmable gate arrays (FPGAs) provide the ability to fine-tune arithmetic precision. An upcoming generation of coarse-grained reconfigurable arrays (CGRAs) optimized for deep learning will provide an intermediate solution between microprocessors and FPGAs. They will help improve performance and make AI-enabled robots and cobots more feasible.

Key takeaways

  • Machine learning provides a way of building more advanced sensory and control systems for robots than traditional path- or rule-based control strategies.
  • Use an appropriate machine-learning algorithm. Deep learning is not necessarily the right answer for all situations.
  • Training and inferencing are separate processes. This can be leveraged by offloading the more compute-intensive operations to the cloud.
  • CNNs can be deployed in many forms. The CNN architecture is intimately tied to the type of data it is expected to learn and process.
  • Training data quality is vital. Poor selection of training data can lead to unexpected results.