Impact of machine vision in automated perception and recognition

Bionic eyes

Machine vision seamlessly merges image capture systems with computer vision algorithms to enable automated inspection and robot guidance. Although inspired by human vision, which extracts conceptual information from two-dimensional images, machine vision isn’t confined to 2D visible light. Optical sensors range from single-beam lasers to advanced 3D Light Detection And Ranging (LiDAR) systems, also known as laser scanners or 2D/3D sonar sensors, alongside one or multiple 2D camera setups.

Primarily, machine vision relies on 2D image-based capture systems and computer vision algorithms that emulate aspects of human visual perception. Humans perceive their 3D surroundings and navigate by reconstructing 3D data from 2D images to position themselves relative to objects. This information is combined with prior knowledge to detect, identify, and comprehend surrounding objects and their interactions. Computer vision comprises scene reconstruction, object detection, and recognition as its key sub-domains.

Reconstructing 3D Information

Irrespective of the imaging sensors used, prevalent methods for reconstructing 3D information involve time-of-flight techniques, multi-view geometry, and photometric stereo. Time-of-flight techniques, employed in laser scanners, gauge the object’s distance based on light travel time. This approach achieves millimeter-scale accuracy, especially for distances spanning kilometers.

Multi-view geometry encompasses ‘structure,’ ‘stereo correspondence,’ and ‘motion’ challenges. It involves estimating 3D coordinates through triangulation, determining corresponding points between images, and recovering camera coordinates from multiple views. 3D laser scanners utilizing triangulation achieve micrometer precision, albeit within a limited range. Techniques like ‘structure from motion’ apply multi-view geometry principles to extract corresponding points and reconstruct an object’s shape.

Stereo Vision and Interest Point Detection

Stereo vision hinges on extracting corresponding salient points/features across images, termed interest point detection. These features must withstand photometric transformations and remain invariant to geometric changes. Over two decades, researchers have proposed various approaches. The Scale-invariant feature transform (SIFT) extracts scale, rotation, and translation-invariant features, proving robust against illumination and perspective variations. Since its inception (1999-2004), SIFT found success in diverse applications, including object recognition, robot localization, and mapping.

Object Recognition Challenges

Recognizing and categorizing objects present tougher challenges compared to 3D reconstruction. This is due to the vast number of objects, potentially belonging to numerous categories simultaneously. Some object detection concepts stem from Gestalt psychology, which groups entities based on proximity, similarity, symmetry, common fate, and continuity.

Earlier research (1960s to early 1990s) centered on geometric shapes, constructing complex objects from primitive 3D components. In the 1990s, appearance-based models emerged, employing manifold learning to parameterize object appearance concerning pose and illumination. However, these methods struggle with occlusion, clutter, and deformation. By the mid-late 1990s, sliding window approaches tackled object classification across image sections. Challenges included designing effective features and efficient position/scale search. Local feature approaches aimed for invariance to scaling, geometry, and illumination changes. ‘Parts-and-shape’ models and ‘bags of features’ gained prominence in the early 2000s. The former represented objects via multi-scaled deformable components, while the latter related recognition to natural language processing techniques.

Deep Learning Revolution

Machine learning revolutionized object recognition by transitioning from pure mathematical modeling to data-driven algorithms. A pivotal moment arrived in 2012 with deep neural networks and large labeled image databases like ImageNet. Unlike traditional methods relying on feature extraction and matching, deep learning integrates these tasks within neural networks’ structure. Deep neural networks elevated image classification from 72% (2010) to 96% (2015), surpassing human accuracy and impacting real-world applications. Companies like Google and Baidu adopted Hinton’s deep neural network architecture, enhancing their image search capabilities. Face detection became prevalent in mobile devices, with Apple introducing pet recognition. These models outperformed human-level accuracy, causing transformative shifts across industries.