Due to its speed, accuracy, and repeatability, machine vision excels in the quantitative measurement of a structured scene. In contrast, human vision excels at the qualitative interpretation of a complicated, unstructured scene.
A machine vision system, for example, on a production line can inspect hundreds, if not thousands, of parts per minute. A machine vision system with the proper camera resolution and optics can easily inspect object details too tiny for the human eyes to see.
Machine vision prevents part damage by removing physical contact between a test system and the tested parts, as well as the time and costs associated with mechanical component wear and tear.
By reducing human involvement in the manufacturing process, machine vision provides additional safety and operational benefits. Furthermore, it protects human workers from hazardous environments and prevents human contamination of clean rooms. Low cost, high robustness, acceptable accuracy, high reliability, and high mechanical and temperature stability are all attributes of machine vision.
Because of these benefits, machine vision encompasses all industrial and non-industrial applications. With a combination of hardware and software, machine vision provides operational guidance to devices in executing their functions based on the capture and processing of images.
It helps meet the following strategic goals:
- Higher quality
- Increased productivity
- Production flexibility
- Less machine downtime and reduced setup time
- Complete information and tighter process control
- Lower capital equipment costs
- Lower production costs
- Scrap rate reduction
- Inventory control
- Reduced floor space
Machine vision systems depend on digital sensors protected inside industrial cameras with specialized optics to acquire computer hardware and software images to process, analyze, and measure various decision-making characteristics.
Lighting, image sensor, lens, vision processing, and communications are the main components of a machine vision system. The part to be examined is illuminated, allowing its features to stand out so the camera can see them. The image, captured by the lens, is presented to the sensor as light. A machine vision camera’s sensor converts this light into a digital image, then sent to the processor for analysis.
Algorithms in vision processing examine an image, extract required information, perform the necessary inspection, and decide. Finally, it typically achieves the communication through a discrete I/O signal or data transmitted across a serial link to a device that logs or uses information.
Most machine vision hardware components, such as lighting modules, sensors, and processors, are off-the-shelf products (COTS). Machine vision systems can be built from COTS components or purchased as a complete system with all components in one device. Machine vision systems are currently divided into three categories: 1D, 2D, and 3D.
Machine vision applications
- Inspection, measurement, gauging, and assembly verification
- Machine Vision System now does repetitive tasks formerly done manually
- Measurement and gauging/robot guidance / prior operation verification
- Changeovers programmed in advance
- Adding vision to a machine improves its performance, avoids obsolescence
- Manual tasks can now provide computer data feedback
- Inspection, measurement, and gauging
- One vision system vs. many people / Detection of flaws early in the process
- Optical Character Recognition and identification
- Vision system vs. operator
The impact of Machine Vision
Image capture systems and computer vision algorithms are combined in machine vision to provide automatic inspection and robot guidance. Although the human visual system inspires it, machine vision systems are not limited to 2D visible light extracting conceptual information from two-dimensional images. Most machine vision applications rely on 2D image-based capture systems and computer vision algorithms that mimic human visual perception. Humans see the world in three dimensions. Their ability to navigate and complete tasks depend on reconstructing 3D information from 2D images to locate themselves concerning the surrounding objects. Following that, this data is combined with prior knowledge to detect and identify objects in their environment and understand how they interact. The main sub-domains of computer vision are scene reconstruction, object detection, and recognition.
The most common approaches to reconstructing 3D information, regardless of the imaging sensors used, are time-of-flight techniques, multi-view geometry, and photometric stereo. In laser scanners, the former is used to calculate the distance between the light source and the object based on the time it takes for the light to reach the object and return. Since they are limited in measuring time, time-of-flight approaches are used to measure kilometers and are accurate to the millimeter scale. On the other hand, multi-view geometry problems include structure problems,’ stereo correspondence’ problems, and motion problems. The recovery of the 3D’structure’ entails estimating the 3D coordinates of a point based on triangulation from two or more 2D projections of the same 3D point in two or more images. The problem of finding the image point that corresponds to a point from another 2D view is known as stereo correspondence.’ Finally, the problem of recovering the camera coordinates from a set of corresponding points in two or more image views is referred to as motion.’ Triangulation-based 3D laser scanners can achieve micrometer accuracy, but their range is limited to a few meters. Multi-view geometry principles are used in several sub-problems, such as structure from motion, to extract corresponding points between 2D views of the same object and reconstruct its shape.
The robust extraction of corresponding salient points/features across images, known as interest point detection, is a prerequisite for stereo-vision. Changes in lighting conditions, for example, should be invariant to photometric transformations, while geometric transformations should be covariant. Researchers have proposed several approaches for over two decades. The Scale-invariant feature transform (SIFT) extracts invariant features robust to illumination variations and moderate perspective transformations. It has been a success in several vision applications, including object recognition, robot localization, and mapping, since its introduction in 1999-2004.
Because thousands of objects can belong to an arbitrary number of categories simultaneously, representing and recognizing object categories has proven to be much more difficult to generalize and solve than 3D reconstruction. Gestalt psychology, a theory of mind concerned with visual perception, is linked to several ideas about object detection. Grouping entities together based on proximity, similarity, symmetry, common fate, continuity, and other factors is an important part of the theory. From the 1960s to the early 1990s, geometric shapes were the focus of object recognition research. It was a bottom-up process that involved assembling a small number of primitive 3D dimensional objects in various configurations to create complex objects. In the 1990s, researchers experimented with appearance-based models based on multiple learning of the object’s appearance, which was parameterized by pose and illumination. Occlusion, clutter, and deformation are all problems with these techniques. Sliding window approaches were developed in the mid-late 1990s to classify whether an object is identified for each instance. The main difficulties were determining how to design features that accurately represented the object’s appearance and efficiently searching many positions and scales. Local feature approaches were also developed to identify insensitive to image scaling, geometric transformations, and lighting changes. ‘Parts-and-shape models,’ as well as ‘bags of features,’ were proposed in the early 2000s. Parts-and-shape models use a combination of multi-scaled deformable objects to represent complex objects. Bags of features methods, on the other hand, represent visual features as words and link object recognition and image classification to natural language processing approaches’ expressive power.
In object recognition, machine learning has aided the transition from solving problems solely through mathematical modeling to learning algorithms based on real-world data and statistical modeling. The emergence of a deep neural network and the availability of large labeled image databases, such as ImageNet, led to a breakthrough in object recognition and classification in 2012. Deep learning has the many advantages of encoding both feature extraction and image classification via the structure of a neural network, as opposed to traditional object recognition methods, which rely on feature extraction followed by feature matching methodologies. Deep neural networks’ superior performance led to an increase in image classification accuracy from 72 percent in 2010 to 96 percent in 2015, outperforming human accuracy and significantly impacting real-world applications. Based on Hinton’s deep neural network architecture, both Google and Baidu updated their image search capabilities. Face recognition has been added to several mobile devices, and Apple has even created a pet recognition app. These models’ object recognition and image classification accuracy exceed humans’, causing waves of technological change across the industry.