Next-gen video codec standard revolutionizes video search & analytics

Consider this: You have a large volume of video footage of about three months from your home security and video surveillance system. You want to find something (a dog, for example, or a girl with a blue hat) in those video files.

How will you find it? Will you tirelessly sit and reply hours-long videos to find the object you’re looking for? Is it even possible to search through millions of video streams for specific things?

Today, we lack a means of searching tens of thousands of video files stored on computers, servers, security systems, and phones. But Compact Descriptor Video Analysis (CDVA) opens the door for exciting new capabilities to allow quick and precise searches on video files.

What is CDVA?

CDVA is a more recent addition to MPEG-7, a multimedia content description standard for audio and video compression and transmission to allow fast and efficient searching for users. The Moving Picture Experts Group (MPEG) was created in 1988 to compress the audio, image, and video files at the capture source, transmit the content to the destination, and then re-create the content back so people or organizations can use it. MPEG-7 focuses more on the objects in the video, rather than just the bits, making it easier to identify objects such as a dog or a group of people or a car, etc.

CDVA approved as a part of the MPEG7 standard in July of 2019, enables images and video to be encoded in machine-only or hybrid (machine and human) formats while being captured, making them searchable with higher speed and precision than currently possible. Machine-only encoding needs only basic and inexpensive camera sensors to capture, use and store information in as much as 1,000 times smaller sizes, while hybrid encoding embeds the metadata automatically for machines and makes image and video files searchable.

Using an AI processor equipped camera sensor and Video Codec for Machines (VCM), the first Language of Machine, CDVA extracts features (such as object, activity, location, events, and gestures) from images and videos. It produces a feature map that is compatible between the devices and the processors by different manufacturers. These feature maps are defined by the standard to provide “machine understanding” of images and videos, much like using different words in the right sequences for people to communicate with languages. It allows users to open their camera, “Shoot” an image and recall the most closely matching files, using the “Search.” They can use the same capability to search the libraries for service providers, providing entertainment and education content.

VCM applications

Gyrfalcon Technologies, Inc. (GTI), one of the founders of the VCM standards committee, has initiated a “Shoot and Search” program to help the world understand the benefits of VCM in everyday situations. It’s fascinating what “Shoot and Search” can do to the future of Machine-to-Machine application development. One application is searching for landmarks or interesting places while touring. It can combine GPS location to search archives of travel information accessible via web, application libraries (for tourism/historical/government records), and bring back the content desired.

VCM can enable variations of augmented reality. For example, you can imagine a user opening their camera and pointing it to an object of interest. The camera then applies GPS and preferences to find the information to overlay and enables the user to click on it to obtain more clarifications.

VCM is the building block of this new machine language to code and analyst image streams. It can extract features on edge devices with very basic camera sensors that, in turn, enables more rapid search and discovery of matching video in the universe of properly tagged media content. Using VCM, all the previously captured images and videos, whether in personal libraries or the galleries and archives of global service providers (like Netflix, YouTube, Facebook, etc.) or governments, can be cost-effectively and energy efficiently processed. Service providers have an excellent opportunity to make their content more easily discoverable.

Data centers can be set up in local servers to process the existing libraries and replace existing files with versions that comply with the new MPEG-7 standard, making them more easily searched by humans and optimized for machine use at the same time.

Machine to machine video sharing is another fastest-growing segment for the foreseeable future. It can drive lower-cost equipment, lower energy costs, improved experiences, and productivity through lower latency and improved ROI for both network and storage investments. The use of VCM results in using less energy to process video, lower cost, and less network utilization.