Different types of data used in AI development

November 25, 2021

Data plays a big role in the development of artificial intelligence (AI), and one key responsibility of developers involves looking closely at the types of data they are going to use.

There are various types of data used for AI development, and specific criteria are used to assess their quality. Both types and characteristics of the data will influence how it is/can be used in AI development and what actions need to be taken to ensure that data is fit for the development of a specific AI technology.

Complex algorithmic and AI systems must process vast amounts of data, which is hugely varied in form. This post will set out the main types of data available and the states they appear in, and how the specific requirements of AI can play a role in the demand for certain types of data.

- Advertisement -

Before looking at the types of data used in AI development, it is worth considering the states in which data can be found. In general, data is found in two forms: structured and unstructured.

Structured data is organized datasets incorporated into a structured data model. It perfectly fits into a pre-defined structure. Structured data models can exist in many forms, from simple 2D spreadsheet arrays to more complex relational databases or knowledge graphs.

Semistructured data is a type of data that doesn’t conform to the rigid structure of traditional relational databases but still possesses some organizational properties that make it easier to analyze compared to unstructured data. Unlike fully structured data, which follows a predefined schema, semistructured data contains tags or markers to separate data elements, yet its structure can vary and evolve over time. Formats like JSON, XML, and YAML files are common examples of this, often requiring specialized tools to handle data conversions efficiently. These formats help bridge the gap between structured and unstructured data, making them a crucial element in AI development.

- Advertisement -

What is considered “big data” today is unstructured data, which is not organized according to any pre-existing data model. Unstructured data is unprocessed and is often generated by machine-led systems where the purpose of the data is not to answer a specific question. This includes social media posts, surveillance camera footage, or satellite imagery. Unstructured data can have its own internal structure, but the relationships between the data elements are often undefined.

Types of data used in AI development

Provided data

Provided data refers to information provided by individuals, specifically those aware that they are actively providing data about themselves. The provision of this can be voluntary (for example, social media posts, financial transactions, personal emails, etc.) Or individuals can be compelled to give their data (for example, registration forms for governmental organizations, health records, job applications, etc.). Individuals will be aware that their data is intended for specific purposes, with consent often required by data controllers. This type of data is more often found in a structured form, with labeled data elements. Access to this type of data by AI developers has its limitations. The access to personal data is generally restrictive due to the high degree of identifiability of personal data and the risks.

Observed data

Observed data consists of information gathered by observing actors or natural/technical phenomena in natural or research environments. This type of data is generated/collected to use the sample observations to make general predictions or analyses of a wider population. The degree to which people are aware that their data is being collected may vary.

- Advertisement -

In certain activities, such as internet browsing or location activation on mobile devices, individuals may be aware that their behaviors are recorded. In other instances, individuals are less aware that their behavior is being observed and recorded digitally. This includes CCTV footage used for facial recognition or readings from sensor devices (movement sensors, light sensors, etc.). When involving data related to human actors, specific legal and ethical issues must be considered, especially in connection with consent. Depending on the context, observed data can be structured and unstructured.

Derived data

Derived data is obtained by processing data being published or made available from any of the above sources. The types of processing or transformations include subsetting, changing structure, analyzing, mining, or creating statistical or algorithmic models. This type of data can also potentially increase the ethical risks of using and misusing personal data and the applications beyond its intended original use.

Inferred data

Inferred data is generated by applying statistical or computational procedures to produce data for predictive purposes, such as credit scores, the likelihood of developing diseases, or creating targeted advertising. Though closely related to derived data, inferred data is more probabilistic in nature. It is more concerned with posthoc pattern detection and categorization. Since some type of processing or analysis is performed on the original dataset to produce inferred data, individuals lose control over how their personal data is used.

Reference data

Reference data is used to give structure or categorize other data or datasets or provide context for other data, such as opening and closing prices in financial markets or aggregated census records. Reference data can be either static or dynamic and is by definition highly structured in form and requires low levels of pre-processing to be incorporated into any procedures requiring data manipulation. Its value to AI development is in its combination with other data types or in providing cross-domain mappings for homogenous datasets, i.e., Facilitating the combination of one or more other datasets.

A sub-category of reference data is metadata, which is essential information that provides the context for a given dataset. This includes information on provenance, data integrity tests, data formats, file size, etc. Metadata is essential for the discoverability of datasets that can potentially be used in AI development.

Synthetic data

Synthetic data is artificially generated. It is not based on findings or observations based on real-world phenomena but on models and simulations of phenomena. It is often produced by an AI or algorithmic system and other methods, such as statistical or data modeling. This artificial dataset contains no identifiable information mapped to real individuals and is considered a safe approach for sharing sensitive data since it does not carry privacy risks.

- Advertisement -

MORE TO EXPLORE

Tags
big data

Robotics-as-a-Service (RaaS): How subscription-based automation is redefining industry

Top 5 best sales analytics tools for Amazon sellers

Tombot Jennie Robotic Dog review (2025): Is it worth the $1,500 price tag?

Sustainable metal machining: Reducing waste with smart CNC technology

Top 20 open-source robotics projects and initiatives for robotics research

Top 5 powerful AI research tools every academic researcher should use

How to write a winning robotics conference paper – Proven strategies and tips

How to start AI and robotics research: A Guide for beginners and aspiring scholars

Cybersecurity certifications tailored for robotics engineers

The role of external support teams in driving SaaS growth

Top 20 open-source robotics projects and initiatives for robotics research

Top 5 powerful AI research tools every academic researcher should use

How AI strengthens anti-cheat systems against online poker bots

How to start AI and robotics research: A Guide for beginners and aspiring scholars

How to manually humanize AI content and bypass AI detectors

Top robotics programs and competitions advancing STEM education

Why robotics startups fail: Lessons from Rethink Robotics’ rise and fall

How to evaluate a robotics startup: A strategic guide for investors

How to generate leads using ChatGPT in 2025

Best TikTok posting times to gain more followers

10 high-demand manufacturing business ideas poised to boom in 2025

Different types of data used in AI development

Types of data used in AI development

Provided data

Observed data

Derived data

Inferred data

Reference data

Synthetic data

MORE TO EXPLORE

Can robotics overcome its data scarcity challenge?

Top 4 cloud-based databases to consider in 2024 [Updated]

Why database administration services are a key element of your IT stack

Best data visualization tools for smarter decisions in 2024

Harnessing data to pinpoint ideal business partnerships

ABOUT US

FOLLOW US