Different types of data used in AI development

November 25, 2021

Data plays a big role in the development of artificial intelligence (AI), and one key responsibility of developers involves looking closely at the types of data they are going to use.

There are various types of data used for AI development, and specific criteria are used to assess their quality. Both types and characteristics of the data will influence how it is/can be used in AI development and what actions need to be taken to ensure that data is fit for the development of a specific AI technology.

Complex algorithmic and AI systems must process vast amounts of data, which is hugely varied in form. This post will set out the main types of data available and the states they appear in, and how the specific requirements of AI can play a role in the demand for certain types of data.

- Advertisement -

Before looking at the types of data used in AI development, it is worth considering the states in which data can be found. In general, data is found in two forms: structured and unstructured.

Structured data is organized datasets incorporated into a structured data model. It perfectly fits into a pre-defined structure. Structured data models can exist in many forms, from simple 2D spreadsheet arrays to more complex relational databases or knowledge graphs.

Semistructured data is a type of data that doesn’t conform to the rigid structure of traditional relational databases but still possesses some organizational properties that make it easier to analyze compared to unstructured data. Unlike fully structured data, which follows a predefined schema, semistructured data contains tags or markers to separate data elements, yet its structure can vary and evolve over time. Formats like JSON, XML, and YAML files are common examples of this, often requiring specialized tools to handle data conversions efficiently. These formats help bridge the gap between structured and unstructured data, making them a crucial element in AI development.

- Advertisement -

What is considered “big data” today is unstructured data, which is not organized according to any pre-existing data model. Unstructured data is unprocessed and is often generated by machine-led systems where the purpose of the data is not to answer a specific question. This includes social media posts, surveillance camera footage, or satellite imagery. Unstructured data can have its own internal structure, but the relationships between the data elements are often undefined.

Types of data used in AI development

Provided data

Provided data refers to information provided by individuals, specifically those aware that they are actively providing data about themselves. The provision of this can be voluntary (for example, social media posts, financial transactions, personal emails, etc.) Or individuals can be compelled to give their data (for example, registration forms for governmental organizations, health records, job applications, etc.). Individuals will be aware that their data is intended for specific purposes, with consent often required by data controllers. This type of data is more often found in a structured form, with labeled data elements. Access to this type of data by AI developers has its limitations. The access to personal data is generally restrictive due to the high degree of identifiability of personal data and the risks.

Observed data

Observed data consists of information gathered by observing actors or natural/technical phenomena in natural or research environments. This type of data is generated/collected to use the sample observations to make general predictions or analyses of a wider population. The degree to which people are aware that their data is being collected may vary.

- Advertisement -

In certain activities, such as internet browsing or location activation on mobile devices, individuals may be aware that their behaviors are recorded. In other instances, individuals are less aware that their behavior is being observed and recorded digitally. This includes CCTV footage used for facial recognition or readings from sensor devices (movement sensors, light sensors, etc.). When involving data related to human actors, specific legal and ethical issues must be considered, especially in connection with consent. Depending on the context, observed data can be structured and unstructured.

Derived data

Derived data is obtained by processing data being published or made available from any of the above sources. The types of processing or transformations include subsetting, changing structure, analyzing, mining, or creating statistical or algorithmic models. This type of data can also potentially increase the ethical risks of using and misusing personal data and the applications beyond its intended original use.

Inferred data

Inferred data is generated by applying statistical or computational procedures to produce data for predictive purposes, such as credit scores, the likelihood of developing diseases, or creating targeted advertising. Though closely related to derived data, inferred data is more probabilistic in nature. It is more concerned with posthoc pattern detection and categorization. Since some type of processing or analysis is performed on the original dataset to produce inferred data, individuals lose control over how their personal data is used.

Reference data

Reference data is used to give structure or categorize other data or datasets or provide context for other data, such as opening and closing prices in financial markets or aggregated census records. Reference data can be either static or dynamic and is by definition highly structured in form and requires low levels of pre-processing to be incorporated into any procedures requiring data manipulation. Its value to AI development is in its combination with other data types or in providing cross-domain mappings for homogenous datasets, i.e., Facilitating the combination of one or more other datasets.

A sub-category of reference data is metadata, which is essential information that provides the context for a given dataset. This includes information on provenance, data integrity tests, data formats, file size, etc. Metadata is essential for the discoverability of datasets that can potentially be used in AI development.

Synthetic data

Synthetic data is artificially generated. It is not based on findings or observations based on real-world phenomena but on models and simulations of phenomena. It is often produced by an AI or algorithmic system and other methods, such as statistical or data modeling. This artificial dataset contains no identifiable information mapped to real individuals and is considered a safe approach for sharing sensitive data since it does not carry privacy risks.

- Advertisement -

MORE TO EXPLORE

Tags
big data

Best solar security cameras to buy in 2026: Smarter surveillance without cables, charges, or subscriptions

Smart motion lights: Hidden mistakes that break home automation and how to fix them

How AI-powered vision systems are creating self-correcting laser cutters

AI gadgets worth buying in 2026: What truly delivers and what completely misses the mark

Infrared tech is reshaping facial recognition: Emerging ways people try to avoid being tracked

Seven common projector buying mistakes and how to avoid them

Top 10 home theater mistakes that undermine performance (and how to avoid them)

How Tesla’s Cybertruck lost sales momentum: Can the EV pickup recover?

Best solar security cameras to buy in 2026: Smarter surveillance without cables, charges, or subscriptions

Smart motion lights: Hidden mistakes that break home automation and how to fix them

How to choose the right ring video doorbell: A practical buyer’s guide to every option

What happened to smart speakers: How voice assistants rose fast, fell quietly, and lost their place in everyday tech

Choosing the right home security camera system: Ten practical decisions that matter more than features

OLED vs QLED vs Mini LED: Which display technology is the best

Best video doorbell cameras in 2026: A complete guide to choosing real home security

How to detect hidden cameras in hotels: A practical, fast and privacy-safe approach

Prop firm trading explained: How retail traders gain access to massive capital and why most never succeed

How robotics startups can secure funding in a competitive market

Patenting AI explained: Strategies, pitfalls, and opportunities for innovators

How to use residential proxies for online reputation management and brand monitoring

How to launch and run a profitable business using only AI tools in 2025

Different types of data used in AI development

Types of data used in AI development

Provided data

Observed data

Derived data

Inferred data

Reference data

Synthetic data

MORE TO EXPLORE

Best tools for generating synthetic data in 2025

Can robotics overcome its data scarcity challenge?

Top 4 cloud-based databases to consider in 2024 [Updated]

Why database administration services are a key element of your IT stack

Best data visualization tools for smarter decisions in 2024

ABOUT US

FOLLOW US