Data plays a big role in the development of artificial intelligence (AI), and one key responsibility of developers involves looking closely at the types of data they are going to use.
There are various types of data used for AI development, and specific criteria are used to assess their quality. Both types and characteristics of the data will influence how it is/can be used in AI development and what actions need to be taken to ensure that data is fit for the development of a specific AI technology.
Complex algorithmic and AI systems must process vast amounts of data, which is hugely varied in form. This post will set out the main types of data available and the states they appear in, and how the specific requirements of AI can play a role in the demand for certain types of data.
Before looking at the types of data used in AI development, it is worth considering the states in which data can be found. In general, data is found in two forms: structured and unstructured.
Structured data is organized datasets incorporated into a structured data model. It perfectly fits into a pre-defined structure. Structured data models can exist in many forms, from simple 2D spreadsheet arrays to more complex relational databases or knowledge graphs.
What is considered “big data” today is unstructured data, which is not organized according to any pre-existing data model. Unstructured data is unprocessed and is often generated by machine-led systems where the purpose of the data is not to answer a specific question. This includes social media posts, surveillance camera footage, or satellite imagery. Unstructured data can have its own internal structure, but the relationships between the data elements are often undefined.
Types of data used in AI development
Provided data
Provided data refers to information provided by individuals, specifically those aware that they are actively providing data about themselves. The provision of this can be voluntary (for example, social media posts, financial transactions, personal emails, etc.) Or individuals can be compelled to give their data (for example, registration forms for governmental organizations, health records, job applications, etc.). Individuals will be aware that their data is intended for specific purposes, with consent often required by data controllers. This type of data is more often found in a structured form, with labeled data elements. Access to this type of data by AI developers has its limitations. The access to personal data is generally restrictive due to the high degree of identifiability of personal data and the risks.
Observed data
Observed data consists of information gathered by observing actors or natural/technical phenomena in natural or research environments. This type of data is generated/collected to use the sample observations to make general predictions or analyses of a wider population. The degree to which people are aware that their data is being collected may vary.
In certain activities, such as internet browsing or location activation on mobile devices, individuals may be aware that their behaviors are recorded. In other instances, individuals are less aware that their behavior is being observed and recorded digitally. This includes CCTV footage used for facial recognition or readings from sensor devices (movement sensors, light sensors, etc.). When involving data related to human actors, specific legal and ethical issues must be considered, especially in connection with consent. Depending on the context, observed data can be structured and unstructured.
Derived data
Derived data is obtained by processing data being published or made available from any of the above sources. The types of processing or transformations include subsetting, changing structure, analyzing, mining, or creating statistical or algorithmic models. This type of data can also potentially increase the ethical risks of using and misusing personal data and the applications beyond its intended original use.
Inferred data
Inferred data is generated by applying statistical or computational procedures to produce data for predictive purposes, such as credit scores, the likelihood of developing diseases, or creating targeted advertising. Though closely related to derived data, inferred data is more probabilistic in nature. It is more concerned with posthoc pattern detection and categorization. Since some type of processing or analysis is performed on the original dataset to produce inferred data, individuals lose control over how their personal data is used.
Reference data
Reference data is used to give structure or categorize other data or datasets or provide context for other data, such as opening and closing prices in financial markets or aggregated census records. Reference data can be either static or dynamic and is by definition highly structured in form and requires low levels of pre-processing to be incorporated into any procedures requiring data manipulation. Its value to AI development is in its combination with other data types or in providing cross-domain mappings for homogenous datasets, i.e., Facilitating the combination of one or more other datasets.
A sub-category of reference data is metadata, which is essential information that provides the context for a given dataset. This includes information on provenance, data integrity tests, data formats, file size, etc. Metadata is essential for the discoverability of datasets that can potentially be used in AI development.
Synthetic data
Synthetic data is artificially generated. It is not based on findings or observations based on real-world phenomena but on models and simulations of phenomena. It is often produced by an AI or algorithmic system and other methods, such as statistical or data modeling. This artificial dataset contains no identifiable information mapped to real individuals and is considered a safe approach for sharing sensitive data since it does not carry privacy risks.