More

    Different types of data used in AI development

    Data plays a big role in the development of artificial intelligence (AI), and one key responsibility of developers involves looking closely at the types of data they are going to use.

    There are various types of data used for AI development, and specific criteria are used to assess their quality. Both types and characteristics of the data will influence how it is/can be used in AI development and what actions need to be taken to ensure that data is fit for the development of a specific AI technology.

    Complex algorithmic and AI systems must process vast amounts of data, which is hugely varied in form. This post will set out the main types of data available and the states they appear in, and how the specific requirements of AI can play a role in the demand for certain types of data.

    - Advertisement -

    Before looking at the types of data used in AI development, it is worth considering the states in which data can be found. In general, data is found in two forms: structured and unstructured.

    Structured data is organized datasets incorporated into a structured data model. It perfectly fits into a pre-defined structure. Structured data models can exist in many forms, from simple 2D spreadsheet arrays to more complex relational databases or knowledge graphs.

    Semistructured data is a type of data that doesn’t conform to the rigid structure of traditional relational databases but still possesses some organizational properties that make it easier to analyze compared to unstructured data. Unlike fully structured data, which follows a predefined schema, semistructured data contains tags or markers to separate data elements, yet its structure can vary and evolve over time. Formats like JSON, XML, and YAML files are common examples of this, often requiring specialized tools to handle data conversions efficiently. These formats help bridge the gap between structured and unstructured data, making them a crucial element in AI development.

    - Advertisement -

    What is considered “big data” today is unstructured data, which is not organized according to any pre-existing data model. Unstructured data is unprocessed and is often generated by machine-led systems where the purpose of the data is not to answer a specific question. This includes social media posts, surveillance camera footage, or satellite imagery. Unstructured data can have its own internal structure, but the relationships between the data elements are often undefined.

    Types of data used in AI development

    Provided data

    Provided data refers to information provided by individuals, specifically those aware that they are actively providing data about themselves. The provision of this can be voluntary (for example, social media posts, financial transactions, personal emails, etc.) Or individuals can be compelled to give their data (for example, registration forms for governmental organizations, health records, job applications, etc.). Individuals will be aware that their data is intended for specific purposes, with consent often required by data controllers. This type of data is more often found in a structured form, with labeled data elements. Access to this type of data by AI developers has its limitations. The access to personal data is generally restrictive due to the high degree of identifiability of personal data and the risks.

    Observed data

    Observed data consists of information gathered by observing actors or natural/technical phenomena in natural or research environments. This type of data is generated/collected to use the sample observations to make general predictions or analyses of a wider population. The degree to which people are aware that their data is being collected may vary.

    - Advertisement -

    In certain activities, such as internet browsing or location activation on mobile devices, individuals may be aware that their behaviors are recorded. In other instances, individuals are less aware that their behavior is being observed and recorded digitally. This includes CCTV footage used for facial recognition or readings from sensor devices (movement sensors, light sensors, etc.). When involving data related to human actors, specific legal and ethical issues must be considered, especially in connection with consent. Depending on the context, observed data can be structured and unstructured.

    Derived data

    Derived data is obtained by processing data being published or made available from any of the above sources. The types of processing or transformations include subsetting, changing structure, analyzing, mining, or creating statistical or algorithmic models. This type of data can also potentially increase the ethical risks of using and misusing personal data and the applications beyond its intended original use.

    Inferred data

    Inferred data is generated by applying statistical or computational procedures to produce data for predictive purposes, such as credit scores, the likelihood of developing diseases, or creating targeted advertising. Though closely related to derived data, inferred data is more probabilistic in nature. It is more concerned with posthoc pattern detection and categorization. Since some type of processing or analysis is performed on the original dataset to produce inferred data, individuals lose control over how their personal data is used.

    Reference data

    Reference data is used to give structure or categorize other data or datasets or provide context for other data, such as opening and closing prices in financial markets or aggregated census records. Reference data can be either static or dynamic and is by definition highly structured in form and requires low levels of pre-processing to be incorporated into any procedures requiring data manipulation. Its value to AI development is in its combination with other data types or in providing cross-domain mappings for homogenous datasets, i.e., Facilitating the combination of one or more other datasets.

    A sub-category of reference data is metadata, which is essential information that provides the context for a given dataset. This includes information on provenance, data integrity tests, data formats, file size, etc. Metadata is essential for the discoverability of datasets that can potentially be used in AI development.

    Synthetic data

    Synthetic data is artificially generated. It is not based on findings or observations based on real-world phenomena but on models and simulations of phenomena. It is often produced by an AI or algorithmic system and other methods, such as statistical or data modeling. This artificial dataset contains no identifiable information mapped to real individuals and is considered a safe approach for sharing sensitive data since it does not carry privacy risks.

    - Advertisement -

    MORE TO EXPLORE

    data

    Can robotics overcome its data scarcity challenge?

    0
    In robotics, achieving autonomy and efficiency relies heavily on the availability of comprehensive and diverse datasets. However, the scarcity of data presents a significant...
    cloud-computing

    Top 4 cloud-based databases to consider in 2024 [Updated]

    0
    Cloud database is becoming the most adopted technology for storing huge amounts of data by many companies worldwide. According to a recent Gartner report,...
    data analytics

    Why database administration services are a key element of your IT stack

    0
    Data is the lifeblood of any cutting-edge business. It powers everything from client interactions to product improvement, advertising, income, operations, and more. Data is...
    data visualization

    Best data visualization tools for smarter decisions in 2024

    0
    Extracting valuable insights from information is crucial for success in our data-driven world. Data visualization plays a vital role in transforming complex data into...
    data preparation

    Harnessing data to pinpoint ideal business partnerships

    0
    Whether looking to find new business partnerships or improve your communication with current partners, you must use B2B professional data. Such datasets will hold...
    - Advertisement -