Data is the primary resource for all organizations to gain better insight into their business and market and make operational and tactical decisions faster and better. This makes the quality of data critical for successful and accurate decision-making.
To maintain the accuracy of the business-critical information that impacts strategic decision-making, businesses usually implement a data quality strategy that embeds data quality techniques into business processes and applications.
A data quality strategy obviously means cleaning up bad data – data that is missing, incorrect, or invalid in some way. However, to ensure data are trustworthy, it is important to understand the key dimensions of data quality to assess how the data is bad in the first place.
Data quality dimension is a term widely used for several years to describe the quality of data. A data quality dimension (attribute) is used by data professionals to represent a data feature that can be measured or assessed against defined standards to determine data quality.
Organizations select the data quality dimensions or the dimension thresholds based on their business context, requirements, levels of risk, etc. Therefore, each dimension is likely to have a different weighting.
There are mainly six core dimensions of data quality, including Accuracy, Completeness (Coverage), Conformity (Validity), Consistency, Coverage, Timeliness, and Uniqueness.
Accuracy is a measurement of the veracity of data or the measurement of the precision of data. It is the extent to which data is correct, reliable, and certified. It is the degree to which information correctly describes the “real world” object or event. It can be validated against defined business rules and measured against original documents or authoritative sources.
- Records at the wrong level of precision (i.e., prices that were originally quoted at three decimal places but cut-off and stored at two decimal places)
- Records that haven’t been refreshed or updated
- Records that are wrong at a specified time (i.e., record with incorrect maturity date)
2. Completeness (Coverage)
Completeness is the proportion of stored data against the potential of “100% complete.” It is the measurement of the availability of required data attributes. In other words, completeness measures the existence of necessary data attributes in the population of data records. It is the extent to which data are of sufficient breadth, depth, and scope for the task at hand.
- A missing ticker symbol, CUSIP, or other identifiers
- A benchmark or index that is missing a dividend notice or stock split
- A fixed income instrument record with a null coupon value
- A record with missing attributes
3. Conformity (Validity)
Conformity or validity is a measurement of the alignment of content with the required standards. Data are valid if it conforms to its definition’s syntax (format, type, range). Conformity measures how well the data aligns to internal, external, or industry-wide standards.
- Violation of allowable values (i.e., state code for a country that doesn’t have states)
- Inconsistent date formats
- Invalid ISO currency codes
Consistency is the measurement of compliance with required formats, values, or definitions. First, it is the absence of difference when comparing two or more representations of a thing against a definition. Second, consistency ensures that data values, formats, and definitions in one population agree with those in another data population. Third, it is the extent to which data is presented in the same format and compatible with previous data. Finally, it refers to the violation of semantic rules defined over the data set.
- Telephone numbers with commas vs. hyphens
- Not logical given parameters or rules (rationalization of coding schemes)
- Invalid data formats in some records in a feed
- U.S. vs. European date formats
Timeliness is the degree to which data represent reality from a critical point in time. It is the measurement of the degree to which data represents current conditions and available for use. It measures how well content means current market/business conditions and whether the data is functionally available when needed. Timeliness has two components: age and volatility. Age or currency is how old the information is, based on how long age it was recorded.
- A file delivered too late for a business process or operation
- A credit rating change not updated on the day if was issued
- A new prospectus not given an official number from the national numbering agency
- An issuance or corporate action not delivered when it was announced
Uniqueness is a measurement of the degree that no record or attribute is recorded more than once. Thus, it refers to the singularity of records/attributes. The objective is a single (unique) recording of data. The data item is measured against itself or its counterpart in another data set or database.
- Two instances of the same security with different identifiers or spellings
- A preferred share represented as both an equity and debt object in the same database
Other data quality considerations
- Accessibility– Extent to which information is available or easily and quickly retrievable.
- Duplication – It is a measure of unwanted duplication existing within or across systems for a particular field, record, or data set.
- Data specification – A measure of the existence, completeness, quality, and documentation of data standards, data models, business rules .meta data and reference data.
- Presentation Quality – It is a measure of how information is presented. Format and appearance support the appropriate use of data.
- Safety – The function can achieve acceptable levels of risk of harm to people, processes, property, or the environment.
- Security – Extent to which access to information is restricted appropriately to maintain its security.
- Believability – Extent to which information is regarded as accurate and credible.
- Understandability – Extent to which data are precise without ambiguity and easily comprehended. To extend to which data is easily understood.
- Objectivity – Extent to which information is unbiased, unprejudiced, and impartial.
- Relevancy – Extent to which information is applicable and helpful for the task at hand.
- Effectiveness – The function can enable users to achieve specified goals with accuracy and completeness in a specified context of use.
- Interpretability – To extend to which data is appropriate languages, symbols, and units and the definition are clear.
- Ease of Manipulation – To extend to which data is easy to manipulate and apply to the same different format.
- Ease of Use and maintainability – A measure of the degree to which data can be accessed, used, updated, maintained, and managed.
- Useability – To the extent to which information is clear and easily used.
- Data Decay – A measure of the rate of negative change to data.
- Navigation – Extent to which data are easily found and linked to.