by David Wells
I believe that building warehouse databases is relatively easy -- the hard
part is obtaining and putting good data
into the warehouse. Data quality and information quality are among the
most difficult issues of data
warehousing. The difficulties begin with the simple question, "What is
data quality?" and become more complex
when you ask, "What is the difference between data quality and information
quality?" Once quality is
understood, then the really tough question is, "How to achieve data and
information quality?"
What Is Data Quality?
J. M. Juran’s book (Juran’s Quality Handbook, McGraw-Hill, 1999) focuses
on quality as absence of defects.
Data defects are conditions in the data that make it difficult or impossible
to obtain needed information, or that
result in delivery of incorrect or unreliable information. Data quality
is the degree to which data is free of defects
that limit its utility as an information resource. Data defects are of
two types: Integrity and Correctness.
Integrity defects occur when a data structure is incorrect or unreliable.
Data integrity means that the data
structure has all properties necessary to provide a reliable and trustworthy
view of the business. Desirable data
integrity qualities include:
1. Identity Integrity. Every occurrence of a real world object, and every
row of a warehouse table, is uniquely
identifiable.
2. Referential Integrity. Navigation of "dead end" relationships never occurs.
3. Cardinal Integrity. The number of participants in any relationship complies with business rules.
4. Value Set Integrity. No data element contains a meaningless value.
5. Data Dependency Integrity. Dependencies among values and dependencies
among relationships comply
with business rules.
Correctness defects occur when the data content is incorrect or unreliable.
Data correctness means that the
data content has all properties necessary to provide a reliable and trustworthy
view of the business. Larry
English (Improving Data Warehouse and Business Information Quality, John
Wiley & Sons, 1999) describes
many of the desirable data correctness qualities including:
1. Completeness. Needed data is present to provide a full picture of the business.
2. Validity. Data values and combinations of values have business meaning
in a specific context, and at a
particular point in time.
3. Accuracy. Data represents a true and factual view of the real world objects that it describes.
4. Precision. Data is sufficiently detailed and granular to meet business needs.
5. Consistency. Redundant data sources do not produce conflicting facts.
What Is Information Quality?
Information quality is the degree to which data is free of defects that
limit its utility as a business intelligence
resource. Data warehousing is a process of turning data into information.
Information defects occur when that
process is defective. They result from using defective data to produce
information, from turning good data into
bad information, and from using good information to reach wrong conclusions.
Information defects are of three
types:
• Materials defects occur when using the wrong data, or when using data
of poor quality to produce
information. When data of poor quality is used, data defects are propagated
to become information defects.
When data is misunderstood and used inappropriately, high-quality data
becomes low quality information.
• Presentation defects occur when information is delivered in a form that
is unreliable, inconsistent or subject to
misunderstanding. Presentation quality means that information is delivered
in a useful and understandable form.
• Application defects occur when good information results in wrong conclusions.
Application quality means that
information is fully understood and appropriately used.
Achieving Quality
Understanding and detecting the various kinds of defects is the first step
to data and information quality. Data
quality is a procedural issue. Quality improvements are achieved through
data cleansing, with attention to defect
prevention and removal as data is placed into the warehouse. Deciding when,
where and how to audit, filter
and correct warehouse data is itself a complex topic.
Information quality improvements are focused more on people than procedure.
Information quality strategies
focus on preventing and removing defects when information is delivered
from the warehouse and used to make
business decisions. Tools and tactics include metadata, education, training
and support.
About the Author: David Wells is an Enterprise Systems Manager at the University
of Washington, the
founder and Principal Consultant of Infocentric, adn a fellow of The Data
Warehousing Institute
(TDWI). He can be reached at dwells@infocentric.org.