Business Intelligence: Quality Assurance

             by David Wells
 

             I believe that building warehouse databases is relatively easy -- the hard part is obtaining and putting good data
             into the warehouse. Data quality and information quality are among the most difficult issues of data
             warehousing. The difficulties begin with the simple question, "What is data quality?" and become more complex
             when you ask, "What is the difference between data quality and information quality?" Once quality is
             understood, then the really tough question is, "How to achieve data and information quality?"

             What Is Data Quality?

             J. M. Juran’s book (Juran’s Quality Handbook, McGraw-Hill, 1999) focuses on quality as absence of defects.
             Data defects are conditions in the data that make it difficult or impossible to obtain needed information, or that
             result in delivery of incorrect or unreliable information. Data quality is the degree to which data is free of defects
             that limit its utility as an information resource. Data defects are of two types: Integrity and Correctness.

             Integrity defects occur when a data structure is incorrect or unreliable. Data integrity means that the data
             structure has all properties necessary to provide a reliable and trustworthy view of the business. Desirable data
             integrity qualities include:

             1. Identity Integrity. Every occurrence of a real world object, and every row of a warehouse table, is uniquely
             identifiable.

             2. Referential Integrity. Navigation of "dead end" relationships never occurs.

             3. Cardinal Integrity. The number of participants in any relationship complies with business rules.

             4. Value Set Integrity. No data element contains a meaningless value.

             5. Data Dependency Integrity. Dependencies among values and dependencies among relationships comply
             with business rules.

             Correctness defects occur when the data content is incorrect or unreliable. Data correctness means that the
             data content has all properties necessary to provide a reliable and trustworthy view of the business. Larry
             English (Improving Data Warehouse and Business Information Quality, John Wiley & Sons, 1999) describes
             many of the desirable data correctness qualities including:

             1. Completeness. Needed data is present to provide a full picture of the business.

             2. Validity. Data values and combinations of values have business meaning in a specific context, and at a
             particular point in time.

             3. Accuracy. Data represents a true and factual view of the real world objects that it describes.

             4. Precision. Data is sufficiently detailed and granular to meet business needs.

             5. Consistency. Redundant data sources do not produce conflicting facts.

             What Is Information Quality?

             Information quality is the degree to which data is free of defects that limit its utility as a business intelligence
             resource. Data warehousing is a process of turning data into information. Information defects occur when that
             process is defective. They result from using defective data to produce information, from turning good data into
             bad information, and from using good information to reach wrong conclusions. Information defects are of three
             types:

             • Materials defects occur when using the wrong data, or when using data of poor quality to produce
             information. When data of poor quality is used, data defects are propagated to become information defects.
             When data is misunderstood and used inappropriately, high-quality data becomes low quality information.

             • Presentation defects occur when information is delivered in a form that is unreliable, inconsistent or subject to
             misunderstanding. Presentation quality means that information is delivered in a useful and understandable form.

             • Application defects occur when good information results in wrong conclusions. Application quality means that
             information is fully understood and appropriately used.

             Achieving Quality

             Understanding and detecting the various kinds of defects is the first step to data and information quality. Data
             quality is a procedural issue. Quality improvements are achieved through data cleansing, with attention to defect
             prevention and removal as data is placed into the warehouse. Deciding when, where and how to audit, filter
             and correct warehouse data is itself a complex topic.

             Information quality improvements are focused more on people than procedure. Information quality strategies
             focus on preventing and removing defects when information is delivered from the warehouse and used to make
             business decisions. Tools and tactics include metadata, education, training and support.

             About the Author: David Wells is an Enterprise Systems Manager at the University of Washington, the
             founder and Principal Consultant of Infocentric, adn a fellow of The Data Warehousing Institute
             (TDWI). He can be reached at dwells@infocentric.org.