![Page 1: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/1.jpg)
Data Quality: A Raising Data Warehousing Concern
Presented by: Chowdhury, Mohammad Aminul Hoque
http://aminchowdhury.info
![Page 2: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/2.jpg)
Data Warehousing
![Page 3: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/3.jpg)
Characteristics of Data Warehouse
• Data warehousing it supports to management on decision
making
• It is Subject Oriented and gives information about a
company's ongoing operations
• Data is gathered in Integrated way into the data warehouse
from a variety of sources and merged into a coherently
• Data warehouse is a Time-variant and is identified with a
particular time period
• It is Non-volatile means stable.
![Page 4: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/4.jpg)
Benefits of a data warehouse Maintain data history Integrate data from multiple source systems, enabling a
central view Improve data quality, by providing codes and
descriptions, or even fixing bad data Present the organization's information consistently Provide a single common data model for all data source Restructure the data to makes sense the users Restructure the data to delivers excellent query
performance Making decision–support queries easier.
![Page 5: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/5.jpg)
Designing of Data Warehouse Top-down, bottom-up approaches or a combination of both
software engineering point of view: Waterfall and Spiral
Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions & measures
1. Star schema
2. Snowflake schema
3. Fact constellations
![Page 6: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/6.jpg)
Extract, Transform, Load (ETL)
![Page 7: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/7.jpg)
Extract
ETL process involves extracting the data from the source systems.
ETL Architecture Pattern Most data warehousing projects consolidate
data from different source systems Each separate system may also use a
different data organization and/or format The goal of the extraction phase is to
convert the data into a single format appropriate for transformation processing.
![Page 8: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/8.jpg)
Transform This stage applies a series of rules to extract data from
source to derive the data for loading into the end target Selecting only certain columns to load. Translating coded values (e.g., 1 for male and 2 for
female) Encoding free-form values (e.g., mapping "Male" to "M") Deriving a new calculated value Sorting Joining data from multiple sources (e.g., lookup, merge)
and de-duplicating the data Aggregation (e.g summarizing multiple rows of data —
total sales for each store, and region, etc.)
![Page 9: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/9.jpg)
Transform
Generating surrogate-key valuesTransposing or pivoting Splitting a column into multiple columns Lookup and validate the relevant data from tables or
referential files for slowly changing dimensionsApplying any form of simple or complex data
validation.
![Page 10: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/10.jpg)
Load This phase loads the data into data warehouse. This process varies widely. Some data warehouses may
overwrite existing information with cumulative information;
However, the entry of data for any one year window is made in a historical manner.
As the load phase interacts with a database and contribute to overall data quality performance of the ETL process
ETL can be used to transform the data into a format suitable for the new application to use.
![Page 11: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/11.jpg)
Data Quality
Data quality is an essential characteristic that determines the
reliability of data for making decisions.
High-quality data:
Complete
Accurate
Available
Timely
![Page 12: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/12.jpg)
Classification Of Data Quality IssuesData Quality Issues at Data Sources
Data Quality Issues at Data Profiling Stage
Data Quality issues at Data Staging ETL
Data Quality Problems at Data Modelling
![Page 13: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/13.jpg)
DATA SOURCE
The sources of dirty data include data entry error
and update error
Part of the data comes from text files, part from
MS Excel and from other sources
Some files are result of manual consolidation of
multiple files as a result of which data quality
might be compromised.
DATA PROFILE• A process of developing information about data
instead of information from data.• Utilizes statistical variables• Metadata
Cont...
![Page 14: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/14.jpg)
Example of Data Profiling
![Page 15: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/15.jpg)
DATA STAGING ETL• A data cleaning process is executed in the data
staging area to improve the accuracy • The data staging area is the place where all
grooming is done on data after it is called from the source systems
• It is a prime location for validating data quality from source or auditing and tracking down data issues.
Cont..
![Page 16: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/16.jpg)
DATA MODELLING• Schema Design of the greatly influences the
quality of the analysis • Operational applications uses UML class model
for conceptual data modelling• Issues as slowly changing dimensions, rapidly
changing dimension, and multi valued dimensions etc.
Cont..
![Page 17: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/17.jpg)
Causes Of Data Quality
CAUSES OF DATA QUALITY PROBLEMS AT DATA SOURCES • Wrong information entered into source system • As time and proximity from the source increase,
the chances for getting correct data decrease • Inability to handle with ageing data contribute to
data quality problems • Varying timeliness of data sources • System fields designed to allow free forms (Field
not having adequate length). • Missing values in data sources • Additional columns • Use of different representation formats in data
sources
![Page 18: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/18.jpg)
Causes Of Data Quality
CAUSES OF DATA QUALITY PROBLEMS AT DATA PROFILING • Unreliable and incomplete metadata of data
source • User Generated SQL queries for the data
profiling purpose leaves the data quality problems.
• Inability of evaluation of data structure, data values and data relationships before data integration, propagates poor data quality
• Inability of integration between Data profiling and ETL causes Data quality problem
• Inappropriate selection of Automated profiling tool cause data quality issues
• Insufficient structural analysis of the data sources in the profiling stage.
![Page 19: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/19.jpg)
Cont..CAUSES OF DATA QUALITY ISSUES AT DATA STAGING AND ETL PHASE
• Different business rules of various data sources Creates problem of data quality.
• Business rules lack currency contributes to DQ• Lack of capturing only changes in source files • Lack of periodical refreshing of the integrated data
storage • Disabling data integrity constraints in data staging
tables cause wrong data and relationships to be extracted
• Purging of data from the Data warehouse cause data quality problem
• The inability to restart the ETL process from checkpoints without losing data
• Lack of automatically generating rules for ETL tools to build mapping that detect and fix data defects
• Unhandled null values causes data quality problem • Lack of automated unit testing facility causes data
quality problem
![Page 20: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/20.jpg)
Cont..CAUSES OF DATA QUALITY ISSUES AT DATA WAREHOUSE SCHEM A DESIGN• Incomplete or wrong requirement analysis of the project lead to
poor schema design• Lack of currency in business rules cause poor requirement
analysis• Choice of dimensional modelling
(STAR,SNOWFLAKE,FACTCONSTALLATION) schema contribute to data quality.
• Late identification of slowly changing dimensions contribute to data quality problems.
• Late arriving dimensions cause DQ Problems. • Multi valued dimensions cause DQ problems • Incomplete/Wrong identification of facts/dimensions, bridge
tables or relationship tables or their• Inability to support database schema refactoring cause data
quality problems
![Page 21: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/21.jpg)
DQ TOOLS
![Page 22: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/22.jpg)
REAL TIME INFORMATICA TOOL
![Page 23: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/23.jpg)
Impact of Data Quality Issues
![Page 24: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/24.jpg)
Cost of Poor Data Quality
![Page 25: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/25.jpg)
Confidence and Satisfaction-based impacts
Bad quality of data results in low confidence in forecasting, inconsistent operational and management reporting.
Its will cause delayed or improper decisions.
It impacts satisfaction of customer, employee, or supplier which leads to decreased organizational trust.
Ex : An international bank, for example, could not meet its customer satisfaction goals because agents in its 23 contact centres all followed different operational processes, using up to 18 different apps — many of which contained duplicate data — to serve a single customer.
![Page 26: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/26.jpg)
Impact on Productivity
Workloads : Increased need for reconciliation of reports
Throughput : Increased time for data gathering and preparation, reduced time for direct data analysis, delays in delivering information products, lengthened production and manufacturing cycles
Output quality : Mistrusted reports
Supply chain : Out-of-stock, delivery delays, missed deliveries, duplicate costs for product delivery
![Page 27: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/27.jpg)
Risk and Compliance impacts
Risk and compliance impacts associated with credit assessment, investment risks, competitive risk, capital investment and/or development, fraud, and leakage, and compliance with government regulations, industry expectations, or self-imposed policies (such as privacy policies).
Ex: Healthcare Systems dealing with sensitive information about patients’ health condition. The privacy of these kind of data should be protected.
![Page 28: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/28.jpg)
Examples of Data Quality Problem• Retail company found over 1m records contained home tel number of “000000000” and addresses containing flight numbers
• Insurance company found customer records with 99/99/99 in creation date field of policy
• Car rental company discovered duplicate agreement numbers in their European data warehouse
• Healthcare company found 9 different values in gender field
• Food/Beverage retail chain found the same product was their No 1 and No 2 best sellers across their business
![Page 29: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/29.jpg)
![Page 30: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/30.jpg)
Example cont..
![Page 31: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/31.jpg)
Example cont..
![Page 32: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/32.jpg)
Example cont..
![Page 33: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/33.jpg)
Why Data Quality Influences?
Schema Design influences the quality of the analysis
Poor data handling procedures and processes
Failure to stick on to data entry and maintenance procedures
Errors in the migration process from one system to another
External and third-party data that may not fit
![Page 34: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/34.jpg)
Causes of Data Quality Problems
Dimensional modelling (STAR, SNOWFLAKE, FACTCONSTALLATION) schema Choosing
Multi-valued dimensions
Incomplete/Wrong identification of facts/dimensions, bridge tables or relationship tables
Incomplete/missing values
Corrupted values
Out of range values
Wrong data
Duplicate data
Dissimilar data formats
Incompatible structures
![Page 35: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/35.jpg)
Missing Data
nonresponse, no information is provided
when data collection improperly
mistakes in data entry
How to deal• Imputation• Reconstruction• Denial/Remove• Interpolation
![Page 36: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/36.jpg)
Data Corruption
Undetected/Silent
Detected
![Page 37: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/37.jpg)
Out of Range error
![Page 38: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/38.jpg)
Use specific business rules of various data sources
Enabling data integrity constraints in data staging
Providing internal profiling or integration to third-party data profiling and cleansing tools
Automatically generating rules for ETL tools to build mapping
Techniques of Data Quality Control
![Page 39: Data Quality: A Raising Data Warehousing Concern](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55c58880bb61ebea168b46c7/html5/thumbnails/39.jpg)
Data warehousing security
Appropriate to summaries and aggregates of data
Exploration data warehouse
Data encryption and enhancing privacy.
For more information visit http://aminchowdhury.info