76726692-etl-testing

Upload: impvibhas

Post on 04-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 76726692-ETL-Testing

    1/12

    www.fullinterview.com

    ETL Testing

    1

  • 7/29/2019 76726692-ETL-Testing

    2/12

    www.fullinterview.com

    Data warehousing and its Concepts:

    What is Data warehouse?

    Data Warehouse is a central managed and integrated database containing datafrom the operational sources in an organization (such as SAP, CRM, ERPsystem). It may gather manual inputs from users determining criteria andparameters for grouping or classifying records.

    Data warehouse database contains structured data for query analysis and can beaccessed by users. The data warehouse can be created or updated at any time,with minimum disruption to operational systems. It is ensured by a strategyimplemented in ETL process.

    A source for the data warehouse is a data extract from operational databases.

    The data is validated, cleansed, transformed and finally aggregated and itbecomes ready to be loaded into the data warehouse.

    Data warehouse is a dedicated database which contains detailed, stable, non-volatile and consistent data which can be analyzed in the time variant.Sometimes, where only a portion of detailed data is required, it may be worthconsidering using a data mart.

    A data mart is generated from the data warehouse and contains data focused ona given subject and data that is frequently accessed or summarized.

    2

  • 7/29/2019 76726692-ETL-Testing

    3/12

    www.fullinterview.com

    Data warehouse Architecture:

    3

  • 7/29/2019 76726692-ETL-Testing

    4/12

    www.fullinterview.com

    Data warehouse Architecture (Contd):

    4

  • 7/29/2019 76726692-ETL-Testing

    5/12

    www.fullinterview.com

    Advantages of Data warehouse:

    Data warehouse provides a common data model for all data of interestregardless of the data's source. This makes it easier to report and

    analyze information than it would be if multiple data models were usedto retrieve information such as sales invoices, order receipts, generalledger charges, etc.

    Inconsistencies are identified and resolved prior to loading of data inthe Data warehouse. This greatly simplifies reporting and analysis.

    Information in the data warehouse is under the control of datawarehouse users so that, even if the source system data is purgedover time, the information in the warehouse can be stored safely forextended periods of time.

    Because they are separate from operational systems, datawarehouses provide retrieval of data without slowing down operationalsystems.

    Data warehouses enhance the value of operational businessapplications, notably customer relationship management (CRM)systems.

    Data warehouses facilitate decision support system applications suchas trend reports (e.g., the items with the most sales in a particular areawithin the last two years), exception reports, and reports that showactual performance versus goals.

    Disadvantages of Data Warehouse:

    Data warehouses are not the optimal environment for unstructureddata.

    Because data must be extracted, transformed and loaded into thewarehouse, there is an element of latency in data warehouse data.

    Over their life, data warehouses can have high costs. Maintenancecosts are high.

    Data warehouses can get outdated relatively quickly. There is a cost ofdelivering suboptimal information to the organization.

    There is often a fine line between data warehouses and operationalsystems. Duplicate, expensive functionality may be developed. Or,functionality may be developed in the data warehouse that, inretrospect, should have been developed in the operational systemsand vice versa.

    5

  • 7/29/2019 76726692-ETL-Testing

    6/12

    www.fullinterview.com

    ETL Concept:

    ETL is the automated and auditable data acquisition process from source systemthat involves one or more sub processes of data extraction, data transportation,data transformation, data consolidation, data integration, data loading and datacleaning.

    E - Extracting data from source operational or archive systems which areprimary source of data for the data warehouse.T - Transforming the data which may involve cleaning, filtering, validating andapplying business rules.L - Loading the data into the data warehouse or any other database orapplication that houses the data.

    6

  • 7/29/2019 76726692-ETL-Testing

    7/12

    www.fullinterview.com

    ETL Process:

    ETL Process involves the Extraction, Transformation and Loading Process.

    Extraction:

    The first part of an ETL process involves extracting the data from the sourcesystems. Most data warehousing projects consolidate data from different sourcesystems. Each separate system may also use a different data format. Common data

    source formats are relational databases and flat files, but may include non-relationaldatabase structures such as Information Management System (IMS) or other data

    structures such as Virtual Storage Access Method (VSAM) or Indexed SequentialAccess Method (ISAM), or even fetching from outside sources such as through web

    spidering or screen-scraping. Extraction converts the data into a format fortransformation processing.

    An intrinsic part of the extraction involves the parsing of extracted data, resulting in

    a check if the data meets an expected pattern or structure. If not, the data may be

    rejected entirely or in part.

    Transformation:

    Transformation is the series of tasks that prepares the data for loading into thewarehouse. Once data is secured, you have worry about its format or structure.

    Because it will be not be in the format needed for the target. Example the grainlevel, data type, might be different. Data cannot be used as it is. Some rules and

    functions need to be applied to transform the data

    7

  • 7/29/2019 76726692-ETL-Testing

    8/12

    www.fullinterview.com

    One of the purposes of ETL is to consolidate the data in a central repository or tobring it at one logical or physical place. Data can be consolidated from similar

    systems, different subject areas, etc.

    ETL must support data integration for the data coming from multiple sources anddata coming at different times. This has to be seamless operation. This will avoid

    overwriting existing data, creating duplicate data or even worst simply unable to loadthe data in the target

    Loading:

    Loading process is critical to integration and consolidation. Loading process decidesthe modality of how the data is added in the warehouse or simply rejected. Methods

    like addition, Updating or deleting are executed at this step. What happens to theexisting data? Should the old data be deleted because of new information? Or

    should the data be archived? Should the data be treated as additional data to theexisting one?

    So data to the data warehouse has to loaded with utmost care for which dataauditing process can only establish the confidence level. This auditing process

    normally happens after the loading of data.

    List of ETL tools:

    Below is the list of ETL Tools available in the market:

    List of ETL Tools ETL Vendors

    Oracle Warehouse Builder (OWB) Oracle

    Data Integrator & Data Services SAP Business Objects

    IBM Information Server (Datastage) IBM

    SAS Data Integration Studio SAS Institute

    PowerCenter Informatica

    Elixir Repertoire Elixir

    Data Migrator Information Builders

    SQL Server Integration Services Microsoft

    Talend Open Studio Talend

    DataFlow ManagerPitney Bowes BusinessInsight

    Data Integrator Pervasive

    Open Text Integration Center Open TextTransformation Manager ETL Solutions Ltd.

    Data Manager/Decision Stream IBM (Cognos)

    Clover ETL Javlin

    ETL4ALL IKAN

    DB2 Warehouse Edition IBM

    Pentaho Data Integration Pentaho

    Adeptia Integration Server Adeptia

    8

  • 7/29/2019 76726692-ETL-Testing

    9/12

    www.fullinterview.com

    ETL Testing:

    Following are some common goals for testing an ETL application:

    Data completeness - To ensure that all expected data is loaded.

    Data Quality - It promises that the ETL application correctly rejects, substitutesdefault values, corrects and reports invalid data.

    Data transformation - This is meant for ensuring that all data is correctlytransformed according to business rules and design specifications.Performance and scalability- This is to ensure that the data loads and queriesperform within expected time frames and the technical architecture is scalable.Integration testing- It is to ensure that ETL process functions well with otherupstream and downstream applications.

    User-acceptance testing - It ensures the solution fulfills the users currentexpectations and also anticipates their future expectations.

    Regression testing - To keep the existing functionality intact each time a newrelease of code is completed.

    Basically data warehouse testing is divided into two categories Back-end testingand Front-end testing. The former applies where the source systems data iscompared to the end-result data in Loaded area which is the ETL testing. Whilethe latter refers to where the user checks the data by comparing their MIS with

    the data that is displayed by the end-user tools.

    Data Validation:Data completeness is one of the basic ways for data validation. This is needed toverify that all expected data loads into the data warehouse. This includes thevalidation of all the records, fields and ensures that the full contents of each fieldare loaded.

    Data Transformation:

    Validating that the data is transformed correctly based on business rules, can beone of the most complex parts of testing an ETL application with significanttransformation logic. Another way of testing is to pick up some sample recordsand compare them for validating data transformation manually, but this methodrequires manual testing steps and testers who have a good amount ofexperience and understand of the ETL logic.

    9

  • 7/29/2019 76726692-ETL-Testing

    10/12

    www.fullinterview.com

    Data Warehouse Testing Life Cycle:

    Like any other piece of software a DW implementation undergoes the naturalcycle of Unit testing, System testing, Regression testing, Integration testing and

    Acceptance testing.

    Unit testing: Traditionally this has been the task of the developer. This is awhite-box testing to ensure the module or component is coded as per agreedupon design specifications. The developer should focus on the following:

    a) That all inbound and outbound directory structures are created properly withappropriate permissions and sufficient disk space. All tables used during the ETLare present with necessary privileges.

    b) The ETL routines give expected results:i. All transformation logics work as designed from source till target

    ii. Boundary conditions are satisfied e.g. check for date fields with leap yeardates

    iii. Surrogate keys have been generated properlyiv. NULL values have been populated where expectedv. Rejects have occurred where expected and log for rejects is created with

    sufficient detailsvi. Error recovery methodsvii. Auditing is done properly

    c) That the data loaded into the target is complete:i. All source data that is expected to get loaded into target, actually get

    loaded compare counts between source and target and use dataprofiling tools

    ii. All fields are loaded with full contents i.e. no data field is truncated whiletransforming

    iii. No duplicates are loadediv. Aggregations take place in the target properlyv. Data integrity constraints are properly taken care of

    System testing: Generally the QA team owns this responsibility. For them thedesign document is the bible and the entire set of test cases is directly basedupon it. Here we test for the functionality of the application and mostly it is black-

    box. The major challenge here is preparation of test data. An intelligentlydesigned input dataset can bring out the flaws in the application more quickly.Wherever possible use production-like data. You may also use data generationtools or customized tools of your own to create test data. We must test for allpossible combinations of input and specifically check out the errors andexceptions. An unbiased approach is required to ensure maximum efficiency.Knowledge of the business process is an added advantage since we must beable to interpret the results functionally and not just code-wise.

    10

  • 7/29/2019 76726692-ETL-Testing

    11/12

    www.fullinterview.com

    The QA team must test for:

    i. Data completeness match source to target counts terms of business.Also the load windows refresh period for the DW and the views createdshould be signed off from users.

    ii. Data aggregations match aggregated data against staging tables.iii. Granularity of data is as per specifications.iv. Error logs and audit tables are generated and populated properly.v. Notifications to IT and/or business are generated in proper format

    Regression testing:A DW application is not a one-time solution. Possibly it isthe best example of an incremental design where requirements are enhancedand refined quite often based on business needs and feedbacks. In such asituation it is very critical to test that the existing functionalities of a DWapplication are not messed up whenever an enhancement is made to it.Generally this is done by running all functional tests for existing code whenever a

    new piece of code is introduced. However, a better strategy could be to preserveearlier test input data and result sets and running the same again. Now the newresults could be compared against the older ones to ensure proper functionality.

    Integration testing: This is done to ensure that the application developed worksfrom an end-to-end perspective. Here we must consider the compatibility of theDW application with upstream and downstream flows. We need to ensure fordata integrity across the flow. Our test strategy should include testing for:

    i. Sequence of jobs to be executed with job dependencies and schedulingii. Re-startability of jobs in case of failures

    iii. Generation of error logsiv. Cleanup scripts for the environment including database

    This activity is a combined responsibility and participation of experts from allrelated applications is a must in order to avoid misinterpretation of results.

    Acceptance testing: This is the most critical part because here the actual usersvalidate your output datasets. They are the best judges to ensure that theapplication works as expected by them. However, business users may not haveproper ETL knowledge. Hence, the development and test team should be readyto provide answers regarding ETL process that relate to data population. The testteam must have sufficient business knowledge to translate the results in terms ofbusiness. Also the load windows, refresh period for the DW and the viewscreated should be signed off from users.

    Performance testing: In addition to the above tests a DW must necessarily gothrough another phase called performance testing. Any DW application isdesigned to be scalable and robust. Therefore, when it goes into productionenvironment, it should not cause performance problems. Here, we must test the

    11

  • 7/29/2019 76726692-ETL-Testing

    12/12

    www.fullinterview.com

    system with huge volume of data. We must ensure that the load window is meteven under such volumes. This phase should involve DBA team, and ETL expertand others who can review and validate your code for optimization.

    Summary:

    Testing a DW application should be done with a sense of utmost responsibility. Abug in a DW traced at a later stage results in unpredictable losses. And the taskis even more difficult in the absence of any single end-to-end testing tool. So thestrategies for testing should be methodically developed, refined and streamlined.This is also true since the requirements of a DW are often dynamically changing.Under such circumstances repeated discussions with development team andusers is of utmost importance to the test team. Another area of concern is testcoverage. This has to be reviewed multiple times to ensure completeness oftesting. Always remember, a DW tester must go an extra mile to ensure near

    defect free solutions.

    12