become a big data quality hero

17
rent Session Presented by: Jason auen Brought to you by: 340 Corporate Way, Suite Orange Park, FL 32073 8882 T8 Concur 4/8/2014 12:45 PM “Become a Big Data Quality Hero” R LexisNexis 300, 688770 9042780524 [email protected] www.sqe.com

Upload: techwellpresentations

Post on 03-Jun-2015

172 views

Category:

Technology


1 download

DESCRIPTION

Many believe that regression testing an application with minimal data is sufficient. However, the data testing methodology becomes far more complex with big data applications. Testing can now be done within the data fabrication process as well as in the data delivery process. Today, comprehensive testing is often mandated by regulatory agencies—and more importantly by customers. Finding issues before deployment and saving your company’s reputation—and in some cases preventing litigation—is critical. Jason Rauen presents an overview of the architecture, processes, techniques, and lessons learned by an original big data company. Detecting defects up-front is vital. Learn how to test thousands, millions, and in some cases billions—yes, billions—of records directly, rendering sampling procedures obsolete. See how you can save your organization time and money—and have better data test coverage than ever before.

TRANSCRIPT

Page 1: Become a Big Data Quality Hero

 

 

 

rent Session 

 

Presented by: 

Jason  auen 

  

Brought to you by: 

  

340 Corporate Way, Suite   Orange Park, FL 32073 888‐2

T8 Concur4/8/2014   12:45 PM     

“Become a Big Data Quality Hero”  

 R

LexisNexis   

    

300,68‐8770 ∙ 904‐278‐0524 ∙ [email protected] ∙ www.sqe.com 

Page 2: Become a Big Data Quality Hero

Jason Rauen LexisNexis  

Jason Rauen is a senior quality test analyst at Georgia-based LexisNexis Risk Solutions. With more than fifteen years of experience, Jason has led the data testing team in big data from its inception. He has presented big data scripting techniques at HPCC Systems national Data Summit. His background includes working at companies including Microsoft, AT&T, and LexisNexis, and instructing at Intel, Boeing, Executrain, and the Department of the Navy. Jason has transitioned through various aspects of technology including technical sales, customer support, training, quality control/quality assurance, and into management.

Page 3: Become a Big Data Quality Hero

2/4/2014

1

“Quality isn’t measured by how many clients you obtain; it’s measured by how many clients you retain ”

Interesting Quotes……

retain.”

“QA isn’t the bottom of the totem pole; it’s the dirt holding it up.”

1

Become a Big Data Quality HeroA look inside QA for Big Data

Presented by 01001010 01100001 01110011 01101111 01101110 00100000 01010010 01100001 01110101 01100101 01101110 (Jason Rauen)

Page 4: Become a Big Data Quality Hero

2/4/2014

2

Overview

• Architecture and why you need to know– HPCC Systems/Hadoop– Know Your Data/Environment

• Why Test Big Data and How it’s Different– Issues– Benefits

• Strategies and Concepts–What to look for– Sample Gathering (AUB) – Stats– Profiling 

3

Architecture and why you need to know

Data Warehouse Architecture

Source Files

EXTRACT TRANSFORM 

LOAD

Staging(Data 

Cleansing)

4

DATAWAREHOUSE

Page 5: Become a Big Data Quality Hero

2/4/2014

3

Architecture and why you need to know

Data Fabrication Engines• HDFS Hadoop and HPCC THOR • Made of several nodes• Made of several nodes• Where the ETL happens• Where the Keys are made

Data Delivery Engines• HPCC ROXIE, HBASE, etc…• Keys moved to and referenced here• Queries reside

5

Architecture and why you need to know

6

Page 6: Become a Big Data Quality Hero

2/4/2014

4

Architecture and why you need to know

HDFSHDFSHadoop Mapreduce HBASE

7

Architecture and why you need to know

8

Page 7: Become a Big Data Quality Hero

2/4/2014

5

Architecture and why you need to know

HDFSMap Shuffle Reduce

9

Architecture and why you need to know

DISTRIBUTE/PROJECT/TRANSFORM Rollup

HPCC Systems

10

Page 8: Become a Big Data Quality Hero

2/4/2014

6

Why Test Big Data and How it’s Different

Why Test Big Data:• Traditional methods not adequate – Traditional sampling 

d i d i i b d hneeds improvement and is scenario based, not enough samples, human error, etc….

• Size of the data is huge, from different sources, and inconsistent 

• Tied into current environment• Government regulatory compliancesg y p• Auditing requirements • Company wide initiatives• The business makes crucial decisions

based off of it11

Why Test Big Data and How it’s Different

Want to keep your customers?

12

Page 9: Become a Big Data Quality Hero

2/4/2014

7

Why Test Big Data and How it’s Different

• When? o Testing ‐ SDLCo Routine Testinggo Frequency ‐ Yearly/Monthly/Weekly/Daily/Hourly/On 

Demand

• What? Types Testing New Project – Source to Target (Transform)Standard  ‐ Production Validation Emergency releases

• How?  o Using  what you have availableo Freebies – Profiling tools, etc… 

13

Why Test Big Data and How it’s Different

Issues:• Lack of control

Timing of buildsTiming of buildsSamples and location of samples

• 3rd Party AppsLack of licenses, Costs, Training, and existing knowledge

• Extra hardware• Extra hardware• Upgrades

14

Page 10: Become a Big Data Quality Hero

2/4/2014

8

Why Test Big Data and How it’s Different

Benefits:• Cost savings• Better Coverage

No SamplesIncreased SamplingFocused Samples

• Faster (Time is $)• Quicker to Diagnosing issues• Better Data Integrity• Collaboration with other groups

15

Strategies and Concepts

• What to look for……Brand New, Incomplete, or Missing Builds (Data Cops)Data progression  Today/Yesterday  FatherKey/Grandfatherkeyp g y/ y y/ yCount of Deltas in release/deployKeys updatedMissing keys/New keysField Validations – mandatory fields blank, consistency, etc…Key Layout issuesCorruption unprintable or invalid charactersDuplicate records of new and existing recordsData Fabrication Engine to Data delivery Engine deploys/syncQueries with new data

16

Page 11: Become a Big Data Quality Hero

2/4/2014

9

Strategies and Concepts

JOIN• Sample gathering• New Key for testing• Deployment Validation‐ Data Fabrication

• Deployment Validation‐ Data Delivery

And get a free cookie…

17

Strategies and Concepts

AUB for JOINA = Left key (New)B = Right key (Old)B   Right key (Old)

Types of JOINS

Inner Join Left Outer Join Right Outer Join

Full Outer Join Minus or Left Only

18

Page 12: Become a Big Data Quality Hero

2/4/2014

10

Strategies and Concepts

Statistics: What you try to remember with this swimming behind you.y

19

Strategies and Concepts

Statistics:• On data sets and keys

‐ Gives you a high level look at the release             ‐ Ranges‐ You’ll start to notice a trend line

• On Releases‐ Done over time you’ll see the trend of new data sets and keys‐ Done over time you’ll see the trend of changed or modified  data sets and keys 

20

Page 13: Become a Big Data Quality Hero

2/4/2014

11

Strategies and Concepts

350

400RELEASE NUMBERS

AVERAGE 175.4

150

200

250

300

CEILING 210.6

FLOOR 135.1

0

50

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

21

Strategies and Concepts

Data Profiling:

• Data Profiling Summary Report• Data Profiling Field Detail Report

http://www.hpccsystems.com/demos/data‐profiling‐demo

• Data Profiling Field Combination Report

22

Page 14: Become a Big Data Quality Hero

2/4/2014

12

Strategies and Concepts

Data Profiling Summary Report

23

Strategies and Concepts

Data Profiling Field Detail Report

24

Page 15: Become a Big Data Quality Hero

2/4/2014

13

Strategies and Concepts

Data Profiling Field Combination Report

25

Strategies and ConceptsSQL

SELECT * FROM Products;

Pig

DUMP Products;

ECL

Products;

SELECT * FROM Products WHERE productcode = ‘R2D2C3PO’;

Products= FILTERProducts BY productcode = ‘R2D2C3PO’;DUMP Products;

Products= GROUP

Products(productcode = ‘R2D2C3PO’);

COUNT(Products);SELECT COUNT (*) FROM 

Products;

Products= GROUP Products ALL; Products =FOREACHProducts GENERATE COUNT (Products);DUMP Products;

COUNT(Products);

26

Page 16: Become a Big Data Quality Hero

2/4/2014

14

Strategies and ConceptsSQL

SELECT * FROM Products ORDER BY productcode;

Pig

Products= ORDERProducts BY productcode;

ECL

SORT(Products,productcode);ORDER BY productcode;

SELECT * FROM Products FULL OUTER JOIN OtherProducts ON Products.col1 = OtherProducts.col1;

DUMP Products;

Products= JOIN ProductsBY col1 FULL OUTER, OtherProducts BY col1; DUMP Products;

JOIN(Products,OtherProducts, LEFT.col1 = RIGHT.col1,FULLOUTER);

27

Summary

Why Test Big Data and How it’s Different

Architecture and why you need to know

Strategies and Concepts

28

Page 17: Become a Big Data Quality Hero

2/4/2014

15

Questions?

29

Contact / Useful links

www.linkedin/in/jasonrauen

• HPCC Systems/ECL Links:http://hpccsystems.comhttp://hpccsystems.com/demos

• Hadoop/Pig Latin Links:http://pig apache orghttp://pig.apache.orghttp://hadoop.apache.org

• SQL Links:http://sql.org/http://msdn.microsoft.com/en‐US/sqlserver/default.aspx

30