become a big data quality hero
DESCRIPTION
Many believe that regression testing an application with minimal data is sufficient. However, the data testing methodology becomes far more complex with big data applications. Testing can now be done within the data fabrication process as well as in the data delivery process. Today, comprehensive testing is often mandated by regulatory agencies—and more importantly by customers. Finding issues before deployment and saving your company’s reputation—and in some cases preventing litigation—is critical. Jason Rauen presents an overview of the architecture, processes, techniques, and lessons learned by an original big data company. Detecting defects up-front is vital. Learn how to test thousands, millions, and in some cases billions—yes, billions—of records directly, rendering sampling procedures obsolete. See how you can save your organization time and money—and have better data test coverage than ever before.TRANSCRIPT
rent Session
Presented by:
Jason auen
Brought to you by:
340 Corporate Way, Suite Orange Park, FL 32073 888‐2
T8 Concur4/8/2014 12:45 PM
“Become a Big Data Quality Hero”
R
LexisNexis
300,68‐8770 ∙ 904‐278‐0524 ∙ [email protected] ∙ www.sqe.com
Jason Rauen LexisNexis
Jason Rauen is a senior quality test analyst at Georgia-based LexisNexis Risk Solutions. With more than fifteen years of experience, Jason has led the data testing team in big data from its inception. He has presented big data scripting techniques at HPCC Systems national Data Summit. His background includes working at companies including Microsoft, AT&T, and LexisNexis, and instructing at Intel, Boeing, Executrain, and the Department of the Navy. Jason has transitioned through various aspects of technology including technical sales, customer support, training, quality control/quality assurance, and into management.
2/4/2014
1
“Quality isn’t measured by how many clients you obtain; it’s measured by how many clients you retain ”
Interesting Quotes……
retain.”
“QA isn’t the bottom of the totem pole; it’s the dirt holding it up.”
1
Become a Big Data Quality HeroA look inside QA for Big Data
Presented by 01001010 01100001 01110011 01101111 01101110 00100000 01010010 01100001 01110101 01100101 01101110 (Jason Rauen)
2/4/2014
2
Overview
• Architecture and why you need to know– HPCC Systems/Hadoop– Know Your Data/Environment
• Why Test Big Data and How it’s Different– Issues– Benefits
• Strategies and Concepts–What to look for– Sample Gathering (AUB) – Stats– Profiling
3
Architecture and why you need to know
Data Warehouse Architecture
Source Files
EXTRACT TRANSFORM
LOAD
Staging(Data
Cleansing)
4
DATAWAREHOUSE
2/4/2014
3
Architecture and why you need to know
Data Fabrication Engines• HDFS Hadoop and HPCC THOR • Made of several nodes• Made of several nodes• Where the ETL happens• Where the Keys are made
Data Delivery Engines• HPCC ROXIE, HBASE, etc…• Keys moved to and referenced here• Queries reside
5
Architecture and why you need to know
6
2/4/2014
4
Architecture and why you need to know
HDFSHDFSHadoop Mapreduce HBASE
7
Architecture and why you need to know
8
2/4/2014
5
Architecture and why you need to know
HDFSMap Shuffle Reduce
9
Architecture and why you need to know
DISTRIBUTE/PROJECT/TRANSFORM Rollup
HPCC Systems
10
2/4/2014
6
Why Test Big Data and How it’s Different
Why Test Big Data:• Traditional methods not adequate – Traditional sampling
d i d i i b d hneeds improvement and is scenario based, not enough samples, human error, etc….
• Size of the data is huge, from different sources, and inconsistent
• Tied into current environment• Government regulatory compliancesg y p• Auditing requirements • Company wide initiatives• The business makes crucial decisions
based off of it11
Why Test Big Data and How it’s Different
Want to keep your customers?
12
2/4/2014
7
Why Test Big Data and How it’s Different
• When? o Testing ‐ SDLCo Routine Testinggo Frequency ‐ Yearly/Monthly/Weekly/Daily/Hourly/On
Demand
• What? Types Testing New Project – Source to Target (Transform)Standard ‐ Production Validation Emergency releases
• How? o Using what you have availableo Freebies – Profiling tools, etc…
13
Why Test Big Data and How it’s Different
Issues:• Lack of control
Timing of buildsTiming of buildsSamples and location of samples
• 3rd Party AppsLack of licenses, Costs, Training, and existing knowledge
• Extra hardware• Extra hardware• Upgrades
14
2/4/2014
8
Why Test Big Data and How it’s Different
Benefits:• Cost savings• Better Coverage
No SamplesIncreased SamplingFocused Samples
• Faster (Time is $)• Quicker to Diagnosing issues• Better Data Integrity• Collaboration with other groups
15
Strategies and Concepts
• What to look for……Brand New, Incomplete, or Missing Builds (Data Cops)Data progression Today/Yesterday FatherKey/Grandfatherkeyp g y/ y y/ yCount of Deltas in release/deployKeys updatedMissing keys/New keysField Validations – mandatory fields blank, consistency, etc…Key Layout issuesCorruption unprintable or invalid charactersDuplicate records of new and existing recordsData Fabrication Engine to Data delivery Engine deploys/syncQueries with new data
16
2/4/2014
9
Strategies and Concepts
JOIN• Sample gathering• New Key for testing• Deployment Validation‐ Data Fabrication
• Deployment Validation‐ Data Delivery
And get a free cookie…
17
Strategies and Concepts
AUB for JOINA = Left key (New)B = Right key (Old)B Right key (Old)
Types of JOINS
Inner Join Left Outer Join Right Outer Join
Full Outer Join Minus or Left Only
18
2/4/2014
10
Strategies and Concepts
Statistics: What you try to remember with this swimming behind you.y
19
Strategies and Concepts
Statistics:• On data sets and keys
‐ Gives you a high level look at the release ‐ Ranges‐ You’ll start to notice a trend line
• On Releases‐ Done over time you’ll see the trend of new data sets and keys‐ Done over time you’ll see the trend of changed or modified data sets and keys
20
2/4/2014
11
Strategies and Concepts
350
400RELEASE NUMBERS
AVERAGE 175.4
150
200
250
300
CEILING 210.6
FLOOR 135.1
0
50
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
21
Strategies and Concepts
Data Profiling:
• Data Profiling Summary Report• Data Profiling Field Detail Report
http://www.hpccsystems.com/demos/data‐profiling‐demo
• Data Profiling Field Combination Report
22
2/4/2014
12
Strategies and Concepts
Data Profiling Summary Report
23
Strategies and Concepts
Data Profiling Field Detail Report
24
2/4/2014
13
Strategies and Concepts
Data Profiling Field Combination Report
25
Strategies and ConceptsSQL
SELECT * FROM Products;
Pig
DUMP Products;
ECL
Products;
SELECT * FROM Products WHERE productcode = ‘R2D2C3PO’;
Products= FILTERProducts BY productcode = ‘R2D2C3PO’;DUMP Products;
Products= GROUP
Products(productcode = ‘R2D2C3PO’);
COUNT(Products);SELECT COUNT (*) FROM
Products;
Products= GROUP Products ALL; Products =FOREACHProducts GENERATE COUNT (Products);DUMP Products;
COUNT(Products);
26
2/4/2014
14
Strategies and ConceptsSQL
SELECT * FROM Products ORDER BY productcode;
Pig
Products= ORDERProducts BY productcode;
ECL
SORT(Products,productcode);ORDER BY productcode;
SELECT * FROM Products FULL OUTER JOIN OtherProducts ON Products.col1 = OtherProducts.col1;
DUMP Products;
Products= JOIN ProductsBY col1 FULL OUTER, OtherProducts BY col1; DUMP Products;
JOIN(Products,OtherProducts, LEFT.col1 = RIGHT.col1,FULLOUTER);
27
Summary
Why Test Big Data and How it’s Different
Architecture and why you need to know
Strategies and Concepts
28
2/4/2014
15
Questions?
29
Contact / Useful links
www.linkedin/in/jasonrauen
• HPCC Systems/ECL Links:http://hpccsystems.comhttp://hpccsystems.com/demos
• Hadoop/Pig Latin Links:http://pig apache orghttp://pig.apache.orghttp://hadoop.apache.org
• SQL Links:http://sql.org/http://msdn.microsoft.com/en‐US/sqlserver/default.aspx
30