integrating big data technology into legacy systems robert cooley, ph.d.codefreeze 1/16/2014
TRANSCRIPT
2
AGENDA
Do you have "Big Data"?Not all big data is useful dataStrengths & Weakness of data technologiesIntegrating big data technologies into legacy systems
CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE
3
BIG DATA?
CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE
Defined obsolescence!
A mere TB or two need not apply
From Wikipedia - “Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process within a tolerable elapsed time. Big Data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes in a single data set.”
4
CAN YOU BENEFIT FROM BIG DATA PARADIGMS AND TECHNOLOGY?
The Three Vs*
Volume – size of the dataVelocity – the speed of new incoming dataVariety – the variation of data formats and types
+Concurrency – amount of simultaneous processing needed
CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE
*”3D Data Management: controlling Data Volume, Velocity, and Variety” Doug Laney 2/6/2001
5
EXAMPLE: OPTIMINE SOFTWARE
Optimization and measurement for digital advertisingData comes in at an advertisement-day level or transaction level
Volume?Not really by today’s standards. Entire datacenter is under 20TB
Velocity?Not really. Data feeds come in once a day
Variety?Yes. Hundreds of different data file formats
Concurrency?Yes. Hundreds of simultaneous processing requests
CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE
6
JUST BECAUSE IT IS BIG DOESN’T MEAN IT’S USEFUL
The danger of the big data mindset is collecting and retaining data without a purpose or plan to
utilize it
An advantage of legacy systems is there is a history of analysis and data already collected to help determine use cases
Be on the lookout for “Accidental Data” – data collected from various applications by default using whatever the default settings happen to be
CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE
7
EXAMPLES OF ACCIDENTAL DATA
Over 1M hits per day for a Web site100% of traffic assigned to a single page
99.8% of age fields are populated for head of household
20% of population is listed as age 18
OptiMine Example 28% of conversions from search assigned to search keywords without a click or visit
CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE
8
BIG DATA WITHOUT ANALYSIS IS A BIG WASTE OF RESOURCES
Collecting data without also investing in an appropriately scaled analytics infrastructure results in a “Data Tomb”
Even if the Big Data technology streamlines data access
e.g. the NSA collection of CDRs, most organizations where IT is building the data infrastructure independently from the business
Make sure an analyst or data scientist has a chance to evaluate the data collection plan and fields
For OptiMine, the head analyst is also the head of development
Think about possible use cases, but if no one in the organization can come up with one, question the cost of collecting and storing it
CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE
9
CURRENT OPTIMINE ETL & STAGING
CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE
ETLPhase 1 – Parse, Validate, & Simple TransformsPhase 2 – Assign Clean Key
IssuesT in Phase 2 processing is a bottleneckInsufficient meta-data makes QA difficultOnly stores latest version of data in database
10
RDBMS
StrengthsMature technology
Variety of technologies availableMPP architectures (e.g. Teradata, BitYota)
Very efficient for set operations & relational algebra
Very efficient for updating data while maintaining data integrity
WeaknessesNot great for procedural operations (e.g. iterators)
Full transaction locking overhead is not always needed
Inserts can be slow due to indexing
Fixed schema (“Schema on write”*)
CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE
*Amr Awadallah, founder Cloudera
11
ETL TOOLS
StrengthsBuilt-in library of common transforms
Built-in library of data source connectors
Typically a drag-and-drop workflow
WeaknessesExpensive, especially for scalable parallel processing
CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE
12
PROCEDURAL PROGRAMMING LANGUAGES
StrengthsFlexibility
Complex data structures
Iterators
Recursion
WeaknessesMore programming time required compared to higher level tools (e.g. ETL)
CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE
13
DISTRIBUTED FILE SYSTEM (HADOOP)
StrengthsFlexibility
“Schema on Read”*
Full procedural programming power
Parallelism/Redundency
Low cost
Data load speed
WeaknessesFlexibility! “Hadoop makes the easy things hard, but the impossible things possible”
Often need to add additional tools (Hive, Pig, etc.)
Evolving technology - ecosystem is still in flux with new tools coming and going
No ability to update, only insert
Data read speedCONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE *Amr Awadallah, founder Cloudera
14
PICK THE PARADIGM FIRST, TOOL SECOND
OptiMine TechnologiesRDBMS – SQLServer 2008
ETL – SSIS (SQLServer Integration Services)
Procedural Language – Java/Groovy
Distributed File System – Hadoop (MapReduce)
IssuesTransform of Phase 2 processing is a bottleneck - MapReduce
Insufficient meta-data makes QA difficult - RDBMS
Only stores latest version of data in database - HDFS
CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE
15
NEW OPTIMINE ETL & STAGING
CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE
HDFS stores all versions of the inbound data
MapReduce handles heavy lifting for assigning and updating Meta Data
Staging to Production queries are reduced to simple inner joins
16
SUMMARY
Your data doesn’t have to be “big” in order to get value out of “big data” technologies
Conversely, don’t fall into the trap of pursuing “all of the data” just because you have the technology to cheaply store and retrieve it
Figure out the right paradigm for the problem first, then select the appropriate technology
CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE