integrating big data technology into legacy systems robert cooley, ph.d.codefreeze 1/16/2014

17
INTEGRATING BIG DATA TECHNOLOGY INTO LEGACY SYSTEMS Robert Cooley, Ph.D. CodeFreeze 1/16/2014

Upload: cameron-leonard

Post on 16-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

INTEGRATING BIG DATA TECHNOLOGY INTO LEGACY SYSTEMS

Robert Cooley, Ph.D. CodeFreeze 1/16/2014

2

AGENDA

Do you have "Big Data"?Not all big data is useful dataStrengths & Weakness of data technologiesIntegrating big data technologies into legacy systems

CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

3

BIG DATA?

CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

Defined obsolescence!

A mere TB or two need not apply

From Wikipedia - “Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process within a tolerable elapsed time. Big Data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes in a single data set.”

4

CAN YOU BENEFIT FROM BIG DATA PARADIGMS AND TECHNOLOGY?

The Three Vs*

Volume – size of the dataVelocity – the speed of new incoming dataVariety – the variation of data formats and types

+Concurrency – amount of simultaneous processing needed

CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

*”3D Data Management: controlling Data Volume, Velocity, and Variety” Doug Laney 2/6/2001

5

EXAMPLE: OPTIMINE SOFTWARE

Optimization and measurement for digital advertisingData comes in at an advertisement-day level or transaction level

Volume?Not really by today’s standards. Entire datacenter is under 20TB

Velocity?Not really. Data feeds come in once a day

Variety?Yes. Hundreds of different data file formats

Concurrency?Yes. Hundreds of simultaneous processing requests

CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

6

JUST BECAUSE IT IS BIG DOESN’T MEAN IT’S USEFUL

The danger of the big data mindset is collecting and retaining data without a purpose or plan to

utilize it

An advantage of legacy systems is there is a history of analysis and data already collected to help determine use cases

Be on the lookout for “Accidental Data” – data collected from various applications by default using whatever the default settings happen to be

CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

7

EXAMPLES OF ACCIDENTAL DATA

Over 1M hits per day for a Web site100% of traffic assigned to a single page

99.8% of age fields are populated for head of household

20% of population is listed as age 18

OptiMine Example 28% of conversions from search assigned to search keywords without a click or visit

CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

8

BIG DATA WITHOUT ANALYSIS IS A BIG WASTE OF RESOURCES

Collecting data without also investing in an appropriately scaled analytics infrastructure results in a “Data Tomb”

Even if the Big Data technology streamlines data access

e.g. the NSA collection of CDRs, most organizations where IT is building the data infrastructure independently from the business

Make sure an analyst or data scientist has a chance to evaluate the data collection plan and fields

For OptiMine, the head analyst is also the head of development

Think about possible use cases, but if no one in the organization can come up with one, question the cost of collecting and storing it

CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

9

CURRENT OPTIMINE ETL & STAGING

CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

ETLPhase 1 – Parse, Validate, & Simple TransformsPhase 2 – Assign Clean Key

IssuesT in Phase 2 processing is a bottleneckInsufficient meta-data makes QA difficultOnly stores latest version of data in database

10

RDBMS

StrengthsMature technology

Variety of technologies availableMPP architectures (e.g. Teradata, BitYota)

Very efficient for set operations & relational algebra

Very efficient for updating data while maintaining data integrity

WeaknessesNot great for procedural operations (e.g. iterators)

Full transaction locking overhead is not always needed

Inserts can be slow due to indexing

Fixed schema (“Schema on write”*)

CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

*Amr Awadallah, founder Cloudera

11

ETL TOOLS

StrengthsBuilt-in library of common transforms

Built-in library of data source connectors

Typically a drag-and-drop workflow

WeaknessesExpensive, especially for scalable parallel processing

CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

12

PROCEDURAL PROGRAMMING LANGUAGES

StrengthsFlexibility

Complex data structures

Iterators

Recursion

WeaknessesMore programming time required compared to higher level tools (e.g. ETL)

CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

13

DISTRIBUTED FILE SYSTEM (HADOOP)

StrengthsFlexibility

“Schema on Read”*

Full procedural programming power

Parallelism/Redundency

Low cost

Data load speed

WeaknessesFlexibility! “Hadoop makes the easy things hard, but the impossible things possible”

Often need to add additional tools (Hive, Pig, etc.)

Evolving technology - ecosystem is still in flux with new tools coming and going

No ability to update, only insert

Data read speedCONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE *Amr Awadallah, founder Cloudera

14

PICK THE PARADIGM FIRST, TOOL SECOND

OptiMine TechnologiesRDBMS – SQLServer 2008

ETL – SSIS (SQLServer Integration Services)

Procedural Language – Java/Groovy

Distributed File System – Hadoop (MapReduce)

IssuesTransform of Phase 2 processing is a bottleneck - MapReduce

Insufficient meta-data makes QA difficult - RDBMS

Only stores latest version of data in database - HDFS

CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

15

NEW OPTIMINE ETL & STAGING

CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

HDFS stores all versions of the inbound data

MapReduce handles heavy lifting for assigning and updating Meta Data

Staging to Production queries are reduced to simple inner joins

16

SUMMARY

Your data doesn’t have to be “big” in order to get value out of “big data” technologies

Conversely, don’t fall into the trap of pursuing “all of the data” just because you have the technology to cheaply store and retrieve it

Figure out the right paradigm for the problem first, then select the appropriate technology

CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE