shawndra hill upenn jasonalb big data wk11

16
3/24/2013 1 Jason P. Albert, University of Pennsylvania [email protected] Big Data Jason Albert University of Pennsylvania Jason P. Albert, University of Pennsylvania [email protected] Big Data PERSPECTIVES

Upload: sargentshriver

Post on 13-Apr-2015

82 views

Category:

Documents


0 download

DESCRIPTION

WK11

TRANSCRIPT

Page 1: Shawndra Hill Upenn Jasonalb Big Data WK11

3/24/2013

1

Jason P. Albert, University of Pennsylvania [email protected]

Big Data

Jason Albert

University of Pennsylvania

Jason P. Albert, University of Pennsylvania [email protected]

Big Data

PERSPECTIVES

Page 2: Shawndra Hill Upenn Jasonalb Big Data WK11

3/24/2013

2

Jason P. Albert, University of Pennsylvania [email protected] 3

What is Big Data?

high volume, velocity and/or variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. (Gartner)

1 Terabyte = 1024 Gigabytes

1 Petabyte = 1024 Terabytes

1 Exabyte = 1024 Petabytes

1 Zettabyte = 1024 Petabytes

1 ZB (1,099,511,627,776GB) * 7.9 = 8,686,141,859,430GB

Jason P. Albert, University of Pennsylvania [email protected] 4

How do we handle Big Data?

“MAD” Information Management is the approach:

Must be Magnetic, attracting all data sources

Must be Agile for easy accommodation of data at a rapid pace

Must provide sophisticated statistical methods for its Deep data repository

Why is MAD a departure from traditional Data Warehouse?

Page 3: Shawndra Hill Upenn Jasonalb Big Data WK11

3/24/2013

3

Jason P. Albert, University of Pennsylvania [email protected] 5

What is the Scope of the Solution?

An End to End Solution must be Considered:

Consume: Volume, Velocity, Variety

Store: Gigabytes, Terabytes, Petabytes

Process: Cluster, Classify, Predict

Present: Visualize, Interact, Evaluate

Jason P. Albert, University of Pennsylvania [email protected] 6

Perspectives on Big Data

Does it handle Big Data? Volume Velocity Variety

Is it considered MAD? Magnetic Agile Deep

Is it an End-to-End Solution? Consume Store Process Present

Page 4: Shawndra Hill Upenn Jasonalb Big Data WK11

3/24/2013

4

Jason P. Albert, University of Pennsylvania [email protected] 7

Options to Consider

Two promising options with low market penetration (Gartner)

MapReduce and alternatives

In-memory Computing

Jason P. Albert, University of Pennsylvania [email protected]

Big Data

MAP REDUCE

Page 5: Shawndra Hill Upenn Jasonalb Big Data WK11

3/24/2013

5

Jason P. Albert, University of Pennsylvania [email protected] 9

Hadoop = MapReduce + HDFS

Open Source, Batch Oriented, Data Intensive general purpose framework for creating distributed applications that process big data i.e. Volume, Velocity, Variety

Hadoop Distrbuted File System (HDFS)

Data distributed and replicated over multiple systems

Block oriented

MapReduce

Map function processes intermediate key/value pairs

Reduce function merges intermediate values

Facilitates parallel processing of multi-terabytes of data on large clusters of commodity platforms

Scale Out • Fully depreciated • Repurposed • Low Cost

Jason P. Albert, University of Pennsylvania [email protected] 10

MapReduce Workflow

1. Input data is distributed 2. Map Tasks work on a split of data

Map(key, value) for each word x in value: output.collect(x,1)

3. Mapper outputs intermediate data 4. Data is exchanged between nodes 5. Intermediate data of same key goes

to same reducer Reduce(keyword, (listOfValues)) for each x in (listOfValues):

sum += x; output.collect(keyword, sum);

6. Reducer output is stored

$ hadoop jar wordcount.jar WordCount /usr/input /usr/output

1. Jack be nimble, Jack be quick, Jack jump over the candlestick.

2. (0, "Jack be nimble,") (15, “Jack be quick") (28, " Jack jump over the candlestick")

3. (“Jack”, 1), (“be”,1), (“nimble,”,1), (“Jack”,1), (“be”, 1),(“quick,”, 1), (“Jack”,1), (“jump”, 1),(“over”, 1),(“the”, 1), (“candlestick.”, 1)

4. …

5. (“Jack”, (1,1,1)), (“be”, (1,1)), (“nimble,”,(1)), (“quick”, (1)), (“jump”, (1)),(“over”, (1)), (“the”, (1)), (“candlestick.”, (1))

6. (“Jack”, 3), (“be”, 2), (“nimble,”,1), (“quick”, 1), (“jump”, 1), (“over”, 1), (“the”, 1), (“candlestick.”, 1)

Page 6: Shawndra Hill Upenn Jasonalb Big Data WK11

3/24/2013

6

Jason P. Albert, University of Pennsylvania [email protected] 11

Scale-Out: MapReduce + HDFS

Jason P. Albert, University of Pennsylvania [email protected] 12

Case Study: Recommendations

1) 9 TB of W3C Extended Log File Format data

2) MapReduce program: sessionExtractor

Session Person Person

SDF92MGSLOK4M23K B041Q3EV N23KFMWE

ASD90K23MOLFWQIE EM9IU67Y

Example: LinkedIn “People you may know” Application • Behavior Analytics • Risk & Fraud Analysis • Social Network "Connectedness“

• Text Analysis • Regressions (Financial)

Page 7: Shawndra Hill Upenn Jasonalb Big Data WK11

3/24/2013

7

Jason P. Albert, University of Pennsylvania [email protected] 13

Supplemental Case Study

Product Sentiment Analysis over Time 1 Month of Twitter Feeds and Opinion Boards onto HDFS

Process using Word Count example of Positive and Negative words associated with a Product over time

This type of analysis is being done with some success

http://techcrunch.com/2012/05/18/study-twitter-sentiment-mirrored-facebooks-stock-price-today/ http://www.cs.ucr.edu/~vagelis/publications/wsdm2012-microblog-financial.pdf

Jason P. Albert, University of Pennsylvania [email protected] 14

MapReduce is Different

MapReduce handles processing differently: Distributed Programming Fault Tolerant MapReduce handles modeling differently: Schema-less Orientated toward exploration and discovery MapReduce handles data differently: Mostly unstructured data objects Vast number of attributes and data sources Data sources added and/or updated frequently Quality is unknown

External References http://developer.yahoo.com/hadoop/ http://code.google.com/edu/parallel/mapreduce-tutorial.html

Page 8: Shawndra Hill Upenn Jasonalb Big Data WK11

3/24/2013

8

Jason P. Albert, University of Pennsylvania [email protected] 15

MapReduce…

…handle Big Data? …considered MAD?

Magnetism

Agile

MapReduce requires algorithm development

Deep

…and End to End Solution?

Jason P. Albert, University of Pennsylvania [email protected]

Big Data

IN-MEMORY COMPUTING

Page 9: Shawndra Hill Upenn Jasonalb Big Data WK11

3/24/2013

9

Jason P. Albert, University of Pennsylvania [email protected] 17

In-Memory Computing

Overview All relevant structured data in-memory

Cache aware memory organization (current bottleneck between CPU and main memory)

Data partitioning for parallel execution

Computation

Computation

Application Stack

Database Stack

Current Methodology

Future Methodology

Optimized for disk access on platforms with limited main memory and slow disk I/O.

Leveraging current innovations in Hardware & Software to move computations into the Database

Jason P. Albert, University of Pennsylvania [email protected] 18

In-Memory Workflow

In-memory computing applies a combination of:

Optimization: for Query Pruning and Data Distribution

Execution: SQL statement plan for computational parallelization

Stores: Column store with partitioning/compression (5-30x ratio)

Persistence: Temporal Tables and MVCC

http://ark.intel.com/

IBM x3850 x5 QPI Scaling Or Max5 Tray

2,3,4TB RAM 2-4 CPU @ 10 Cores/each > 4 TB @ 8x HDD

Page 10: Shawndra Hill Upenn Jasonalb Big Data WK11

3/24/2013

10

Jason P. Albert, University of Pennsylvania [email protected] 19

Scale-Out Strategy for In-Memory

Jason P. Albert, University of Pennsylvania [email protected] 20

Capturing and Presenting

Data Provisioning IM-DBMS does not currently accommodate transaction workloads

Trigger Replication new transactions to replicate to an in-memory DB facilitating real time operational analysis, planning, and simulation.

Extraction using ETL (Extract, Transform, Load) tools with a large variety of external and internal source system support handles other data sources in near real-time but require job scheduling

e.g. SAP HANA

Page 11: Shawndra Hill Upenn Jasonalb Big Data WK11

3/24/2013

11

Jason P. Albert, University of Pennsylvania [email protected] 21

Case Study: Sales Analysis

1) Load 1.1 Billion PoS in < 1 sec 3) Drill Down Into Category < 1 sec

4) Plan/Actuals as Schema & Visualize

2) Identify Top Selling Categories

Link to Video: PoS from HANA using Business Objects Explorer

Jason P. Albert, University of Pennsylvania [email protected] 22

Examples of Performance Gains

Report on Product Dimensions 120 million line items Standard ERP solution: several minutes on pre-aggregated

dataset; more for drilldown In-Memory: less than 1 second on line item level data;

minute delay for drilldown

Genome Analysis: Optimized Data Warehouse: Sequence Alignment 81 minutes

+ Variant Calling: 65 minutes In-Memory: Sequence Alignment 15 minutes + Variant Calling

19.5 minutes (6.5 min estimated) Approximately 2hr savings

Page 12: Shawndra Hill Upenn Jasonalb Big Data WK11

3/24/2013

12

Jason P. Albert, University of Pennsylvania [email protected] 23

In-Memory Computing…

…handle Big Data?

…considered MAD?

Magnetism

Unstructured data still requires pre-processing

Agile

Deep

Unsupervised

Supervised

…an End to End Solution?

Jason P. Albert, University of Pennsylvania [email protected]

Big Data

HDFS + MAP REDUCE + IN MEMORY

Page 13: Shawndra Hill Upenn Jasonalb Big Data WK11

3/24/2013

13

Jason P. Albert, University of Pennsylvania [email protected] 25

Case Study: Recommendations

1) 9 TB of W3C Extended Log File Format data

2) MapReduce program: sessionExtractor

Session Product Product

SDF92MGSLOK4M23K B041Q3EV N23KFMWE

ASD90K23MOLFWQIE EM9IU67Y

Hadoop-HANA Connector

18M Records

Jason P. Albert, University of Pennsylvania [email protected] 26

Scale-Out: MapReduce + HDFS

Recall this slide as the Foundation

Page 14: Shawndra Hill Upenn Jasonalb Big Data WK11

3/24/2013

14

Jason P. Albert, University of Pennsylvania [email protected] 27

+ Case Study: Predictive Analysis

1) Add Connection Details to all Data Reader Component

4) K-Means Cluster of Sessions

2) Retrieves records

5) Write back to database for persistence

3) Join 1.1B PoS records to Session Data

4) Explore Outcome

6) Use to provide Recommendations for Future Website Visitors

Jason P. Albert, University of Pennsylvania [email protected] 28

Scale-Out Strategy for In-Memory

Recall this slide as the Foundation

Page 15: Shawndra Hill Upenn Jasonalb Big Data WK11

3/24/2013

15

Jason P. Albert, University of Pennsylvania [email protected] 29

Better together

…handle Big Data?

MapReduce Enables Magnetism preprocesses unstructured data

In-Memory Enables Agility Data Provisioning

Replication

Extraction

Both MapReduce and In-Memory Enable Deep Analysis During MapReduce preprocessing

Unsupervised & Supervised for In-Memory

…an End to End Solution?

Jason P. Albert, University of Pennsylvania [email protected] 30

SAP HANA + Intel Distribution of Hadoop

This is New News February 27, 2013

http://www.sap.com/corporate-en/news.epx?PressID=20498

Page 16: Shawndra Hill Upenn Jasonalb Big Data WK11

3/24/2013

16

Jason P. Albert, University of Pennsylvania [email protected] 31

MAD Improvement Focus

Transformative potential in five domains U.S. Healthcare

E.U. Public Sector administration

Retail

Manufacturing

Personal Location

Most significant constraint: Shortage of talent to take advantage of the insights gained from large datasets Deep analytical talent with technical skills in statistics to provide insights

Data-savvy analysts to interpret/challenge/base decisions on results

Support personnel who develop/implement/maintain the architecture

Big data: The next frontier for innovation, competition, and productivity McKinsey Global Institute

Jason P. Albert, University of Pennsylvania [email protected]

Big Data

QUESTIONS?