builiding analytical apps on hadoop

Post on 12-May-2015

1.597 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Headline Goes HereSpeaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12

Building Analytical Applications on HadoopJosh Wills | Director of Data Science November 2012

2

About Me

3

What are ‘Analytical Applications?’

4

The Humble Dashboard

5

Crossfilter with Flight Information

6

New York Times Electoral Vote Map

7

New York Times Electoral Vote Map (Detail)

8

Analytical Applications vs. Frameworks

9

A Case Study

Developing Analytical Applications

10

2012: The Predicting of the President

11

RealClearPolitics

• Simple Average of Polls

• Transparent

• Simple Interactions

12

FiveThirtyEight

• “Foxy” Model

• Opaque

• Simple Interactions with a richer UI

13

Princeton Election Consortium

•Medians and Polynomials

• Transparent

• Rich Interactions

14

How Did They Do?

15

A Few of These, Because They’re Fun

16

A Few of These, Because They’re Fun

17

A Few of These, Because They’re Fun

18

Here’s the Rub: One Expert Beat Nate

19

Index Funds, Hedge Funds, and Warren Buffett

20

A Brief Introduction to Hadoop

21

Data Storage in 2001: Databases

• Structured schemas• Intensive processing

done where data is stored• Somewhat reliable• Expensive at scale

22

Data Storage in 2001: Filers

• No schemas, stores any kind of file• No data processing

capability• Reliable• Expensive at scale

23

And Then, This Happened

24

Data Economics: Return on Byte

25

Big Data Economics

• No individual record is particularly valuable• Having every record is

incredibly valuable• Web index• Recommendation systems• Sensor data• Market basket analysis• Online advertising

26

Introduction to Hadoop

27

The Hadoop Distributed File System

• Based on the Google File System• Data stored in large files

• Large block size: 64MB to 256MB per block• Blocks are replicated to

multiple nodes in the cluster

28

Simple, Reliable Processing: MapReduce

• Map Stage• Embarrassingly parallel

• Shuffle Stage: Large-scale distributed sort• Reduce Stage

• Process all of the values that have the same key in a single step• Process the data where it is stored• Write once and you’re done.

29

Developing Analytical Applications with Hadoop

30

Novelty is the Enemy of Adoption

31

The Best Way to Get Started: Apache Hive

• Apache Hive• Data Warehouse System on

top of Hadoop• SQL-based query language

• SELECT, INSERT, CREATE TABLE

• Includes some MapReduce-specific extensions

32

Borrowing Abstractions

33

Improving the UX (http://github.com/cloudera/impala)

34

Moving Beyond the Abstractions

35

Making the Abstract Concrete

36

Cloudera’s Data Science Course

37

Analytical Applications I Love

38

The Experiments Dashboard

39

Adverse Drug Events

40

Gene Sequencing and Analytics

41

The Doctor’s Perspective

42

A Couple of Themes

1. Structure data the data in the way that makes sense for the problem.

2. Interactive inputs, not just interactive outputs.

3. Simpler interfaces that yield more sophisticated answers.

43

Working Towards The Dream

44

Moving Beyond MapReduce

Developing Analytical Applications

45

The Cambrian Explosion…of Frameworks

46

It’s Frameworks All The Way Down: Spark

• Developed at Berkeley’s AMP Lab• Defines operations on

distributed in-memory collections• Written in Scala• Supports reading to and

writing from HDFS

47

IFATWD: Graphlab

• Developed at CMU• Lower-level primitives

• (but higher than MPI)• Map/Reduce =>

Update/Sort• Flexible, allows for

asynchronous computations• Reads from HDFS

48

Playing with YARN

49

BranchReduce (http://github.com/cloudera/branchreduce)

50

top related