builiding analytical apps on hadoop

Headline Goes HereSpeaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12

Building Analytical Applications on HadoopJosh Wills | Director of Data Science November 2012

About Me

What are ‘Analytical Applications?’

The Humble Dashboard

Crossfilter with Flight Information

New York Times Electoral Vote Map

New York Times Electoral Vote Map (Detail)

Analytical Applications vs. Frameworks

A Case Study

Developing Analytical Applications

2012: The Predicting of the President

RealClearPolitics

• Simple Average of Polls

• Transparent

• Simple Interactions

FiveThirtyEight

• “Foxy” Model

• Opaque

• Simple Interactions with a richer UI

Princeton Election Consortium

•Medians and Polynomials

• Transparent

• Rich Interactions

How Did They Do?

A Few of These, Because They’re Fun

Here’s the Rub: One Expert Beat Nate

Index Funds, Hedge Funds, and Warren Buffett

A Brief Introduction to Hadoop

Data Storage in 2001: Databases

• Structured schemas• Intensive processing

done where data is stored• Somewhat reliable• Expensive at scale

Data Storage in 2001: Filers

• No schemas, stores any kind of file• No data processing

capability• Reliable• Expensive at scale

And Then, This Happened

Data Economics: Return on Byte

Big Data Economics

• No individual record is particularly valuable• Having every record is

incredibly valuable• Web index• Recommendation systems• Sensor data• Market basket analysis• Online advertising

Introduction to Hadoop

The Hadoop Distributed File System

• Based on the Google File System• Data stored in large files

• Large block size: 64MB to 256MB per block• Blocks are replicated to

multiple nodes in the cluster

Simple, Reliable Processing: MapReduce

• Map Stage• Embarrassingly parallel

• Shuffle Stage: Large-scale distributed sort• Reduce Stage

• Process all of the values that have the same key in a single step• Process the data where it is stored• Write once and you’re done.

Developing Analytical Applications with Hadoop

Novelty is the Enemy of Adoption

The Best Way to Get Started: Apache Hive

• Apache Hive• Data Warehouse System on

top of Hadoop• SQL-based query language

• SELECT, INSERT, CREATE TABLE

• Includes some MapReduce-specific extensions

Borrowing Abstractions

Improving the UX (http://github.com/cloudera/impala)

Moving Beyond the Abstractions

Making the Abstract Concrete

Cloudera’s Data Science Course

Analytical Applications I Love

The Experiments Dashboard

Adverse Drug Events

Gene Sequencing and Analytics

The Doctor’s Perspective

A Couple of Themes

1. Structure data the data in the way that makes sense for the problem.

2. Interactive inputs, not just interactive outputs.

3. Simpler interfaces that yield more sophisticated answers.

Working Towards The Dream

Moving Beyond MapReduce

Developing Analytical Applications

The Cambrian Explosion…of Frameworks

It’s Frameworks All The Way Down: Spark

• Developed at Berkeley’s AMP Lab• Defines operations on

distributed in-memory collections• Written in Scala• Supports reading to and

writing from HDFS

IFATWD: Graphlab

• Developed at CMU• Lower-level primitives

• (but higher than MPI)• Map/Reduce =>

Update/Sort• Flexible, allows for

asynchronous computations• Reads from HDFS

Playing with YARN

BranchReduce (http://github.com/cloudera/branchreduce)

builiding analytical apps on hadoop

data storage

structure data

big data economics

google file system data

reliable processing

hadoop distributed file

mapreduce map stage

stage process

Documents

usda forest service –yates builiding · usda forest...

exton, pa spring thesis project builiding for the …...

hadoop trends & hadoop on ec2

the new analytical db for the hadoop platform

hadoop operations powered by ... hadoop (hadoop summit 2014...

jerry rand faly miarahabanareo tonga manatrika izao b.b.s...

2. hadoop -...

usda forest service –yates builiding · studios...

big data: introducing infosphere biginsights, ibm's...

builiding strategy from the middle

community builiding final

analyzing hadoop with hadoop

builiding byes laws

postgresql development essentials -...

company builiding 1.0

builiding a universe local

hadoop hadoop & spark meetup - altiscale

what is the analytical ecosystem? and why is it making my...

builiding responsibility in children

una revisión sistemática de los criterios del builiding