how to suceed in hadoop

Syncsort & MapR @ comScore

Michael Brown, CTO | July 9th, 2014

The comScore Story

Analytics for a Digital World™

The Digital World is Complex

comScore’s Mission

Be the Leader in Digital Media Analytics.

Measure all forms of media—content and advertising—at scale, across all platforms, in real-time, globally.

comScore Brings it Together

TabletPC/Mac TV SmartphoneGaming

comScore is a leading internet technology company thatprovides Analytics for a Digital World™

NASDAQ SCOR

Clients 2,400+ Worldwide

Employees 1,200+

Headquarters Reston, Virginia, USA

Global Coverage Measurement from 172 Countries; 44 Markets Reported

Local Presence 32 Locations in 23 Countries

Providing Analytics For More Than 2,400+ Clients Globally

Media Agencies Telecom/Mobile Financial Retail Travel CPG Health Technology

CensusTags & Data Feeds

PanelsPC, iOS, Android

SurveyNon-behavioral elements

MethodsAggregation DictionariesTaxonomies

SyndicatedData

Platform

Media MetrixvCE

Collection Calibration Delivery

ModelsWeightingProjection

De-DuplicationAttribution

Turning Big Data into Powerful Insight

Client AnalyticsPlatform

Digital Analytix

Panel Heat Map

Average Records Captured per Day (2005-2009)

200,000,000

400,000,000

600,000,000

800,000,000

1,000,000,000

1,200,000,000

1,400,000,000

1,600,000,000

1,800,000,0009/

CENSUS

Unified Digital Measurement™ (UDM) Establishes Platform For Panel + Census Data Integration

Adopted by 90% of Top 100 U.S. Media Properties

Unified Digital Measurement (UDM)Patent-Pending Methodology

Global PERSONMeasurement

Global DEVICEMeasurement

Beacon Heat Map

Monthly Records Collection

Billion

200 Billion

400 Billion

600 Billion

800 Billion

1,000 Billion

1,200 Billion

1,400 Billion

1,600 Billion

1,800 Billion

2,000 Billion

Beacon RecordsPanel Records

Total records collected in June 2014 = 1,726,563,202,649Total records collected YTD 2014 = 10,037,131,368,475

DMX @ comScore

DMX use at comScore

Purchased our first 4 licenses in 2000!

We use DMX from Syncsort across hundreds of servers for efficient data processing and aggregation.

We currently run over 100+ unique jobs every day.

With these jobs we process over 150 billion rows of data through DMX!

Connect

Design

Process Accelerate

Compression w/Sorting

Compress Log Files when processing large volumes of log dataSeveral advantages to Sorting Data First: Reduces the size of the data Improves application performance

Examples: 1 Hour of one source of our data 2,315 GB raw (2.9 billion rows) Standard compression of time ordered data is 509 GB (22% of original) Standard compression on a sorted set is 324 GB (14% of original)

When applied to all our sources we save 5.0 TB per day 155 TB per month 460 TB per quarter

Hadoop @ comScore

Why Hadoop?

• comScore built our own distributed computing stack in 2002.

• In 2009 we decided it was better to leverage the efforts of the Hadoop community instead of building our own stack.

• We recognized the benefit of switching to Hadoop which would allow for seamless scaling of our infrastructure to meet the needs of the business.

• Hadoop allows us to add compute, storage and memory linearly and allows you to process things at tremendous scale.

• Partnered with SyncSort on their Hadoop efforts from Oct 2010

• Evaluated the beta of MapR in the fall of 2011

90 Days of Data

4,8625,084

Trillion

1,000 Trillion

2,000 Trillion

3,000 Trillion

4,000 Trillion

5,000 Trillion

6,000 Trillion

2009 2010 2011 2012 2013 2014 2016

High Level Data Flow

Census

Custom Code +

Delivery

Our Cluster

Production Hadoop Cluster 400+ nodes: Mix of Dell 720xd, R710 and R510 servers Each R720xd has (24x1.2TB drives; 128GB RAM; 32 cores) 13,800+ total CPUs 31.6 TB total memory 8.2 PB total disk space Our distro is MapR M5 2.1.3

Leveraging Partitions from MapR

Validation Funnel & Target Effectiveness

Our growth

As our volume has grown we have the following stats: Over 683 billion events per month Daily Aggregate 1.8 billion 160 billion aggregate records for 92 days 146K Campaigns Over 50 countries We see 15 billion distinct cookies in a month We only need to output 26 million rows

Solution to reduce the shuffle

The Problem: Most aggregations within comScore can not take advantage of combiners, leading to large shuffles and

job performance issues

The Idea: Partition and sort the data by cookie on a daily basis Create a custom InputFormat to merge daily partitions for monthly aggregations

Custom Input Format with Map Side Aggregation

Mapper MapperMapperMap Map Map

Reduce ReduceReduce

Combiner Combiner Combiner

Risks for Partitioning

Data locality Custom InputFormat requires reading blocks of the partitioned data over the network This was solved using a feature of the MapR file system. We created volumes and set the chunk size to

zero which guarantees that the data written to a volume will stay on one node

Map failures might result in long run times Size of the map inputs is no longer set by block size This was solved by creating a large number (10K) of volumes to limit the size of data processed by each

mapper

Partitioning Summary

Benefits: A large portion of the aggregation can be completed in the map phase Applications can now take advantage of combiners Shuffles sizes are minimal

Results: Took a job from 35 hours to 3 hours with no hardware changes

DMX-h @ comScore

Reasons for comScore selecting DMX-h

Performance

• DMX-h as the pluggable sort in Hadoop allows us to increase throughput on it’s existing platform; this reduces capital and ongoing operational expenses

• The increase in throughput allows us to also deliver our data more quickly to our customers. These things make the data more valuable to our clients.

Speed of Development

• The ability to quickly build out applications in the DMX-h GUI allows us to iterate and respond quicker to the needs of the business.

• The ease of development also allows us to democratize the access to the Hadoop platform by leveraging a point and click GUI.

Performance - DMx Pluggable Sort Testing Results

First Comparison Run on our Dev Cluster

Pig scripts and called with SyncSort plug in

GroupBy / Distinct Operations• Counting uniques• These have large shuffle steps which leads to more data to sort.• Observed up to a 20% decrease in job runtime

Filter Operations• Searching for a specific value• Observed a 5% – 10% decrease in job runtime• Dependent on type of filter and size of job output

40GB compressed data, base run is 86 min, test run is 68 min; Savings of 20%

Results from 7 Nodes; 56 cores; 433 GB RAM; 28 TB disk; MapR M5 3.0.2; DMX-h 7.12

Speed of Development - POC

We took an existing process that runs in our Hadoop cluster and converted that to DMX-h to validate the new capabilities.

The existing process:

• Written in 75 lines of Pig with 3 Java UDFs

• Developed in about 25 hours

• Processes 3.5 billion input rows per day

• Takes 35 minutes to run on a daily basis

DMXh-Process

Speed of Development - POC

The new process in DMX-h:

• Developed a new job with 13 tasks

• No Java UDF required

• Runs on the same data and in the same environment.

• Developed in 12 hours.

• Runs in 11 minutes! 1/3 of the time of the Pig & Java code.

Useful Factoids

Visit www.comscoredatamine.com or follow @datagems for the latest gems.

Colorful, bite-sized graphical representations of the best discoveries we unearth.

Thank You!

Michael BrownCTOcomScore, Inc.

mbrown@comscore.com

Today’s Presenters

Steve WooledgeVP - Product Marketing

@swooledge

Jorge LopezDirector - Product Marketing

@zanilli

Mike Brown CTO

comScore

Syncsort & MapR @ comScore

• Michael Brown, CTO | July 9th, 2014

Leveraging MapR and Syncsort

Big Data is Overwhelming Traditional Systems

• Mission-critical reliability• Transaction guarantees• Deep security• Real-time performance• Backup and recovery

• Interactive SQL• Rich analytics• Workload management• Data governance• Backup and recovery

Enterprise Data

Architecture

1TRENDTREND

ENTERPRISE USERS

OPERATIONAL SYSTEMS

ANALYTICALSYSTEMS

PRODUCTIONREQUIREMENTS

OUTSIDE SOURCES

Hadoop: The Disruptive Technology at the Core of Big DataTRENDTREND

JOB TRENDS FROM INDEED.COM

Jan ‘06 Jan ‘12 Jan ‘14Jan ‘07 Jan ‘08 Jan ‘09 Jan ‘10 Jan ‘11 Jan ‘13

OPERATIONAL SYSTEMS

ANALYTICALSYSTEMS

ENTERPRISE USERS

1REALITYREALITY

• Data staging• Archive

• Data transformation• Data exploration

• Streaming, interactions

Hadoop Relieves the Pressure from Enterprise Systems

2 Interoperability

1 Reliability and DR

4 Supports operations and analytics

3 High performance

Keys for Production Success

FOUNDATION

Architecture Matters for Success2REALITYREALITY

Data protection& security

High performance

Multi-tenancy

Operational & Analytical Workloads

Open standards for integration

NEW APPLICATIONS SLAs TRUSTED INFORMATION LOWER TCO

The Power of the Open Source Community

MapR Data Platform

APACHE HADOOP AND OSS ECOSYSTEM

Security

Cascading

Spark Streaming

Storm*

Streaming

NoSQL & Search

Provisioning &

coordination

Savannah*

Mahout

ML, Graph

GraphX

MapReduce v1 & v2

EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS

Workflow & Data

GovernanceTez*

Accumulo*

Impala

Drill*

Sentry* Oozie ZooKeeperSqoop

Knox* WhirrFalcon*Flume

Data Integration& Access

HttpFS

* Certification/support planned for 2014

MapR Distribution for Hadoop

MapR Data Platform

APACHE HADOOP AND OSS ECOSYSTEM

Security

Cascading

Spark Streaming

Storm*

Streaming

NoSQL & Search

Provisioning &

coordination

Savannah*

Mahout

ML, Graph

GraphX

MapReduce v1 & v2

EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS

Workflow & Data

GovernanceTez*

Accumulo*

Impala

Drill*

Sentry* Oozie ZooKeeperSqoop

Knox* WhirrFalcon*Flume

Data Integration& Access

HttpFS

* Certification/support planned for 2014

• High availability • Data protection• Disaster recovery

• Standard file access• Standard database

access• Pluggable services• Broad developer

support

• Enterprise securityauthorization

• Wire-level authentication

• Data governance

• Ability to support predictive analytics, real-time database operations, and support high arrival rate data

• Ability to logically divide a cluster to support different use cases, job types, user groups, and administrators

• 2X to 7X higher performance

• Consistent, low latency

Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability

MapR: Best Solution for Customer Success

Top Ranked Exponential Growth

500+ Customers

PremierInvestors

3X3X bookings Q1 ‘13 – Q1 ‘14

80%80% of accounts expand 3X

90%90% software licenses

< 1%< 1% lifetime churn

> $1B> $1B in incremental revenuegenerated by 1 customer

MapR and Syncsort Reference Architecture

SourcesRELATIONAL, SAAS, MAINFRAME

DOCUMENTS, EMAILS

LOG FILES, CLICKSTREAMS

BLOGS, TWEETS,LINK DATA

DATA MARTS DATA WAREHOUSE

MapR Data Platform

Business Intelligence / Visualization

MapR-DB MapR-FS

Batch(MR, Spark, Hive, Pig,

Interactive(Impala, Drill, …)

Streaming(Spark Streaming,

Storm…)

MAPR DISTRIBUTION FOR HADOOP

Do You Know Syncsort?

• Syncsort provides fast, secure, enterprise‐grade software spanning “Big Iron to Big Data”

• Fastest sort technology in the market• Powering 50% of mainframes’ sort

• A history of innovation• 25+ issued & pending patents

• Large global customer base• 12,000+ deployments in 80 countries and serving 87 of the Fortune 100

• First‐to‐market, fully integrated approach to Hadoop ETL

• Top 7 contributors to Hadoop. Based on number of lines of code changed in 2013

Our customers are achieving the impossible, every day!

Key Partners

The Hadoop Challenge

PROCESS

JoinAggregate Copy

DISTRIBUTECOLLECT

Most organizations use Hadoop to…

EExtract

TTransform

Turning Hadoop into a Feature-rich ETL Solution

Collect• Broad based connectivity with automated parallelism • Best in class mainframe data access & translationProcess & Distribute• No manual coding. GUI for developing & maintaining MR jobs• No code generation. Engine runs natively on each node• Develop & test locally in Windows; run natively on Hadoop

Optimize & Secure• Faster throughput per node• Full support for Kerberos & LDAP• Web‐based monitoring console• Sort‐work compression for storage savings

DMX‐h

Collect Process & Distribute

Optimize& Secure

A Roadmap to Hadoop Success

Agile Data Exploration & Visualization

Next‐gen Analytics

Cheap Storage

Offload Data Warehouse

Enabling The

Data‐driv

en Organiza

Solving The Intractable

IT Problem

MapR + Syncsort Solutions

Data Warehouse Optimization

Click‐stream Analysis

Mainframe Offload

Shift ELT Workloads to Hadoop

Access, Translate & Analyze Mainframe Data with Hadoop

Collect, Process & Analyze More Data from Your Website

Q & AEngage with us!

1. Download the MapR Sandbox for Hadoop: www.mapr.com/sandbox

2. Try Syncsort’s Hadoop ETL in the MapR Sandbox: www.syncsort.com/mapr

3. Learn best practices for Hadoop ETL: www.mapr.com/EDH

how to suceed in hadoop

comscore story analytics

comscore michael brown

rows of data

big data

days of data

census data integration

digital media analytics

efficient data processing

Software

hadoop - how it works

how to increase performance of your hadoop cluster

how salesforce.com uses hadoop webinar

how to develop big data pipelines for hadoop

how we lose etu hadoop competition

how to use hadoop with your sap

how salesforce.com uses hadoop

how yarn enables multiple data processing engines in hadoop

how hadoop works[1]

how to win friends and influence people (with hadoop)

how kafka is transforming hadoop, spark & storm

how to set up a hadoop cluster using oracle...

how kkbox use mrjob to link python, hadoop, aws

hadoop operations: how to secure and control cluster access

how does this stuff work?. - ohdsi · what is hadoop? how...

ves.ac.in · how big data problems are handled by hadoop...

how mapreduce works (in hadoop)

hadoop on cloud: why and how?

how hadoop revolutionized data warehousing at yahoo and...

introduction to hadoop 2.0 and how it overcomes the...