how to suceed in hadoop
Post on 29-Jun-2015
133 Views
Preview:
TRANSCRIPT
© comScore, Inc. Proprietary.
Syncsort & MapR @ comScore
Michael Brown, CTO | July 9th, 2014
© comScore, Inc. Proprietary.© comScore, Inc. Proprietary.
The comScore Story
Analytics for a Digital World™
© comScore, Inc. Proprietary. 3
The Digital World is Complex
V0113
© comScore, Inc. Proprietary. 4
comScore’s Mission
Be the Leader in Digital Media Analytics.
Measure all forms of media—content and advertising—at scale, across all platforms, in real-time, globally.
© comScore, Inc. Proprietary. 5
comScore Brings it Together
TabletPC/Mac TV SmartphoneGaming
V0113
© comScore, Inc. Proprietary. 6
comScore is a leading internet technology company thatprovides Analytics for a Digital World™
NASDAQ SCOR
Clients 2,400+ Worldwide
Employees 1,200+
Headquarters Reston, Virginia, USA
Global Coverage Measurement from 172 Countries; 44 Markets Reported
Local Presence 32 Locations in 23 Countries
V0113
© comScore, Inc. Proprietary. 7
Providing Analytics For More Than 2,400+ Clients Globally
Media Agencies Telecom/Mobile Financial Retail Travel CPG Health Technology
V0113
© comScore, Inc. Proprietary. 8
CensusTags & Data Feeds
PanelsPC, iOS, Android
SurveyNon-behavioral elements
MethodsAggregation DictionariesTaxonomies
SyndicatedData
Platform
Media MetrixvCE
Collection Calibration Delivery
Con
sulti
ng
Ana
lysi
s
ModelsWeightingProjection
De-DuplicationAttribution
Turning Big Data into Powerful Insight
Client AnalyticsPlatform
Digital Analytix
© comScore, Inc. Proprietary. 9
© comScore, Inc. Proprietary. 10
Panel Heat Map
© comScore, Inc. Proprietary. 11
Average Records Captured per Day (2005-2009)
-
200,000,000
400,000,000
600,000,000
800,000,000
1,000,000,000
1,200,000,000
1,400,000,000
1,600,000,000
1,800,000,0009/
26/2
005
10/2
6/20
0511
/26/
2005
12/2
6/20
051/
26/2
006
2/26
/200
63/
26/2
006
4/26
/200
65/
26/2
006
6/26
/200
67/
26/2
006
8/26
/200
69/
26/2
006
10/2
6/20
0611
/26/
2006
12/2
6/20
061/
26/2
007
2/26
/200
73/
26/2
007
4/26
/200
75/
26/2
007
6/26
/200
77/
26/2
007
8/26
/200
79/
26/2
007
10/2
6/20
0711
/26/
2007
12/2
6/20
071/
26/2
008
2/26
/200
83/
26/2
008
4/26
/200
85/
26/2
008
6/26
/200
87/
26/2
008
8/26
/200
89/
26/2
008
10/2
6/20
0811
/26/
2008
12/2
6/20
081/
26/2
009
2/26
/200
93/
26/2
009
© comScore, Inc. Proprietary. 12
CENSUS
Unified Digital Measurement™ (UDM) Establishes Platform For Panel + Census Data Integration
Adopted by 90% of Top 100 U.S. Media Properties
PANEL
Unified Digital Measurement (UDM)Patent-Pending Methodology
Global PERSONMeasurement
Global DEVICEMeasurement
V0411
© comScore, Inc. Proprietary. 13
Beacon Heat Map
© comScore, Inc. Proprietary. 14
Monthly Records Collection
Billion
200 Billion
400 Billion
600 Billion
800 Billion
1,000 Billion
1,200 Billion
1,400 Billion
1,600 Billion
1,800 Billion
2,000 Billion
# of
reco
rds
Beacon RecordsPanel Records
Total records collected in June 2014 = 1,726,563,202,649Total records collected YTD 2014 = 10,037,131,368,475
© comScore, Inc. Proprietary.
DMX @ comScore
© comScore, Inc. Proprietary. 16
DMX use at comScore
Purchased our first 4 licenses in 2000!
We use DMX from Syncsort across hundreds of servers for efficient data processing and aggregation.
We currently run over 100+ unique jobs every day.
With these jobs we process over 150 billion rows of data through DMX!
Connect
Design
Process Accelerate
© comScore, Inc. Proprietary. 17
Compression w/Sorting
Compress Log Files when processing large volumes of log dataSeveral advantages to Sorting Data First: Reduces the size of the data Improves application performance
Examples: 1 Hour of one source of our data 2,315 GB raw (2.9 billion rows) Standard compression of time ordered data is 509 GB (22% of original) Standard compression on a sorted set is 324 GB (14% of original)
When applied to all our sources we save 5.0 TB per day 155 TB per month 460 TB per quarter
© comScore, Inc. Proprietary.
Hadoop @ comScore
© comScore, Inc. Proprietary. 19
Why Hadoop?
• comScore built our own distributed computing stack in 2002.
• In 2009 we decided it was better to leverage the efforts of the Hadoop community instead of building our own stack.
• We recognized the benefit of switching to Hadoop which would allow for seamless scaling of our infrastructure to meet the needs of the business.
• Hadoop allows us to add compute, storage and memory linearly and allows you to process things at tremendous scale.
• Partnered with SyncSort on their Hadoop efforts from Oct 2010
• Evaluated the beta of MapR in the fall of 2011
© comScore, Inc. Proprietary. 20
90 Days of Data
1,148
1,919
3,049
4,8625,084
Trillion
1,000 Trillion
2,000 Trillion
3,000 Trillion
4,000 Trillion
5,000 Trillion
6,000 Trillion
2009 2010 2011 2012 2013 2014 2016
© comScore, Inc. Proprietary. 21
High Level Data Flow
Panel
Census
Custom Code +
ADW
EDW
Delivery
© comScore, Inc. Proprietary. 22
Our Cluster
Production Hadoop Cluster 400+ nodes: Mix of Dell 720xd, R710 and R510 servers Each R720xd has (24x1.2TB drives; 128GB RAM; 32 cores) 13,800+ total CPUs 31.6 TB total memory 8.2 PB total disk space Our distro is MapR M5 2.1.3
© comScore, Inc. Proprietary.
Leveraging Partitions from MapR
© comScore, Inc. Proprietary.
© comScore, Inc. Proprietary.
Validation Funnel & Target Effectiveness
© comScore, Inc. Proprietary. 26
Our growth
As our volume has grown we have the following stats: Over 683 billion events per month Daily Aggregate 1.8 billion 160 billion aggregate records for 92 days 146K Campaigns Over 50 countries We see 15 billion distinct cookies in a month We only need to output 26 million rows
© comScore, Inc. Proprietary. 27
Solution to reduce the shuffle
The Problem: Most aggregations within comScore can not take advantage of combiners, leading to large shuffles and
job performance issues
The Idea: Partition and sort the data by cookie on a daily basis Create a custom InputFormat to merge daily partitions for monthly aggregations
© comScore, Inc. Proprietary. 28
Custom Input Format with Map Side Aggregation
CB
Mapper MapperMapperMap Map Map
Reduce ReduceReduce
BA AC
A B C
A B C
Combiner Combiner Combiner
A B C
© comScore, Inc. Proprietary. 29
Risks for Partitioning
Data locality Custom InputFormat requires reading blocks of the partitioned data over the network This was solved using a feature of the MapR file system. We created volumes and set the chunk size to
zero which guarantees that the data written to a volume will stay on one node
Map failures might result in long run times Size of the map inputs is no longer set by block size This was solved by creating a large number (10K) of volumes to limit the size of data processed by each
mapper
© comScore, Inc. Proprietary. 30
Partitioning Summary
Benefits: A large portion of the aggregation can be completed in the map phase Applications can now take advantage of combiners Shuffles sizes are minimal
Results: Took a job from 35 hours to 3 hours with no hardware changes
© comScore, Inc. Proprietary.
DMX-h @ comScore
© comScore, Inc. Proprietary. 32
Reasons for comScore selecting DMX-h
Performance
• DMX-h as the pluggable sort in Hadoop allows us to increase throughput on it’s existing platform; this reduces capital and ongoing operational expenses
• The increase in throughput allows us to also deliver our data more quickly to our customers. These things make the data more valuable to our clients.
Speed of Development
• The ability to quickly build out applications in the DMX-h GUI allows us to iterate and respond quicker to the needs of the business.
• The ease of development also allows us to democratize the access to the Hadoop platform by leveraging a point and click GUI.
© comScore, Inc. Proprietary. 33
Performance - DMx Pluggable Sort Testing Results
First Comparison Run on our Dev Cluster
Pig scripts and called with SyncSort plug in
GroupBy / Distinct Operations• Counting uniques• These have large shuffle steps which leads to more data to sort.• Observed up to a 20% decrease in job runtime
Filter Operations• Searching for a specific value• Observed a 5% – 10% decrease in job runtime• Dependent on type of filter and size of job output
40GB compressed data, base run is 86 min, test run is 68 min; Savings of 20%
Results from 7 Nodes; 56 cores; 433 GB RAM; 28 TB disk; MapR M5 3.0.2; DMX-h 7.12
© comScore, Inc. Proprietary. 34
Speed of Development - POC
We took an existing process that runs in our Hadoop cluster and converted that to DMX-h to validate the new capabilities.
The existing process:
• Written in 75 lines of Pig with 3 Java UDFs
• Developed in about 25 hours
• Processes 3.5 billion input rows per day
• Takes 35 minutes to run on a daily basis
© comScore, Inc. Proprietary. 35
DMXh-Process
© comScore, Inc. Proprietary. 36
Speed of Development - POC
The new process in DMX-h:
• Developed a new job with 13 tasks
• No Java UDF required
• Runs on the same data and in the same environment.
• Developed in 12 hours.
• Runs in 11 minutes! 1/3 of the time of the Pig & Java code.
© comScore, Inc. Proprietary. 37
Useful Factoids
Visit www.comscoredatamine.com or follow @datagems for the latest gems.
Colorful, bite-sized graphical representations of the best discoveries we unearth.
© comScore, Inc. Proprietary. 38
Thank You!
Michael BrownCTOcomScore, Inc.
mbrown@comscore.com
© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Today’s Presenters
Steve WooledgeVP - Product Marketing
@swooledge
Jorge LopezDirector - Product Marketing
@zanilli
Mike Brown CTO
© 2014 MapR Technologies 3© 2014 MapR Technologies
comScore
© comScore, Inc. Proprietary.
Syncsort & MapR @ comScore
• Michael Brown, CTO | July 9th, 2014
© 2014 MapR Technologies 5© 2014 MapR Technologies
Leveraging MapR and Syncsort
© 2014 MapR Technologies 6
Big Data is Overwhelming Traditional Systems
• Mission-critical reliability• Transaction guarantees• Deep security• Real-time performance• Backup and recovery
• Interactive SQL• Rich analytics• Workload management• Data governance• Backup and recovery
Enterprise Data
Architecture
1TRENDTREND
ENTERPRISE USERS
OPERATIONAL SYSTEMS
ANALYTICALSYSTEMS
PRODUCTIONREQUIREMENTS
PRODUCTIONREQUIREMENTS
OUTSIDE SOURCES
© 2014 MapR Technologies 7
Hadoop: The Disruptive Technology at the Core of Big DataTRENDTREND
JOB TRENDS FROM INDEED.COM
Jan ‘06 Jan ‘12 Jan ‘14Jan ‘07 Jan ‘08 Jan ‘09 Jan ‘10 Jan ‘11 Jan ‘13
2
© 2014 MapR Technologies 8
OPERATIONAL SYSTEMS
ANALYTICALSYSTEMS
ENTERPRISE USERS
1REALITYREALITY
• Data staging• Archive
• Data transformation• Data exploration
• Streaming, interactions
Hadoop Relieves the Pressure from Enterprise Systems
2 Interoperability
1 Reliability and DR
4 Supports operations and analytics
3 High performance
Keys for Production Success
© 2014 MapR Technologies 9
FOUNDATION
Architecture Matters for Success2REALITYREALITY
Data protection& security
High performance
Multi-tenancy
Operational & Analytical Workloads
Open standards for integration
NEW APPLICATIONS SLAs TRUSTED INFORMATION LOWER TCO
© 2014 MapR Technologies 10
The Power of the Open Source Community
Man
agem
ent
Man
agem
ent
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Spark
Batch
Spark Streaming
Storm*
Streaming
HBase
Solr
NoSQL & Search
Juju
Provisioning &
coordination
Savannah*
Mahout
MLLib
ML, Graph
GraphX
MapReduce v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow & Data
GovernanceTez*
Accumulo*
Hive
Impala
Shark
Drill*
SQL
Sentry* Oozie ZooKeeperSqoop
Knox* WhirrFalcon*Flume
Data Integration& Access
HttpFS
Hue
* Certification/support planned for 2014
© 2014 MapR Technologies 11
MapR Distribution for Hadoop
Man
agem
ent
Man
agem
ent
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Spark
Batch
Spark Streaming
Storm*
Streaming
HBase
Solr
NoSQL & Search
Juju
Provisioning &
coordination
Savannah*
Mahout
MLLib
ML, Graph
GraphX
MapReduce v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow & Data
GovernanceTez*
Accumulo*
Hive
Impala
Shark
Drill*
SQL
Sentry* Oozie ZooKeeperSqoop
Knox* WhirrFalcon*Flume
Data Integration& Access
HttpFS
Hue
* Certification/support planned for 2014
• High availability • Data protection• Disaster recovery
• Standard file access• Standard database
access• Pluggable services• Broad developer
support
• Enterprise securityauthorization
• Wire-level authentication
• Data governance
• Ability to support predictive analytics, real-time database operations, and support high arrival rate data
• Ability to logically divide a cluster to support different use cases, job types, user groups, and administrators
• 2X to 7X higher performance
• Consistent, low latency
Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability
© 2014 MapR Technologies 12
MapR: Best Solution for Customer Success
Top Ranked Exponential Growth
500+ Customers
PremierInvestors
3X3X bookings Q1 ‘13 – Q1 ‘14
80%80% of accounts expand 3X
90%90% software licenses
< 1%< 1% lifetime churn
> $1B> $1B in incremental revenuegenerated by 1 customer
© 2014 MapR Technologies 13
MapR and Syncsort Reference Architecture
SourcesRELATIONAL, SAAS, MAINFRAME
DOCUMENTS, EMAILS
LOG FILES, CLICKSTREAMS
BLOGS, TWEETS,LINK DATA
DATA MARTS DATA WAREHOUSE
MapR Data Platform
Business Intelligence / Visualization
MapR-DB MapR-FS
Batch(MR, Spark, Hive, Pig,
…)
Interactive(Impala, Drill, …)
Streaming(Spark Streaming,
Storm…)
MAPR DISTRIBUTION FOR HADOOP
© 2014 MapR Technologies 14
Do You Know Syncsort?
• Syncsort provides fast, secure, enterprise‐grade software spanning “Big Iron to Big Data”
• Fastest sort technology in the market• Powering 50% of mainframes’ sort
• A history of innovation• 25+ issued & pending patents
• Large global customer base• 12,000+ deployments in 80 countries and serving 87 of the Fortune 100
• First‐to‐market, fully integrated approach to Hadoop ETL
• Top 7 contributors to Hadoop. Based on number of lines of code changed in 2013
Our customers are achieving the impossible, every day!
Our customers are achieving the impossible, every day!
Key Partners
© 2014 MapR Technologies 15
The Hadoop Challenge
PROCESS
Sort
JoinAggregate Copy
Merge
DISTRIBUTECOLLECT
Most organizations use Hadoop to…
EExtract
TTransform
LLoad
© 2014 MapR Technologies 16
Turning Hadoop into a Feature-rich ETL Solution
Collect• Broad based connectivity with automated parallelism • Best in class mainframe data access & translationProcess & Distribute• No manual coding. GUI for developing & maintaining MR jobs• No code generation. Engine runs natively on each node• Develop & test locally in Windows; run natively on Hadoop
Optimize & Secure• Faster throughput per node• Full support for Kerberos & LDAP• Web‐based monitoring console• Sort‐work compression for storage savings
DMX‐h
ETL
Collect Process & Distribute
Optimize& Secure
© 2014 MapR Technologies 17
A Roadmap to Hadoop Success
Agile Data Exploration & Visualization
Next‐gen Analytics
Cheap Storage
Offload Data Warehouse
Enabling The
Data‐driv
en Organiza
tion
Solving The Intractable
IT Problem
17
© 2014 MapR Technologies 18
MapR + Syncsort Solutions
Data Warehouse Optimization
Click‐stream Analysis
Mainframe Offload
Shift ELT Workloads to Hadoop
Access, Translate & Analyze Mainframe Data with Hadoop
Collect, Process & Analyze More Data from Your Website
© 2014 MapR Technologies 19
Q & AEngage with us!
1. Download the MapR Sandbox for Hadoop: www.mapr.com/sandbox
2. Try Syncsort’s Hadoop ETL in the MapR Sandbox: www.syncsort.com/mapr
3. Learn best practices for Hadoop ETL: www.mapr.com/EDH
top related