![Page 1: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/1.jpg)
©AE 2012 1
Bram VanschoenwinkelSenior Data Scientist, AE
@bvschoen@AE_NV
R & HadoopThe perfect marriage for your analytics?
Avondconferentie 19/06/2014
![Page 2: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/2.jpg)
2
Agenda
1. It’s a ( R )evolution
2. Intelligent Decision Support in the Digital Age
3. The R Project for Statistical Computing
4. The World of Hadoop
5. Case: A Customer Intelligence Platform
6. Conclusions
![Page 3: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/3.jpg)
3
It’s a (R)evolution
2000 2010 2015
DATA VOLUME
TIME
MA
JOR
ITY
U
NST
RU
CTU
RED
DA
TA
![Page 4: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/4.jpg)
4
Abundance of Data
BEYOND
WEB
CRM
ERPPURCHASE DETAIL
PRODUCTION
PAYMENT DETAIL
PLANNING
CONTACT INFORMATION
LEADS
OFFERS
SEGMENTATION
PROSPECTS
CLICK STREAM DATA
WEB SHOPS SOCIAL MEDIA
VIDEO
IMAGES
TEXT
ONLINE SERVICES
AUDIO
OPEN DATA
MOBILE DEVICES
INTERNET OF THINGS
RFID
GPS
SENSORS
USER GENERATED CONTENT
SMART DEVICES
SENSORS
REMOTE MONITORING
CLOUD
MEDICAL
WARABLES
![Page 5: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/5.jpg)
5
Opportunities
OPERATIONAL EXCELLENCE
INNOVATIVE BUSINESS MODELS
INSIGHTS, STRATEGY AND POLICY
![Page 6: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/6.jpg)
6
SHORT LIFESPAN OF THE DATA
FAST
MO
VIN
G D
ATA
FAST
DA
TA P
RO
CES
SIN
G
HIGH VARIETY OF DATA
Challenges
![Page 7: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/7.jpg)
7
intelligent decision support in the digital age
WHAT WE SEE
ABUNDANCE OF HETEROGENOUS DATA
THE WAY WE INTERACT WITH THE WORLD HAS
CHANGED
OPPORTUNITIES
OPERATIONAL EXCELLENCE
BETTER DECISION SUPPORT
CHALLENGES
ANALYSIS GAP
VOLUME, VARIETY, VELOCITY
INNOVATING BUSINESS MODELS
COMPETENCES
![Page 8: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/8.jpg)
8
Decision Support in the Digital Age
Facing the Challenges and realizing the Opportunities
Business Analytics
Big Data
![Page 9: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/9.jpg)
9
Elements of a Holistic Information Management Framework
- Data Sources- Internal & External- From Data to Information
- Improving data quality- Integrality of data- From Information to Knowledge
Intelligent Decision Support:
- Reporting- Business Analytics- From Knowledge to Intelligence
DATAInformation
Knowledge
Intelligence Wisdom/Insight
![Page 10: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/10.jpg)
10
Decision Support in the Digital Age
“Business Analytics is the nontrivial extraction of implicit, previously unknown, and potentially useful
information from data.”
![Page 11: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/11.jpg)
11
Business Analytics vs Business Intelligence
![Page 12: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/12.jpg)
12
New Insights
8 stoppen
132 stoppen
10 stoppen
53 stoppen
64 stoppen
14 stoppen 4 stoppen
11 stoppen
![Page 13: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/13.jpg)
13
Innovating Business Models
Front-end Application(s)
Security
Analytics (on Hadoop)
Web Click StreamingSocial Media
Connectivity
External Application Integration
Operational Data Processing on Hadoop
![Page 14: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/14.jpg)
14
From Analytics…
Statistics Algorithms
BiologyPsychology
Databases
![Page 15: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/15.jpg)
15
…to Business Analytics
![Page 16: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/16.jpg)
16
Analytics Approach
Analytics
Incremental and iterative
Think big act small
Proof-of-Concept
Open source tools
Architecture & Deployment
(Non-)funtional requirements
Information Architecture
Technology
Embedded into operations
Two Phase Approach
Analytics
Architecture Deployment
![Page 17: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/17.jpg)
17
Analytics Churn Prediction Example
Invoicing CRM Call Center Application
John Doe – 43years – Antwerp – Man – 7calls – 3weeks – 30%down invoicingJane Dan – 32years – Brussels – Woman – 2calls – 12weeks – 10%up invoicing…
Operations
CHURN SCORES
REGION
PR
OD
UC
T
CHURN SCORES
MA
NA
GEM
ENT
DA
SHB
OA
RD
OPERATIONS
DATA DUMP
Analytics Engine
Data Warehouse
![Page 18: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/18.jpg)
18
Big Data
“Big data is high-volume, high-velocity, high-complexity and
high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight
and decision making.” (Gartner)
![Page 19: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/19.jpg)
19
Four V’s and a C
Not only volume makes big data big, it’s all about the three V’s: High Volume, Variety, Velocity
High Value!
In addition the data is very complex in nature, often unstructured: Text documents, emails, images and videos, etc.
Click stream data, social media feed data, etc.
![Page 20: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/20.jpg)
20
Innovative Forms of Information Processing
Traditional methods don’t suffice anymore.
New forms of information processing have emerged.
DISTRIBUTED DATA STORAGE
COMPUTATIONNoSQL DATA STORES
![Page 21: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/21.jpg)
21
Innovative Forms of Information Processing
![Page 22: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/22.jpg)
22
The R Project for Statistical Computing
R is a dialect of the S language
S was developed by John Chambers and others at Bell Labs
S was initiated in 1976
Now owned by TIBCO and sold under the name S-PLUS
INTERACTIVE NOT PROGRAMMING
PROGRAMMING WHEN SYSTEM
ASPECTS BECOME IMPORTANT
GRADUALLY MOVING INTO
![Page 23: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/23.jpg)
23
Advantages of R
Most widely used data analysis software Created and used by 2M+ data scientists, statisticians and analysts
Most powerful statistical programming language Flexible, extensible & comprehensive for productivity, +4800 packages
Create beautiful and unique data visualizations As seen in New York Times, Twitter and Flowing Data
Thriving open-source community Leading edge of analytics research
Fills the talent gap New graduates prefer R
![Page 24: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/24.jpg)
24
Drawbacks of R
Steep learning curve
Objects must be stored in physical
memory, little thought to memory
management
Functionality is based on consumer demand and user
contributions
Documentation is sometimes patchy
and terse, and impenetrable to the
non-statistician
Vibrant community to help you
Recent advancements to
deal with this
If a package is useful to many people, it will
quickly evolve into a robust product
Vibrant community to help you
![Page 25: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/25.jpg)
25
Exploding growth and Demand for R
R is the highest paid IT skill – Dice.com, Jan 2014
R most-used data science language after SQL – O’Reilly, Jan 2014
R is used by 70% of data miners – Rexer, Sep 2013
R is #15 of all programming languages – RedMonk, Jan 2014
R growing faster than any other data science language – KDnuggets, Aug 2013
More than 2 million users worldwide
![Page 26: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/26.jpg)
26
Great Adoption of R by Many Companies
Commercial vendors offering general support and developingspecific R based products, e.g.: Oracle, RevolutionAnalytics.
Companies using R for advanced statistics and analytics, e.g.:Thomas Cook, Google, Twitter.
Also in the AE customer base we see different companies lookinginto R as an alternative or complement to the traditional tools.
![Page 27: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/27.jpg)
27
Example Packages
twitteR: Provides an interface to the Twitter web API.
tm: Provides Text Mining functionalities like word stemming,stopword removal, etc.
wordcloud: Provides methods for producing wordclouds indifferent forms, shapes and colors.
![Page 28: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/28.jpg)
28
Apache Hadoop
Open-source software framework.
Storage and large-scale processing of data on clusters of commodity hardware.
Apache top-level project built and used by a global community.
Two core components:
1. Hadoop Distributed File System (HDFS)
2. MapReduce
![Page 29: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/29.jpg)
29
Apache Hadoop
MapReduce/HDFS based on Google's MapReduce and Google File System.
Other components are:
Hadoop Common – libraries and utilities needed by other Hadoop modules
Hadoop YARN – a resource-management platform
The entire Apache Hadoop “platform” is now commonly considered to consistof a number of related projects as well: Pig, Hive, Hbase,…
Created by Doug Cutting and Mike Cafarella at Yahoo in 2005 originally tosupport distribution for the Apache Nutch search engine project.
All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or
racks of machines) are common and thus should be automatically handled in software by the framework.
![Page 30: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/30.jpg)
30
The World of Hadoop
![Page 31: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/31.jpg)
31
Key Properties Apache Hadoop
Transforms commodity hardware into a service that: Stores petabytes of data reliably.
Allows huge distributed computations.
Key Properties: Designed for batch processing.
Write-once-read-many access model for files.
Extremely powerful.
Scalability: • Scales linearly with cores and disks.
• Machines can be added and removed from the cluster.
• Write code once, same program runs on 1, 1000, 4000 machines.
Reliable and fault-tolerant:• Failed tasks/data transfers are automatically retried.
• Data replication, redundancy.
![Page 32: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/32.jpg)
32
Rack 2 Rack 3Rack 1
A Typical Hadoop Cluster
Client
DATA ASSIGNMENT TO NODES
DATA READDATA WRITE
METADATA FORBLOCK INFO
Task Tracker
Task Tracker
Map Reduce
Map Reduce
Job Tracker
Data Node
Data Node
Task Tracker
Map Reduce
Data Node
Task Tracker
Task Tracker
Map Reduce
Map Reduce
Data Node
Data Node
Task Tracker
Map Reduce
Data Node
Task Tracker
Task Tracker
Map Reduce
Map Reduce
Data Node
Data Node
Task Tracker
Map Reduce
Data Node
Master Node
SlaveNodes
SlaveNodes
SlaveNodes
Name Node
JOB ASSIGNMENT
TASK ASSIGNMENT
1. Client
2. Master Node Name Node
Job Tracker
3. Slave Nodes Data Nodes
Task Trackers
Map / Reduce
![Page 33: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/33.jpg)
33
1. Client consults Name Node
2. Client writes block to Data Node
3. Data Node replicates block
4. Cycle repeats for next blocks
Rack 2 Rack 3Rack 1
Hadoop File System (HDFS)
Data Node 1 Data Node 4 Data Node 7
Data Node 2 Data Node 5 Data Node 8
Data Node 3 Data Node 6 Data Node 9
Name Node
Client
FILE
FILE
DATA ASSIGNMENT TO NODES
DATA READDATA WRITE
METADATA FORBLOCK INFO
Rack 1:Data Node 1Data Node 2…
Rack 2:Data Node 3…
![Page 34: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/34.jpg)
34
MapReduce
the, 1quick, 1brown, 1fox, 1
the, 1fox, 1ate, 1the, 1mouse, 1
how, 1now, 1brown, 1cow, 1
the, 1the, 1the, 1
fox, 1fox, 1
quick, 1
brown, 1brown, 1
ate, 1
mouse, 1
how, 1
now, 1
cow, 1
the, 3
fox, 2
quick, 1
brown, 2
ate, 1
mouse, 1
how, 1
now, 1
cow, 1
the, 3fox, 2quick, 1brown, 2ate, 1mouse, 1how, 1now, 1cow, 1
Input Splitting Map ShuffleSort
Reduce
OutputThe Map function processes one line at a time, splits it into tokens seperated by a withespace
and emits a key-value pair <word, 1>.
The Reducer function just sums up the values, which are the occurence counts for each key
(i.e. words in this example).
![Page 35: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/35.jpg)
35
Hadoop Distributions
Fully equipped, scalable and flexible cloud solutions.
Also different on premise solutions are being offered.
Choice depends on specific requirements. Data Privacy, Scalability, Security, Data Mastership, Configuration, Flexibility,
Price-Performance Ratio, Automation,…
How to get started? Free to download!
Business model is based on training, consulting, support and additional“tooling” (Enterprise Editions).
Many free trial cloud versions available to play around with.
Many tutorials, trainings, blogs, user groups etc.
![Page 36: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/36.jpg)
36
RHadoop
A collection of four R packages that allow users to manage andanalyze data with Hadoop: rmr: Hadoop MapReduce functionality in R
rhdfs: file management of the HDFS from within R
rhbase: database management for the HBase distributed database
Recently a new package plyrmr was relased providing a familiar interfacewhile hiding many of the MapReduce details (like Hive, Pig and Mahoot).
R and all RHadoop packges should be installed on all nodes inthe Hadoop cluster.
Combining the advantages of R with the power of Hadoop.
![Page 37: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/37.jpg)
37
MapReduce Wordcount Example in R
Map function.
Reduce function.
Reading the input fromHDFS from.dfs().
Writing the results back to HDFS to.dfs().
![Page 38: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/38.jpg)
38
Case: A Customer Intelligence Platform
* Non Disclosure Agreement: Contact AE via www.ae.be/contact for more information
![Page 39: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/39.jpg)
39
Conclusions
The Digital Age brings many opportunities but also challenges.
Big Data and Analytics can face the challenges and realize theopportunities.
It is within anyone’s grasp, do it incremental and iterative.
R and Hadoop: Open source software, active user groups and support.
A great way to start exploring!
Combined power gives you the advantage of 1 + 1 =3.
Sometimes alternatives are better.
![Page 40: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/40.jpg)
40
Conclusions
Don’t always need Big Data to do Analytics, it depends on therequirements.
Hadoop cloud solutions are scalable, flexible and cost-efficient,but sometimes limited in functionality (or not standardized).
Many differences between Hadoop distributions, constantlyevolving (and getting better).
Need for good Data Scientists in a mixed team of competences tomake the right choices.
![Page 41: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/41.jpg)
41
What’s next?
Ask yourselves following questions: What opportunities do I see for myself?
What strategic and competitive advantages can I realize?
Is Analytics the right solution for me? Do I need Big Data?
What about my Data Warehouse environment?
And what about the quality of my operational data?
Do I have the right infrastructure in place?
Do I have the right competences in house?
Now you should know what’s in it for you, but also the challengesyour most probably will be facing.
![Page 42: SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554f5ae8b4c905524c8b54af/html5/thumbnails/42.jpg)
42
What’s next?
You have a case you would like to discuss…?
You have any questions…?
Please feel free to contact me: Bram Vanschoenwinkel
+32 478 741738
@bvschoen
be.linkedin.com/in/bramvanschoenwinkel/