hadoop overview
TRANSCRIPT
OverviewWhy
Hadoop?
What is Hadoop?
How to Hadoop?
Examples
Data GrowthWhat is Big Data?
Hadoop usage
ComponentsNo SQLClusterVendors
Tool Comparison
Typical ImplementationData Analysis with Pig & Hive
OpportunitiesMap Reduce deep dive
WordcountSearch index
Recommendation Engine
Data Growth
OLTPDatabases for
Operations
Throw away historical data
RelationalOracle, DB2
OLAPData warehouses
for analytics
Cheaper centralized
storage -> Data warehouses(ETL tools)
Relational/MPP appliances
< few hundred TB
Big DataData explosion (social media,
etc)Petabyte scale
Network speeds haven’t increased
Need Data Locality
Distributed processing on
commodity hardware(Hadoop)
Non-relational
Big Data
What is Big Data?
Volume
Petabyte scale
VarietyStructured
Semi-structured
Unstructured
VelocitySocialSensor
Throughput
VeracityUnclean
ImpreciseUnclear
Where is Hadoop Used?
Industry
Technology
Use Cases
SearchPeople you may know
Movie recommendations
BanksFraud Detection
RegulatoryRisk management
MediaRetail
Marketing analyticsCustomer service
Product recommendations
Manufacturing Preventive maintenance
Hadoop
HDFS
Distributed StorageEconomical: commodity hardwareScalable: rebalances data on new nodesFault Tolerant: detects faults & auto recoversReliable: maintains multiple copies of dataHigh throughput: because data is distributed
Open source distributed computing framework for storage and processing
What is Hadoop?
MapReduce
Distributed ProcessingData Locality: process where the data residesFault Tolerant: auto-recover job failuresScalable: add nodes to increase parallelismEconomical: commodity hardware
• Unlike RDBMS:o De-normalizedo No secondary indexeso No transactions
• Modeled after Google’s Big Table• Random real time read/write access to Big Data• Billions of rows x millions of columns• Commodity hardware• Open source, distributed, versioned, column oriented• Integrates with MapReduce; Has Java/REST APIs• Automatic sharding
NoSQL DBs - HBase
Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
Master Node
Slave Node Slave Node Slave Node
Job Tracker
Task Tracker Task Tracker Task Tracker
Name Node
Data Node Data Node Data Node
Cluster
How Does Hadoop Work?
Vendors
Apache Hadoop
Cloudera
HortonWorks
MapR
Pentaho Informatica
Talend Clover
EMR
ETL/BI Connectors
Hadoop Distributors
Microstrategy
Tableau
SASAbInitio
ComparisonTraditional ETL/BI
Expensive licenseExpensive hardware
Hadoop
Open sourceCheap commodity hardware
< 100 TBCentral storage
Petabyte scaleDistributed storage
Cost
Volu
me
Quick response for processing small data
Not as fast on large data
Even smallest job takes 15 seconds
Super fast on large dataSp
eed
Thousands of reads/writes per minute
Millions of reads/writes per minute
Th
rup
ut
HDFS
Hadoop
FlumeSqoop
Ing
est
Put/GetETL tools
RDBMSData Feeds
Files
Hadoop Implementation
Reports Machine Learning
Ou
tpu
t AnalyticsVisualizatio
n
SAS R
MapReduce
Pig Hive
Mahout
Pro
cess
Data Analysis: Pig & Hive
Pig Hive
Abstraction on top of MapReduce. Generates MapReduce jobs in the backend. Useful for analysts who are not programmers.
Data flow languageNo schemaBetter with less structured Data
SQL like languageSchema, tables, joins are stored in a meta-store.
ExampleLOAD ‘file’ USING PigStorage(‘\t’) AS (id, name);
FILTERFOREACHGROUPORDERSTORE
ExampleCREATE TABLE customer (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’;
SELECT * from customerWHERE id < 100 limit 10;
Word count - Java• Copy input files to HDFS
o hadoop fs –put file1.txt input
• Create drivero Set configuration variables, mapper and reducer class names
• Create mappero Read input and emit key value pairs
• Create reducer (optional)o Aggregate all values for a particular key
• Executeo hadoop jar WordCount.jar WordCount input output
• Analyze outputo hadoop fs –cat output/* | head
Word count - Streaming
• Hadoop is written in Java. I don’t know Java. What do I do? o Hadoop Streaming (Python, Ruby, R, etc)
• Copy input files to HDFSo hadoop fs –put file1.txt input
• Create mappero Read input stream (stdin) and emit (print) key value pairs
• Create reducer (optional)o Aggregate all values for a particular key
• Executeo hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-
stream*.jar -mapper mapper.py –file mapper.py -reducer reducer.py –file reducer.py -input input –output output
• Analyze outputo hdoop fs –cat output/* | head
Hadoop for RSys.setenv(HADOOP_HOME="/home/istvan/hadoop")Sys.setenv(HADOOP_CMD="/home/istvan/hadoop/bin/hadoop")library(rmr2)library(rhdfs)
setwd("/home/istvan/rhadoop/blogs/")gdp <- read.csv("GDP_converted.csv")head(gdp)
hdfs.init()gdp.values <- to.dfs(gdp)# AAPL revenue in 2012 in millions USDaaplRevenue = 156508
gdp.map.fn <- function(k,v) {key <- ifelse(v[4] < aaplRevenue, "less", "greater")keyval(key, 1)}
count.reduce.fn <- function(k,v) {keyval(k, length(v))}
count <- mapreduce(input=gdp.values, map = gdp.map.fn, reduce = count.reduce.fn)from.dfs(count)
• RHadoop packageo rmro rhdfso Rhbase
• Uses Hadoop Streaming
• Example on the right determines how many countries have greater GDP than Apple
Source: http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/
Search index example• Crawl web
o Crawl and save websites to local directory
• Ingest files to HDFS• Map
o Split the words & associate words with file names
• Reduceo Build an index with words and files & count of occurrences
• Searcho Pass the word to the index to get the files it shows up in. Display the
file listing in descending order of number of occurrences of the word in a file
Recommender example
• Use web server logs with user ratings info for items
• Create Hive tables to build structure on top of this log data
• Generate Mahout specific csv input file (user, item, rating)
• Run Mahout to build item recommendations for userso mahout recommeditembased \
--input /user/hive/warehouse/mahout_input \--output recommendations \-s SIMILARITY_PEARSON_CORRELATION –n 20
RecapWhy
Hadoop?
What is Hadoop?
How to Hadoop?
Demo
Data GrowthWhat is Big Data?
Hadoop usage
ComponentsNo SQLClusterVendors
Tool Comparison
Typical ImplementationData Analysis with Pig & Hive
OpportunitiesMap Reduce deep dive
WordcountSearch index
Recommendation Engine
Q & A
Contact Siva Pandeti:Email: [email protected]: www.linkedin.com/in/SivaPandetiTwitter: @SivaPandetihttp://pandeti.com/blog