an introduction to big-data processing applying hadoop
TRANSCRIPT
An introduction to
Big Data processing
using Hadoop
A.Sedighi
hexican.com
No single standard definiHon…
“Big Data” is data whose scale, diversity,
and complexity require new architecture,
techniques, algorithms, and analyHcs to
manage it and extract value and hidden
knowledge from it…
Big Data, Definition
Information is powerful…
but it is how we use it that will
define us
Data Explosion
relational
textaudiovideo
images
Big Data Era
-creates over 30 billion pieces of content per day
-stores 30 petabytes of data
-produces over 90 million tweets per day
Log Files
-Log files contains data.
-Each banking transaction should be logged in
different levels.
How much a Banking solution generates log
files per a day?
Big Data: 3 V's
Big Data: 3 V's
volumevelocityvariety
Some Makes it 3 V's
What is driving Big Data Industry?
- Optimizations and predictive analytics- Complex statistical analysis- All types of data, and many sources- Very large datasets- More of a real-time
- Ad-hoc querying and reporting- Data mining techniques- Structured data, typical sources- Small to mid-size datasets
Big Data Challenges
Big Data Challenges
Sorting of 10TB on:1 node takes 2.5 Days O(N log N)100 nodes takes 35 Mins O(log N)
Big Data Challenges
Problem: “Fat” servers implies high cost.
Solution: Using cheap commodity nodes instead.
Problem: Large number of cheap nodes implies often
failures.
Solution: leverage automatic fault-tolerance
Big Data Challenges
We need new data-parallel programming
model for clusters of commodity machines.
What Technology Do We Have
For Big Data ?
Map Reduce
MapReduce
Published in 2004 by GooglePopularized by Apache Hadoop project.
Using by Yahoo!, Facebook, Twitter, Amazon, LinkedIn and many other enterprises.
Word Count Example
MapReduce philosophy
-hide complexity
-make it scalable
-make it cheap
MapReduce popularized by
Apache Hadoop project
Hadoop Overview
Open source implementation of Google
MapReduce
Google File System (GFS)
First release in 2008 by Yahoo!
Wide adoption by Facebook, Twitter, Amazon, etc.
Everything Started By Searching
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project.
Hadoop Sub Projects -‐ 1
Hadoop Sub Projects -‐ 2
Hadoop Distributed File System (HDFS) -‐ 1
HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware.
-“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data.
Hadoop Distributed File System (HDFS) -‐ 2
HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware.
-HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. The time to read the whole dataset is more important than the latency in reading the first record.
Hadoop Distributed File System (HDFS) -‐ 3
HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware.
-HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure.
Were HDFS doesn't work well?
● Low-‐latency data access
● Lots of small files
● MulHple writers, arbitrary file modificaHons.
MapReduce and HDFS
HDFS Concepts - Blocks65MB 128MB or 256MB Block size.If the seek time is around 10ms, and the transfer rate is 100 MB/s, then to make the seek time 1% of the transfer time, we need to make the block size around 100 MB.
Anatomy of a File Read
Anatomy of a File Write
Replica Replacement
Machine Learning - 1
Mahout's goal is to build scalable machine learning libraries providing core algorithms for clustering, classificaHon and batch based collaboraHve filtering are implemented on top of Apache Hadoop using the map/reduce paradigm.
Machine Learning - 2
Mahout can be used as a recommender engine on the top of hadoop clusters.
Using hadoop for
● ads and recomendations● online travel● processing mobile data● energy savings and discovery● infrastructure management● image processing● fraud detection● IT security● health care