an introduction to big-data processing applying hadoop

An introduction to

Big Data processing

using Hadoop

A.Sedighi

hexican.com

No single standard definiHon…

“Big Data” is data whose scale, diversity,

and complexity require new architecture,

techniques, algorithms, and analyHcs to

manage it and extract value and hidden

knowledge from it…

Big Data, Definition

Information is powerful…

but it is how we use it that will

define us

Data Explosion

relational

textaudiovideo

images

Big Data Era

-creates over 30 billion pieces of content per day

-stores 30 petabytes of data

-produces over 90 million tweets per day

Log Files

-Log files contains data.

-Each banking transaction should be logged in

different levels.

How much a Banking solution generates log

files per a day?

Big Data: 3 V's

Big Data: 3 V's

volumevelocityvariety

Some Makes it 3 V's

What is driving Big Data Industry?

- Optimizations and predictive analytics- Complex statistical analysis- All types of data, and many sources- Very large datasets- More of a real-time

- Ad-hoc querying and reporting- Data mining techniques- Structured data, typical sources- Small to mid-size datasets

Big Data Challenges

Big Data Challenges

Sorting of 10TB on:1 node takes 2.5 Days O(N log N)100 nodes takes 35 Mins O(log N)

Big Data Challenges

Problem: “Fat” servers implies high cost.

Solution: Using cheap commodity nodes instead.

Problem: Large number of cheap nodes implies often

failures.

Solution: leverage automatic fault-tolerance

Big Data Challenges

We need new data-parallel programming

model for clusters of commodity machines.

What Technology Do We Have

For Big Data ?

Map Reduce

MapReduce

Published in 2004 by GooglePopularized by Apache Hadoop project.

Using by Yahoo!, Facebook, Twitter, Amazon, LinkedIn and many other enterprises.

Word Count Example

MapReduce philosophy

-hide complexity

-make it scalable

-make it cheap

MapReduce popularized by

Apache Hadoop project

Hadoop Overview

Open source implementation of Google

MapReduce

Google File System (GFS)

First release in 2008 by Yahoo!

Wide adoption by Facebook, Twitter, Amazon, etc.

Everything Started By Searching

Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project.

Hadoop Sub Projects -‐ 1

Hadoop Sub Projects -‐ 2

Hadoop Distributed File System (HDFS) -‐ 1

HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware.

-“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data.



-HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. The time to read the whole dataset is more important than the latency in reading the first record.



-HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure.

Were HDFS doesn't work well?

● Low-‐latency data access

● Lots of small files

● MulHple writers, arbitrary file modificaHons.

MapReduce and HDFS

HDFS Concepts - Blocks65MB 128MB or 256MB Block size.If the seek time is around 10ms, and the transfer rate is 100 MB/s, then to make the seek time 1% of the transfer time, we need to make the block size around 100 MB.

Anatomy of a File Read

Anatomy of a File Write

Replica Replacement

Machine Learning - 1

Mahout's goal is to build scalable machine learning libraries providing core algorithms for clustering, classificaHon and batch based collaboraHve filtering are implemented on top of Apache Hadoop using the map/reduce paradigm.

Machine Learning - 2

Mahout can be used as a recommender engine on the top of hadoop clusters.

Using hadoop for

● ads and recomendations● online travel● processing mobile data● energy savings and discovery● infrastructure management● image processing● fraud detection● IT security● health care

an introduction to big-data processing applying hadoop

Data & Analytics

itbig data

types of data

vsbig data

big data processing

havefor big data

big data industry

store petabytes of data

streaming data access