big data & hdfs - anuradha bhatia · unstructured data that is too large, ... a map-reduce...

January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL 1

BIG DATA & HDFS

OUTLINE

January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL

2

Big Data

Characteristics of Big Data

Traditional v/s Streaming Data

Hadoop

Hadoop Architecture

BIG DATA


3

Big data is a collection of both structured and

unstructured data that is too large, fast and distinct to

be managed by traditional database management tools

or traditional data processing applications.

For e.g., Data managed by e-commerce websites for request search,

consumer/customer recommendations, current trend and

merchandising

Data managed by social media for providing a social network

platform

Data managed by real-time auction / bidding in online environment


4

IMPLEMENTATION

Stock market• Impact of weather on

securities prices• 5 million messages per

second, trade in 150 microseconds

Natural Systems• Wildfire management

• Water management

•Water Water

Fraud prevention• Detecting multi-party fraud

• Real time fraud prevention

Radio Astronomy• Detection of transient events

Health & Life Sciences• Neonatal ICU monitoring

• Epidemic early warning system

• Remote healthcare monitoring

Transportation• Intelligent traffic management

• Global air traffic management

Law Enforcement• Real-time multimodal surveillance

Manufacturing• Process control for

microchip fabrication


5

BIG DATA USES

BIG DATA CONSISTS OF …


6

WHICH PLATFORM DO YOU CHOOSE?


7

Structured Semi-Structured Unstructured

Hadoop

Analytic Database

General Purpose RDBMS


8

CHARACTERISTICS OF BIG DATA

• System generated streams of data

• Multiple sources feeding data for one system

• Structured data

• Unstructured data-Blogs, Images, Audio etc.

• System/Users generating TeraBtes, PetaBytes

and ZetaBytes of dataVolume

Velocity

Variety

Storage

Processing

Presentation


9

VALUE CHAIN OF BIG DATA

Data Generation

Data Collection

Data Analysis

Application of Insights

Source of data e.g., Users, Enterprises, Systems etc

Companies, Tools, Sites aggregating data e.g, IMS

Research and Analytics Firms e.g. , MuSigms, etc.

Management consulting firms, MNC’s


10

Queries DataResults

a) static data

Queries DataResults

a) static data

Queries DataResultsQueries DataResults

a) static data

TRADITIONAL COMPUTING

Historical fact finding with data-at-rest

Batch paradigm, pull model

Query-driven: submits queries to static data

Relies on Databases, Data Warehouses


11

STREAM COMPUTING

Real time analysis of data-in-motion

Streaming data

A stream of structured or unstructured data-in-motion

Stream Computing

Analytic operations on streaming datain real-time

STREAM COMPUTING EVENTS


12

text and transactional data news broadcasting

digital audio, video and image data

RFID

financial data network packet traces instant messages

satellite dataphone conversations

web searchesATM transactions

pervasive sensor data

click streams

Unknown data/signal

High usefulness density

Simple analytics

Well defined event

High speed (million events per sec)

Very low latency

Low usefulness density

Complex analytics

Event needs to be detected

High volume (TB/sec)

Low latency

Large Spectrum of Events/Data

Unstructured dataStructured data


13

Ecosystem of open sourceprojects

Hosted by Apache Foundation

Google developed and sharedconcepts

Distributed file system thatscales out on commodity serverswith direct attached storage andautomatic failover.

HADOOP

HADOOP SYSTEM


14

Source: Hortonworks

HADOOP DISTRIBUTED FILE SYSTEM - HDFS


15

Hadoop Distributed File System (HDFS)

HDFS is the implementation of Hadoop file system, the java abstract classorg.apache.hadoop.fs.FileSystem that represents a file system in Hadoop.

HDFS is designed to work efficiently in conjunction with MapReduce.

Definition

A distributed file system that provides big data storage solution through high-throughput access to applicationdata.

When data can potentially outgrow the storage capacity of a single machine, portioning it across a number ofseparate machines is necessary for storage of processing. This is achieved using a distributed file systems.

Potential Challenges:

Ensuring data integrity

Data retention in case of nodes failure

Integration across multiple nodes and systems



16

Hadoop Distributed File System (HDFS)

HDFS is designed for storing very large files with streaming data access patterns, running onclusters of commodity hardware

Very Large Files

Very large means files that are hundred of MB, GB, TB or PB in size.

Streaming Data Access:

HDFS implements write-once, read-many-times pattern. Data is copied from source for analysis overtime. Each analysis involves a large portion of the dataset, so that time to read the whole dataset is moreimportant than the latency in reading the first record.

Commodity Hardware:

Hadoop runs on clusters of commodity hardware (commodity available hardware) HDFS is designed tocarry on working without a noticeable interruption to the user in the case of node failure.



17

Where HDFS doesn’t work well:

HDFS is not designed for the following scenarios

Low-Latency Data Access:

HDFS is optimised for delivering a high throughput of data, and this may be at the expense oflatency

Lots of Small Files:

File system metadata is stored in memory, hence the limit to the number of files in a file systemis governed by the amount of memory on the namenode.

As rule of thumb, each file, directory, and block takes about 150 bytes.

Multiple Updates in the File:

Files in HDFS may be written to by a single writer at the end of the file. There is no support formultiple writers, or for modifications at arbitrary offsets in the file.

HADOOP DISTRIBUTED FILE SYSTEM - CONCEPT


18

Blocks

NameNode

DataNodes

HDFS Federation

HDFS High Availablity

HADOOP DISTRIBUTED FILE SYSTEM - BLOCKS


19

Files in HDFS are broken into blocks of 64 MB (default) and stored as

independent units.

Files in HDFS that is smaller than a single block does not occupy a full block’s

storage

HDFS blocks are large compared to disk blocks to minimize the cost of seeks.

Map tasks in MapReduce operate on one block at a time

Block as a unit of abstraction rather than a file simplifies the storage subsystem

which takes the metadata information

Blocks fit well with replication for providing fault tolerance and availability

HDFS’s fsck command understands blocks

For example, command to list the blocks that make up each file in the system:

% hadoop fsck / -files -blocks

HDFS – NAMENODES & DATANODES


20

HDFS cluster consists of

NameNodes

DataNodes


21

DFS2-G

Client lo

et blockcation

FSDataInputStream

Datanode Datanode Datanode

Namenode

HDFS – NAMENODES & DATANODES


22

HDFS ARCHITECTURE

WHAT DOES IT DO?


23

Hadoop implements Google’s MapReduce, using HDFS

HDFS creates multiple replicas of data blocks for reliability, placing

them on compute nodes around the cluster.

MapReduce can then process the data where it is located.

Hadoop ‘s target is to run on clusters of the order of 10,000-nodes.

DATA CHARACTERISTICS


24

Batch processing rather than interactive user access.

Large data sets and files: gigabytes to terabytes size

High aggregate data bandwidth

Scale to hundreds of nodes in a cluster

Tens of millions of files in a single instance

Write-once-read-many: a file once created, written and closed

need not be changed – this assumption simplifies coherency

A map-reduce application or web-crawler application fits

perfectly with this model.

DATA BLOCKS


25

HDFS support write-once-read-many with reads at streaming

speeds.

A typical block size is 64MB (or even 128 MB).

A file is chopped into 64MB chunks and stored.

FILESYSTEM NAMESPACE


26

Hierarchical file system with directories and files

Create, remove, move, rename etc.

Namenode maintains the file system

Any meta information changes to the file system recorded by the

Namenode.

An application can specify the number of replicas of the file

needed: replication factor of the file. This information is stored

in the Namenode.

FS SHELL, ADMIN AND BROWSER INTERFACE


27

HDFS organizes its data in files and directories.

It provides a command line interface called the FS shell that lets

the user interact with data in the HDFS.

The syntax of the commands is similar to bash and csh.

Example: to create a directory

/foodir/bin/hadoop dfs –mkdir /foodir

There is also DFSAdmin interface available

Browser interface is also available to view the namespace.

Steps for HDFS


28

BLOCKS REPLICATION STAGING

DATA REPLICATION


29

HDFS is designed to store very large files across machines in a

large cluster.

Each file is a sequence of blocks.

All blocks in the file except the last are of the same size.

Blocks are replicated for fault tolerance.

Block size and replicas are configurable per file.

The Namenode receives a Heartbeat and a BlockReport from

each DataNode in the cluster.

BlockReport contains all the blocks on a Datanode.

REPLICA PLACEMENT


30

The placement of the replicas is critical to HDFS reliability and

performance.

Optimizing replica placement distinguishes HDFS from other

distributed file systems.

Rack-aware replica placement:

Replicas are placed: one on a node in a local rack, one on a

different node in the local rack and one on a node in a different

rack.

1/3 of the replica on a node, 2/3 on a rack and 1/3 distributed

evenly across remaining racks.

REPLICA SELECTION


31

Replica selection for READ operation: HDFS tries to minimize

the bandwidth consumption and latency.

If there is a replica on the Reader node then that is preferred.

HDFS cluster may span multiple data centers: replica in the local

data center is preferred over the remote one.

STAGING


32

A client request to create a file does not reach Namenode

immediately.

HDFS client caches the data into a temporary file. When the data

reached a HDFS block size the client contacts the Namenode.

Namenode inserts the filename into its hierarchy and allocates a

data block for it.

The Namenode responds to the client with the identity of the

Datanode and the destination of the replicas (Datanodes) for the

block.

Then the client flushes it from its local memory.

STAGING


33

The client sends a message that the file is closed.

Namenode proceeds to commit the file for creation operation

into the persistent store.

If the Namenode dies before file is closed, the file is lost.

This client side caching is required to avoid network congestion.

SAFEMODE STARTUP


34

On startup Namenode enters Safemode.

Replication of data blocks do not occur in Safemode.

Each DataNode checks in with Heartbeat and BlockReport.

Namenode verifies that each block has acceptable number of

replicas

After a configurable percentage of safely replicated blocks check

in with the Namenode, Namenode exits Safemode.

It then makes the list of blocks that need to be replicated.

Namenode then proceeds to replicate these blocks to other

Datanodes.

NAMENODE


35

Keeps image of entire file system namespace and file Blockmap

in memory.

8GB of local RAM is sufficient to support the above data

structures that represent the huge number of files and

directories.

When the Namenode starts up it gets the FsImage and Editlog

from its local file system, update FsImage with EditLog

information and then stores a copy of the FsImage on the

filesytstem as a checkpoint.

Periodic checkpointing is done. So that the system can recover

back to the last checkpointed state in case of a crash.

FILESYSTEM METADATA


36

The HDFS namespace is stored by Namenode.

Namenode uses a transaction log called the EditLog to record

every change that occurs to the filesystem meta data.

o For example, creating a new file.

o Change replication factor of a file

o EditLog is stored in the Namenode’s local filesystem

Entire filesystem namespace including mapping of blocks to files

and file system properties is stored in a file FsImage. Stored in

Namenode’s local filesystem.

DATANODE


37

A Datanode stores data in files in its local file system.

Datanode has no knowledge about HDFS filesystem

It stores each block of HDFS data in a separate file.

Datanode does not create all files in the same directory.

It uses heuristics to determine optimal number of files per

directory and creates directories appropriately:

Research issue?

When the filesystem starts up it generates a list of all HDFS

blocks and send this report to Namenode: Blockreport.

NAMENODES & DATANODES


38

Master/slave architecture

HDFS cluster consists of a single Namenode, a master server that manages

the file system namespace and regulates access to files by clients.

There are a number of DataNodes usually one per node in a cluster.

The DataNodes manage storage attached to the nodes that they run on.

HDFS exposes a file system namespace and allows user data to be stored in

files.

A file is split into one or more blocks and set of blocks are stored in

DataNodes.

DataNodes: serves read, write requests, performs block creation, deletion,

and replication upon instruction from Namenode.



39

NameNode Manages the file namespace operation like opening,

creating, renaming etc.

File name to list blocks + location mapping

File metadata

Authorization and authentication

Collect block reports from DataNodes on block locations

Replicate missing blocks

Keeps ALL namespace in memory plus checkpoints & journal



40

DataNode Handles block storage on multiple volumes and

data integrity.

Clients access the blocks directly from data nodes for read

and write

Data nodes periodically send block reports to NameNode

Block creation, deletion and replication upon instruction

from the NameNode


41

CLIENT

FAULT TOLERANCE


42

Failure is the norm rather than exception

A HDFS instance may consist of thousands of low end

machines, each storing part of the file system’s data.

Since we have huge number of components and that each

component has non-trivial probability of failure means that

there is always some component that is non-functional.

Detection of faults and quick, automatic recovery from

them is a core architectural goal of HDFS.

DATANODE FAILURE & HEARTBEAT


43

A network partition can cause a subset of Datanodes to lose

connectivity with the Namenode.

Namenode detects this condition by the absence of a Heartbeat

message.

Namenode marks Datanodes without Hearbeat and does not send

any IO requests to them.

Any data registered to the failed Datanode is not available to the

HDFS.

Also the death of a Datanode may cause replication factor of some

of the blocks to fall below their specified value.

RE-REPLICATION


44

The necessity for re-replication may arise due to:

o A Datanode may become unavailable,

o A replica may become corrupted,

o A hard disk on a Datanode may fail, or

o The replication factor on the block may be increased.

HDFS – FAULT TOLERANCE


45

The input data (on HDFS) is stored on the local disks of the machines in the cluster.

HDFS divides each file into 64 MB blocks, and stores several copies of each block

(typically 3 copies) on different machines.

Worker Failure: The master pings every worker periodically. If no response is received

from a worker in a certain amount of time, the master marks the worker as failed. Any

map tasks completed by the worker are reset back to their initial idle state, and therefore

become eligible for scheduling on other workers. Similarly, any map task or reduce task

in progress on a failed worker is also reset to idle and becomes eligible for rescheduling.

Master Failure: It is easy to make the master write periodic checkpoints of the master

data structures described above. If the master task dies, a new copy can be started from

the last check-pointed state. However, in most cases, the user restarts the job.

CLUSTER - REBALANCING


46

HDFS architecture is compatible with data rebalancing schemes.

A scheme might move data from one Datanode to another if the free

space on a Datanode falls below a certain threshold.

In the event of a sudden high demand for a particular file, a scheme

might dynamically create additional replicas and rebalance other

data in the cluster.

DATA INTEGRITY


47

Consider a situation: a block of data fetched from Datanode arrives

corrupted.

This corruption may occur because of faults in a storage device,

network faults, or buggy software.

A HDFS client creates the checksum of every block of its file and

stores it in hidden files in the HDFS namespace.

When a clients retrieves the contents of file, it verifies that the

corresponding checksums match.

If does not match, the client can retrieve the block from a replica.

METADATA DISK FAILURE


48

FsImage and EditLog are central data structures of HDFS.

A corruption of these files can cause a HDFS instance to be non-

functional.

For this reason, a Namenode can be configured to maintain multiple

copies of the FsImage and EditLog.

Multiple copies of the FsImage and EditLog files are updated

synchronously.

Meta-data is not data-intensive.

The Namenode could be single point failure: automatic failover is

NOT supported!


49

BACKUP

SAFEMODE


50

On startup NameNode enters SafeMode

Replication of data blocks do not occur in SafeMode

Each DataNode checks in with HeartBeat and BlockReport

NameNode verifies that each block has acceptable number of

replicas

After a configurable percentage of safely replicated blocks check in

with the NameNode, NameNode exists in SafeMode

It then makes the list of blocks that need to be replicated

NameNode then proceeds to replicate these blocks to other

DataNodes

APPLICATION PROGRAMMING INTERFACE


51

HDFS provides Java API for application to use.

Python access is also used in many applications.

A C language wrapper for Java API is also available.

A HTTP browser can be used to browse the files of a HDFS instance.

SPACE RECLAMATION


52

When a file is deleted by a client, HDFS renames file to a file in be

the /trash directory for a configurable amount of time.

A client can request for an undelete in this allowed time.

After the specified time the file is deleted and the space is reclaimed.

When the replication factor is reduced, the Namenode selects excess

replicas that can be deleted.

Next heartbeat(?) transfers this information to the Datanode that

clears the blocks for use.

HADOOP ECOSYSTEM


53

HDFS is the file system

MR is the job which runs on file system

The MR job helps the user to ask question from HDFS files

Pig and Hive are two projects built to replace coding the map reduce

Pig and Hive interpreter turns the script and sql queries "INTO" MR job

To save the map and reduce only dependency to be able to query on HDFS - Impala and Hive

Impala Optimized for high latency queries-Near real time

Hive optimized for batch processing job

Sqoop: Can put data from a relation DB to Hadoop ecosystem

Flume can send data generated from external system to move to HDFS- Apt for high volume

logging

Hue: Graphical frontend to cluster

Oozie: Workflow management tool

Mahout : Machine learning Library


54

HADOOP ECOSYSTEM

HUE, OOZIE, MAHOUTCDHSelect * from …

SQOOPFLUME

CLOUDERA

PIG HIVE

MR IMPALA HBASE

HDFS


55

STORAGE OF FILE IN HDFS


56

When a 150MB file is being fed to Hadoop ecosystem it breaks itself in to

multiple parts to achieve parallelism

It breaks itself in to chunks where default chunk size is 64MB

Data node is the demon which takes care of all the happening at an individual

node

Name node is the one which keeps a track on what goes where and when

required how to collect the same group together

Now think hard what could be the possible challenges?

STORAGE OF FILE IN HDFS


57

HDFS – APPLICATION


58

HDFS – APPLICATION

Moving a file inhadoop

Moved file inHadoop ecosystem

Application of HDFS-Moving Data File for Analysis


59

MAPREDUCE STRUCTURE

NoSQL


60

What is NoSQL

CAP Theorem

What is lost

Types of NoSQL

Data Model

Frameworks

Demo

Wrap-up

SCALING UP


61

Issues with scaling up when the dataset is just too big

RDBMS were not designed to be distributed

Began to look at multi-node database solutions

Known as ‘scaling out’ or ‘horizontal scaling’

Different approaches include:

o Master-slave

o Sharding

RDBMS - MASTER/SLAVE


62

Master-Slave

o All writes are written to the master. All reads performed against thereplicated slave databases

o Critical reads may be incorrect as writes may not have beenpropagated down

o Large data sets can pose problems as master needs to duplicate datato slaves

RDBMS - SHARDING


63

Partition or Sharding

o Scales well for both reads and writes

o Not transparent, application needs to be partition-aware

o Can no longer have relationships/joins across partitions

o Loss of referential integrity across shards

SCALING RDBMS


64

Multi-Master replication

INSERT only, not UPDATES/DELETES

No JOINs, thereby reducing query time

o This involves de-normalizing data

In-memory databases

NoSQL


65

Stands for Not Only SQL

Class of non-relational data storage systems

Usually do not require a fixed table schema nor do they use the

concept of joins

All NoSQL offerings relax one or more of the ACID properties

WHY NoSQL ??


66

For data storage, an RDBMS cannot be the be-all/end-all

Just as there are different programming languages, need to have

other data storage tools in the toolbox

A NoSQL solution is more acceptable to a client now than even a

year ago

BIG TABLE


67

Three major papers were the seeds of the NoSQL movement

o BigTable (Google)

o Dynamo (Amazon)

Gossip protocol (discovery and error detection)

Distributed key-value data store

Eventual consistency

o CAP Theorem

CAP THEOREM


68

Three properties of a system: Consistency, Availability and

Partitions

You can have at most two of these three properties for any

shared-data system

To scale out, you have to partition. That leaves either

consistency or availability to choose from

o In almost all cases, you would choose availability over consistency

CHARACTERISTICS OF NoSQL


69

NoSQL solutions fall into two major areas:

o Key/Value or ‘the big hash table’.

Amazon S3 (Dynamo)

Voldemort

Scalaris

o Schema-less which comes in multiple flavors, column-based,document-based or graph-based.

Cassandra (column-based)

CouchDB (document-based)

Neo4J (graph-based)

HBase (column-based)

KEY VALUE


70

Pros

o Very fast

o Very scalable

o Simple model

o Able to distribute horizontally

Cons

o Many data structures (objects) can't be easily modeled as key valuepairs

SCHEMA- LESS


71

Pros

o Schema-less data model is richer than key/value pairs

o Eventual consistency

o Many are distributed

o Still provide excellent performance and scalability

Cons

o Typically no ACID transactions or joins

SQL TO NoSQL


72

Joins

Group by

Order by

ACID transactions

SQL as a sometimes frustrating but still powerful query language

Easy integration with other applications that support SQL

SEARCHING


73

Relational

o SELECT `column` FROM `database`,`table` WHERE `id` = key;

o SELECT product_name FROM rockets WHERE id = 123;

Cassandra (standard)

o keyspace.getSlice(key, “column_family”, "column")

o keyspace.getSlice(123, new ColumnParent(“rockets”),getSlicePredicate());

NoSQL API


74

Basic API access:

o get(key) -- Extract the value given a key

o put(key, value) -- Create or update the value given its key

o delete(key) -- Remove the key and its associated value

o execute(key, operation, parameters) -- Invoke an operation to thevalue (given its key) which is a special data structure (e.g. List, Set,Map .... etc).

DATA MODEL


75

Within Cassandra data set

o Column: smallest data element, a tuple with a name and a value

:hadoop, '1' might return:

{'name' => ‘Hadoop Model',

‘toon' => ‘Ready Set Zoom',

‘inventoryQty' => ‘5‘,

‘productUrl’ => ‘hadoop\1.gif’}

DATA MODEL


76

o ColumnFamily: There’s a single structure used to group both the

Columns and SuperColumns. Called a ColumnFamily (think table), it

has two types, Standard & Super.

Column families must be defined at startup

o Key: the permanent name of the record

o Keyspace: the outer-most level of organization. This is usually the

name of the application. For example, ‘Acme' (think database name).

HASHING


77

A

H

D

B

M

V

S

R

C

HASHING


78

Partition using consistent hashing

o Keys hash to a point on a fixed circular space

o Ring is partitioned into a set of ordered slots and servers and keyshashed over these slots

Nodes take positions on the circle.

A, B, and D exists.

o B responsible for AB range.

o D responsible for BD range.

o A responsible for DA range.

C joins.

o B, D split ranges.

o C gets BC from D.

DATA TYPE


79

Columns are always sorted by their name. Sorting supports:

o BytesType

o UTF8Type

o LexicalUUIDType

o TimeUUIDType

o AsciiType

o LongType

Each of these options treats the Columns' name as a different

data type

CASE STUDY


80

Facebook Search

MySQL > 50 GB Data

o Writes Average : ~300 ms

o Reads Average : ~350 ms

Rewritten with NoSQL > 50 GB Data

o Writes Average : 0.12 ms

o Reads Average : 15 ms

IMPLEMENTATION OF NoSQL


81

Log Analysis

Social Networking Feeds (many firms hooked in through

Facebook or Twitter)

External feeds from partners (EAI)

Data that is not easily analyzed in a RDBMS such as time-based

data

Large data feeds that need to be massaged before entry into an

RDBMS

SHARED DATA ARCHITECTURE


82

A B C D

1 2 3 4

Shared Data

SHARED NOTHING ARCHITECTURE


83

Shared Nothing

A B C D

LIST OF NoSQL DATABASES


84

Wide Column Store

oHadoop / Hbase

oCloudera

oCassandra

oHypertable

oAccumulo

oAmazon Simple DB

oCloudata

oMonetDB



85

Document Store

oOrientDB

oMongoDB

oCouchbase Server

oCouchDB

oRavenDB

oMarklogic Server

o JSON ODM



86

Key Value Store

oDynamoDB

oAzure

oRiak

oRedis

oAerospike

oLevelDB

oRocksDB



87

Graph Databases

oNeo4J

oArangoDB

o Infinite Graph

o Sparksee

oTITAN

o InfoGrid

oGraph Base

MapReduce


88

Map Operations

Reduce Operations

Submitting a MapReduce Job

Shuffle

Data types

Map Reduce


89

Map reduce is a programming model forprocessing and generating large data sets.

Use of functional model with user specified Mapand reduce operations allows to parallelize largecomputations.

map(k1,v1) list(k2,v2)

Map OPERATION


90

The common array operation

var a = [1, 2, 3] ;

for( i = 0 ; i < a.length ; i++)

a[i] = a[i] * 2;

The output is

var a = [2, 4, 6]

Map OPERATION


91

When fn is passed as an function argument

function map(fn, a)

{

for( i = 0; i < a.length; i++)

a[i] = fn(a[i]);

}

The map function is invoked as

map(function(x) {return x * 2;} , a);

Reduce FUNCTION


92

Merges together the intermediate key value accepted from the user, I with the set of values from the key.

Merges the values to form a smaller set of values.

reduce(k2, list(v2)) list(v2)

EXECUTION OVERVIEW


93

The Map invocations are distributed acrossmultiple machines by automatically partitioningthe input data into a set of M splits.

Reduce invocations are distributed by partitioningthe intermediate key space into R pieces using apartitioning function.

The number of partitions (R) and the partitioningfunctions are specified by the user.

Map FUNCTION FOR WORDCOUNT


94

Key : Document Name

Value : Document Contents

map ( String key, String value)

for each word w in value :

EmitIntermediate(w, “1”);

Reduce FUNCTION FOR WORDCOUNT


95

Key : A Word

Values : A list of counts

reduce( String key, Iterator values):

int result = 0;

for each v in values:

result += ParseInt (v);

Emit(AsString(result));

MapReduce AT HIGH LEVEL


96

Master Node

Job TrackerMapReduce job submitted by the Client Machine

Slave Node

Task Tracker

Task Instance

Slave Node

Task Tracker

Task Instance

ANATOMY OF MapReduce


97

INPUT DATA

Reduce

Reduce

Reduce

MAP

MAP

MAP

Interim D

Interim D

Interim D

NODE 1

NODE 2

NODE 3

Node to store Output



Partitioning

SUMMARY


98

A MapReduce job usually splits the input data-setinto independent chunks which are processed bythe map tasks in a completely parallel manner.

The framework sorts the outputs of the maps,Shuffle the sorted output based on its key and theninput to the reduce tasks.

The input and the output of the job are stored in afile-system.

The framework takes care of scheduling tasks,monitoring them and re-executes the failed tasks.

QUESTIONS


99

100


big data & hdfs - anuradha bhatia · unstructured data that is too large, ... a map-reduce...

Documents