big data & hdfs - anuradha bhatia · unstructured data that is too large, ... a map-reduce...
TRANSCRIPT
OUTLINE
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
2
Big Data
Characteristics of Big Data
Traditional v/s Streaming Data
Hadoop
Hadoop Architecture
BIG DATA
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
3
Big data is a collection of both structured and
unstructured data that is too large, fast and distinct to
be managed by traditional database management tools
or traditional data processing applications.
For e.g., Data managed by e-commerce websites for request search,
consumer/customer recommendations, current trend and
merchandising
Data managed by social media for providing a social network
platform
Data managed by real-time auction / bidding in online environment
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
4
IMPLEMENTATION
Stock market• Impact of weather on
securities prices• 5 million messages per
second, trade in 150 microseconds
Natural Systems• Wildfire management
• Water management
•Water Water
Fraud prevention• Detecting multi-party fraud
• Real time fraud prevention
Radio Astronomy• Detection of transient events
Health & Life Sciences• Neonatal ICU monitoring
• Epidemic early warning system
• Remote healthcare monitoring
Transportation• Intelligent traffic management
• Global air traffic management
Law Enforcement• Real-time multimodal surveillance
Manufacturing• Process control for
microchip fabrication
WHICH PLATFORM DO YOU CHOOSE?
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
7
Structured Semi-Structured Unstructured
Hadoop
Analytic Database
General Purpose RDBMS
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
8
CHARACTERISTICS OF BIG DATA
• System generated streams of data
• Multiple sources feeding data for one system
• Structured data
• Unstructured data-Blogs, Images, Audio etc.
• System/Users generating TeraBtes, PetaBytes
and ZetaBytes of dataVolume
Velocity
Variety
Storage
Processing
Presentation
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
9
VALUE CHAIN OF BIG DATA
Data Generation
Data Collection
Data Analysis
Application of Insights
Source of data e.g., Users, Enterprises, Systems etc
Companies, Tools, Sites aggregating data e.g, IMS
Research and Analytics Firms e.g. , MuSigms, etc.
Management consulting firms, MNC’s
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
10
Queries DataResults
a) static data
Queries DataResults
a) static data
Queries DataResultsQueries DataResults
a) static data
TRADITIONAL COMPUTING
Historical fact finding with data-at-rest
Batch paradigm, pull model
Query-driven: submits queries to static data
Relies on Databases, Data Warehouses
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
11
STREAM COMPUTING
Real time analysis of data-in-motion
Streaming data
A stream of structured or unstructured data-in-motion
Stream Computing
Analytic operations on streaming datain real-time
STREAM COMPUTING EVENTS
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
12
text and transactional data news broadcasting
digital audio, video and image data
RFID
financial data network packet traces instant messages
satellite dataphone conversations
web searchesATM transactions
pervasive sensor data
click streams
Unknown data/signal
High usefulness density
Simple analytics
Well defined event
High speed (million events per sec)
Very low latency
Low usefulness density
Complex analytics
Event needs to be detected
High volume (TB/sec)
Low latency
Large Spectrum of Events/Data
Unstructured dataStructured data
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
13
Ecosystem of open sourceprojects
Hosted by Apache Foundation
Google developed and sharedconcepts
Distributed file system thatscales out on commodity serverswith direct attached storage andautomatic failover.
HADOOP
HADOOP DISTRIBUTED FILE SYSTEM - HDFS
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
15
Hadoop Distributed File System (HDFS)
HDFS is the implementation of Hadoop file system, the java abstract classorg.apache.hadoop.fs.FileSystem that represents a file system in Hadoop.
HDFS is designed to work efficiently in conjunction with MapReduce.
Definition
A distributed file system that provides big data storage solution through high-throughput access to applicationdata.
When data can potentially outgrow the storage capacity of a single machine, portioning it across a number ofseparate machines is necessary for storage of processing. This is achieved using a distributed file systems.
Potential Challenges:
Ensuring data integrity
Data retention in case of nodes failure
Integration across multiple nodes and systems
HADOOP DISTRIBUTED FILE SYSTEM - HDFS
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
16
Hadoop Distributed File System (HDFS)
HDFS is designed for storing very large files with streaming data access patterns, running onclusters of commodity hardware
Very Large Files
Very large means files that are hundred of MB, GB, TB or PB in size.
Streaming Data Access:
HDFS implements write-once, read-many-times pattern. Data is copied from source for analysis overtime. Each analysis involves a large portion of the dataset, so that time to read the whole dataset is moreimportant than the latency in reading the first record.
Commodity Hardware:
Hadoop runs on clusters of commodity hardware (commodity available hardware) HDFS is designed tocarry on working without a noticeable interruption to the user in the case of node failure.
HADOOP DISTRIBUTED FILE SYSTEM - HDFS
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
17
Where HDFS doesn’t work well:
HDFS is not designed for the following scenarios
Low-Latency Data Access:
HDFS is optimised for delivering a high throughput of data, and this may be at the expense oflatency
Lots of Small Files:
File system metadata is stored in memory, hence the limit to the number of files in a file systemis governed by the amount of memory on the namenode.
As rule of thumb, each file, directory, and block takes about 150 bytes.
Multiple Updates in the File:
Files in HDFS may be written to by a single writer at the end of the file. There is no support formultiple writers, or for modifications at arbitrary offsets in the file.
HADOOP DISTRIBUTED FILE SYSTEM - CONCEPT
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
18
Blocks
NameNode
DataNodes
HDFS Federation
HDFS High Availablity
HADOOP DISTRIBUTED FILE SYSTEM - BLOCKS
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
19
Files in HDFS are broken into blocks of 64 MB (default) and stored as
independent units.
Files in HDFS that is smaller than a single block does not occupy a full block’s
storage
HDFS blocks are large compared to disk blocks to minimize the cost of seeks.
Map tasks in MapReduce operate on one block at a time
Block as a unit of abstraction rather than a file simplifies the storage subsystem
which takes the metadata information
Blocks fit well with replication for providing fault tolerance and availability
HDFS’s fsck command understands blocks
For example, command to list the blocks that make up each file in the system:
% hadoop fsck / -files -blocks
HDFS – NAMENODES & DATANODES
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
20
HDFS cluster consists of
NameNodes
DataNodes
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
21
DFS2-G
Client lo
et blockcation
FSDataInputStream
Datanode Datanode Datanode
Namenode
HDFS – NAMENODES & DATANODES
WHAT DOES IT DO?
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
23
Hadoop implements Google’s MapReduce, using HDFS
HDFS creates multiple replicas of data blocks for reliability, placing
them on compute nodes around the cluster.
MapReduce can then process the data where it is located.
Hadoop ‘s target is to run on clusters of the order of 10,000-nodes.
DATA CHARACTERISTICS
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
24
Batch processing rather than interactive user access.
Large data sets and files: gigabytes to terabytes size
High aggregate data bandwidth
Scale to hundreds of nodes in a cluster
Tens of millions of files in a single instance
Write-once-read-many: a file once created, written and closed
need not be changed – this assumption simplifies coherency
A map-reduce application or web-crawler application fits
perfectly with this model.
DATA BLOCKS
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
25
HDFS support write-once-read-many with reads at streaming
speeds.
A typical block size is 64MB (or even 128 MB).
A file is chopped into 64MB chunks and stored.
FILESYSTEM NAMESPACE
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
26
Hierarchical file system with directories and files
Create, remove, move, rename etc.
Namenode maintains the file system
Any meta information changes to the file system recorded by the
Namenode.
An application can specify the number of replicas of the file
needed: replication factor of the file. This information is stored
in the Namenode.
FS SHELL, ADMIN AND BROWSER INTERFACE
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
27
HDFS organizes its data in files and directories.
It provides a command line interface called the FS shell that lets
the user interact with data in the HDFS.
The syntax of the commands is similar to bash and csh.
Example: to create a directory
/foodir/bin/hadoop dfs –mkdir /foodir
There is also DFSAdmin interface available
Browser interface is also available to view the namespace.
DATA REPLICATION
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
29
HDFS is designed to store very large files across machines in a
large cluster.
Each file is a sequence of blocks.
All blocks in the file except the last are of the same size.
Blocks are replicated for fault tolerance.
Block size and replicas are configurable per file.
The Namenode receives a Heartbeat and a BlockReport from
each DataNode in the cluster.
BlockReport contains all the blocks on a Datanode.
REPLICA PLACEMENT
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
30
The placement of the replicas is critical to HDFS reliability and
performance.
Optimizing replica placement distinguishes HDFS from other
distributed file systems.
Rack-aware replica placement:
Replicas are placed: one on a node in a local rack, one on a
different node in the local rack and one on a node in a different
rack.
1/3 of the replica on a node, 2/3 on a rack and 1/3 distributed
evenly across remaining racks.
REPLICA SELECTION
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
31
Replica selection for READ operation: HDFS tries to minimize
the bandwidth consumption and latency.
If there is a replica on the Reader node then that is preferred.
HDFS cluster may span multiple data centers: replica in the local
data center is preferred over the remote one.
STAGING
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
32
A client request to create a file does not reach Namenode
immediately.
HDFS client caches the data into a temporary file. When the data
reached a HDFS block size the client contacts the Namenode.
Namenode inserts the filename into its hierarchy and allocates a
data block for it.
The Namenode responds to the client with the identity of the
Datanode and the destination of the replicas (Datanodes) for the
block.
Then the client flushes it from its local memory.
STAGING
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
33
The client sends a message that the file is closed.
Namenode proceeds to commit the file for creation operation
into the persistent store.
If the Namenode dies before file is closed, the file is lost.
This client side caching is required to avoid network congestion.
SAFEMODE STARTUP
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
34
On startup Namenode enters Safemode.
Replication of data blocks do not occur in Safemode.
Each DataNode checks in with Heartbeat and BlockReport.
Namenode verifies that each block has acceptable number of
replicas
After a configurable percentage of safely replicated blocks check
in with the Namenode, Namenode exits Safemode.
It then makes the list of blocks that need to be replicated.
Namenode then proceeds to replicate these blocks to other
Datanodes.
NAMENODE
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
35
Keeps image of entire file system namespace and file Blockmap
in memory.
8GB of local RAM is sufficient to support the above data
structures that represent the huge number of files and
directories.
When the Namenode starts up it gets the FsImage and Editlog
from its local file system, update FsImage with EditLog
information and then stores a copy of the FsImage on the
filesytstem as a checkpoint.
Periodic checkpointing is done. So that the system can recover
back to the last checkpointed state in case of a crash.
FILESYSTEM METADATA
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
36
The HDFS namespace is stored by Namenode.
Namenode uses a transaction log called the EditLog to record
every change that occurs to the filesystem meta data.
o For example, creating a new file.
o Change replication factor of a file
o EditLog is stored in the Namenode’s local filesystem
Entire filesystem namespace including mapping of blocks to files
and file system properties is stored in a file FsImage. Stored in
Namenode’s local filesystem.
DATANODE
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
37
A Datanode stores data in files in its local file system.
Datanode has no knowledge about HDFS filesystem
It stores each block of HDFS data in a separate file.
Datanode does not create all files in the same directory.
It uses heuristics to determine optimal number of files per
directory and creates directories appropriately:
Research issue?
When the filesystem starts up it generates a list of all HDFS
blocks and send this report to Namenode: Blockreport.
NAMENODES & DATANODES
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
38
Master/slave architecture
HDFS cluster consists of a single Namenode, a master server that manages
the file system namespace and regulates access to files by clients.
There are a number of DataNodes usually one per node in a cluster.
The DataNodes manage storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored in
files.
A file is split into one or more blocks and set of blocks are stored in
DataNodes.
DataNodes: serves read, write requests, performs block creation, deletion,
and replication upon instruction from Namenode.
NAMENODES & DATANODES
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
39
NameNode Manages the file namespace operation like opening,
creating, renaming etc.
File name to list blocks + location mapping
File metadata
Authorization and authentication
Collect block reports from DataNodes on block locations
Replicate missing blocks
Keeps ALL namespace in memory plus checkpoints & journal
NAMENODES & DATANODES
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
40
DataNode Handles block storage on multiple volumes and
data integrity.
Clients access the blocks directly from data nodes for read
and write
Data nodes periodically send block reports to NameNode
Block creation, deletion and replication upon instruction
from the NameNode
FAULT TOLERANCE
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
42
Failure is the norm rather than exception
A HDFS instance may consist of thousands of low end
machines, each storing part of the file system’s data.
Since we have huge number of components and that each
component has non-trivial probability of failure means that
there is always some component that is non-functional.
Detection of faults and quick, automatic recovery from
them is a core architectural goal of HDFS.
DATANODE FAILURE & HEARTBEAT
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
43
A network partition can cause a subset of Datanodes to lose
connectivity with the Namenode.
Namenode detects this condition by the absence of a Heartbeat
message.
Namenode marks Datanodes without Hearbeat and does not send
any IO requests to them.
Any data registered to the failed Datanode is not available to the
HDFS.
Also the death of a Datanode may cause replication factor of some
of the blocks to fall below their specified value.
RE-REPLICATION
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
44
The necessity for re-replication may arise due to:
o A Datanode may become unavailable,
o A replica may become corrupted,
o A hard disk on a Datanode may fail, or
o The replication factor on the block may be increased.
HDFS – FAULT TOLERANCE
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
45
The input data (on HDFS) is stored on the local disks of the machines in the cluster.
HDFS divides each file into 64 MB blocks, and stores several copies of each block
(typically 3 copies) on different machines.
Worker Failure: The master pings every worker periodically. If no response is received
from a worker in a certain amount of time, the master marks the worker as failed. Any
map tasks completed by the worker are reset back to their initial idle state, and therefore
become eligible for scheduling on other workers. Similarly, any map task or reduce task
in progress on a failed worker is also reset to idle and becomes eligible for rescheduling.
Master Failure: It is easy to make the master write periodic checkpoints of the master
data structures described above. If the master task dies, a new copy can be started from
the last check-pointed state. However, in most cases, the user restarts the job.
CLUSTER - REBALANCING
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
46
HDFS architecture is compatible with data rebalancing schemes.
A scheme might move data from one Datanode to another if the free
space on a Datanode falls below a certain threshold.
In the event of a sudden high demand for a particular file, a scheme
might dynamically create additional replicas and rebalance other
data in the cluster.
DATA INTEGRITY
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
47
Consider a situation: a block of data fetched from Datanode arrives
corrupted.
This corruption may occur because of faults in a storage device,
network faults, or buggy software.
A HDFS client creates the checksum of every block of its file and
stores it in hidden files in the HDFS namespace.
When a clients retrieves the contents of file, it verifies that the
corresponding checksums match.
If does not match, the client can retrieve the block from a replica.
METADATA DISK FAILURE
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
48
FsImage and EditLog are central data structures of HDFS.
A corruption of these files can cause a HDFS instance to be non-
functional.
For this reason, a Namenode can be configured to maintain multiple
copies of the FsImage and EditLog.
Multiple copies of the FsImage and EditLog files are updated
synchronously.
Meta-data is not data-intensive.
The Namenode could be single point failure: automatic failover is
NOT supported!
SAFEMODE
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
50
On startup NameNode enters SafeMode
Replication of data blocks do not occur in SafeMode
Each DataNode checks in with HeartBeat and BlockReport
NameNode verifies that each block has acceptable number of
replicas
After a configurable percentage of safely replicated blocks check in
with the NameNode, NameNode exists in SafeMode
It then makes the list of blocks that need to be replicated
NameNode then proceeds to replicate these blocks to other
DataNodes
APPLICATION PROGRAMMING INTERFACE
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
51
HDFS provides Java API for application to use.
Python access is also used in many applications.
A C language wrapper for Java API is also available.
A HTTP browser can be used to browse the files of a HDFS instance.
SPACE RECLAMATION
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
52
When a file is deleted by a client, HDFS renames file to a file in be
the /trash directory for a configurable amount of time.
A client can request for an undelete in this allowed time.
After the specified time the file is deleted and the space is reclaimed.
When the replication factor is reduced, the Namenode selects excess
replicas that can be deleted.
Next heartbeat(?) transfers this information to the Datanode that
clears the blocks for use.
HADOOP ECOSYSTEM
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
53
HDFS is the file system
MR is the job which runs on file system
The MR job helps the user to ask question from HDFS files
Pig and Hive are two projects built to replace coding the map reduce
Pig and Hive interpreter turns the script and sql queries "INTO" MR job
To save the map and reduce only dependency to be able to query on HDFS - Impala and Hive
Impala Optimized for high latency queries-Near real time
Hive optimized for batch processing job
Sqoop: Can put data from a relation DB to Hadoop ecosystem
Flume can send data generated from external system to move to HDFS- Apt for high volume
logging
Hue: Graphical frontend to cluster
Oozie: Workflow management tool
Mahout : Machine learning Library
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
54
HADOOP ECOSYSTEM
HUE, OOZIE, MAHOUTCDHSelect * from …
SQOOPFLUME
CLOUDERA
PIG HIVE
MR IMPALA HBASE
HDFS
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
56
When a 150MB file is being fed to Hadoop ecosystem it breaks itself in to
multiple parts to achieve parallelism
It breaks itself in to chunks where default chunk size is 64MB
Data node is the demon which takes care of all the happening at an individual
node
Name node is the one which keeps a track on what goes where and when
required how to collect the same group together
Now think hard what could be the possible challenges?
STORAGE OF FILE IN HDFS
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
58
HDFS – APPLICATION
Moving a file inhadoop
Moved file inHadoop ecosystem
Application of HDFS-Moving Data File for Analysis
NoSQL
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
60
What is NoSQL
CAP Theorem
What is lost
Types of NoSQL
Data Model
Frameworks
Demo
Wrap-up
SCALING UP
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
61
Issues with scaling up when the dataset is just too big
RDBMS were not designed to be distributed
Began to look at multi-node database solutions
Known as ‘scaling out’ or ‘horizontal scaling’
Different approaches include:
o Master-slave
o Sharding
RDBMS - MASTER/SLAVE
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
62
Master-Slave
o All writes are written to the master. All reads performed against thereplicated slave databases
o Critical reads may be incorrect as writes may not have beenpropagated down
o Large data sets can pose problems as master needs to duplicate datato slaves
RDBMS - SHARDING
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
63
Partition or Sharding
o Scales well for both reads and writes
o Not transparent, application needs to be partition-aware
o Can no longer have relationships/joins across partitions
o Loss of referential integrity across shards
SCALING RDBMS
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
64
Multi-Master replication
INSERT only, not UPDATES/DELETES
No JOINs, thereby reducing query time
o This involves de-normalizing data
In-memory databases
NoSQL
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
65
Stands for Not Only SQL
Class of non-relational data storage systems
Usually do not require a fixed table schema nor do they use the
concept of joins
All NoSQL offerings relax one or more of the ACID properties
WHY NoSQL ??
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
66
For data storage, an RDBMS cannot be the be-all/end-all
Just as there are different programming languages, need to have
other data storage tools in the toolbox
A NoSQL solution is more acceptable to a client now than even a
year ago
BIG TABLE
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
67
Three major papers were the seeds of the NoSQL movement
o BigTable (Google)
o Dynamo (Amazon)
Gossip protocol (discovery and error detection)
Distributed key-value data store
Eventual consistency
o CAP Theorem
CAP THEOREM
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
68
Three properties of a system: Consistency, Availability and
Partitions
You can have at most two of these three properties for any
shared-data system
To scale out, you have to partition. That leaves either
consistency or availability to choose from
o In almost all cases, you would choose availability over consistency
CHARACTERISTICS OF NoSQL
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
69
NoSQL solutions fall into two major areas:
o Key/Value or ‘the big hash table’.
Amazon S3 (Dynamo)
Voldemort
Scalaris
o Schema-less which comes in multiple flavors, column-based,document-based or graph-based.
Cassandra (column-based)
CouchDB (document-based)
Neo4J (graph-based)
HBase (column-based)
KEY VALUE
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
70
Pros
o Very fast
o Very scalable
o Simple model
o Able to distribute horizontally
Cons
o Many data structures (objects) can't be easily modeled as key valuepairs
SCHEMA- LESS
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
71
Pros
o Schema-less data model is richer than key/value pairs
o Eventual consistency
o Many are distributed
o Still provide excellent performance and scalability
Cons
o Typically no ACID transactions or joins
SQL TO NoSQL
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
72
Joins
Group by
Order by
ACID transactions
SQL as a sometimes frustrating but still powerful query language
Easy integration with other applications that support SQL
SEARCHING
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
73
Relational
o SELECT `column` FROM `database`,`table` WHERE `id` = key;
o SELECT product_name FROM rockets WHERE id = 123;
Cassandra (standard)
o keyspace.getSlice(key, “column_family”, "column")
o keyspace.getSlice(123, new ColumnParent(“rockets”),getSlicePredicate());
NoSQL API
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
74
Basic API access:
o get(key) -- Extract the value given a key
o put(key, value) -- Create or update the value given its key
o delete(key) -- Remove the key and its associated value
o execute(key, operation, parameters) -- Invoke an operation to thevalue (given its key) which is a special data structure (e.g. List, Set,Map .... etc).
DATA MODEL
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
75
Within Cassandra data set
o Column: smallest data element, a tuple with a name and a value
:hadoop, '1' might return:
{'name' => ‘Hadoop Model',
‘toon' => ‘Ready Set Zoom',
‘inventoryQty' => ‘5‘,
‘productUrl’ => ‘hadoop\1.gif’}
DATA MODEL
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
76
o ColumnFamily: There’s a single structure used to group both the
Columns and SuperColumns. Called a ColumnFamily (think table), it
has two types, Standard & Super.
Column families must be defined at startup
o Key: the permanent name of the record
o Keyspace: the outer-most level of organization. This is usually the
name of the application. For example, ‘Acme' (think database name).
HASHING
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
78
Partition using consistent hashing
o Keys hash to a point on a fixed circular space
o Ring is partitioned into a set of ordered slots and servers and keyshashed over these slots
Nodes take positions on the circle.
A, B, and D exists.
o B responsible for AB range.
o D responsible for BD range.
o A responsible for DA range.
C joins.
o B, D split ranges.
o C gets BC from D.
DATA TYPE
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
79
Columns are always sorted by their name. Sorting supports:
o BytesType
o UTF8Type
o LexicalUUIDType
o TimeUUIDType
o AsciiType
o LongType
Each of these options treats the Columns' name as a different
data type
CASE STUDY
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
80
Facebook Search
MySQL > 50 GB Data
o Writes Average : ~300 ms
o Reads Average : ~350 ms
Rewritten with NoSQL > 50 GB Data
o Writes Average : 0.12 ms
o Reads Average : 15 ms
IMPLEMENTATION OF NoSQL
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
81
Log Analysis
Social Networking Feeds (many firms hooked in through
Facebook or Twitter)
External feeds from partners (EAI)
Data that is not easily analyzed in a RDBMS such as time-based
data
Large data feeds that need to be massaged before entry into an
RDBMS
SHARED DATA ARCHITECTURE
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
82
A B C D
1 2 3 4
Shared Data
SHARED NOTHING ARCHITECTURE
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
83
Shared Nothing
A B C D
LIST OF NoSQL DATABASES
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
84
Wide Column Store
oHadoop / Hbase
oCloudera
oCassandra
oHypertable
oAccumulo
oAmazon Simple DB
oCloudata
oMonetDB
LIST OF NoSQL DATABASES
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
85
Document Store
oOrientDB
oMongoDB
oCouchbase Server
oCouchDB
oRavenDB
oMarklogic Server
o JSON ODM
LIST OF NoSQL DATABASES
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
86
Key Value Store
oDynamoDB
oAzure
oRiak
oRedis
oAerospike
oLevelDB
oRocksDB
LIST OF NoSQL DATABASES
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
87
Graph Databases
oNeo4J
oArangoDB
o Infinite Graph
o Sparksee
oTITAN
o InfoGrid
oGraph Base
MapReduce
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
88
Map Operations
Reduce Operations
Submitting a MapReduce Job
Shuffle
Data types
Map Reduce
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
89
Map reduce is a programming model forprocessing and generating large data sets.
Use of functional model with user specified Mapand reduce operations allows to parallelize largecomputations.
map(k1,v1) list(k2,v2)
Map OPERATION
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
90
The common array operation
var a = [1, 2, 3] ;
for( i = 0 ; i < a.length ; i++)
a[i] = a[i] * 2;
The output is
var a = [2, 4, 6]
Map OPERATION
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
91
When fn is passed as an function argument
function map(fn, a)
{
for( i = 0; i < a.length; i++)
a[i] = fn(a[i]);
}
The map function is invoked as
map(function(x) {return x * 2;} , a);
Reduce FUNCTION
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
92
Merges together the intermediate key value accepted from the user, I with the set of values from the key.
Merges the values to form a smaller set of values.
reduce(k2, list(v2)) list(v2)
EXECUTION OVERVIEW
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
93
The Map invocations are distributed acrossmultiple machines by automatically partitioningthe input data into a set of M splits.
Reduce invocations are distributed by partitioningthe intermediate key space into R pieces using apartitioning function.
The number of partitions (R) and the partitioningfunctions are specified by the user.
Map FUNCTION FOR WORDCOUNT
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
94
Key : Document Name
Value : Document Contents
map ( String key, String value)
for each word w in value :
EmitIntermediate(w, “1”);
Reduce FUNCTION FOR WORDCOUNT
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
95
Key : A Word
Values : A list of counts
reduce( String key, Iterator values):
int result = 0;
for each v in values:
result += ParseInt (v);
Emit(AsString(result));
MapReduce AT HIGH LEVEL
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
96
Master Node
Job TrackerMapReduce job submitted by the Client Machine
Slave Node
Task Tracker
Task Instance
Slave Node
Task Tracker
Task Instance
ANATOMY OF MapReduce
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
97
INPUT DATA
Reduce
Reduce
Reduce
MAP
MAP
MAP
Interim D
Interim D
Interim D
NODE 1
NODE 2
NODE 3
Node to store Output
Node to store Output
Node to store Output
Partitioning
SUMMARY
January 9, 2016Anuradha Bhatia, Big Data & HDFS, NoSQL
98
A MapReduce job usually splits the input data-setinto independent chunks which are processed bythe map tasks in a completely parallel manner.
The framework sorts the outputs of the maps,Shuffle the sorted output based on its key and theninput to the reduce tasks.
The input and the output of the job are stored in afile-system.
The framework takes care of scheduling tasks,monitoring them and re-executes the failed tasks.