hadoop framework
TRANSCRIPT
Seminar on
Abstract
The amount total digital data in the world has exploded in recent years.
In 2006, the universal data was estimated to be 0.18 zettabytesin 2006, and is forecasting a tenfold growth
by 2011 to 1.8 zettabytes.
1 zettabyte = 10 21 bytes.
The problem is that while the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives have not kept up.
One typical drive from 1990 could store 1370 MB of data and had a transfer speed of 4.4 MB/s, so we could read all the data from a full drive in around 300 seconds.
In 2010, 1 Tb drives are the standard hard disk size, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk.
Parallelisation
A very obvious solution to solving this problem is parallelisation. The input data is
usually large and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time.
Reading 1 Tb from a single hard drive may take a long time, but on parallelizing this over
different machines can solve the problem in 2 minutes.
Key issues
The key issues involved in this Solution:
Hardware failure
Combine the data after analysis (i.e.
reading)
Solutions
Hadoop is a framework for running applications on large cluster built ofcommodity hardware. The Hadoopframework transparently provides applications both
reliability and data motion.
It solves the problem of Hardware Failure through replication.
The second problem is solved by a simple programming model- Mapreduce.
Introduction
Hadoop
Hadoop is an open source framework
for writing and running distributed
applications that process large amounts
of data.
Hadoop is designed to efficiently
process large volumes of information by
connecting many commodity computers
together to work in parallel.
Features of HADOOP
The features of hadoop that stand out
are its simplified programming model
and its efficient, automatic distribution of
data and work across machines.