big data analytics a research reportijrpublisher.com › gallery › 36-july-2018.pdfbig data...
TRANSCRIPT
BIG DATA ANALYTICS A RESEARCH REPORT
Mrs. Sowmya Koneru Associate Professor, NRI Institute of Technology, Agiripalli, Vijayawada, Andhra Pradesh
Abstract— Big data is the term for any collection of datasets so large and complex that it becomes
difficult to process using traditional data processing applications. The challenges include analysis,
capture, curation, search, sharing, storage, transfer, visualization, and privacy violations. Big data is a set
of techniques and technologies that require new forms of integration to uncover large hidden values from
large datasets that are diverse, complex, and of a massive scale. Big data environment is used to acquire,
organize and analyze the various types of data. Data that is so large in volume, so diverse in variety or
moving with such velocity is called Big data. Analyzing Big Data is a challenging task as it involves large
distributed file systems which should be fault tolerant, flexible and scalable. The technologies used by big
data application to handle the massive data are Hadoop, Map Reduce, Apache Hive, No SQL and HPCC.
First, we present the definition of big data and discuss big data challenges. Next, we present a systematic
framework to decompose big data systems into four sequential modules, namely data generation, data
acquisition, data storage, and data analytics. These four modules form a big data value chain. Following
that, we present a detailed survey of Materials and methods used in research and industry communities. In
addition, we present the prevalent Hadoop framework for addressing big data. Finally, we outline Big
data system architecture and present key challenges of research directions for big data system. Keywords— Big Data, Hadoop, Map Reduce, Apache Hive, No SQL and HPC
I. INTRODUCTION Big data is a largest buzz phrases in domain of IT, new technologies of personal communication driving the big data new trend and internet population grew day by day but it never reach by 100%. The need of big data generated from the large companies like facebook, yahoo, Google, YouTube etc for the purpose of analysis of enormous amount of data which is in unstructured form or even in structured form. Google contains the large amount of information. So; there is the need of Big Data Analytics that is the processing of the complex and massive datasets This data is different from structured data in terms of five parameters –variety, volume, value, veracity and velocity (5V’s). The five V’s (volume, variety, velocity, value, veracity) are the challenges of big data management are: 1. Volume:
Data is ever-growing day by day of alltypes ever MB, PB, YB, ZB, KB, TB of information. The data results into large files. Excessive volume of
data is main issue of storage. This main issue is resolved by reducing storage cost. Data volumes are expected to grow 50 times by 2020. 2. Variety:
Data sources are extremely heterogeneous. The files comes in various formats and of any type, it may be structured or unstructured such as text, audio, videos, log files and more. The varieties are endless, and the data enters the network without having been quantified or qualified in any way. 3. Velocity:
The data comes at high speed.Sometimes 1
minute is too late so big data is time sensitive.
Some organizations data velocity is main
challenge. The social media messages and credit
card transactions done in millisecond and data
generated by this putting in to databases.
4. Value:
It is a most important v in big data. Value is
main buzz for big data because it is important
International Journal of Research
Volume 7, Issue VII, JULY/2018.
ISSN NO : 2236-6124
Page No:276
for businesses, IT infrastructure system to store
large amount of values in database. 5. Veracity:
The increase in the range of valuestypical of a
large data set. When we dealing with high
volume, velocity and variety of data, the all of
data are not going 100% correct, there will be
dirty data. Big data and analytics technologies
work with these types of data. Huge volume of data (both structured and
unstructured) is management by organization,
administration and governance. Unstructured
data is a data that is not present in a database.
Unstructured data may be text, verbal data or in
another form. Textual unstructured data is like
power point presentation, email messages, word
documents, and instant massages. Data in
another format can be.jpg images, .png images
and audio files.
Fig.2 illustrates a general big data network
model with MapReduce. A distinct application
in the cloud has put demanding requirements for
acquisition, transportation and analytics of
structured and unstructured data. The challenges
include analysis, capture, curation, search,
sharing, storage, transfer, visualization, and
privacy violations. The trend to larger data sets
is due to the additional information derivable
from analysis of a single large set of related
data, as compared to separate smaller sets with
the same total amount of data, allowing
correlations to be found to "spot business trends,
prevent diseases, and combat crime and so on".
Scientists regularly encounter limitations due to
large data sets in many areas, including
meteorology, genomics, connectomics, complex
physics simulations, and biological and
environmental research. The limitations also
affect Internet search, finance and business
informatics. Data sets grow in size in part
because they are increasingly being gathered by
ubiquitous information-sensing mobile devices,
aerial sensory technologies (remote sensing),
software logs, cameras, microphones, radio-
frequency identification (RFID) readers, and
wireless sensor networks. The world's
technological per-capita capacity to store
information has roughly doubled every 40
months since the 1980s;as of 2012, every day
2.5 exabytes (2.5×1018) of data were created.
The challenge for large enterprises is
determining who should own big data initiatives
that straddle the entire organization.
The history of big data can be roughly split
into the following stages: Megabyte to
Gigabyte: In the 1970s and 1980s, his-torical
business data introduced the earliest ``big data''
challenge in moving from megabyte to gigabyte
sizes. The urgent need at that time was to house
that data and run relational queries for business
analyses and report-ing. Research efforts were
made to give birth to the ``database machine''
that featured integrated hardware and software
International Journal of Research
Volume 7, Issue VII, JULY/2018.
ISSN NO : 2236-6124
Page No:277
to solve problems. The underlying philosophy
was that such integration would provide better
performance at lower cost. After a period of
time, it became clear that hardware-specialized
database machines could not keep pace with the
progress of general-purpose computers. Thus,
the descendant database systems are soft-ware
systems that impose few constraints on hardware
and can run on general-purpose computers.
Gigabyte to Terabyte: In the late 1980s, the
popularization of digital technology caused data
volumes to expand to several gigabytes or even
a terabyte, which is beyond the storage and/or
processing capabilities of a single large
computer system. Data parallelization was
proposed to extend storage capabilities and to
improve performance by distributing data and
related tasks, such as building indexes and
evaluating queries, into disparate hardware.
Based on this idea, several types of parallel
databases were built, including shared-memory
databases, shared-disk databases, and shared-
nothing databases, all as induced by the
underlying hardware architecture. Of the three
types of database Terabyte to Petabyte: During
the late 1990s, whenthe database community
was admiring its `` nished'' work on the parallel
database, the rapid development of Web 1.0 led
the whole world into the Internet era, along with
massive semi-structured or unstructured web-
pages holding terabytes or petabytes (PBs) of
data. The resulting need for search companies
was to index and query the mushrooming
content of the web. Unfortunately, although
parallel databases handle structured data well,
they provide little support for unstructured data.
Additionally, systems capabilities were limited
to less than several terabyte.
Petabyte to Exabyte: Under current
development trends,data stored and analyzed by
big companies willundoubtedly reach the PB to
exabyte magnitude soon. However, current
technology still handles terabyte to PB data;
there has been no revolutionary technology
developed to cope with larger datasets.
II. LITERATURE SURVEY
1 Hadoop Map Reduce is a large scale, open
source software framework dedicated to
scalable, distributed, data-intensive computing.
The framework breaks up large data into smaller
parallelizable chunks and handles scheduling ▫
Maps each piece to an intermediate value ▫
Reduces intermediate values to a solution ▫
User-specified partition and combiner options
Fault tolerant, reliable, and supports thousands
of nodes and petabytes of data • If you can
rewrite algorithms into MapReduces, and your
problem can be broken up into small pieces
solvable in parallel, then Hadoop’s Map Reduce
is the way to go for a distributed problem
solving approach to large datasets • Tried and
tested in production • Many implementation
options. We can present the design and
evaluation of a data aware cache framework that
requires minimum change to the original Map
Reduce programming model for provisioning
incremental processing for Big Data applications
using the Map Reduce model [4].
2 The author [2] stated the importance of some
of the technologies that handle Big Data like
Hadoop, HDFS and Map Reduce. The author
suggested about various schedulers used in
Hadoop and about the technical aspects of
Hadoop. The author also focuses on the
importance of YARN which overcomes the
limitations of Map Reduce.
3 The author [3] have surveyed various
technologies to handle the big data and there
architectures. In this paper we have also
discussed the challenges of Big data (volume,
variety, velocity, value, veracity) and various
advantages and a disadvantage of these
technologies. This paper discussed an
architecture usingHadoop HDFS distributed data
storage, real-time NoSQL databases, and
MapReduce distributed dataprocessing over a
cluster of commodity servers. The main goal of
International Journal of Research
Volume 7, Issue VII, JULY/2018.
ISSN NO : 2236-6124
Page No:278
our paper was to make a survey of various big
data handling techniques those handle a massive
amount of data from different sources and
improves overall performance of systems.
4 The author continue with the Big Data
definition and enhance the definition given in [3]
that includes the 5V Big Data properties:
Volume, Variety, Velocity, Value, Veracity, and
suggest other dimensions for Big Data analysis
and taxonomy, in particular comparing and
contrasting Big Data technologies in e-Science,
industry, business, social media, healthcare.
With a long tradition of working with constantly
increasing volume of data, modern e-Science
can offer industry the scientific analysis
methods, while industry can bring advanced and
fast developing Big Data technologies and tools
to science and wider public[1]
5 The author [6] stated the need to process
enormous quantities of data has never been
greater. Not only are terabyte - and petabyte
scale datasets rapidly becoming commonplace,
but there is consensus that great value lies buried
in them, waiting to be unlocked by the right
computational tools. In the commercial sphere,
business intelligence, driven by the ability to
gather data from a dizzying array of sources. Big
Data analysis tools like Map Reduce over
Hadoop and HDFS, promises to help
organizations better understand their customers
and the marketplace, hopefully leading to better
business decisions and competitive advantages
[3].
6 The author [5] stated there is a need to
maximize returns on BI investments and to
overcome difficulties. Problems and new trends
mentioned in this article and finding solutions by
combination of advanced tools, techniques and
methods would help readers in BI projects and
implementations. BI vendors are struggling and
doing continuous effort to bring technical
capabilities and to provide complete out of the
box solution with set of tools and techniques. In
2014, due to rapid change in BI maturity, BI
teams are facing tough time to have
infrastructure with less skilled resources.
Consolidation and convergence is going on,
market is coming up with wide range of new
technologies. Still the ground is immature and in
a state of rapid evolution.
7 The author [8] given some important emerging
framework model design for Big Data Analytics
and a 3-tier architecture model for Big Data in
Data Mining. In the proposed 3-tier architecture
model is more scalable in working with different
environment and also benefits to overcome with
the main issue in Big Data Analytics for storing,
Analyzing, and visualization. The framework
model given for Hadoop HDFS distributed data
storage, real-time Nosql databases, and
MapReduce distributed data processing over a
cluster of commodity servers.
8 Big data framework needs to consider complex
relationships between samples, models and data
sources along with their evolving changes with
time and other possible factors. To support Big
data mining high performance computing
platforms are required. With Big data
technologies [3] we will hopefully be able to
provide most relevant and most accurate social
sensing feedback to better understand our
society at real time [7].
9 There are lots of scheduling technique are
available to improve job performance but all the
technique have some little limitation so any one
technique cannot overcome that particular
parameter in which they effecting the
performance to whole system. Like data locality,
fairness, load balance, straggler problem and
deadline constrains. All the technique has
advantages over any othertechnique so if we
combined or interchange some technique then
the result will be even much better than the
individual scheduling technique [10].
10 The author [9] describes the concept of Big
Data along with 3 Vs, Volume, Velocity and
variety of Big Data. The paper also focuses on
Big Data processing problems. These technical
International Journal of Research
Volume 7, Issue VII, JULY/2018.
ISSN NO : 2236-6124
Page No:279
challenges must be addressed for efficient and
fast processing of Big Data. The challenges
include not just the obvious issues of scale, but
also heterogeneity, lack of structure, error -
handling, privacy, timeliness, provenance, and
visualization, at all stages of the analysis
pipeline from data acquisition to result
interpretation. These technical challenges are
common across a large variety of application
domains, and therefore not cost-effective to
address in the context of one domain alone. The
paper describes Hadoop which is an open source
software used for processing of Big Data. 11
The author [12] proposed system is based on
implementation of Online Aggregation of Map
Reduce in Hadoop for ancient big data
processing.TraditionalMapReduce
implementations materialize the intermediate
results of mappers and do not allow pipelining
between the map and the reduce phases. This
approach has the advantageof simple recovery in
the case of failures, however, reducers cannot
start executing tasks before all mappers have
finished. As the Map Reduce Online is a
modeled version of Hadoop Map Reduce, it
supports Online Aggregation and stream
processing,while also improving utilization and
reducing response time. 12 The author [11]
stated learning from the application studies; we
explore the design space for supporting data-
intensive and compute-intensive applications on
large data-center-scale computer systems.
Traditional data processing and storage
approaches are facing many challenges in
meeting the continuously increasing computing
demands of Big Data. This work focused on
Map Reduce, one of the key enabling
approaches for meeting Big Data demands by
means of highly parallel processing on a large
number of commodity nodes.
III. TECHNOLOGIES AND METHODS
All paragraphs must be indented. Big data is a
new concept for handling massive data therefore
the architectural description of this technology is
very new. There are the different technologies
which use almost same approach i.e. to
distribute the data among various local agents
and reduce the load of the main server so that
traffic can be avoided. There are endless articles,
books and periodicals that describe Big Data
from a technology perspective so we will instead
focus our efforts here on setting out some basic
principles and the minimum technology
foundation to help relate Big Data to the broader
IM domain.
A. Hadoop Hadoopis a framework that can run
applications on systems with thousands of nodes
and terabytes. It distributes the file among the
nodes and allows to system continue work in
case of a node failure. This approach reduces the
risk of catastrophic system failure. In which
application is broken into smaller parts
(fragments or blocks).Apache Hadoop consists
of the Hadoop kernel, Hadoop distributed file
system (HDFS), map reduce and related projects
are zookeeper, Hbase, Apache Hive. Hadoop
Distributed File System consists of three
Components: the Name Node, Secondary Name
Node and Data Node. The multilevel secure
(MLS) environmental problems of Hadoop by
using security enhanced Linux (SE Linux)
protocol. In which multiple sources of Hadoop
applications run at different levels. This protocol
is an extension of Hadoop distributed file
system. Hadoop is commonly used for
distributed batch index building; it is desirable to
optimize the index capability in near real time.
Hadoop provides components for storage and
analysis for large scale processing. Now a day’s
Hadoop used by hundreds of companies. The
advantage of Hadoop is Distributed storage &
Computational capabilities, extremely scalable,
optimized for high throughput, large block sizes,
tolerant of software and hardware failure.
International Journal of Research
Volume 7, Issue VII, JULY/2018.
ISSN NO : 2236-6124
Page No:280
Components of Hadoop: HBase: It is open
source, distributed and Non-relational database
system implemented in Java. It runs above the
layer of HDFS. It can serve the input and output
for the Map Reduce in well mannered structure.
Oozie: Oozie is a web-application that runs in
ajava servlet. Oozie use the database to gather
the information of Workflow which is a
collection of actions. It manages the Hadoop
jobs in a mannered way. Sqoop: Sqoop is a
command-line interface application that
provides platform which is used for converting
data from relational databases and Hadoop or
vice versa. Avro: It is a system that provides
functionality ofdata serialization and service of
data exchange. It is basically used in Apache
Hadoop. These services can be used together as
well as independently according the data
records. Chukwa: Chukwa is a framework that
is used fordata collection and analysis to process
and analyze the massive amount of logs. It is
built on the upper layer of the HDFS and Map
Reduce framework. Pig: Pig is high-level
platform where the MapReduce framework is
created which is used with Hadoop platform. It
is a high level data processing system where the
data records are analyzed that occurs in high
level language. Zookeeper: It is a centralization
based service thatprovides distributed
synchronization and provides group services
along with maintenance of the configuration
information and records. Hive: It is application
developed for datawarehouse that provides the
SQL interface as well as relational model. Hive
infrastructure is built on the top layer of Hadoop
that help in providing conclusion, and analysis
for respective queries. Hadoop was created by
Doug Cutting and Mike Cafarella in 2005. Doug
Cutting, who was working at Yahoo! at the time,
named it after his son's toy elephant. It was
originally developed to support distribution for
the Nutch search engine project. Hadoop is
open- source software that enables reliable,
scalable, distributed computing on clusters of
inexpensive servers. Hadoop is:
Reliable: The software is fault tolerant, it
expectsand handles hardware and software
failures.
Scalable: Designed for massive scale
ofprocessors, memory, and local attached
storage.
Distributed: Handles replication. Offers
massivelyparallel programming model, Map
Reduce.
Hadoop is an Open Source implementation of a
large-scale batch processing system. That use
the Map-Reduce framework introduced by
Google by leveraging the concept of map and
reduce functions that well known used in
Functional Programming. Although the Hadoop
framework is written in Java, it allows
developers to deploy custom-written programs
coded in Java or any other language to process
data in a parallel fashion across hundreds or
International Journal of Research
Volume 7, Issue VII, JULY/2018.
ISSN NO : 2236-6124
Page No:281
thousands of commodity servers. It is optimized
for contiguous read requests(streaming reads),
where processing includes of scanning all the
data. Depending on the complexity of the
process and the volume of data, response time
can vary from minutes to hours. While Hadoop
can processes data fast, so its key advantage is
its massive scalability. Hadoop is currently
being used for index web searches, email spam
detection, recommendation engines, prediction
in financial services, genome manipulation in
life sciences, and for analysis of unstructured
data such as log, text, and clickstream. While
many of these applications could in fact be
implemented in a relational database(RDBMS),
the main core of the Hadoop framework is
functionally different from an RDBMS. The
following discusses some of these differences
Hadoop is particularly useful when: Complex
information processing is needed:
Unstructured data needs to be turned into
structured data.
Queries can’t be reasonably expressed using
SQL Heavily recursive algorithms.
Complex but parallelizable algorithms
needed, such as geo-spatial analysis or genome
sequencing.
Machine learning:
Data sets are too large to fit into database
RAM, discs, or require too many cores (10’s of
TB up to PB) .
Data value does not justify expense of
constant real-time availability, such as archives
or special interest info, which can be moved to
Hadoop and remain available at lower cost.
Results are not needed in real time Fault
tolerance is critical.
Significant custom coding would be required to:
Handle job scheduling.
Hadoop was inspired by Google's MapReduce, a
software framework in which an application is
broken down into numerous small parts. Any of
these parts (also called fragments or blocks) can
be run on any node in the cluster. Doug Cutting,
Hadoop's creator, named the framework after his
child's stuffed toy elephant. The current Apache
Hadoop ecosystem consists of the Hadoop
kernel, MapReduce, the Hadoop distributed file
system (HDFS) and a number of related projects
such as Apache Hive, HBase and Zookeeper.
The Hadoop framework is used by major players
including Google, Yahoo and IBM, largely for
applications involving search engines and
advertising. The preferred operating systems are
Windows and Linux but Hadoop can also work
with BSD and OS X. HDFS The Hadoop
Distributed File System (HDFS) is the file
system component of the Hadoop framework.
HDFS is designed and optimized to store data
over a large amount of low-cost hardware in a
distributed fashion. Name Node: Name node is
a type of the master node, which is having the
information that means meta data about the all
data node there is address(use to talk ), free
space, data they store, active data node , passive
data node, task tracker, job tracker and many
other configuration such as replication of data.
The NameNode records all of the metadata,
attributes, and locations of files and data blocks
in to the DataNodes. The attributes it records are
the things like file permissions, file modification
and access times, and namespace, which is a
hierarchy of files and directories. The
NameNode maps the namespace tree to file
blocks in DataNodes. When a client node wants to
International Journal of Research
Volume 7, Issue VII, JULY/2018.
ISSN NO : 2236-6124
Page No:282
read a file in the HDFS it first contacts the
Namenode to receive the location of the data
blocks associated with that file. A NameNode
stores information about the overall system
because it is the master of the HDFS with the
DataNodes being the slaves. It stores the image
and journal logs of the system. The NameNode
must always store the most up to date image and
journal. Basically, the NameNode always knows
where the data blocks and replicates are for each
file and it also knows where the free blocks are
in the system so it keeps track of where future
files can be written.
Data Node: Data node is a type of slave node in
the hadoop, which is used to save the data and
there is task tracker in data node which is use to
track on the ongoing job on the data node and
the jobs which coming from name node. The
DataNodes store the blocks and block replicas of
the file system. During startup each DataNode
connects and performs a handshake with the
NameNode. The DataNode checks for the
accurate namespace ID, and if not found then the
DataNode automatically shuts down. New
DataNodes can join the cluster by simply
registering with the NameNode and receiving
the namespace ID. Each DataNode keeps track
of a block report for the blocks in its node. Each
DataNode sends its block report to the
NameNode every hour so that the NameNode
always has an up to date view of where block
replicas are located in the cluster.During the
normal operation of the HDFS, each DataNode
also sends a heartbeat to the NameNode every
ten minutes so that the NameNode knows which
DataNodes are operating correctly and are
available. The base Apache Hadoop framework
is composed of the following modules:
Hadoop Common – contains libraries and
utilities needed by other Hadoop modules.
Hadoop Distributed File System (HDFS) – a
distributed file-system that stores data on
commodity machines, providing very high
aggregate bandwidth across the cluster. Hadoop
MapReduce – a programming model for large
scale data processing.
All the modules in Hadoop are designed with a
fundamental assumption that hardware failures
are common and thus should be automatically
handled in software by the framework. Apache
Hadoop's MapReduce and HDFS components
originally derived respectively from Google's
MapReduce and Google File System (GFS)
papers."Hadoop" often refers not to just the base
Hadoop package but rather to the Hadoop
Ecosystem fig.6 which includes all of the
additional software packages that can be
installed on top of or alongside Hadoop, such as
Apache Hive, Apache Pig and Apache Spark.
B. Map Reduce: Map-Reduce was introduced
by Google in order to process and store large
datasets on commodity hardware. Map Reduce
is a model for processing large-scale data
records in clusters. The Map Reduce
programming model is based on two functions
which are map() function and reduce() function.
Users can simulate their own processing logics
having well defined map() and reduce()
functions. Map function performs the task as the
master node takes the input, divide into smaller
sub modules and distribute into slave nodes. A
slave node further divides the sub modules again
that lead to the hierarchical tree structure. The
slave node processes the base problem and
passes the result back to the master Node. The
Map Reduce system arrange together all
intermediate pairs based on the intermediate
keys and refer them to reduce() function for
International Journal of Research
Volume 7, Issue VII, JULY/2018.
ISSN NO : 2236-6124
Page No:283
producing the final output. Reduce function
works as the master node collects the results
from all the sub problems and combines them
together to form the output.
Map(in_key,in_value)---
>list(out_key,intermediate_value)
Reduce(out_key,list(intermediate_value))---
>list(out_value) The parameters of map () and
reduce () function is as follows:
map (k1,v1) ! list (k2,v2) and reduce
(k2,list(v2)) ! list (v2) A Map Reduce
framework is based on a master-slave
architecture where one master node handles a
number of slave nodes . Map Reduce works by
first dividing the input data set into even-sized
data blocks for equal load distribution. Each data
block is then assigned to one slave node and is
processed by a map task and result is generated.
The slave node interrupts the master node when
it is idle. The scheduler then assigns new tasks
to the slave node. The scheduler takes data
locality and resources into consideration when it
disseminates data blocks.
Figure 7 shows the Map Reduce Architecture
and Working. It always manages to allocate a
local data block to a slave node. If the effort
fails, the scheduler will assign a rack-local or
random data block to the slave node instead of
local data block. When map() function complete
its task, the runtime system gather all
intermediate pairs and launches a set of
condense tasks to produce the final output.
Large scale data processing is a difficult task,
managing hundreds or thousands of processors
and managing parallelization and distributed
environments makes is more difficult. Map
Reduce provides solution to the mentioned
issues, as is supports distributed and parallel I/O
scheduling, it is fault tolerant and supports
scalability and it has inbuilt processes for status
and monitoring of heterogeneous and large
datasets as in Big Data. It is way of approaching
and solving a given problem. Using Map Reduce
framework the efficiency and the time to retrieve
the data is quite manageable. To address the
volume aspect, new techniques have been
proposed to enable parallel processing using
Map Reduce framework. Data aware caching
(Dache) framework that made slight change to
the original map reduce programming model and
framework to enhance processing for big data
applications using the map reduce model. The
advantage of map reduce is a large variety of
problems are easily expressible as Map reduce
computations and cluster of machines handle
thousands of nodes and fault-tolerance. The
disadvantage of map reduce is Real-time
processing, not always very easy to implement,
shuffling of data, batch processing.
Map Reduce Components:
1. Name Node: manages HDFS metadata,
doesn’tdeal with files directly.
2. Data Node: stores blocks of HDFS—default
replication level for each block: 3.
3. Job Tracker: schedules, allocates and
monitorsjob execution on slaves—Task
Trackers.
4. Task Tracker: runs Map Reduce operations.
Map Reduce Framework Map Reduce is a
software framework for distributed processing of
large data sets on computer clusters. It is first
developed by Google .Map Reduce is intended
to facilitate and simplify the processing of vast
amounts of data in parallel on large clusters of
commodity hardware in a reliable, fault-tolerant
manner. MapReduce is the key algorithm that
the Hadoop MapReduce engine uses to
distribute work around a cluster. Typical
International Journal of Research
Volume 7, Issue VII, JULY/2018.
ISSN NO : 2236-6124
Page No:284
Hadoop cluster integrates MapReduce and
HFDS layer. In MapReduce layer job tracker
assigns tasks to the task tracker.Master node job
tracker also assigns tasks to the slave node task
tracker fig.8.
Master node contains - Job tracker node
(MapReduce layer) Task tracker node
(MapReduce layer) Name node (HFDS layer)
Data node (HFDS layer)
Multiple slave nodes contain - Task tracker
node (MapReduce layer) Data node (HFDS
layer) MapReduce layer has job and task tracker
nodes HFDS layer has name and data nodes
C. Hive: Hive is a distributed agent platform, a
decentralized system for building applications
by networking local system resources. Apache
Hive data warehousing component, an element
of cloud-based Hadoop ecosystem which offers
a query language called HiveQL that translates
SQL-like queries into Map Reduce jobs
automatically. Applications of apache hive are
SQL, oracle, IBM DB2. Architecture is divided
into Map-Reduce-oriented execution, Meta data
information for data storage, and an execution
part that receives a query from user or
applications for execution.
The advantage of hive is more secure and
implementations are good and well tuned.The
disadvantage of hive is only for ad hoc queries
and performance is less as compared to pig.
D. No-SQL: No-SQL database is an approach to
data management and data design that’s useful
for very large sets of distributed data. These
databases are in general part of the real-time
events that are detected in process deployed to
inbound channels but can also be seen as an
enabling technology following analytical
capabilities such as relative search applications.
These are only made feasible because of the
elastic nature of the No-SQL model where the
dimensionality of a query is evolved from the
data in scope and domain rather than being fixed
by the developer in advance. It is useful when
enterprise need to access huge amount of
unstructured data. There are more than one
hundred No SQL approaches that specialize in
management of different multimodal data types
(from structured to non-structured) and with the
aim to solve very specific challenges. Data
Scientist, Researchers and Business Analysts in
specific pay more attention to agile approach
that leads to prior insights into the data sets that
may be concealed or constrained with a more
International Journal of Research
Volume 7, Issue VII, JULY/2018.
ISSN NO : 2236-6124
Page No:285
formal development process. The most popular
No-SQL database is Apache Cassandra. The
advantage of No-SQL is open source, Horizontal
scalability, Easy to use, store complex data
types, Very fast for adding new data and for
simple operations/queries. The disadvantage of
No-SQL is Immaturity, No indexing support, No
ACID, Complex consistency models, Absence
of standardization.
E. HPCC: HPCC is an open source platform
used for computing and that provides the service
for handling of massive big data workflow.
HPCC data model is defined by the user end
according to the requirements. HPCC system is
proposed and then further designed to manage
the most complex and data-intensive analytical
related problems. HPCC system is a single
platform having a single architecture and a
single programming language used for the data
simulation.HPCC system was designed to
analyze the gigantic amount of data for the
purpose of solving complex problem of big data.
HPCC system is based on enterprise control
language which has the declarative and on-
procedural nature programming language the
main components of HPCC are: HPCC Data
Refinery: Use parallel ETL enginemostly.
HPCC Data Delivery: It is massively based
onstructured query engine used. Enterprise
Control Language distributes the workload
between the nodes in appropriate even load.
IV. EXPERIMENTS ANALYSIS
Big-Data System Architecture:
In this section, we focus on the value chain for
big data analytics. Specifically, we describe a
big data value chain that consists of four stages
(generation, acquisition, storage, and
processing). Next, we present a big data
technology map that associates the leading
technologies in this domain with specific phases
in the big data value chain and a timestamp.
Big-Data System: A Value-Chain View A big-
data system is complex, providing functions to
deal with different phases in the digital data life
cycle, ranging from its birth to its destruction. At
the same time, the system usually involves
multiple distinct phases for different
applications. In this case, we adopt a systems-
engineering approach, well accepted in industry,
to decompose a typical big-data system into four
consecutive phases, including data generation,
data acquisition, data storage, and data analytics,
as illustrated in the horizontal axis of Fig. 3.
Notice that data visualization is an assistance
method for data analysis. In general, one shall
visualize data to and some. The details for each
phase are explained as follows.
Data generation concerns how data are
generated. In this case, the term ``big data'' is
designated to mean large, diverse, and complex
datasets that are generated from various
longitudinal and/or distributed data sources,
including sensors, video, click streams, and
other available digital sources. Normally, these
datasets are associated with different levels of
domain-specific values. In this paper, we focus
on datasets from three prominent domains,
business, Internet, and scientific research, for
which values are relatively easy to
understand.However, there are overwhelming
technicalchallenges in collecting, processing,
and analyzing these datasets that demand new
solutions to embrace the latest advances in the
information and communications technology
(ICT) domain.
International Journal of Research
Volume 7, Issue VII, JULY/2018.
ISSN NO : 2236-6124
Page No:286
Data acquisition refers to the process of
obtaining information and is subdivided into
data collection, data transmission, and data pre-
processing. First, because data may come from a
diverse set of sources, websites that host
formatted text, images and/or videos - data
collection refers to dedicated data collection
technology that acquires raw data from a
specific data production environment. Second,
after collecting raw data, we need a high-speed
transmission mechanism to transmit the data into
the proper storage sustaining system for various
types of analytical applications.
Fig. 11 Big data technology map. It pivots on
two axes, i.e., data value chain and timeline. The
data value chain divides thedata lifecycle into
four stages, including data generation, data
acquisition, data storage, and data analytics. In
each stage, we highlight exemplary technologies
over the past 10 years. rough patterns rst, and
then employ specific data mining methods. I
mention this in data analytics section. Finally,
collected datasets might contain many
meaningless data, which unnecessarily increases
the amount of storage space and affects the
consequent data analysis. For instance,
redundancy is common in most datasets
collected from sensors deployed to monitor the
environment, and we can use data compression
technology to address this issue. Thus, we must
per-form data pre-processing operations for
efficient storage and mining.
Data storage concerns persistently storing and
managinglarge-scale datasets. A data storage
system can be divided into two parts: hardware
infrastructure and data management. Hardware
infrastructure consists of a pool of shared ICT
resources organized in an elastic way for various
tasks in response to their instantaneous demand.
The hardware infrastructure should be able to
scale up and out and be able to be dynamically
recon gured to address different types of
application environments. Data management
software is deployed on top of the hardware
infrastructure to maintain large-scale datasets.
Additionally, to analyze or interact with the
stored data, storage systems must provide
several interface functions, fast querying and
other programming models.
Data analysis leverages analytical methods or
tools toinspect, transform, and model data to
extract value. Many application elds leverage
opportunities presented by abundant data and
domainspecific analytical methods to derive the
intended impact. Although various elds pose
different application requirements and data
characteristics, a few of these elds may leverage
similar underlying technologies. Emerging
analytics research can be classied into six critical
technical areas: structured data analytics, text
analytics, multimedia analytics, web analytics,
net-work analytics, and mobile analytics.
Big-Data System: A Layered View:
Alternatively, the big data system can be
decomposed into a layered structure, as illustrated in Fig.
12.
International Journal of Research
Volume 7, Issue VII, JULY/2018.
ISSN NO : 2236-6124
Page No:287
The layered structure is divisible into three
layers, i.e., the infrastructure layer, the
computing layer, and the application layer, from
bottom to top. This layered view only provides a
conceptual hierarchy to underscore the
complexity of a big data system. The function of
each layer is as follows.
The infrastructure layer consists of a pool of
ICT resources, which can be organized by cloud
computing infrastructure and enabled by
virtualization technology. These resources will
be exposed to upper-layer systems in a ne-
grained manner with a speci c service-level
agreement (SLA). Within this model, resources
must be allocated to meet the big data demand
while achieving resource efficiency by
maximizing system utilization, energy
awareness, operational simplification, etc.
The computing layer encapsulates various data
tools into a middleware layer that runs over raw
ICT resources. In the context of big data, typical
tools include data integration, data management,
and the programming model. Data integration
means acquiring data from disparate sources and
integrating the dataset into a unied form with the
necessary data pre-processing operations. Data
management refers to mechanisms and tools that
provide persistent data storage and highly
efficient management, such as distributed le
systems and SQL or NoSQL data stores. The
programming model implements abstraction
application logic and facilitates the data analysis
applications. MapReduce, Dryad, Pregel, and
Dremel exemplify programming models.
The application layer exploits the interface
provided by the programming models to
implement various data analysis functions,
including querying, statistical analyses,
clustering, and classification; then, it combines
basic analytical methods to develop various led
related applications. McKinsey presented
potential big data application domains: health
care, public sector administration, retail, global
manufacturing, and personal location data.
Big-Data System Challenges: Designing and
deploying a big data analytics system is not a
trivial or straightforward task. As one of its
definitions suggests, big data is beyond the
capability of current hard-ware and software
platforms. The new hardware and software
platforms in turn demand new infrastructure and
models to address the wide range of challenges
of big data. Recent works have discussed
potential obstacles to the growth of big data
applications. In this paper, we strive to classify
these challenges into three categories: data
collection and management, data analytics, and
system issues. Data collection and management
addresses massive amounts of heterogeneous
and complex data. The following challenges of
big data must be met:
Data Representation: Many datasets are
heterogeneousin type, structure, semantics,
organization, granularity, and accessibility. A
competent data presentation should be designed
to reect the structure, hierarchy, and diversity of
the data, and an integration technique should be
designed to enable of client operations across
different datasets.
Redundancy Reduction and Data Compression:
Typically, there is a large number of redundant
International Journal of Research
Volume 7, Issue VII, JULY/2018.
ISSN NO : 2236-6124
Page No:288
data in raw datasets. Redundancy reduction and
data compression without scarifying potential
value are efficient ways to lessen overall system
overhead.
DataLife-CycleManagement: Pervasive
sensing andcomputing is generating data at an
unprecedented rate and scale that exceed much
smaller advances in storage system technologies.
One of the urgent challenges is that the current
storage system cannot host the massive data. In
general, the value concealed in the big data
depends on data freshness; therefore, we should
set up the data importance principle associated
with the analysis value to decide what parts of
the data should be archived and what parts
should be discarded.
Data Privacy and Security: With the
proliferation ofonline services and mobile
phones, privacy and security concerns regarding
accessing and analyzing personal information is
growing. It is critical to understand what support
for privacy must be provided at the platform
level to eliminate privacy leakage and to
facilitate various analyses. There will be a
significant impact that results from advances in
big data analytics, including interpretation,
modeling, prediction, and simulation.
Unfortunately, massive amounts of data,
heterogeneous data structures, and diverse
applications present tremendous challenges,
such as the following.
Approximate Analytics: As data sets grow and
the real-time requirement becomes stricter,
analysis of the entire dataset is becoming more
dif cult. One way to potentially solve this
problem is to provide approximate results, such
as by means of an approximation query. The
notion of approximation has two dimensions: the
accuracy of the result and the groups omitted
from the output.
Connecting Social Media: Social media
possessesunique properties, such as vastness,
statistical redundancy and the availability of user
feedback. Various extraction techniques have
been successfully used to identify references
from social media to specific product names,
locations, or people on websites. By connecting
inter data with social media, applications can
achieve high levels of precision and distinct
points of view.
Deep Analytics: One of the drivers of
excitementaround big data is the expectation of
gaining novel insights. Sophisticated analytical
technologies, such as machine learning, are
necessary to unlock such insights. However,
effectively leveraging these analysis toolkits
requires an understanding of probability and
statistics. The potential pillars of privacy and
security mechanisms are mandatory access
control and security communication, multi-
granularity access control, privacy-aware data
mining and analysis, and security storage and
management. Finally, large-scale parallel
systems generally confront several common
issues; however, the emergence of big data has
amplified the following challenges, in particular.
Energy Management: The energy consumption
of large-scale computing systems has attracted
greater concern from economic and
environmental perspectives. Data transmission,
storage, and processing will inevitably consume
progressively more energy, as data volume and
analytics demand increases. Therefore, system-
level power control and management
mechanisms must be considered in a big data
system, while continuing to provide extensibility
and accessibility.
Scalability: A big data analytics system must be
able tosupport very large datasets created now
and in the future. All the components in big data
systems must be capable of scaling to address
the ever-growing size of complex datasets.
Collaboration: Big data analytics is an
interdisciplinaryresearch eld that requires
specialists from multiple professional elds
collaborating to mine hidden values. A
comprehensive big data cyber infrastructure is
necessary to allow broad communities of
International Journal of Research
Volume 7, Issue VII, JULY/2018.
ISSN NO : 2236-6124
Page No:289
scientists and engineers to access the diverse
data, apply their respective expertise, and
cooperate to accomplish the goals of analysis.
V. CONCLUSIONS
In this paper we have surveyed various
technologies to handle the big data and there
architectures. In this paper we have also
discussed the challenges of Big data (volume,
variety, velocity, value, veracity) and various
advantages and a disadvantage of these
technologies. This paper discussed an
architecture using Hadoop HDFS distributed
data storage, real-time NoSQL databases, and
MapReduce distributed data processing over a
cluster of commodity servers. The main goal of
our paper was to make a survey of various big
data handling techniques those handle a massive
amount of data from different sources and
improves overall performance of systems.
REFERENCES
[1] Yuri Demchenko ―The Big Data
Architecture Framework (BDAF)‖ Outcome of
the Brainstorming Session at the University of
Amsterdam 17 July 2013.
[2] Amogh Pramod Kulkarni, Mahesh
Khandewal, ―Survey on Hadoop and
Introduction to YARN‖, International Journal of
Emerging Technology and Advanced
Engineering Website: www.ijetae.com (ISSN
2250-2459, ISO 9001:2008 Certified Journal,
Volume 4, Issue 5, May 2014).
[3] Sagiroglu, S.Sinanc, D.,‖Big Data: A
Review‖,2013, 20-24.
[4] Ms. Vibhavari Chavan, Prof. Rajesh. N.
Phursule, ―Survey Paper On Big Data‖
International Journal of Computer Science and
Information Technologies, Vol. 5 (6), 2014.
[5] Margaret Rouse, April 2010―unstructured
data‖.
[6] Kyuseok Shim, MapReduce Algorithms for
Big Data Analysis, DNIS 2013, LNCS 7813, pp.
44–48, 2013.
[7] Dong, X.L.; Srivastava, D. Data Engineering
(ICDE),‖ Big data integration― IEEE
International Conference on , 29(2013) 1245–
1248.
[8] Tekiner F. and Keane J.A., Systems, Man
and Cybernetics (SMC), ―Big Data
Framework‖ 2013 IEEE International
Conference on 13–16 Oct. 2013, 1494–1499.
[9] Mrigank Mridul, Akashdeep Khajuria,
Snehasish Dutta, Kumar N ―Analysis of Big
Data using Apache Hadoop and Map Reduce‖
Volume 4, Issue 5, May 2014‖.
[10] Suman Arora, Dr.Madhu Goel, ―Survey
Paper on Scheduling in Hadoop‖ International
Journal of Advanced Research in Computer
Science and Software Engineering, Volume 4,
Issue 5, May 2014.
[11] Aditya B. Patel, Manashvi Birla and Ushma
Nair, "Addressing Big Data Problem Using
Hadoop and Map Reduce," in Proc. 2012 Nirma
University International Conference On
Engineering. [12] Jimmy Lin ―Map Reduce Is
Good Enough?‖ The control project, IEEE
Computer 32 (2013).
International Journal of Research
Volume 7, Issue VII, JULY/2018.
ISSN NO : 2236-6124
Page No:290