big data analytics a research reportijrpublisher.com › gallery › 36-july-2018.pdfbig data...

BIG DATA ANALYTICS A RESEARCH REPORT

Mrs. Sowmya Koneru Associate Professor, NRI Institute of Technology, Agiripalli, Vijayawada, Andhra Pradesh

[email protected]

Abstract— Big data is the term for any collection of datasets so large and complex that it becomes

difficult to process using traditional data processing applications. The challenges include analysis,

capture, curation, search, sharing, storage, transfer, visualization, and privacy violations. Big data is a set

of techniques and technologies that require new forms of integration to uncover large hidden values from

large datasets that are diverse, complex, and of a massive scale. Big data environment is used to acquire,

organize and analyze the various types of data. Data that is so large in volume, so diverse in variety or

moving with such velocity is called Big data. Analyzing Big Data is a challenging task as it involves large

distributed file systems which should be fault tolerant, flexible and scalable. The technologies used by big

data application to handle the massive data are Hadoop, Map Reduce, Apache Hive, No SQL and HPCC.

First, we present the definition of big data and discuss big data challenges. Next, we present a systematic

framework to decompose big data systems into four sequential modules, namely data generation, data

acquisition, data storage, and data analytics. These four modules form a big data value chain. Following

that, we present a detailed survey of Materials and methods used in research and industry communities. In

addition, we present the prevalent Hadoop framework for addressing big data. Finally, we outline Big

data system architecture and present key challenges of research directions for big data system. Keywords— Big Data, Hadoop, Map Reduce, Apache Hive, No SQL and HPC

I. INTRODUCTION Big data is a largest buzz phrases in domain of IT, new technologies of personal communication driving the big data new trend and internet population grew day by day but it never reach by 100%. The need of big data generated from the large companies like facebook, yahoo, Google, YouTube etc for the purpose of analysis of enormous amount of data which is in unstructured form or even in structured form. Google contains the large amount of information. So; there is the need of Big Data Analytics that is the processing of the complex and massive datasets This data is different from structured data in terms of five parameters –variety, volume, value, veracity and velocity (5V’s). The five V’s (volume, variety, velocity, value, veracity) are the challenges of big data management are: 1. Volume:

Data is ever-growing day by day of alltypes ever MB, PB, YB, ZB, KB, TB of information. The data results into large files. Excessive volume of

data is main issue of storage. This main issue is resolved by reducing storage cost. Data volumes are expected to grow 50 times by 2020. 2. Variety:

Data sources are extremely heterogeneous. The files comes in various formats and of any type, it may be structured or unstructured such as text, audio, videos, log files and more. The varieties are endless, and the data enters the network without having been quantified or qualified in any way. 3. Velocity:

The data comes at high speed.Sometimes 1

minute is too late so big data is time sensitive.

Some organizations data velocity is main

challenge. The social media messages and credit

card transactions done in millisecond and data

generated by this putting in to databases.

4. Value:

It is a most important v in big data. Value is

main buzz for big data because it is important

International Journal of Research

Volume 7, Issue VII, JULY/2018.

ISSN NO : 2236-6124

Page No:276

for businesses, IT infrastructure system to store

large amount of values in database. 5. Veracity:

The increase in the range of valuestypical of a

large data set. When we dealing with high

volume, velocity and variety of data, the all of

data are not going 100% correct, there will be

dirty data. Big data and analytics technologies

work with these types of data. Huge volume of data (both structured and

unstructured) is management by organization,

administration and governance. Unstructured

data is a data that is not present in a database.

Unstructured data may be text, verbal data or in

another form. Textual unstructured data is like

power point presentation, email messages, word

documents, and instant massages. Data in

another format can be.jpg images, .png images

and audio files.

Fig.2 illustrates a general big data network

model with MapReduce. A distinct application

in the cloud has put demanding requirements for

acquisition, transportation and analytics of

structured and unstructured data. The challenges

include analysis, capture, curation, search,

sharing, storage, transfer, visualization, and

privacy violations. The trend to larger data sets

is due to the additional information derivable

from analysis of a single large set of related

data, as compared to separate smaller sets with

the same total amount of data, allowing

correlations to be found to "spot business trends,

prevent diseases, and combat crime and so on".

Scientists regularly encounter limitations due to

large data sets in many areas, including

meteorology, genomics, connectomics, complex

physics simulations, and biological and

environmental research. The limitations also

affect Internet search, finance and business

informatics. Data sets grow in size in part

because they are increasingly being gathered by

ubiquitous information-sensing mobile devices,

aerial sensory technologies (remote sensing),

software logs, cameras, microphones, radio-

frequency identification (RFID) readers, and

wireless sensor networks. The world's

technological per-capita capacity to store

information has roughly doubled every 40

months since the 1980s;as of 2012, every day

2.5 exabytes (2.5×1018) of data were created.

The challenge for large enterprises is

determining who should own big data initiatives

that straddle the entire organization.

The history of big data can be roughly split

into the following stages: Megabyte to

Gigabyte: In the 1970s and 1980s, his-torical

business data introduced the earliest ``big data''

challenge in moving from megabyte to gigabyte

sizes. The urgent need at that time was to house

that data and run relational queries for business

analyses and report-ing. Research efforts were

made to give birth to the ``database machine''

that featured integrated hardware and software



ISSN NO : 2236-6124

Page No:277

to solve problems. The underlying philosophy

was that such integration would provide better

performance at lower cost. After a period of

time, it became clear that hardware-specialized

database machines could not keep pace with the

progress of general-purpose computers. Thus,

the descendant database systems are soft-ware

systems that impose few constraints on hardware

and can run on general-purpose computers.

Gigabyte to Terabyte: In the late 1980s, the

popularization of digital technology caused data

volumes to expand to several gigabytes or even

a terabyte, which is beyond the storage and/or

processing capabilities of a single large

computer system. Data parallelization was

proposed to extend storage capabilities and to

improve performance by distributing data and

related tasks, such as building indexes and

evaluating queries, into disparate hardware.

Based on this idea, several types of parallel

databases were built, including shared-memory

databases, shared-disk databases, and shared-

nothing databases, all as induced by the

underlying hardware architecture. Of the three

types of database Terabyte to Petabyte: During

the late 1990s, whenthe database community

was admiring its `` nished'' work on the parallel

database, the rapid development of Web 1.0 led

the whole world into the Internet era, along with

massive semi-structured or unstructured web-

pages holding terabytes or petabytes (PBs) of

data. The resulting need for search companies

was to index and query the mushrooming

content of the web. Unfortunately, although

parallel databases handle structured data well,

they provide little support for unstructured data.

Additionally, systems capabilities were limited

to less than several terabyte.

Petabyte to Exabyte: Under current

development trends,data stored and analyzed by

big companies willundoubtedly reach the PB to

exabyte magnitude soon. However, current

technology still handles terabyte to PB data;

there has been no revolutionary technology

developed to cope with larger datasets.

II. LITERATURE SURVEY

1 Hadoop Map Reduce is a large scale, open

source software framework dedicated to

scalable, distributed, data-intensive computing.

The framework breaks up large data into smaller

parallelizable chunks and handles scheduling ▫

Maps each piece to an intermediate value ▫

Reduces intermediate values to a solution ▫

User-specified partition and combiner options

Fault tolerant, reliable, and supports thousands

of nodes and petabytes of data • If you can

rewrite algorithms into MapReduces, and your

problem can be broken up into small pieces

solvable in parallel, then Hadoop’s Map Reduce

is the way to go for a distributed problem

solving approach to large datasets • Tried and

tested in production • Many implementation

options. We can present the design and

evaluation of a data aware cache framework that

requires minimum change to the original Map

Reduce programming model for provisioning

incremental processing for Big Data applications

using the Map Reduce model [4].

2 The author [2] stated the importance of some

of the technologies that handle Big Data like

Hadoop, HDFS and Map Reduce. The author

suggested about various schedulers used in

Hadoop and about the technical aspects of

Hadoop. The author also focuses on the

importance of YARN which overcomes the

limitations of Map Reduce.

3 The author [3] have surveyed various

technologies to handle the big data and there

architectures. In this paper we have also

discussed the challenges of Big data (volume,

variety, velocity, value, veracity) and various

advantages and a disadvantage of these

technologies. This paper discussed an

architecture usingHadoop HDFS distributed data

storage, real-time NoSQL databases, and

MapReduce distributed dataprocessing over a

cluster of commodity servers. The main goal of



ISSN NO : 2236-6124

Page No:278

our paper was to make a survey of various big

data handling techniques those handle a massive

amount of data from different sources and

improves overall performance of systems.

4 The author continue with the Big Data

definition and enhance the definition given in [3]

that includes the 5V Big Data properties:

Volume, Variety, Velocity, Value, Veracity, and

suggest other dimensions for Big Data analysis

and taxonomy, in particular comparing and

contrasting Big Data technologies in e-Science,

industry, business, social media, healthcare.

With a long tradition of working with constantly

increasing volume of data, modern e-Science

can offer industry the scientific analysis

methods, while industry can bring advanced and

fast developing Big Data technologies and tools

to science and wider public[1]

5 The author [6] stated the need to process

enormous quantities of data has never been

greater. Not only are terabyte - and petabyte

scale datasets rapidly becoming commonplace,

but there is consensus that great value lies buried

in them, waiting to be unlocked by the right

computational tools. In the commercial sphere,

business intelligence, driven by the ability to

gather data from a dizzying array of sources. Big

Data analysis tools like Map Reduce over

Hadoop and HDFS, promises to help

organizations better understand their customers

and the marketplace, hopefully leading to better

business decisions and competitive advantages

[3].

6 The author [5] stated there is a need to

maximize returns on BI investments and to

overcome difficulties. Problems and new trends

mentioned in this article and finding solutions by

combination of advanced tools, techniques and

methods would help readers in BI projects and

implementations. BI vendors are struggling and

doing continuous effort to bring technical

capabilities and to provide complete out of the

box solution with set of tools and techniques. In

2014, due to rapid change in BI maturity, BI

teams are facing tough time to have

infrastructure with less skilled resources.

Consolidation and convergence is going on,

market is coming up with wide range of new

technologies. Still the ground is immature and in

a state of rapid evolution.

7 The author [8] given some important emerging

framework model design for Big Data Analytics

and a 3-tier architecture model for Big Data in

Data Mining. In the proposed 3-tier architecture

model is more scalable in working with different

environment and also benefits to overcome with

the main issue in Big Data Analytics for storing,

Analyzing, and visualization. The framework

model given for Hadoop HDFS distributed data

storage, real-time Nosql databases, and

MapReduce distributed data processing over a

cluster of commodity servers.

8 Big data framework needs to consider complex

relationships between samples, models and data

sources along with their evolving changes with

time and other possible factors. To support Big

data mining high performance computing

platforms are required. With Big data

technologies [3] we will hopefully be able to

provide most relevant and most accurate social

sensing feedback to better understand our

society at real time [7].

9 There are lots of scheduling technique are

available to improve job performance but all the

technique have some little limitation so any one

technique cannot overcome that particular

parameter in which they effecting the

performance to whole system. Like data locality,

fairness, load balance, straggler problem and

deadline constrains. All the technique has

advantages over any othertechnique so if we

combined or interchange some technique then

the result will be even much better than the

individual scheduling technique [10].

10 The author [9] describes the concept of Big

Data along with 3 Vs, Volume, Velocity and

variety of Big Data. The paper also focuses on

Big Data processing problems. These technical



ISSN NO : 2236-6124

Page No:279

challenges must be addressed for efficient and

fast processing of Big Data. The challenges

include not just the obvious issues of scale, but

also heterogeneity, lack of structure, error -

handling, privacy, timeliness, provenance, and

visualization, at all stages of the analysis

pipeline from data acquisition to result

interpretation. These technical challenges are

common across a large variety of application

domains, and therefore not cost-effective to

address in the context of one domain alone. The

paper describes Hadoop which is an open source

software used for processing of Big Data. 11

The author [12] proposed system is based on

implementation of Online Aggregation of Map

Reduce in Hadoop for ancient big data

processing.TraditionalMapReduce

implementations materialize the intermediate

results of mappers and do not allow pipelining

between the map and the reduce phases. This

approach has the advantageof simple recovery in

the case of failures, however, reducers cannot

start executing tasks before all mappers have

finished. As the Map Reduce Online is a

modeled version of Hadoop Map Reduce, it

supports Online Aggregation and stream

processing,while also improving utilization and

reducing response time. 12 The author [11]

stated learning from the application studies; we

explore the design space for supporting data-

intensive and compute-intensive applications on

large data-center-scale computer systems.

Traditional data processing and storage

approaches are facing many challenges in

meeting the continuously increasing computing

demands of Big Data. This work focused on

Map Reduce, one of the key enabling

approaches for meeting Big Data demands by

means of highly parallel processing on a large

number of commodity nodes.

III. TECHNOLOGIES AND METHODS

All paragraphs must be indented. Big data is a

new concept for handling massive data therefore

the architectural description of this technology is

very new. There are the different technologies

which use almost same approach i.e. to

distribute the data among various local agents

and reduce the load of the main server so that

traffic can be avoided. There are endless articles,

books and periodicals that describe Big Data

from a technology perspective so we will instead

focus our efforts here on setting out some basic

principles and the minimum technology

foundation to help relate Big Data to the broader

IM domain.

A. Hadoop Hadoopis a framework that can run

applications on systems with thousands of nodes

and terabytes. It distributes the file among the

nodes and allows to system continue work in

case of a node failure. This approach reduces the

risk of catastrophic system failure. In which

application is broken into smaller parts

(fragments or blocks).Apache Hadoop consists

of the Hadoop kernel, Hadoop distributed file

system (HDFS), map reduce and related projects

are zookeeper, Hbase, Apache Hive. Hadoop

Distributed File System consists of three

Components: the Name Node, Secondary Name

Node and Data Node. The multilevel secure

(MLS) environmental problems of Hadoop by

using security enhanced Linux (SE Linux)

protocol. In which multiple sources of Hadoop

applications run at different levels. This protocol

is an extension of Hadoop distributed file

system. Hadoop is commonly used for

distributed batch index building; it is desirable to

optimize the index capability in near real time.

Hadoop provides components for storage and

analysis for large scale processing. Now a day’s

Hadoop used by hundreds of companies. The

advantage of Hadoop is Distributed storage &

Computational capabilities, extremely scalable,

optimized for high throughput, large block sizes,

tolerant of software and hardware failure.



ISSN NO : 2236-6124

Page No:280

Components of Hadoop: HBase: It is open

source, distributed and Non-relational database

system implemented in Java. It runs above the

layer of HDFS. It can serve the input and output

for the Map Reduce in well mannered structure.

Oozie: Oozie is a web-application that runs in

ajava servlet. Oozie use the database to gather

the information of Workflow which is a

collection of actions. It manages the Hadoop

jobs in a mannered way. Sqoop: Sqoop is a

command-line interface application that

provides platform which is used for converting

data from relational databases and Hadoop or

vice versa. Avro: It is a system that provides

functionality ofdata serialization and service of

data exchange. It is basically used in Apache

Hadoop. These services can be used together as

well as independently according the data

records. Chukwa: Chukwa is a framework that

is used fordata collection and analysis to process

and analyze the massive amount of logs. It is

built on the upper layer of the HDFS and Map

Reduce framework. Pig: Pig is high-level

platform where the MapReduce framework is

created which is used with Hadoop platform. It

is a high level data processing system where the

data records are analyzed that occurs in high

level language. Zookeeper: It is a centralization

based service thatprovides distributed

synchronization and provides group services

along with maintenance of the configuration

information and records. Hive: It is application

developed for datawarehouse that provides the

SQL interface as well as relational model. Hive

infrastructure is built on the top layer of Hadoop

that help in providing conclusion, and analysis

for respective queries. Hadoop was created by

Doug Cutting and Mike Cafarella in 2005. Doug

Cutting, who was working at Yahoo! at the time,

named it after his son's toy elephant. It was

originally developed to support distribution for

the Nutch search engine project. Hadoop is

open- source software that enables reliable,

scalable, distributed computing on clusters of

inexpensive servers. Hadoop is:

Reliable: The software is fault tolerant, it

expectsand handles hardware and software

failures.

Scalable: Designed for massive scale

ofprocessors, memory, and local attached

storage.

Distributed: Handles replication. Offers

massivelyparallel programming model, Map

Reduce.

Hadoop is an Open Source implementation of a

large-scale batch processing system. That use

the Map-Reduce framework introduced by

Google by leveraging the concept of map and

reduce functions that well known used in

Functional Programming. Although the Hadoop

framework is written in Java, it allows

developers to deploy custom-written programs

coded in Java or any other language to process

data in a parallel fashion across hundreds or



ISSN NO : 2236-6124

Page No:281

thousands of commodity servers. It is optimized

for contiguous read requests(streaming reads),

where processing includes of scanning all the

data. Depending on the complexity of the

process and the volume of data, response time

can vary from minutes to hours. While Hadoop

can processes data fast, so its key advantage is

its massive scalability. Hadoop is currently

being used for index web searches, email spam

detection, recommendation engines, prediction

in financial services, genome manipulation in

life sciences, and for analysis of unstructured

data such as log, text, and clickstream. While

many of these applications could in fact be

implemented in a relational database(RDBMS),

the main core of the Hadoop framework is

functionally different from an RDBMS. The

following discusses some of these differences

Hadoop is particularly useful when: Complex

information processing is needed:

Unstructured data needs to be turned into

structured data.

Queries can’t be reasonably expressed using

SQL Heavily recursive algorithms.

Complex but parallelizable algorithms

needed, such as geo-spatial analysis or genome

sequencing.

Machine learning:

Data sets are too large to fit into database

RAM, discs, or require too many cores (10’s of

TB up to PB) .

Data value does not justify expense of

constant real-time availability, such as archives

or special interest info, which can be moved to

Hadoop and remain available at lower cost.

Results are not needed in real time Fault

tolerance is critical.

Significant custom coding would be required to:

Handle job scheduling.

Hadoop was inspired by Google's MapReduce, a

software framework in which an application is

broken down into numerous small parts. Any of

these parts (also called fragments or blocks) can

be run on any node in the cluster. Doug Cutting,

Hadoop's creator, named the framework after his

child's stuffed toy elephant. The current Apache

Hadoop ecosystem consists of the Hadoop

kernel, MapReduce, the Hadoop distributed file

system (HDFS) and a number of related projects

such as Apache Hive, HBase and Zookeeper.

The Hadoop framework is used by major players

including Google, Yahoo and IBM, largely for

applications involving search engines and

advertising. The preferred operating systems are

Windows and Linux but Hadoop can also work

with BSD and OS X. HDFS The Hadoop

Distributed File System (HDFS) is the file

system component of the Hadoop framework.

HDFS is designed and optimized to store data

over a large amount of low-cost hardware in a

distributed fashion. Name Node: Name node is

a type of the master node, which is having the

information that means meta data about the all

data node there is address(use to talk ), free

space, data they store, active data node , passive

data node, task tracker, job tracker and many

other configuration such as replication of data.

The NameNode records all of the metadata,

attributes, and locations of files and data blocks

in to the DataNodes. The attributes it records are

the things like file permissions, file modification

and access times, and namespace, which is a

hierarchy of files and directories. The

NameNode maps the namespace tree to file

blocks in DataNodes. When a client node wants to



ISSN NO : 2236-6124

Page No:282

read a file in the HDFS it first contacts the

Namenode to receive the location of the data

blocks associated with that file. A NameNode

stores information about the overall system

because it is the master of the HDFS with the

DataNodes being the slaves. It stores the image

and journal logs of the system. The NameNode

must always store the most up to date image and

journal. Basically, the NameNode always knows

where the data blocks and replicates are for each

file and it also knows where the free blocks are

in the system so it keeps track of where future

files can be written.

Data Node: Data node is a type of slave node in

the hadoop, which is used to save the data and

there is task tracker in data node which is use to

track on the ongoing job on the data node and

the jobs which coming from name node. The

DataNodes store the blocks and block replicas of

the file system. During startup each DataNode

connects and performs a handshake with the

NameNode. The DataNode checks for the

accurate namespace ID, and if not found then the

DataNode automatically shuts down. New

DataNodes can join the cluster by simply

registering with the NameNode and receiving

the namespace ID. Each DataNode keeps track

of a block report for the blocks in its node. Each

DataNode sends its block report to the

NameNode every hour so that the NameNode

always has an up to date view of where block

replicas are located in the cluster.During the

normal operation of the HDFS, each DataNode

also sends a heartbeat to the NameNode every

ten minutes so that the NameNode knows which

DataNodes are operating correctly and are

available. The base Apache Hadoop framework

is composed of the following modules:

Hadoop Common – contains libraries and

utilities needed by other Hadoop modules.

Hadoop Distributed File System (HDFS) – a

distributed file-system that stores data on

commodity machines, providing very high

aggregate bandwidth across the cluster. Hadoop

MapReduce – a programming model for large

scale data processing.

All the modules in Hadoop are designed with a

fundamental assumption that hardware failures

are common and thus should be automatically

handled in software by the framework. Apache

Hadoop's MapReduce and HDFS components

originally derived respectively from Google's

MapReduce and Google File System (GFS)

papers."Hadoop" often refers not to just the base

Hadoop package but rather to the Hadoop

Ecosystem fig.6 which includes all of the

additional software packages that can be

installed on top of or alongside Hadoop, such as

Apache Hive, Apache Pig and Apache Spark.

B. Map Reduce: Map-Reduce was introduced

by Google in order to process and store large

datasets on commodity hardware. Map Reduce

is a model for processing large-scale data

records in clusters. The Map Reduce

programming model is based on two functions

which are map() function and reduce() function.

Users can simulate their own processing logics

having well defined map() and reduce()

functions. Map function performs the task as the

master node takes the input, divide into smaller

sub modules and distribute into slave nodes. A

slave node further divides the sub modules again

that lead to the hierarchical tree structure. The

slave node processes the base problem and

passes the result back to the master Node. The

Map Reduce system arrange together all

intermediate pairs based on the intermediate

keys and refer them to reduce() function for



ISSN NO : 2236-6124

Page No:283

producing the final output. Reduce function

works as the master node collects the results

from all the sub problems and combines them

together to form the output.

Map(in_key,in_value)---

>list(out_key,intermediate_value)

Reduce(out_key,list(intermediate_value))---

>list(out_value) The parameters of map () and

reduce () function is as follows:

map (k1,v1) ! list (k2,v2) and reduce

(k2,list(v2)) ! list (v2) A Map Reduce

framework is based on a master-slave

architecture where one master node handles a

number of slave nodes . Map Reduce works by

first dividing the input data set into even-sized

data blocks for equal load distribution. Each data

block is then assigned to one slave node and is

processed by a map task and result is generated.

The slave node interrupts the master node when

it is idle. The scheduler then assigns new tasks

to the slave node. The scheduler takes data

locality and resources into consideration when it

disseminates data blocks.

Figure 7 shows the Map Reduce Architecture

and Working. It always manages to allocate a

local data block to a slave node. If the effort

fails, the scheduler will assign a rack-local or

random data block to the slave node instead of

local data block. When map() function complete

its task, the runtime system gather all

intermediate pairs and launches a set of

condense tasks to produce the final output.

Large scale data processing is a difficult task,

managing hundreds or thousands of processors

and managing parallelization and distributed

environments makes is more difficult. Map

Reduce provides solution to the mentioned

issues, as is supports distributed and parallel I/O

scheduling, it is fault tolerant and supports

scalability and it has inbuilt processes for status

and monitoring of heterogeneous and large

datasets as in Big Data. It is way of approaching

and solving a given problem. Using Map Reduce

framework the efficiency and the time to retrieve

the data is quite manageable. To address the

volume aspect, new techniques have been

proposed to enable parallel processing using

Map Reduce framework. Data aware caching

(Dache) framework that made slight change to

the original map reduce programming model and

framework to enhance processing for big data

applications using the map reduce model. The

advantage of map reduce is a large variety of

problems are easily expressible as Map reduce

computations and cluster of machines handle

thousands of nodes and fault-tolerance. The

disadvantage of map reduce is Real-time

processing, not always very easy to implement,

shuffling of data, batch processing.

Map Reduce Components:

1. Name Node: manages HDFS metadata,

doesn’tdeal with files directly.

2. Data Node: stores blocks of HDFS—default

replication level for each block: 3.

3. Job Tracker: schedules, allocates and

monitorsjob execution on slaves—Task

Trackers.

4. Task Tracker: runs Map Reduce operations.

Map Reduce Framework Map Reduce is a

software framework for distributed processing of

large data sets on computer clusters. It is first

developed by Google .Map Reduce is intended

to facilitate and simplify the processing of vast

amounts of data in parallel on large clusters of

commodity hardware in a reliable, fault-tolerant

manner. MapReduce is the key algorithm that

the Hadoop MapReduce engine uses to

distribute work around a cluster. Typical



ISSN NO : 2236-6124

Page No:284

Hadoop cluster integrates MapReduce and

HFDS layer. In MapReduce layer job tracker

assigns tasks to the task tracker.Master node job

tracker also assigns tasks to the slave node task

tracker fig.8.

Master node contains - Job tracker node

(MapReduce layer) Task tracker node

(MapReduce layer) Name node (HFDS layer)

Data node (HFDS layer)

Multiple slave nodes contain - Task tracker

node (MapReduce layer) Data node (HFDS

layer) MapReduce layer has job and task tracker

nodes HFDS layer has name and data nodes

C. Hive: Hive is a distributed agent platform, a

decentralized system for building applications

by networking local system resources. Apache

Hive data warehousing component, an element

of cloud-based Hadoop ecosystem which offers

a query language called HiveQL that translates

SQL-like queries into Map Reduce jobs

automatically. Applications of apache hive are

SQL, oracle, IBM DB2. Architecture is divided

into Map-Reduce-oriented execution, Meta data

information for data storage, and an execution

part that receives a query from user or

applications for execution.

The advantage of hive is more secure and

implementations are good and well tuned.The

disadvantage of hive is only for ad hoc queries

and performance is less as compared to pig.

D. No-SQL: No-SQL database is an approach to

data management and data design that’s useful

for very large sets of distributed data. These

databases are in general part of the real-time

events that are detected in process deployed to

inbound channels but can also be seen as an

enabling technology following analytical

capabilities such as relative search applications.

These are only made feasible because of the

elastic nature of the No-SQL model where the

dimensionality of a query is evolved from the

data in scope and domain rather than being fixed

by the developer in advance. It is useful when

enterprise need to access huge amount of

unstructured data. There are more than one

hundred No SQL approaches that specialize in

management of different multimodal data types

(from structured to non-structured) and with the

aim to solve very specific challenges. Data

Scientist, Researchers and Business Analysts in

specific pay more attention to agile approach

that leads to prior insights into the data sets that

may be concealed or constrained with a more



ISSN NO : 2236-6124

Page No:285

formal development process. The most popular

No-SQL database is Apache Cassandra. The

advantage of No-SQL is open source, Horizontal

scalability, Easy to use, store complex data

types, Very fast for adding new data and for

simple operations/queries. The disadvantage of

No-SQL is Immaturity, No indexing support, No

ACID, Complex consistency models, Absence

of standardization.

E. HPCC: HPCC is an open source platform

used for computing and that provides the service

for handling of massive big data workflow.

HPCC data model is defined by the user end

according to the requirements. HPCC system is

proposed and then further designed to manage

the most complex and data-intensive analytical

related problems. HPCC system is a single

platform having a single architecture and a

single programming language used for the data

simulation.HPCC system was designed to

analyze the gigantic amount of data for the

purpose of solving complex problem of big data.

HPCC system is based on enterprise control

language which has the declarative and on-

procedural nature programming language the

main components of HPCC are: HPCC Data

Refinery: Use parallel ETL enginemostly.

HPCC Data Delivery: It is massively based

onstructured query engine used. Enterprise

Control Language distributes the workload

between the nodes in appropriate even load.

IV. EXPERIMENTS ANALYSIS

Big-Data System Architecture:

In this section, we focus on the value chain for

big data analytics. Specifically, we describe a

big data value chain that consists of four stages

(generation, acquisition, storage, and

processing). Next, we present a big data

technology map that associates the leading

technologies in this domain with specific phases

in the big data value chain and a timestamp.

Big-Data System: A Value-Chain View A big-

data system is complex, providing functions to

deal with different phases in the digital data life

cycle, ranging from its birth to its destruction. At

the same time, the system usually involves

multiple distinct phases for different

applications. In this case, we adopt a systems-

engineering approach, well accepted in industry,

to decompose a typical big-data system into four

consecutive phases, including data generation,

data acquisition, data storage, and data analytics,

as illustrated in the horizontal axis of Fig. 3.

Notice that data visualization is an assistance

method for data analysis. In general, one shall

visualize data to and some. The details for each

phase are explained as follows.

Data generation concerns how data are

generated. In this case, the term ``big data'' is

designated to mean large, diverse, and complex

datasets that are generated from various

longitudinal and/or distributed data sources,

including sensors, video, click streams, and

other available digital sources. Normally, these

datasets are associated with different levels of

domain-specific values. In this paper, we focus

on datasets from three prominent domains,

business, Internet, and scientific research, for

which values are relatively easy to

understand.However, there are overwhelming

technicalchallenges in collecting, processing,

and analyzing these datasets that demand new

solutions to embrace the latest advances in the

information and communications technology

(ICT) domain.



ISSN NO : 2236-6124

Page No:286

Data acquisition refers to the process of

obtaining information and is subdivided into

data collection, data transmission, and data pre-

processing. First, because data may come from a

diverse set of sources, websites that host

formatted text, images and/or videos - data

collection refers to dedicated data collection

technology that acquires raw data from a

specific data production environment. Second,

after collecting raw data, we need a high-speed

transmission mechanism to transmit the data into

the proper storage sustaining system for various

types of analytical applications.

Fig. 11 Big data technology map. It pivots on

two axes, i.e., data value chain and timeline. The

data value chain divides thedata lifecycle into

four stages, including data generation, data

acquisition, data storage, and data analytics. In

each stage, we highlight exemplary technologies

over the past 10 years. rough patterns rst, and

then employ specific data mining methods. I

mention this in data analytics section. Finally,

collected datasets might contain many

meaningless data, which unnecessarily increases

the amount of storage space and affects the

consequent data analysis. For instance,

redundancy is common in most datasets

collected from sensors deployed to monitor the

environment, and we can use data compression

technology to address this issue. Thus, we must

per-form data pre-processing operations for

efficient storage and mining.

Data storage concerns persistently storing and

managinglarge-scale datasets. A data storage

system can be divided into two parts: hardware

infrastructure and data management. Hardware

infrastructure consists of a pool of shared ICT

resources organized in an elastic way for various

tasks in response to their instantaneous demand.

The hardware infrastructure should be able to

scale up and out and be able to be dynamically

recon gured to address different types of

application environments. Data management

software is deployed on top of the hardware

infrastructure to maintain large-scale datasets.

Additionally, to analyze or interact with the

stored data, storage systems must provide

several interface functions, fast querying and

other programming models.

Data analysis leverages analytical methods or

tools toinspect, transform, and model data to

extract value. Many application elds leverage

opportunities presented by abundant data and

domainspecific analytical methods to derive the

intended impact. Although various elds pose

different application requirements and data

characteristics, a few of these elds may leverage

similar underlying technologies. Emerging

analytics research can be classied into six critical

technical areas: structured data analytics, text

analytics, multimedia analytics, web analytics,

net-work analytics, and mobile analytics.

Big-Data System: A Layered View:

Alternatively, the big data system can be

decomposed into a layered structure, as illustrated in Fig.

12.



ISSN NO : 2236-6124

Page No:287

The layered structure is divisible into three

layers, i.e., the infrastructure layer, the

computing layer, and the application layer, from

bottom to top. This layered view only provides a

conceptual hierarchy to underscore the

complexity of a big data system. The function of

each layer is as follows.

The infrastructure layer consists of a pool of

ICT resources, which can be organized by cloud

computing infrastructure and enabled by

virtualization technology. These resources will

be exposed to upper-layer systems in a ne-

grained manner with a speci c service-level

agreement (SLA). Within this model, resources

must be allocated to meet the big data demand

while achieving resource efficiency by

maximizing system utilization, energy

awareness, operational simplification, etc.

The computing layer encapsulates various data

tools into a middleware layer that runs over raw

ICT resources. In the context of big data, typical

tools include data integration, data management,

and the programming model. Data integration

means acquiring data from disparate sources and

integrating the dataset into a unied form with the

necessary data pre-processing operations. Data

management refers to mechanisms and tools that

provide persistent data storage and highly

efficient management, such as distributed le

systems and SQL or NoSQL data stores. The

programming model implements abstraction

application logic and facilitates the data analysis

applications. MapReduce, Dryad, Pregel, and

Dremel exemplify programming models.

The application layer exploits the interface

provided by the programming models to

implement various data analysis functions,

including querying, statistical analyses,

clustering, and classification; then, it combines

basic analytical methods to develop various led

related applications. McKinsey presented

potential big data application domains: health

care, public sector administration, retail, global

manufacturing, and personal location data.

Big-Data System Challenges: Designing and

deploying a big data analytics system is not a

trivial or straightforward task. As one of its

definitions suggests, big data is beyond the

capability of current hard-ware and software

platforms. The new hardware and software

platforms in turn demand new infrastructure and

models to address the wide range of challenges

of big data. Recent works have discussed

potential obstacles to the growth of big data

applications. In this paper, we strive to classify

these challenges into three categories: data

collection and management, data analytics, and

system issues. Data collection and management

addresses massive amounts of heterogeneous

and complex data. The following challenges of

big data must be met:

Data Representation: Many datasets are

heterogeneousin type, structure, semantics,

organization, granularity, and accessibility. A

competent data presentation should be designed

to reect the structure, hierarchy, and diversity of

the data, and an integration technique should be

designed to enable of client operations across

different datasets.

Redundancy Reduction and Data Compression:

Typically, there is a large number of redundant



ISSN NO : 2236-6124

Page No:288

data in raw datasets. Redundancy reduction and

data compression without scarifying potential

value are efficient ways to lessen overall system

overhead.

DataLife-CycleManagement: Pervasive

sensing andcomputing is generating data at an

unprecedented rate and scale that exceed much

smaller advances in storage system technologies.

One of the urgent challenges is that the current

storage system cannot host the massive data. In

general, the value concealed in the big data

depends on data freshness; therefore, we should

set up the data importance principle associated

with the analysis value to decide what parts of

the data should be archived and what parts

should be discarded.

Data Privacy and Security: With the

proliferation ofonline services and mobile

phones, privacy and security concerns regarding

accessing and analyzing personal information is

growing. It is critical to understand what support

for privacy must be provided at the platform

level to eliminate privacy leakage and to

facilitate various analyses. There will be a

significant impact that results from advances in

big data analytics, including interpretation,

modeling, prediction, and simulation.

Unfortunately, massive amounts of data,

heterogeneous data structures, and diverse

applications present tremendous challenges,

such as the following.

Approximate Analytics: As data sets grow and

the real-time requirement becomes stricter,

analysis of the entire dataset is becoming more

dif cult. One way to potentially solve this

problem is to provide approximate results, such

as by means of an approximation query. The

notion of approximation has two dimensions: the

accuracy of the result and the groups omitted

from the output.

Connecting Social Media: Social media

possessesunique properties, such as vastness,

statistical redundancy and the availability of user

feedback. Various extraction techniques have

been successfully used to identify references

from social media to specific product names,

locations, or people on websites. By connecting

inter data with social media, applications can

achieve high levels of precision and distinct

points of view.

Deep Analytics: One of the drivers of

excitementaround big data is the expectation of

gaining novel insights. Sophisticated analytical

technologies, such as machine learning, are

necessary to unlock such insights. However,

effectively leveraging these analysis toolkits

requires an understanding of probability and

statistics. The potential pillars of privacy and

security mechanisms are mandatory access

control and security communication, multi-

granularity access control, privacy-aware data

mining and analysis, and security storage and

management. Finally, large-scale parallel

systems generally confront several common

issues; however, the emergence of big data has

amplified the following challenges, in particular.

Energy Management: The energy consumption

of large-scale computing systems has attracted

greater concern from economic and

environmental perspectives. Data transmission,

storage, and processing will inevitably consume

progressively more energy, as data volume and

analytics demand increases. Therefore, system-

level power control and management

mechanisms must be considered in a big data

system, while continuing to provide extensibility

and accessibility.

Scalability: A big data analytics system must be

able tosupport very large datasets created now

and in the future. All the components in big data

systems must be capable of scaling to address

the ever-growing size of complex datasets.

Collaboration: Big data analytics is an

interdisciplinaryresearch eld that requires

specialists from multiple professional elds

collaborating to mine hidden values. A

comprehensive big data cyber infrastructure is

necessary to allow broad communities of



ISSN NO : 2236-6124

Page No:289

scientists and engineers to access the diverse

data, apply their respective expertise, and

cooperate to accomplish the goals of analysis.

V. CONCLUSIONS

In this paper we have surveyed various

technologies to handle the big data and there

architectures. In this paper we have also

discussed the challenges of Big data (volume,

variety, velocity, value, veracity) and various

advantages and a disadvantage of these

technologies. This paper discussed an

architecture using Hadoop HDFS distributed

data storage, real-time NoSQL databases, and

MapReduce distributed data processing over a

cluster of commodity servers. The main goal of

our paper was to make a survey of various big

data handling techniques those handle a massive

amount of data from different sources and

improves overall performance of systems.

REFERENCES

[1] Yuri Demchenko ―The Big Data

Architecture Framework (BDAF)‖ Outcome of

the Brainstorming Session at the University of

Amsterdam 17 July 2013.

[2] Amogh Pramod Kulkarni, Mahesh

Khandewal, ―Survey on Hadoop and

Introduction to YARN‖, International Journal of

Emerging Technology and Advanced

Engineering Website: www.ijetae.com (ISSN

2250-2459, ISO 9001:2008 Certified Journal,

Volume 4, Issue 5, May 2014).

[3] Sagiroglu, S.Sinanc, D.,‖Big Data: A

Review‖,2013, 20-24.

[4] Ms. Vibhavari Chavan, Prof. Rajesh. N.

Phursule, ―Survey Paper On Big Data‖

International Journal of Computer Science and

Information Technologies, Vol. 5 (6), 2014.

[5] Margaret Rouse, April 2010―unstructured

data‖.

[6] Kyuseok Shim, MapReduce Algorithms for

Big Data Analysis, DNIS 2013, LNCS 7813, pp.

44–48, 2013.

[7] Dong, X.L.; Srivastava, D. Data Engineering

(ICDE),‖ Big data integration― IEEE

International Conference on , 29(2013) 1245–

1248.

[8] Tekiner F. and Keane J.A., Systems, Man

and Cybernetics (SMC), ―Big Data

Framework‖ 2013 IEEE International

Conference on 13–16 Oct. 2013, 1494–1499.

[9] Mrigank Mridul, Akashdeep Khajuria,

Snehasish Dutta, Kumar N ―Analysis of Big

Data using Apache Hadoop and Map Reduce‖

Volume 4, Issue 5, May 2014‖.

[10] Suman Arora, Dr.Madhu Goel, ―Survey

Paper on Scheduling in Hadoop‖ International

Journal of Advanced Research in Computer

Science and Software Engineering, Volume 4,

Issue 5, May 2014.

[11] Aditya B. Patel, Manashvi Birla and Ushma

Nair, "Addressing Big Data Problem Using

Hadoop and Map Reduce," in Proc. 2012 Nirma

University International Conference On

Engineering. [12] Jimmy Lin ―Map Reduce Is

Good Enough?‖ The control project, IEEE

Computer 32 (2013).



ISSN NO : 2236-6124

Page No:290

big data analytics a research reportijrpublisher.com › gallery › 36-july-2018.pdfbig data...

Documents