big data project

ACKNOWLEDGEMENT

My Sincere thanks to Prof Sandeep Kelkar for his guidance, insights during key

stages of the project and extending his support during the entire project lifecycle.

This may sound silly, but I would like to thank Google for the search product and also

ebook like big data in IBM which is tremendously effective to give access to the

books and corners in the internet over a vast sea of information.

Place : Mumbai

Date: September 30, 2013 Prasad Bhoja Amin

1

INDEX

Sr. No. Table of Contents Page No.

1 Introduction 1-33

2 Executive Summary 34

3 Design of Survey 35-39

4 Data Collection Summary 40

5 Data Analysis 41-48

6 Interfaces / Key Findings 49-50

7 Conclusions 51

8 Suggestions 52-56

9 Bibliography 57

10 SPSS Tool, Analysis.Digram& PIE

Chart (OUTPUT)

58-96

Table of contents

2

Content Page No.

1.0 What is big data? 2

1.1 The Importance of Big Data and What You Can Accomplish 3

1.2 1.3 Big Data has three characteristic 3

1.3 Why big data is important it is shown in diagram? 4

1.5 Big data steps, vendors and technology landscape 4

1.6 Operational Definitions

1.6.1 Data Scientist 4

1.6.2 Massive Parallel Processing 4

1.6.3 In Memory analytics 4

1.6.4 Redundant Array of Independent Disk (RAID) 4

1.6.5 What business problems are being targeted? 5

1.6.6 Structured , Semi-Structured & Unstructured Data 8

2.0 Big Data Infrastructure 9

2.1.1 Why Raid fail at scale 9

2.1.2 Scale up v/s scale out NAS 10

2.1.3 EMC ISILON 11

2.2 Apache Hadoop 13

2.2.1 Data Appliances 15

2.2.2 HP Vertica 15

2.2.3 Terradata Aster 17

3.0 Domain Wise Challenges in big data Era

3.1 Log Management 17

3.2 Data Integrity & Reliability in the big data era 18

3.3 Backup Management in bid data era 20

3.4 Database Management in big data era 22

4.0 Big Data Use Cases:

4.1 Potential use Cases 24

4.2 Big Data Actual Use Cases 26

4.3 In IBM Big Data used for 32

3

INTRODUCTION

The internet has grown tremendously in the last decade, from 304 million users in

March 2000 to 2280 million users in March 2012 according to internet worlds stats.

Worldwide information is more than doubling every two years , with 1.8 zettabytes or

1.8 trillion gigabytes projected to be created and replicated in 2011 according to the

study conducted by research firm IDC.

A buzzword,or catch –phrase, used to describe a massive volume of both structured

and unstructured data that is so large that is difficult to process with traditional

database and software techniques is “Big Data”. An example of big data might be

perabytes(1,024 terabytes) or exabytes (1,024 petabytes) and zettabytes of data

consisting of billions to trillions of records of million of people

All from different sources (e.g blogs,social media,email,sensors,RFID readers ,

photographs,videos, microphones,mobile data and so on). The data is typically loosely

structured data that is often incomplete and inaccessible. When dealing with larger

datasets, organizations face difficulties in being able to create, manipulate , and

manage Big Data. Scientists regularly encounter this problem in meteorology,

Genomics, connectomics, complex physics simulations ,biological and environmental

research, internet search , finance and business informatics. Big data is particularly a

problem in business analytics because standard tools and procedures are not designed

to search and analyze massive datasets. While the term may seem to reference the

volume of data , that isn’t always the case. The term Big data, especially when used

by vendors, refer to the technology(the tools and processes) that an organization

requires to handle the large amounts of data and storage facilities.

4

Over a distributed storage system

Hadoop used to process unstructured and semistructured big data uses the map

paradigm to locate all relevant data then select only the data directly answering the

query.NoSQL,MongoDB and TerraStore process structured big data. Nosql data is

characterized by being basically available,soft state (changeable), and eventually

consistent .MongoDB and Terrastore are both no sql-related products used for

document – orient applications .The advent of the age of big data poses opportunities

and challenges for businesses.Previously unavailable forms of data can now

besavved,retrieved and processed . However, change to hardware ,software , and data

processing techniques are necessary to employ this new paradigm.

1.1 What is big data ?

Big data is a popular term used to describe the exponential growth and availability of

data, both structured and unstructured. And big data may be as important to business –

and society – as the Internet has become. Why? More data may lead to more accurate

analysis. More accurate analysis may lead to more confident decision making. And

better decisions can mean greater operational efficiencies, cost reductions and reduced

risk.

Data Growth Curve :- Terabytes Petabytes

Exabyteszettabytesyottabytesbronotobytes-geopbytes . it getting more

interesting.

Analytical Infrastucturecurve :- Databases datamartsoperational data stores(ods)

enterprise data warehouses data appliances in-memory appliances no sql

databases hadoop clusters

5

1.2 The Importance of Big Data and What You Can Accomplish

The real issue is not that you are acquiring large amounts of data. It's what you do

with the data that counts. The hopeful vision is that organizations will be able to take

data from any source, harness relevant data and analyze it to find answers that enable

1) cost reductions, 2) time reductions, 3) new product development and optimized

offerings, and 4) smarter business decision making. For instance, by combining big

data and high-powered analytics, it is possible to:

1.3 Big Data has three characteristic

1. Velocity 2. Volume 3. Variety

Figure 1.1 Characteristics of Big Data

Source: IBM, Hadoops

6

VELOCITY {Batch Processing -->

Video Streaming}

VOLUME {ZettaByte to TeraByte}

VARIETY {Structure --> Unstructure}

1.4 Why big data is important it is shown in diagram?

For .e.g if you see in IBM BIG insight having a log analysis for performance

optimizer. Log is a volume,it can be formed in semistructure or row format. When

there is a up gradation in any organization for e.g we can say upgrading the operating

system or database or migration that time log also change the format sometimes.

We having another example why big data is important for e.g we can say any public

sector and private sector bank having big data when customer come to the bank ,or

customer having daily transaction in debit card and credit card . when transaction goes

on log file has been generated . In online transaction each day log file having more

than 5 terabytes. It is day to day log generated .we cant do delete logfile because it is

more useful in organization.

1.5 Big data steps, vendors and technology landscape

1. Data Acquisition

Data is collected from the data sources and distributed across multiple nodes – often a

grid – each of which processes a subset of data in parallel.

Here we have technological provier like IBM ,HP etc and data providers like reuters,

saleforce etc and social network websites like facebook,google +,LinkedIn etc..

2. Marshallling

In This domain , we have very large data warehousing and BI appliances,actors like

action ,emc2(greenplum),hp(vertica),IBM (netezza) etc.

3. Analytics

In this phase ,we have the predictive technologies (suach as data mining )and vendors

which are adobe ,emc2,good data ,hadoop map reduce etc..

7

4. Action

Includes all the data acquisition providers plus the ERP ,CRM, and BPM actors

including adobe,eloqua ,emc2 etc.both in analytical and action phases , BI tools

vendors are good data ,google, hp (autonomy),IBM (cognos suite)etc.

5. Data Governance

An efficient master data management solution. As defined ,data governance applies

to each of the six preceding stages of big data dlivery.By establishing process and

guiding principles it sanctions behaviors around data delivery.By establishing

processes and guiding principles it sanctions behaviours around data. In short data

governance means that the application of big data is useful and relevant. Its an

insurance policy that the right questions are being asked.so we won’t be squandering

the immense powe of new big data technologies that make processing storage and

delivery speed more cost effective and nimble than ever.

1.6 Operational Definitions

1.6.1 Data Scientist

A data scientist represents an evolution from the business or data analyst role. Data

scientists also known as data analysts – are professionals with core statistics or

mathematics background coupled with good knowledge in analytics and data software

tools. AMckinsey study on big data states indiawil need nearly ,100,000 data

scientists in the next few years.”A data Scientists is a fairly new role defined by

Hillary mason of big as someone who can obtain ,srub, explore,model and interpret

data, blending hacking, statistics and machine learning who culls information from

data. These data scientists take a blend of the hackers arts, statistics and machine

learning and apply their expertise in mathematics and understanding the domain of the

8

data – where the data originated – to process the data into useful information . This

require the ability to make creative ecisions about the data and the information created

and maintaining a perspective that goes beyond ordinary scientific boundaries.

1.6.2 Massive Parallel Processing (MPP)

Mpp is the coordinated processing of a program by multiple processors that work on

different parts of the program, with each processor using its own operating system and

memory. An MPP system is considered better than a symmetrically parallel

system(SMP) for applications that allow a number of databases to be searched in

parallel. These include decision support system and data warehouse applications.

1.6.3 In memory analytics

The key difference between conventional BI tools and in memory products is that the

former query data on disk while the latter query data in random access

memory(RAM).When a user runs a query against a typical data warehouse, the query

normally goes to a database that reads the information from multiple tables stored on

a server’s shared Itdisk. With a serverbased inmemory database,all information is

initially loaded into memory. Users than query and interact with the data load into the

machine’s memory.

1.6.4 Redundant Array of Independent Disk (RAID)

RAID is short for redundant array of independent (or inexpensive) disks.It is a

category of disk drives that employ two or more drives in combination for fault

tolerance and performance. RAID disk drives are used frequently on servers but aren't

generally necessary for personal computers. RAID allows you to store the same data

9

redundantly (in multiple paces) in a balanced way to improve overall storage

performance

1.6. 5 What business problems are being targeted?

2. modeling true risk

3. customer churn analysis

4. Flexible supply chains,

5. Loyalty pricing,

6. Recommendation engines,

7. Ad targeting

8. Precision targeting,

9. Pos transaction analysis

10. Threat analsysis

11. Trade surveillance

12. Search quality fine tuningand

13. Mashups such as location +ad targeting.

Does an in-memory analytics platform replace or augment traditional in-

database approaches?

The answer is that it is quite complementary. In – database approaches put a large

focus on the data preparation and scoring portions of the analytic process. The value

of in-database processing is the ability to handle terabytes or petabytes of data

effectively. Much of the processing may not be highly sophisticated but it is a

critical.The new in memory architiectures use a massively parallel platform to enable

the mutilpe terabytes of system memory to be utilized (conceptually) as one big pool

of memory. This means that samples can be much larger, or even eliminated. The

number of variables tested can be expanded immensely.

In – Memory approaches fit best in situations where there is a need for:

1. High Volume & speed: it is necessary to run many,many models quickly

2. High Width &Depth : It is desired to test hundred or thousands of metrics across

tens of millions customers.

10

3. High complexity : It is critical to run processing-intensive algorithms on all this

data and to allow for many iterations to occur.

Tics

1. In Memory OLAP :- Classic MOLAP (multidimensional online analytical

processing ) cube loaded entirely in memory.

2. In Memory ROLAP : Relational OLAP metadata loaded entirely in memory.

3. In Memory inverted index :- Index with data loaded inot memory.

4. In Memory associative index : An Array / index with every entity / attribute

correlated to every entity/attribute.

5. In-memory spreadsheet :- Spreadsheet like array loaded entirely into memory.

1.6.6 Structured , Semi –Structured and unstructured Data

Structured Data is that type that would fit nearly into a standard relational database

management system,RDBMS, and lend itself to that type of processing.

Semi-Structured Data is the which has some level of commonality but does not fit

the structured data type

Unstructured Datais the type that varies in it content and can change from entry to

entry.

Structured data Semi structure data Unstructured data

Customer Records Web Logs Pictures

Point of Sale data Social Media Video Editing Data

Inventory E-Commercce Productivity (office docs)

Financial Records Geological Data

11

2.0 Big Data Infrastructure

2.1.1 Why RAID Fails at Scale

RAID schemes are based on parity and at its root, if more than two drives fail

simultaneously , data is not recoverable . The Statistical likelihood of multiple drive

failures has not been an issue in the past. However as drive capacities continue to

grow beyond the terabyte range and storage systems continues to grow to hundreds of

terabytes and petabytes, the likelihood of multiple drive failures is not reality.

Further drives aren’t perfect and typical SATA drives have a published bit rate

error(BRE) of 10 14, meaning100, 000,000,000,000 bits there will be a bit that is

unrecoverable.Doesn’t seem signigicant ? In Today bid data storage systems it is the

likelihood of having one drive fail, and encountering a bit rate error when rebuilding

from the remaining RAID set is highly probable in real world scenarios. To put this

inot perspective, when reading 10 terabytes , the probability of an unreadable bit is

likely (565%) and when reading 100 terabytes it is nearly certain (99.97%).

2.1.2 Scale up VS Scale out NAS

Traditional scale up system would provide a small number of access points, or data

servers that would sit in front of a set of disk protected with RAID . As these systems

needed to provide more data to more users the storage administrator would add more

disks to the back end but this only caused to create the data servers as a choke point.

Larger and faster data servers could be created using faster processor and more

memory but this architecture still had significant scalability issues.Scale out uses the

approach of more of everything instead of adding drives behind aair of servers, it adds

servers each with processor, memory,network interfaces and storageCapacity. As I

need to add capacity to a grid – to scale out version of an array – I insert a new node

12

with all the available resources. This architecture required a number of things to make

it work from both a technology and financial aspect.

Some of these factors include.

1. Clusterd architecture

FOR this model to work the entire grid needed to work as a single entity and each

node in the grid would need to be able to pick up a portion of the function of any

other node that may fail.

2. Distributed / parallel file system

The file system must allow for a file to be accessed from any one or any number of

nodes to be sent to the requesting system. This required different mechanism

underlying the file system : distribution of data across multiple nodes for

redundancy , as distributed metadata o locking mechanism and data scrubbinh /

validation routines.

3. Commodity Hardware

For these systems to be affordable they must rely on commodity hardware that is

inexpensive and easily accessible instead of purpose built systems.

Benefits of Scale Out

There are a number of significant benefits to these new scale out systems that meet

the needs of big data challenges.

1. Manageability

When data can grow in a single file system namespace the manageability of the

system increases significantly and a single data administrator can now mange a

petabyte or more of storage versus 50 or 100 terabytes on a scale up system.

13

2. Elimination of stovepipes

Since these systems scale linearly and do not have the bottlenecks that scale up

systems create, all data is kept in a single file system in a single grid eliminationg the

stovepipes introduced by the multiple arrays and files systems required.

3. Just in time scalability

As my storage needs grow I can add an appropriate number of nodes to meet my

needs at the time I need them. With scale up arrays I would have to guess at the final

size my data may grow while using that array which often led to the purchase of large

data servers with only a few disks behind them initially so I would not hit bottleneck

in the data server as I added disk.

4. Increased utilization rates

Since the data servers in these scale out systems can address the entire pool of storage

there is no stranded capacity. There are five core tenets of scale out NAS should be

simple to scale offer predictable performance , be efficient to operate always available

and be proven to work in a large enterprises

2.1.3 EMC ISILON

EMC Isilon is the scale – out platform that delivers ideal storage for big data.

Powered by the oneFS operating system, Isilon nodes are clustered to created a high –

performing single pool of storage.EMC Corporation announced in May 2011, the

world largest single file system with the introduction of emcisilon’s new iq 108NL

scale – out NAS hardware product . Leveraging three terabyte (TB) enterprise – class

hitachiultrastar drives in a 4u node, the 108NL scales to more than 15 petabytes (PB)

in a single file system and single volume, providing the storage foundation for

maximizing the big data opportunity.

14

EMC also announced isilon’s new smartlock data retention software

application ,delivering immutable protection for big data to ensure the integrity and

continuity of big data assets from initial creation to archival.

Object Based Storage

Object storage is based on a single . Flat addresss space that enables the automatic

routing of data to the right storage systems and the right and protections levels within

those systems according to it value and stage in the data life cycle.

Better Data Availability than RAID

In a properly configured object storage system content it replicated so that a minimum

of to replicas assure continuous data availability. If a disk dies all the other disk in the

cluster join in to replace the lost replicas while the system still runs at nearly full

speed . Recovery takes only minutes with no interruption of data availability and no

noticeable performancedegradation.

Provides unlimited capacity and scalability

In object storage systems there is no directory hierarchy and the object location does

not have to be specified in the same way a directory path has to be known in order to

retrieve it. This Nables object storage system to scaleout limits on the number of files

(objects ), to petabytes and beyond without limits on the number of files ,file size or

file system capacity, such as the 2-terabyte restriction that is common for window and

linux file systems.

15

Backups are Eliminated

With a well designed object storage system ,backups are not required . ultiples

replicas ensure that content is always available and an offsite disaster recovery replica

can be automatically created if desired.

Automatic Load Balancing

A well designed object storage cluster is totally symmetrical which means that each

node is independent provides an entry point into the cluster and runs the same code.

Companies that provide this are cleversafe,compiverde,amplidata ,caring,emc ,hitachi

data systems (hitachi content platform),netapp (storage grid ) and scality.

2.2 Apache Hadoop has been the drivig force behind the growth of the big data

industry . it is a frame work for running applications on large cluster buiyt of

commodity hardware. The hadoop framework transparaently provides applications

both reliability and data motion.

MapReduce is the core of hadoop created at google in response to the problem of

creating web search indexes , the map reduces frame work is the power house behind

most of today big data processing . In addition to hadoop you will find map reduce

inside MPP and no SQL databases such as vertica or mongoDB .The important

innovation of map reduce is the ability to take a query over a dataset, divide it and run

it in parallel over multiple nodes. Distributing the computation solves the issures of

data too large to fit into a single machine.

16

Combine this technique with commodity linux servers and you have a cost effective

alternative to massive computing arrays.

HDFS - we discussed the ability of map reduce to distribute computation over

multiple servers. For that computation to take palce, each server must have acess to

the data. This is the role of HDFS , the hadoop distributed file system.

HDFS and mapreduce are robust .severs in a hadoop cluster can fail and not abort the

computation process. HDFS ensures data is replicated with redundancy across the

cluster. On completion of a calculation a node will write its results back intoHDFS .

There are also restrictions on the data thatHDFS stores . Data may be structured and

schemes be defined before storing the data .with HDFS making sense of the data is

the responsibility of the developers code.

Why a company will be interested in hadoop?

The number one reason is that the company is interested in taking advantage of

unstructured or semi structured data. This data will not fit well into a relational

databases, but hadoop offers a scalable and relatively easy to program way to work

with it. This category includes emails web serve logs instrumentation of online

stores , images video and external data sets. All this data can contain information that

is critical to this businesses orgranized by geographical area. All this data can contain

information that is critical to this business and should reside in your data

warehouse,but it needs a lot of pre-processing and this pre-processing will not happen

in oracle RDBMS .

17

For Example

The other reason to look into hadoop is for information that exists in the databse, but

can’t be efficiently processed within the database.this is a wide usecase and it is

usually labeled ETL because the data is going out of an OLTP system and inot a data

warehouse. You use hadoop when 99% of the work is in the t of etl processing the

data into useful information.

2.3 DATA APPLIANCES

Purpose built solutions like teredata, IBM/NETEZZA, EMC/Greenplum, SAP

HANA( High-Performance Analytic appliance), HP Vertica and oracle exadata are

forming a new category . Data appliances are one of the fastest growing categories in

bid data. Data appliances integrate databases, processing and storage in a integrated

system optimized for analytics.

1. Proeccessing close to the data source

2. Appliance simplicity

3. Massively parallel architecture

4. Platform for advanced analytics

5. Flexible configurations and extreme scalability.

2.3.1 HP Vertica

The vertica Analytics platform is purpose built from the ground up to enable

companies to extract value from their data at the speed and scale they need to thrive In

today economy. Vertica was designed and built 000since it inceptions for today most

demanding analytic workloads each vertica component is able to take full-advantage

of the others by design.

18

Key features of the Vertica Analytics Platform

1. Real Time Query &Loading>> Capture the time value of data by continuously

loading information while simultaneously allowing immediate access for rich

analytics.

2. Advanced In Database Analytics >> Ever growing library of features and

functions to explore and process more data closer to the CPU cores without the

need to extract.

3. Database Designer & Administration Tools >>Powerfullsetup , tuning and control

with minimal administration effort. Can make continual improvements while the

system remains online.

4. Columnar Storage & Execution >> Perform queries 50 x – 1000x faster by

eliminating costly disk i/owithout the hassie and overhead of indexes and

materialized views.

5. Aggressive Data Compression >> Accomplished more with less CAPX while

delivering superior performance with our engine tha operated on compressed

data.

6. Scale-Out MPP Architecture >>Vertica automatically scales linearly and

limitlessly by just adding industry – standard x86 servers to the grid.

7. Automatic high Availability >> Runs non stop with automatically redundancy

failover and recovery optimized to deliver superior query performance as well.

8. Optimizer,Execution Engine & workload Management >> Get Maximum

Performance without worrying about the details of how it gets done. Users just

think about questions we deliver answers ,fast.

9. Native BI ETL & Hadoop /mapreduce integration >> Seamless integration with a

robust and ever growing ecosystem of analytics solutions .

19

2.3.2 Terradata Aster

To Gain Business insight using mapreduce and apache hadoop with SQL Based

analytics below is a summary using a unified big data architecture that blends the

best of hadoop and SQL allowing user to;

1. Capture and refine data from a wide variety of sources

2. Perform necessary multi-stuctured data preprocessing

3. Develop rapid analytics

4. Process embedded analytics, analyzing both relational and non relational data.

5. Produce sem-stuctured data as output often with metadata and heuristic

analysis

6. Solve new analytical workloads with reduced time to insight.

7. Usemassively parallel storage in hadoop to efficientlystora and retain data.

Below figure offer frame work to help enterprise architects most effectively use each

part of a unified big data architecture. This framework allows a best of breed

approach that you can apply to each schema type, helping you achieve maximum

performance , rapid enterprise adoption and the lowest TCO.

3.0 Domain Wise Challenges in Big Data Era

3.1 Log Management

Log data does not fall into the convenient schemas required by relational databases.

Log data is at its core, unstructured or in fact semi-structured which leads to a

deafening cacophony of formats, the sheer variety in which logs are being generated

is presenting a major problem in how they are analyzed . The emergence of big data

has not only been driven by the increasing amount of unstructured data to be

20

processed in near real – time , but also by the availability of new toolst to deal with

these challenges. There are 2 things that don,t receive enough attention in the log

management space. The 1st is real scalability which means thinking beyond what data

centers can do. That inevitably leads to ambient cloud models for log

management .Splunk has doe an amaing job of pioneering an ambient cloud mdel

with th way they created and eventual consistency model which allow you to make a

query to get a good enough answer quickly or a perfect answer in more time.

The 2nd thing is security. Log data is next to useless if it is not nonrepudiatable .

Basically all the log data in the world is not useful as evidence unless you can prove

that nobody changed it.Sumo DataLogglySpluunkare the primary companies that

currently have products around log management.

3.2 Data Integrity and reliability in the big data era

Consider standard business practices and how nearly all physical forms of

documentation and transactions have evolved to become digitized versions and with

them come the inherenllenges of validating not just the authenticity of their contents

but also the impact of acting upon an inavalid data set something which is highly

possible in today high velocity big data business environment . with view we can then

begin to identify the scale of the challenge. With cybercrime and insider threats

clearly emerging as a much mre profitable business the the criminal element , the

need to validate and verify is going to become critical to all business documentation

and related transactions even within the existing supply chains.

21

Keyless signature technology is a relatively new concept in the marke and will require

a different set of perspectives when put under consideration . A keyless signature

provides an alternative method to key based technologies by providing proof and non

repudiation of electronic data using only hash functions for verification. The

implementation of keyless signature is done via a globally distributed machine, taking

hash values of data as inputs and returning keyless signatures that prove the time ,

integrity and origin of the input data.

A primary goal of the keyless signature technology is to provide mass-scale ,non-

expiring data validation while elimanting the need for secrets or other forms of trust

thereby reducing or even eliminating the need for more complex certificate based

solutions as these are ripe with certificate management issues , including expiration

and revocation.

As more orgainisations become affected by big data phenomenon , the clear

implication is that many business will potentially be making business based on

massice amounts of internal a third party data .

Consequently the demand for novel and trusted approaches to validating data will

grow. Extend this concept to the ability to validate a virtual machine, switch logs or

indeed the security logs and then multiply by the clear advantages that cloud

computing (public or private) has over the traditional datacenter design – we will

begin to understand why keyless data integrity technology that can ensure self –

validating data is a technology that is likely to experience adoption.

22

The ability to move away from reliance on a third party certification authority will be

welcomed by many although this move from the traditionally accepted approach to

verify data integrity needs to be more fully broadcasted and understood for more mass

market adoption and acceptance.

Another solution for monitoring the stability , performance and security of your big

data environment is from a company called Gazzang. Enterprises and SaaS solution

providers have new needs that re driven by the new infrastructures and opportunities

of cloud computing . For Example , business intelligence analysis use big data stores

such as mongo db , hadoop and Cassandra . The data is spread across hundreds of

server in order to optimize processing time and return business insight to the user.

Leraging its extensive experience with cloud architectures and big data

platform ,gazzang is delivering a Sass solution for the capture , management and

analysis of massive volume of IT DATA. Gazzangzops is purpose built for

monitoring big data platforms and multiple cloud environments . The powerful engine

collects and correlates vast amounts of data from numerous sources in a variety of

forms.

3.3 Backup Management in Big Data Era

For protection against user or application error ,asharbaig a senior analyst and

consultant with the taneja group , said snapshots can help with big data backups. Big

also recommends a local disk based system for quick and simple first – level data

recovery problem. “ look for a solution that provide you an option for local copies of

data so that you can do local restores which are much faster he said , having a local

23

copy and having an image based technology to do fast image based snaps and

replications does speed it up and takes care of the performance concern.

Faster Scanning Needed

One of the issures big data backup systems face is scanning each time the backup and

archiving solutions start their jobs. Legacy data protection systems scan the file

system each time a backup job is run and each time an archiving job is run. For file

systems is big data Environment this can be time consuming. Commvault solution for

the scanning issue in its impana data protection software is it one pass feature.

According to commvault , one pass is an object level converged process for collecting

backup archiving and reporting data. The data is collected and moved off the primary

system to a content store virtual repository for completing the data protection

operations. Once a complete scan has been accomplished , the commvault software

places an agent on the file system to report on incremental backups making the

process even more efficient.

Casino doesn’t want to gamble on backups

Pechanga resort and casino in Temecula calif went live with a cluster of 50 EMC

isilon X200 nodes in February to back up data from its surveillance cameras. The

casino has 1.4 PB of usable isilon storage to keep the data, which is critical to

operations because the casino must shutdown all gaming operations if its surveillance

system is interrupted.

“In gaming we’re mandated to have surveillance coverage ,” said Michael grimsley

director of systems for Pechanga technology solutions group. If surveillance is down

all gaming has to stop. If a security incident occurs their team pulls footage from the

24

x200 nodes and moves it to worm compliant storage and back it up with networker

software to emc data domain dd860 the duplication target appliances. The casino

doesn’t need tape for worm capability because worm is part of isilons smart lock

software. Another possibility is adding replication to a DR site so the casino can

recover quickly if the surveillance system goes down.

Scale out Systems

Another option to solving the performance and capacity issues is using a scale out

backup system one similar to scale out NAS,but built for data protection .you add

nodes with additional performance and capacity resources as. the amount of protected

data growschnogy. “Any backup architecture especially for the big data world has to

officer balance the performance and the capacity properly said jefftofanosepatoninc

chief technology officer .otherwise at the end of the day , it not a good solution for the

customer and is a more expensive solution than it should be .

Sepaton s2100-es2 modular virtual tape library (VTL) was built for data intensive

large enterprises. According to the company its is 64 bits processor nodes backup data

at up to 43.2tb per hour , regardless of the data type and can store up to 1.6pb yoy can

add up to eight performance nodes per cluster as your needs require and add disk

shelves to add capacity.

3.4 Database management in Big Era

There are currently three tree trends in the industry:

1. The NoSQL databases designed to meet the scalability requirements of

distribution architectures and or schemaless data management requirements.

25

2. The NewSQL databases designed to meet the reqirements of distributed

architectures or to improves performance such that horizontal scalability is no

longer needed.

3. The data grid /cache products designed to store data in memory to increase

application and database performance .

Computer World Tam Harbert explored the skills and needs organizations are

searching for in the quest to manage the big data challenge and also identified five

job titles emerging in the big data world .

Along with habert findings here are 7 new types job being created by big data :

1. Data Scientists : This emerging role is taking the lead in processing raw data and

determining what types of analysis would deliver the best results.

2. Data Architects : Organistaitons managing bid data need professional who will be

able to build a data model and plan out roadmap of how and when various data

sources and analytical will come online had how they will all fit together.

3. Data Visualizer : These days a lot of decision – maker rely on information that is

presented to them in highly visual format – either on dashboards with colorful

alerts and dials or in quick understand charts and graphs organizations need

professionals who can harness the data and put it in context , in layman language

exploring what the data means and how t will impact the company.

4. Data Change agents : - Every forward thinking organistation needs change agents

usually an informal role who can evangelize and marshal the necessary resources

for new innovation and ways of doing business. Harbert predicts that data change

in internal operations and processes based on data analytics. They need to be good

26

communicators and six sigma background – meaning they know how to apply

statistics to improve quality on a continuous basis also help.

5. Data engineer/operators : these are the people that make the big data infrastructure

hum on day to day basis. They develop the architecture that hels analyze and

supply data in the way the business needs and make sure systems are performing

smoothly says harbert.

6. Data stewards : not mentioned in harbert list but essential to any analytics-driven

organization is the emerging the role of data steward.Every bit and byte of data

across the enterprise should be owned by someone – ideally a linne of business .

Data Stewards ensure that data sourcesare properly accounted for and may also

maintained a centralized repository as part of master data management approach

in which there is one gold copy of enterprise data to be referenced .

7. Data Virtualization/cloud specialists :- Databases themselves are no longer as

unique as they use to be . what matter now is the ability to build and maintain a

virtualized data service layer in a consistent easy to access manner. Sometimes

this is called databases a service . No matter what it called organization need

professional that can also build support these virtualized layer or clouds.

4.0 BigData use cases :

4.1 Potential use cases

The key to exploiting big data analytics is focusing on a compelling business

opportunity as defined by use case what (what exactly are we trying to do ? ) what

value is ther proving a hypothesis ?

27

Use cases are emerging in a variety of industries that illustrate different core

competencies around analytics. Figure below illustrates some use cases along two

dimensions data velocity and variety.

RAW DATA -> AGGREGATED DATA INTELLIGENCE -- > INSIGHTS

DECISIONS OPERATIONAL IMPACT FINANCIAL OUTCOMES -- >

VALUE CREATION.

Insurance:- -- Individualize auto – insurance policies based on newly captured

vehicle telemetry data . Insurer gains insight inot customer driving habits delivering

1) more accurate assessments of risk 2) individualized pricing based on actual

individual customer driving habits 3) influence and motivate individual customer to

improve their drivinghabits .

Travel :-- optimize buying experience through web log and social media data nalysis

1) travel site gain insight in not customer preferences and desires 2) up –selling

products by correlating current sales with subsequent browsing behavior increase

browse to buy conversions via customized offers and packages 3) deliver personalized

travel recommendations based on social media data .

Gaming – Collect gaming data to optimize spend within and across games 1) games

company gains insight into likes , dislikes and relationships of it user 2)Enchance

games to drive customer spend within games 3) recommend othecontenet based on

analysis of player connections and similar like . Create special offers or packages

based on browsing and buying behaiour.

28

4.2 Big data Actual Use Cases

Below graphic mentions the survey result undertaken by information week which

indicated the % of respondents who would be opting for a open source solutions for

Bd Data.

1. Use Case

Amazon will pay shoppers $5 to walk out of stores emptyhandedInteresting use of

consumer data entry to power next generation retail price competition amazon is

offering consumers up to $ 5 off on purchase if they compare prices using their

mobile phone application in a store. The promotion will serve as a way for

amazon to increase usage of it bar code scanning application while also

collectiong intelligence on prices in the stores.

Amazon’s price check app which is available for iphone and android allows

shoppers to scan a bar code take a picture of an item or conduct a text search to

find the lowest prices . Amazon is also asking consumer its still ofto submit the

prices of items with the app ‘ so amazon know offering the best prices . A great

way to feed data inot it learningengine from brick and mortor retailers.This is an

interesting trend that should terrify bricj and mortar retailer .while the

realtimeevery day low price information empower consumers it terrifies retailer

who increasingly are feeling like showroom shoppers come to be check out the

merchandise but uttimately decide to walk out and buy online instead.

29

2. Smart meters

1) Because of smart meters , electricity providers can read the meter once every

15 minutes reather than once a month . This not only eliminated the need to

send some one for meter reading, but as the e is read once every fifteen

minutes , electricity can be priced differently for peak and off peak hours .

pricing can be used to shape the demand curve during peak hours eliminationg

the need for creating additional generating capacity just to meet peak demand ,

saving electicity providers millions of dollars worth of investment in

generating capacity and plant maintenance costs.

2) Well there is a smart electric meter in a residence in texas and one of the

electricity providers in the area is using the smart meter technology to shape

the demand curve by offering free night time energy charges – all night every

night . All year long.

3) In Fact they promote their service as do your laundry or run the dishwasher at

night and pay nothing for your energy charges. What txu energy is trying to do

here is to reshape energy demand using pricing so as to manage peak time

demand resulting in savings for both txu and customer . This wount have been

possible without smart electric meters.

4) T-mobile USA ---- has integrated big data across multiple it systems to

combine customer transaction and interactions data in order to better predict

customer defections. By leveraging social media data along with transactions

data from CRM and billing systems , t mobile USA has been able to cut

customer defections in half in a single quarter.

5) Us express provider of a wide r=variety of transportation solutions collects

about a thousand data elements ranging from fuel usage to tire conditions to

30

truck engine operations to gps information and uses this data for optimal fleet

management and to drive productivity saving millions o dollars in operating

costs.

6) Mclaren formula one racing team ------- uses real – time car sensor data during

car races, identifies issues with its racing cars using predicatives analytics and

take corrective actions proactively before it too late |

7) How morgan Stanley uses hadoop ---- Gary bhattarcharjee executive director

of enterprises information management at the firm has worked with hadoop as

early as 2008 and thought that it might provide a solutions .so the it

department hooked up some old servers.

At the fountained head conference on hadoop In finance in new York

bhattacharjee said the investment bank has started by stringether 15 end of life

boxes . it allowed us to bring really cheap infrastructure into a framework and

install hadoop and let it run.One area that bhattacharjee would talk about was

in it and log analyisis.A typical approach would be a look at web logs and

database logs to see problems but one log wouldn’t shpow if a web delay was

caused by a databse logs pu them inot hadoop and ran tim based

correlations .now they can see market events and how they correlate with web

issues and databases read write problems.

8) Big data at ford

With analytics now embedded ito the culture of ford , the rise of big data analytics

has created a whole host of new possibilities for the automaker generate

internally from our business operations and also from our vehicle. We recognize

31

that the volumes of data we generate internally – from our business operations and

also from our vehicle research activities as well as the universeof data that our

customers live inn and our vehicle research activities as well as the universe of

daa that our customers live in and that exists on the internet – all of those things

are hug opportunities for us that will likely require some new specialized

techniques or platforms to manage, said ginder . our research organization is

experimenting with haddo and we are trying to combine all of these various data

sorce that we have accesto . we thing to sky is the limit . we recognize that we are

just kind of scraping the tip of the iceberg here.The other major asset that fordhs

going for it when it comes to bigdata is that hthe company is tracking enormpus

amounts of useful data in both the product development process and the products

themselves.

Ginder noted our manufacturing sites are all very well instrumental. Our vehicles

are very well instrtmental. They closed loop control systems . There are many

many sensors in each vehivle until now most of that information was in the

vehicle , but we think ththat data andere opportunity to grab that data and

understand better how the car operated and how consumers use the vehicles and

feed that informations back inot our design process and help optimize the user

experience in the future as well.

Of course big data is about a lot more than just hearnessing all of the runaway data

sources that mos companies are trying to grapple with it about structured data plus

unstructured data . structure data is all the traditional stuff most companies have in

their databases as the stuf like ford is talking about with sensors in its vehicles

32

and assembly ) unstructured data is the stuff that now freely available across he

internet , from public data now being exposed by government on sites such as

data.gov in the us to treasure troves of consumer intelligence such as

twiteer .mxing the two and coming up with new analysis is that big data is all

about.

The amount of that data is only goingtassumption of big data is only own goin

grow and there aopportunity for us to combine that external data with our internal

data in new ways said ginder . for better forecasting or for better insight into

product design there are many many opportunities.

Ford is alos digging into the consumer intelligence aspect of unstructured

data .Ginder said we recognize that the data on the internet is potentially

insightful for understaning what our customer or our potential customer are

looking for what their attitudes are so we do some sentiment analysis around blog

post, comments and other types of content on the internet.

That kind of thing is pretty common and a lot of fortune 500 companies are doing

similar kinds of things .however there another way that ford is uusing unstructured

data from the web that is a little more unique and it tohas impacted the way the

company predicits future sales of tit vehicles.

We use google trends which measures the popularity of search terms to help form

ourwon internal sales forecasts ,ginder explained . Along with other internal data

we have , we use that to build a better forecast , ginder explained . Along with

other internal data we have we use that to buid a better forecast .it one of the

33

inputs for oursals forecast . In the past it would justbe what we sold last week .

ow it what we sold last weel plus the popularity of the search terms again, I think

we are just scratching the surface . There a lot more I think we will be doing in the

future.

Computer and electronics products and information sectors traded globally stand

out as sctors that have already been experiencing very strong productively growth

and that are poised to gain substaintially from the use of big data.

Two services sectors and insurance and government are positioned to benefit very

strongly from big data as long as barriers ot it use can overcome.

Several sectos have experienced negative productivity growth probably

indicationg that these sectos face strong systematic barriers to increasing

productivity. Among the remaing sectors we see that globally traded sectos ten to

have experienced higher productivity growth while local services (mainly cluster

E) have experienced lower growth.

While all sectors will have to overcome barriers to capture value from the use of

big data,barriers are structureally higher for some than for others . For Example ,

the public sector , including educations , faces higher hurdles because of a lack of

datadriven mindset and available data .capturing value in health care face

challenges given the relatively low IT investment performed so far. Sectors such

as retail , manufacturing and professiona services may have relatively lowe

degrees of barriers to overcome for precisely the opposite reasons.

34

4.3 In IBM Big Data used for

1971 :- SPEECH RECOGNITION

speech recognition (SR) is the translation of spoken words into text. It is also known

as "automatic speech recognition", "ASR. Some SR systems use "speaker independent

speech recognition" while others use "training" where an individual speaker reads

sections of text into the SR system. These systems analyze the person's specific voice

and use it to fine tune the recognition of that person's speech, resulting in more

accurate transcription. Systems that do not use training are called "speaker

independent" systems. Systems that use training are called "speaker dependent"

systems.

Speech recognition applications include voice user interfaces such as voice dialling

(e.g. "Call home"), call routing (e.g. "I would like to make a collect call"),

domotic appliance control, search (e.g. find a podcast where particular words were

spoken), simple data entry (e.g., entering a credit card number), preparation of

structured documents (e.g. a radiology report), speech-to-text processing (e.g., word

processors or emails), and aircraft (usually termed Direct Voice Input).

1980 :- RISC Architecture (Reduced Instruction Set Computer). In old ibmserver ,

performance level speed has been improved.

1988 :-NSFNET :- Having connecting to network between many university in

US.NSINET speed 92 countries for isp in 1995.

35

1993: Scalable Parallel System.A multiprocessor is a tightly coupled computer

system having two or more processing units (Multiple Processors) each sharing main

memory and peripherals, in order to simultaneously process programs.

Sometimes the term Multiprocessor is confused with the term Multiprocessing.

1996 : DEEP THUNDER

It show a daily wheather report. It showcalclulation and manipulation of project.

1997 :- DEEP BLUE

IBM 6000 super computer having parallel process. Breaking up the task into smaller

subtask and execute them in parallel.2000 :- Linux operating system.BLUE GENE –

2004 . Fastest wide range of application , medical and climate

2009 :- The First Nationalwide smart energy and water grid having water

shortage ,skyrocketing energy cost,monitor waste, incentive efficient resource

usedetecttheft,reduce dependency and also other utilities.

2009 :- STREAM COMPUTING:- (Video Streaming) for w.g we can say is there

events in orgainisation all the audiofile and video file record are stored in server.

2009 :- CLOUDThe future of the cloud is going to be a hybrid combination of public

and private cloud, not one or the other. There will be times when you want to run a

workload in a private cloud, then move it up to a public cloud, and later move it back

again to your private cloud. We see a Microsoft private cloud as the first step towards

building a cloud that allows you to go into the public cloud, which is what we call

Windows Azure. With Microsoft, our cloud offerings are designed so that your

private cloud and public cloud work together.

36

2010 GPFS SNC – General Parellel file system shared disk clusterd file system

stored in SAN, GPFS provide high availability , disaster recovery, security ,

herarichal system.

EXECUTIVE SUMMARY

The internet has made new sources of vast amount of data to business executives . Big

data is comprised of data sets too large be handled by traditional systems. To remain

competitive, business executives need to adopt new technologies and techniques

emerging e to big data. Big data includes structured data , semi structured and

unstructured data. structured data are those data formatted for use in a database

management system. Semi structured and unstructured data include all type of

unformatted data including multimedia and social media content. Big data are also

provided by myriad hardware objects, including sensors and actuators embedded in

physical objects,

37

DESIGN OF SURVEY

ANALYSIS OF BIG DATA

Page 1 of 1

Form Title

Name of the person filling the form:

ORGANISATION

Department Name:

38

Are you aware of Big Data ?

Yes

No

Other:

What type(s) of data do you use in your organisation ?

Microsoft Office

Open Office

Tally

Libra Office

Lotus Office

None

Other:

What kind of Module(s) do you use in your organisation ? *

ERP

SAP

SUN

TALLY

WEB PORTAL

NONE

Other:

39

What type(s) of Data Format do you use?

Document Format (.doc, docx)

Excel Format(.xls)

PDF Format()

JPEG Format

Video Format()

None

Other:

If your data is lost, is there any process for data restoration?

Yes

No

Other:

Do you have any procedure for Data Backup?

Daily backup

Weekly backup

Monthly backup

None

Other:

40

What media do you use to take backup of data stored on your system?

CD/DVD

USB EXTERNAL Storage Drive

USB Pendrive

Local drive(s) on your system

None

Other:

Which Email Service do you use?

Zimbra

Microsoft Exchange

Lotus

None

Other:

Do you have an Archieved Backup of your data and email service?

Yes

No

Other:

Which service do you use for offline Backup of email service?

Microsoft Outlook

41

Outlook Express

Mozilla ThunderBird

Zimbra Client

Lotus notes

None

Other:

Please provide your View(s) on Big Data ? Explain in 1 line

Do you have any Suggestions on the concept of Big Data ?

Add item

Confirmation Page

Send form

42

DATA COLLECTION SUMMARY

Data Collection we have only 73 person has response

Web Based Questionnaires : A new and inevitably growing methodology is the use of

Internet based research. This would mean receiving an e-mail on which you would be

click of an address that would take you to secure web-site to fill in a questionnaire.

This type of research is often quicker and less detailed. Some disadvantages of this

method include the exclusion of people who do have a computer or are unable to

access a computer. Also the validity of such surveys are in question as people might

be in a hurry to complete it and so might not give accurate responses.

(https:/

/https://docs.google.com/a/welingkarmail.org/forms/d/1OYNpCCQyEzCqn42ih_LX7

NksW5GR_frU5wnhA_OPVAY/viewform)

Questionnaires often make use of checklist , checkbox and rating scale. These devices

help simplify and quantify people’s behaviors and attitudes. A Checklist is a list of

behaviors , characteristics , or other entities that the researcher is looking for Either

the researcher or survey participant simply checks whether each item on the list is

observed, present or true or vice versa. A Rating Scale is more useful when behavior

needs to be evaluated on a continuum. They are also known as LikertScaels.

43

https://docs.google.com/a/welingkarmail.org/forms/d/1OYNpCCQyEzCqn42ih_LX7NksW5GR_frU5wnhA_OPVAY/viewform

https://docs.google.com/a/welingkarmail.org/forms/d/1OYNpCCQyEzCqn42ih_LX7NksW5GR_frU5wnhA_OPVAY/viewform

DATA ANALYSIS

In this data analysis we have 70 % of people know about the is big data ; In

Analysis of Big Data this diagram it show the rate of scale and rating of analysis of

big data .

73 responses

Summary

Name of the person filling the form:

Chitin

Salian VikramMadhavShinde Priti Nikita ShoaibMomim Gouri rohitkhana Neha gane

shdevarushi sheetal siddhant |

Bitla NazirKanoor TruptiMengle BhojaAminBhaveshdodia ArchanaRathod Deepakl

praveen surakhakamble Vinitha

Nair suhilaamin mehmoodKanoor Lynette PriyankaSalunke Subodh durgesh Deepak

SupriyaMoreAnkurThakkar ParveenShaikh Vajreshwari Amin Siddhi

Deshpande prashantdesai Tanuja Latesh

Poojary Vidya satish PriyankaAjgaonkar JitinSalian PramodMulik HeenaShailkh San

deepKelkar piyush ManeeshaMhatre Rao DilipVishwasrao ajay Rupal

Choudhari Vanmala Bhagwati Mrunal Shivan Naina naina Girish santoshkadam ajayd

esai Mehek somappasalian parinita Maitreyee Anagheswar RutujaDeshmukh Jeetendr

44

aVelhal dipeshnagalkar Neeta Papal Ryan

Rodricks AkshayadeviSawant SantoshRajeesh Nair Surjit Singh deepak husainkanor

ORGANISATION

lokandwala SBI Salian Daulat Exim Pvt Ltd Accentures Lupin Ltd. web print SBI

Life welingkarinstitiutes Ugam Solutions MT Educare Pvt. Ltd We

School JLL Redington India Ltd poojary Kotian jsw

steel Welingkar DATA Welingkar Institute WeSchool cargo AB

Enterprises Desperado Inc accenture Cybertech Ferrero autonomous Welingkar

Institute Of Management Accenture Services Pvt. Ltd. coca cola Godrej

Infotech 3dplm wipro bajaj auto ltd TCS orange business service SIBER Maxwell

Industries Ltd deloittepunjab national bank Tetra Pack College of ABM,Oros Rai

University none Dhruvinfotech Sai Service Agency Pvt Ltd sorte Alitalia

Airlines Atos origin central bank of india welingkar institute of

management hdfc Welingkar Institute of Management Jacquor net magic ICFAI

University Nokia eClerx Service Ltd xolo Amin Annet Tech. Sixsigma Pvt. Soft

SolutionsYogaVidyaNiketan Ideate digital Welingkar capgemini HCL Infosystems

Ltd. poona finance bank

Department Name

Quality Operation BMLP IT Infrastructure Support production Sales &

Marketing Technology Distribution Marketing supply chain

mangaement support Customer Care GOC back office Event Management none Yoga

Kendra quality Management CVN Testing Marketing Reparing printing

dept Reservations and Holiday

45

Packages mis acounting researchstoredept RAID HLDC admin finance Digital

Marketing Designing manufacturing Coordination Admin Admin MBA Media Distan

ce education Department FinanceAdministrator IT Design import dept UBI Data

Analyst IT ADC Technology governance Accounts MIS Administration dispatch Acc

ounts accounts logistic hardware HR & Admin it PGDM – FMB

Are you aware of Big Data ?

Yes 4259%

No 2738%

Othe

r

2 3%

What type(s) of data do you use in your organisation ?

Microsoft Office 66 51%

Open Office 18 14%

Tally 17 13%

Libra Office 4 3%

Lotus Office 9 7%

None 4 3%

46

Other 12 9%

What kind of Module(s) do you use in your organisation ?

ERP 28 21%

SAP 12 9%

SUN 4 3%

TALLY 20 15%

WEB PORTAL 49 37%

NONE 7 5%

Other 13 10%

What type(s) of Data Format do you use?

Document Format (.doc, docx) 6

4

23%

Excel Format(.xls) 6

7

24%

PDF Format() 6

0

22%

JPEG Format 4

7

17%

Video Format() 2

8

10%

None 3 1%

47

Other 6 2%

If your data is lost, is there any process for data restoration?

Yes 5172%

No 1927%

Othe

r

1 1%

Do you have any procedure for Data Backup?

Daily backup 32 40%

Weekly backup 22 28%

Monthly

backup

11 14%

None 12 15%

Other 3 4%

What media do you use to take backup of data stored on your system?

CD/DVD 2518%

USB EXTERNAL Storage

Drive

3927%

48

USB Pendrive 2920%

Local drive(s) on your system 3122%

None 9 6%

Other 9 6%

Which Email Service do you use?

Zimbra 24 34%

Microsoft Exchange 25 35%

Lotus 7 10%

None 6 8%

Other 9 13%

Do you have an Archieved Backup of your data and email service?

Yes 4058%

49

No 2841%

Othe

r

1 1%

Which service do you use for offline Backup of email service?

Microsoft Outlook 2129%

Outlook Express 811%

Mozilla

ThunderBird

5 7%

Zimbra Client 1014%

Lotus notes 913%

None 1825%

Other 1 1%

Please provide your View(s) on Big Data ? Explain in 1 line

The biggest challenge for any huge organisation is to figure out who should own the

big data initiatives that straddle the entire organisation. Big Data is a must in every

organisation as there is always a chance of losing a big chunk of the important

data. no idea abt big data Great cloud backup Big data is collocation of all the relevant

data. big data should be have backup Student DATA Big data is useful for viewing

structured data. I don't deal with big data on day to day basis. Hence not able to

justify. other no idea about Big data Harddisk Most of the organizations maintain a

50

mix of data sets in various forms. Irrespective of your profession apps are require to

make your data more accessible, usable and valuable. More reliable Should have more

capacity for Data use . Big data refers to groups of data that are so large and unwieldy

that regular database management tools have difficulty capturing, storing, sharing and

managing the information. I don't know any thing about Big data big data Big Data

Required More Efficient Big data Should be stored in centralised format The file

which is more than 10 MB. Big Data is not limited to just email but its more about the

business data running on servers (Oracle database, data backup of servers,

etc) none no idea we are use many types of data formats,for security reason we take

daily backup. Use cloud Service for video and jpeg format require advance

technology cloud computing not yet used, but aware of what big data is. Big data is

future. no idea abt big data It's a huge hype with only top guns moving into the

technology. The investment is high and seems risk prone. all the database should be

stored in centralised data should be in cloud Big data require more security Provide

custopmization in big dataBig data should be disaster reovery Accounts

Statement data should be in Icloud. It should be easily accessable for the person who

is storing it .. should provide in cloud

Do you have any Suggestions on the concept of Big Data?

no big data cloud backup NA big data should be have backup Information

management is the most crucial in the organizations, hence more research on data

management and analytics is required. other i suggestion data should be in cloud

backup,it is very much secure & better, Thanks Purchase New USB Hard disk Drive

And Copy All Data For economic feasibility in any successful organisation ,it is

essential to device ways and means to handle "big data" without driving up the

hardware costs. More reliable No 1) Backup concept needs to be highlighted in this

51

survey. 2) Software used should be asked 3) Kind of data to be backed up should be

asked 4) Retrieval procedure should be asked 5) Incase of backup is corrupted there

should be a concept of fail safe module which should be discussed Store more Hard

disk Big data Should be stored in centralised format The file which is more than 10

MB. none Same as above no idea If daily back up is taken then it would be of great

help and if possible back up is taken dept. wise then the time consumed for retriving

the data would be reduced. require advance technology no idea abt big data data

should be in cloud cloud computing Use cloud Serice for video and jpeg format Big

data require more security Provide custopmization in big data Big data should be

disaster reovery nope, not yet data should be in Icloud. It should be easily accessable

for the person who is storing it .. should provide in cloud

Number of daily responses

INFERENCES/KEY FINDINGS

1. Nearly half the data (49%) is unstructured (text), while 51% is structured. Also,

about 70% of the data is from internal sources.

2. Logistics and finance expect the greatest ROI, although sales and marketing have

a bigger share (30%) of the Big Data budget

3. Monitoring how customers use their products to detect product and design flaws is

seen as a critical application for Big Data

52

4. About half of the firms surveyed are using Big Data, and many of them projected

big returns for 2014

5. Big split in spending on Big Data, with a minority of companies spending massive

amounts and a larger number spending very little

6. Investments are geared toward generating and maintaining revenue.

7. The biggest challenges to getting business value from Big Data are as much

cultural as they are technological.

8. The biggest projected 2012 Big Data returns for leaders came from places that

laggards did not value as much: improving customers offline experience and

location-based marketing.

9. Companies that do more business on the Internet spend more on Big Data and

project greater ROI.

10. Organizing a core unit of Big Data analysts in a separate function appears to be

important to success.

11. Big Data has become big news almost overnight, and there are no signs that

interest is waning. In fact, several indicators suggest executive attention will climb

even higher.

12. Over the last three years, few business topics have been mentioned in the media

and researched as extensively as Big Data. Hundreds of articles have appeared in

the general business press (for example, Forbes, Fortune, Bloomberg

BusinessWeek, The Wall Street Journal, The Economist), technology publications

and industry journals, and more seem to be written by the day. A March 2013

search on Amazon.com surfaces more than 250 books, articles and e-books on the

topic, most of them published in the last three years.

53

13. Dozens of studies have been conducted on Big Data as well, and every week

another one appears. Most of the big consulting firms and IT services companies

have weighed in, as well as (of course) the technology research community:

Gartner, Forrester, IDC and many of the rest.

CONCLUSION

It seems that the general Blogging Idol conclusion is that big data is here to stay, that

there are good arguments for moving forward with big data systems, and that the best

way is to start small and prove the benefits. While this isn’t much different from any

other new technology, it might be an especially good strategy to apply to big data

applications. Cloud computing may also prove valuable for big data.But it is not

necessary that vvery data should be in cloud for example we can say , xyz

orgainisation having the big data , ver video ,audio ,jpeg and social media file were

we can store in cloud computing . and Log data , confidential document ,events ,email

backup ,online transaction file and log data that we can store in internal organization .

54

At the end of the day, big data provides an opportunity for “big analysis” leading to

“big opportunities” to gain a competitive edge, to advance the quality of life, or to

solve the mysteries of the world.

SUGGESTIONS

1. Expanding customer intelligence

2. Improving operational efficiencies

3. Adding mobility to big data

4. “Big Data” and “Analytics” – As a service

5. Define big data problems

6. Technology infrastructure recommendation, setup, ongoing operations &

support.

7. Ingestion of big data

8. Analytics – algorithms, map/reduce, statistical functions

9. Integration with enterprise systems

10. Recommendation Engine for e-Commerce portals

55

11. Big data should be in cloud computing (for e.g video and jpeg file)

12. Replication and Disaster Recovery

It must not compromise the basic functionality of the cluster

It should scale in the same manner as the cluster.

It should not compromise the essential characteristics of big data

It should address – or at least mitigate – a security threat to big data environments or

data stored within the cluster. So how can we secure big data repositories today? The

following is a list of common challenges, with security measures to address them:

1. User access: We use identity and access management systems to control users,

including both regular and administrator access.

2. Separation of duties: We use a combination of authentication, authorization, and

encryption to provide separation of duties between administrative personnel.

We use application space, namespace, or schemata to logically segregate user

access to a subset of the data under management.

3. Indirect access: To close “back doors” – access to data outside permitted

interfaces – we use a combination of encryption, access control, and configuration

management.

4. User activity: We use logging and user activity monitoring (where available) to

alert on suspicious activity and enable forensic analysis.

5. Data protection: Removal of sensitive information prior to insertion and data

masking (via tools) are common strategies for reducing risk. But the majority of

big data clusters we are aware of already store redundant copies of sensitive data.

This means the data stored on disk must be protected against unauthorized access,

and data encryption is the de facto method of protecting sensitive data at rest. In

keeping with the requirements above, any encryption solution must scale with the

56

cluster, must not interfere with MapReduce capabilities, and must not store keys

on hard drives along with the encrypted data – keys must be handled by a secure

key manager.

6. Eavesdropping: We use SSL and TLS encryption to protect network

communications. Hadoop offers SSL, but its implementation is limited to client

connections. Cloudera offers good integration of TLS; otherwise look for third

party products to close this gap.

7. Name and data node protection: By default Hadoop HTTP web consoles

(JobTracker, NameNode, TaskTrackers, and DataNodes) allow access without any

form of authentication. The good news is that Hadoop RPC and HTTP web

consoles can be configured to require Kerberos authentication. Bi-directional

authentication of nodes is built into Hadoop, and available in some other big data

environments as well. Hadoop’s model is built on Kerberos to authenticate

applications to nodes, nodes to applications, and client requests for MapReduce

and similar functions. Care must be taken to secure granting and storage of

Kerberos tickets, but this is a very effective method for controlling what nodes

and applications can participate on the cluster. Application protection: Big data

clusters are built on web-enabled platforms – which means that remote injection,

cross-site scripting, buffer overflows, and logic attacks against and through client

applications are all possible avenues of attack for access to the cluster.

Countermeasures typically include a mixture of secure code development

practices (such as input validation, and address space randomization), network

segmentation, and third-party tools (including Web Application Firewalls, IDS,

authentication, and authorization). Some platforms offer built-in features to

bolster application protection, such as YARN’s web application proxy service.

57

Archive protection: As backups are largely an intractable problem for big data, we

don’t need to worry much about traditional backup/archive security. But just

because legitimate users cannot perform conventional backups does not mean an

attacker would not create at least a partial backup. We need to secure the

management plane to keep unwanted copies of data or data nodes from being

propagated. Access controls, and possibly network segregation, are effective

countermeasures against attackers trying to gain administrative access, and

encryption can help protect data in case other protections are defeated. In the end,

our big data security recommendations boil down to a handful of standard tools

which can be effective in setting a secure baseline for big data environments:

Use Kerberos: This is effective method for keeping rogue nodes and applications

off your cluster. And it can help protect web console access, making

administrative functions harder to compromise. We know Kerberos is a pain to set

up, and (re-)validation of new nodes and applications takes work. But without bi-

directional trust establishment it is too easy to fool Hadoop into letting malicious

applications into the cluster, or into accepting introduce malicious nodes – which

can then add, alter, or extract data. Kerberos is one of the most effective security

controls at your disposal, and it’s built into the Hadoop infrastructure, so use it.

File layer encryption: File encryption addresses two attacker methods for

circumventing normal application security controls. Encryption protects in case

malicious users or administrators gain access to data nodes and directly inspect

files, and it also renders stolen files or disk images unreadable. Encryption

protects against two of the most serious threats. Just as importantly, it meets our

58

requirements for big data security tools – it is transparent to both Hadoop and

calling applications, and scales out as the cluster grows. Open source products are

available for most Linux systems; commercial products additionally offer external

key management, trusted binaries, and full support. This is a cost-effective way to

address several data security threats.

8. Management: Deployment consistency is difficult to ensure in a multi-node

environment. Patching, application configuration, updating the Hadoop stack,

collecting trusted machine images, certificates, and platform discrepancies, all

contribute to what can easily become a management nightmare. The good news is

that most of you will be deploying in cloud and virtual environments. You can

leverage tools from your cloud provider, hypervisor vendor, and third parties

(such as Chef and Puppet) to automate pre-deployment tasks. Machine images,

patches, and configuration should be fully automated and updated prior to

deployment. You can even run validation tests, collect encryption keys, and

request access tokens before nodes are accessible to the cluster. Building the

scripts takes some time up front but pays for itself in reduced management time

later, and additionally ensures that each node comes up with baseline security in

place. Log it!: Big data is a natural fit for collecting and managing log data. Many

web companies started with big data specifically to manage log files. Why not add

logging onto your existing cluster? It gives you a place to look when something

fails, or if someone thinks perhaps you have been hacked. Without an event trace

you are blind. Logging MR requests and other cluster activity is easy to do, and

increases storage and processing demands by a small fraction, but the data is

indispensable when you need it. Secure communication: Implement secure

communication between nodes, and between nodes and applications. This requires

59

an SSL/TLS implementation that actually protects all network communications

rather than just a subset. Cloudera appears to get this right, and some cloud

providers offer secure communication options as well; otherwise you will likely

need to integrate these services into your application stack.

93%

73%

59%57%

46%

42%

40%

38%

37%TRANSACTION LOG DATAEVENTSEMAIL SOCIAL MEDIA SENSOREXTERNAL FEED RFID SCAN FREE FORM TEXT

Bibliography

1. Ibm analysis of big data

2. John Webster – “ Understanding Big Data Analytics”, Aug,

Searchstorage.techtarget.com

3. Bill Franks – “What’s up With In-Memory?”, May 7 ,2012 iilanalytics.com.

4. PankajMaru – “Datat Scientist: The new kid on the IT block”, Sep

3 ,2012,CIOL.com.

5. Yellow White Paper- “ In- Memory Analytics ,ww.yellowfin.bi.i

6. “Morgan Stanley takes on Big Data With Hadoop”, March 30,2012 ,Forbes.com

7. Ravi Kalakota, “New Tools For New Times- Primer on big Data, hadoop and

“In-memory”

8. Data Clouds”, May 15,2011,practical analytics.wordpress.com

9. Mckinsey Global Institute – “ Bigdata : The next Frontier for

innovation ,competition , and productivity “, June 2011

60

10. Harish Kotadia – “4 Excellent Big Data Case Studies”, July 2012, Hkotadia.com

11. Jeff Kelly,Bigdata : Hadoop, Business Analytics and Beyond, Aug 27 ,2012

Wikibon.org

12. Joe Mckendrick – “ 7 new types of jobs created by big data”, Sep

20 ,2012,Smartplanet.com.

13. Jean-Jacques DubrayNoSQL, NewSQL and Beyond ,Apr 19. 2011, Infoq.com

61

big data project

Documents