apache hadoop india summit 2011 talk "provisioning hadoop’s mapreduce in cloud for effective...

31
Provisioning Hadoop’s MapReduce in Cloud for Effective Storage as a Service Dr. S.M.Shalinie, Associate Professor and Head, Department of Computer Science and Engineering, Thiagarajar College of Engineering, Madurai 625 015

Upload: yahoo-developer-network

Post on 22-Jul-2015

2.013 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Provisioning Hadoop’s MapReduce in

Cloud for Effective Storage as a Service

Dr. S.M.Shalinie,

Associate Professor and Head,

Department of Computer Science and Engineering,

Thiagarajar College of Engineering, Madurai 625 015

Page 2: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Introduction

• The advent of Web 2.0 has made organizations move towards Cloudcomputing models with IT as a Service

• Explosive growth of audio, video and user generated content clearlyimplies that maintaining data center hardware infrastructure is a biggestchallenge

• Major concerns related to huge data are•Security•Storage Management•Data Reduction Techniques•Data Archiving

Thiagarajar College of Engineering, Madurai

Page 3: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

According to Gartner recent survey report :

• 47% of enterprises identified ‘data growth’ as their top challenge with

other 2 challenges as 37% ‘system performance and scalability’ and 36%

‘network congestion and connectivity’

• It is because data growth is particularly associated with increased

costs relative to hardware, software, associated maintenance,

administration and services

Source: http://www.gartner.com/it/page.jsp?id=1460213

Impact of Data Growth

Thiagarajar College of Engineering, Madurai

Page 4: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Traditional Datacenters

• High performance and high degree of control

• Building a scalable and reliable storage requires experienced skillful

engineering team

• Upfront cost and maintenance cost and using resources efficiently is a

key factor to save cost

• Consumes heavy internet bandwidth

• Additional Internet connections and equipments for redundancy or load

balancing

• By Moore’s law hardware price per Gigabyte is dropping every day– if company has deployed too much storage equipments without full utilization the

equipment will be wasted

Thiagarajar College of Engineering, Madurai

Page 5: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Application categories

Video Surveillance

- To store outdated video clips

Huge data store

- ERP, Industry and Consumer statistics

Backup and archiving

- Server/Desktop offsite backup

Content Distribution

- Static content to save bandwidth

Versioning

- Reduce storage cost

File sharing

Thiagarajar College of Engineering, Madurai

Page 6: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Cloud based services

Google Apps Amazon DropboxOther cloud based services

Basic upload/download/delete files

YesYes

Yes Yes

Edit File Yes No No No

Online demand and scalable

Yes Yes Yes No

Online Management

Yes Yes Yes Yes

Data Encryption No No Yes Yes

Support Folder/File ACL

Yes Yes Yes Yes

Program Library Plenty Plenty Plenty Basic

Thiagarajar College of Engineering, Madurai

Page 7: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Amazon's S3

Object-store

– URL PUT and GET

– Simple usage

Proprietary and Unique

– Coding required

Variable performance

Infinitely scalable

Provision for archives

Objects

Put

Get

Thiagarajar College of Engineering, Madurai

Page 8: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Data at Rest

• Maintain Integrity

- Accuracy and consistency of data

• Confidentiality- Ensuring Privacy of data

- Ensuring Data access only by authorized users

• Information Assurance

- Measures to ensure availability

• Information Security- Protecting data from unauthorised access, use, disclosure,

disruption and modification[2]

Thiagarajar College of Engineering, Madurai

Page 9: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Security methods

• Encrypting Data at rest

- Maintains confidentiality of data

- Security tradeoff against processing time

- Size complexity issue solved by compression

• Securing the keys

- Encryption using 128/192/256 bit keys

• Compressing the data

- Process of encoding information with fewer bits

• Deduplication

- Specialized data compression technique for eliminating

coarse-grained redundant data to improve storage utilization.

Thiagarajar College of Engineering, Madurai

Page 10: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Parallelizing Encryption Process

• Encryption consumes large resources and time

• Abundant utilization of resources make the encryption process effective

• Hadoop's MapReduce supports large scale parallel data processing

framework for high end computing applications

• Suitable Algorithm is required to perform Encryption Process

Thiagarajar College of Engineering, Madurai

Page 11: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

AES

• To meet better security standard, US Government agency NIST selected

Rijndael's Algorithm as Advanced Encryption Standard(AES) which is

accepted as a industry standard[5]

• AES is typically designed to accept three different key sizes which are

128,192 or 256 bits. The algorithm is capable of encrypting bulk data on

top-end 32 bit and 64 bit CPU's.

• It is efficient in encrypting all sorts of data deployed in cloud from text to

audio and video

• The performance of AES algorithm vary dramatically on various CPU's

based on key size and it can be improved remarkably when it is parallelized

Thiagarajar College of Engineering, Madurai

Page 12: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Key Generation

• The key should be kept

secret and must be

completely random

• It should not follow a

particular pattern

• The key has to be

generated such that the user

has control over the data

• The key should be strong

enough so that it is not

vulnerable to attacks (like

brute-force)

Thiagarajar College of Engineering, Madurai

User name,

Password

Valid UserData upload

User name: hadoopPassword: *******

File Password: ********

Page 13: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Key Management

Generation of unique key per user

Thiagarajar College of Engineering, Madurai

DES

1ff360f124b6e2

453597010ea589ee6871681840

so5y/8WBOZlSg4d8

128 bits

SHA1

File password

Username

Page 14: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Overall Process

SHA1 DES

User name: hadoop

Password: *******File password: *********

1ff360f124b6e2

453597010ea589ee6871681840

so5y/8WBOZlSg4d8

128 bits

User nameFile password

Thiagarajar College of Engineering, Madurai

Page 15: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Encryption modes

• To Enhance the effect of cryptographic algorithm

• To Adapt the algorithm for a particular application

• For parallelisation, the mode should support

Encryption of subsequent blocks independent of

each other

ELECTRONIC CODE BOOK (ECB) MODE

Plaintext handled one block at a time

Each block encrypted using same key

XEX-TCB-CTS (XTS) MODE

Each block encrypted using 2 different keys.

Tweak key – varies based on the position of the

block. Handles last incomplete block of plaintext[1]

Thiagarajar College of Engineering, Madurai

X

Page 16: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Parallelizing using Hadoop MR

Data will be stored as contiguous blocks and every block is

represented by unique block _id(I0

,I1,I

2.....I

n-1)

A MapReduce includes set of mappers (M1,M

2..... M

r) and

reducers (R1,R

2..... R

r)

The input is given to mapper in the form of

<block_id,object>

The object is data stored in the corresponding block id [3][4]

Thiagarajar College of Engineering, Madurai

Page 17: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

MapReduce Paradigm

• Execution of Mapper

- Block <Ir-1

,object> is given to mapper Mr

- The mapper will generate the corresponding output I'R

and sends it to the reducer R

r

Let W= {<I0 ,object1>,<I

1,object2>,<I

2,object3>.....<I

n-1,object n>}

Then, I'r= I

r-1ϵW

<block_id,object>M

r (<block_id,Enc_comp

object>)

• Execution of Reducer

- The collected outputs from various mappers are written to the disk in the sequential order(I'

1,I'

2.......I'

n)

Thiagarajar College of Engineering, Madurai

Page 18: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Encryption using MapReduce

Thiagarajar College of Engineering, Madurai

Name node

Map 3AES+XTS

Map 2AES+XTS

Map 1AES+XTS

Map NAES+XTS

.

.

.

Reducer

Rack 1

Rack 2

Output

Page 19: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Plaintext

Encryption through

Map Reduce

Storage as a Service

Web Server

Rack 1

Rack 2

Cluster

HDFS

Thiagarajar College of Engineering, Madurai

Rack 3

Page 20: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Data Size(GB)

Time(mins)

Data Size(GB)

Time(mins)

Data

Size(GB)

(iii) AES-XTS with mapper only

Performance of the Algorithm

(i) AES-ECB with mapper only (ii) AES-XTS with reducer

Time(mins)

Thiagarajar College of Engineering, Madurai

Page 21: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Deduplication

• Technique to improve storage utilization by eliminating coarse-grained redundant

data

• Process involves deleting duplicate and leaving only one copy of the data

• The unique copy of the data is referred using Symbolic link

• By default Hadoop does not support Data Deduplication

Thiagarajar College of Engineering, Madurai

User1

File2:

<Wxyz>

File1

<abcd>

File2

<Wxyz>

File3

<abcd> HDFS

Symbolic link

File3

<abcd>

File1

<abcd>

Page 22: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Compression

•To exploit statistical redundancy in the data and represent it using fewer

bits

• Among many algorithms bzip analysis proved that bzip2 has better

compression ration for text files

• MapReduce can be used for compressing a set of large text files in

efficient manner

Thiagarajar College of Engineering, Madurai

Page 23: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

User2

Deduplication and Compression Using

MapReduce

Thiagarajar College of Engineering, Madurai

User1

Plain Text

Audio

Video

MapReduce

performing

Deduplication

and

Compression

User1

PlainText

User2 Audio

User3

Video

MapReduce

performing

Encryption

Compressed

Output

User3

Page 24: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

AES+XTS Encryption without compression for Text

DataAES+XTS Encryption with compression for Text Data

DataSize(MB)

Compression

Ratio

DataSize(GB)

Time(mins)

Text Data Results

Thiagarajar College of Engineering, Madurai

Page 25: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Time(mins)

DataSize(GB) DataSize(GB)

Compression

Ratio

Thiagarajar College of Engineering, Madurai

Image Data Results

Page 26: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Inference

• Encrypting Data at Rest is the ideal option for maintaining

the integrity and confidentiality of user data.

• Encryption using AES-XTS gives better performance

• Compression results prove that storage requirements

have been reduced by ratio of 1:10 for text data and 1:2 for

image data

Thiagarajar College of Engineering, Madurai

Page 27: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Future Enhancement

• Deduplication by Classification of similar images based on

Fuzzy Matching Techniques

• Include in the bucket system to store objects in the bucket

securely and efficiently

• Validate the results using standard data sets such as Enron

Thiagarajar College of Engineering, Madurai

Page 28: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Our Contribution

• Provisioning Hadoop in Cloud DataStore through AES Encryption

• Deduplication and compression using MapReduce

• Provide Integrity and Confidentiality of Data thereby assisting

Business applications

• ‘Secure Storage as a Service’ methodology is well suited for

Cloud based services

Thiagarajar College of Engineering, Madurai

Page 29: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Summary

• AES supports encrypting large scale data in a better way

• MapReduce concept is suitable for running encryption process in

parallel mode

• Storage space can be managed efficiently by including

compression technique before performing encryption strategy

•Experimental results prove that compression followed by

encryption using MapReduce suits securing Data at Rest in cloud

Thiagarajar College of Engineering, Madurai

Page 30: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

Other projects

Thiagarajar College of Engineering, Madurai

1.TCE MR Simulator

- To reduce the execution time of Map Reduce jobs

- To design a scheduler with pre-emption support

- To address the HDFS scalability issue

- To index larger files before searching

2.Securing Hadoop Environment

- To develop a bucket management system

- To maintain the integrity of data between nodes during MapReduce

process

3. Parallelization of Machine Learning Algorithms

- To generate frequent item sets using MapReduce for large datasets

Page 31: Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie

References

[1]M.Dworkin,”Recommendation for Block Cipher Modes of

Operation:The XTS-AES Mode for confidentiality on Storage Devices”,

NIST Special Publication 800-38E, US Nat’l Inst. Of Standards and

Tech,2010.

[2]Lori M.Kaufman,”Data Security in the world of Cloud Computing”,

IEEE Security and Privacy Vol2,pp61-64,2010.

[3]Jeffrey Dean and Sanjay Ghemawat ,”MapReduce: Simplified Data

Processing on Large Clusters”, Communications of the ACM, Vol.51,

No 1, 2008.

[4]http://hadoop.apache.org

[5]Bruce Schneier and Doug Whiting,”A Performance Comparison of

the Five AES Finalist”, Second AES Candidate Conference,2000

Thiagarajar College of Engineering, Madurai