apache hadoop india summit 2011 talk "provisioning hadoop’s mapreduce in cloud for effective...
TRANSCRIPT
Provisioning Hadoop’s MapReduce in
Cloud for Effective Storage as a Service
Dr. S.M.Shalinie,
Associate Professor and Head,
Department of Computer Science and Engineering,
Thiagarajar College of Engineering, Madurai 625 015
Introduction
• The advent of Web 2.0 has made organizations move towards Cloudcomputing models with IT as a Service
• Explosive growth of audio, video and user generated content clearlyimplies that maintaining data center hardware infrastructure is a biggestchallenge
• Major concerns related to huge data are•Security•Storage Management•Data Reduction Techniques•Data Archiving
Thiagarajar College of Engineering, Madurai
According to Gartner recent survey report :
• 47% of enterprises identified ‘data growth’ as their top challenge with
other 2 challenges as 37% ‘system performance and scalability’ and 36%
‘network congestion and connectivity’
• It is because data growth is particularly associated with increased
costs relative to hardware, software, associated maintenance,
administration and services
Source: http://www.gartner.com/it/page.jsp?id=1460213
Impact of Data Growth
Thiagarajar College of Engineering, Madurai
Traditional Datacenters
• High performance and high degree of control
• Building a scalable and reliable storage requires experienced skillful
engineering team
• Upfront cost and maintenance cost and using resources efficiently is a
key factor to save cost
• Consumes heavy internet bandwidth
• Additional Internet connections and equipments for redundancy or load
balancing
• By Moore’s law hardware price per Gigabyte is dropping every day– if company has deployed too much storage equipments without full utilization the
equipment will be wasted
Thiagarajar College of Engineering, Madurai
Application categories
Video Surveillance
- To store outdated video clips
Huge data store
- ERP, Industry and Consumer statistics
Backup and archiving
- Server/Desktop offsite backup
Content Distribution
- Static content to save bandwidth
Versioning
- Reduce storage cost
File sharing
Thiagarajar College of Engineering, Madurai
Cloud based services
Google Apps Amazon DropboxOther cloud based services
Basic upload/download/delete files
YesYes
Yes Yes
Edit File Yes No No No
Online demand and scalable
Yes Yes Yes No
Online Management
Yes Yes Yes Yes
Data Encryption No No Yes Yes
Support Folder/File ACL
Yes Yes Yes Yes
Program Library Plenty Plenty Plenty Basic
Thiagarajar College of Engineering, Madurai
Amazon's S3
Object-store
– URL PUT and GET
– Simple usage
Proprietary and Unique
– Coding required
Variable performance
Infinitely scalable
Provision for archives
Objects
Put
Get
Thiagarajar College of Engineering, Madurai
Data at Rest
• Maintain Integrity
- Accuracy and consistency of data
• Confidentiality- Ensuring Privacy of data
- Ensuring Data access only by authorized users
• Information Assurance
- Measures to ensure availability
• Information Security- Protecting data from unauthorised access, use, disclosure,
disruption and modification[2]
Thiagarajar College of Engineering, Madurai
Security methods
• Encrypting Data at rest
- Maintains confidentiality of data
- Security tradeoff against processing time
- Size complexity issue solved by compression
• Securing the keys
- Encryption using 128/192/256 bit keys
• Compressing the data
- Process of encoding information with fewer bits
• Deduplication
- Specialized data compression technique for eliminating
coarse-grained redundant data to improve storage utilization.
Thiagarajar College of Engineering, Madurai
Parallelizing Encryption Process
• Encryption consumes large resources and time
• Abundant utilization of resources make the encryption process effective
• Hadoop's MapReduce supports large scale parallel data processing
framework for high end computing applications
• Suitable Algorithm is required to perform Encryption Process
Thiagarajar College of Engineering, Madurai
AES
• To meet better security standard, US Government agency NIST selected
Rijndael's Algorithm as Advanced Encryption Standard(AES) which is
accepted as a industry standard[5]
• AES is typically designed to accept three different key sizes which are
128,192 or 256 bits. The algorithm is capable of encrypting bulk data on
top-end 32 bit and 64 bit CPU's.
• It is efficient in encrypting all sorts of data deployed in cloud from text to
audio and video
• The performance of AES algorithm vary dramatically on various CPU's
based on key size and it can be improved remarkably when it is parallelized
Thiagarajar College of Engineering, Madurai
Key Generation
• The key should be kept
secret and must be
completely random
• It should not follow a
particular pattern
• The key has to be
generated such that the user
has control over the data
• The key should be strong
enough so that it is not
vulnerable to attacks (like
brute-force)
Thiagarajar College of Engineering, Madurai
User name,
Password
Valid UserData upload
User name: hadoopPassword: *******
File Password: ********
Key Management
Generation of unique key per user
Thiagarajar College of Engineering, Madurai
DES
1ff360f124b6e2
453597010ea589ee6871681840
so5y/8WBOZlSg4d8
128 bits
SHA1
File password
Username
Overall Process
SHA1 DES
User name: hadoop
Password: *******File password: *********
1ff360f124b6e2
453597010ea589ee6871681840
so5y/8WBOZlSg4d8
128 bits
User nameFile password
Thiagarajar College of Engineering, Madurai
Encryption modes
• To Enhance the effect of cryptographic algorithm
• To Adapt the algorithm for a particular application
• For parallelisation, the mode should support
Encryption of subsequent blocks independent of
each other
ELECTRONIC CODE BOOK (ECB) MODE
Plaintext handled one block at a time
Each block encrypted using same key
XEX-TCB-CTS (XTS) MODE
Each block encrypted using 2 different keys.
Tweak key – varies based on the position of the
block. Handles last incomplete block of plaintext[1]
Thiagarajar College of Engineering, Madurai
X
Parallelizing using Hadoop MR
Data will be stored as contiguous blocks and every block is
represented by unique block _id(I0
,I1,I
2.....I
n-1)
A MapReduce includes set of mappers (M1,M
2..... M
r) and
reducers (R1,R
2..... R
r)
The input is given to mapper in the form of
<block_id,object>
The object is data stored in the corresponding block id [3][4]
Thiagarajar College of Engineering, Madurai
MapReduce Paradigm
• Execution of Mapper
- Block <Ir-1
,object> is given to mapper Mr
- The mapper will generate the corresponding output I'R
and sends it to the reducer R
r
Let W= {<I0 ,object1>,<I
1,object2>,<I
2,object3>.....<I
n-1,object n>}
Then, I'r= I
r-1ϵW
<block_id,object>M
r (<block_id,Enc_comp
object>)
• Execution of Reducer
- The collected outputs from various mappers are written to the disk in the sequential order(I'
1,I'
2.......I'
n)
Thiagarajar College of Engineering, Madurai
Encryption using MapReduce
Thiagarajar College of Engineering, Madurai
Name node
Map 3AES+XTS
Map 2AES+XTS
Map 1AES+XTS
Map NAES+XTS
.
.
.
Reducer
Rack 1
Rack 2
Output
Plaintext
Encryption through
Map Reduce
Storage as a Service
Web Server
Rack 1
Rack 2
Cluster
HDFS
Thiagarajar College of Engineering, Madurai
Rack 3
Data Size(GB)
Time(mins)
Data Size(GB)
Time(mins)
Data
Size(GB)
(iii) AES-XTS with mapper only
Performance of the Algorithm
(i) AES-ECB with mapper only (ii) AES-XTS with reducer
Time(mins)
Thiagarajar College of Engineering, Madurai
Deduplication
• Technique to improve storage utilization by eliminating coarse-grained redundant
data
• Process involves deleting duplicate and leaving only one copy of the data
• The unique copy of the data is referred using Symbolic link
• By default Hadoop does not support Data Deduplication
Thiagarajar College of Engineering, Madurai
User1
File2:
<Wxyz>
File1
<abcd>
File2
<Wxyz>
File3
<abcd> HDFS
Symbolic link
File3
<abcd>
File1
<abcd>
Compression
•To exploit statistical redundancy in the data and represent it using fewer
bits
• Among many algorithms bzip analysis proved that bzip2 has better
compression ration for text files
• MapReduce can be used for compressing a set of large text files in
efficient manner
Thiagarajar College of Engineering, Madurai
User2
Deduplication and Compression Using
MapReduce
Thiagarajar College of Engineering, Madurai
User1
Plain Text
Audio
Video
MapReduce
performing
Deduplication
and
Compression
User1
PlainText
User2 Audio
User3
Video
MapReduce
performing
Encryption
Compressed
Output
User3
AES+XTS Encryption without compression for Text
DataAES+XTS Encryption with compression for Text Data
DataSize(MB)
Compression
Ratio
DataSize(GB)
Time(mins)
Text Data Results
Thiagarajar College of Engineering, Madurai
Time(mins)
DataSize(GB) DataSize(GB)
Compression
Ratio
Thiagarajar College of Engineering, Madurai
Image Data Results
Inference
• Encrypting Data at Rest is the ideal option for maintaining
the integrity and confidentiality of user data.
• Encryption using AES-XTS gives better performance
• Compression results prove that storage requirements
have been reduced by ratio of 1:10 for text data and 1:2 for
image data
Thiagarajar College of Engineering, Madurai
Future Enhancement
• Deduplication by Classification of similar images based on
Fuzzy Matching Techniques
• Include in the bucket system to store objects in the bucket
securely and efficiently
• Validate the results using standard data sets such as Enron
Thiagarajar College of Engineering, Madurai
Our Contribution
• Provisioning Hadoop in Cloud DataStore through AES Encryption
• Deduplication and compression using MapReduce
• Provide Integrity and Confidentiality of Data thereby assisting
Business applications
• ‘Secure Storage as a Service’ methodology is well suited for
Cloud based services
Thiagarajar College of Engineering, Madurai
Summary
• AES supports encrypting large scale data in a better way
• MapReduce concept is suitable for running encryption process in
parallel mode
• Storage space can be managed efficiently by including
compression technique before performing encryption strategy
•Experimental results prove that compression followed by
encryption using MapReduce suits securing Data at Rest in cloud
Thiagarajar College of Engineering, Madurai
Other projects
Thiagarajar College of Engineering, Madurai
1.TCE MR Simulator
- To reduce the execution time of Map Reduce jobs
- To design a scheduler with pre-emption support
- To address the HDFS scalability issue
- To index larger files before searching
2.Securing Hadoop Environment
- To develop a bucket management system
- To maintain the integrity of data between nodes during MapReduce
process
3. Parallelization of Machine Learning Algorithms
- To generate frequent item sets using MapReduce for large datasets
References
[1]M.Dworkin,”Recommendation for Block Cipher Modes of
Operation:The XTS-AES Mode for confidentiality on Storage Devices”,
NIST Special Publication 800-38E, US Nat’l Inst. Of Standards and
Tech,2010.
[2]Lori M.Kaufman,”Data Security in the world of Cloud Computing”,
IEEE Security and Privacy Vol2,pp61-64,2010.
[3]Jeffrey Dean and Sanjay Ghemawat ,”MapReduce: Simplified Data
Processing on Large Clusters”, Communications of the ACM, Vol.51,
No 1, 2008.
[4]http://hadoop.apache.org
[5]Bruce Schneier and Doug Whiting,”A Performance Comparison of
the Five AES Finalist”, Second AES Candidate Conference,2000
Thiagarajar College of Engineering, Madurai