hw09 terapot email archiving with hadoop

Next RevolutionToward Open Platform

Terapot: Massive Email Archiving with Hadoop & Friends

Jaesun HanFounder & CEO of [email protected]

- Commercial Hadoop Application

#2About NexR

icube-cc (Compute)

icube-sc(Storage)

Hadoop

Pro

visionin

g &

Managem

ent

Massive Data Storage & Processing Platform

Cloud Computing Platform(Compatible with Amazon AWS)

Academic SupportProgramMassive Email Archiving MapReduce Workflow

Hadoop & Cloud Computing Services

Offering Hadoop & Cloud Computing Platform and Services

demo-nexr-intro.avi

#3What is Email Archiving?

The Objectives of Email Archiving- Regulatory compliance- e-Discovery: Litigation and legal discovery- E-mail backup and disaster recovery- Messaging system & storage optimization- Monitoring of internal and external e-mail content

#4The Architecture of Email Archiving

Email ArchivingServer

Indexes

Journaling

Search

Crawling

EmailServers

Archival Storageemail data

Indexing

Discovery

Data AcquisitionJournaling

Mailbox Crawling

Data ProcessingIndexingFiltering

Data AccessSearch

Discovery

auditoradministrator

employee

#5The Challenges of Email Archiving

Explosive growth of digital data- 6 times (988XB) in 2010 than 2006- 95% (939 XB) unstructured data including email- Increasing the cost and complexity of archiving Requiring scalable & low cost archiving

Reinforcement of data retention regulation- Retention, Disposal, e-Discovery, Security- HIPPA(Healthcare) 21 ~ 23 yrs, SEC17(Trading) 6 yrs,OSHA(Toxic) 30 yrs, SOX(Finance) 5 yrs, J-SOX, K-SOX Requiring scalable archiving & fast discovery

Needs for intelligent data management- Knowledge management from email data- Filtering, monitoring, data mining, etc Requiring integration with intelligent system

#6New Requirements of Email Archiving

High Scalability

Low Cost

High Performance

Intelligence

#7Terapot: When Hadoop Met Email Archiving…

EmailServers

Distributed Crawling

JournalingServer

Journaling

Hadoop HDFS(Archiving)

Hadoop MapReduce(Crawling, Indexing, etc)

Distributed Search & Discovery

Scale-out architecture with Hadoop- Hadoop HDFS for archiving email data- Hadoop MapReduce for crawling & indexing- Apache Lucene for search & discovery

#8Features of Terapot

Distributed Massive Email Archiving High Scalability by Shared-Nothing Architecture

- Thousands of servers, billions of emails

Low Cost by Inexpensive Hardware- Entry servers under $5,000

High Performance by Parallelism- Fast search under 1-2 seconds for each user account- Fast discovery in parallel with MapReduce

Intelligence by Data Mining- Contact network analysis, content analysis, statistics

Support Both On-premise Version and Cloud(hosted) Version Development with Various Open Source Software

#9The Architecture of Terapot

Crawling

Batch processing

MR Workflow Manager

Terapot Frontend

Terapot Clients

POP3Server

HTTP/FTP/SFTPServer

MailServer

NAS/NFS

Email Sources

SOAP REST JSON

Local(index)

HDFS(email)

Indexing Merging

Analyzer

MiningReal-Time

Indexing

MailServer

Searching

Search Gateway

ETL

Analysis

Hadoop MapReduce, Lucene, & Hive

4 keycomponents

#10Batch Processing Component

Crawling(MR)

Indexing(MR)

Merging

An archive file per user(sequence file)

a temporary index file per user

(lucene index file)

a merged index file(for backing up)

Email Sources

HDFS

Archiving policies An archive file per user Several archive files per crawling

configuredperiod

Local file system

index shard(3 copy replication)

shard 1 shard 0

Search

#11Real-Time Indexing Component

Real-TimeIndexing

JournalingServer

Memory

Real-TimeIndex

Database

BatchProcessingComponent

Crawling

Indexing Archiving

HDFS

archive

index

Flushing

Forwarding

#12Search & Discovery Component

SearchGateway

Zookeeper

Updatingshard status

Locatingindex shards

Assigningshards

DistributedSearch

HDFS

index shards

Real-TimeIndexing Nodes

Search Nodescopy index shardsto local file system

#13Data Analysis Component

ETL (MR)Extract-Transform-

Load

email archive files Hive table

Hive

MiningEngine

MR MR MR MR MR

analysis results database

generatingreports

HDFS

Hive queries

AnalyzerWeb

Reporter

reports

Personal contact network analysis Domain statistics

#14Installation & Quantitative Analysis

2masternodes

10workernodes

(datanode, tasktracker,searcher,

etc)

Description Qty

CPUIntel Xeon Nehalem

E5504 2.0GHz2

(8 cores)

MemoryDDR3 2GB PC3-10600

Registered Dimm9

(18GB)

HDD 1TB 7200 RPM SATA24

(4TB)

HA Assuming

- 1000 employees- 16 emails per day for each person- 215KB (content 142 KB + attachment 73 KB)for average email size

- 1.25 GB per year for 1 employee Storage

- index size: about 80% of email- compression ratio: about 50 %

Disk volume required for 1 year- email archive (HDFS): 1881 GB- indexes (HDFS + Local): 4559 GB- total: about 6.4 TB per year

40 TB may cover 6 years archiving

Quantitative Analysis

#15Demonstration

demo-terapot-4.avi

Hadoop & Cloud ComputingCompany

www.nexrcorp.com

For more information- www.nexrcorp.com- www.terapot.com- [email protected] @jaesun_han

hw09 terapot email archiving with hadoop

Technology

email data filtering

hadoop hadoop hdfs

hive hdfs email local

archiving indexingcrawling

complexity of archiving

year nodes email archive

email compression ratio

index shardsdistributedsearch