hw09 terapot email archiving with hadoop
TRANSCRIPT
Next RevolutionToward Open Platform
Terapot: Massive Email Archiving with Hadoop & Friends
Jaesun HanFounder & CEO of [email protected]
- Commercial Hadoop Application
#2About NexR
icube-cc (Compute)
icube-sc(Storage)
Hadoop
Pro
visionin
g &
Managem
ent
Massive Data Storage & Processing Platform
Cloud Computing Platform(Compatible with Amazon AWS)
Academic SupportProgramMassive Email Archiving MapReduce Workflow
Hadoop & Cloud Computing Services
Offering Hadoop & Cloud Computing Platform and Services
#3What is Email Archiving?
The Objectives of Email Archiving- Regulatory compliance- e-Discovery: Litigation and legal discovery- E-mail backup and disaster recovery- Messaging system & storage optimization- Monitoring of internal and external e-mail content
#4The Architecture of Email Archiving
Email ArchivingServer
Indexes
Journaling
Search
Crawling
EmailServers
Archival Storageemail data
Indexing
Discovery
Data AcquisitionJournaling
Mailbox Crawling
Data ProcessingIndexingFiltering
Data AccessSearch
Discovery
auditoradministrator
employee
#5The Challenges of Email Archiving
Explosive growth of digital data- 6 times (988XB) in 2010 than 2006- 95% (939 XB) unstructured data including email- Increasing the cost and complexity of archiving Requiring scalable & low cost archiving
Reinforcement of data retention regulation- Retention, Disposal, e-Discovery, Security- HIPPA(Healthcare) 21 ~ 23 yrs, SEC17(Trading) 6 yrs,OSHA(Toxic) 30 yrs, SOX(Finance) 5 yrs, J-SOX, K-SOX Requiring scalable archiving & fast discovery
Needs for intelligent data management- Knowledge management from email data- Filtering, monitoring, data mining, etc Requiring integration with intelligent system
#6New Requirements of Email Archiving
High Scalability
Low Cost
High Performance
Intelligence
#7Terapot: When Hadoop Met Email Archiving…
EmailServers
Distributed Crawling
JournalingServer
Journaling
Hadoop HDFS(Archiving)
Hadoop MapReduce(Crawling, Indexing, etc)
Distributed Search & Discovery
Scale-out architecture with Hadoop- Hadoop HDFS for archiving email data- Hadoop MapReduce for crawling & indexing- Apache Lucene for search & discovery
#8Features of Terapot
Distributed Massive Email Archiving High Scalability by Shared-Nothing Architecture
- Thousands of servers, billions of emails
Low Cost by Inexpensive Hardware- Entry servers under $5,000
High Performance by Parallelism- Fast search under 1-2 seconds for each user account- Fast discovery in parallel with MapReduce
Intelligence by Data Mining- Contact network analysis, content analysis, statistics
Support Both On-premise Version and Cloud(hosted) Version Development with Various Open Source Software
#9The Architecture of Terapot
Crawling
Batch processing
MR Workflow Manager
Terapot Frontend
Terapot Clients
POP3Server
HTTP/FTP/SFTPServer
MailServer
NAS/NFS
Email Sources
SOAP REST JSON
Local(index)
HDFS(email)
Indexing Merging
Analyzer
MiningReal-Time
Indexing
MailServer
Searching
Search Gateway
ETL
Analysis
Hadoop MapReduce, Lucene, & Hive
4 keycomponents
#10Batch Processing Component
Crawling(MR)
Indexing(MR)
Merging
An archive file per user(sequence file)
a temporary index file per user
(lucene index file)
a merged index file(for backing up)
Email Sources
HDFS
Archiving policies An archive file per user Several archive files per crawling
configuredperiod
Local file system
index shard(3 copy replication)
shard 1 shard 0
Search
#11Real-Time Indexing Component
Real-TimeIndexing
JournalingServer
Memory
Real-TimeIndex
Database
BatchProcessingComponent
Crawling
Indexing Archiving
HDFS
archive
index
Flushing
Forwarding
#12Search & Discovery Component
SearchGateway
Zookeeper
Updatingshard status
Locatingindex shards
Assigningshards
DistributedSearch
HDFS
index shards
Real-TimeIndexing Nodes
Search Nodescopy index shardsto local file system
#13Data Analysis Component
ETL (MR)Extract-Transform-
Load
email archive files Hive table
Hive
MiningEngine
MR MR MR MR MR
analysis results database
generatingreports
HDFS
Hive queries
AnalyzerWeb
Reporter
reports
Personal contact network analysis Domain statistics
#14Installation & Quantitative Analysis
2masternodes
10workernodes
(datanode, tasktracker,searcher,
etc)
Description Qty
CPUIntel Xeon Nehalem
E5504 2.0GHz2
(8 cores)
MemoryDDR3 2GB PC3-10600
Registered Dimm9
(18GB)
HDD 1TB 7200 RPM SATA24
(4TB)
HA Assuming
- 1000 employees- 16 emails per day for each person- 215KB (content 142 KB + attachment 73 KB)for average email size
- 1.25 GB per year for 1 employee Storage
- index size: about 80% of email- compression ratio: about 50 %
Disk volume required for 1 year- email archive (HDFS): 1881 GB- indexes (HDFS + Local): 4559 GB- total: about 6.4 TB per year
40 TB may cover 6 years archiving
Quantitative Analysis
#15Demonstration
Hadoop & Cloud ComputingCompany
www.nexrcorp.com
For more information- www.nexrcorp.com- www.terapot.com- [email protected] @jaesun_han