introduction to apache accumulo
Post on 27-Jan-2015
142 Views
Preview:
DESCRIPTION
TRANSCRIPT
Introduction to Apache Accumulo
Boulder/Denver BigData Meetup - March 21,2012Jared Winick@jaredwinick
Accumulo /əˈkjuːmjʊˌloʊ/
1. Sorted, distributed key/value store with cell-based access control and customizable server-side processing
http://yourmotivational.com/uploads/8604.jpg
Annotation AddedJeff Dean: Designs, Lessons and Advice from Building Large Distributed Systemshttp://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
Enables interactive access to…
Trillions of recordspetabytes of indexed data
across 100s-1000s of servers
Short Accumulo History Lesson
http://www.flickr.com/photos/mr_t_in_dc/4249886990/sizes/l/in/photostream/
2006
2008
http://upload.wikimedia.org/wikipedia/commons/8/84/National_Security_Agency_headquarters%2C_Fort_Meade%2C_Maryland.jpg
2011
2012
Uses of BigTable and Kin
(BigTable)
•Google Analytics1
•Crawl1
•AppEngine Datastore2
•Many more1
(HBase)
•Messages3,4,6
•Insights5,6
(Accumulo)
•???
(Cassandra)
•Rainbird (realtime analytics)7
1.) http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf2.) http://code.google.com/appengine/articles/storage_breakdown.html3.) http://www.facebook.com/note.php?note_id=4549916089194.) http://mvdirona.com/jrh/TalksAndPapers/KannanMuthukkaruppan_StorageInfraBehindMessages.pdf5.) http://www.facebook.com/note.php?note_id=101501039002589206.) http://borthakur.com/ftp/SIGMODRealtimeHadoopPresentation.pdf7.) http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011
Accumulo /əˈkjuːmjʊˌloʊ/
1. Sorted, distributed key/value store with cell-based access control and customizable server-side processing
Multi-dimension Key
Key
Row ID
Column
Timestamp
Family Qualifier Visibility
Value
http://incubator.apache.org/accumulo/user_manual_1.4-incubating/Accumulo_Design.html
Keys Sorted Lexicographically
Row ID, Column Family, Column Qualifier, Column Visibility, Timestamp
Everything is a byte[] except the Timestamp which is a long
Physical Layout
Row ID Col Fam Col Qual Col Vis Time Value
Alice properties age public March 2011 31
Alice properties phone private Feb 2011 555-1234
Alice purchases Xbox public Feb 2011 $299
Bob properties phone private March 2011 555-4321
Bob purchases iPhone Public Feb 2011 $399
Key Value
Queries
•By exact Key or range of Keys•Data is always returned in sorted order
Query Requirements Drive Data Model Design
http://incubator.apache.org/accumulo/user_manual_1.4-incubating/Accumulo_Design.html
Accumulo
Hadoop HDFS Zookeeper
Storage Configuration/State
Hadoop MapReduce
Analytics
Clients
Read/Write
Accumulo
Hadoop HDFS
Tablet Server
Tablet Server
Tablet Server
Master
Data Node
Data Node
Data Node
Name Node
. . .
. . .
TableTablets
… … … … … …
Accumulo
Hadoop HDFS
Tablet Server
Tablet Server
Tablet Server
Master
Data Node
Data Node
Data Node
Name Node
. . .
. . .
TableTablets
Tablet Server Failure
1.) Detect Failure
2.) Reassign
Accumulo
Hadoop HDFS
Tablet Server
Writes
Client
Write-Ahead
Log (WAL)
1MemTable2
Data Node
Data Node
Data Node. . .
Tablet
Accumulo
Hadoop HDFS
Tablet Server
Writes
Client
Write-Ahead
Log (WAL)
1MemTable2
Data Node
Data Node
Data Node. . .
File 1
3
Tablet
Compactions
Minor Major
The process of flushing a MemTable of a Tablet to a single file in HDFS
The process of combining multiple files into a single file
• Tablets are split when they reach a max size• Always split on row boundary• Master assigns a split Tablet to another Tablet
server (no data is moved!)
Tablet Splits
Reads
AccumuloTablet Server
Client MemTable
File 1
Tablet
File 1
Merge
Accumulo /əˈkjuːmjʊˌloʊ/
1. Sorted, distributed key/value store with cell-based access control and customizable server-side processing
http://wiki.eeng.dcu.ie/ee557/287-EE/version/default/part/ImageData/data/server-side_intro.gif
Iterators: Server-side programming
Iterators
Can be run at:•Scan Time•Minor Compaction•Major Compaction
Can do things like:•Aggregation (Combiners)•Age-Off•Filtering (access control)•Transformation
Push Processing to the Data
Accumulo /əˈkjuːmjʊˌloʊ/
1. Sorted, distributed key/value store with cell-based access control and customizable server-side processing
• Every key-value has a visibility label• Label is defined with boolean operators• Label is arbitrary and ad-hoc
• Authorizations presented at scan time• Data is filtered out automatically by system-
level Iterator
Access Control
Public Private | Admin Finance | (HR & Manager)
Access Control – Typical Architecture
Web Server
Enterprise Identity
Management
Accumulo1.) Pass Credentials
2.) Lookup User
3.) Return Authorizations
4.) Proxy Authorization
Trusted Zone
5.) Return Visible Data6.) Return Data
Access Control – Typical Architecture
3.) Auths:[SECRET, UNCLASSIFIED,PROJECT X, PROJECT Y]
Web Server
Enterprise Identity
Management
Accumulo
1.) PKI Cert
2.) Lookup Bob
4.) Proxy Bob’s Auths
Trusted Zone
5.) Return [6,8]6.) Return [6,8]
Bob
SECRET&PROJECT X, 6SECRET&PROJECT Y, 8SECRET&PROJECT Z, 3
Demo
Application Requirements
Build an application to analyze trends in Twitter messages.
•Query for word/phrase and view real-time activity in a time series graph•View at different time ranges (1 day, 7 days, 30 days, etc)•Allow multiple query terms to compare activity (ex. Breakfast,Lunch)•Automatically extract daily trends for the user
Demo Setup/Data
• Twitter Streaming API• US country codes only messages• 1,2,3-grams built• Data since Dec 24 – Live• Running on average workstation, 1 SATA disk,
6 GB memory.• 72GB, 2.6 billion entries and counting
Data Model• Tweets table– Row ID: n-gram– Column Family: Date Granularity (DAY, HOUR)– Column Qual: Date Value– Value: Count– SummingCombiner (Iterator) used to update Count
Row ID Col Fam Col Qual Value
breakfast DAY 20120318 31
breakfast DAY 20120319 56
… … … …
lunch HOUR 2012031801 3
lunch HOUR 2012031802 4
Data Model• Trends table– Row ID: (Date Granularity + Date Value)– Column Family: (Integer.MAX_VALUE – trendScore)– Column Qual: n-gram– Value: []
Row ID Col Fam Col Qual Value
DAY:20120318 2147483145 church
DAY:20120318 2147483316 hangover
… … … …
DAY:20120319 2147476521 the broncos
DAY:20120319 2147477704 tim tebow
• Utilize MapReduce for building trends• AccumuloInputFormat reads from tweets
table• AccumuloOutputFormat writes to trends
table• AccumuloStorage LoadFunc for Pig
available on github
MapReduce Analytics
Summary
•Accumulo exploits locality to enable interactive access to huge data sets while adding cell-level access control and server-side programming
•Nothing in life is free. Accumulo comes with the complexity and responsibility of managing a distributed system and designing indexes on your data
References
http://incubator.apache.org/accumulo/
http://www.slideshare.net/cloudera/h-base-and-accumulo-todd-lipcom-jan-25-2012
• Documentation, Mailing Lists, Links
• HBase Shootout
• Trendulohttps://github.com/jaredwinick/trendulo
top related