oct 2012 hug: apache accumulo: unlocking the power of big data
DESCRIPTION
Apache Accumulo, originally developed by the National Security Agency and now an Apache Software Foundation project, builds upon Google's Bigtable design to provide a scalable, lightly-structured database capability complementing the ubiquitous Hadoop environment. The core capabilities of Accumulo include cell-level security, flexible schemas, real-time analytics, bulk I/O, and linear scalability beyond trillions of entries and petabytes of data. These new capabilities lead to techniques that unlock the power of Big Data, but don't fit into traditional database design patterns. Learn about the advantages of Apache Accumulo and how it fits into the Hadoop and NoSQL ecosystem. Presenter: Adam Fuchs, CTO, sqrrlTRANSCRIPT
sqrrl data, INC.Secure. Scale. Adapt.
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Adam Fuchs, Chief Technology Officer
Who We are
2
Secure. Scale. Adapt.
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
is the commercial provider of
Mature Database Technology - Apache Accumulo
Fine-Grained Access Controls - Data Integration and Sharing
Proven Performance - Petabytes and Beyond
Advanced Analytics - Search, Statistics, and Graphs
Contents
Core Philosophy
Technology
Techniques
Application APIs
3
Secure. Scale. Adapt.
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Integration across:
Multiple business linesMultiple data setsMultiple applicationsMultiple security, privacy, legal, policy, regulatory, and compliance constraintsNew demands
Apache Accumulo Perspective
Application
Data Data Data
Application Application
4
Secure. Scale. Adapt.
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Accumulo Design Drivers
Scalability Near linear performance improvements at thousands of nodes Durable and reliable under increased failures that come with scale
2
Diverse, Interactive Analytics Sorted key/value core performs well in a diverse set of domains Information retrieval, statistics, graph analysis, geo indexing, and more
3
Cell-Level Security Express common security requirements in the infrastructure, not just in the application Data-centric approach encourages secure sharing
1
5
Secure. Scale. Adapt.
Flexible, Adaptive Schema Start with universal structures and indexing Refine the schema over time
4
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Contents
Core Philosophy
Technology
Techniques
Application APIs
6
Secure. Scale. Adapt.
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Accumulo Key Structure
An Accumulo key is a 5-tuple, consisting of:
Row: Controls AtomicityColumn Family: Controls Locality Column Qualifier: Controls UniquenessVisibility Label: Controls AccessTimestamp: Controls Versioning
Row Col. Fam. Col. Qual. Visibility Timestamp Value
John Doe Notes PCP PCP_JD 20120912 Patient suffers from an acute …
John Doe Test Results Cholesterol JD|PCP_JD 20120912 183
John Doe Test Results Mental Health JD|PSYCH_JD 20120801 Pass
John Doe Test Results X-Ray JD|PHYS_JD 20120513 1010110110100…
Accumulo Key/Value Example
7
Secure. Scale. Adapt.
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Visibility Syntax & Semantics
8
Secure. Scale. Adapt.
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Tablets
9
Collections of KV pairs form Tables
Tables are partitioned into Tablets
Metadata tablets hold info about other tablets, forming a 3-level hierarchy
A Tablet is a unit of work for a Tablet Server
Root Tablet-∞ to ∞
Metadata Tablet 1-∞ to “Encyclopedia:Ocelot”
Data Tablet-∞ : thing
Data Tabletthing : ∞
Data Tablet-∞ : Ocelot
Data TabletOcelot : Yak
Data TabletYak : ∞
Data Tablet-∞ to ∞
Metadata Tablet 2 “Encyclopedia:Ocelot” to ∞
Well-Known Location
(zookeeper)
Table: Adam’s Table Table: Encyclopedia Table: Foo
Secure. Scale. Adapt.
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Accumulo Architecture
Tablet Server
Tablet
Tablet Server
Tablet
Tablet Server
Tablet
Application
Zookeeper
Zookeeper
Zookeeper
Master
Hadoop
Read/Write
Store/Replicate
Assign/Balance
Delegate Authority
Delegate Authority
Application
Application
10
Secure. Scale. Adapt.
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Tablet Data Flow
In-Memory Map
Write AheadLog
(For Recovery)
Sorted, Indexed
File
Sorted, Indexed
FileSorted, Indexed
File
Tablet
ReadsIterator
TreeMinor
Compaction
Merging / Major Compaction
Iterator Tree
Writes
11
Secure. Scale. Adapt.
Iterator Tree
Scan
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Contents
Core Philosophy
Technology
Techniques
Application APIs
16
Secure. Scale. Adapt.
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Hierarchical Decomposition
17
Row:
Column Family:
Column Qualifier:
Value:
<person>
attribute purchases returns
age
<age>
discount
<cost>
hat
<cost>
sneakers
<40%>
Secure. Scale. Adapt.
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Materialized Table
18
Row:
Column Family:
Column Qualifier:
Value:
george
attribute purchases returns
age
27 $83
hat
$42
sneakers
bill
attribute purchases
40%
sneakers
$100
discount
49
age
Secure. Scale. Adapt.
Key/Value Pair
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Forward and Inverted Index
19
Table:
Row:
Column Family:
Column Qualifier:
Value:
Forward Index
<UUID>
<Type>
<Field>
<Term>
Inverted Index
<Term>
<Type> + <Field>
<UUID>
<Digest of Event>
Secure. Scale. Adapt.
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Forward and Inverted Index
20
Secure. Scale. Adapt.
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Graph Analysis
21
Table:
Row:
Column Family:
Column Qualifier:(Tuples):
Value:
Graph Table
<Node ID>
“Node Info” “Out Edges” “In Edges”
<Field>
<Value>
<Node ID>
<Edge ID>
<Edge Info>
<Node ID>
<Edge ID>
<Edge Info>
Secure. Scale. Adapt.
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Geospatial Queries
22
Table:
Row:
Column Family:
Column Qualifier:
Value:
Geo Index
<GeoHash>
<Event Type>
<UUID>
<Digest of Event>
Secure. Scale. Adapt.
Latitude10110101001
Longitude00111010010
101001110111010101011100001011100
Depth11010110110
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Document Partitioning
23
Table:
Row:
Column Family:
Column Qualifier(Tuples):
Value:
Shard Table
<Partition ID>
“Docs” “Inv. Index” “Field Index”
<UUID>
<Value>
<Term>
<UUID>
<Field:Term>
<UUID>
Secure. Scale. Adapt.
<Field>
“Geo”
<Hash>
<UUID>
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Document Partitioning
24
Secure. Scale. Adapt.
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Intersecting Iterator
26
Secure. Scale. Adapt.
‘foo’ and (‘bar’ or ‘baz’)
<Partition ID>
“Docs” “Inv. Index”
<UUID>
<Value>
<Term>
<UUID><Field>
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Contents
Core Philosophy
Technology
Techniques
Application APIs
27
Secure. Scale. Adapt.
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
acorn
28
Key/Value pairs are great! How do I construct a document partitioning key again?
Techniques should be built into an APILet the people have polyglotLucene, SQL, SPARQL, JAQL, Matlab (not just Key, Value, Range)
Secure. Scale. Adapt.
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
=
+
+
Combined IR + Graph Search
29
Secure. Scale. Adapt.
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Schema-less Stats
30
Secure. Scale. Adapt.
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Get Involved
http://accumulo.apache.org
Help us make Accumulo even better!
31
Secure. Scale. Adapt.
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
Contact
32
Adam Fuchs, CTO
sqrrl data, Inc.617-520-4375
www.sqrrl.com@sqrrl_inc
Secure. Scale. Adapt.
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved