sqrrl - the apache software...
TRANSCRIPT
sqrrl Secure. Scale. Adapt
Sqrrl Data, Inc. All Rights Reserved
sqrrl Secure. Scale. Adapt.
Adam Fuchs, CTO 11 April, 2013
2 Sqrrl Data, Inc. All Rights Reserved
Management
Ely Kahn sqrrl VP BizDev,
White House
Investors
Adam Fuchs
sqrrl CTO, NSA
Who We Are
20+ years of combined Apache Accumulo engineering exper9se
Mark Terenzoni sqrrl CEO, F5
• Founded July 2012 • Funded August 2012 • Team includes former Tech
Director of Accumulo at NSA and 6 commiDers/contributors
3 Sqrrl Data, Inc. All Rights Reserved
3
Our Mission
Security
AdapGvity Scalability
4 Sqrrl Data, Inc. All Rights Reserved
4
Apache Accumulo
" Sorted, Distributed Key/Value Store
" Based on Google’s Big Table Design
" Built on Top of Apache Hadoop and Apache Zookeeper
" Augments and Integrates With the Hadoop ecosystem
" Originally developed at the National Security Agency, now an Apache Software Foundation project
5 Sqrrl Data, Inc. All Rights Reserved
5
Applica9ons
Analy9cs APIs
Security & Access Controls
Data Integra9on
Search, Sta*s*cs, Graph, Lucene, SQL, Custom Extensions
IAM, Encryp*on, DAM, Secure Code
ETL, Hadoop
Accumulo
Sqrrl Enterprise Architecture
6 Sqrrl Data, Inc. All Rights Reserved
" Start small, but design for scalability – One applicaGon first, then grow to hundreds – One gigabyte first, then grow to petabytes
" Itera*ve schema refinement – IniGally, let the data define the schema – Refine the schema in bulk as you beDer understand the data – Middle ground between flat files and complete ontologies
" Discovery analy*cs as applica*on building blocks – Universal search: structured and unstructured data, across data sets, low latency – Basic staGsGcs: aggregaGons of query results, parallelized, low latency, to support big
picture analysis – Graphs: scalable graph analyGcs for analyzing how everything is connected
" Data-‐centric security – Separate modeling of security and analysis – Simplifies mulG-‐tenancy and applicaGon accreditaGon
Big Data Lessons Learned
7 Sqrrl Data, Inc. All Rights Reserved
7
Schema Discovery
8 Sqrrl Data, Inc. All Rights Reserved
The future of Big Data innovaGon is Apps, built on: • Universal Search • Schema-‐less StaGsGcs • Graphs • IntuiGve Languages • Secure, Scalable, and
Adaptable pla\orms
Lightweight Apps
9 Sqrrl Data, Inc. All Rights Reserved
9
Targeted Analysis
10 Sqrrl Data, Inc. All Rights Reserved
10
Big-Picture Analytics
11 Sqrrl Data, Inc. All Rights Reserved
DefiniGon: A form of security in which data carries with it the elements of provenance that are required to make policy decisions on its releasability. • Separate data modeling for Security and Analysis • Reusability of applicaGons across security domains
• Distributed development of ingest and query applicaGons
• Supported by Accumulo’s cell-‐level security
Data-Centric Security
12 Sqrrl Data, Inc. All Rights Reserved
12
Cell-Level Security
13 Sqrrl Data, Inc. All Rights Reserved
13
Scalable Data-Centric Security
Data Labeler Accumulo Apps
User ACributes
Audits
Policies
HDFS, Zookeeper
End Users
Auth. Service
Policy Engine
14 Sqrrl Data, Inc. All Rights Reserved
14
Accumulo’s Strengths
" Security – Cell-‐level security reduces the cost of applicaGon development in the
presence of complex legal or policy restricGons on data use – IAM and encrypGon Ges into enterprise security standards
" Scalability – Proven reliability and performance at the mulG-‐petabyte scale – High-‐performance parallel I/O library
" Adap9vity – Flexible schema support to quickly ingest new data sources – Sorted key/value paradigm supports a mulGtude of search and
analysis applicaGons – Server-‐side programming framework “iterator trees” support best-‐in-‐
class aggregaGon, filtering, and complex query semanGcs
15 Sqrrl Data, Inc. All Rights Reserved
15
An Accumulo key is a 5-‐tuple, consis9ng of: " Row: Controls Atomicity " Column Family: Controls Locality " Column Qualifier: Controls Uniqueness " Visibility Label: Controls Access " Timestamp: Controls Versioning
Row Col. Fam. Col. Qual. Visibility Timestamp Value
John Doe Notes PCP PCP_JD 20120912 PaGent suffers from an acute …
John Doe Test Results Cholesterol JD|PCP_JD 20120912 183
John Doe Test Results Mental Health JD|PSYCH_JD 20120801 Pass
John Doe Test Results X-‐Ray JD|PHYS_JD 20120513 1010110110100…
Accumulo Key/Value Example
Accumulo Key Structure
16 Sqrrl Data, Inc. All Rights Reserved
16
Accumulo Architecture
Tablet Server
Tablet
Tablet Server
Tablet
Tablet Server
Tablet
ApplicaGon
Zookeeper
Zookeeper
Zookeeper
Master
HDFS
Read/Write
Store/Replicate
Assign/Balance
Delegate Authority
Delegate Authority
ApplicaGon
ApplicaGon
17 Sqrrl Data, Inc. All Rights Reserved
17
Tablet Data Flow
In-‐Memory Map
Write Ahead Log
(For Recovery)
Sorted, Indexed File
Sorted, Indexed File
Sorted, Indexed File
Tablet Reads
Iterator Tree
Minor Compac<on
Merging / Major Compac<on
Iterator Tree
Writes Iterator Tree
Scan
Iterator Framework
18
Secure. Scale. Adapt.
Iterator Opera9ons: " File Reads " Block Caching " Merging " DeleGon " IsolaGon " Locality Groups " Range SelecGon " Column SelecGon " Cell-‐level Security " Versioning " Filtering " AggregaGon " ParGGoned Joins
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
19 Sqrrl Data, Inc. All Rights Reserved
• No built-‐in secondary indices
• Sort Order ó Index • Balance between ingest and query
• Avoid introducing boDlenecks
• Preserve cell-‐level security and scalability
Table Design Table:
Row:
Column Family:
Column Qualifier:
Value:
Forward Index
<UUID>
<Type>
<Field>
<Term>
Inverted Index
<Term>
<Type> + <Field>
<UUID>
<Digest of Event>
20 Sqrrl Data, Inc. All Rights Reserved
20
Ecosystem Architecture
Apache HDFS
Apache Accumulo
Sqrrl Enterprise
Custom Ingester Web Server Custom AnalyGc Map/Reduce Task
Sqrrl API over Apache Thrip RPC : Hierarchical Documents + Graphs, Lucene + SQL + more
Accumulo RPC : Sorted Key/Value I/O
Hadoop RPC : File I/O
21 Sqrrl Data, Inc. All Rights Reserved
21
sqrrl data, inc. 275 Third St.
Cambridge, MA 02142
617-‐902-‐0784 www.sqrrl.com @sqrrl_inc
Contact