recommendation engines & accumulo - sqrrl data science group may 21, 2013

97
Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Upload: bartholomew-randell-mosley

Post on 20-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Recommendation Engines &

Accumulo - Sqrrl

Data Science Group

May 21, 2013

Page 2: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Agenda

6:15 - 6:30 More data trumps better algorithms by Michael

Walker

6:30 - 7:30 Recommendation Engines by Tom Rampley

7:30 - 8:30 Accumulo - Sqrrl by John Dougherty

8:30 - 9:30 Network at Old Chicago at 14th and Market.

Page 3: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Data Science Group New Sponsors

Cloudera

O'Reilly Media

Page 4: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013
Page 5: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

More data is better

Page 6: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

More data is better

Even if less exact or messier

Page 7: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013
Page 8: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

One sensor = strict accuracy

Page 9: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

One sensor = strict accuracy

Multiple sensors = less accurate & messy

Page 10: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

One sensor = strict accuracy

Multiple sensors = less accurate & messy

More data points = greater value

Page 11: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

One sensor = strict accuracy

Multiple sensors = less accurate & messy

More data points = greater value

Aggregate = more comprehensive picture

Page 12: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013
Page 13: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Increase frequency of sensor readings

Page 14: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Increase frequency of sensor readings

One measure per min = accurate

Page 15: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Increase frequency of sensor readings

One measure per min = accurate

100 readings per second = less accurate

Page 16: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Increase frequency of sensor readings

One measure per min = accurate

100 readings per second = less accurate

> volume vs. exactitude

Page 17: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Increase frequency of sensor readings

One measure per min = accurate

100 readings per second = less accurate

> volume vs. exactitude

Accept messiness to get scale

Page 18: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Sacrifice accuracy in return for knowing general trend

Page 19: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Sacrifice accuracy in return for knowing general trend

Big data = probabilistic (not precise)

Page 20: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Sacrifice accuracy in return for knowing general trend

Big data = probabilistic (not precise)

Good yet has problems

Page 21: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Internet of Things

Page 22: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013
Page 23: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013
Page 24: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013
Page 25: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013
Page 26: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013
Page 27: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Internet of Things

"Data Science" means the scientific study of the creation, manipulation and transformation of data to create meaning.

Page 28: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Internet of Things

"Data Scientist" means a professional who uses scientific methods to liberate and create meaning from raw data.

Page 29: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Internet of Things

"Big Data" means large data sets that have different properties from small data sets and requires special data science methods to differentiate signal from noise to extract meaning and requires special compute systems and power.

Page 30: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Data Science

Page 31: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Data Science

Page 32: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Data Science

"Signal" means a meaningful interpretation of data based on science that may be transformed into scientific evidence and knowledge.

Page 33: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Data Science

"Noise" means a competing interpretation of data not grounded in science that may not be considered scientific evidence. Yet noise may be manipulated into a form of knowledge (what does not work).

Page 34: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013
Page 35: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Machine Learning

Page 36: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013
Page 37: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Machine Learning

Field of study that gives computers the ability to learn without being explicitly programmed.

Page 38: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013
Page 39: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Algorithms

Page 40: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013
Page 41: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Algorithms

Process or set of rules to be followed in calculations or other problem-solving operations to achieve a goal, especially a mathematical rule or procedure used to compute a desired result, produce the answer to a question or the solution to a problem in a finite number of steps.

Page 42: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

More data trumps better algorithms

Page 43: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

More data trumps better algorithms

Microsoft Word Grammar Checker

Page 44: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

More data trumps better algorithms

Microsoft Word Grammar Checker

Improve algorithms

Page 45: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

More data trumps better algorithms

Microsoft Word Grammar Checker

Improve algorithms

New techniques

Page 46: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

More data trumps better algorithms

Microsoft Word Grammar Checker

Improve algorithms

New techniques

New features

Page 47: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

More data trumps better algorithms

Feed more data into existing methods

Page 48: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

More data trumps better algorithms

Feed more data into existing methods

Most ML-A one million words or less

Page 49: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

More data trumps better algorithms

Feed more data into existing methods

Most ML-A one million words or less

Experiment: 10 mil - 100 mill - 1 billion

Page 50: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

More data trumps better algorithms

Results: algorithms improved dramatically

Simple algorithm that was worst performer with 1/2 mill words performed better than all others with 1 bill words

Algorithm worked best with 1/2 mill performed worst with 1 bill words

Page 51: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

More data trumps better algorithms

Conclusions:

More trumps less

More trumps smarter (not always)

Page 52: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

More data trumps better algorithms

Tradeoff between spending time and money on algorithm development versus spending it on data development

Page 53: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

More data trumps better algorithms

Google language translation

1 billion words

1 trillion words

larger yet messier data set - entire internet

Page 54: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013
Page 55: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013
Page 56: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013
Page 57: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Tom Rampley

Recommendation Engines: an Introduction

Page 58: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

A Brief History of Recommendation Engines

Page 59: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

What Does a Recommender Do?

Recommendation engines use algorithms of varying complexity to suggest items based upon historical information

• Item ratings or content• Past user behavior/purchase history

Recommenders typically use some form of collaborative filtering

Page 60: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Collaborative Filtering

The name:•‘Collaborative’ because the algorithm takes the choices of many users into account to make a recommendation•Rely on user taste similarity•‘Filtering’ because you use the preferences of other users to filter out the items most likely to be of interest to the current user

Collaborative Filtering

algorithms include:

•K nearest neighbors•Cosine similarity•Pearson correlation•Bayesian belief nets•Markov decision processes•Latent semantic indexing methods•Association Rules Learning

Page 61: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Cosine Similarity ExampleLets walk through an example of a simple collaborative

filtering algorithm, namely cosine similarityCosine similarity can be used to find similar items, or

similar individuals. In this case, we’ll be trying to identify individuals with similar taste

Imagine individual ratings on a set of items to be a [user,item] matrix. You can then treat the ratings of each individual as an N-dimensional vector of ratings on items: {r1, r2…rN}

The similarity of vectors (individuals’ ratings) can be computed by computing the cosine of the angle between them:

The closer the cosine is to 1, the more alike the two individuals’ ratings are

Page 62: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Cosine Similarity Example ContinuedLets say we have the following matrix of users

and ratings of TV shows:

And we encounter a new user, James, who has only seen and rated 5 of these 7 shows:

Of the two remaining shows, which one should we recommend to James?

True Blood CSI JAG Star Trek Castle The Wire

Twin Peaks

Bob 5 2 1 4 3 2 5Mary 4 4 2 1 3 1 2Jim 1 1 5 2 5 2 3

George 3 4 3 5 5 4 3Jennifer 5 2 4 2 4 1 0Natalie 0 5 0 4 4 1 4Robin 5 5 0 0 4 2 2

True Blood CSI JAG Star Trek CastleJames 5 5 3 1 0

Page 63: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Cosine Similarity Example ContinuedTo find out, we’ll see who James is most similar to

among the folks who have rated all the shows by calculating the cosine similarity between the vectors of the 5 shows that each individual have in common:

It seems that Mary is the closest to James in terms of show ratings among the group. Of the two remaining shows, The Wire and Twin Peaks, Mary slightly preferred Twin Peaks so that is what we recommend to James

Cosine Similarity James

Bob 0.73

Mary 0.89

Jim 0.47

George 0.69

Jennifer 0.78

Natalie 0.50

Robin 0.79

Page 64: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Collaborative Filtering Continued

This simple cosine similarity example could be extended to extremely large datasets with hundreds or thousands of dimensions

You can also compute item to item similarity by treating the item as the vectors for which you’re computing similarity, and the users as the dimensions•Allows for recommending similar items to a user after they’ve made a purchase•Amazon uses a variant of this algorithm•This is an example of item-to-item collaborative filtering

Page 65: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Adding ROI to the Equation: an Example with Naïve Bayes

When recommending products, some may generate more margin for the firm than others

Some algorithms can take cost into account when making recommendations

Naïve Bayes is a commonly used classifier that allows for the inclusion of marginal value of a product sale in the recommendation decision

Page 66: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Naïve Bayes

Bayes theorem tells us the probability of

our beliefs being true given prior beliefs

and evidence

Naïve Bayes is a classifier that utilizes Bayes’

theorem (with simplifying assumptions) to generate a probability of an instance

belonging to a class

Class likelihood can be combined with expected payoff to generate the optimal payoff from a

recommendation

Page 67: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Naïve Bayes ContinuedHow does the NB algorithm generate class

probabilities, and how can we use the algorithmic output to maximize expected payoff?

Let’s say we want to figure out which of two products to recommend to a customerEach product generates a different amount of

profit for our firm per unit soldWe know the target customer’s past purchasing

behavior, and we know the past purchasing behavior of twelve other customers who have bought one of the two potential recommendation products

Let’s represent our knowledge as a series of matrices and vectors

Page 68: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Naïve Bayes Continued

Past Customer Purchasing Behavior

Toys Games Candy Books BoatJohn Squirt Gun Chess Skittles Harry Potter SpeedboatMary Doll Life M&Ms Emma SpeedboatPete Kite Chess M&Ms Twilight SailboatKevin Squirt Gun Life Snickers Emma SailboatDale Doll Life Skittles Twilight SpeedboatJane Kite Monopoly Skittles Twilight SpeedboatRaquelle Squirt Gun Monopoly Skittles Harry Potter SailboatJoanne Kite Chess Snickers Twilight SpeedboatSusan Squirt Gun Chess Skittles Twilight SailboatTim Doll Life M&Ms Harry Potter SailboatLarry Kite Chess M&Ms Twilight SpeedboatRegina Doll Monopoly Snickers Harry Potter SailboatEric Squirt Gun Life Snickers Harry Potter ?

Page 69: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Naïve Bayes ContinuedNB uses (independent) probabilities of events

to generate class probabilitiesUsing Bayes’ theorem (and ignoring the

scaling constant) the probability of a customer with past purchase history α (a vector of past purchases) buying item θ is:

P (α1, …, αi | θj ) P (θj )Where P (θj) is the frequency with which the

item appears in the training data, and P (α1,

…, αi | θj ) is Π P (αi | θj ) for all i items in the training dataThat P (α1, …, αi | θj ) P (θj ) = Π P (αi | θj ) P (θj) is

dependent up on the assumption of conditional independence between past purchases

Page 70: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Naïve Bayes ContinuedIn our example, we can calculate the

following probabilities:Sailboat Speedboat

P(θ) 6/12 6/12

Sailboat SpeedboatSquirt Gun 3/6 1/6Kite 1/6 3/6Doll 2/6 2/6Life 2/6 2/6Monopoly 2/6 1/6Chess 2/6 3/6Skittles 2/6 3/6M&Ms 2/6 2/6Snickers 2/6 1/6Harry Potter 3/6 1/6Twilight 2/6 4/6Emma 1/6 1/6

Page 71: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Now that we can calculate P (α1, …, αi | θj ) P (θj ) for all instances, let’s figure out the most likely boat purchase for Eric:

These probabilities may seem very low, but recall that we left out the scaling constant in Bayes theorem since we’re only interested in the relative probabilities of the two outcomes

Naïve Bayes Continued

P(θ) Toys Games Candy Books Boat

Eric Squirt Gun Life Snickers Harry Potter ?

Sailboat 6/12 3/12 2/12 2/12 3/12 0.00086806

Speedboat 6/12 1/12 2/12 1/12 1/12 0.00004823

Page 72: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

So it seems like the sailboat is a slam dunk to

recommend. It’s much more likely (18 times!) for

Eric to buy than the speedboat.

But let’s consider a scenario: let’s say our hypothetical firm

generates $20 of profit whenever a customer buys a speedboat, but only $1 when they buy a sailboat (outboard motors are apparently very high margin)

In that case, it would make more sense to recommend the

speedboat, because our expected payoff from the speedboat

recommendation would be 11% greater ($20/$1 * .0000048/.00087) than our expected payout from the

sailboat recommendation

This logic can be applied to any number of products, by

multiplying the set of purchase probabilities by the

set of purchase payoffs, taking the maximum value as the

recommended item

Naïve Bayes Continued

Page 73: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Challenges

While recommendation algorithms in many cases are relatively simple as machine learning goes, there are a couple of difficult

problems that all recommenders must deal with:

Cold start problem• How do you make

recommendations to someone for whom you have very little or no data?

Data sparsity• With millions of

items for sale, most customers have bought very few individual items

Grey and Black sheep problem• Some people

have very idiosyncratic taste, and making recommendations to them is extremely difficult because they don’t behave like other customers

Page 74: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Dealing With Cold Start

Typically only a problem in the very early stages of a user-system interaction

Requiring creation of a profile for new users can mitigate the problem to a certain extent, by making early recommendations contingent upon supplied personal dataA recommender system can also start out using item-item recommendations based upon the first items a user buys, and gradually change over to a person-person system as the system learns the user’s taste

Page 75: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Dealing With Data Sparsity

Data sparsity can be dealt with primarily by two methods:• D

ata imputation

• Latent factor methods

Data imputation typically uses an algorithm like cosine similarity to impute the rating of an item based upon the ratings of similar users

Latent factor methods typically use some sort of matrix decomposition to reduce the rank of the large, sparse matrix while simultaneously adding ratings for unrated items based upon latent factors

Page 76: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Dealing With Data Sparsity

• Techniques like principal components analysis/singular value decomposition allow for the creation of low rank approximations to sparse matrices with relatively little loss of information

Page 77: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Dealing With Sheep of Varying Darkness

To a large extent, these cases are unavoidable

Feedback on recommended items post purchase, as well as the purchase rate of recommended items, can be used to learn even very idiosyncratic preferences, but take longer than for a normal user

Grey and black sheep are doubly troublesome because their odd tendencies can also weaken the strength of your engine to make recommendations to the broad population of white sheep

Page 79: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

sqrrl and AccumuloPresented by: John Dougherty, CIO

Viriton5/21/2013

Page 80: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Which NoSQL solution?

© sqrrl, Inc.

There are a lot of

places to fit

sqrrl, and Accumulo.

Page 81: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

What is sqrrl?

Based on Accumulo

A proven, secure, multi tenant, data platform for building real-time applications

Scales elastically to tens of petabytes of data and enables organizations to eliminate their internal data silos

Seamless integration with Hadoop, and most of its variants

Providing a supply for a much needed security demand (ground-up security)

Already deployed and utilized by defense and government industries

Page 82: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

A history of sqrrl

© sqrrl, Inc. - Accumulo Meetup Presentation

Page 83: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

A sqrrl’s architecture

© sqrrl, Inc.

Page 84: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

What is Accumulo?

Development began at the NSA in 2008

Base foundation for sqrrl

Cell-level security reduces the cost of app development, circumnavigating complex, sometimes impossible, legal or policy restrictions

Provides the ability to scale to >PB levels

Highly adaptive schema and sorted key/value paradigm

Stores key/value pairs in parsed, sorted, secure controls

Page 85: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Where does Accumulo fit?

© sqrrl, Inc.

Page 86: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

How does Accumulo provide security?

Security Labels are applied to keys

Cell-level security is implemented to allow for security policy enforcement, using data labeler tags

These policies are applied when data is ingested

Tablets contain data, are controlled using security policies

Stores key/value pairs in parsed, sorted, secure controls, a 5-tuple key system

Page 87: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Accumulo Security (cont.)

Why Cell-Level Security Is Important:

Many databases insufficiently implement security through row-and column-level restrictions. Column-level security is only sufficient when the data schema is static, well known, and aligned with security concerns. Row-level security breaks down when a single record conveys multiple levels of information. The flexible, fine-grained cell-level security within Sqrrl Enterprise (or its root Accumulo) supports flexible schemas, new indexing patterns, and greater analytic adaptability at scale.

Page 88: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Accumulo Security (cont.)

An Accumulo key is a 5-tuple key, consisting of:  Row: Controls Atomicity

Column Family: Controls Locality

Column Qualifier: Controls Uniqueness

Visibility Label: Controls Access

Timestamp: Controls Versioning

Keys are sorted: Hierarchically: Row first, then column family, and so on

Lexicographically: Compare first byte, then second, and so on

(Values are byte arrays)

Page 89: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Accumulo Security (cont.)

An example of column usage

Page 90: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Accumulo Architecture

Accumulo servers (tablets) utilize a multitude of big data technologies, but their layout is different than Map/Reduce, HDFS, MongoDB, Cassandra, etc. used alone.

Data is stored in HDFS

Zookeeper is utilized for configuration management

SSH, password-less, node configuration

An emphasis, more of an imperative, on data model and data model design

Page 91: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Accumulo Architecture (cont.)

Tablets Partitions of tables, collections of sorted

key/value pairs Held and managed by Tablet Servers

Page 92: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Accumulo Architecture (cont.)

Receive writes, responds to reads, from clients

Writes to a write-ahead log, sorting new key/value pairs in memory, while periodically flushing sorted key/value pairs to new files in HDFS

Managed by Master

Responsible for detecting and responding to Tablet Server failure, load balancing

Coordinates startup, graceful shutdown, and recovery of write-ahead logs

Zookeeper

An apache project, open source

Utilized for distributed locking mechanism, with no single point of failure

Tablet Servers

Page 93: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Integration with users/access

1. Gather an organization’s information security policies and dissecting them into data- ‐centric and user- ‐centric components

2. As data is ingested into Accumulo, a data labeler tags individual key/value pairs with the appropriate data- ‐centric visibility labels based on these policies.

3. Data is then stored in Accumulo where it is available for real- ‐time queries by operational applications. End users are authenticated through these applications and authorized to access underlying data

4. As an end user performs an operation via the app (e.g., performs a search request), the visibility label on each candidate key/value pair is checked against his or her attributes, and only the data that he or she is authorized to see is returned.

The visibility labels are a feature that is unique to Accumulo. No other database can apply access controls at such a fine-grained level.

Labels are generated by translating an organization’s existing data security and information sharing policies into Boolean expressions

Page 94: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Making sqrrl work

sqrrl’s extensibility of Accumulo allows it to process millions of records per second, as either static or streaming objects

These records are converted into hierarchical JSON documents, giving document store capabilities

Passing this data to the analytics layer is designed to make integration and development of real-time analytics possible, and accessible

Combining access at the cell level, with Accumulo, sqrrl integrates Identity and Access Management (IAM) systems (LDAP, RADIUS, etc.)

Page 95: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Making sqrrl work (cont.)

Sqrrl process

Data Ingest:

HDFS:

Apache Accumulo:

Apache thrift:

Apache Lucene:

JSON, or graph format

File Storage system, compatible with both open source (OSS) and commercial versions

The core of transactional and online analytical data processing in sqrrl

Enables development in diverse language choices

Custom iterators, providing developers with real-time capabilities, such as full-text search, graph analysis, and statistics, for analytical applications and dashboards.

Page 96: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

Who is sqrrl for?

CTOs/CIOs: Unlock the value in fractured and unstructured datasets across your organization

Developers: More easily create apps on top of Big Data and distributed databases

Infrastructure Managers: Simplify administration of Big Data through highly scalable and multitenant distributed systems

Data Analysts: Dig deeper into your data using advanced analytical techniques, such as graph analysis

Business Users: Use Big Data seamlessly via apps developed on top of sqrrl enterprise

Page 97: Recommendation Engines & Accumulo - Sqrrl Data Science Group May 21, 2013

sqrrl/Accumulo wrap-up

Accumulo bridges the gap for security perspectives that restrict a large swath of industries

Accumulo Setup:

1. Installation of HDFS and ZooKeeper must installed and configured2. Password-less SSH should be configured between all nodes (emphasized master <>

tablet)

3. Installation of Accumulo (from http://accumulo.apache.org/downloads/ using http://accumulo.apache.org/1.4/user_manual/Administration.html#Installation

Or get started using their AMI (http://www.sqrrl.com/downloads#getting-started)

sqrrl

combines the best of available technologies, develops and contributes their own, and designs big apps for big data.