recommendation engines & accumulo - sqrrl data science group may 21, 2013

Recommendation Engines &

Accumulo - Sqrrl

Data Science Group

May 21, 2013

Agenda

6:15 - 6:30 More data trumps better algorithms by Michael

Walker

6:30 - 7:30 Recommendation Engines by Tom Rampley

7:30 - 8:30 Accumulo - Sqrrl by John Dougherty

8:30 - 9:30 Network at Old Chicago at 14th and Market.

Data Science Group New Sponsors

Cloudera

O'Reilly Media

More data is better

More data is better

Even if less exact or messier

One sensor = strict accuracy


Multiple sensors = less accurate & messy



More data points = greater value



More data points = greater value

Aggregate = more comprehensive picture

Increase frequency of sensor readings


One measure per min = accurate



100 readings per second = less accurate




> volume vs. exactitude




> volume vs. exactitude

Accept messiness to get scale

Sacrifice accuracy in return for knowing general trend


Big data = probabilistic (not precise)


Big data = probabilistic (not precise)

Good yet has problems

Internet of Things

Internet of Things

"Data Science" means the scientific study of the creation, manipulation and transformation of data to create meaning.

Internet of Things

"Data Scientist" means a professional who uses scientific methods to liberate and create meaning from raw data.

Internet of Things

"Big Data" means large data sets that have different properties from small data sets and requires special data science methods to differentiate signal from noise to extract meaning and requires special compute systems and power.

Data Science

Data Science

"Signal" means a meaningful interpretation of data based on science that may be transformed into scientific evidence and knowledge.

Data Science

"Noise" means a competing interpretation of data not grounded in science that may not be considered scientific evidence. Yet noise may be manipulated into a form of knowledge (what does not work).

Machine Learning

Machine Learning

Field of study that gives computers the ability to learn without being explicitly programmed.

Algorithms

Algorithms

Process or set of rules to be followed in calculations or other problem-solving operations to achieve a goal, especially a mathematical rule or procedure used to compute a desired result, produce the answer to a question or the solution to a problem in a finite number of steps.

More data trumps better algorithms


Microsoft Word Grammar Checker



Improve algorithms



Improve algorithms

New techniques



Improve algorithms

New techniques

New features


Feed more data into existing methods



Most ML-A one million words or less



Most ML-A one million words or less

Experiment: 10 mil - 100 mill - 1 billion


Results: algorithms improved dramatically

Simple algorithm that was worst performer with 1/2 mill words performed better than all others with 1 bill words

Algorithm worked best with 1/2 mill performed worst with 1 bill words


Conclusions:

More trumps less

More trumps smarter (not always)


Tradeoff between spending time and money on algorithm development versus spending it on data development


Google language translation

1 billion words

1 trillion words

larger yet messier data set - entire internet

Tom Rampley

Recommendation Engines: an Introduction

A Brief History of Recommendation Engines

What Does a Recommender Do?

Recommendation engines use algorithms of varying complexity to suggest items based upon historical information

• Item ratings or content• Past user behavior/purchase history

Recommenders typically use some form of collaborative filtering

Collaborative Filtering

The name:•‘Collaborative’ because the algorithm takes the choices of many users into account to make a recommendation•Rely on user taste similarity•‘Filtering’ because you use the preferences of other users to filter out the items most likely to be of interest to the current user

Collaborative Filtering

algorithms include:

•K nearest neighbors•Cosine similarity•Pearson correlation•Bayesian belief nets•Markov decision processes•Latent semantic indexing methods•Association Rules Learning

Cosine Similarity ExampleLets walk through an example of a simple collaborative

filtering algorithm, namely cosine similarityCosine similarity can be used to find similar items, or

similar individuals. In this case, we’ll be trying to identify individuals with similar taste

Imagine individual ratings on a set of items to be a [user,item] matrix. You can then treat the ratings of each individual as an N-dimensional vector of ratings on items: {r1, r2…rN}

The similarity of vectors (individuals’ ratings) can be computed by computing the cosine of the angle between them:

The closer the cosine is to 1, the more alike the two individuals’ ratings are

Cosine Similarity Example ContinuedLets say we have the following matrix of users

and ratings of TV shows:

And we encounter a new user, James, who has only seen and rated 5 of these 7 shows:

Of the two remaining shows, which one should we recommend to James?

True Blood CSI JAG Star Trek Castle The Wire

Twin Peaks

Bob 5 2 1 4 3 2 5Mary 4 4 2 1 3 1 2Jim 1 1 5 2 5 2 3

George 3 4 3 5 5 4 3Jennifer 5 2 4 2 4 1 0Natalie 0 5 0 4 4 1 4Robin 5 5 0 0 4 2 2

True Blood CSI JAG Star Trek CastleJames 5 5 3 1 0

Cosine Similarity Example ContinuedTo find out, we’ll see who James is most similar to

among the folks who have rated all the shows by calculating the cosine similarity between the vectors of the 5 shows that each individual have in common:

It seems that Mary is the closest to James in terms of show ratings among the group. Of the two remaining shows, The Wire and Twin Peaks, Mary slightly preferred Twin Peaks so that is what we recommend to James

Cosine Similarity James

Bob 0.73

Mary 0.89

Jim 0.47

George 0.69

Jennifer 0.78

Natalie 0.50

Robin 0.79

Collaborative Filtering Continued

This simple cosine similarity example could be extended to extremely large datasets with hundreds or thousands of dimensions

You can also compute item to item similarity by treating the item as the vectors for which you’re computing similarity, and the users as the dimensions•Allows for recommending similar items to a user after they’ve made a purchase•Amazon uses a variant of this algorithm•This is an example of item-to-item collaborative filtering

Adding ROI to the Equation: an Example with Naïve Bayes

When recommending products, some may generate more margin for the firm than others

Some algorithms can take cost into account when making recommendations

Naïve Bayes is a commonly used classifier that allows for the inclusion of marginal value of a product sale in the recommendation decision

Naïve Bayes

Bayes theorem tells us the probability of

our beliefs being true given prior beliefs

and evidence

Naïve Bayes is a classifier that utilizes Bayes’

theorem (with simplifying assumptions) to generate a probability of an instance

belonging to a class

Class likelihood can be combined with expected payoff to generate the optimal payoff from a

recommendation

Naïve Bayes ContinuedHow does the NB algorithm generate class

probabilities, and how can we use the algorithmic output to maximize expected payoff?

Let’s say we want to figure out which of two products to recommend to a customerEach product generates a different amount of

profit for our firm per unit soldWe know the target customer’s past purchasing

behavior, and we know the past purchasing behavior of twelve other customers who have bought one of the two potential recommendation products

Let’s represent our knowledge as a series of matrices and vectors

Naïve Bayes Continued

Past Customer Purchasing Behavior

Toys Games Candy Books BoatJohn Squirt Gun Chess Skittles Harry Potter SpeedboatMary Doll Life M&Ms Emma SpeedboatPete Kite Chess M&Ms Twilight SailboatKevin Squirt Gun Life Snickers Emma SailboatDale Doll Life Skittles Twilight SpeedboatJane Kite Monopoly Skittles Twilight SpeedboatRaquelle Squirt Gun Monopoly Skittles Harry Potter SailboatJoanne Kite Chess Snickers Twilight SpeedboatSusan Squirt Gun Chess Skittles Twilight SailboatTim Doll Life M&Ms Harry Potter SailboatLarry Kite Chess M&Ms Twilight SpeedboatRegina Doll Monopoly Snickers Harry Potter SailboatEric Squirt Gun Life Snickers Harry Potter ?

Naïve Bayes ContinuedNB uses (independent) probabilities of events

to generate class probabilitiesUsing Bayes’ theorem (and ignoring the

scaling constant) the probability of a customer with past purchase history α (a vector of past purchases) buying item θ is:

P (α1, …, αi | θj ) P (θj )Where P (θj) is the frequency with which the

item appears in the training data, and P (α1,

…, αi | θj ) is Π P (αi | θj ) for all i items in the training dataThat P (α1, …, αi | θj ) P (θj ) = Π P (αi | θj ) P (θj) is

dependent up on the assumption of conditional independence between past purchases

Naïve Bayes ContinuedIn our example, we can calculate the

following probabilities:Sailboat Speedboat

P(θ) 6/12 6/12

Sailboat SpeedboatSquirt Gun 3/6 1/6Kite 1/6 3/6Doll 2/6 2/6Life 2/6 2/6Monopoly 2/6 1/6Chess 2/6 3/6Skittles 2/6 3/6M&Ms 2/6 2/6Snickers 2/6 1/6Harry Potter 3/6 1/6Twilight 2/6 4/6Emma 1/6 1/6

Now that we can calculate P (α1, …, αi | θj ) P (θj ) for all instances, let’s figure out the most likely boat purchase for Eric:

These probabilities may seem very low, but recall that we left out the scaling constant in Bayes theorem since we’re only interested in the relative probabilities of the two outcomes


P(θ) Toys Games Candy Books Boat

Eric Squirt Gun Life Snickers Harry Potter ?

Sailboat 6/12 3/12 2/12 2/12 3/12 0.00086806

Speedboat 6/12 1/12 2/12 1/12 1/12 0.00004823

So it seems like the sailboat is a slam dunk to

recommend. It’s much more likely (18 times!) for

Eric to buy than the speedboat.

But let’s consider a scenario: let’s say our hypothetical firm

generates $20 of profit whenever a customer buys a speedboat, but only $1 when they buy a sailboat (outboard motors are apparently very high margin)

In that case, it would make more sense to recommend the

speedboat, because our expected payoff from the speedboat

recommendation would be 11% greater ($20/$1 * .0000048/.00087) than our expected payout from the

sailboat recommendation

This logic can be applied to any number of products, by

multiplying the set of purchase probabilities by the

set of purchase payoffs, taking the maximum value as the

recommended item


Challenges

While recommendation algorithms in many cases are relatively simple as machine learning goes, there are a couple of difficult

problems that all recommenders must deal with:

Cold start problem• How do you make

recommendations to someone for whom you have very little or no data?

Data sparsity• With millions of

items for sale, most customers have bought very few individual items

Grey and Black sheep problem• Some people

have very idiosyncratic taste, and making recommendations to them is extremely difficult because they don’t behave like other customers

Dealing With Cold Start

Typically only a problem in the very early stages of a user-system interaction

Requiring creation of a profile for new users can mitigate the problem to a certain extent, by making early recommendations contingent upon supplied personal dataA recommender system can also start out using item-item recommendations based upon the first items a user buys, and gradually change over to a person-person system as the system learns the user’s taste

Dealing With Data Sparsity

Data sparsity can be dealt with primarily by two methods:• D

ata imputation

• Latent factor methods

Data imputation typically uses an algorithm like cosine similarity to impute the rating of an item based upon the ratings of similar users

Latent factor methods typically use some sort of matrix decomposition to reduce the rank of the large, sparse matrix while simultaneously adding ratings for unrated items based upon latent factors

Dealing With Data Sparsity

• Techniques like principal components analysis/singular value decomposition allow for the creation of low rank approximations to sparse matrices with relatively little loss of information

Dealing With Sheep of Varying Darkness

To a large extent, these cases are unavoidable

Feedback on recommended items post purchase, as well as the purchase rate of recommended items, can be used to learn even very idiosyncratic preferences, but take longer than for a normal user

Grey and black sheep are doubly troublesome because their odd tendencies can also weaken the strength of your engine to make recommendations to the broad population of white sheep

ReferencesA good survey of recommendation techniques

Matrix factorization for use in recommenders

Article on the BellKor solution to the Netflix challenge

Article on Amazon's recommendation engine

http://www.hindawi.com/journals/aai/2009/421425/

http://www.hindawi.com/journals/aai/2009/421425/

http://www.stat.osu.edu/~dmsl/Koren_2009.pdf

http://www2.research.att.com/~volinsky/netflix/ProgressPrize2007BellKorSolution.pdf



http://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf

sqrrl and AccumuloPresented by: John Dougherty, CIO

Viriton5/21/2013

What is sqrrl?

Based on Accumulo

A proven, secure, multi tenant, data platform for building real-time applications

Scales elastically to tens of petabytes of data and enables organizations to eliminate their internal data silos

Seamless integration with Hadoop, and most of its variants

Providing a supply for a much needed security demand (ground-up security)

Already deployed and utilized by defense and government industries

What is Accumulo?

Development began at the NSA in 2008

Base foundation for sqrrl

Cell-level security reduces the cost of app development, circumnavigating complex, sometimes impossible, legal or policy restrictions

Provides the ability to scale to >PB levels

Highly adaptive schema and sorted key/value paradigm

Stores key/value pairs in parsed, sorted, secure controls

How does Accumulo provide security?

Security Labels are applied to keys

Cell-level security is implemented to allow for security policy enforcement, using data labeler tags

These policies are applied when data is ingested

Tablets contain data, are controlled using security policies

Stores key/value pairs in parsed, sorted, secure controls, a 5-tuple key system

Accumulo Security (cont.)

Why Cell-Level Security Is Important:

Many databases insufficiently implement security through row-and column-level restrictions. Column-level security is only sufficient when the data schema is static, well known, and aligned with security concerns. Row-level security breaks down when a single record conveys multiple levels of information. The flexible, fine-grained cell-level security within Sqrrl Enterprise (or its root Accumulo) supports flexible schemas, new indexing patterns, and greater analytic adaptability at scale.


An Accumulo key is a 5-tuple key, consisting of: Row: Controls Atomicity

Column Family: Controls Locality

Column Qualifier: Controls Uniqueness

Visibility Label: Controls Access

Timestamp: Controls Versioning

Keys are sorted: Hierarchically: Row first, then column family, and so on

Lexicographically: Compare first byte, then second, and so on

(Values are byte arrays)


An example of column usage

Accumulo Architecture

Accumulo servers (tablets) utilize a multitude of big data technologies, but their layout is different than Map/Reduce, HDFS, MongoDB, Cassandra, etc. used alone.

Data is stored in HDFS

Zookeeper is utilized for configuration management

SSH, password-less, node configuration

An emphasis, more of an imperative, on data model and data model design

Accumulo Architecture (cont.)

Tablets Partitions of tables, collections of sorted

key/value pairs Held and managed by Tablet Servers

Accumulo Architecture (cont.)

Receive writes, responds to reads, from clients

Writes to a write-ahead log, sorting new key/value pairs in memory, while periodically flushing sorted key/value pairs to new files in HDFS

Managed by Master

Responsible for detecting and responding to Tablet Server failure, load balancing

Coordinates startup, graceful shutdown, and recovery of write-ahead logs

Zookeeper

An apache project, open source

Utilized for distributed locking mechanism, with no single point of failure

Tablet Servers

Integration with users/access

1. Gather an organization’s information security policies and dissecting them into data- ‐centric and user- ‐centric components

2. As data is ingested into Accumulo, a data labeler tags individual key/value pairs with the appropriate data- ‐centric visibility labels based on these policies.

3. Data is then stored in Accumulo where it is available for real- ‐time queries by operational applications. End users are authenticated through these applications and authorized to access underlying data

4. As an end user performs an operation via the app (e.g., performs a search request), the visibility label on each candidate key/value pair is checked against his or her attributes, and only the data that he or she is authorized to see is returned.

The visibility labels are a feature that is unique to Accumulo. No other database can apply access controls at such a fine-grained level.

Labels are generated by translating an organization’s existing data security and information sharing policies into Boolean expressions

Making sqrrl work

sqrrl’s extensibility of Accumulo allows it to process millions of records per second, as either static or streaming objects

These records are converted into hierarchical JSON documents, giving document store capabilities

Passing this data to the analytics layer is designed to make integration and development of real-time analytics possible, and accessible

Combining access at the cell level, with Accumulo, sqrrl integrates Identity and Access Management (IAM) systems (LDAP, RADIUS, etc.)

Making sqrrl work (cont.)

Sqrrl process

Data Ingest:

HDFS:

Apache Accumulo:

Apache thrift:

Apache Lucene:

JSON, or graph format

File Storage system, compatible with both open source (OSS) and commercial versions

The core of transactional and online analytical data processing in sqrrl

Enables development in diverse language choices

Custom iterators, providing developers with real-time capabilities, such as full-text search, graph analysis, and statistics, for analytical applications and dashboards.

Who is sqrrl for?

CTOs/CIOs: Unlock the value in fractured and unstructured datasets across your organization

Developers: More easily create apps on top of Big Data and distributed databases

Infrastructure Managers: Simplify administration of Big Data through highly scalable and multitenant distributed systems

Data Analysts: Dig deeper into your data using advanced analytical techniques, such as graph analysis

Business Users: Use Big Data seamlessly via apps developed on top of sqrrl enterprise

sqrrl/Accumulo wrap-up

Accumulo bridges the gap for security perspectives that restrict a large swath of industries

Accumulo Setup:

1. Installation of HDFS and ZooKeeper must installed and configured2. Password-less SSH should be configured between all nodes (emphasized master <>

tablet)

3. Installation of Accumulo (from http://accumulo.apache.org/downloads/ using http://accumulo.apache.org/1.4/user_manual/Administration.html#Installation

Or get started using their AMI (http://www.sqrrl.com/downloads#getting-started)

sqrrl

combines the best of available technologies, develops and contributes their own, and designs big apps for big data.

http://accumulo.apache.org/downloads/

http://accumulo.apache.org/downloads/

http://accumulo.apache.org/1.4/user_manual/Administration.html#Installation

http://accumulo.apache.org/1.4/user_manual/Administration.html#Installation

recommendation engines & accumulo - sqrrl data science group may 21, 2013

Documents

data points

transformation of data

raw data

data science noise

special data science

large data sets

small data sets

internet of things big