recommendation engines & accumulo - sqrrl data science group may 21, 2013
TRANSCRIPT
Recommendation Engines &
Accumulo - Sqrrl
Data Science Group
May 21, 2013
Agenda
6:15 - 6:30 More data trumps better algorithms by Michael
Walker
6:30 - 7:30 Recommendation Engines by Tom Rampley
7:30 - 8:30 Accumulo - Sqrrl by John Dougherty
8:30 - 9:30 Network at Old Chicago at 14th and Market.
Data Science Group New Sponsors
Cloudera
O'Reilly Media
More data is better
More data is better
Even if less exact or messier
One sensor = strict accuracy
One sensor = strict accuracy
Multiple sensors = less accurate & messy
One sensor = strict accuracy
Multiple sensors = less accurate & messy
More data points = greater value
One sensor = strict accuracy
Multiple sensors = less accurate & messy
More data points = greater value
Aggregate = more comprehensive picture
Increase frequency of sensor readings
Increase frequency of sensor readings
One measure per min = accurate
Increase frequency of sensor readings
One measure per min = accurate
100 readings per second = less accurate
Increase frequency of sensor readings
One measure per min = accurate
100 readings per second = less accurate
> volume vs. exactitude
Increase frequency of sensor readings
One measure per min = accurate
100 readings per second = less accurate
> volume vs. exactitude
Accept messiness to get scale
Sacrifice accuracy in return for knowing general trend
Sacrifice accuracy in return for knowing general trend
Big data = probabilistic (not precise)
Sacrifice accuracy in return for knowing general trend
Big data = probabilistic (not precise)
Good yet has problems
Internet of Things
Internet of Things
"Data Science" means the scientific study of the creation, manipulation and transformation of data to create meaning.
Internet of Things
"Data Scientist" means a professional who uses scientific methods to liberate and create meaning from raw data.
Internet of Things
"Big Data" means large data sets that have different properties from small data sets and requires special data science methods to differentiate signal from noise to extract meaning and requires special compute systems and power.
Data Science
Data Science
Data Science
"Signal" means a meaningful interpretation of data based on science that may be transformed into scientific evidence and knowledge.
Data Science
"Noise" means a competing interpretation of data not grounded in science that may not be considered scientific evidence. Yet noise may be manipulated into a form of knowledge (what does not work).
Machine Learning
Machine Learning
Field of study that gives computers the ability to learn without being explicitly programmed.
Algorithms
Algorithms
Process or set of rules to be followed in calculations or other problem-solving operations to achieve a goal, especially a mathematical rule or procedure used to compute a desired result, produce the answer to a question or the solution to a problem in a finite number of steps.
More data trumps better algorithms
More data trumps better algorithms
Microsoft Word Grammar Checker
More data trumps better algorithms
Microsoft Word Grammar Checker
Improve algorithms
More data trumps better algorithms
Microsoft Word Grammar Checker
Improve algorithms
New techniques
More data trumps better algorithms
Microsoft Word Grammar Checker
Improve algorithms
New techniques
New features
More data trumps better algorithms
Feed more data into existing methods
More data trumps better algorithms
Feed more data into existing methods
Most ML-A one million words or less
More data trumps better algorithms
Feed more data into existing methods
Most ML-A one million words or less
Experiment: 10 mil - 100 mill - 1 billion
More data trumps better algorithms
Results: algorithms improved dramatically
Simple algorithm that was worst performer with 1/2 mill words performed better than all others with 1 bill words
Algorithm worked best with 1/2 mill performed worst with 1 bill words
More data trumps better algorithms
Conclusions:
More trumps less
More trumps smarter (not always)
More data trumps better algorithms
Tradeoff between spending time and money on algorithm development versus spending it on data development
More data trumps better algorithms
Google language translation
1 billion words
1 trillion words
larger yet messier data set - entire internet
Tom Rampley
Recommendation Engines: an Introduction
A Brief History of Recommendation Engines
What Does a Recommender Do?
Recommendation engines use algorithms of varying complexity to suggest items based upon historical information
• Item ratings or content• Past user behavior/purchase history
Recommenders typically use some form of collaborative filtering
Collaborative Filtering
The name:•‘Collaborative’ because the algorithm takes the choices of many users into account to make a recommendation•Rely on user taste similarity•‘Filtering’ because you use the preferences of other users to filter out the items most likely to be of interest to the current user
Collaborative Filtering
algorithms include:
•K nearest neighbors•Cosine similarity•Pearson correlation•Bayesian belief nets•Markov decision processes•Latent semantic indexing methods•Association Rules Learning
Cosine Similarity ExampleLets walk through an example of a simple collaborative
filtering algorithm, namely cosine similarityCosine similarity can be used to find similar items, or
similar individuals. In this case, we’ll be trying to identify individuals with similar taste
Imagine individual ratings on a set of items to be a [user,item] matrix. You can then treat the ratings of each individual as an N-dimensional vector of ratings on items: {r1, r2…rN}
The similarity of vectors (individuals’ ratings) can be computed by computing the cosine of the angle between them:
The closer the cosine is to 1, the more alike the two individuals’ ratings are
Cosine Similarity Example ContinuedLets say we have the following matrix of users
and ratings of TV shows:
And we encounter a new user, James, who has only seen and rated 5 of these 7 shows:
Of the two remaining shows, which one should we recommend to James?
True Blood CSI JAG Star Trek Castle The Wire
Twin Peaks
Bob 5 2 1 4 3 2 5Mary 4 4 2 1 3 1 2Jim 1 1 5 2 5 2 3
George 3 4 3 5 5 4 3Jennifer 5 2 4 2 4 1 0Natalie 0 5 0 4 4 1 4Robin 5 5 0 0 4 2 2
True Blood CSI JAG Star Trek CastleJames 5 5 3 1 0
Cosine Similarity Example ContinuedTo find out, we’ll see who James is most similar to
among the folks who have rated all the shows by calculating the cosine similarity between the vectors of the 5 shows that each individual have in common:
It seems that Mary is the closest to James in terms of show ratings among the group. Of the two remaining shows, The Wire and Twin Peaks, Mary slightly preferred Twin Peaks so that is what we recommend to James
Cosine Similarity James
Bob 0.73
Mary 0.89
Jim 0.47
George 0.69
Jennifer 0.78
Natalie 0.50
Robin 0.79
Collaborative Filtering Continued
This simple cosine similarity example could be extended to extremely large datasets with hundreds or thousands of dimensions
You can also compute item to item similarity by treating the item as the vectors for which you’re computing similarity, and the users as the dimensions•Allows for recommending similar items to a user after they’ve made a purchase•Amazon uses a variant of this algorithm•This is an example of item-to-item collaborative filtering
Adding ROI to the Equation: an Example with Naïve Bayes
When recommending products, some may generate more margin for the firm than others
Some algorithms can take cost into account when making recommendations
Naïve Bayes is a commonly used classifier that allows for the inclusion of marginal value of a product sale in the recommendation decision
Naïve Bayes
Bayes theorem tells us the probability of
our beliefs being true given prior beliefs
and evidence
Naïve Bayes is a classifier that utilizes Bayes’
theorem (with simplifying assumptions) to generate a probability of an instance
belonging to a class
Class likelihood can be combined with expected payoff to generate the optimal payoff from a
recommendation
Naïve Bayes ContinuedHow does the NB algorithm generate class
probabilities, and how can we use the algorithmic output to maximize expected payoff?
Let’s say we want to figure out which of two products to recommend to a customerEach product generates a different amount of
profit for our firm per unit soldWe know the target customer’s past purchasing
behavior, and we know the past purchasing behavior of twelve other customers who have bought one of the two potential recommendation products
Let’s represent our knowledge as a series of matrices and vectors
Naïve Bayes Continued
Past Customer Purchasing Behavior
Toys Games Candy Books BoatJohn Squirt Gun Chess Skittles Harry Potter SpeedboatMary Doll Life M&Ms Emma SpeedboatPete Kite Chess M&Ms Twilight SailboatKevin Squirt Gun Life Snickers Emma SailboatDale Doll Life Skittles Twilight SpeedboatJane Kite Monopoly Skittles Twilight SpeedboatRaquelle Squirt Gun Monopoly Skittles Harry Potter SailboatJoanne Kite Chess Snickers Twilight SpeedboatSusan Squirt Gun Chess Skittles Twilight SailboatTim Doll Life M&Ms Harry Potter SailboatLarry Kite Chess M&Ms Twilight SpeedboatRegina Doll Monopoly Snickers Harry Potter SailboatEric Squirt Gun Life Snickers Harry Potter ?
Naïve Bayes ContinuedNB uses (independent) probabilities of events
to generate class probabilitiesUsing Bayes’ theorem (and ignoring the
scaling constant) the probability of a customer with past purchase history α (a vector of past purchases) buying item θ is:
P (α1, …, αi | θj ) P (θj )Where P (θj) is the frequency with which the
item appears in the training data, and P (α1,
…, αi | θj ) is Π P (αi | θj ) for all i items in the training dataThat P (α1, …, αi | θj ) P (θj ) = Π P (αi | θj ) P (θj) is
dependent up on the assumption of conditional independence between past purchases
Naïve Bayes ContinuedIn our example, we can calculate the
following probabilities:Sailboat Speedboat
P(θ) 6/12 6/12
Sailboat SpeedboatSquirt Gun 3/6 1/6Kite 1/6 3/6Doll 2/6 2/6Life 2/6 2/6Monopoly 2/6 1/6Chess 2/6 3/6Skittles 2/6 3/6M&Ms 2/6 2/6Snickers 2/6 1/6Harry Potter 3/6 1/6Twilight 2/6 4/6Emma 1/6 1/6
Now that we can calculate P (α1, …, αi | θj ) P (θj ) for all instances, let’s figure out the most likely boat purchase for Eric:
These probabilities may seem very low, but recall that we left out the scaling constant in Bayes theorem since we’re only interested in the relative probabilities of the two outcomes
Naïve Bayes Continued
P(θ) Toys Games Candy Books Boat
Eric Squirt Gun Life Snickers Harry Potter ?
Sailboat 6/12 3/12 2/12 2/12 3/12 0.00086806
Speedboat 6/12 1/12 2/12 1/12 1/12 0.00004823
So it seems like the sailboat is a slam dunk to
recommend. It’s much more likely (18 times!) for
Eric to buy than the speedboat.
But let’s consider a scenario: let’s say our hypothetical firm
generates $20 of profit whenever a customer buys a speedboat, but only $1 when they buy a sailboat (outboard motors are apparently very high margin)
In that case, it would make more sense to recommend the
speedboat, because our expected payoff from the speedboat
recommendation would be 11% greater ($20/$1 * .0000048/.00087) than our expected payout from the
sailboat recommendation
This logic can be applied to any number of products, by
multiplying the set of purchase probabilities by the
set of purchase payoffs, taking the maximum value as the
recommended item
Naïve Bayes Continued
Challenges
While recommendation algorithms in many cases are relatively simple as machine learning goes, there are a couple of difficult
problems that all recommenders must deal with:
Cold start problem• How do you make
recommendations to someone for whom you have very little or no data?
Data sparsity• With millions of
items for sale, most customers have bought very few individual items
Grey and Black sheep problem• Some people
have very idiosyncratic taste, and making recommendations to them is extremely difficult because they don’t behave like other customers
Dealing With Cold Start
Typically only a problem in the very early stages of a user-system interaction
Requiring creation of a profile for new users can mitigate the problem to a certain extent, by making early recommendations contingent upon supplied personal dataA recommender system can also start out using item-item recommendations based upon the first items a user buys, and gradually change over to a person-person system as the system learns the user’s taste
Dealing With Data Sparsity
Data sparsity can be dealt with primarily by two methods:• D
ata imputation
• Latent factor methods
Data imputation typically uses an algorithm like cosine similarity to impute the rating of an item based upon the ratings of similar users
Latent factor methods typically use some sort of matrix decomposition to reduce the rank of the large, sparse matrix while simultaneously adding ratings for unrated items based upon latent factors
Dealing With Data Sparsity
• Techniques like principal components analysis/singular value decomposition allow for the creation of low rank approximations to sparse matrices with relatively little loss of information
Dealing With Sheep of Varying Darkness
To a large extent, these cases are unavoidable
Feedback on recommended items post purchase, as well as the purchase rate of recommended items, can be used to learn even very idiosyncratic preferences, but take longer than for a normal user
Grey and black sheep are doubly troublesome because their odd tendencies can also weaken the strength of your engine to make recommendations to the broad population of white sheep
ReferencesA good survey of recommendation techniques
Matrix factorization for use in recommenders
Article on the BellKor solution to the Netflix challenge
Article on Amazon's recommendation engine
sqrrl and AccumuloPresented by: John Dougherty, CIO
Viriton5/21/2013
Which NoSQL solution?
© sqrrl, Inc.
There are a lot of
places to fit
sqrrl, and Accumulo.
What is sqrrl?
Based on Accumulo
A proven, secure, multi tenant, data platform for building real-time applications
Scales elastically to tens of petabytes of data and enables organizations to eliminate their internal data silos
Seamless integration with Hadoop, and most of its variants
Providing a supply for a much needed security demand (ground-up security)
Already deployed and utilized by defense and government industries
A history of sqrrl
© sqrrl, Inc. - Accumulo Meetup Presentation
A sqrrl’s architecture
© sqrrl, Inc.
What is Accumulo?
Development began at the NSA in 2008
Base foundation for sqrrl
Cell-level security reduces the cost of app development, circumnavigating complex, sometimes impossible, legal or policy restrictions
Provides the ability to scale to >PB levels
Highly adaptive schema and sorted key/value paradigm
Stores key/value pairs in parsed, sorted, secure controls
Where does Accumulo fit?
© sqrrl, Inc.
How does Accumulo provide security?
Security Labels are applied to keys
Cell-level security is implemented to allow for security policy enforcement, using data labeler tags
These policies are applied when data is ingested
Tablets contain data, are controlled using security policies
Stores key/value pairs in parsed, sorted, secure controls, a 5-tuple key system
Accumulo Security (cont.)
Why Cell-Level Security Is Important:
Many databases insufficiently implement security through row-and column-level restrictions. Column-level security is only sufficient when the data schema is static, well known, and aligned with security concerns. Row-level security breaks down when a single record conveys multiple levels of information. The flexible, fine-grained cell-level security within Sqrrl Enterprise (or its root Accumulo) supports flexible schemas, new indexing patterns, and greater analytic adaptability at scale.
Accumulo Security (cont.)
An Accumulo key is a 5-tuple key, consisting of: Row: Controls Atomicity
Column Family: Controls Locality
Column Qualifier: Controls Uniqueness
Visibility Label: Controls Access
Timestamp: Controls Versioning
Keys are sorted: Hierarchically: Row first, then column family, and so on
Lexicographically: Compare first byte, then second, and so on
(Values are byte arrays)
Accumulo Security (cont.)
An example of column usage
Accumulo Architecture
Accumulo servers (tablets) utilize a multitude of big data technologies, but their layout is different than Map/Reduce, HDFS, MongoDB, Cassandra, etc. used alone.
Data is stored in HDFS
Zookeeper is utilized for configuration management
SSH, password-less, node configuration
An emphasis, more of an imperative, on data model and data model design
Accumulo Architecture (cont.)
Tablets Partitions of tables, collections of sorted
key/value pairs Held and managed by Tablet Servers
Accumulo Architecture (cont.)
Receive writes, responds to reads, from clients
Writes to a write-ahead log, sorting new key/value pairs in memory, while periodically flushing sorted key/value pairs to new files in HDFS
Managed by Master
Responsible for detecting and responding to Tablet Server failure, load balancing
Coordinates startup, graceful shutdown, and recovery of write-ahead logs
Zookeeper
An apache project, open source
Utilized for distributed locking mechanism, with no single point of failure
Tablet Servers
Integration with users/access
1. Gather an organization’s information security policies and dissecting them into data- ‐centric and user- ‐centric components
2. As data is ingested into Accumulo, a data labeler tags individual key/value pairs with the appropriate data- ‐centric visibility labels based on these policies.
3. Data is then stored in Accumulo where it is available for real- ‐time queries by operational applications. End users are authenticated through these applications and authorized to access underlying data
4. As an end user performs an operation via the app (e.g., performs a search request), the visibility label on each candidate key/value pair is checked against his or her attributes, and only the data that he or she is authorized to see is returned.
The visibility labels are a feature that is unique to Accumulo. No other database can apply access controls at such a fine-grained level.
Labels are generated by translating an organization’s existing data security and information sharing policies into Boolean expressions
Making sqrrl work
sqrrl’s extensibility of Accumulo allows it to process millions of records per second, as either static or streaming objects
These records are converted into hierarchical JSON documents, giving document store capabilities
Passing this data to the analytics layer is designed to make integration and development of real-time analytics possible, and accessible
Combining access at the cell level, with Accumulo, sqrrl integrates Identity and Access Management (IAM) systems (LDAP, RADIUS, etc.)
Making sqrrl work (cont.)
Sqrrl process
Data Ingest:
HDFS:
Apache Accumulo:
Apache thrift:
Apache Lucene:
JSON, or graph format
File Storage system, compatible with both open source (OSS) and commercial versions
The core of transactional and online analytical data processing in sqrrl
Enables development in diverse language choices
Custom iterators, providing developers with real-time capabilities, such as full-text search, graph analysis, and statistics, for analytical applications and dashboards.
Who is sqrrl for?
CTOs/CIOs: Unlock the value in fractured and unstructured datasets across your organization
Developers: More easily create apps on top of Big Data and distributed databases
Infrastructure Managers: Simplify administration of Big Data through highly scalable and multitenant distributed systems
Data Analysts: Dig deeper into your data using advanced analytical techniques, such as graph analysis
Business Users: Use Big Data seamlessly via apps developed on top of sqrrl enterprise
sqrrl/Accumulo wrap-up
Accumulo bridges the gap for security perspectives that restrict a large swath of industries
Accumulo Setup:
1. Installation of HDFS and ZooKeeper must installed and configured2. Password-less SSH should be configured between all nodes (emphasized master <>
tablet)
3. Installation of Accumulo (from http://accumulo.apache.org/downloads/ using http://accumulo.apache.org/1.4/user_manual/Administration.html#Installation
Or get started using their AMI (http://www.sqrrl.com/downloads#getting-started)
sqrrl
combines the best of available technologies, develops and contributes their own, and designs big apps for big data.