digital trails dave king 1 5 10 part 2 d3
TRANSCRIPT
Collective Intelligence (CI): Defined
• The intelligence that’s extracted out from the collective set of interactions and contributions made by your users.
• The use of this intelligence to act as a filter for what’s valuable in your application for a user
Source: Alag, S. Collective Intelligence in Action. Manning Press (2009)
Collective Intelligence: Explicit Resources
Ways to Harness CI
Source: Alag, S. Collective Intelligence in Action. Manning Press (2009)
CI Requirements
You need to:1. Allow users to interact with your
site and with each other, learning about each user through their interactions and contributions.
2. Aggregate what you learn about your users and their contributions using some useful models.
3. Leverage those models to recommend relevant content to a user (yielding higher retention & completion rates)
Source: Alag, S. Collective Intelligence in Action. Manning Press (2009)
Forms of CI Data
• Data comes in two forms: structured data and unstructured data. – Structured data has a well defined form, something that
makes it easily stored and queried on.( e.g. user ratings, content articles viewed, and items purchased …)
– Unstructured data is typically in the form of raw text (e.g. reviews, discussion forum posts, blog entries, and chat sessions …)
• Most applications transform unstructured data into structured data
Source: Alag, S. Collective Intelligence in Action. Manning Press (2009)
CI Data Model
Most applications generally consist of users and items. An item isany entity of interest in your application. If your application is a social-networking application, or you’re looking to connect one user with another, then a user is also a type of item.
Source: Alag, S. Collective Intelligence in Action. Manning Press (2009)
Users
Metadata
Items
Classification of Recommender Engines
Non-Personalized Collaboration
Non-personalized recommendations are identical for each user. The recommendations are either manually selected (e.g. editor choices) or based on the popularity of items (e.g. average ratings, sales data).
Non-Personalized: Example
Non-Personalized: Example
Demographic Recommendation
The users are categorized based on the attributes of their demographic profiles in order to find users with similar features. The engine then suggests or recommends (explicitly or implicitly) items that are preferred by these similar users.
Demographic Recommendation: Example
Demographic Recommendation: Example
Demographic Recommendation: Mystery Movie
Demographic Recommendation: Guilt by Association • Advantages
– New users can get recommendations before they have rated any item.– Technique is domain independent
• Limitations– Gathering the required demographic data leads to privacy issues.– Demographic classification is too crude for highly personalized
recommendations.– Users with an unusual taste may not get good recommendations (“gray sheep”
problem).– Once established user preferences do not change easily (stability vs. plasticity
problem).
Collaborative Filtering
• Employs user-item ratings (or votes) as their information source. The concept is to make correlations between users or between items.
• The correlations are used to predict user behavior and make recommendations.
• Widely implemented and the most mature recommendation technique.
Collaborative Filtering: Example
Collaborative Filtering: Example
Collaborative Filtering: Example
Collaborative Filtering: Example
Collaborative Filtering: Example
Collaborative Filtering: Main Approaches
• User-based• Item-based• Model-based
Collaborative Filtering: User-Based
• Assumption that users that rated the same items similarly probably have the same taste.
• It make user-to-user correlations by using the rating profiles of different users to find highly correlated users.
• These users form like-minded neighborhoods based on their shared item preferences.
• The engine then can recommend the items preferred by the other users in the neighborhood.
j1 j2 j3 j4 j5 i
u 1 2 3 4 5 ?
v1 1 2 3 4 5 5
v2 5 4 3 2 1 1
Collaborative Filtering: Similarity?
Mathematical concept analogous to the notion of Euclidean Distance
1 2 30
1
2
3
AB
C
1
2.24
Collaborative Filtering: Similarity?
• Cosine
• Correlation
• Adjusted Cosine
Collaborative Filtering: Cosine Similarity (an Example)
Photo1 Photo2 Photo3Jane 2 2 4John 3 4 2Doe 1 3 5
Sum Sq Sqrt(Sum Sq)24 4.9029 5.3935 5.92
Photo1 Photo2 Photo3Jane 0.41 0.41 0.82John 0.56 0.74 0.37Doe 0.17 0.51 0.85
Jane John DoeJane 1.00 0.83 0.97John 0.83 1.00 0.78Doe 0.97 0.78 1.00
Step 1: Find SQRT ofSum of Squares
Each Row of Scores
Step 2: Divide each ScoresIn row by SQRT of Sum of SQs
Step3: Calculate Cosine SimilarityBetween Users by Summing X-Productsof their normalized Scores (from Step 2)
Collaborative Filtering: User-Based Predictions and Recommendations
Critic Similarity Night Sim*Night Lady Sim*Lady Luck Sim*LuckRose 0.99 3.0 2.97 2.5 2.475 3 2.97LaSalle 0.92 3.0 2.76 3 2.76 2 1.84Puig 0.89 4.5 4.01 - - 3 2.67Matthems 0.66 3.0 1.98 3 1.98 - -Seymour 0.38 3.0 1.14 3 1.14 1.5 0.57Total 12.86 8.355 8.05Sim Sum 3.84 2.95 3.46Total/Sim Sum 3.35 2.832203 2.32659
Lady in the Water
Snakes on a Plane
Just My Luck Superman Returns
You,Me and Dupree
The Night Listener
Lisa Rose 2.5 3.5 3.0 3.5 2.5 3.0Gene Seymour 3.0 3.5 1.5 5.0 3.5 3.0Claudia Puig 3.5 3.0 4.0 2.5 4.5Mick LaSalle 3.0 4.0 2.0 3.0 2.0 3.0Jack Matthews 3.0 4.0 5.0 3.5 3.0Toby ? 4.5 ? 4.0 1.0 ?N 6 6 4 7 7 6
Collaborative Filtering: User-Based Disadvantages
• Cold Start: What do you do with users who have no or few ratings?
• Sparcity: What do you do if there is little overlap in user ratings across users in the data set?
• Scale: What if there are millions of users? Does this scale well as the number of comparisons increases?
• Real-time: How do you do these calculations in real-time.
Collaborative Filtering: Item-Based Example
www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf
Amazon.com has more than 29 million customers and several million catalog items. Other major retailers have comparably large data sources. While all this data offers opportunity, it’s also a curse, breaking the backs of algorithms designed for data sets three orders of magnitude smaller. Almost all existing algorithms were evaluated over small data sets.
Collaborative Filtering: Item-Based
• Assumes that items rated similarly are probably similar.
• Compares items based on the shared appreciation of users, in order to create neighborhoods of similar items.
• The engine then recommends the neighboring items of the user’s known preferred ones.
j1 j2 j3 j4 j5 i
u 1 2 3 4 5 ?
v1 1 2 3 4 5 5
v2 5 4 3 2 1 1Item-based: i similar to j5 more than other itemsPredict ? = 5
Collaborative Filtering: Item-Based Example
Movie Rating® Night R*Night Lady R*Lady Luck R*LuckSnakes 4.5 0.182 0.819 0.222 0.999 0.105 0.473Superman 4.0 0.103 0.412 0.091 0.364 0.065 0.260Dupree 1.0 0.148 0.148 0.400 0.400 0.182 0.182Total 0.433 1.379 0.713 1.763 0.352 0.915Normalized 3.185 2.473 2.598
Lisa Rose Gene Seymour
Michael Phillips
Claudia Puig
Mick LaSalle
Jack Matthews
Toby
Lady in the Water 2.5 3.0 2.5 3.5 3.0 3.0Snakes on a Plane 3.5 3.5 3.0 4.0 4.0 4.5Just My Luck 3.0 1.5 3.0 2.0Superman Returns 3.5 5.0 3.5 4.0 3.0 5.0 4.0You,Me and Dupree 2.5 3.5 4.0 2.5 2.0 3.5 1.0The Night Listener 3.0 3.0 4.5 4.5 3.0 3.0
Collaborative Filtering: Item-Based Advantages
• Scalable: More scalable than the user-based approach because correlations are drawn among a limited number of products, instead of a potentially very large number of users.
• Sparcity: Because the number of items is naturally smaller than the number of users, the item-based approach has a reduced sparsity problem in comparison to the user-based approach.
Collaborative Filtering: There's Money in CF – The Netflix Prize
Collaborative Filtering: Netflix Prize
Collaborative Filtering: Group Lens Rating Data Sets for Testing
MovieLens , Wikilens (Beers), Book-Crossing, Jester Joke, HP EachMovie
Collaborative Filtering: Other ApplicationsMetaData
Item or User Value1 Value2 Value3 Value4 …I or U1 n11 n12 n13 n14I or U2 n21 n22 n23 n24I or U3 n31 n32 n33 n34I or U4 n41 n42 n43 n44…
Anything that can be represented in matrix form where n is a number representing a nominal (e.g. 0,1 for present, absent), ordinal, interval or ratio value
TagsItem Tag1 Tag2 Tag3 Tag4 …Item1 0 1 1 0Item2 1 1 0 0Item3 1 0 1 1Item4 1 0 0 1…
Word CountsText Entry Word1 Word2 Word3 Word4 …Entry1 1 2 5 0Entry2 6 2 1 1Entry3 2 5 1 1Entry4 4 0 0 5…
Senders or RecipientsItem S or R1 S or R2 S or R3 S or R4 …Item1 0 1 1 0Item2 1 1 0 0Item3 1 0 1 1Item4 1 0 0 1…
Uploads, Downloads, or BookmarksUsers Item1 Item2 Item3 Item4 …User1 1 0 0 1User2 0 0 0 1User3 1 0 1 0User4 0 1 1 0…
CI from Content: Text Mining Defined
• Text mining (also known as text data mining or knowledge discovery in textual databases) is the semi-automated process of extracting patterns (useful information and knowledge) from large amounts of unstructured data sources.– Information extraction. Identification of key phrases and relationships within
text by looking for predefined sequences in text via pattern matching.– Topic tracking. Based on a user profile and documents that a user views, text
mining can predict other documents of interest to the user.– Summarization. Summarizing a document to save time on the part of the
reader.– Categorization. Identifying the main themes of a document and then placing
the document into a predefined set of categories based on those themes.– Clustering. Grouping similar documents without having a predefined set of
categories.– Concept linking. Connects related documents by identifying their shared
concepts and, by doing so, helps users find information that they perhaps would not have found using traditional search methods.
– Question answering. Finding the best answer to a given question through knowledge-driven pattern matching.
CI from Content: Resources
CI from Content: Resources
• Ronen Feldman, “Information Extraction: Theory and Practice,” Bar-Ilan University, ISRAEL, u.cs.biu.ac.il/~feldman/icml_tutorial.html
• Seth Grimes, “Text Analytics for BI/DW Practitioners.” altaplana.com/TDWI2008Aug-TextAnalyticsForBIDWPractitioners.pdf
• Bing Liu, “Opinion Mining & Summarization – Sentiment Analysis.” April 2008. cs.uic.edu/~liub/FBS/opinion-mining-sentiment-analysis.pdf
CI from Content: Some Interesting Data Sets for Research and Training• Natural Language Toolkit (NLTK):
– nltk.org: Diverse set of “corpora,” used in conjunction with Natural Language Processing with Python.– Data set description: nltk.googlecode.com/svn/trunk/nltk_data/index.xml
• Enron Email– 500K Enron emails sent primarily by sr. managers over a 3.5 year period covering the height of the
scandal. There are multiple versions of the set including database versions.– cs.cmu.edu/~enron/
• 9/11 Pager Messages– Approximately 500K messages sent in and around WTC area before, during, and after the attacks– 911.wikileaks.org/release/messages.zip
• Web Site APIs– Del.icio.us– Technorati– Twitter
• Web Sites Devoted to Data Sets– http://www.datawrangling.com/some-datasets-available-on-the-web– http://blog.jonudell.net/2007/07/05/show-me-the-data/
CI from Content: Text Mining Process
CI from Content: Preparing Data for Term Document Matrix
• Tokenization —Parse the text to generate terms. Sophisticated analyzers can also extract phrases from the text.
• Normalize — Convert them to lowercase.• Eliminate stop words — Eliminate terms that appear very
often (e.g. the).• Stemming — Convert the terms into their stemmed form—
remove plurals.
CI for Unstructured Contents: Analyzing Blogs
CI for Unstructured Contents: Analyzing Blogs (RSS Feed)
CI for Unstructured Contents: Analyzing Blogs (Source of RSS Feed)
CI for Unstructured Contents: Structure of an “Atom” Feed
CI for Unstructured Contents: Preparing Blog RSS Feed for Analysis
…
• Access & Parse Feed
• Retrieve contents of each entry
Collection ofEntry
Contents(HTML) for each Blog
• Normalize• Remove
stop words
• Stem
List of word stems for each entry
• Compute the word/stem counts for each word in each collection
• Select words for analysis based on word counts
Matrix of Word
counts by Blog
Subsetof Words
for Analysis
Collection ofBlogsRSS FeedEntry1Entry2
• Compute word counts for each Blog by summing word counts across entries
Matrix ofWord counts
for eachBlog
CI for Unstructured Contents: Word Counts for Collection of Blogs
CI from Content: Data Mining applied to Prepared Text Data
• http://www.kdnuggets.com/index.html?lg
CI for Unstructured Contents: Blog Dendogram
CI for Unstructured Contents: Blog Results for K-Means Clustering
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9 Cluster 10
'Online Marketing
"The Superficial -
'Publishing 2.0'
'Mashable!' 'Download Squad'
'Blog Maverick'
'Joystiq' 'we make money not
'BuzzMachine'
'Gizmodo'
'ScienceBlogs : Combined
'Wonkette' 'GigaOM' 'Signum sine tinnitu--by
'Signal vs. Noise'
'CoolerHeads Prevail'
"John Battelle's
"Sifry's Alerts"
'Engadget'
'Gothamist' 'Eschaton' 'Joho the Blog'
'lifehack.org' 'The Unoffi cial
'Google Operating
'Topix.net Weblog'
'TMZ.com'
'MAKE Magazine'
"Neil Gaiman's
'43 Folders' 'Kotaku' 'Valleywag' 'Cool Hunting'
'TechEBlog'
'Treehugger' 'Hot Air' "O'Reilly Radar"
'Deadspin' 'Search Engine Watch
'Bloggers Blog: Blogging
'Autoblog'
'Pharyngula' 'Talking Points Memo:
'Lifehacker' 'SpikedHumor
'Techdirt' 'Copyblogger'
'kottke.org' 'Daily Kos' 'SimpleBits' 'Joel on Software'
'Offi cial Google Blog'
"Steve Pavlina's
"Joi Ito's Web"
'Go Fug Yourself'
'Slashdot' 'Search Engine
'Oilman'
'Boing Boing' 'Andrew Sullivan | The
'Micro Persuasion'
'Shoemoney - Skills to pay
'Bloglines | News'
'gapingvoid: "cartoons
'Michelle Malkin'
'456 Berea Street'
'A Consuming Experience
'ProBlogger Blog Tips'
'The Viral Garden'
'Creating Passionate
'Google Blogoscoped'
'flagrantdisre
'TechCrunch'
CI from Content: Simple Example “We Feel Fine”
• Scours the Internet every ten minutes, harvesting human feelings from a large number of blogs (generally identifying and saving between 15,000 and 20,000 feelings per day. ).
• Scans blog posts for sentences with the phrases "I feel" and "I am feeling“, extracts the sentence, and look to see if it includes one of about 5,000 pre-identified "feelings". If a valid feeling is found, the sentence is said to represent one person who feels that way.
• URL format of many blog posts can be used to extract the username of the post's author which is used to extract the age, gender, country, state, and city of the blog's owner.
• Given the country, state, and city, we can then retrieve the local weather conditions for that city at the time the post was written. We extract and save as much of this information as we can, along with the post.
CI from Content: Simple Example “We Feel Fine” Visualizations
Madness Murmerings Montage
MoundsMetricsMobs
CI from Content: 9/11 Pager Data
2001-09-11 08:52:46 Skytel [002386438] B ALPHA [email protected]||Reports of a plane crash near World Trade Center - no more details at this point. WNBC's LIVE pix - Network working on coverage.
CI from Content: 9/11 Pager Data
Dataveillance: Roger Clarke
• The systematic use of personal data systems in the investigation or monitoring of the actions or communications of one or more persons.
• The terms personal surveillance and mass surveillance are commonly used, but seldom defined– Personal surveillance is the surveillance of an identified
person. In general, a specific reason exists for the investigation or monitoring.
– Mass surveillance is the surveillance of groups of people, usually large groups. In general, the reason for investigation or monitoring is to identify individuals who belong to some particular class of interest to the surveillance organization.
Dataveillance: Resources
Dataveillance
Internet & OtherCommunication
Data Sources
Data MiningData Mining& Social Network& Social Network
AnalysisAnalysis
ChoicePoint (17B records)AcxiomEquifax (400M credit holders)Experian…
Issues with Privacy and Dataveillance• Worldwide Privacy Protection
– There is a tangled matrix of laws and regulations world-wide governing the privacy and protection of this data
– Anytime we interaction on the Web were likely to cross a number of jurisdictions
• Widely held belief that our data produced from our activities is protected.
• Widely held belief among internet users that it’s hard to identify or link specific traces & trails with specific
• Law enforcement and intelligence agencies (worldwide) are persistent in the requests for internet and communications data
Re-Identifiability of Information
• Deals with the linkage of datasets without explicit identifiers such as name and address.
• Examples of Re-identification– Large portion of the US can be re-identified using a
combination of 5-digit ZIP code, gender and date of birth.– AOL case 4417749 (2006 release of 20 million search
queries of over 650,000 users– CMU study of predicting SSNs -- it is possible to guess many
-- if not all -- of the nine digits in an individual's Social Security number using publicly available information (about location and birth date)
3 0lagerdelar d3 0l9 8 d3 0l11 8 d3 0l13 8 d3 0l15 8 d3 0l17 8 d3 0l19 8 d3 0l21 8 d3 0l23 8 d3 0l25
3 0bearing parts d3 0l9 8 d3 0l11 8 d3 0l13 8 d3 0l15 8 d3 0l17 8 d3 0l19 8 d3 0l21 8 d3 0l23 8 d3 0