cs246: mining massive datasets jure leskovec,...
TRANSCRIPT
![Page 1: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/1.jpg)
CS246: Mining Massive Datasets Jure Leskovec, Stanford University
http://cs246.stanford.edu
![Page 2: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/2.jpg)
Discovery of patterns and models that are: Valid: hold on new data with some certainty Useful: should be possible to act on the item Unexpected: non-obvious to the system Understandable: humans should be able to
interpret the pattern Subsidiary issues: Data cleansing: detection of bogus data Visualization: something better than MBs of output Warehousing of data (for retrieval)
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2
![Page 3: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/3.jpg)
Overlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress on scalability of number
of features and instances stress on algorithms and
architectures automation for handling
large data
Machine Learning/ Pattern
Recognition
Statistics/ AI
Data Mining
Database systems
3/13/2012 3 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
![Page 4: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/4.jpg)
Apriori MapReduce Association rules Frequent itemsets PCY Recommender systems PageRank TrustRank HITS SVM Decision Trees Perceptron Web Advertising DGIM method Margin BFR
Locality Sensitive Hashing Random hyperplanes MinHash SVD CUR Clustering Matrix factorization Bloom filters Flajolet-Martin method CURE Stochastic Gradient Descent Collaborative Filtering SimRank Spectral clustering AND-OR constructions K-means
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4
![Page 5: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/5.jpg)
Based on different types of data: Data is high dimensional Data is a graph Data is never-ending Data is labeled
Based on different models of computation: MapReduce Streams and online algorithms Single machine in-memory
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 5
![Page 6: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/6.jpg)
Based on different applications: Recommender systems Association rules Link analysis Duplicate detection
Based on different “tools”: Linear algebra (SVD, Rec. Sys., Communities) Optimization (stochastic gradient descent) Dynamic programming (frequent itemsets) Hashing (LSH, Bloom filters) 3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6
![Page 7: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/7.jpg)
High dim. data
Locality sensitive hashing
Clustering
Dimensionality
reduction
Graph data
PageRank, SimRank
Community Detection
Spam Detection
Infinite data
Filtering data
streams
Web advertising
Queries on streams
Machine learning
SVM
Decision Trees
Perceptron, kNN
Apps
Recommender systems
Association Rules
Duplicate document detection
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7
![Page 8: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/8.jpg)
Data is High-dimensional: Locality Sensitive Hashing Dimensionality reduction Clustering
Data is a graph: Link Analysis: PageRank, TrustRank, Hubs & Authorities
Data is Labeled (Machine Learning): kNN, Perceptron, SVM, Decision Trees
Data is infinite: Mining data streams Advertising on the Web
Applications: Association Rules Recommender systems
8 3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
![Page 9: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/9.jpg)
1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short
signatures, while preserving similarity 3. Locality-sensitive hashing: focus on pairs of
signatures likely to be similar 3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9
Docu- ment
The set of strings of length k that
appear in the document
Signatures : short integer vectors that represent the sets,
and reflect their similarity
Locality- sensitive Hashing
Candidate pairs : those pairs of signatures that we need to test for similarity.
![Page 10: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/10.jpg)
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10
A m
n
Σ m
n
U
VT
≈
![Page 11: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/11.jpg)
11
Hierarchical: Agglomerative (bottom up): Initially, each point is a cluster Repeatedly combine the two
“nearest” clusters into one Represent a cluster by its
centroid or clustroid
Point Assignment: Maintain a set of clusters Points belong to “nearest” cluster
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
![Page 12: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/12.jpg)
LSH: Find somewhat similar pairs of items while avoiding
O(N2) comparisons Clustering: Assign points into a prespecified number of clusters Each point belongs to a single cluster Summarize the cluster by a centroid (e.g., topic vector)
SVD (dimensionality reduction): Want to explore correlations in the data Some dimensions may be irrelevant Useful for visualization, removing noise from the data,
detecting anomalies
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12
![Page 13: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/13.jpg)
Data is high-dimensional: Locality Sensitive Hashing Dimensionality reduction Clustering
The data is a graph: Link Analysis: PageRank, TrustRank, Hubs & Authorities
Data is labeled (Machine Learning): kNN, Perceptron, SVM, Decision Trees
Data is infinite: Mining data streams Advertising on the Web
Applications: Association Rules Recommender systems
13 3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
![Page 14: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/14.jpg)
Rank nodes using link structure
PageRank: Link voting: P with importance x has n out-links, each link gets x/n votes Page R’s importance is the sum of the votes on its in-links
Complications: Spider traps, Dead-ends At each step, random surfer has two options: With probability β, follow a link at random With prob. 1-β, jump to some page uniformly at random
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14
![Page 15: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/15.jpg)
TrustRank: topic-specific PageRank with a teleport set of “trusted” pages
SimRank: measure similarity between items a k-partite graph with k types of nodes Example: picture nodes and tag nodes
Perform a random-walk with restarts from node N i.e., teleport set = {N}
Resulting prob. distribution measures similarity to N
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15
![Page 16: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/16.jpg)
Detecting clusters of densely connected nodes Trawling: discover complete
bipartite subgraphs Frequent itemset mining
Graph partitioning: “cut” few edges
to separate the graph in two pieces Spectral clustering Graph Laplacian
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16
![Page 17: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/17.jpg)
Data is high-dimensional: Locality Sensitive Hashing Dimensionality reduction Clustering
The data is a graph: Link Analysis: PageRank, TrustRank, Hubs & Authorities
Data is labeled (Machine Learning): kNN, Perceptron, SVM, Decision Trees
Data is infinite: Mining data streams Advertising on the Web
Applications: Association Rules Recommender systems
17 3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
![Page 18: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/18.jpg)
Prediction = sign(w⋅x + b) Model parameters w, b
Margin: SVM optimization problem:
Find w,b using Stochastic gradient descent
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18
wwww 1
=⋅
=γ
+ +
+
+
+
+ + - -
- - -
+
ξi
- ξi
iii
n
iibw
bxwyits
Cwi
ξ
ξξ
−≥+⋅∀
+ ∑=
≥
1)(,..
min1
221
0,,
![Page 19: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/19.jpg)
Building decision trees using MapReduce How to predict? Predictor: avg. yi of the
examples in the leaf
When to stop? # of examples in the leaf is small
How to build? One MapReduce job per level Need to compute split quality
for each attribute and each split value for each current leaf
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 19
A
B X1<v1
C
D E
F G H I
|D|=90 |D|=10
X2<v2
X3<v4 X2<v5
|D|=45 |D|=45
.42
|D|=25 |D|=20 |D|=30 |D|=15
FindBestSplit
FindBestSplit
FindBestSplit
FindBestSplit
![Page 20: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/20.jpg)
SVM: Classification Millions of numerical features (e.g., documents) Simple (linear) decision boundary Hard to interpret model
kNN: Classification or regression (Many) numerical features Many parameters to play with –distance metric, k,
weighting, … there is no simple way to set them! Decision Trees: Classification or Regression Relatively few features (handles categorical features) Complicated decision boundary: Overfitting! Easy to explain/interpret the classification Bagged Decision Trees – very very hard to beat!
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20
![Page 21: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/21.jpg)
Data is high-dimensional: Locality Sensitive Hashing Dimensionality reduction Clustering
The data is a graph: Link Analysis: PageRank, TrustRank, Hubs & Authorities
Data is labeled (Machine Learning): kNN, Perceptron, SVM, Decision Trees
Data is infinite: Mining data streams Advertising on the Web
Applications: Association Rules Recommender systems
21 3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
![Page 22: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/22.jpg)
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22
Processor
Limited Working Storage
. . . 1, 5, 2, 7, 0, 9, 3 . . . a, r, v, t, y, h, b . . . 0, 0, 1, 0, 1, 1, 0 time Streams Entering
Ad-Hoc Queries
Output
Archival Storage
Standing Queries
![Page 23: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/23.jpg)
Sampling data from a stream: Each element is included with prob. k/N
Queries over sliding windows: How many 1s are in last k bits?
Filtering a stream: Bloom filters Select elements with
property x from stream Counting distinct elements: Number of distinct elements in
the last k elements of the stream
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 23
1001010110001011010101010101011010101010101110101010111010100010110010
Item
0010001011000
Output the item since it may be in S;
hash func h
Drop the item
Bit array B
![Page 24: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/24.jpg)
You get to see one input piece at a time, and need to make irrevocable decisions
Competitive ratio = minall inputs I (|Mmy_alg|/|Mopt|)
Addwords problem: Query arrives to a search engine Several advertisers bid on the query query Pick a subset of advertisers whose ads are shown
Greedy online matching: competitive ratio ≥ 1/2
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 24
1
2
3
4
a
b
c
d
Boys Girls
![Page 25: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/25.jpg)
Data is high-dimensional: Locality Sensitive Hashing Dimensionality reduction Clustering
The data is a graph: Link Analysis: PageRank, TrustRank, Hubs & Authorities
Data is labeled (Machine Learning): kNN, Perceptron, SVM, Decision Trees
Data is infinite: Mining data streams Advertising on the Web
Applications: Association Rules Recommender systems
25 3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
![Page 26: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/26.jpg)
Supermarket shelf management – Market-basket model: Goal: To identify items that are bought together
by sufficiently many customers Approach: Process the sales data collected with
barcode scanners to find dependencies among items
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26
TID Items
1 Bread, Coke, Milk2 Beer, Bread3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk5 Coke, Diaper, Milk
Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
![Page 27: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/27.jpg)
User-user collaborative filtering Consider user c Find set D of other users whose
ratings are “similar” to c’s ratings Estimate user’s ratings based on
the ratings of users in D Item-item collaborative filtering Estimate rating for item based on
ratings for similar items
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 27
Items
Sear
ch
Rec
omm
enda
tions
![Page 28: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/28.jpg)
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 28
user-movie interaction movie bias user bias
User-Movie interaction Characterizes the matching between
users and movies Attracts most research in the field
Baseline predictor Separates users and movies Benefits from insights into user’s
behavior
( )
++++
+++−
∑∑∑∑
∑∈
ii
uu
uu
ii
Riu
Tuiiuui
PQ
bbpq
pqbbr
2222
2
),(,
)(min
λ
µ
![Page 29: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/29.jpg)
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu
![Page 30: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/30.jpg)
High dim. data
Locality sensitive hashing
Clustering
Dimensionality
reduction
Graph data
PageRank, SimRank
Community Detection
Spam Detection
Infinite data
Filtering data
streams
Web advertising
Queries on streams
Machine learning
SVM
Decision Trees
Perceptron, kNN
Apps
Recommender systems
Association Rules
Duplicate document detection
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 30
![Page 31: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/31.jpg)
MapReduce Association Rules Apriori algorithm Finding Similar Items Locality Sensitive Hashing Random Hyperplanes Dimensionality Reduction Singular Value Decomposition CUR method Clustering Recommender systems Collaborative filtering PageRank and TrustRank Hubs & Authorities k-Nearest Neighbors Perceptron Support Vector Machines Stochastic Gradient Descent Decision Trees Mining data streams Bloom Filters Flajolet-Martin Advertising on the Web
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 31
![Page 32: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/32.jpg)
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 32
How do you want that data?
![Page 33: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/33.jpg)
We are producing more data than we are able to store!
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 33
[The economist, 2010]
![Page 34: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/34.jpg)
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 34
![Page 35: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/35.jpg)
How to analyze large datasets to discover patterns and models that are: valid: hold on new data with some certainty novel: non-obvious to the system useful: should be possible to act on the item understandable: humans should be able to
interpret the pattern How to do this using massive data (that does
not fit into main memory)
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 35
![Page 36: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/36.jpg)
Seminars: InfoSeminar: http://i.stanford.edu/infoseminar RAIN Seminar: http://rain.stanford.edu
Conferences: KDD: ACM Conference on Knowledge Discovery and
Data Mining ICDM: IEEE International Conference on Data Mining WWW: World Wide Web Conference ICML: International Conference on Machine Learning NIPS: Neural Information Processing Systems
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 36
![Page 37: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/37.jpg)
CS341: Project in Data Mining (Spring 2012) Research project on big data Groups of 3 students We provide interesting data, computing resources
and mentoring You provide project ideas
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 37
Information session: Monday 3/19 5pm in Gates 415
(there will be pizza)
![Page 38: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/38.jpg)
Other relevant courses CS224W: Social and Information Network Analysis CS276: Information Retrieval and Web Search CS229: Machine Learning CS245: Database System Principles CS347: Transaction Processing and Distributed
Databases CS448g: Interactive Data Analysis
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 38
![Page 39: CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2012/slides/18-review.pdf1. Shingling: convert docs to sets 2. Minhashing: convert large sets to short](https://reader033.vdocuments.mx/reader033/viewer/2022060920/60ac5b5c4fe3ea69904344d4/html5/thumbnails/39.jpg)
You Have Done a Lot!!! And (hopefully) learned a lot!!! Answered questions and proved many
interesting results Implemented a number of methods And did excellently on the final!
Thank You for the Hard Work!!!
3/13/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 39