![Page 1: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/1.jpg)
Announcements • Working AWS codes are out • 605 waitlist ~= 25, slots ~= 15 • 10-805 project deadlines now posted • William has no ofDice hours next week
![Page 2: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/2.jpg)
Recap • An algorithm for testing a huge naïve Bayes
classiDier – More generally: for evaluating a linear
classiDier on a test set efDiciently on-disk, using stream-and-sort or map-reduce ops only
• Sketch of algorithm for Rocchio training/testing
2
![Page 3: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/3.jpg)
Recap • Abstractions for map-reduce (TFIDF example) • map-side vs reduce-side joins
3
Proposed syntax: table2 = MAP table1 TO λ row : f(row))
Proposed syntax: table2 = FILTER table1 BY λ row : f(row))
f(row)! {true,false}
Proposed syntax: table2 = FLATMAP table1 TO λ row : f(row))
f(row)!list of rows Proposed syntax: GROUP table BY λ row : f(row) Could deDine f via: a function, a Dield of a deDined record structure, …
Proposed syntax: JOIN table1 BY λ row : f(row),
table2 BY λ row : g(row)
![Page 4: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/4.jpg)
Today • Less abstract abstractions
4
Proposed syntax: table2 = MAP table1 TO λ row : f(row))
Proposed syntax: table2 = FILTER table1 BY λ row : f(row))
f(row)! {true,false}
Proposed syntax: table2 = FLATMAP table1 TO λ row : f(row))
f(row)!list of rows Proposed syntax: GROUP table BY λ row : f(row) Could deDine f via: a function, a Dield of a deDined record structure, …
Proposed syntax: JOIN table1 BY λ row : f(row),
table2 BY λ row : g(row)
![Page 5: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/5.jpg)
PIG: A WORKFLOW/DATAFLOW LANGUAGE
5
![Page 6: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/6.jpg)
PIG: word count • Declarative “data Dlow” language
PIG program is a bunch of assignments where every LHS is a relation. No loops, conditionals, etc allowed.
6
![Page 7: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/7.jpg)
More on Pig • Pig Latin
– atomic types + compound types like tuple, bag, map
– execute locally/interactively or on hadoop • can embed Pig in Java (and Python and …) • can call out to Java from Pig
7
![Page 8: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/8.jpg)
8
Tokenize – built-in function
Flatten – special keyword, which applies to the next step in the process – ie foreach is transformed from a MAP to a FLATMAP
![Page 9: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/9.jpg)
PIG Features • LOAD ‘hdfs-path’ AS (schema)
– schemas can include int, double, bag, map, tuple, … • FOREACH alias GENERATE … AS …, …
– transforms each row of a relation • DESCRIBE alias/ ILLUSTRATE alias -- debugging • GROUP alias BY … • FOREACH alias GENERATE group, SUM(….)
– GROUP/GENERATE … aggregate op together act like a map-reduce
• JOIN r BY Hield, s BY Hield, … – inner join to produce rows: r::f1, r::f2, … s::f1, s::f2, …
• CROSS r, s, … – use with care unless all but one of the relations are singleton
• User deDined functions as operators – also for loading, aggregates, …
9
PIG parses and optimizes a sequence of commands before it executes them It’s smart enough to turn GROUP … FOREACH… SUM … into a map-reduce
![Page 10: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/10.jpg)
10
Example: the optimizer will compress these steps into one map-reduce operation
![Page 11: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/11.jpg)
ANOTHER EXAMPLE:� COMPUTING TFIDF IN PIG LATIN
11
![Page 12: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/12.jpg)
Abstract Implementation: [TF]IDF
data = pairs (docid ,term) where term is a word appears in document with id docid operators: • DISTINCT, MAP, JOIN • GROUP BY …. [RETAINING …] REDUCING TO a reduce step docFreq = DISTINCT data
| GROUP BY λ(docid,term):term REDUCING TO count /* (term,df) */ docIds = MAP DATA BY=λ(docid,term):docid | DISTINCT numDocs = GROUP docIds BY λdocid:1 REDUCING TO count /* (1,numDocs) */ dataPlusDF = JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term | MAP λ((docid,term),(term,df)):(docId,term,df) /* (docId,term,document-freq) */ unnormalizedDocVecs = JOIN dataPlusDF by λrow:1, numDocs by λrow:1 | MAP λ((docId,term,df),(dummy,numDocs)): (docId,term,log(numDocs/df)) /* (docId, term, weight-before-normalizing) : u */
1/2
docId term
d123 found
d123 aardvark
key value
found (d123,found),(d134,found),…
aardvark (d123,aardvark),…
key value
1 12451
key value
found (d123,found),(d134,found),… 2456
aardvark (d123,aardvark),… 7
12
![Page 13: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/13.jpg)
Abstract Implementation: TFIDF
normalizers = GROUP unnormalizedDocVecs BY λ(docId,term,w):docid RETAINING λ(docId,term,w): w2
REDUCING TO sum /* (docid,sum-of-square-weights) */ docVec = JOIN unnormalizedDocVecs BY λ(docId,term,w):docid,
normalizers BY λ(docId,norm):docid | MAP λ((docId,term,w), (docId,norm)): (docId,term,w/sqrt(norm)) /* (docId, term, weight) */
2/2
key
d1234 (d1234,found,1.542), (d1234,aardvark,13.23),… 37.234
d3214 ….
key
d1234 (d1234,found,1.542), (d1234,aardvark,13.23),… 37.234
d3214 …. 29.654
docId term w
d1234 found 1.542
d1234 aardvark 13.23
docId w
d1234 37.234
d1234 37.234 13
![Page 14: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/14.jpg)
(docid,token) è (docid,token,tf(token in doc))
(docid,token,tf) è (docid,token,tf,length(doc))
(docid,token,tf,n)è(…,tf/n)
(docid,token,tf,n,tf/n)è(…,df)
ndocs.total_docs
(docid,token,tf,n,tf/n)è(docid,token,tf/n * id)
relation-to-scalar casting
14
group outputs record with “group” as field name
![Page 15: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/15.jpg)
Debugging/visualization
15
![Page 16: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/16.jpg)
16
![Page 17: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/17.jpg)
TF-IDF in PIG - another version
17
![Page 18: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/18.jpg)
GUINEA PIG
18
![Page 19: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/19.jpg)
GuineaPig: PIG in Python • Pure Python (< 1500 lines) • Streams Python data structures
– strings, numbers, tuples (a,b), lists [a,b,c] – No records: operations deDined functionally
• Compiles to Hadoop streaming pipeline – Optimizes sequences of MAPs
• Runs locally without Hadoop – compiles to stream-and-sort pipeline – intermediate results can be viewed
• Can easily run parts of a pipeline • http://curtis.ml.cmu.edu/w/courses/index.php/Guinea_Pig
19
![Page 20: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/20.jpg)
GuineaPig: PIG in Python • Pure Python, streams Python data structures
– not too much new to learn (eg Dield/record notation, special string operations, UDFs, …)
– codebase is small and readable • Compiles to Hadoop or stream-and-sort, can easily run parts of a
pipeline – intermediate results often are (and always can be) stored and
inspected – plan is fairly visible
• Syntax includes high-level operations but also fairly detailed description of an optimized map-reduce step – Flatten | Group(by=…, retaining=…, reducingTo=…)
20
![Page 21: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/21.jpg)
21
A wordcount example
class variables in the planner are data structures
![Page 22: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/22.jpg)
Wordcount example …. • Data structure can be converted to a series of
“abstract map-reduce tasks”
22
![Page 23: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/23.jpg)
More examples of GuineaPig
23
Join syntax, macros, Format command
Incremental debugging, when intermediate views are stored:
% python wrdcmp.py –store result … % python wrdcmp.py –store result –reuse cmp
![Page 24: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/24.jpg)
More examples of GuineaPig
24
Full Syntax for Group
Group(wc, by=lambda (word,count):word[:k], retaining=lambda (word,count):count, reducingTo=ReduceToSum())
equiv to: Group(wc, by=lambda (word,count):word[:k],
reducingTo= ReduceTo(int,
lambda accum,(word,count)): accum+count))
![Page 25: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/25.jpg)
ANOTHER EXAMPLE:� COMPUTING TFIDF IN GUINEA PIG
25
![Page 26: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/26.jpg)
Actual Implementation
26
![Page 27: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/27.jpg)
Actual Implementation
27
docId w
d123 found
d123 aardvark
![Page 28: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/28.jpg)
Actual Implementation
28
docId w
d123 found
d123 aardvark
key value
found (d123,found),(d134,found),… 2456
aardvark (d123,aardvark),… 7
![Page 29: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/29.jpg)
Actual Implementation
29
Augment: loads a preloaded object b at mapper initialization time, cycles thru the input, and generates pairs (a,b)
![Page 30: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/30.jpg)
Full Implementation
30
![Page 31: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/31.jpg)
Outline: Soft Joins with TFIDF • Why similarity joins are important • Useful similarity metrics for sets and strings • Fast methods for K-NN and similarity joins
– Blocking – Indexing – Short-cut algorithms – Parallel implementation
31
![Page 32: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/32.jpg)
In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest threat to his continued good health, for--the stories go--once an enemy, even a weak unskilled enemy, learned the sorcerer's true name, then routine and widely known spells could destroy or enslave even the most powerful. As times passed, and we graduated to the Age of Reason and thence to the first and second industrial revolutions, such notions were discredited. Now it seems that the Wheel has turned full circle (even if there never really was a First Age) and we are back to worrying about true names again:
The first hint Mr. Slippery had that his own True Name might be known--and, for that matter, known to the Great Enemy--came with the appearance of two black Lincolns humming up the long dirt driveway ... Roger Pollack was in his garden weeding, had been there nearly the whole morning.... Four heavy-set men and a hard-looking female piled out, started purposefully across his well-tended cabbage patch.…
This had been, of course, Roger Pollack's great fear. They had discovered Mr. Slippery's True Name and it was Roger Andrew Pollack TIN/SSAN 0959-34-2861.
32
![Page 33: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/33.jpg)
SOFT JOINS WITH TFIDF:�WHY AND WHAT
33
![Page 34: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/34.jpg)
Motivation • Integrating data is important • Data from different sources
may not have consistent object identiHiers – Especially automatically-
constructed ones • But databases will have
human-readable names for the objects
• But names are tricky….
34
![Page 35: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/35.jpg)
35
![Page 36: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/36.jpg)
Sim Joins on Product Descriptions
• Similarity can be high for descriptions of distinct items:
o AERO TGX-Series Work Table -42'' x 96'' Model 1TGX-4296 All tables shipped KD AEROSPEC- 1TGX Tables are Aerospec Designed. In addition to above specifications; - All four sides have a V countertop edge ...
o AERO TGX-Series Work Table -42'' x 48'' Model 1TGX-4248 All tables shipped KD AEROSPEC- 1TGX Tables are Aerospec Designed. In addition to above specifications; - All four sides have a V countertop ..
• Similarity can be low for descriptions of identical items: o Canon Angle Finder C 2882A002 Film Camera Angle Finders Right Angle
Finder C (Includes ED-C & ED-D Adapters for All SLR Cameras) Film Camera Angle Finders & Magnifiers The Angle Finder C lets you adjust ...
o CANON 2882A002 ANGLE FINDER C FOR EOS REBEL® SERIES PROVIDES A FULL SCREEN IMAGE SHOWS EXPOSURE DATA BUILT-IN DIOPTRIC ADJUSTMENT COMPATIBLE WITH THE CANON® REBEL, EOS & REBEL EOS SERIES.
36
![Page 37: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/37.jpg)
One solution: Soft (Similarity) joins • A similarity join of two sets A and B is
– an ordered list of triples (sij,ai,bj) such that • ai is from A • bj is from B • sij is the similarity of ai and bj
• the triples are in descending order
• the list is either the top K triples by sij or ALL triples with sij>L … or sometimes some approximation of these….
37
![Page 38: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/38.jpg)
Softjoin Example - 1
A useful scalable similarity metric: IDF weighting plus cosine distance! 38
![Page 39: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/39.jpg)
How well does TFIDF work?
39
![Page 40: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/40.jpg)
40
![Page 41: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/41.jpg)
There are refinements to TFIDF distance – eg ones that extend with soft matching at the token level (e.g., softTFIDF)
41
![Page 42: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/42.jpg)
Semantic Joining�with Multiscale Statistics
William Cohen Katie Rivard, Dana Attias-Moshevitz
CMU
42
![Page 43: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/43.jpg)
43
![Page 44: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/44.jpg)
SOFT JOINS WITH TFIDF:�HOW?
44
![Page 45: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/45.jpg)
Rocchio’s algorithm DF(w) = # different docs w occurs in
TF(w,d) = # different times w occurs in doc d
IDF(w) = |D |DF(w)
u(w,d) = log(TF(w,d)+1) ⋅ log(IDF(w))u(d) = u(w1,d),....,u(w|V |,d)
u(y) =α 1|Cy |
u(d)||u(d) ||2d∈Cy
∑ −β1
|D−Cy |u(d ')
||u(d ') ||2d '∈D−Cy
∑
f (d) = argmaxyu(d)
||u(d) ||2
⋅u(y)
||u(y) ||2
Many variants of these formulae
…as long as u(w,d)=0 for words not in d!
Store only non-zeros in u(d), so size is O(|d| )
But size of u(y) is O(|nV| )
u2= ui
2
i∑
45
![Page 46: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/46.jpg)
TFIDF similarity DF(w) = # different docs w occurs in
TF(w,d) = # different times w occurs in doc d
IDF(w) = |D |DF(w)
u(w,d) = log(TF(w,d)+1) ⋅ log(IDF(w))u(d) = u(w1,d),....,u(w|V |,d)
v(d) = u(d)||u(d) ||2
sim(v(d1),v(d2 )) = v(d1) ⋅v(d2 ) = u(w,d1)||u(d1) ||2w
∑ u(w,d2 )||u(d2 ) ||2
46
![Page 47: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/47.jpg)
Soft TFIDF joins • A similarity join of two sets of TFIDF-weighted
vectors A and B is – an ordered list of triples (sij,ai,bj) such that
• ai is from A • bj is from B • sij is the dot product of ai and bj • the triples are in descending order
• the list is either the top K triples by sij or ALL triples with sij>L … or sometimes some approximation of these….
47
![Page 48: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/48.jpg)
PARALLEL SOFT JOINS
48
![Page 49: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/49.jpg)
SIGMOD 2010
49
![Page 50: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/50.jpg)
TFIDF similarity: variant for joins DF(A,w) = # different docs w occurs in from ADF(B,w) = # different docs w occurs in from BTF(w,d) = # different times w occurs in doc d
IDF(w,d) = |Cd |DF(Cd,w)
, where Cd ∈ {A,B}
u(w,d) = log(TF(w,d)+1) ⋅ log(IDF(w,d))u(d) = u(w1,d),....,u(w|V |,d)
v(d) = u(d)||u(d) ||2
sim(v(d1),v(d2 )) = v(d1) ⋅v(d2 ) = u(w,d1)||u(d1) ||2w
∑ u(w,d2 )||u(d2 ) ||2
50
![Page 51: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/51.jpg)
Parallel Inverted Index Softjoin - 1
want this to work for long documents or short ones…and keep the relations simple
Statistics for computing TFIDF with IDFs local to each relation 51
![Page 52: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/52.jpg)
Parallel Inverted Index Softjoin - 2
What’s the algorithm? • Step 1: create document vectors as (Cd, d, term, weight)
tuples • Step 2: join the tuples from A and B: one sort and reduce
• Gives you tuples (a, b, term, w(a,term)*w(b,term)) • Step 3: group the common terms by (a,b) and reduce to
aggregate the components of the sum 52
![Page 53: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/53.jpg)
An alternative TFIDF pipeline
53
![Page 54: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/54.jpg)
Inverted Index Softjoin – PIG 1/3
54
![Page 55: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/55.jpg)
Inverted Index Softjoin – 2/3
55
![Page 56: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/56.jpg)
Inverted Index Softjoin – 3/3
56
![Page 57: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/57.jpg)
Results…..
57
![Page 58: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/58.jpg)
58
![Page 59: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/59.jpg)
Making the algorithm smarter….
59
![Page 60: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/60.jpg)
Inverted Index Softjoin - 2
we should make a smart choice about which terms to use
60
![Page 61: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/61.jpg)
Adding heuristics to the soft join - 1
61
![Page 62: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/62.jpg)
Adding heuristics to the soft join - 2
62
![Page 63: Working AWS codes are out 605 waitlist ~= 25, slots ~= 15 10-805 ...wcohen/10-605/2016/workflow-2.pdf · JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term ... Join syntax,](https://reader034.vdocuments.mx/reader034/viewer/2022042411/5f297554e9583d35b9668421/html5/thumbnails/63.jpg)
63