bahman bahmani ashish goel stanford university
DESCRIPTION
Social Search and Related Problems. Bahman Bahmani Ashish Goel Stanford University. Challenge. Careful extension of existing algorithms to modern data models Large body of theory work Distributed Computing PRAM models Streaming Algorithms Sparsification , Spanners, Embeddings - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/1.jpg)
Bahman BahmaniAshish Goel
Stanford University
Social Search and Related Problems
![Page 2: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/2.jpg)
Challenge• Careful extension of existing algorithms to
modern data models• Large body of theory work– Distributed Computing– PRAM models– Streaming Algorithms– Sparsification, Spanners, Embeddings– LSH, MinHash, Clustering– Primal Dual
• Adapt the wheel, not reinvent it
![Page 3: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/3.jpg)
Data Model #1: Map Reduce• An immensely successful idea which transformed offline analytics and bulk-data
processing. Hadoop (initially from Yahoo!) is the most popular implementation.
• MAP: Transforms a (key, value) pair into other (key, value) pairs using a UDF (User Defined Function) called Map. Many mappers can run in parallel on vast amounts of data in a distributed file system
• SHUFFLE: The infrastructure then transfers data from the mapper nodes to the “reducer” nodes so that all the (key, value) pairs with the same key go to the same reducer
• REDUCE: A UDF that aggregates all the values corresponding to a key. Many reducers can run in parallel.
![Page 4: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/4.jpg)
A Motivating Example: Continuous Map Reduce
• There is a stream of data arriving (eg. tweets) which needs to be mapped to timelines
• Simple solution?– Map: (user u, string tweet, time t)
(v1, (tweet, t))(v2, (tweet, t))…(vK, (tweet, t))
where v1, v2, …, vK follow u.– Reduce : (user v, (tweet_1, t1), (tweet_2, t2), …
(tweet_J, tJ)) sort tweets in descending order of time
![Page 5: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/5.jpg)
Data Model #2: Active DHT
• DHT (Distributed Hash Table): Stores key-value pairs in main memory on a cluster such that machine H(key) is responsible for storing the pair (key, val)
• Active DHT: In addition to lookups and insertions, the DHT also supports running user-specified code on the (key, val) pair at node H(key)
• Like Continuous Map Reduce, but reducers can talk to each other
![Page 6: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/6.jpg)
Problem #1: Incremental PageRank
• Assume social graph is stored in an Active DHT• Estimate PageRank using Monte Carlo: Maintain a
small number R of random walks (RWs) starting from each node
• Store these random walks also into the Active DHT, with each node on the RW as a key– Number of RWs passing through a node ~= PageRank
• New edge arrives: Change all the RWs that got affected• Suited for Social Networks
![Page 7: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/7.jpg)
Incremental PageRank• Assume edges are chosen by an adversary, and arrive in
random order• Assume N nodes• Amount of work to update PageRank estimates of every
node when the M-th edge arrives = (RN/ε2)/M which goes to 0 even for moderately dense graphs
• Total work: O((RN log M)/ε2)• Consequence: Fast enough to handle changes in edge
weights when social interactions occur (clicks, mentions, retweets etc)
[Joint work with Bahmani and Chowdhury]
![Page 8: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/8.jpg)
Data Model #3: Batched + Stream
• Part of the problem is solved using Map-Reduce/some other offline system, and the rest solved in real-time
• Example: The incremental PageRank solution for the Batched + Stream model: Compute PageRank initially using a Batched system, and update in real-time
• Another Example: Social Search
![Page 9: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/9.jpg)
Problem Statement: Real-Time Social Search
• Real-Time Social Search: Find a piece of content that is exciting to my extended network right now and matches my search criteria
• Hard technical problem: imagine building 100M real-time indexes over real-time content
![Page 10: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/10.jpg)
Related Work: Social Search
• Social Search problem and its variants heavily studied in literature: o Name search on social networks: Vieira et al. '07o Social question and answering: Horowitz et al. '10o Personalization of web search results based on user’s
social network: Carmel et al. '09, Yin et al. '10o Social network document ranking: Gou et al. '10o Search in collaborative tagging nets: Yahia et al '08
• Shortest paths proposed as the main proxy
![Page 11: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/11.jpg)
Current Status: No Known Efficient, Systematic Solution...
![Page 12: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/12.jpg)
... Even without the Real-Time Component
![Page 13: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/13.jpg)
Related Work: Distance Oracles
• Approximate distance oracles: Bourgain, Dor et al '00, Thorup-Zwick '01, Das Sarma et al '10, ...• Family of Approximating and Eliminating Search
Algorithms (AESA) for metric space near neighbor search: Shapiro '77, Vidal '86, Mico et al. '94, etc.• Family of "Distance-based indexing" methods for
metric space similarity searching: surveyed by Chavez et al. '01, Hjaltason et al. '03
![Page 14: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/14.jpg)
Formal Definition• The Data Model– Static undirected social graph with N nodes, M edges– A dynamic stream of updates at every node– Every update is an addition or a deletion of a keyword
• Corresponds to a user producing some content (tweet, blog post, wall status etc) or liking some content, or clicking on some content
• Could have weights
• The Query Model– A user issues a single keyword query, and is returned
the closest node which has that keyword
![Page 15: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/15.jpg)
The Processing Model: Active DHT• DHT (Distributed Hash Table): Stores key-value pairs in
main memory on a cluster such that machine H(key) is responsible for storing the pair (key, val)
• Active DHT: In addition to lookups and insertions, the DHT also supports running user-specified code on the (key, val) pair at node H(key)
• Examples: Twitter’s Storm; LinkedIn’s Kafka; Yahoo’s S4 (all open source)– Largely subsumes “Streaming MapReduce” and
Pregel
![Page 16: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/16.jpg)
Our Contribution• Bridge the gap between distance oracles and social
search. Propose a Scheme that– Takes O~(M) time for offline graph processing (uses
Das Sarma et al’s oracle)– Takes O~ (1) time for index operations (i.e. query and
update)– Can be efficiently implemented on an Active DHT
with O~ (C) total memory where C is the corpus size, and with O~ (1) DHT calls per index operation in the worst case, and two DHT calls in a common case
• Empirical validation
![Page 17: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/17.jpg)
Partitioned Multi-Indexing: Overview
• Maintain a small number (e.g., 100) indexes of real-time content, and a corresponding small number of distance sketches [Hence, ”multi”]• Each index is partitioned into up to N/2
smaller indexes [Hence, “partitioned”]• Content indexes can be updated in real-time;
Distance sketches are batched• Real-time efficient querying on Active DHT
![Page 18: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/18.jpg)
Distance Sketch: Overview• Sample sets Si of size N/2i from the set of all nodes V,
where i ranges from 1 to log N• For each Si, for each node v, compute:– The “landmark node” Li(v) in Si closest to v
– The distance Di(v) of v to L(v)• Intuition: if u and v have the same landmark in set Si then
this set witnesses that the distance between u and v is at most Di(u) + Di(v), else Si is useless for the pair (u,v)
• Repeat the entire process O(log N) times for getting good results
![Page 19: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/19.jpg)
Distance Sketch: Overview• Sample sets Si of size N/2i from the set of all nodes V,
where i ranges from 1 to log N• For each Si, for each node v, compute:– The “landmark” Li(v) in Si closest to v
– The distance Di(v) of v to L(v)• Intuition: if u and v have the same landmark in set Si then
this set witnesses that the distance between u and v is at most Di(u) + Di(v), else Si is useless for the pair (u,v)
• Repeat the entire process O(log N) times for getting good results
BFS-LIKE
![Page 20: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/20.jpg)
Distance Sketch: Overview• Sample sets Si of size N/2i from the set of all nodes V,
where i ranges from 1 to log N• For each Si, for each node v, compute:– The “landmark” Li(v) in Si closest to v
– The distance Di(v) of v to L(v)• Intuition: if u and v have the same landmark in set Si then
this set witnesses that the distance between u and v is at most Di(u) + Di(v), else Si is useless for the pair (u,v)
• Repeat the entire process O(log N) times for getting good results
![Page 21: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/21.jpg)
Distance Sketch: Overview• Sample sets Si of size N/2i from the set of all nodes V,
where i ranges from 1 to log N• For each Si, for each node v, compute:– The “landmark” Li(v) in Si closest to v
– The distance Di(v) of v to L(v)• Intuition: if u and v have the same landmark in set Si then
this set witnesses that the distance between u and v is at most Di(u) + Di(v), else Si is useless for the pair (u,v)
• Repeat the entire process O(log N) times for getting good results
![Page 22: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/22.jpg)
![Page 23: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/23.jpg)
Node Si
u v Landmark
![Page 24: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/24.jpg)
Node Si
u v Landmark
![Page 25: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/25.jpg)
Node Si
u v Landmark
![Page 26: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/26.jpg)
Node Si
u v Landmark
![Page 27: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/27.jpg)
Node Si
u v Landmark
![Page 28: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/28.jpg)
Node Si
u v Landmark
![Page 29: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/29.jpg)
Node Si
u v Landmark
![Page 30: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/30.jpg)
Node Si
u v Landmark
![Page 31: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/31.jpg)
Node Si
u v Landmark
![Page 32: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/32.jpg)
Node Si
u v Landmark
![Page 33: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/33.jpg)
Node Si
u v Landmark
![Page 34: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/34.jpg)
Node Si
u v Landmark
![Page 35: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/35.jpg)
Partitioned Multi-Indexing: Overview
• Maintain a priority queue PMI(i, x, w) for every sampled set Si, every node x in Si, and every keyword w
• When a keyword w arrives at node v, add node v to the queue PMI(i, Li(v), w) for all sampled sets Si
– Use Di(v) as the priority
– The inserted tuple is (v, Di(v))• Perform analogous steps for keyword deletion• Intuition: Maintain a separate index for every Si,
partitioned among nodes in Si
![Page 36: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/36.jpg)
Querying: Overview• If node u queries for keyword w, then look for
the best result among the top results in exactly one partition of each index Si
– Look at PMI(i, Li(u), w)
– If non-empty, look at the top tuple <v,Di(v)>, and return the result <i, v, Di(u) + Di(v)>
• Choose the tuple <i, v, D> with smallest D
![Page 37: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/37.jpg)
Intuition• Suppose node u queries for keyword w, which
is present at a node v very close to u
– It is likely that u and v will have the same landmark in a large sampled set Si and that landmark will be very close to both u and v.
![Page 38: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/38.jpg)
Node Si
u w Landmark
![Page 39: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/39.jpg)
Node Si
u w Landmark
![Page 40: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/40.jpg)
Node Si
u w Landmark
![Page 41: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/41.jpg)
Node Si
u w Landmark
![Page 42: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/42.jpg)
Node Si
u w Landmark
![Page 43: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/43.jpg)
Node Si
u w Landmark
![Page 44: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/44.jpg)
Distributed Implementation
• Sketching easily done on MapReduce• Indexing operations (updates and search queries)
can be implemented on an Active DHT• Basic Idea: Shard sketches based on node, and indexes
based on word• Only 2 network accesses per query/update (assumes that
the total index size for the keyword is small enough to fit on one machine; else O~(1))
• Total network communication almost constant per update or per search result.
![Page 45: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/45.jpg)
Theorems1. Efficiency: Already specified
![Page 46: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/46.jpg)
Our Contribution (revisited)• Bridge the gap between distance oracles and social search– Propose an algorithm that take O~(M) time for offline
graph processing (uses Das Sarma et al’s oracle)– Takes O~ (1) time for index operations (i.e. query and
update)– Can be efficiently implemented on an Active DHT with
O~ (C) total memory where C is the corpus size, and with O~ (1) DHT calls per index operation in the worst case, and two DHT calls per in a common case
• Empirical validation
![Page 47: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/47.jpg)
Theorems2. Correctness: Suppose– Node v issues a query for word w– There exists a node x with the word w
Then we find a node y which contains w such that, with high probability,
d(v,y) = O(log N)d(v,x)Builds on Das Sarma et al; much better in
practice (typically,1 + ε rather than O(log N))
![Page 48: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/48.jpg)
Extensions: Combining with Other Relevance Measures
• Examples of important relevance measures: PageRank, tf-idf, and recency (for real-time search results).• Rank based on combined score function:
• Approximate variant decomposes:
and leads to same guarantees.
![Page 49: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/49.jpg)
Extensions: Multiple Results
• Can easily extend scheme to get J results per query
![Page 50: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/50.jpg)
Extensions: Other Distance Measures
• Distance measure only used for computing the distance sketch
• Hence, our scheme extends to any distance measure where O~(1) “bfs-like” computations can be performed efficiently offline (eg. on MapReduce)
![Page 51: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/51.jpg)
Experimental Setup
• The twitter network is a sub-sample (~4M nodes); the corpus was bios of users; queries synthesized using random walks to model a user's behavior
• Measures:• Number of failed queries, i.e. where none of the top J results is as
good as the known target node• Average depth of successful queries
![Page 52: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/52.jpg)
Experimental Results
![Page 53: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/53.jpg)
![Page 54: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/54.jpg)
Experimental Results
![Page 55: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/55.jpg)
![Page 56: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/56.jpg)
Experimental Results
![Page 57: Bahman Bahmani Ashish Goel Stanford University](https://reader036.vdocuments.mx/reader036/viewer/2022062222/568161eb550346895dd21cb1/html5/thumbnails/57.jpg)
Future Work• Formal analysis over generative models
• Multi-keyword queries
• Other measures of social closeness (eg. PageRank/hitting times)
• Implementation over active DHT (eg. Storm)
– Partial progress