overcite: a distributed, cooperative citeseer jeremy stribling, jinyang li, isaac g. councill, m....
TRANSCRIPT
![Page 1: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/1.jpg)
OverCite: A Distributed, Cooperative CiteSeer
Jeremy Stribling, Jinyang Li, Isaac G. Councill,
M. Frans Kaashoek, Robert Morris
MIT Computer Science and Artificial Intelligence LaboratoryUC Berkeley/New York University
Pennsylvania State University
![Page 2: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/2.jpg)
People Love CiteSeer
• Online repository of academic papers• Crawls, indexes, links, and ranks papers• Important resource for CS community
typical unification of access points and rereliable web services
![Page 3: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/3.jpg)
People Love CiteSeer Too Much
• Burden of running the system forced on one site• Scalability to large document sets uncertain• Adding new resources is difficult
![Page 4: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/4.jpg)
What Can We Do?
• Solution #1: All your © are belong to ACM• Solution #2: Donate money to PSU• Solution #3: Run your own mirror• Solution #4: Aggregate donated resources
![Page 5: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/5.jpg)
Solution: OverCite
Client
Rest of the talk focuses on how to achieve this
![Page 6: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/6.jpg)
CiteSeer Today: Hardware
• Two 2.8-GHz servers at PSU
Client
![Page 7: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/7.jpg)
CiteSeer Today: SearchSearch keywords
Results meta-data
Context
![Page 8: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/8.jpg)
CiteSeer Today: DocumentsCached
doc
Cited by
![Page 9: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/9.jpg)
CiteSeer: Local Resources
# documents 675,000
Document storage 803 GB
Meta-data storage 45 GB
Index size 22 GB
Total storage 870 GB
Searches 250,000/day
Document traffic 21 GB/day
Total traffic 34.4 GB/day
![Page 10: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/10.jpg)
Goals and Challenge
• Goals– Parallel speedup– Lower burden per site
• Challenge: Distribute work over wide-area nodes– Storage– Search– Crawling
![Page 11: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/11.jpg)
OverCite’s Approach
• Storage:– Use DHT for documents and meta-data– Achieve parallelism, balanced load, durability
• Search:– Divide docs into partitions, hosts into groups– Less search work per host
• Crawling– Coordinate activity via DHT
![Page 12: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/12.jpg)
The Life of a DownloadThe Life of a Query
Client
Query
ResultsPage
KeywordsHits w/ meta-data,rank and context
Meta-data req/resp
Group 1
Group 2Group 3
Group 4
Document Req
Document
Documentblocks
Web-based front end
IndexDHT storage
(Documents and meta-data)
![Page 13: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/13.jpg)
Store Docs and Meta-data in DHT
• DHT stores papers for durability• DHT stores meta-data tables
– e.g., document IDs {title, author, year, etc.}• DHT provides load-balance and parallelism
Server
Server
Server
Server
![Page 14: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/14.jpg)
peer {1,2,3}DHT {4}mesh {2,4,6}hash {1,2,5}table {1,2,4}
Parallelizing Queries• Partition by document• Divide the index into k partitions• Each query sent to only k nodes
Server
Server
Server
Server
peer {1}hash {1,5}table {1}
peer {2}mesh{2,6}hash {2}table {2}
peer {3}DHT {4}mesh {4}table {4}
Part. 1 Part. 1 Part. 2 Part. 2
peer {1,2,3}mesh{2}hash {1,2}table {1,2}
peer {1,2,3}mesh{2}hash {1,2}table {1,2}
DHT {4}hash {5}mesh{4,6}table {4}
DHT {4}hash {5}mesh{4,6}table {4}
Group 1 Group 2Documents 1,5 Documents 2,6 Document 3 Document 4Documents 1,2,3 Documents 4,5,6
![Page 15: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/15.jpg)
Considerations for k• If k is small
+ Send queries to fewer hosts less latency
+ Fewer DHT lookups– Less opportunity for parallelism
• If k is big
+ More parallelism
+ Smaller index partitions faster searches– More hosts some node likely to be slow– More DHT lookups
• Current deployment: k = 2
![Page 16: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/16.jpg)
Implementation
• Storage: Chord/DHash DHT• Index: Searchy search engine• Web server: OKWS• Anycast service: OASIS
• Event-based execution, using libasync• 11,000 lines of C++ code
![Page 17: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/17.jpg)
Deployment• 27 nodes across North America
– 9 RON/IRIS nodes + private machines– 47 physical disks, 3 DHash nodes per disk– Large range of disk and memory
Map source: http://www.coralcdn.org/oasis/servers
![Page 18: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/18.jpg)
Evaluation Questions
• Does OverCite achieve parallel speedup?• What is the per-node storage burden?• What is the system-wide storage overhead?
![Page 19: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/19.jpg)
Configuration
• Index first 5,000 words/document• 2 partitions (k = 2)• 20 results per query• 2 replicas/block in the DHT
![Page 20: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/20.jpg)
Evaluation Methods
9 Web front end servers
18 index servers
27 DHT servers
Client
1 client at MIT1000 queries from CS trace128 queries in parallel
Group 1
Group 2
![Page 21: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/21.jpg)
More Servers More Queries/sec
0
5
10
15
20
25
2 4 6 8 10 12 14 16 18Number of Index servers
Qu
erie
s/se
con
d
• 9x servers 7x query throughput• CiteSeer serves 4.8 queries/sec
(All experiments use 27 DHT servers)
![Page 22: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/22.jpg)
Per-node Storage Burden
Property Individual Cost
Document/
meta-data storage
18.1 GB
Index size 6.8 GB
Total storage 24.9 GB
![Page 23: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/23.jpg)
System-wide Storage Overhead
Property System Cost
Document/
meta-data storage
18.1 GB * 47
= 850.7 GB
Index size 6.8 GB * 27
= 183.6 GB
Total storage 1034.3 GB
4x as expensive as raw underlying data
![Page 24: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/24.jpg)
Future Work
• Production-level public deployment• Distributed crawler• Public API for developing new features
![Page 25: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/25.jpg)
Related Work
• Search on DHTs– Partition by keyword
[Li et al. IPTPS ’03, Reynolds & Vadhat Middleware ’03, Suel et al. IWWD ’03]
– Hybrid schemes[Tang & Dwarkadas NSDI ’04, Loo et al. IPTPS ’04, Shi et al. IPTPS ’04, Rooter WMSCI ‘05]
• Distributed crawlers[Loo et al. TR ’04, Cho & Garcia-Molina WWW ’02,Singh et al. SIGIR ‘03]
• Other paper repositories[arXiv.org (Physics), ACM and Google Scholar (CS),
Inspec (general science)]
![Page 26: OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and](https://reader037.vdocuments.mx/reader037/viewer/2022110322/56649d055503460f949d93fc/html5/thumbnails/26.jpg)
Summary
• A system for storing and coordinating a digital repository using a DHT
• Spreads load across many volunteer nodes• Simple to take advantage of new resources• Run CiteSeer as a community• Implementation and deployment
http://overcite.org