web communities: the world online raghu ramakrishnan chief scientist for audience and cloud...
Post on 20-Jan-2016
216 views
TRANSCRIPT
![Page 1: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/1.jpg)
Web Communities: The World Online
Raghu RamakrishnanChief Scientist for Audience and Cloud Computing
Research Fellow
Yahoo! (On leave, Univ. of Wisconsin-Madison)
![Page 2: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/2.jpg)
Evolution of Online Communities
![Page 3: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/3.jpg)
- 4 -Research
![Page 4: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/4.jpg)
- 5 -Research
Rate of content creation
• Estimated growth of content– Published content from traditional sources: 3-4
Gb/day
– Professional web content: ~2 Gb/day
– User-generated content: 8-10 Gb/day
– Private text content: ~3 Tb/day (200x more)
– Upper bound on typed content: ~700 Tb/day
(Towards a PeopleWeb, Ramakrishnan & Tomkins, IEEE Computer, August 2007)
![Page 5: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/5.jpg)
- 6 -Research
Metadata
• Estimated growth of metadata
– Anchortext: 100Mb/day
– Tags: 40Mb/day
– Pageviews: 100-200Gb/day
– Reviews: Around 10Mb/day
– Ratings: <small>
Drove most advances in search from 1996-present
Increasingly rich and available, but not yet useful in search
This is in spite of the fact that interactions on the web arecurrently limited by the fact that each site is essentially a silo
![Page 6: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/6.jpg)
- 7 -Research
PeopleWeb: Site-Centric People-Centric
• Common web-wide id for objects (incl. users)– Even common attributes? (e.g., pixels for camera objects)
• As users move across sites, their personas and social networks will be carried along
• Increased semantics on the web through community activity (another path to the goals of the Semantic Web)
Global Object
Model
Portable Social Environment
Community
Search
(Towards a PeopleWeb, Ramakrishnan & Tomkins, IEEE Computer, August 2007)
![Page 7: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/7.jpg)
- 8 -Research
Content Access and Ownership
(Slide courtesy Andrew Tomkins)
![Page 8: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/8.jpg)
- 9 -Research
Facebook Apps, Open Social
• Web site provides canvas
– Third party apps can paint on this canvas
– “Paint” comes from data on and off-network• Via APIs that each site chooses to expose What is the core asset
of a web portal?
• What are the computational implications?
– App hosting and caching
– Dynamic, personalized content
– Searching over “spaghetti” information threads
![Page 9: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/9.jpg)
Trends in Search
![Page 10: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/10.jpg)
- 11 -Research
Search and Content Supply
• Premise:
– People don’t want to search
– People want to get tasks done
I want to book a vacation in Tuscany.Start Finish
Broder 2002, A Taxonomy of web search
![Page 11: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/11.jpg)
- 12 -Research
“seafood san francisco”
Category: restaurantLocation: San Francisco
Reserve a table for two tonight at SF’s best Sushi Bar and get a free sake, compliments of OpenTable!
Category: restaurant Location: San Francisco
Alamo Square Seafood Grill - (415) 440-2828 803 Fillmore St, San Francisco, CA - 0.93mi - map
Category: restaurant Location: San Francisco
Structure Intent
![Page 12: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/12.jpg)
- 13 -Research
Y! Shortcuts
![Page 13: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/13.jpg)
- 14 -Research
Google Base
![Page 14: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/14.jpg)
- 15 -Research
Search as Killer App for Web Data Semantics
• Publishers and search engine collaborate
– Example: Abstracts surfacing structured content
• Users see richer search experience
– Accomplish their tasks faster and more effectively
![Page 15: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/15.jpg)
Social Search
![Page 16: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/16.jpg)
- 17 -Research
Social Search
• Explicitly open up search
– Enable communities, sites and consumers to explicitly re-define search results (e.g., SearchMonkey, Boss)
• What is the right unit for a “search result”? Can we intelligently “stitch together” more informative abstracts, possibly from multiple sources?
• Facilitate creation of specialized ranking engines based on different kinds of tasks, or aimed at different communities of users
• Implicitly leverage socially engaged users and their interactions
– Learning from shared community interactions, and leveraging community interactions to create and refine content
• Expanding search results to include sources of information
– E.g., Experts, sub-communities of shared interest, particular search engines (in a world with many, this is valuable!)
Reputation, Quality, Trust, Privacy
![Page 17: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/17.jpg)
- 18 -Research
Opening Up Yahoo! Search
Phase 1 Phase 2
Giving site owners and developers control over the appearance of Yahoo!
Search results.
BOSS takes Yahoo!’s open strategy to the next level by providing Yahoo!
Search infrastructure and technology to developers and companies to help them
build their own search experiences.
(Slide courtesy Prabhakar Raghavan)
![Page 18: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/18.jpg)
- 19 -Research
What Is It?
Before After
An open platform for using structured data to build more useful and relevant search results
(Slide courtesy Amit Kumar)
![Page 19: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/19.jpg)
- 20 -Research
What’s New?
task linksbuy thisuser reviewsbest trips
task linksbuy thisuser reviewsbest trips
structured datareview ratingsproduct priceshours of operation
structured datareview ratingsproduct priceshours of operation
faviconfavicon send resultshare this richresult with others
send resultshare this richresult with others
mediaproduct imagesbusiness photosprofile pictures
mediaproduct imagesbusiness photosprofile pictures
user choiceremovereport spam
user choiceremovereport spam
(Slide courtesy Amit Kumar)
![Page 20: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/20.jpg)
- 21 -Research
How Does It Work?
Acme.com’sdatabase
Index
RDF/Microformat Markup
Site owners/publishers share structured data with Yahoo!. 1
Consumers customize their search experience with Enhanced Results or Infobars
3
Site owners & third-party developers build SearchMonkey apps.2
DataRSS feed
Web Services
Page Extraction
Acme.com’s Web Pages
(Slide courtesy Amit Kumar)
![Page 21: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/21.jpg)
- 22 -Research
Publishing Structured Data: Support for Emerging Semantic Web Standards ++
• Microformats
– hCard, hEvent, hReview, hAtom, XFN
– More as they get adopted
• RDFa and eRDF markup
• OpenSearch
– +extensions to return structured data
• Atom/RSS Feeds
– +extensions to embed structured data
markup
(crawl)
apis
(pull)
push
(Slide courtesy Andrew Tomkins)
![Page 22: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/22.jpg)
- 23 -Research
Infobars: Integrating 3rd Party Data
Pull in data from any web service
(Slide courtesy Amit Kumar)
![Page 23: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/23.jpg)
- 24 -Research
babycenter
epicurious
Search Results of the Future
yelp.com
answers.com
webmd
Gawker
New York Times
(Slide courtesy Andrew Tomkins)
![Page 24: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/24.jpg)
- 25 -Research
BOSS Offerings
API
A self-service, web services model for developers and start-ups to quickly build and deploy new search experiences.
BOSS offers two options for companies and developers and has partnered with top technology universities to drive search experimentation, innovation and research into next generation search.
• University of Illinois Urbana Champaign• Carnegie Mellon University
• Stanford University
• Purdue University
• MIT
• Indian Institute of
Technology Bombay
• University of
Massachusetts
CUSTOM
Working with 3rd parties to build a more relevant, brand/site specific web search experience.
This option is jointly built by Yahoo! and select partners.
ACADEMIC
Working with the following universities to allow for wide-scale research in the search field:
(Slide courtesy Prabhakar Raghavan)
![Page 25: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/25.jpg)
- 26 -Research
BOSS Could Enable Custom Search Experiences
Social Search
Vertical Search
Visual Search
(Slide courtesy Prabhakar Raghavan)
![Page 26: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/26.jpg)
- 27 -Research
Partner Examples
![Page 27: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/27.jpg)
- 28 -Research
Web Search Results for “Lisa”
Latest news results for “Lisa”. Mostly about people because Lisa is a popular name
Web search results are very diversified, covering pages about organizations, projects, people, events, etc.
41 results from My Web!
![Page 28: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/28.jpg)
- 29 -Research
Save / Tag Pages You Like
You can save / tag pages you like into My Web from toolbar / bookmarklet / save buttons
You can pick tags from the suggested tags based on collaborative tagging technology
Type-ahead based on the tags you have used
Enter your note for personal recall and sharing purpose
You can specify a sharing mode
You can save a cache copy of the page content
(Courtesy: Raymie Stata)
![Page 29: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/29.jpg)
- 30 -Research
My Web 2.0 Search Results for “Lisa”
Excellent set of search results from my community because a couple of people in my community are interested in Usenix Lisa-related topics
![Page 30: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/30.jpg)
- 31 -Research
Google Co-Op
This query matches a pattern
provided by Contributor…
…so SERP displays (query-specific) links
programmed by Contributor.
Subscribed Link
edit | remove
Query-based direct-display, programmed by Contributor
Users “opts-in” by “subscribing” to
them
![Page 31: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/31.jpg)
- 32 -Research
![Page 32: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/32.jpg)
- 33 -Research
Tech Support at COMPAQ
“In newsgroups, conversations disappear and you have to ask the same question over and over again. The thing that makes the real difference is the ability for customers to collaborate and have information be persistent. That’s how we found QUIQ. It’s exactly the philosophy we’re looking for.”
“Tech support people can’t keep up with generating content and are not experts on how to effectively utilize the product … Mass Collaboration is the next step in Customer Service.”
– Steve Young, VP of Customer Care, Compaq
![Page 33: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/33.jpg)
- 34 -Research
KNOWLEDGEBASE
QUESTION
Answer added to power self service
SELF SERVICE
ANSWER
KNOWLEDGEBASE
QUESTION
SELF SERVICE
--Partner Experts-Customer Champions -Employees
Customer
How It Works
Support Agent
Answer added to power self service
![Page 34: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/34.jpg)
- 35 -Research
65% (3,247)
77% (3,862)
86% (4,328)
6,845
74% answered
Answersprovidedin 12h
Answersprovidedin 24h
40% (2,057)
Answersprovided
in 3h
Answersprovidedin 48h
Questions
• No effort to answer each question
• No added experts
• No monetary incentives for enthusiasts
Timely Answers
77% of answers provided within 24h
![Page 35: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/35.jpg)
- 36 -Research
Power of Knowledge Creation
~80%
Support Incidents Agent Cases
5-10 %
Self-Service *)
CustomerMass Collaboration *)
KnowledgeCreation
SHIELD 1
SHIELD 2
*) Averages from QUIQ implementations
SUPPORT
![Page 36: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/36.jpg)
- 37 -Research
Mass Contribution
Users who on average provide only 2 answers provide 50% of all answers
7 % (120) 93 % (1,503)
50 % (3,329)
100 %(6,718)
Answers
ContributingUsers
Top users
Contributed by mass of users
![Page 37: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/37.jpg)
- 38 -Research
Interesting Problems
• Question categorization
• Detecting undesirable questions & answers
• Identifying “trolls”
• Ranking results in Answers search
• Finding related questions
• Estimating question & answer quality
(Byron Dom: SIGIR talk)
![Page 38: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/38.jpg)
- 39 -Research
Supplying Structured Search Content
• Semantic Web?
• Unleash community computing—PeopleWeb!
• Three ways to create semantically rich summaries that address the user’s information needs:
– Editorial, Extraction, UGC
Challenge: Design social interactions that lead to creation and maintenance of high-quality structured content
![Page 39: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/39.jpg)
- 40 -Research
Better Search via Information Extraction
• Extract, then exploit, structured data from raw text:
For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“Richard Stallman, founder of the Free Software Foundation, countered saying…
Name Title OrganizationBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman Founder Free Soft..
PEOPLE
Select Name From PEOPLE Where Organization = ‘Microsoft’
Bill Gates
Bill Veghte(from Cohen’s IE tutorial, 2003)
![Page 40: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/40.jpg)
- 41 -Research
Community Information Management (CIM)
• Many real-life communities have a Web presence
– Database researchers, movie fans, stock traders
• Each community = many data sources + people
• Members want to query and track at a semantic level:
– Any interesting connection between researchers X and Y?
– List all courses that cite this paper
– Find all citations of this paper in the past one week on the Web
– What is new in the past 24 hours in the database community?
– Which faculty candidates are interviewing this year, where?
![Page 41: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/41.jpg)
- 42 -Research
DBLife
Integrated information about a (focused) real-world community
Collaboratively built and maintained by the community
Semantic web via extraction & community
![Page 42: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/42.jpg)
- 43 -Research
DBLife
• Faculty: AnHai Doan & Raghu Ramakrishnan
• Students: P. DeRose, W. Shen, F. Chen, R. McCann, Y. Lee, M. Sayyadian
• Prototype system up and running since early 2005
• Plan to release a public version of the system in Spring 2007
• 1164 sources, crawled daily, 11000+ pages / day
• 160+ MB, 121400+ people mentions, 5600+ persons
• See DE overview article, CIDR 2007 demo
![Page 43: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/43.jpg)
- 44 -Research
DBLife Papers
• Efficient Information Extraction over Evolving Text Data, F. Chen, A. Doan, J. Yang, R. Ramakrishnan. ICDE-08.
• Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach, P. DeRose, W. Shen, F. Chen, A. Doan, R. Ramakrishnan. VLDB-07.
• Declarative Information Extraction Using Datalog with Embedded Extraction Predicates, W. Shen, A. Doan, J. Naughton, R. Ramakrishnan. VLDB-07.
• Source-aware Entity Matching: A Compositional Approach, W. Shen, A. Doan, J.F. Naughton, R. Ramakrishnan: ICDE 2007.
• OLAP over Imprecise Data with Domain Constraints, D. Burdick, A. Doan, R. Ramakrishnan, S. Vaithyanathan. VLDB-07.
• Community Information Management, A. Doan, R. Ramakrishnan, F. Chen, P. DeRose, Y. Lee, R. McCann, M. Sayyadian, and W. Shen. IEEE Data Engineering Bulletin, Special Issue on Probabilistic Databases, 29(1), 2006.
• Managing Information Extraction, A. Doan, R. Ramakrishnan, S. Vaithyanathan. SIGMOD-06 Tutorial.
![Page 44: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/44.jpg)
- 45 -Research
DBLife
• Integrate data of the DB research community
• 1164 data sources
Crawled daily, 11000+ pages = 160+ MB / day
![Page 45: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/45.jpg)
- 46 -Research
Entity Extraction and Resolution
Raghu Ramakrishnan
co-authors = A. Doan, Divesh Srivastava, ...
![Page 46: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/46.jpg)
- 47 -Research
Resulting ER Graph
“Proactive Re-optimization
Jennifer Widom
Shivnath Babu
SIGMOD 2005
David DeWitt
Pedro Bizarrocoauthor
coauthor
coauthor
advise advise
write
write
write
PC-Chair
PC-member
![Page 47: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/47.jpg)
- 48 -Research
Challenges
• Extraction– Domain-level vs. site-level extraction “templates”
• Compositional, customizable approach to extraction planning
– Blending extraction with other sources (feeds, wiki-style user edits)
• Maintenance of extracted information– Managing information Extraction
– Incremental maintenance of “extracted views” at large scales
– Mass Collaboration—community-based maintenance
• Exploitation– Search/query over extracted structures in a community
– Search across communities—Semantic Web through the back door!
– Detect interesting events and changes
![Page 48: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/48.jpg)
- 49 -Research
Mass Collaboration
• We want to leverage user feedback to improve the quality of extraction over time.– Maintaining an extracted “view” on a collection of documents
over time is very costly; getting feedback from users can help
– In fact, distributing the maintenance task across a large group of users may be the best approach
![Page 49: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/49.jpg)
- 50 -Research
Mass Collaboration: A Simplified Example
Not David!
Picture is removed if enough users vote “no”.
![Page 50: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/50.jpg)
- 51 -Research
Mass Collaboration Meets Spam
Jeffrey F. Naughton swears that this is David J. DeWitt
![Page 51: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/51.jpg)
- 52 -Research
Incorporating Feedback
A. Gupta, D. Smith, Text mining, SIGMOD-06
System extracted “Gupta, D” as a person name
System extracted “Gupta, D” using rules:
(R1) David Gupta is a person name(R2) If “first-name last-name” is a person name, then “last-name, f” is also a person name.
Knowing this, system can potentially improve extraction accuracy.
(1) Discover corrective rules(2) Find and fix other
incorrect applications of R1 and R2
A general framework for incorporating feedback?
User says this is wrong
![Page 52: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/52.jpg)
- 53 -Research
Collaborative Editing
• Users should be able to
– Correct/add to the imported data
– E.g., User imports a paper, system provides bib item
• Challenges
– Incentives, reputation
– Handling malicious/spam users
– Ownership model• My home page vs. a citation that appears on it
– Reconciliation• Extracted vs. manual input
• Conflicting input from different users
![Page 53: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/53.jpg)
- 54 -Research
The Purple SOX Project
Operator Library
Extraction Management System(e.g Vertex, Societek)
Shopping,Travel,Autos
Academic Portals
(DBLife/MeYahoo)
EnthusiastPlatform
…and many others
Application Layer
(SOcial eXtraction)
![Page 54: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/54.jpg)
Web Data Management:Massively Distributed Hosted
Systems
![Page 55: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/55.jpg)
- 57 -Research
An Example Web App
uploads tags as“flower”
» Friend activity » Your Photos
Sonja uploaded Brandon tagged a photo
» Photos tagged as “flower”
Updates
Queries
Heavy use of simple database operations
![Page 56: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/56.jpg)
- 58 -Research
The Problem
What does it take to build the next big app?
![Page 57: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/57.jpg)
- 59 -Research
Why Hosted?
simpleAPI
No maintenance worries for application Single ops team Resource sharing leads to savings
No maintenance worries for application Single ops team Resource sharing leads to savings
![Page 58: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/58.jpg)
- 60 -Research
Data Analysis Platforms
User
Tags
• Understanding online communities, and provisioning their data needs
– Exploratory analysis over massive data sets
• Challenges: Analyze shared, evolving social networks of users, content, and interactions to learn models of individual preferences and characteristics; community structure and dynamics; and to develop robust frameworks for evolution of authority and trust; extracting and exploiting structure from web content …
• Examples:
– Bigtable, Map-Reduce, Hadoop, PIG
![Page 59: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/59.jpg)
- 61 -Research
The Bigger Picture
• Software-as-a-service
– E.g., Salesforce.com
• Hosted data systems
– E.g., Amazon’s S3/Dynamo and EC2
• Web application development
– Ning, Ruby-on-rails
• Change tracking
– Stream management
![Page 60: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/60.jpg)
- 62 -Research
Implications
• Data management as a service– Scientists and others who’ve resisted (installing, maintaining, and) using DBMSs
will find it much easier to reap the benefits– “Data centers” and “Computing Centers” will come into vogue again
• Hosted back-ends and RAD tools will make Web application development accessible to all– The Web is becoming open
• E.g., OpenSocial, OpenID • Ideas will be the most valuable currency, not the wherewithal to build complex systems
• Paradigm shifts possible for how we do research in many fields– Build applications that embed your algorithms and test them directly in the field—
Computer Scientists can interact directly with users (ironically, this would still be a breakthrough of sorts after four decades!)
– Many other disciplines (e.g., Sociology, microeconomics) can design and conduct online experiments involving unprecedented numbers of participants
![Page 61: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/61.jpg)
- 63 -Research
Summary
• Online communities represent a tremendous resource for organizing information online
– Open APIs and cloud services = mass engagement
– Extraction + mass collaboration = semantics
• Web is becoming
– More people-centric, less site-centric
– Highly intertwined, distributed, dynamic, personalized
– Models of ownership, trust, incentives?
– Next generation of search algorithms and infrastructure?
![Page 62: Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)](https://reader035.vdocuments.mx/reader035/viewer/2022081515/56649d2b5503460f94a0075d/html5/thumbnails/62.jpg)
- 64 -Research
Further Reading
• Content, Metadata, and Behavioral Information: Directions for Yahoo! Research, The Yahoo! Research Team, IEEE Data Engineering Bulletin, Dec 2006 (Special Issue on Web-Scale Data, Systems, and Semantics)
• Systems, Communities, Community Systems on the Web, Community Systems Group at Yahoo! Research, SIGMOD Record, Sept 2007
• Towards a PeopleWeb, R. Ramakrishnan and A. Tomkins, IEEE Computer, August 2007 (Special Issue on Web Search)