1 one table stores all: enabling painless free-and-easy data publishing and sharing bei yu 1,...
TRANSCRIPT
1
One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing
Bei Yu1, Guoliang Li2, Beng Chin Ooi1, Li-zhu Zhou2
1National University of Singapore2Tsinghua University
2
Folksonomy (folk+taxonomy)
Examples Delicious http://del.icio.us/ Flickr http://www.flickr.com/ Google Base http://base.google.com/ YouTube http://www.youtube.com/
Internet-based information sharing methodology
Users collaboratively publish information resources, e.g., webpages, photos, using self-defined metadata
Users collaborative behavior decides the data semantics
System categorize information resources based on user-defined metadata, to facilitate searching, browsing, etc..
3
Our Attempt Devise a general system framework
for supporting folksonomy-based data sharing
Allows rich and flexible structure of the metadata (called data units) for describing published resources
Categorize data units Efficiently store all data units Provide browsing and querying
services
4
Data Units
Title Uzzer's blog
FieldsHomepage: http://uzzer.livejournal.comAuthor: uzzerBlog type: art-bloglanguage: english accepted, russian
Tagsart, blog, comments, design, funlivejournal, photos, pictures, uzzer, web
Title Uzzer's blog
FieldsHomepage: http://uzzer.livejournal.comAuthor: uzzerBlog type: art-bloglanguage: english accepted, russian
Tagsart, blog, comments, design, funlivejournal, photos, pictures, uzzer, web
Title China's Internet services marketreached 18 billion in 2005
FieldsAuthor: Analysis InternationalNews Source: Analysis InternationalPublish Date: 02/22/2006
TagsChina, Internet services, News and Articles
Title China's Internet services marketreached 18 billion in 2005
FieldsAuthor: Analysis InternationalNews Source: Analysis InternationalPublish Date: 02/22/2006
TagsChina, Internet services, News and Articles
The metadata, called data unit, consists of user-created title, fields (attributes and values), tags
5
Data Model A generic relational table for storing all data units,
e.g.
A set of virtual relations (VR) as views over the generic table, as querying interface, e.g.
analysis international
null
news source
02/22/2006
null
publish
date
englishaccepted, russian
art-bloghttp://uzzer.livejournal.com
Art, blog, comments, design, fun, livejournal, photos, pictures, uzzer, web
uzzerUzzer's blog0
nullnullnullChina, internet Services, News and Articles
Analysis International
China’s International services market reached 18 billion in 2005
1
languagetags blogtype
homepageauthortitleid
analysis international
null
news source
02/22/2006
null
publish
date
englishaccepted, russian
art-bloghttp://uzzer.livejournal.com
Art, blog, comments, design, fun, livejournal, photos, pictures, uzzer, web
uzzerUzzer's blog0
nullnullnullChina, internet Services, News and Articles
Analysis International
China’s International services market reached 18 billion in 2005
1
languagetags blogtype
homepageauthortitleid
languagetags blog typehomepageauthortitleid languagetags blog typehomepageauthortitleid
tags publish datenews sourceauthortitleid tags publish datenews sourceauthortitleidVR2
VR1
6
System Framework
Generic Table
Storage Manager
VR1: products
VR2: recipes
VR3: blogs
VR4: patents
VR5: restaurants
VR6: travel
Multi-function Query Processor
Data Units Categorizer
Browsing and Search InterfacePublish Interface
queries
7
Data Units Categorizer Constructs and maintains VRs dynamically
as data units are published constantly Clustering based on attributes and tags VR ≡ Cluster of data units with similar topics
Need an on-line one pass clustering model Accepts a data unit u, and extracts its
attributes and tags Compare u with existing VRs, and assigns it to
the ones that results in a match If no suitable VR for u, create a new VR with u
as the only tuple
8
Challenges for Categorizing Uncontrolled
vocabulary for both attributes and tags
Large portion of “noise”, very infrequent
The number of unique attributes and tags keeps growing
Problems with synonyms, polysemy, etc.
Distribution of attributes frequencies
0
1000
2000
3000
4000
5000
6000
1 282 563 844 1125 1406 1687 1968 2249 2530 2811 3092 3373 3654
attributes
freq
uen
cy
Distribution of tag frequencies
0
200
400
600
800
1000
1200
1400
1600
1800
1 2789 5577 8365 11153 13941 16729 19517 22305 25093 27881
tags
freq
uen
cy
9
Our Current Approach
Characterize each VR with sets of popular attributes (PAS) and tags (PTS), for representing the dominating features
Compare new data units with PAS and PTS, for limiting the affect of “noise”
Maintain PAS and PTS when assigning each new data unit
10
Storage Manager Function
Store and index the generic table (very sparse)
maintain mappings with VRs Challenge
Space efficiency Scalable over the number of attributes and
data volume Be efficient for both retrieval and update
11
Storage with Sparse Table Only storing non-null values for each
tuple Build inverted index over attributes for
processing attribute-based queries
Build inverted index over keywords for processing keyword queries
Other approaches? Bitmap index?
attr1 val1 attr4 val2 attr7 val3 attrt1
attr2 val1 attr6 val2 attr7 val3 attrt3
attr2 val1 attr5 val2 attr6 val3 attrt2
attr3 val1 attr6 val2 attr6 val3 attrt4
attr1attr2attr3attr4
Index Data
12
Browsing and Query Processing
The VRs are ordered based on popularity for browsing May be presented in different views,
e.g., based on attributes or based on tags
Support both keyword query and structured query Inverted index
Effective ranking
13
Conclusion
We have presented the design for a folksonomy-based data sharing system
We devise a generic table data model for representing and storing the data units
Future work Port the system into P2P networks