Exploring Collaboratively Annotated Data for Automatic ... ?· Exploring Collaboratively Annotated Data…

Download Exploring Collaboratively Annotated Data for Automatic ... ?· Exploring Collaboratively Annotated Data…

Post on 29-Aug-2018




0 download

Embed Size (px)


<ul><li><p>Exploring Collaboratively Annotated Data for AutomaticAnnotation</p><p>Jiafeng Guo, Xueqi Cheng, Huawei Shen, Shuo BaiInstitute of Computing Technology, CAS</p><p>Beijing, P.R.China</p><p>{guojiafeng,shenhuawei}@software.ict.ac.cn, cxq@ict.ac.cn, sbai@sse.com.cn</p><p>ABSTRACTSocial tagging, as a collaborative form of annotation pro-cess, has grown in popularity on the web due to its effec-tiveness in organizing and accessing web resources. Thispaper addresses the issue of automatic annotation, whichaims to predict social tags for web resources automaticallyto help future navigation, filtering or search. We explore thecollaboratively annotated data in social tagging services, inwhich collaborative annotations (i.e., combined social tagsof many users) serve as a description of web resources fromthe crowds point of views, and analyze its three importantproperties. Based on these properties, we propose an au-tomatic annotation approach by leveraging a probabilistictopic model, which captures the relationship between re-sources and annotations. The topic model is an extensionof conventional LDA (Latent Dirichlet Allocation), referredas Word-Tag LDA, which effectively reflects the generativeprocess of the collaboratively annotated data and models theconditional distribution of annotations given resources. Ex-periments are carried out on a real-world annotation data setsampled from del.icio.us. Results demonstrate that our ap-proach can significantly outperform the other baseline meth-ods in automatic annotation.</p><p>Categories and Subject DescriptorsH.3.3 [Information Storage and Retrieval]: InformationSearch and Retrieval</p><p>General TermsAlgorithms, Experimentation, Performance, Theory</p><p>Keywordssocial tagging, automatic annotation, topic model</p><p>1. INTRODUCTIONAnnotating resources with descriptive keywords or tags</p><p>is a common way to help organize and access large collec-tions of resources. In the recent years, social tagging, asa collaborative form of the annotation process, is gainingpopularity on the web. Social tagging services (e.g., Deli-cious1 and Citeulike2) enable users to not only freely tag the</p><p>1http://del.icio.us2http://www.citeulike.org</p><p>Copyright is held by the author/owner(s).CAW2.0 2009, April 21, 2009, Madrid, Spain..</p><p>web resources but also publicly share these information withothers. Such collaborative annotations (i.e., combined socialtags of many users) actually reflect how the crowds describeand organize the web resources. Therefore, a natural ques-tion is that whether we can learn from the wisdom of thecrowds, the collaboratively annotated data, to predict so-cial tags for web resources automatically. This is what weaddressed by automatic annotation here. Note that in thispaper, we focus on document resources that are consisted ofwords, which represent a major type of resources on the web.Automatic annotation is potentially useful for wide applica-tions including resource discovery, tag recommendation, webpage search and clustering. For example, a large amount ofun-annotated web resources can be automatically annotatedto help discover similar resources to the existing annotatedresources, which users might be interested in. Moreover,predicted tags might also be reasonable to suggest to usersto assist their annotation process.</p><p>There is much previous work focused on social tagging,including analysis of tagging systems [9, 18], study of tagvisualization [7], recommendation of tags [19, 14, 21], anddiscovery of hidden semantics in tags [22]. However, in thispaper, we investigate exploiting collaboratively annotateddata for automatic annotation. Such a problem fits in theframework of Semantic Web [2], which aims to automaticallyassign metadata to web resources. However, the primary ap-proaches on Semantic Web either presume that an ontologyis built before annotation [17, 12] or rely on text keywordsextraction [5]. We propose to annotate resources with socialtags, which are freely generated by web users without anya-priori formal ontology to conform to.</p><p>Our work is inspired by the following three importantproperties of the collaboratively annotated data in socialtagging services: (1) Stable, which means that collaborativeannotations for a particular web resource give rise to a sta-ble distribution of social tags. (2) Consistent, which meansthat collaborative annotations reflect the content of web re-sources with its own literature. (3) Shareable, which meansthat most tags in collaborative annotations are shared by alarge variety of web resources. Based on these properties, wepropose an automatic annotation approach by leveraging aprobabilistic topic model, which learns the relationship be-tween resources and collaborative annotations. Specifically,the topic model is an extension of conventional LDA (La-tent Dirichlet Allocation) [4], referred as Word-Tag LDA,which captures the conditional relationships between lower-dimensional representations (i.e. hidden topics) of words (inresources) and tags (in annotations). The approach simulta-</p></li><li><p>neously models the way how the web resource is generated aswell as the way how the resource is subsequently annotatedwith collaborative annotations. With the Word-Tag LDAmodel learnt in hand, we are able to predict the conditionaldistribution of social tags given a new web resource. Notehere we are interested in predicting a ranked list of socialtags for web resources, not only a yes/no decision for eachpossible tag. Obviously, there are always a great amountof tags which can be assigned to web resources. Therefore,predicting a ranked list rather than a set of social tags makesmore sense and is more effective in describing web resources.</p><p>We demonstrate the effectiveness of our automatic an-notation approach by comparing with three types of base-line approaches, including keyword approach, similarity ap-proach and joint approach. Experiments are conducted ona real-world annotation data set sampled from Delicious.Experimental results show that our model can significantlyoutperform all the other baselines in automatic annotation.</p><p>The rest of the paper is organized as follows. Section 2discusses related work. Section 3 introduces the problem ofautomatic annotation and conducts some preliminary stud-ies. Section 4 describes the proposed Word-Tag LDA indetails. Experimental results are presented in Section 5.Conclusions are made in the last section.</p><p>2. RELATED WORK</p><p>2.1 Research on Social AnnotationSocial tagging has received major attention in the recent</p><p>literature. The early discussion of the social tagging canbe found in [9, 18, 20, 11]. In [9], Golder et al. providedempirical study of the tagging behavior and tag usage inDelicious. Quintarelli [18] gave a general introduction ofsocial annotation and suggested that we should take it asan information organizing tool.</p><p>Research based on social tagging has been done in variousareas such as emergent semantics [16, 22], tag recommenda-tion [19, 14, 21], visualization [7], and information retrieval[1, 23]. Wu et al. [22] explored emergent semantics fromthe social annotations in a statistical way and applied thederived semantics to discover and search shared web book-marks. Sigurbjornsson et al. [19] proposed a method whichuses tag co-occurrence to suggest additional tags that com-plement user-defined tags of photographs in Flickr. Hey-mann et al. [14] predicted tags for web pages based on theanchortext of incoming links, web page content, surround-ing hosts and other tags applied to the URL. In [7], Dubinkoet al. presented a novel approach to visualize the evolutionof tags over the time. Bao et al. utilized social annotationsto benefit web search [1]. They introduced SocialPageRank,to measure page authority based on its annotations, andSocialSimRank for similarities among annotations.</p><p>Different from the above work, we investigated the capa-bility of social tagging in automatic annotation by utilizingthe correlation between web resources and collaborative an-notations.</p><p>2.2 Research on Automatic AnnotationAutomatically assigning metadata to web resources is a</p><p>key problem in Semantic Web area. A lot of work has beendone about this topic. Early work [17, 12] mainly focusedon the utilization of an ontology engineering tool. An on-tology is usually built first and then manual annotation for</p><p>web resources is conducted with the tool. In order to helpautomate the manual process, many approaches have beenproposed and evaluated. Chirita et al. proposed P-TAG,a method which suggests personalized tags for web pagesbased on both the content of the page and the data on theusers desktop [5]. Dill et al. [6] presented a platform forlarge-scale text analysis and automatic semantic tagging.They proposed to learn from a small set of training exam-ples and then automatically tag concept instances on theweb. Handschuh et al. [13] considered the web pages thatare generated from a database, and proposed to automati-cally produce sematic annotations from the database withan ontology. However, the primary approaches on SemanticWeb either presume that an ontology is built before anno-tation or rely on text keywords extraction. We proposed toassign social tags to content, which are freely generated byusers without any a-priori ontology to conform to.</p><p>3. PRELIMINARY STUDIESThe idea of a social approach to the semantic annotation</p><p>is enlightened and enabled by the now widely popular socialtagging services on the web. These social tagging servicesprovide users a convenient collaborative form of annotationprocess, so that they can not only freely tag the web re-sources but also publicly share these information with oth-ers. Since social tagging has become an effective way of or-ganizing and accessing large collections of web resources, anatural question is that whether we can predict social tagsfor web resources automatically to help future navigation,filtering or search. This is the automatic annotation prob-lem addressed in our paper.</p><p>Before we discuss the technical aspects of our work, itis worth spending a moment to conduct some preliminarystudies on social tagging. There are many popular socialtagging services on the web, e.g. the social bookmarkingsystems [11]. Here we conduct some studies on a typicalsocial bookmarking system, Delicious. In Delicious, an an-notation activity (i.e. bookmark) typically consists of fourelements: a user, a link to the web resource, one or moretags, and a tagging time. We define an annotation as</p><p>(User, Url, Tag, Time).</p><p>In this paper, we focus on the collaboratively annotateddata, in which collaborative annotations (i.e., combined so-cial tags of many users) serve as a description of web re-sources from the crowds point of views. Therefore, we dis-regard the roles of User and Time, combine Tags of thesame URL together, crawl the web resource (i.e. documentwords) referred by the URL and obtain the pair of (Words,Tags) as the collaboratively annotated data. For our pre-liminary studies, we randomly sampled 70,000 annotatedURLs from Delicious and collected the corresponding word-tag pairs. We made some investigation over such data andexplore the following three important properties, which be-come the foundation of our work:</p><p>(1) Stable, which means that collaborative annotationsfor a particular web resource give rise to a stable distributionof social tags. Although individuals annotation may havepersonal preference and varying tag vocabulary, collabora-tive annotations form a stable pattern in which the tag fre-quency distribution is nearly fixed. Fig. 1 shows this prop-erty, which was obtained by plotting the Kullback-Leibler(KL) divergence of tag frequency distribution for each step</p></li><li><p>Table 1: The top 10 frequent keywords and collaborative annotations of an example URL.Url http://www.xml.com/pub/a/2002/12/18/dive-into-xml.htmlTop tf Words rss xml feed format schema web content feeds news shareTop Tags rss xml syndication web reference programming web2.0 article tutorial howto</p><p>0 50 100 150 2000</p><p>0.1</p><p>0.2</p><p>0.3</p><p>0.4</p><p>0.5</p><p>0.6</p><p>0.7</p><p>0.8</p><p>0.9</p><p>1</p><p>Units of Bookmarks Added</p><p>Kullb</p><p>ack</p><p>Leib</p><p>ler (</p><p>KL) d</p><p>iver</p><p>genc</p><p>e</p><p>Figure 1: Kullback-Leibler (KL) divergence of tagfrequency distribution at each step (as units of book-marks added), with respect to the final tag distri-bution.</p><p>(as units of bookmarks added) with respect to the final tagdistribution for the resource (where final denotes distri-bution at the time the measurement was taken, or the lastobservation in the data). From Fig. 1 we can see that empir-ically after the first 100 or so bookmarks, the KL divergencebetween each step and the final one converges to zero (orclose to zero), which indicates that tag distribution for theresource has become stable. Similar observations have alsobeen obtained in previous research work [9, 10]. This stableproperty is believed to be caused by the two public factorsof social tagging, imitation and shared knowledge [9].</p><p>(2) Consistent, which means that collaborative annota-tions reflect the content of web resources, although tags maydiffer literally from words in the resource. On one hand,with the wisdom of the crowds, collaborative annotationsare more likely to be a less biased semantic description ofweb resources, which have the comparable expression capa-bility as words in web resources. For example, we show thetop 10 frequent keywords in a bookmarked resource as wellas the top 10 frequent tags annotated by users in Table 1.The collaborative annotations clearly reflect what the corre-sponding web resource talks about. We further found thatwhen top 10 frequent keywords considered, 73.6% of all re-sources are fully covered by their collaborative annotations.On the other hand, tags may differ literally from words inweb resources, since they are created by large number ofweb users and usually represent a higher-level abstractionon the content [15]. Take the annotations shown in Table 1as an example, beyond those matched with words, we canalso observe some high-level abstraction tags, such as pro-gramming and tutorial.</p><p>(3) Shareable, which means that most tags in collabo-rative annotations are shared by a large variety of web re-sources. We plot the average document frequency (DF) ofthe top N frequent tags over the data in Fig. 2. The re-sults show that the top 10 frequent tags for web resourcesare shared by more than 300 resources on average. As col-laborative annotations usually form a Power-Law frequencydistribution, these top frequent shareable tags actually rep-</p><p>10 20 30 40 50 60 70 80 90 1000</p><p>50</p><p>100</p><p>150</p><p>200</p><p>250</p><p>300</p><p>350</p><p>Aver</p><p>age </p><p>Doc</p><p>umen</p><p>t Fre</p><p>quen</p><p>cy</p><p>TopK Tags</p><p>Figure 2: Average Document Frequency (DF) of topfrequent tags.</p><p>resent the major part of the collaborative annotations forweb resources. This shareable property is believed to be re-lated to two reasons: One is that users are more likely toassign similar tags for the similar content in web resources;The other is that tags which represent a semantic abstrac-tion on the resource content are more likely to be used by alarge number of users than t...</p></li></ul>