qeavis: quantitative evaluation of academic websites

15
Jornada de Seguimiento de Proyectos, 2010 Programa Nacional de Tecnolog´ ıas Inform´ aticas QEAVis: Quantitative Evaluation of Academic Websites Visibility TIN 2007-67581-C02 M. F. Verdejo, E. Amig´ o, L. Araujo, J. Artiles, V. Fresno, G. Garrido, P. Hern´andez, R. Mart´ ınez, A. Pe˜ nas, A. P´ erez, J.R. P´ erez, A. Rodrigo, J. Romo, A. Zubiaga NLP&IR Group, UNED I.Aguillo, M. Fern´ andez, A. M. Utrilla Cybermetrics Lab, CCHS - CSIC Abstract The application of the Human-Language Technologies (HLT) in the web arises new technological challenges. First, the web pages structure and textual content of the web sites is not comparable to the traditional domains in the textual treatment such as the news repositories. Second, the processing of big portions of the web arises a scalability problem and new challenges in the development of methodologies, techniques and algo- rithms of textual treatment. The project plans the application of HLT to an important problem such as the measurement of the academic visibility in the web, giving the basis of a quantitative evaluation of the universities departments’ commitment in the public access to their information. Web indicators (Cybermetrics) must be developed and applied to the study of the academic websites visibility, with special focus on the presence of the Spanish language (of strategic importance) and the academic areas related to humanities (which need special help for their web positioning). First, we will determine the main web media- tors of academic contents at web subdomains level. These subdomains should be crawled to unload, store and manage their web pages, so that the web pages are prepared for the automatic classification and extraction. Web subdomains should be classified under lan- guage, academic category (Humanities, Science, etc) and discipline(Philosophy, Philology, etc). Furthermore, the information necessary for creating the profile of each subdomain should be automatically extracted. All this information will be used to elaborate a profile and a description of each university department. A series of web indicators will be ap- plied to the information of the subdomains in order to quantify their presence, visibility, impact and popularity. The resultant quantitative values will be used to make a ranking of subdomains/departments per each academic category. In the ranking the top positions will be for those departments whose commitment to the visibility of their information is the largest. The rankings, together with the criteria used in their construction, the recommendations and resources in order to improve the results, will be public available. Finally, we expect that the HLT application allow the development of new cybermetric indicators with higher granularity. Keywords: automatic classification of web pages, information extraction, cybermetrics, access and visibility of multilingual information on Internet.

Upload: others

Post on 24-Mar-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Jornada de Seguimiento de Proyectos, 2010Programa Nacional de Tecnologıas Informaticas

QEAVis: Quantitative Evaluation of Academic

Websites Visibility

TIN 2007-67581-C02

M. F. Verdejo, E. Amigo, L. Araujo, J. Artiles, V. Fresno, G. Garrido, P. Hernandez,R. Martınez, A. Penas, A. Perez, J.R. Perez, A. Rodrigo, J. Romo, A. Zubiaga

NLP&IR Group, UNED

I.Aguillo, M. Fernandez, A. M. UtrillaCybermetrics Lab, CCHS - CSIC

Abstract

The application of the Human-Language Technologies (HLT) in the web arises newtechnological challenges. First, the web pages structure and textual content of the websites is not comparable to the traditional domains in the textual treatment such as thenews repositories. Second, the processing of big portions of the web arises a scalabilityproblem and new challenges in the development of methodologies, techniques and algo-rithms of textual treatment. The project plans the application of HLT to an importantproblem such as the measurement of the academic visibility in the web, giving the basis ofa quantitative evaluation of the universities departments’ commitment in the public accessto their information. Web indicators (Cybermetrics) must be developed and applied to thestudy of the academic websites visibility, with special focus on the presence of the Spanishlanguage (of strategic importance) and the academic areas related to humanities (whichneed special help for their web positioning). First, we will determine the main web media-tors of academic contents at web subdomains level. These subdomains should be crawledto unload, store and manage their web pages, so that the web pages are prepared for theautomatic classification and extraction. Web subdomains should be classified under lan-guage, academic category (Humanities, Science, etc) and discipline(Philosophy, Philology,etc). Furthermore, the information necessary for creating the profile of each subdomainshould be automatically extracted. All this information will be used to elaborate a profileand a description of each university department. A series of web indicators will be ap-plied to the information of the subdomains in order to quantify their presence, visibility,impact and popularity. The resultant quantitative values will be used to make a rankingof subdomains/departments per each academic category. In the ranking the top positionswill be for those departments whose commitment to the visibility of their informationis the largest. The rankings, together with the criteria used in their construction, therecommendations and resources in order to improve the results, will be public available.Finally, we expect that the HLT application allow the development of new cybermetricindicators with higher granularity.Keywords: automatic classification of web pages, information extraction, cybermetrics,access and visibility of multilingual information on Internet.

TIN 2007-67581-C02

1 Objectives

The application of Human-Language Technologies (HLT) to mine the web in order to auto-matically measure the academic visibility poses new technological challenges. The main goalsof the QEAVIS project are the following: (i) To advance the state of the art of classificationand extraction techniques for automating effectively the process of identifying and obtainingrelevant information from websites. In particular the data needed to evaluate the impact ofthat website in the context of a research community. (ii) To advance the state of the art onweb indicators and the methodological approach to test their reliability. (iii) To gain insighton the presence and impact of the Humanities fields in the WWW, specially the websites ofthe Spanish-speaking universities departments.

1.1 Groups involved and approach

The project is carried out by NLP&IR-UNED and the Cybermetrics Lab (CCHS-CSIC) respec-tively. UNED expertise includes crawling, multilingual information retrieval, classification andextraction techniques. The Cybermetrics Lab has developed cybermetrics methods to analyzeand rank (mainly in an intellectual way) web sites visibility. In QEAVIS we will determinefirst the main web sites of academic contents at web sub domain level. These subdomains arecrawled to download, store and manage their web pages, so that the web pages are prepared fortheir automatic classification and information extraction. Web subdomains will be classifiedunder language, academic category and discipline. Furthermore, the information necessary forcreating the micro-format of each subdomain should be automatically extracted. This infor-mation will be used to elaborate a profile and a description of each university department.Finally, a variety of web indicators will be applied to the information of the subdomains inorder to quantify their presence, visibility, impact and popularity. The resultant quantitativevalues will be used to make a ranking of subdomains/departments per each academic category.In the ranking, the top positions will be for those departments whose commitment to the visi-bility of their information is the largest. The rankings, together with the criteria used in theirconstruction, the recommendations and resources in order to improve the results, will be madepublic. So, we expect : (1) to stimulate the continuous improvement of the accessibility andvisibility of the academic information in the web; (2) to provide new cybermetric indicatorswith finer granularity. (3) to test the results in specific applications like portals or searchengines.

2 Current state and achievements of the project

In this section we describe in detail the current state of the project highlighting the achieve-ments and the relevance of the results in its first and second year per WP (two of the indicatorsrequired for evaluation purposes). The following table summarizes the WPs and the scheduleof the project, timely developed as planned.

WP 1.1: Selection and Crawling of academic websites This WP is organized inthree tasks. Next, we briefly describe the work carried out in each of them.

Task 1.1.1 Infrastructure set up The cluster machines and the crawling software was set upproperly for the massive downloading of several millions of pages. Up to 100.000 sites queued

TIN 2007-67581-C02

and 7.000.000 pages had to be finally crawled and indexed. The automatic collection of initialseeds for crawling were extracted from search engines and it was done by a series of serversemulating close to 50 different PCs, each one with their own IP.

Task 1.1.2 Crawling of websites in specific domains Regarding crawling tasks, first ofall, we analyzed the existing crawling software, taking into account the performance and thescalability of it. We also considered the availability of open source solutions, which offer moreflexibility than commercial ones. After the analysis and trial process, we finally selected Nutchas a suitable web-search software to carry out our experiments. Nutch is part of the Apacheproject, and it is built on Lucene Java, adding a crawler, a link-graph database, parsers forHTML and other document formats, etc. We made some initial experiments downloadingwebsites in specific domains, with successful results.

Task 1.1.3 Crawling of academic websites Once we selected the software to create ourweb page collections, we proceeded to generate the first list of URLs to seed our crawler.This initial list was created by experts in information science. In this way, the CybermetricsResearch Group from CCHS-CSIC provided a list of nearly 300,000 subdomains underneath

TIN 2007-67581-C02

academic domains. This list was compiled from an automatic extraction of records from theYahoo Search public search engine. Then, we had to clean this list, by removing invalid andduplicated URLs, and non-existing or non-responding URLs. We used the resulting list, withroughly 124,000 subdomains, as our first seed. Thus, we obtained a collection with almost 7million web pages from academic websites. Nonetheless, the above collection was no labeled(only the titles were provided), and so we couldn’t perform later classification tasks withit. Then, the Cybermetrics group provided us another list of URLs, this time with theircorresponding labels, over a 27-categories taxonomy defined from UNESCO codes. This wasa smaller list, with approximately 4,400 subdomains. In the same manner, we created a newweb page collection with this list as a seed. Once we had the two collections, one labeledand one unlabeled, we required increment our labeled subdomains set. To achieve this, theCybermetrics Lab manually labeled a set of more than 6,000 subdomains, randomly extractedfrom the unlabeled collection. Each subdomain was analyzed by two different indexers, withan extra opinion when the labels differ, a task that was done manually. With the aim tomake easier this task, we developed Qannotate, a tool to help indexers in the labeling of eachsubdomain. Then, analyzing the annotations of them we concluded that roughly 75% of thesubdomains were not strictly academic, or without enough academic content, although theywas underneath academic domains, while the remainder was clearly classifiable into the definedtaxonomy. Finally, we have a broader labeled collection, with approximately 6,000 classifiedURLs, and almost 5,000 non-academic URLs. This collection will be used later for testingseveral applications, including portals with directories, subject search engines and rankings.

WP 1.2: Automatic classification of academic websites This WP is organized intwo related subtasks as follows

Task 1.2.1 Extraction of domain descriptors Concerning terminology extraction, UNEDhas developed a tool to extract automatically relevant terminology from the term index cor-responding to a web domain. [1, 4, 5, 16, 17, 18] This has been done by calculating thedivergence between language models. Specifically, we use Kullback-Liebler divergence (seeFigure 1). Where MD is the language model for the category D, and MC for the collection.Then, UNESCO codes have been used to select those pages from the collection which cor-respond to a particular academic domain in different levels of detail. Furthermore an indexfor each level using the documents of the category associated to the level has been generated.Applying the tool for terminology extraction to the index, term lists have been generated, char-acterizing each considered academic domain. These lists are used in the classification phase.Applying these term lists to the classification phase requires some kind of global weighting ofthe terms. The KLD value assigned to each term during the extraction step can not be directlyused since the range of values is different for the index of each category, as it depends on fea-tures such as the number of documents in the category and the number of terms. Accordingly,we have developed a kind of normalized value of KLD, computing the divergence value of aterm between the category and the whole collection, and normalizing for all the terms withina category.

KLD(T ) = MD(T )log(MD(T )/MC(T ))

Figure 1: Kullback-Liebler divergence

TIN 2007-67581-C02

Task 1.2.2 Automatic Classification of Web Documents from academic websites. Withregard to automatic classification, first we carried out a study on whether unlabeled data couldimprove results for multiclass web page classification tasks using Support Vector Machines(SVM). In the light of the results, ([23], [24], and [27]) it was decided to rely only on labeleddata, both for good performance and for reducing the computational cost. Next step aimedat reducing the number of UNESCO categories to the 20-25 that better fitted our crawledcollections, without taking into account the deep level of the taxonomy. To evaluate theadaptation of the collections to that taxonomy, an unsupervised grouping was tried: a clusteringof the UNESCO categories represented by means of all the documents belonging to them wascarried out using a Self-Organizing Map (SOM) ([9],[10], and [28]). The inconsistent resultsled to dismiss the use of UNESCO taxonomy in the future. Based on the conclusions of theclustering results, the Cybermetric group decided to create a new taxonomy according to thedomain. First, a combined classification of the most popular international codes (UDC, LC,UNESCO) was prepared, including several multilingual versions. This combined classificationconsists of a hierarchical association of over 500 entries grouped in a 27-categories taxonomy.In order to test this taxonomy, some experiments were done using English and Spanish versionsof the headings to recover from web search engines the university departments with these wordsin their names. This allows detecting some issues regarding this preliminary taxonomy. On theone hand, some headings were excluded, changed or merged, and the taxonomy was globallyredefined too. The Cybermetric group finally proposed a taxonomy with 25 categories. Onthe other hand, since our dataset relies on subdomains that belong to academic domains,we found lots of these subdomains not to be strictly academic. This is the reason why wedecided to create a filtering classifier, as a previous step to the multiclass classifier. Thisfiltering classifier discards the non-academic subdomains, cleaning the subsequent input forthe multiclass classifier. The resultant classification consists on two phases: (1) selecting thetrue academic subdomains, and (2) classifying the filtered subdomains in the 25 categoriesestablished. We also studied a web page classification task based on information from socialtagging sites ([25, 26]), for a later application to academic context. Later analysis showed lotsof URLs had no social information to improve content-based representations.

WP 1.3: Website Descriptive Information Extraction and Named Entities Recog-nition The work is organized in three tasks. Next, we briefly describe the work carried out ineach task

Task 1.3.1 Named Entity Recognition. In order to evaluate and optimize the IE system,we have developed a manually annotated corpus. This corpus contains 1124 entities from 30subdomains in the Qeavis dataset. It includes the name of the main organization, relatedorganizations, related people, place and contact information. We have implemented the infras-tructure necessary for: (I) extracting entity and relevant features from each entity appearance(context, link between pages, anchor texts, etc...), (ii) the evaluation against manually gener-ated templates by using similarity metrics between strings (iii) a mapping process for avoidingredundant entities and(iv) a framework for generating machine learning models for templatefilling. In addition to those features, we have studied the frequency of entity occurrences as afeature for template filling. This feature captures redundancy, which is particularly relevant inthe IE over subdomains. For each subdomain the system extracts named entity candidates inorder to fill the corresponding IE template. These candidates are generated using the StanfordNamed Entity Recognizer (NER). The Stanford NER provides an implementation of linear

TIN 2007-67581-C02

chain Conditional Random Field (CRF) sequence models and a three class (person, organi-zation, and location) named entity recognizer for English. Specific named entity occurrencesreferring to the same individual are grouped in the mapping step. The mapper componentuses an implementation of the Agglomerative Clustering Algorithm with single linkage and aclustering threshold empirically set. The distance between entity occurrences is the normal-ized Levenstein edition distance. Other string distance metrics have been discarded becauseof their time cost or accuracy. Nevertheless we are still studying possible improvements tothis basic metric. For each slot in the IE template the system generates a set of classificationinstances. Each instance is represented by a vector of features and assigned a Boolean valueindicating whether it is a correct candidate for a particular template slot. The classifier hasbeen trained for each template slot using a Support Vector Machine learning method. Otheralgorithms have been discarded because of the characteristics of the features. The learningprocess is supported by an evaluation component that also uses string distances to comparethe classifier output against manual templates. Regarding the results, we have seen that theevaluation results are affected to a great extent by the performance of the string distances inboth the mapping and the automatic evaluation component. It is necessary to deal with theseissues. Currently, we are analyzing the data in deep.

Task 1.3.2 Pattern acquisition for Information Extraction. Regarding this task, we haveimplemented the extraction of features that are necessary to the pattern acquisition process(html paths, sentence segmentation, location in the document). By means of training decisiontrees, we have analyzed the contribution of different features to the information extractionas a first step in the pattern acquisition process. We have observed that some features arevery discriminative for certain template slots. For instance, the text anchor of links that pointto the page containing the entity occurrence determines to a great extent the related peoplebecause of staff pages. In order to tackle the variability of html structures in subdomains, forthe second version we are applying Visual Page Segmentation, which makes use of page layout,features such us font color and size to construct a vision tree for a page. Furthermore, we havealso analyzed patterns to capture other aspects of the web pages that can help to characterizethem, such as the presence of broken links or spam [11, 12, 13, 14, 15]

Task 1.3.3 Generation of descriptive summaries from subdomains. According to the workplan, this task is in its first stage. We have defined the template that constitutes the kernel ofthe summary. We have also extracted the text context of entities in order to enrich the systemoutput.

WP 1.4: System integrationTask 1.4.1 Specification, Architecture and Integration. The initial source of information

to be integrated are the subdomains obtained after identifying the web mediators of academiccontents. They are feed to the crawler to proceed to obtaining a sizable subset of the WorldWide Web. A filtering classifier distinguishes between those pages of academic content andthose that are not, and thus, are considered non-relevant for our purposes. Other minorpreprocessing is carried out to clean up the collection, which is then indexed, for persistence andretrievability. A multiclass classifier for academic web sites is then applied to feed the categoryinformation into the index, so the system allows for immediate categorization of the retrievedand presented results. The Information Extraction engine processes the information obtainedin the crawling phase, and produces the entity profiles, that are stored in the Database. Thepresentation layer and user interface is a search interface. The user interacts with the interface,

TIN 2007-67581-C02

submitting queries to the system that are responded readily by obtaining information fromboth the integrated index and database. The overall architecture, database and interfaceshave been specified (Figure 2). The Crawler component has been integrated. Regarding theautomatic classifier component, we concluded that roughly 75% of the subdomains were notstrictly academic, although they was underneath academic domains, while the remainder wasclearly classifiable into the defined taxonomy. Therefore, a new classifier for academic andnon academic sites is required. This integration is pending. The integration of IE componentsis in process. A first prototype for an academic search engine has been implemented. Thesearch interface offers to the user the information that has been integrated into the system.Its appearance is similar to the search interfaces users are familiarized to: it allows for theuser to type a keyword query on a search box, and offers as result a ranked list of the resultsmost relevant to the query. This familiar design enhances usability. We have then worked onintegrating the information collected and extracted by the system components, and showingit to the user. We have done so in two main directions: (i) Aggregating the information byacademic domain, offering the user those results belonging to the same institution; the usercan also obtain profile information from the institution, by clicking on its name; the user isalso allowed to restrict the query to a certain institution. (ii) Ordering the results according toa variety of preprocessed webometric criteria. Both of these directions converge in the generalobjective of giving the user an easy interface that at the same time exposes the additionalinformation collected and extracted by the system, and enables him or her to decide betterwhich results are more promising in order to fulfill his or her information needs. Figure 3 showsa snapshot of the interface.

Figure 2: System components

Task 1.4.2 Database construction and management. The integration of crawling, pre-processing and indexing are completed. The integration of the classifiers is partial, but thecomponent design assures the final version of the classifiers will be easily connected once ready.The Information Extraction component is yet to be integrated. A general high-level description

TIN 2007-67581-C02

Figure 3: Interface prototype

is defined, but the information extraction experiments will show the amount of informationthat can be integrated into the system to allow for optimal trade-off between performance andinformation availability. The overall architecture of the user interface remains stable, whiledifferent presentations of results are being tried at the moment, in order to obtain adequateperformance and aiming at solving the information needs of potential users.

WP 2.1: Academic web mediators. The market of search engines is changing everyyear, reducing the number of actors involved. The actual number of engines is very low,probably less than ten, although some of them with very small databases or not frequentlyupdated. We have identified five public search engines with a good and even coverage, largedatabases and a set of powerful commands for recovery information. We evaluated if the figuresprovided for specific operators were trustable, with varied results [30, 36, 40]. At least one ofthe engines selected was a subject (academic) one, in this case Google Scholar. Preliminaryanalysis should be repeated as the way this engine works has changed very recently. Someof these innovations have been taken into consideration in the new design of the webometricindicators. The API availability for automatic recovery is not a factor in the selection of theengines, as test show the databases used are smaller than the commercial ones, usually becausethey are older. However, we are using their terms of use for designing the collection procedures,in this case limiting the number of requests to 5000 per day. The research showed that most ofthe engines have several “flavors”: In some cases, there are mirrors (Altavista and Allthewebare using the Yahoo’s database), different interfaces (Hereuare is using Gigablast’s database)or the results could be provided by randomly selected data centers (Google). A table describingall the operators available for the engines, focusing on those useful for webometrics purposeshas been published [30, 36, 40, 51].

WP 2.2: Research groups with Web presence. In order to obtain a comprehensivelist of research groups’ websites, a feasible strategy was needed. It was decided that only thosegroups with their own subdomain will be included. This decision takes into consideration,not only technical reasons, but organizational ones as usually a web subdomain is reservedfor true units. From a technical point of view there is an easy way to extract subdomains,namely the feature command available from Yahoo. With the help of this operator is possibleto capture a large number of subdomains for a selected academic domain. The list obtainedis very noisy, that means a cleaning process should be performed. The url cleaning is prettysimple, but it is not complete. An additional cleaning using the title is more time consuming

TIN 2007-67581-C02

as it cannot be automatized. The scheduled plan is to scan the domains of the top worlduniversities (about 3,000) to obtain a list of over 200,000 research groups. A subset of this list,focusing on Humanities, will be the primary population for analysis.

WP 2.3: Web indicators. As previously stated the indicators to be used derived fromthe available operators with webometric capabilities of the search engines. It is generallyacknowledged that the web data follows power law distributions, so the normalization of dataimplies the use of log transformation. The geographical coverage of the search engines is verybiased, so the use of several of them is mandatory for decreasing the impact of individualbias. The data are collected twice for each engine, and then the median of the results ofall the search engines is calculated. We are studying the use of factor G and PageRankthat requires complex mathematical analysis and it is very computing intensive. A compositeindicator has been developed combining activity and visibility indicators obtained from thesearch engines [30, 36, 33, 34, 42, 44].

Figure 4: Geographical distribution of the top ranking positions

WP 2.4: digital divide. The final result is the compilation of the Rankings Web,including Universities, Research Centers, Hospitals, Business Schools and Repositories. TheseRankings are published two times per year and their coverage is truly global. For example, theRanking of Universities provides information for close to 18,000 Higher Education Institutionsworldwide. The geographical analysis of the results show a concerning academic digital divide,as in the top positions there are far more North American Universities than European ones(Figure 4). We are exploring the causes of this divide, analyzing the different aspects of thegap [31, 33, 35, 38, 41, 43, 48, 49, 50].

3 Indicators

In this section we include information related to the rest of the result indicators

3.1 UNED subproject

Scientific and technological outcome. In two years we have published 22 papers in relevantevents and 7 in journals related to the topics of the Project. (3 included in the SCI and 4 inother databases). For the conferences and workshops we list the reference indexes. Furthersubmissions to journals are on the way, and more are foreseen in the future. The review

TIN 2007-67581-C02

process for journals is much longer and consequently this kind of publications happen more atthe end and beyond the period of the project. The whole list is included in the References.Beside publications, we have created resources and software: annotated corpus for the differenttasks of the project have been generated, and would be published as open resources for otherresearchers; a filtering classifier for academic web sites; a multiclass (25) classifier for academicweb sites; an information extraction system for automatic profiling academic web sites; aprototype for an academic search engine.

Technology Transfer. The Project has Alma Technologies S.A. as EPO. They are spe-cially interested in opinion tracking, and the results of the Information Extraction task arevery relevant for this purpose. In fact a new project, WebOpinion, funded under the umbrellaof the Plan Avanza, is currently going on involving our group and the company.

Training PhD researchers engaged in the project. Included in the proposal, asmembers of the research team, there were the junior researchers listed below. Their profile aswell as the international relationships they have been involved are summarized.

• Alvaro Rodrigo, funded by a research grant of the Comunidad de Madrid. He is nowfinishing his PhD. He has been visiting New York University from September 2007 to 30march 2008. His advisor was Prof. Grishman, the founder of the Proteus Project, whichconducts a wide range of research in natural language processing. The theme for thecollaborative work was the adaption of the JET tool for Spanish. A. Rodrigo has alsoparticipated in the organization committee of the CLEF answer validation task ([19]),the TAC track ([21]) and the new ResPubliQA evaluation campaign in CLEF 2009.

• Jose Ramon Perez: Phd finished in 9/12/2008. Currently working as Clinical AssistantProfessor in University of North Carolina at Chapel Hill.

• Javier Artiles Picon: He has obtained his PhD in February 2010. He was in a stay inNew York University with Prof. Sekine (1/10/2008 – 1/4/2009). They have been jointlyinvolved in the organization of the Web People search evaluation task, in 2008 and 2009.

The following junior researchers have been included after the starting of the project

• Arkaitz Zubiaga: Master degree (28/9/2008). PhD in progress. He has a research grantfrom Comunidad de Madrid.

• Guillermo Garrido: Currently engaged in the Master “Inteligencia Artificial y SistemasInformaticos de la UNED”. He has a research grant (FPI) associated with the Project. Aspart of his training he has carried out a two month stay in Yahoo! Research Barcelonain 2009, where he was working on studying the community structure of endorsementnetworks. A three month stay is planned in Google Labs in Zurich, March-May 2010.

• Juan Martınez: PhD in progress, teaching assistant collaborating part time. He hascarried out a three-month stay visiting WEST (Web Exploration and Search Technology)Lab at Polytechnic Institute of NYU. His internship advisor at Poly-NYU is professorTorsten Suel. The theme for the joint work is related to the exploration and compressionof the Wayback Machine, a huge archive preserving 40 thousand millions of web pagescreated since 1996, where the time stamp plays an important role for providing searchingresults.

• Alberto Perez: PhD in progress, teaching assistant collaborating part time, he has carriedout a three month stay in Helsinki University of Technology, with the group “Compu-tational Cognitive Systems research group”. His internship advisor is professor Timo

TIN 2007-67581-C02

Honkela. The theme for the joint work was hierarchical clustering using Self-OrganizingMaps.

• Patricia Hernandez, hired by the project since September 2009 until November 2010. Sheis involved in the information extraction task.

All of them have contributed to the project, as reflected in the publications.Their stages in different research groups have consolidated our international relationships

with leaders in the respective topics.Coordination. The two teams are complementary in competences, and their expertise is

required to achieve the goals of the project. The collaboration in the different tasks has beenexplained. At this stage joint publications are in preparation, aiming at disseminating theresults in both areas computer science and information science.

International Collaboration. Related to the topics of the Information Extraction task,we have co-organized two international evaluation campaigns, briefly described below.

• ResPubliQA, a track of The Cross-Language Evaluation Forum, CLEF.http://www.clef-campaign.org/Given a pool of 500 independent natural language questions, systems must return thepassage (not the exact answer) that answers each question from the JRC-Acquis collec-tion of EU documentation. Both questions and documents are translated and alignedfor a subset of languages (at least Bulgarian, Dutch, English, French, German, Italian,Portuguese, Romanian and Spanish). The track was coordinated by CELCT (IT) andUNED (ES).The central website is http://nlp.uned.es/clef-qa/.

• WePs. Second Web People Search Evaluation Workshop. April 21st - Madrid, Spain.Co-located with the WWW2009 Conference. Finding information about people in theWorld Wide Web is one of the most common activities of Internet users ([7, 8]). Personnames, however, are highly ambiguous. In most cases, the results for a person namesearch are a mix of pages about different people sharing the same name The Web PeopleSearch (WePS) Evaluation is a shared evaluation campaign focused on this problem. Inthis second evaluation 19 research teams from around the world have participated in twotasks: (i) clustering web pages to solve the ambiguity of search results, and (ii) extracting18 kinds of ”attribute values” for target individuals whose names appear on a set of webpages. Participants were provided with task guidelines and development data, and laterevaluated their systems on a new testbed. WePS-2: The organizers were Javier Artiles,NLP & IR Group (UNED), Julio Gonzalo, NLP & IR Group (UNED) and Satoshi Sekine,Proteus Project (NYU). The web site is http://nlp.uned.es/weps/. The Web PeopleSearch Task ([6, 22]) has attracted the attention of several IT companies: Spock (USA)has sponsored the workshop, and Google (USA) and Alias-i (USA) are participating inthe Programme Committee for the task. Evaluation Measures are also an open problemfor this task, and we have contributed with ([2, 3]) a new proposal.

We are involved in the MAVIR consortium, http://www.mavir.net/ where we have or-ganized advanced seminars and workshops, with invited speakers from N.York University,and Yahoo! Research Labs, among others. These events have been part of the planningfor establishing collaborative action with these research groups. Besides, new collaborations,through our junior researchers have been established with Poly-NYU, Google Labs in Zurich

TIN 2007-67581-C02

and Helsinki University of Technology. Other well known researchers have visited us, as forexample E.Hovy from Information Sciences Institute in California, USA, and A. Penas is cur-rently on sabbatical in his group. In addition, members of the NLP&IR Group have beeninvolved in three EU-funded projects related to Qeavis: Multimatch (Multilingual multimediasearch engine in the Cultural Heritage domain), MedIEQ (Quality labeling of medical webcontent using multilingual information extraction) and TrebleCLEF (Evaluation, best practiceand collaboration for Multilingual Information Access). In 2009 we have successfully finishedMediEQ and Multimatch projects, as well as TrebleCLEF in January 2010.

Project Management. The project is carried out as planned, with some changes injunior personnel as previously indicated. There are not major deviations from the technicalplan. Regular as well as special purpose meetings are periodically organized with participationof different team members from both subprojects. The project has a web site and has beenpresented in different events such as [29]. Results are timely disseminated through the men-tioned international evaluation campaigns, as well as in journals and conference contributionsas listed in References.

3.2 Cybermetrics Lab

Scientific and technological outcome. In two years we have published 11 papers in ISI-referred journals, 3 papers in other journals, 3 contributions to books and 26 presentations innational and international conferences.

International collaboration. Members of our group have participated in two expertgroups of the DG Research of the European Commission, both targeting development in themonitoring of ERA (European Research Area): Research and mobility indicators. Currentlywe are participating in OpenAIRE EU project for providing statistics to the Framework pro-grammes’ repository and make a proposal (ACUMEN) for funding the development of indi-vidual researchers’ indicators (7 FP). We collaborate with several research groups includingCNRS, University of Bordeaux, University of Wolverhamton, Dutch Royal Academy of Sci-ences, Scimago group and others.

Training PhD researchers engaged in the project. Not funded.

References

[1] Enrique Amigo, Juan Martinez-Romo, Lourdes Araujo and Victor Peinado: UNED atWebCLEF 2008: Applying High Restrictive Summarization, Low Restrictive InformationRetrieval and Multilingual Techniques. Lecture Notes in Computer Science. 2008.

[2] E. Amigo, J. Gonzalo, J. Artiles, F. Verdejo A comparison of extrinsic clustering evaluationmetrics based on formal constraints, Information Retrieval Journal. 2008. JCR, impact(0,69)

[3] E. Amigo, J. Gonzalo, J. Artiles Combining Evaluation Metrics via the Unanimous Im-provement Ratio and its Application in WePS Clustering Task, 2nd Web People SearchEvaluation Workshop (WePS 2009), 18th WWW Conference. 2009. (workshop in WWWConference, CORE A)

[4] Lourdes Araujo, Jose R. Perez-Aguera: Improving Query Expansion with StemmingTerms: A New Genetic Algorithm Approach. EvoCOP 2008: 182-193. (workshop in Eu-roGP Conference, CORE B)

TIN 2007-67581-C02

[5] Lourdes Araujo: Stochastic Parsing and Evolutionary Algorithms, Applied Artificial In-telligence, 23(4), 2009, 346-372. JCR, impact (0,79)

[6] J. Artiles, J. Gonzalo, S. Sekine, WePS 2 Evaluation Campaign: overview of the WebPeople Search Clustering Task. 2nd Web People Search Evaluation Workshop (WePS2009), 18th WWW Conference. 2009. . (workshop in WWW Conference, CORE A)

[7] J. Artiles, E. Amigo, J. Gonzalo The role of named entities in Web People Search, Pro-ceedings of the 2009 Conference on Empirical Methods in Natural Language Processing.534–542. 2009. CORE A.

[8] J. Artiles, E. Amigo, J. Gonzalo The Impact of Query Refinement in the Web PeopleSearch Task, Proceedings of the ACL-IJCNLP 2009 Conference. 361–364. 2009. CORE B.

[9] Garcıa-Plaza, A. P., Fresno, V., Martınez, R. 2009. Una Representacion Basada en LogicaBorrosa para el Clustering de paginas web con Mapas Auto-Organizativos . Curso deTecnologıas Linguısticas (Fundacion Duques de Soria), Soria, 7-11 de Julio de 2008.

[10] Alberto P. Garcıa-Plaza, Vıctor Fresno, Raquel Martınez. Web Page Clustering Using aFuzzy Logic Based Representation and Self-Organizing Maps. IEEE/WIC/ACM Interna-tional Conference on Web Intelligence and Intelligent Agent Technology Sydney, Australia,Dec. 9-12. CORE: B, Citeseer 807. Web Intelligence: 0.29 (top 66.09

[11] Juan Martinez-Romo, Lourdes Araujo: Recommendation System for Automatic Recoveryof Broken Web Links. IBERAMIA 2008: 302-311. Citeseer.

[12] Juan Martinez-Romo, Lourdes Araujo: Web spam identification through language modelanalysis, Proc. Fifth International Workshop on Adversarial Information Retrieval on theWeb, AIRWeb 09, pp. 21-28. (workshop in WWW, CORE A.)

[13] Juan Martınez-Romo, Lourdes Araujo: Sistema de Recomendacion para la RecuperacionAutomatica de Enlaces Web Rotos. Congreso de la Sociedad Espanola para el Proce-samiento del Lenguaje Natural, SEPLN’08. Leganes (Madrid), Spain, 10-12 Sept. 2008.Indexed in SCOPUS,REDALIC, Latinindex and DBLP.

[14] Juan Martinez-Romo, Lourdes Araujo: Deteccion de Web Spam basada en la Recu-peracion Automatica de Enlaces. Curso de Tecnologıas Linguısticas (Fundacion Duquesde Soria), Soria, 7-11 de Julio de 2008.

[15] Juan Martinez-Romo and Lourdes Araujo: Retrieving broken web links using an approachbased on contextual information, Proc. ACM Conference on Hypertext and HypermediaHypertext, 2009, 351-352. CORE A.

[16] Jose R. Perez-Aguera, Hugo Zaragoza, Lourdes Araujo: Exploiting Morphological QueryStructure Using Genetic Optimisation. NLDB 2008: 124-135. CORE C

[17] Jose R. Perez-Aguera, Lourdes Araujo: Comparing and Combining Methods for Auto-matic Query Expansion. Cicling (2008). Advances in Natural Language Processing andApplications, 33, pp. 177-188. CORE B.

[18] Joaquın Perez-Iglesias and Lourdes Araujo: Ranking List Dispersion as a Query Perfor-mance Predictor, Proc. Advances in Information Retrieval Theory, Second InternationalConference on the Theory of Information Retrieval ICTIR 2009, 371-374. (Second editionof this conference)

[19] A. Penas, A. Rodrigo, V. Sama, F. Verdejo, Testing the Reasoning for Question AnsweringValidation. Special Issue on Natural Language and Knowledge Representation, Journal ofLogic and Computation. JCR impact 0,53.

[20] F. Rangel, A. Penas. Clasificacion de Paginas Web en Dominio Especıfico Revista de laSociedad Espanola para el Procesamiento del Lenguage Natural, SEPLN. 89–96. 2008. .Indexed in SCOPUS,REDALIC, Latinindex and DBLP.

TIN 2007-67581-C02

[21] A. Rodrigo, A. Penas, F. Verdejo. Towards an Entity-based recognition of Textual Entail-ment Text Analysis Conference (TAC) 2008 Workshop. 2008. Organized by NIST.

[22] S. Sekine, J. Artiles WePS2 Attribute Extraction Task, 2nd Web People Search EvaluationWorkshop (WePS 2009), 18th WWW Conference. 2009. (workshop in WWW, CORE A.)

[23] Zubiaga, A., Fresno, V., Martınez, R. 2009. Comparativa de Aproximaciones a SVMSemisupervisado Multiclase para Clasificacion de Paginas Web. Curso de TecnologıasLinguısticas (Fundacion Duques de Soria), Soria, 7-11 de Julio de 2008.

[24] Zubiaga, A. 2008. Comparativa de Aproximaciones a SVM Semisupervisado Multiclasepara Clasificacion de Paginas Web. Master Thesis Report

[25] A. Zubiaga, V. Fresno, R. Martınez. Clasificacion de Paginas web con Anotaciones So-ciales. Revista de la Sociedad Espanola para el Procesamiento del Lenguaje Natural. ISSN1135-5948. Vol 43. Paginas: 225-233, 2009. . Indexed in SCOPUS,REDALIC, Latinindexand DBLP.

[26] A. Zubiaga, R. Martınez, V. Fresno. Getting the Most Out of Social Annotations for WebPage Classification. Proceedings of the 9th ACM Symposium on Document Engineering.Paginas 74-83, 2009, ACM. Congreso CORE: A Indexado en el Computer Science Con-ference Ranking dentro del area ”Artificial Intelligence and Related Subjects”

[27] A. Zubiaga, V. Fresno, R. Martınez. Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?. Proceedings of the NAACL-HLT 2009 Workshop on Semi-supervised Learning for Natural Language Processing, pp. 28-36, Boulder, CO, UnitedStates, 2009, ACM. Congreso CORE: A Indexado en el Computer Science ConferenceRanking dentro del area ”Artificial Intelligence and Related Subjects”

[28] Zubiaga, A. P. Garcia-Plaza, V. Fresno, R. Martınez. Content-based Clustering for TagCloud Visualization. Proceedings of the International Conference on Advances in SocialNetworks Analysis and Mining. ASONAM 2009. Paginas 316-319. IEEE computer Societyand CPS. (Primera edcicion de este congreso)

[29] M. F. Verdejo, E. Amigo, L. Araujo, V. Fresno, G. Garrido, R. Martınez, A. Penas, A.Perez, J.R. Perez, A.Rodrigo, J.Romo, A.Zubiaga, I. Aguillo, M. Fernandez, A. M. Utrilla.QUEAVis: Quantitative evaluation of academic websites visibility. Revista de la SociedadEspanola para el Procesamiento del Lenguaje Natural. ISSN 1135-5948. Vol 43. . Indexedin SCOPUS,REDALIC, Latinindex and DBLP.

[30] Aguillo, I. (2009). Measuring the institution’s footprint in the web. Library Hi Tech, 27(4), pp. 540-556.

[31] Aguillo, IF. (2009). Web Networks of Collaboration. International conference: Exploringthe emergent European research system. IPTS-JRC, Sevilla, 16-17 de Noviembre 2009.

[32] Aguillo, IF (2009). Ranking Web of Repositories; Metrics, results and a plea for achange. Seminario Metrics , Repositories and Social Sciences and Humanities. CCHS-CSIC. Madrid, 20 de noviembre de 2009.

[33] Aguillo. IF (2009). Webometrics Indicators for the CNRS. Journee Les reseaux des saviors.Universite Bordeaux 3, 17 de Diciembre 2009.

[34] Aguillo. IF (2009). Rankings! Spanish Session. Online Information Meeting 2009. Londres,2 de Diciembre 2009.

[35] Aguillo, IF. (2009). Evaluating International Relationships in the Web: A CybermetricsAnalysis of EINIRAS Network. EINIRAS Meeting 2009. Real Instituto Elcano, Madrid17 de Septiembre 2009.

[36] Aguillo, IF. (2009). Problemas tecnicos, metodologicos y documentales en la elaboracionde Rankings basados en indicadores Web. XI Jornadas Espanolas de DocumentacionFESABID 2009. Zaragoza, 20-21 de Mayo 2009.

TIN 2007-67581-C02

[37] Aguillo, IF. (2009). Acceso Abierto: Metrica e indicadores (Mesa redonda). XI JornadasEspanolas de Documentacion FESABID 2009. Zaragoza, 20-21 de Mayo 2009.

[38] Aguillo IF, Bar-Ilan J, Levene M, Ortega, J.L. (2009) Comparing University Rankings.Proceedings of the International Conference on Scientometrics and Informetrics Volume:1 Pages: 97-107.

[39] Aguillo, IF. (2009). Metrica de repositorios y evaluacion de la investigacion. AnuarioThinkEPI, ISSN 1886-6344, No. 1, 2009, pags. 40-41

[40] Aguillo, IF. (2009). Your Institution’s Footprint in the Web. 9th International BielefeldConference, Bielefeld University, 3-5 November 2009. (Opening Keynote).

[41] Aguillo, IF. (2009). Asian Universities and The World Rankings: The Webometrics Sce-nario. International Conference on World University Ranking 2009. Universitas Indonesia,16 Abril 2009.

[42] Aguillo, IF (2009). The Ranking Web. 2nd International Workshop on University WebRankings 2009. CCHS-CSIC. Madrid, 20 de abril de 2009.

[43] Aguillo IF, Bar-Ilan J, Levene M, Ortega, J.L. (2009). Comparing University Rankings.International Conference on Scientometrics and Informetrics (ISSI). Rio de Janeiro, 14-17de Julio 2009.

[44] Barre, R.; Regibeau, P.; Aguillo, IF.; Lepori, B.; Siedschlag, I.; Soboll, H.; Tubbs, M.;Veugelers, R.; Ziarko, E.; Stierna, J. (2009).ERA indicators and monitoring. Expert GroupReport. Brussels: European Commission. EUR 24171EN, 140 pages.

[45] Fernandez, M., Zamora, H., Ortega, J.L., Utrilla, A.M., Aguillo, I.F. (2009). Generoy visibilidad Web de la actividad de profesores universitarios Espanoles: El caso de laUniversidad Complutense de Madrid. Revista Espanola de Documentacion Cientıfica, 32(2), pp. 51-65.

[46] Ortega, J.L., Cothey, V., Aguillo, I.F. (2009). How old is the web? characterizing the ageand the currency of the European scientific web. Scientometrics, 81 (1), pp. 295-309.

[47] Ortega, J.-L., Aguillo, I. (2009). Minerıa del uso de webs. Profesional de la Informacion,18 (1), pp. 20-26.

[48] Ortega, J.L., Aguillo, I.F. (2009). Mapping world-class universities on the web. Informa-tion Processing and Management, 45 (2), pp. 272-279.

[49] Ortega JL, Aguillo IF. (2009). Structural analysis of the Iberoamerican academic web.Revista Espanola de Documentacion Cientıfica, 32(3), pp. 51-65.

[50] Ortega, J.L., Aguillo IF. (2009). North America Academic Web Space: MulticulturalCanada vs. The United States Homogeneity. ASIST & ISSI Pre-Conference Symposiumon Informetrics and Scientometrics. Vancouver , BC, Canada, 7 de Noviembre 2009.

[51] Utrilla Ramırez, A.M., Fernandez, M., Ortega, J.L., Aguillo, I.F. (2009). ClasificacionWeb de hospitales del mundo: situacion de los hospitales en la red. Medicina Clınica, 132(4), pp. 144-153.

[52] Utrilla-Ramırez, AM. (2009). La Visibilidad de los hospitales en la Web. Variables, ries-gos y retos de los Rankings. Comunicacion en salud 2.0. Retos y oportunidades delas redes sociales y las nuevas herramientas de comunicacion. Jornada en el marco delMaster en Comunicacion Cientıfica, Medica y Ambiental. IDEC-Universidad PompeuFabra/Observatorio de la Comunicacion Cientıfica, Barcelona, 26 de Octubre 2009.