slovak university of technology in bratislava faculty of ... · slovak university of technology in...

Slovak University of Technology in BratislavaFaculty of Informatics and Information Technologies

FIIT-5220-64389

Bc. Martin Lipták

RESEARCHER MODELING IN PERSONALIZEDDIGITAL LIBRARY

Master thesis

Study program: Software EngineeringField of Study: 9.2.5 Software EngineeringPlace: Institute of Informatics and Software EngineeringSupervisor: prof. Ing. Mária Bieliková, PhD.

2014, May

Annotation

Slovak University of Technology in BratislavaFACULTY OF INFORMATICS AND INFORMATION TECHNOLOGIESDegree Course: SOFTWARE ENGINEERING

Author: Bc. Martin LiptákMaster thesis: Researcher Modeling in Personalized Digital LibrarySupervisor: prof. Ing. Mária Bieliková, PhD.2014, May

Digital libraries are crucial resources for researchers regardless of their fields of study.In order to deal with the amount of information in digital libraries and to improve theresearcher’s experience, many parts of the digital library can be personalized. Digital libraryapplications provide various user features and collect diverse metadata about documents.We propose a researcher model (user model for a digital library) that comprises all typesof user and domain data available in digital libraries. All the researcher’s interactionsalong with all relevant content metadata are extracted from the digital library data modelto graph relations. Higher-level relations are deduced from the extracted relations. Thefinal relation is a representation of the researcher’s interests in the digital library. Exactcomposition of relations and entities inside the graph and other parameters are adjustedfor the requirements of each digital library. The researcher model is a vector of termsfrom outside, but a graph inside, whereby the components of the model are reusableand the model is flexible and extensible. We realize the researcher model in the Annotadigital library. We evaluate the researcher model by determining correlation between theresearcher model terms and what the researchers perceive themselves as their researchinterests.

ii

Anotácia

Slovenská technická univerzita v BratislaveFAKULTA INFORMATIKY A INFORMACNÝCH TECHNOLÓGIÍŠtudijný program: SOFTVÉROVÉ INŽINIERSTVO

Autor: Bc. Martin LiptákDiplomová práca: Modelovanie výskumníka v personalizovanej digitálnej knižniciVedúci práce: prof. Ing. Mária Bieliková, PhD.máj 2014

Digitálne knižnice sú dôležitým zdrojom informácií pre každého výskumníka bez ohl’aduna oblast’ jeho výskumu. Pre zvládnutie vel’kého množstva informácií v digitálnychknižniciach a pre zlepšenie práce výskumníka, vel’a funkcií digitálnej knižnice môžebyt’ personalizovaných. Aplikácie digitálnych knižníc poskytujé rôzne funkcie a zbierajúrôznorodé metadáta o dokumentoch. Navrhujeme model výskumníka (model používatel’apre digitálnu knižnicu), ktorý zahrna všetky druhy používatel’ských a doménových dátdostupných v digitálnych knižniciach. Všetky interakcie výskumníka spolu so všetkýmirelevantnými metadátami k obsahu sú extrahované z dátového modelu digitálnej knižnicedo vzt’ahov v grafe. Vzt’ahy na vyššej úrovni sú odvodené z extrahovaných vzt’ahov.Najvyšší vzt’ah je reprezentácia modelu výskumníka v digitálnej knižnici. Presné zloženievzt’ahov a entít vo vnútri grafu a d’alšie parametre sú nastavené podl’a požiadaviek každejdigitálnej knižnice. Model výskumníka je vektor pojmov zvonka, ale graf vo vnútri. Pretosú vzt’ahy a pojmy v modeli znovupoužitel’né a model je pružný a rozšíritel’ný. Modelvýskumníka realizujeme v digitálnej knižnici Annota. Overujeme ho urcením koreláciemedzi pojmami z modelu a tým, co sami výskumníci vnímajú ako svoje vlastné výskumnézáujmy.

iv

I dedicate this work to all the researchers making constant effort to bringinnovation.

vi

ACKNOWLEDGEMENTSI hereby would like to thank Mária Bieliková for her ideas and supervision.I would also like to thank to all members of the Annota team, all participantsof the experiments and and all participants of the pilot user study.

viii

Contents

1 Introduction 1

2 Researchers in Digital Libraries 32.1 Academic Publishers and Digital Libraries . . . . . . . . . . . . . . . . . 3

2.1.1 ACM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 IEEE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Academic Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.1 Google Scholar . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Academic Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.1 ResearchGate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 Mendeley . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.3 Annota . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Personalization and User Modeling 153.1 User Modeling Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 User Characteristics and Context . . . . . . . . . . . . . . . . . . . . . . 173.3 Types of User Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4 User Model Representation . . . . . . . . . . . . . . . . . . . . . . . . . 203.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Researcher Model Design 234.1 External Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2 Internal Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2.2 Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Researcher Model Realization in Annota 295.1 Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.1.1 Available User Data . . . . . . . . . . . . . . . . . . . . . . . . 295.1.2 Available Domain Data . . . . . . . . . . . . . . . . . . . . . . . 305.1.3 Personalisable User Features . . . . . . . . . . . . . . . . . . . . 30

5.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

x

5.2.1 Graphene Framework . . . . . . . . . . . . . . . . . . . . . . . . 315.2.2 Default Researcher Model . . . . . . . . . . . . . . . . . . . . . 34

5.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6 Evaluation 396.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.1.1 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.1.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.1.3 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.1.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.1.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.2.2 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7 Conclusions 47

References 49

Resumé v slovenskom jazyku (Resumé in Slovak language) 53

A Pilot User Study A-1A.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1A.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1A.3 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2

A.3.1 Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2A.3.2 Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2A.3.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3

A.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3

B User Documentation B-1

C Developer Documentation C-1

D Journal Article Proposal D-1

E Attached CD Content E-1

xi

Chapter 1

Introduction

Digital libraries are an important resource for all researchers regardless of their field ofstudy. They are used to publish, search and access research works. Researchers can findsolutions for particular problems, follow the latest trends in domains of their interest orpublish results of their work. The amount of information in every domain has grownexponentially recently and still continues to grow at an increasing rate. Digital librariesare no exception to this trend with rapid growth of number of published works. This hastwo serious consequences. Firstly, researchers miss many works that would be useful totheir interests. Secondly, they spend large amount of time reading works that turn outinapplicable to their research. These issues are generally solved with personalization,which besides tackling information overload, enhances user experience of digital libraryapplications. Personalized digital library applications adapt their interfaces and contentsbased on the current user in order to provide a better service for each individual user.

Personalization involves adaptation of the application to the user based on estimationsof her characteristics. User modeling is a process of acquisition of these estimations.There are various user characteristics that can be modeled [6]. Knowledge is importantfor adaptive leaning systems, where the user model (or student model) reflects student’sknowledge of discrete parts of the learning materials. Interests are modeled in newsrecommendation systems, which depend on a user model estimating relevance of variousarticle features to the user. Goals are modeled in information access systems to estimatewhat type of information the user currently needs. User model is utilized together witha domain model, which provides content characteristics. Domain models of adaptivelearning systems contain all the learning materials and their dependencies. Many times,user models overlay domain models — user characteristics are estimated for units ofcontent in the domain model [6]. Student’s knowledge is estimated for the concepts, whichthe student is supposed to learn.

When we talk about the domain of digital libraries, we use terms researcher model andresearcher modeling for user model and user modeling respectively. The rationale behindthis terminology is the domain — users in digital library systems are researchers. Usercharacteristics are important for modeling of researchers — they have their knowledge,interests, goals, etc. [8]. In 2003 a report on state of personalization and recommendersystems in digital libraries stated that a robust, flexible and portable researcher modelis needed. The researcher model would be used in multiple digital libraries and could

1

handle heterogeneous user data [7]. There are studies on characteristics relevant for usermodeling in digital libraries [8, 9]. There are papers on recommender system design indigital libraries [10]. However, we have not seen a work thoroughly designing a generaluser model, which could be used in any digital library. We face the problem with threeobjectives.

• Digital library applications provide various user features like navigation, search,recommendations, tagging, categorization or personal annotations and all the re-searcher’s interactions with these user features can be logged. Domain models indigital libraries contain documents, tags, categories and diverse document metadataand other metadata can be obtained from external services or other digital libraries.Digital libraries therefore provide vast user and domain data from many sources andthere are multiple user features that can use estimations of researcher’s characteris-tics. We intend to design a flexible researcher modeling framework, which can beused in any digital library, taking advantage of diverse researcher and domain dataand provide estimates of user characteristics to diverse personalized user features inthe digital library application.

• New user features or new sources of user and domain data can be added to digital li-brary applications. We intend to design an extensible researcher modeling framework.All components inside the researcher model can be reused and the framework allowscoexistence of multiple researcher models at the same time. Besides enhancing theextensibility of the whole digital library application, an extensible framework is alsosuitable for experiments comparing various user models or information retrievaltechniques.

• We intend to propose general researcher modeling principles covering the specificsof the digital libraries domain.

This work designs and evaluates a flexible and extensible user model for digitallibraries. Chapter 2 analyzes user and domain data available in digital libraries and otherspecifics of digital library applications. Chapter 3 summarizes state of the art of usermodeling and personalization. Chapter 4 brings us to the specification of requirements anddesign of a general researcher model realizable in any digital library. Chapter 5 realizesthe designed researcher model in the Annota digital library. Some requirements on thedesigned researcher model are evaluated by its realization, other requirements are evaluatedin chapter 6. We conclude the contribution of this work in chapter 7. In the end of thework we include a summary of the most important parts of the thesis in Slovak, since thisis a requirement for writing theses in foreign languages in Slovak Republic.

In appendix A we perform a pilot user study with the objective to dive deeper into theresearch and digital libraries. Appendix B contains user documentation for Annota. Ap-pendix C contains developer documentation of the researcher model in Annota. AppendixD summarizes this work in a form of a journal article proposal. Appendix E lists contentsof the attached CD.

2

Chapter 2

Researchers in Digital Libraries

Digital library is a potentially virtual organization, that comprehensively collects, managesand preserves for the long depth of time rich digital content, and offers to its targeted usercommunities specialized functionality on that content, of defined quality and according tocomprehensive codified policies [18]. To clarify this general definition, we can imaginea web information system that stores documents published by its users and makes themavailable to other users. The web information system offers specialized functionality ofsearching, tagging and categorizing of documents. Publishing may involve special usersthat perform reviews of documents. Such a system is an instance of digital library.

Digital libraries are utilized in all domains where sharing of digital content is cruciallike research, education, culture or arts. We denote users of digital libraries as researchers,because they search and publish digital content in order to share knowledge and inspiration.

This chapter aims to analyze all the domain aspects of researchers in digital librariesrelevant to researcher modeling, which is the aim of this work. Examples used throughoutthe chapter favor the research in information technologies, since we are familiar with thisdomain.

We have performed a pilot user study with 7 researchers with the objective to divedeeper into the research and digital libraries and to understand researchers and how theywork (see Appendix A). We reference the study many times throughout this and thefollowing chapters.

2.1 Academic Publishers and Digital Libraries

Important scientific publishers provide online digital libraries with papers, journals andbooks in digital formats. Many works in the field of information technologies are publishedby professional organizations like ACM and IEEE, which we cover in this section. Otherworks are published by Elsevier1, Springer2 or other scientific publisher.

All these works are available to subscribers who have purchased membership in thedigital library of the publisher. Publicly available online are only metadata like title,authors, editors, abstract, etc. This has two main reasons. The first is the cost of editing,

1http://www.elsevier.com/2http://www.springer.com/

3

peer review, adding of metadata and sometimes printing of research works. The secondis monopoly of the traditional research publishers [17]. There is a growing movementin the scientific community to publish research works freely and without restrictions ontheir availability. Many researchers also use non-profit publishers like Public Library ofScience3 (the organization is non-profit, but the publishing costs have to be paid by theresearcher). And even more researchers publish pre-reviewed versions of their workson their personal web sites, services like ArXiv4 or in academic social networks that arecovered in following sections.

2.1.1 ACM

Association for Computing Machinery (ACM) is a professional organization for computerscientists and engineers. It defines itself as ’the world’s largest educational and scientificcomputing society’. Along with many services for its members, it provides a compre-hensive digital library. ACM Digital Library5 (ACM DL) contains articles published innumerous ACM journals, transactions or magazines, proceedings from ACM conferencesor other publications provided by ACM or affiliated organizations.

ACM DL offers a simple search interface. After researchers specify keywords, theycan refine the search with facets (like authors, editors, publisher or publication year). SeeFigure 2.1.

Figure 2.1: ACM Digital Library search interface with facets

3http://www.plos.org/4http://arxiv.org/5http://dl.acm.org/

4

Figure 2.2: ACM Digital Library paper details with meta-data, comments, bibliometrics and tool togenerate citations

ACM DL provides a web page for every paper with available metadata. Researcherscan submit their reviews and comments, show article download statistics and they cangenerate article citations in multiple formats (see Figure 2.2). Some participants of thepilot user study have stated that thanks to these features they search directly in the ACMDL instead of using an academic search engine (see Appendix A). ACM DL provides thefollowing article metadata.

• Title

• Publication year

• Abstract

• Authors

• References (other works referenced in the article)

• Cited by (other works that reference the article)

• Index Terms (ACM taxonomical categorization of the article)

• DOI6

6Digital Object Identifier uniquely identifies an object as an electronic document. DOI identificationof a document is permanent. It is a more stable reference to the document, as URLs can change when thepublisher for example changes digital library system.

5

Every article belongs to a publication, which has its own metadata.

• Publication type (journal, proceedings, etc.)

• Volume

• Year

• Issue

• Publisher (e.g. ACM New York)

2.1.2 IEEE

Institute of Electrical and Electronics Engineers is a professional organization whosemembers are mostly electrical, electronics and computer engineers. It defines itself as ’theworld’s largest professional association dedicated to advancing technological innovationand excellence for the benefit of humanity’. The IEEE digital library IEEE XPlore7 offersaccess to scientific and technical content published by IEEE and its partners.

IEEE Xplore provides an intuitive search interfaces with keyword search and facetrefinements (like content type, publication year or author). See Figure 2.3.

Figure 2.3: IEEE Xplore search interface with facets

There is a page for every article in IEEE Xplore, where researchers can read metadata,share articles and generate article citations in multiple formats (see Figure 2.4). Thefollowing metadata are available for articles in IEEE.

7http://ieeexplore.ieee.org/

6

Figure 2.4: IEEE Xplore paper details with meta-data

• Title

• Publication year

• Abstract

• Authors

• IEEE terms

• Author keywords

• DOI

• References (works referenced in the article)

2.2 Academic Search Engines

Researchers work with multiple digital libraries. There are numerous other digital librarieson the Web, they may not be familiar with or may not have access to due to membershiprestrictions. Also some research works may be found elsewhere on the Web. As the websearch engine provides access to vast information on the Web, there are search enginesthat provide access specifically to huge amounts of research works available on the Web(including digital libraries whose metadata are publicly available). Most used academicsearch engines in the field of information technologies are Google Scholar, CiteSeerX8

and Microsoft Research.8http://csxstatic.ist.psu.edu/

7

2.2.1 Google Scholar

Google Scholar9 searches across all on-line digital libraries and also research works foundelsewhere on the Web. Its web interface is very similar to search interfaces of digitallibraries — keyword search with faceted query refinements. It tracks citations of articles,searches for related articles and computes author’s statistics. It also provides direct linkto download the article if the researcher has a valid membership in the library, where thearticle is stored. See Figure 2.5.

Besides searching for research papers, Google Scholar provides paper recommenda-tions and various kinds of notifications (the user’s paper as been cited etc.) for registeredusers. Many participants of the pilot user study find Google Scholar practical and oneparticipant emphasized the recommendations that he found useful (see Appendix A).

Figure 2.5: Google Scholar

2.3 Academic Social Networks

Academic social networks connect researchers and facilitate spreading of information intheir disciplines far beyond libraries and search engines (ResearcheGate, Academia.edu10).Many of them started as simple tools that with inclusion of social features outgrew to fullsocial networks (Mendeley, Annota).

9http://scholar.google.sk/10http://academia.edu/

8

2.3.1 ResearchGate

ResearchGate11 is an academic social network with over 2.7 milion users. The researchercan ask a question to his followees or add a publication. She can be part of a project or aninstitution. Figure 2.6

Figure 2.6: Research Gate

2.3.2 Mendeley

Mendeley 12 is a free reference manager and academic social network. In November2012 Mendeley celebrated 2 million users 13. It consists of a desktop client (MendeleyDesktop; Figure 2.7) and a web application (mendeley.com; Figure 2.8). Also recently anew official iPhone application has been released. Many participants of the pilot user studyuse Mendeley (see Appendix A).

• Mendeley Desktop is used to organize, read and add annotations to documents onresearcher’s computer. The researcher reviews document metadata.

• Documents are uploaded to mendeley.com and synchronized across all computersand other devices of the researcher. They are also accessible anywhere on-line viamendeley.com.

• Researchers can follow other researchers. They can see works of others, writecomments and respond to comments. They can create groups and share papers anddiscuss across these groups.

11http://www.researchgate.net12http://www.mendeley.com/13http://blog.mendeley.com/academic-life/mendeley-has-two-million-users-to-celebrate-were-releasing-

the-global-research-report/

9

Figure 2.7: Mendeley Desktop

Figure 2.8: Mendeley.com

• Relevant papers are recommended to the researchers based on what she is currentlyreading.

• Researchers can automatically generate bibliographies.

10

Mendeley also provides an external API for Mendeley application developers 14. Ex-amples of Mendeley applications are Mendeley clients for other platforms (Android), bibli-ography widgets for content management systems (Wordpress, Drupal) or various statisticsand mash-ups. Mendeley API consists of public methods and user-specific methods. Publicmethods provide, besides other features, document searching, access to document metadataand related documents. User-specific methods require user’s authentication and providefollowing features.

• Statistics about authors, tags and publications concerning documents in user’s library

• User’s library documents along with their metadata

• User’s groups

• User’s folders

• User’s profile

Mendeley stores metadata about each document.

• Document type (Journal Article, Working Paper, Book, Web Page, Patent, ComputerProgram etc.)

• Title

• Authors

• Abstract

• Tags

• Keywords

• URL

• Catalog IDs (ArXivID, DOI, PMID)

Other metadata are added depending on document type (like journal, issue, volumeand year for journal articles).

2.3.3 Annota

Annota15 is a web application developed as a research project at FIIT STU in Bratislava.Annota can be described as a tool for annotating documents and a simple academic socialnetwork [20]. The project started as a Firefox extension to annotate web pages and researchpapers (see Figure 2.9). Later a web interface was added to display user’s bookmarkedweb pages, papers and annotations (see Figure 2.10). The web interface has developedinto a small academic social network with researchers sharing papers, web pages andtheir annotations, tagging and categorizing papers, following each other and organizingthemselves in groups.

11

Figure 2.9: Annota browser extension

Figure 2.10: Annota web application

Annota stores diverse metadata about every paper. Metadata are enriched by importingfrom linked Mendeley user accounts, scraping ACM digital library and extracting keywordsusing AlchemyAPI. There are currently 182 registered users, 53,243 papers and 2,687 tags(April 2014). Annota contains following data about every user.

14http://dev.mendeley.com/15http://annota.fiit.stuba.sk

12

• Papers and web pages in the library

• Papers and web pages marked as to be read later

• Papers and web pages marked as favorite

• Tags of papers and web pages created by the user or imported from Mendeley

• Keywords of papers and web pages extracted using AlchemyAPI, obtained fromACM metadata or imported from Mendeley

• Annotations of papers and web pages

• Folders of papers and web pages created by the user or imported from Mendeley

• User’s membership in groups

• Logs of user’s activities - user’s access to articles in Annota, user’s searches inAnnota, user’s activity on the ACM web site

The author of this thesis is a member of the Annota developer team. Annota is used toget the data required to build the researcher model and to evaluate it.

2.4 Discussion

Researchers publish their works in digital libraries. Digital library applications providemany user features like navigation using facets, tags or folders, full-text and metadatasearch, pages with metadata for every digital work, diverse citation export options andstatistics. Besides digital library web interfaces, the researchers use other applications.

• Academic search engines to search for digital content across multiple digital libraries(Google Scholar)

• Academic social networks to share the works of their interest (ResearchGate, Mende-ley, Annota)

• Tools to annotate documents (Annota)

• Tools for personal library organization (Mendeley, Annota)

Among the participants of the pilot user study the most used applications are GoogleScholar and Mendeley, The most relevant digital library among the participants of thestudy (and also of the future experiments) is ACM DL. See Appendix A.

We analyzed metadata and APIs that are provided by mentioned applications. Annotatakes advantage of Mendeley API, ACM DL metadata and extracts additional keywordsusing AlchemyAPI. It collects logs about user’s activity in Annota web application and onthe ACM web pages. As Annota provides necessary data and access to the user feedback,we use it for implementation and evaluation of the researcher model.

13

Chapter 3

Personalization and User Modeling

Personalization improves efficiency of the user in using of information spaces like digitallibraries or the Web. It takes advantage of user’s characteristics for adaptation of contentand user interfaces in information systems. Personalization is a special type of adaptation,where the adaptive system adapts to the user [3]. Each user might see a modified userinterface and different content parts that are more suitable to her needs. Many adaptivesystems benefit from personalization — learning systems, news recommenders, interactiveguides, digital libraries, to name a few.

Personalized systems use the following models [6].

• Domain model represents characteristics of content in the information system. Itcomprises keywords, terms or concepts. Keywords are small parts of text extractedfrom content. Terms are keywords grouped using normalization, stemming, lemma-tization or text similarity measures. Concepts usually do not have direct counter-partin the original information source. They can be assigned by an expert or extractedusing named entity recognition, tags or ontologies. Individual entities of the do-main model can be linked with relations, which reflect their similarity, order ordependency.

• User model estimates characteristics of the user like knowledge, interests, goals,tasks or individual traits.

• Context model estimates particularities of current interaction of the user (i.e. currentuser session) like device, browser or location.

Personalization comprises the following steps.

1. Domain, user and context data are collected. Their extent depends on their availabil-ity, privacy preferences of the user and the purpose of personalization.

2. Domain, user and context characteristics are estimated and stored in respective mod-els. The models serve as middlewares between collected data and the requirementsof the personalized user features of the application.

3. The information system interface and content is adapted based on the models.

15

User modeling is a process of collecting data about the user, estimating characteristicsof the user (building a user model) and adaptation of the information system based on thesedata. This chapter describes user and context characteristics important for user modeling,types and representations of user models and dives deeper into the user modeling process.Applicability of existing approaches in the modeling of researchers in digital libraries isdiscussed.

3.1 User Modeling Process

The process of user modeling consists of the following 3 phases [5].

1. Acquisition of data. The system collects data about every individual user frommultiple sources [1].

• Using predefined values. For example an educational system assumes that anew user knows nothing and thus initializes user’s knowledge of all conceptsto zero values.

• Acquiring user data directly from the user (explicit user modeling). It is usuallyperformed by letting the user fill in a prepared form. Almost every adaptivesystem to some extent uses explicit user modeling (for example, user profileinformation on Facebook are important for adaptation of advertisements).Explicit user modeling alleviates the cold-start problem, as the informationneeded for successful initial adaptation can be provided upon registration of anew user.

• Acquiring user data indirectly from logs of user’s interaction with the system(implicit user modeling). Implicit user modeling can require monitoring ofuser’s access to resources (e.g. web pages), user’s feedback (implicit or explicit)or user’s behavior (clicks, keystrokes, scrolling). It is important to record timeof users interactions, as it can be used for future modeling of user’s interestschanging in time. Implicit user modeling is sensible to amount of collecteddata and therefore needs to be combined with other approaches to cope withthe cold-start problem.

• Using the information about and from other users. When the user declares arelationship with other users in the system (a friendship in social network), thisrelationship can be used to infer similar interests or other characteristics aboutusers.

• Acquiring information from external sources. There are many applicationsthat store personal user data and have means to share these data with otherapplications. For example, external APIs the user has authorized access to (e.g.Twitter API, Mendeley API) or file import from other application (e.g. XMLfile with bookmarks from web browser).

16

User’s contextual information (location, platform etc.) need to be collected as wellalong with the data required for the user model. The extent of collected data dependson the nature of personalized system. We must also find balance between requireddata and user’s privacy and we may not share them with any third party without prioruser’s consent. Process of collecting user data should be unobtrusive and run in thebackground to place minimal burden on the user.

2. Characteristics estimation and update. Collected data are used to estimate eachuser’s characteristics (knowledge, interests etc.). Previously estimated characteristicsare updated. Estimated characteristics are stored in user model representation.Characteristics chosen to be estimated depend on the nature of personalized system.Maintenance of estimated user characteristics while real user characteristics keepchanging is the main challenge of user modeling.

3. Adaptation. Personalized system uses estimated characteristics to adapt its interface,recommendations, search results or any other adaptive features to the user.

User modeling is an iterative process — all three steps are repeated. For example, arecommender collects user’s feedback on recommended items, estimates current user’sinterests and adapts future recommendations. Later it collects user’s feedback on futurerecommendations, updates user’s interests and so on.

3.2 User Characteristics and Context

Modeled user characteristics depend on the domain of the personalized system or aredomain-independent. They also reflect the goals of its personalization — services of thesystem that are going to be personalized. Brusilovsky and Millán [6], who deal mostlywith personalized educational systems, identify 5 most important features of the user:

• Knowledge is the most important user characteristic for personalized educationalsystems. It either increases or decreases, as the user learns new concepts or respec-tively forgets already learned concepts. Knowledge can be conceptual or procedural.Conceptual knowledge represents facts and their relationships (e.g. theory of soft-ware engineering, grammar of foreign language). Procedural knowledge representsproblem-solving skills (e.g. math problems, programming exercises). Knowledgemodel can be also inverted and represent errors or misconceptions of the user.

• Interests are the most important user characteristics for personalized informationretrieval systems (e.g. news readers, microblog readers, social network time-linefilters). Interests can change over time. For example there are trending topics andthere are topics the user is steadily interested in.

• Goals and tasks represent immediate purpose of current user’s actions within thesystem. In domain of educational systems, this can be what the student currentlywants to learn.

17

• Background represents the user’s experience outside the core domain of the sys-tem. For example an educational system changes language of exercises for foreignstudents.

• Individual traits is an aggregate name for set of features that together represent auser as an individual like personality traits (introvert, extrovert), cognitive styles(holist, serialist), cognitive factors (working memory capacity) or learning styles.

Brusilovsky and Millán identify other features of the user as context features. Contextmodeling is conceptually different from user modeling, as many context features can notbe considered as information about the user in a pure sense. However, they are tightlyrelated to the user and are also important for personalization. That is why we describethem in this section.

• User platform represents the hardware, operating system, web browser or anysoftware used to access the personalized system that is used by the user.

• User location represents the user’s current location like work, home etc.

• Affective state represents the user’s current emotional state when working with thesystem.

Frias-Martinez et al., who have been researching digital libraries, mention nine dimen-sions of user modeling [8].

• Cognitive style represents the way in which the user processes information. Thereare studies of how users of different cognitive styles interact differently with thesystem [15]. Frias-Martinez et al. have also evaluated methods for automaticcognitive style recognition [9].

• System experience represents the user’s experience with the system.

• Domain expertise represents the user’s knowledge of topics in digital library she isinterested in.

• History captures user’s past interaction with the system.

• Device represents the user’s hardware when interacting with the system.

• Context represents the user’s location when interacting with the system. Theirterminology is different from the terminology of Brusilovski and ours. We considercontext generally as we described above.

• Personal data include gender, age, language, culture etc.

• Interests represent relevant topics in digital library for the user.

• Goals represent the user’s reason for current session to work with the system.

18

Nine dimensions of user modeling in digital libraries are similar to five user character-istics with context features in general and educational web-based systems. Similar usercharacteristics and context features can be used for user modeling also in other domains.However, there are two aspects that vary across domains. Firstly, the importance of indi-vidual user and context features is different. For example, the most important feature indomain of education is knowledge and further in domain of information retrieval systems,the most important feature might be user’s interests. Secondly, the distinction between usermodel and context model is different. User’s goals and tasks in domain of education areuser characteristics, because what the student wants to learn is an essential part of studentmodel. However, in domain of digital libraries, the user’s goals, for example current searchterms, are a part of the user’s context.

Some user models store only the values of user characteristics [4]. Other user modelsalso store a reference to the source used to acquire them [11]. It is generally advantageousto store additional meta-data along with their characteristics. Such meta-data can be forinstance confidence and relevance [2]. Relevance captures importance of a characteristicto the user. Confidence captures the way a characteristic was acquired. In certain situationscharacteristics provided by the users have higher confidence and characteristics providedby other adaptive systems about the users (acquired using external APIs) have lowerconfidence. In other situations this is inverted — we trust characteristics provided byexternal APIs more than those provided by users themselves (i.e. we are accessing to theuser data in applications that they have been using longer than our application).

3.3 Types of User Models

There are two general types of user models.

• Stereotype model represents the user as a group member. All users belonging tothe same group are considered the same from the viewpoint of user modeling. Whenuser’s characteristics change, the user’s group membership is changed.

The simplest stereotype model is the scalar model, where the user evaluates herself(or is objectively tested) by a single value on some scale [5].

• Feature-based model represents every user individually based on her characteristics.When user’s characteristics change, the feature-based user model is adjusted.

The most common feature-based model is the overlay model, where concepts fromthe domain model have its copies in the user’s model. For example, a domain modelreflects expert-level knowledge and the overlay model represents user’s knowledgeof individual concepts [5].

Stereotype user models are simple to use and require little initial user data (answer to aquestion after the user’s account in the system has been created). Feature-based modelsoffer much more flexibility, as they store multiple features for every user individually.However, this flexibility requires much more initial user data. When the system has not

19

enough data about the user (for example, a new user has just created an account), a feature-based user model can not truly reflect her characteristics. This is called the cold-startproblem. A good way to use the strength of a feature-based user model and avoid thecold-start problem is to combine both models. Use a stereotype model initially when thesystem has not enough user data and switch to a feature-based user model, when it hasacquired all the necessary information about the user.

Overlay user models often have increased demands on system resources. There aremultiple strategies to cope with this problem. Firstly coupling between the user and domainlayers can be loosen. For example in domain of job offers, the user layer can store relationsto job offer attributes, which are shared among all job offers, instead of full job offers [1].Another strategy is dividing user model to multiple distinct layers. For example, we canmodel parts of the system’s information space visited by the user in one layer and use themto update another layer with other parts of the system’s information space [21]. Layereduser models often have the evidence layer (also called usage data layer), which containsthe original user data (logs of user’s interaction with the system, original data extractedfrom external APIs). Other layers contain higher level characteristics inferred from theevidence layer.

3.4 User Model Representation

There are four most common ways how to represent the user model. Selection of suitablerepresentations depends mostly on the purpose of the personalized system, requirementson its scalability (number of users, amount of data) and user characteristics that are beingmodeled.

• Bag of words represents the user as a set of keywords that indicate her attitude tocontent items. Items in the domain model can be either included in the user model ornot — bag of words implies binary value. Multiple attitudes can be using multiplebags of words.

This approach is suitable when the most important user characteristic to be modeledare interests and thus can be used in information retrieval systems (like searchengines or article recommendations systems) [23].

• Vector represents user’s attitude to all items in the domain model.

X = x1, x2, x3, ..., xn

xi represents user’s attitude to the i-th item in the domain model. If we want toexpress multiple attitudes, we have to employ multiple vectors. Its values can beboolean indicating if the user has understood the concept or integer representingdegree of knowledge of the concept [4].

This representation offers more precision than the bag of words model and is still

20

quite simple and lightweight. It is suitable for open-corpus user interest modeling1 fortwo reasons. Firstly, new keywords may appear at any time and a such a lightweightrepresentation is very efficient and scalable. Secondly, it can be difficult to identifythe relations, create an expert graph-based domain model and base the user modelon this domain model. For example, for a domain-independent user model depictinguser’s interest in the web pages she has visited it is easier to work with keywordsand their weights than with concepts and relations (as the web pages can belong todifferent domains and the relations are difficult, if not impossible, to model).

• Graph represents user as a network. Nodes are concepts and edges are relationshipsbetween them. Nodes and edges in the user model copy nodes and edges in thedomain model with additional attributes indicating user’s attitude to the domainmodel concept.

For example, in an educational system, every corresponding user model concepthas an attribute representing user’s knowledge of the domain model concept. Graphrepresentation is used to keep track of concept prerequisites (some topics need to bemastered before continuing to other ones).

Another example are news recommendation systems, where a graph representationis used instead of simpler representations like vector or bag of words. Owing tousing a graph, relations between keywords can be added. These relations connectkeywords with similar meaning or on the other hand disambiguate different meaningsof the same keyword. This leads to more accurate user model and thus more preciserecommendations [14].

• Ontology representation is usually based on a graph representation. Ontology addshierarchical relationships, semantics and possibility of using inference algorithms.Ontology-based models can be shared with other adaptive applications. Theirdisadvantage is that they are more difficult to obtain automatically.

3.5 Discussion

Following user characteristics described in section 3.2, we propose the characteristicsrelevant to the domain of digital libraries.

• Interests — We assume that papers the researcher has stored in the library or webpages the researcher has bookmarked imply researcher’s interest in them. Interestschange with time and thus temporal dynamics of interest model is important.

• Knowledge — Researcher’s knowledge can be represented as papers the researcherhas authored or read. Knowledge is also changed with time, as a researcher canforget notions from papers that she read a long time ago.

1Open-corpus modeling deals with domains, where new concepts appear after the system has beendesigned. On the other hand, closed-corpus modeling deals with domains, where concepts are known whilethe system is being designed and added manually.

21

• Goals — A sudden change of researcher’s interests can be caused by change ofresearcher’s goals. For example a student is not currently working on her thesis, butsearches for research papers for a course on data mining. Goals can be modeledusing detection of drifting [12] in the interests. Goals and interests are sometimesdifficult to distinguish automatically.

We focus on modeling of researcher’s interests in the digital library. Her goals can befurther extracted from the temporal change of her interests using the approaches describedin [12], [22] and [13]. The domain of digital libraries is full of metadata. There arerelations between papers (citation, common authors, common keywords), authors (co-authored papers, friendship in social networks), conferences (common authors), etc. Graphor ontology user-model representations are suitable in this case, because they are able toreflect these relations. Considering two user model types — stereotype and overlaying,we consider an overlaying model more suitable. It can take advantage of domain modelscontaining relations and availability of the researcher’s interest in individual parts of thedomain model (we know which papers are in the researcher’s library, which keywords shesearches for, etc.).

22

Chapter 4

Researcher Model Design

We model user’s interests in digital libraries (see Discussion in chapter Personalization andUser Modeling 3.5). Taking into account the specifics of the domain of digital libraries(see Discussion in chapter Researchers in Digital Libraries 2.4), we set the followingrequirements.

• Accuracy. Every model is an estimation of what is being modeled. The researchermodel reflects the researcher’s interests in digital library. The estimations should beas accurate as possible. However, as we deal with models and estimations, we donot strictly require them to reflect the researcher’s interests exactly. It is not evenpossible, because there is no way to know exact interests of the user. We proposea general researcher model, which is close to the researcher’s interests and can beutilized to personalize multiple user features in a digital library application.

• Flexibility. Digital library applications provide many user features and the re-searcher’s interactions with these user features can be logged. Domain models indigital libraries contain diverse document metadata, which are many times enrichedby external services or other digital libraries. Also distinct user features have differ-ent requirements on the researcher model. The researcher model takes advantage ofall the sources of user and domain data and responds to the requirements of differentuser features in digital libraries.

• Extensibility. New user features or data sources can be added to a digital libraryapplication. Also new user features can be added to the application and thus newrequirements on the researcher model are made. The researcher model can beextended to respond to these changes.

We intend to make the design general to make it realizable in any digital library system.We present the designed model in two perspectives. The external perspective is moreconceptual and gives picture of how the researcher model is utilized by personalized userfeatures in digital libraries. The internal perspective on the contrary focuses on the creationprocess and extensibility of the researcher model.

23

4.1 External Perspective

We propose an overlaying user model with vector-based representation of researcher’sinterests. Every researcher (j) has her own vector with weights (xi) of all the interests inthe digital library. Representation of interests depends on the particular realization of thedesign — keywords, terms or concepts can be used. There are multiple researcher modeltypes (k), since every user feature in digital library applications might require a differentmodel. See Equation 4.1.

RMjk = (x1, x2, x3, xi, ..., xn) (4.1)

Multiple research model types address the flexibility requirement. Every user featurein a digital library application can have its own researcher model tailored to its needs.

4.2 Internal Perspective

We present the internal representation in two separate parts. Structural design describesthe principles of how the researcher model is computed and stored, which are also usableoutside of the digital libraries domain. Content design describes specifics for the domainof digital libraries. Then we discuss properties of the designed model.

4.2.1 Structure

The researcher model is internally represented as a graph of all the relevant relations andentities in an information system (e.g. users, documents, tags, extracted keywords). Theserelations have weights, which denote their strength. Relevant relations and entities arecopied from the original data model in the information system (e.g. extracted keywordweight for a document, number of times the user has accessed a document). New entitiesand relations are created from the existing relations in the graph model in four ways.

• Combination. New relations are created by combining a matrix of existing relations.Weights of all the combined relations are simply either summed or multiplied. SeeFigure 4.1.

The relations that are placed one after another are multiplied (one relation weightsthe importance of the other — they are put into intersection) and the results aresummed (two relations are parallel — they are put into union). Coefficients areutilized to weight the summands. Resulting weight of the A to B relation on Figure4.1 is a linear function.

w(A− > B) = c1 ∗ w(A− > X) ∗ w(X− > B) + c2 ∗ w(A− > Y ) ∗ w(Y − > B)

• Normalization. New entities are created by applying a normalization function toexisting entities. New relations connecting the original entities with the new entities

24

Figure 4.1: Computation of Weights

are created — there are one-to-one connections between the original entities and thenew entities.

• Extraction. New entities are created by applying an extraction function to existingentities. New relations connecting an original entity with one or more new entitiesare created — there are one-to-many connections between the original entities andthe new entities.

• Joining. A new relation is created by joining two similar entities. There is anone-to-one connection between existing entities.

The entities and relations of the model are created in the determined order, as they canbe dependent one on the other. Relations depend on the entities they connect. Combinedrelations depend on the relations they combine. Normalized and extracted entities dependon the entities they normalize and extract respectively. Joining relations depend on theentities they join.

4.2.2 Content

There are two principal entities in the researcher model — Researcher and Interest. Allthe other entities in the digital library are set between them. Researcher is connectedto content entities with a has relation, which indicates the researcher’s interest in them.Content entities are further connected to the Interest entity with a contains relation, whichindicates importance of the interest for describing the content entity. The relation betweenResearcher and Interest depends on all the other relations between them. The weights ofthis relation can be conceptually considered the researcher model represented as a vector ofresearcher’s interests — the external perspective. Figure 4.2 shows an example researchermodel.

Each digital library system provides different set of content entities. We list a fewexamples to make clear what content entities are.

• Document. Digital libraries store documents. Researchers have various relationswith documents - researchers save them to their personal libraries, researchers aretheir authors or researchers read them. The has relation between Researcher and

25

Figure 4.2: Researcher Model

Document combines all these relations to estimate the overall researcher-documentrelation. Documents have been tagged, categorized to folders, annotated, they havetheir authors, categorization keywords and other metadata. We have listed otherentities, which are related to documents. These entities are further normalizedor extracted to relate them with interests. The contains relation combines all therelations between documents and interests and estimates the document-interestrelation.

• Tag. Researchers tag documents or use tags for navigation in the digital library. Thehas relation between Researcher and Tag combines these relations to estimate theoverall researcher-tag relation. The contains relation normalizes tags to interests orcombines multiple relations if a more complex transformation of tags to interests isneeded. Researchers also categorize documents to folders, this can be treated in asimilar way as tags.

• Annotation. Researchers annotate documents in digital libraries. The has rela-tion between Researcher and Annotation indicates that a researcher has created anannotation. The contains relation extracts interests from annotations.

• Query. Researchers search in digital libraries. Search queries can be consideredcontent entities as well. The has relation between Researcher and Query indicatesthat a researcher has used a query. The contains relation extracts interests fromsearch queries.

The has relation between Researcher and Interest combines all the mentioned relationsand strengthens the weights of the individual interests depending on the activity of everyresearcher in the digital library system.

4.2.3 Properties

The weights of the edges in the graph are conceptually partial sums in the computationof researcher model vector values. Availability of the partial values enables extensibility,since these values can be reused. For example, we represent interests as terms. Then we

26

create word stems of all the terms with a term-stem relation connecting them. Then webuild another user model on top of the researcher-term relation, which connects researcherswith the respective stems of their terms — researcher-stem. Now we have two user modelsat the same time. Each has its own terms and weights. In a similar way we add other usermodels that comply with specific requirements of each user feature in the digital library.They combine different relations with different coefficients.

Having a graph with all the entities and relations addresses the flexibility requirement.When a new type of data is added to the digital library system, we create new entityand relation types in the graph model and incorporate them into the nearest higher-levelrelation. They are automatically used by all the other higher-level relations. For example,a new keyword extraction service is added to the digital library. We already have onedocument-keyword and keyword-interest relations for the original service, we add anotherone for the new service. We incorporate the new relation to the document-interest relation,so that the new service is used in the researcher model.

If we need to keep using the original researcher model, we create a new paper-termrelation instead of modifying the original one. We also create a new researcher-interestrelation. Now we have two researcher model — the original one and the new one using ourkeyword extraction service. We can use both of them at the same time. We can comparethem and use the approach for experimenting with various keyword extraction services.

4.3 Discussion

We have proposed a researcher model that comprises all types of user and domain dataavailable in digital libraries. All the researcher’s interactions along with all relevant contentmetadata are extracted from the digital library data model to graph relations. Every relationhas a weight denoting its strength. Higher-level relations are deduced from the extractedrelations and their weight is computed as a linear function of the original relations. Thefinal relation is a representation of the researcher’s interests in the digital library. Exactcomposition of relations and entities inside the graph and the coefficients used in the linearfunctions can be adjusted for the requirements of each digital library. The researcher modelis a vector of terms from outside, but a graph inside, whereby the components of the modelare reusable and the model is flexible and extensible.

We set three requirements on the model — accuracy, flexibility and extensibility.The model is extensible and flexible by design. The accuracy needs to be evaluated byrealization of the researcher model in a particular digital library.

27

Chapter 5

Researcher Model Realization inAnnota

We realize the designed researcher model in the Annota digital library. Annota takesadvantage of Mendeley API, ACM DL metadata and extracts additional keywords usingAlchemyAPI. It collects logs about user’s activity in Annota web application and on theACM web pages. Annota provides necessary data and access to the user feedback. SeeDiscussion in chapter Researchers in Digital Libraries 2.4.

The realization of the researcher model subsumes specification of available data inAnnota, specification of the requirements on the model imposed by personalized userfeatures in Annota, design and finally implementation of the research model in Annota.

5.1 Specification

We specify the available user and domain data, which will be inputs to the realizedresearcher model. We also analyze personalisable user features in Annota and set therequirements for the outputs of the designed and implemented researcher model.

5.1.1 Available User Data

Annota stores the following data about researchers.

• Papers in their personal libraries

• Tags, folders and annotations they have used

• Social relations (following relations, membership in groups)

• Activity logs in Annota (navigation with tags and folders, access to articles, usedsearch queries)

• Activity logs in the ACM DL (access to articles, used search queries)

29

5.1.2 Available Domain Data

Annota stores the following data about stored research papers.

• Keywords extracted from papers using multiple keyword extraction services

• Tags assigned to papers

• Folders used to categorize papers

• Annotations used in papers

• Other paper metadata (author, publication, citations, etc.)

5.1.3 Personalisable User Features

There are many examples of user features that can benefit from personalization.

• Search of papers, users and groups relies on elasticsearch1. The researcher model canbe customized to provide similarities between users and papers. These similaritiescan be saved to elasticsearch and used in the relevance computation of the searchresults.

• Recommendation of similar papers utilizes keywords extracted from the paper theuser is currently reading. The extraction method uses graph of neighboring wordsand activation spreading to find the most relevant keywords. If the user highlightsparts of the paper or adds annotations, this is also taken into account and relevancevalues are adjusted. Keywords are used to build an enhanced query and the resultsyield similar papers [19]. The relevance values of found keywords can be furthermodified using the weights of keywords from the researcher model.

• Navigation using cloud of keywords [16]. Size of keywords can be dependent on thekeyword’s position in the researcher model of the current user.

We have implemented another feature, which directly visualizes vector of terms andis useful for quick evaluation of the researcher model during its development — a cloudof the 50 most relevant terms to the researcher. The more relevant terms are, the largerfont size they have. Color is used to distinguish terms, because they contain spaces andthe user would not know where the term ends and the next one begins (color has no otherinterpretation). The aim of the cloud is to provide a quick visualization of the best terms.The term cloud can be seen in the user profile in Annota. See Figure 5.1.

The search with elasticsearch requires similarities between users and papers. Otherfeatures can be personalized using a vector of terms. Researcher’s terms cloud visualizes avector of terms. We focus on building a researcher model, which computes the vector ofterms.

1http://www.elasticsearch.org/

30

Figure 5.1: Terms Cloud in the user’s profile in Annota

5.2 Design

We design a framework to create arbitrary graph models based on the flexibility andextensibility principles of the general research model designed in chapter 4 ResearcherModel Design. Then we design a default researcher model, which uses the framework.Other researcher models can be designed in the future and reuse the parts of the defaultresearcher model.

5.2.1 Graphene Framework

The designed framework contains base classes for entities and relations. Entities are nodesin the graph and relations are edges. There are multiple models and each one has its ownset of entities and relations. All the entities and relations can be reused among the models.The framework is intended to simplify building of any graph model. See Figure 5.2.

Builders are tasks, which create entities and relations. Entities and relations are createdin determined order, as there are dependencies among them. Builders are run periodicallyto update the user model. Builders also have their configuration, which can be used tosave time of the last run, so that the next time they are run, the update of models can beincremental.

Entities have the following attributes.

• Type

• Reference to the original object in Annota for entities that have been copied fromAnnota domain entities.

• Content for entities that need to contain some data.

31

Figure 5.2: Graphene Framework. Graphene::Entity and Graphene::Relation are base classes.Graphene::Annota is an example model (with its namespace). Graphene::Annota::Researcher,Graphene::Annota::Term and Graphene::Annota::ResearcherHasTerm extend framework classes.Graphene::Annota::ResearcherHasTerm is a relation connecting Graphene::Annota::Researcherand Graphene::Annota::Term entities.

• Time when the entity was updated according to the original Annota entity.

• Time when the entity was created according to the original Annota entity.

• Custom properties

Relations have the following attributes.

• Type

• Reference to the original object in Annota for entities that have been copied fromAnnota domain entities.

• Weight indicates strength of the relationship.

• Time when the relation was updated according to the original Annota relation.

• Time when the relation was created according to the original Annota relation.

• Custom properties

32

Figure 5.3: Graphene Framework - all classes

There are more types of entities and relations. They reflect common patterns occurringduring creation process of entities and relations. See Figure 5.3.

• CopyingEntity copies entities from Annota original data model. Attribute origi-nal_id is set to the original entity. Attribute content contains custom data if it isneeded. Timestamp attributes are set according to the original entity.

• CopyingRelation copies relations from Annota original data model. Attributeoriginal_id is set to the original relation. Attributes from_id and to_id are set toentities the relation connects. Weight is set to the original relation weight. Timestampattributes are set according to the original relation.

• CombiningRelation combines multiple relations based on the weights computationproposed in the researcher model design. The combined relations are specified witha matrix. Weights of the relations in a row are multiplied, all the rows are furthermultiplied with a coefficient and finally summed. See part 4.2 Internal Perspective.

• NormalizingRelation creates new entities by normalizing existing ones. TermNor-malizer puts entity content to lower case, substitutes joining characters like or _ tospaces and squeezes spaces (more than one successive space is converted to exactlyone space). StemNormalizer splits entity content to words, stems each one of themand finally joins them. A custom normalizer can be used as well.

• ExtractingRelation creates multiple new entities from an existing entity. Alche-myExtractor uses AlchemyAPI to extract keywords from entity content. A customextractor can be used as well.

33

• JoiningRelation joins two existing entities with equal content attribute.

5.2.2 Default Researcher Model

We have built a default researcher model on top of the Graphene framework. Figure 5.4shows its overall architecture. Researcher and Term are main entities. ResearcherToTermrelation connects them and the resulting weights are computed by combining all therelations between Researcher and Term. Paper, Tag, Folder, Annotation and Query arecontent entities copied from the original Annota data model. There is a connection betweenResearcher and every content entity denoting the researcher’s relation with the contententity. For example ResearcherToPaper relation indicates the strength of researcher’srelation to a given paper. These relations are combined, as they are more complex andconsist of other relations. There are connections between content entities and terms. Someof them are combined of more relations, other are simply normalized. For example, theTagToTerm relation normalizes tags and connects them to the resulting terms.

Figure 5.4: Default Researcher Model Overview

Figure 5.5 shows the ResearcherToPaper relation. It combines all the relations thatdenote a relation between researchers and papers. ResearcherHasPaper indicates thatthe researcher has stored the paper in her personal library. ResearcherOpenedPaper,ResearcherClickedInPaper, ResearcherScrolledInPaper and ResearcherCopiedTextInPaperare all relations collected from activity logs on paper web pages in the ACM DL.

ResearcherAuthorsPaper connects researchers with the papers they have published.Researchers first need to be mapped to authors by their names. An exact mapping couldnot be determined from the available metadata. See Figure 5.6.

Figure 5.7 shows the PaperToTerm relation. Papers contain keywords, are categorizedusing tags and folders and have authors. All these entities are normalized to terms. There

34

Figure 5.5: ResearcherToPaper Relation

Figure 5.6: ResearcherAuthorsPaper Relation

are three sources of keywords in Annota — AlchemyAPI, Mendeley import and ACMmetadata. There are thus three relations reflecting these sources.

Figure 5.8 shows the AnnotationToTerm relation. Annotations contain text, which isstored in the content entity relation. AnnotationToAlchemyKeyword extracts keywordsfrom the content and connects Annotation with the extracted AlchemyKeyword entities.AlchemyKeyword entities are further normalized to terms.

Figure 5.9 shows the ResearcherToQuery relation. Annota logs two types of usersearches — search queries in Annota itself and queries used in the ACM DL. Both types ofqueries are normalized to a single Query entity. ResearcherToQuery connects researcherswith the normalized queries.

Figure 5.10 shows the QueryToTerm relation. Keywords extracted from queries arenormalized to terms in a similar manner as in the AnnotationToTerm relation.

Figure 5.11 shows four user models. ResearcherToTerm models user’s interests byestimating weights of terms directly extracted from content. However, there are many terms,which can be considered synonyms from the perspective of the requirements we have set onuser modeling in Annota. For example, ’user model’, ’user modeling’ and ’user modelling’.We stem all terms to the Stem entities by stemming every word of the term content —

35

Figure 5.7: PaperToTerm Relation

Figure 5.8: AnnotationToTerm Relation

Figure 5.9: ResearcherToQuery Relation

both ’user modelling’ and ’user modeling’ change to ’user model’. ResearcherHasStemconnects researchers with the Stem entities instead the original terms. There are duplicates,

36

Figure 5.10: QueryToTerm Relation

because many terms normalize to the same stems. ResearcherHasSynonym removes theseduplicates by keeping only one relation for stem summing all the weights. The stemscan be visually incorrect terms (e.g. ’augment real’ instead of ’augmented reality’) andtherefore unsuitable to be presented to the users. ResearcherHasBestSynonym selects forevery stem a term that is the most important for the user and assigns it the summed valueof the stem.

Figure 5.11: The Four User Models in Annota

5.3 Implementation

Graphene framework with the default researcher model are implemented in the Annotaapplication. Annota is based on the Ruby on Rails2 web development framework. Graphenetakes full advantage of Ruby on Rails relational database library ActiveRecord. Everydatabase table is represented with a Ruby class called model. Objects of models reflectrows in tables. Tables containing type attribute use single-table inheritance — other classesextend the original model class while being stored in the same table and having the same

2http://rubyonrails.org/

37

attributes. Every class can be put into a namespace to prevent conflicts among differentparts of the code.

UserModel::Entity and UserModel::Relation are ActiveRecord models. All Grapheneentities and relations extend UserModel::Entity and UserModel::Relation models respec-tively. The extending classes are single-table inherited, therefore the whole graph model iskept in two database tables — user_model_entities and user_model_relations. Figure 5.2shows their attributes.

Builders are classes providing a method called periodically. The called method del-egates the building process to the entities and relations in determined order, as relationsdepend on entities and other relations. Entity and relation classes provide the method calledby builders. They either implement it themselves or use framework base classes dependingon the desired creation process — CopyingEntity, CopyingRelation, CombiningRelation,NormalizingRelation, ExtractingRelation or JoiningRelation. See Figure 5.3.

All the default researcher model classes are developed using test-driven development.There are tests to ensure that they work as intended. This is important to enable futurerefactoring or improvement of the default researcher model. All the Graphene classes andthe important parts of the default researcher model are documented to make it easy for thedevelopers to build custom researcher models. See Appendix C Developer Documentationand Appendix B User Documentation.

5.4 Discussion

We have realized the researcher model designed in the previous chapter in the Annota digitallibrary. The realization consisted of requirements analysis, design and implementation.We specified available inputs and requirements for outputs of the researcher model. Wedescribed how personalisable user features in Annota could use the researcher model. Wedesigned a graph-modeling framework Graphene and a default researcher model built ontop of this framework. We implemented both parts in Annota together with a cloud ofcomputed interests, which visualizes the researcher model (see Figure 5.1).

The implemented researcher model collects data from multiple sources and provides avector of terms, which are visualized by the cloud of computed interests. This evaluatesthe flexibility requirement of the researcher model. There are actually four vectors ofresearcher’s interests provided by the implemented researcher model. They depend oneach other and share all the steps in the building process. The researcher model is not onlyextensible by design, but its extensibility has been evaluated by its realization in Annota.We evaluate the accuracy requirement in the following chapter.

38

Chapter 6

Evaluation

In chapter 4 Researcher Model Design we proposed objectives for the designed researchermodel — accuracy, flexibility and extensibility. Flexibility has been evaluated by realizationof the designed researcher model in Annota, where we implemented a researcher modelinferring interests from diverse user and domain data. Extensibility has been evaluated byimplementing multiple user models in Annota, which reuse the same internal parts.

We evaluate accuracy of the researcher model by investigating how the researcherperceives her own researcher model terms. We verify if terms in the researcher model arerelated to the interests of the researcher. We compare the level of importance of particularterms assigned by the researchers with the importance of the terms resulted form theresearch model. Each digital library application user feature can have specific requirementson user modeling, but our objective is to prepare a general baseline user model. We assumethat researcher model terms positively identified by researchers themselves as their interestsare a good building ground for further personalization of the entire digital library.

We have performed two quantitative experiments, where users of Annota evaluatedterms from their researcher models. After the experiments, we asked the users for theiropinion on the terms they had been evaluating to bring qualitative insights into the evalua-tion.

6.1 Experiment 1

We performed the first experiment to evaluate the researcher model terms. Before theexperiment, we performed three pilot experiments with 1-4 participants to make sure thefinal experiment is valid (after the first iteration we completely changed proposed way ofterm evaluation).

6.1.1 Hypotheses

We have set two hypotheses.

• Researchers identify terms in their researcher models as their interests.

39

• The order of the terms in the researcher model correlates with the importance of theterms to the researcher.

6.1.2 Data

The researcher model implemented in Annota provides the terms relevant to the researcher’sinterests and estimations of their importance. We have used ResearcherHasBestSynonymrelation, since it provides the terms most relevant to the researcher in a visualizable form.

We order all the participant’s terms by weight. We ask the users to evaluate first 50,100, 150, ... terms. When they have finished the first 50 terms, we add another 50 termsand so on. The terms are split into groups of 20 terms and their order is randomized inthese groups. The users are first presented the most relevant groups. The intention is tomake the terms more relevant in the researcher model more probable to get evaluated bythe user, because they are more likely to be used by the personalized user features in theapplication (for example the cloud of computed interests uses the first 50 terms).

6.1.3 Participants

We performed the experiments with 15 users of Annota. The experiment was carried out inan individual session with every participant with a supervisor instructing participants andasking them questions. Every participant was encouraged to comment the terms she or hewas presented with and asked to summarize her or his experience after the experiment.

The extent the users had been using Annota varied. Some of them used the Annotaextension in the ACM DL, others used only the Annota web application. Also the time theusers had been using Annota was variable.

6.1.4 Methodology

The experiment was a game, where the participants were presented terms and asked toevaluate them (see Figure 6.1). They had three options to select.

• Not at all. The participant does not identify the term as her or his research interest.

• Possibly. The participant thinks that the term might be her or his researcher interest.

• For sure. The participant is sure that the term is her or his research interest.

The participants were motivated to play the game with score and position among otherplayers. The more terms they evaluated, the more points they were awarded. To avoidrandom or thoughtless clicking in the game, the score computation considers the timespent thinking. The bigger the pause between player’s actions, the bigger score gained inthe round. However, this is limited to 7 seconds. If the player thinks more, no points areadded. This limit prevents the artful players from getting points for leaving the page openand doing nothing. The players also got points for coming back and revising their formerdecisions.

40

Figure 6.1: Researcher Model Game

6.1.5 Results

Figure 6.2 shows all the evaluations of terms performed by the participants. Terms areaggregated in groups by their position in the researcher model. Evaluations by all the usersare summed in groups of 20 terms. We included only the best 100 terms for every user, asother groups did not have enough number of evaluations to be comparable with the firstgroups. Figure 6.3 shows all the evaluations without terms originated from author names,as the participants had problems identifying authors in their library and therefore theseevaluations are biased. However, removing author names from the evaluated terms resultedin only a small change of the results. Figure 6.8 shows average evaluation for every groupof terms on a scale from 0 to 1 (0 means Not at all, 1 means For sure). The highest averagewas in the first group — 0.62. The correlation coefficient of group numbers (increasing)and the averages (decreasing) was -0.93 with authors and -0.97 without authors.

After the experiment, all the participant stated that the terms they had been presentedgenerally reflected their real research interests. According to most participants there weremore highly related terms in the beginning of the game, then the number of related termsstarted decreasing.

6.2 Experiment 2

We have performed another experiment with the same hypotheses. We enhanced the userdata used to create the researcher model, increased number of participants and asked theparticipants for additional information when evaluating their researcher models.

41

Figure 6.2: Experiment 1

Figure 6.3: Experiment 1 without authors

6.2.1 Data

In Experiment 1 some participants stated that they use Mendeley and could import all theirdata to Annota. Before Experiment 2, we asked everyone to link their Mendeley accountwith Annota and we imported all the available data. We updated the researcher model toinclude all the imported data and used it in the experiment.

42

Figure 6.4: Experiment 1 averages

We asked the participants to focus on the first 100 terms and make sure they areevaluated correctly. Then we offered them an option to evaluate other 100 terms, whichcan be useful to improve the researcher model after the experiments.

6.2.2 Participants

We have asked all the users of Annota to take part in the experiment. This time theparticipants evaluated the terms individually without any supervision. They were giveninstructions and asked to take part in the experiment on their own any time they liked. Bythe time the experiment was closed, 19 users of Annota participated.

6.2.3 Methodology

We made one change in the game. We asked the participants to give a reason for choosingthe Not at all option for term evaluation (see Figure 6.5). The reasons are based on thecommon comments of the participants in Experiment 1.

• Good keyword, but out of my interests

• Good keyword, but I have no idea what it is

• Keyword is too general

• Keyword isn’t correctly extracted

• Good name, but I don’t know this author

• Other (the participants write a reason themselves)

43

Figure 6.5: Researcher Model Game

6.2.4 Results

Figure 6.6 shows all the evaluations of the first 200 terms in the researcher models. Figure6.7 shows the evaluations, but we excluded all the Not at all evaluations except Goodkeyword, but out of my interests. We can divide all the other reasons for selecting Not at allto 3 groups.

• Good keyword, but I have no idea what it is and Good name, but I don’t know thisauthor indicate that the participant could not decide and the evaluation can be biased.

• Keyword is too general and Keyword isn’t correctly extracted indicate that a goodkeyword extraction service could improve the results. The realized researcher modeldepends on keyword extraction services, but keyword extraction services are not partof it.

• The number of Other reasons is insignificant and does not influence the results.

After removing the biased and keyword extraction related evaluations, the number ofNot at all evaluations decreased significantly. Figure 6.8 shows average evaluations forevery compared group. The highest average appeared in the first group — 0.83 on a scalefrom 0 to 1 (0 means Not at all, 1 means For sure). The correlation coefficient of groupnumbers (increasing) and the averages (decreasing) was -0.86 with all the evaluationsincluded and -0.80 without biased and wrong-keyword-extraction evaluations for 200terms. The correlation is closer to 0 than the correlation in Experiment 1, because inExperiment 1 we took into account only the first 100 terms. When we compute correlationcoefficient for the first 100 terms in Experiment 2, we get -0.87 with all the evaluationsincluded and -0.93 without biased and wrong-keyword-extraction evaluations.

6.3 Discussion

We have evaluated the accuracy of the researcher model by investigating how researchersthemselves perceive their researcher model terms. The experiments confirmed both hy-potheses we had set.

44

Figure 6.6: Experiment 2

Figure 6.7: Experiment 2 only Good keyword, but out of my interests

• Researchers identify terms in their researcher models as their interests.

• The order of the terms in the researcher model correlates with the importance of theterms to the researcher.

All the participants after Experiment 1 stated that the terms they had been presentedwere related to what they were interested in. The best average evaluation in terms group

45

Figure 6.8: Experiment 2 averages

was 0.82 (in Experiment 2, excluding biased and incorrectly extracted terms), which meansthat the terms in the first 20 terms had been mostly selected as For sure.

All of the participants after Experiment 1 stated that the number of terms related totheir interests was decreasing as they were advancing in the researcher model game. Thecorrelation between the position of the terms in the researcher model (with granularity of20) and their average evaluation ranges from -0.83 - -0.97. This makes us conclude thatthere is a clear correlation between the position of the terms in the researcher model andtheir relatedness to the researcher’s interests.

In spite of having numerical results, the experiments were more qualitative thanquantitative. The objective was to investigate if a realized researcher model can reflectresearcher’s interests. According to the statements of participants and the numerical resultsof the experiments, the researcher model realized in Annota provides a good baselineestimation of researcher’s interests. The evaluation of terms in the experiments wasindividual and many times dependent on the participant’s subjective understanding of theinstructions. However, a completely objective experiment can not be performed when theproposed way of evaluation — investigating how researchers themselves perceive theirown researcher model terms — is subjective itself. We also tried a few machine learningapproaches to use the data acquired during the experiments to improve the researchermodel realization, but we concluded that the evaluations were too subjective to obtaincredible results. The metrics we have used can also be discussed and improved to becomemore reliable for comparing accuracy of user models, but they have been sufficient toprove both hypotheses.

46

Chapter 7

Conclusions

We analyzed state of the art of user modeling and application domain of digital libraries.We set the following objectives.

• Design a flexible researcher modeling framework, which can be used in any digitallibrary, taking advantage of diverse researcher and domain data and provide esti-mates of user characteristics to diverse personalized user features in digital libraryapplications.

• Design an extensible researcher modeling framework. All components inside theresearcher model can be reused and the framework allows coexistence of multipleresearcher models at the same time. Besides enhancing the extensibility of the wholedigital library application, an extensible framework is also suitable for experimentscomparing various user models or information retrieval techniques.

• Propose general researcher modeling principles covering the specifics of the digitallibraries domain.

We designed a researcher model structure that implies flexibility and extensibility. Allthe researcher’s interactions along with all relevant content metadata are extracted fromthe digital library data model to graph relations. Every relation has a weight denotingits strength. Higher-level relations are deduced from the extracted relations and theirweight is computed as a linear function of the original relations. The final relation is arepresentation of the researcher’s interests in the digital library. Exact composition ofrelations and entities inside the graph and the coefficients used in the linear functions canbe adjusted for the requirements of each digital library. The researcher model is a vector ofterms from outside, but a graph inside — entities and relations are flexible in reflecting theoriginal user and domain data and the researcher model is extensible, as the entities andrelations can be reused.

We designed a general researcher model content that has to be customized to everydigital library. We proposed two principal entities — Researcher and Interest. Researcheris connected to content entities with a relation, which indicates the researcher’s interest inthem. Content entities are further connected to the Interest entity with a relation, whichindicates importance of the interest for describing the content entity. The relation between

47

Researcher and Interest is the researcher model. We provided various content entities asexamples — documents, tags, folders, annotations and queries.

We realized the designed researcher model in Annota — we specified needs of theparticular digital library and implemented the researcher model. The implemented re-searcher model collects data from multiple sources and provides a vector of terms, whichare visualized by the cloud of computed interests. This evaluates the flexibility requirementof the researcher model. There are actually four vectors of researcher’s interests providedby the implemented researcher model. They depend on each other and share all the steps inthe building process. This evaluates the extensibility requirement of the researcher model.

We evaluated the accuracy of the researcher model by investigating how researchersthemselves perceive their researcher model terms. We performed two experiments. Bothexperiments confirmed that researchers identify terms in their researcher models as theirinterests and the order of the terms in the researcher model correlates with the importanceof the terms to the researcher.

Besides reaching the objectives of the work by designing a general user model fordigital libraries, we have created a base for more research in the domain of digital libraries.The researcher model realized in Annota is extensible and can be used to implementand compare various user modeling and information retrieval techniques. The result ofour effort is useful to evaluate many research ideas — to name a few, degradation ofresearcher’s interests with time, real-time modeling of researcher’s interests, design ofrecommender systems for digital libraries or modeling of other researcher’s characteristics.We hope that our effort becomes a step on the way towards more innovative research ideasto improve sharing of knowledge in digital libraries.

48

References

[1] Michal Barla. Towards Social-based User Modeling and Personalization. PhDthesis, Faculty of Informatics and Information Technologies Slovak University ofTechnology in Bratislava, 2010.

[2] Michal Barla and Mária Bieliková. Estimation of User Characteristics using Rule-based Analysis of User Logs. In Proc. of Workshop held at International Conferenceon User Modeling UM 2007, CORFU, pages 5–14, 2007.

[3] Mária Bieliková. Adaptívna prezentácia hypermédií na webe. In DATAKON 2003:Proceedings of the Annual Database Conference, pages 1–19, Brno, 2003.

[4] Paul De Bra and Licia Calvi. AHA: a Generic Adaptive Hypermedia System. InProceedings of the 2nd Workshop on Adaptive Hypertext and Hypermedia HYPER-TEXT’98, 1998.

[5] Peter Brusilovsky. Adaptive Hypermedia for Education and Training. In AdaptiveTechnologies for Training and Education, pages 46–68. Cambridge University Press,Cambridge, UK, 2012.

[6] Peter Brusilovsky and Eva Millán. User Models for Adaptive Hypermedia andAdaptive Educational Systems. In LNCS 4321: The Adaptive Web, pages 3–53.Springer-Verlag Berlin Heidelberg, 2007.

[7] Jamie Callan, Alan Smeaton, Micheline Beaulieu, Pia Borlund, Information Science,Peter Brusilovsky, Matthew Chalmers, Clifford Lynch, John Riedl, Barry Smyth,Umberto Straccia, and Elaine Toms. Personalisation and Recommender Systems inDigital Libraries Joint NSF-EU DELOS Working Group Report. Technical ReportMay, 2003.

[8] E. Frias-Martinez, G. Magoulas, S. Chen, and R. Macredie. Automated user modelingfor personalized digital libraries. International Journal of Information Management,26(3):234–248, June 2006.

[9] Enrique Frias-martinez, Sherry Y Chen, and Xiaohui Liu. Automatic Cognitive StyleIdentification of Digital Library Users for Personalization. Journal of the AmericanSociety for Information Science and Technology, 58(2):237–251, 2007.

[10] Zan Huang, Wingyan Chung, Thian-Huat Ong, and Hsinchun Chen. A graph-basedrecommender system for digital library. Proceedings of the second ACM/IEEE-CSjoint conference on Digital libraries - JCDL ’02, page 65, 2002.

49

[11] Judy Kay, Bob Kummerfeld, and Piers Lauder. Personis: A Server for User Models.In Adaptive Hypermedia and Adaptive Web-Based Systems, pages 203–212. SpringerBerlin Heidelberg, 2006.

[12] Ivan Koychev and Ingo Schwab. Adaptation to Drifting User’s Interests. In Proceed-ings of ECML2000 Workshop: Machine Learning in New Information Age, pages39–46, 2000.

[13] W. Lam, S. Mukhopadhyay, J. Mostafa, and M. Palakal. Detection of Shifts in UserInterests Filtering for Personalized. In Proceedings of the 19th annual internationalACM SIGIR conference on Research and development in information retrieval, pages317–325. ACM, 1996.

[14] Bernardo Magnini and Carlo Strapparava. User Modelling for News Web Siteswith Word Sense Based Techniques. User Modeling and User-Adapted Interaction,14(2/3):239–257, June 2004.

[15] George D. Magoulas, Yparisia Papanikolaou, and Maria Grigoriadou. Adaptive web-based learning: accommodating individual differences through system’s adaptation.British Journal of Educational Technology, 34(4):511–527, September 2003.

[16] Samuel Molnár and Mária Bieliková. Trending Words in Navigation History forTerm Cloud-based Navigation ´. In Proceedings of SMAP 2013: 8th InternationalWorkshop on Semantic and Social Media Adaptation and Personalization, pages53–58, 2013.

[17] Richard Van Noorden. Open access: The true cost of science publishing, 2013.

[18] Seventh Framework Programme, I C T Programme, Cultural Heritage, Technol-ogy Enhanced Learning, and Project Number. DL . org : Coordination Action onDigital Library Interoperability , Best Practices. 2011.

[19] Jakub Ševcech and Mária Bieliková. Related Documents Search Using User CreatedAnnotations. In Proceedings of FedCSIS 2013 - Federated Conference on ComputerScience and Information Systems, 2013.

[20] Jakub Ševcech, Mária Bieliková, Roman Burger, and Michal Barla. Zaznamenávanieaktivity výskumníka v digitálnej knižnici vedeckých zdrojov obohatené o poznámky.In 7th Workshop on Intelligent and Knowledge Oriented Technologies, pages 197–200,2012.

[21] Michal Šimún, Anton Andrejko, and Mária Bieliková. Maintenance of Learner’sCharacteristics by Spreading a Change. In Learning to Live in the Knowledge Society,pages 223–226. Springer US, 2008.

[22] Dwi H. Widyantoro, Thomas R. Ioerger, and John Yen. An adaptive algorithmfor learning changes in user interests. In Proceedings of the eighth internationalconference on Information and knowledge management - CIKM ’99, pages 405–412,New York, New York, USA, 1999. ACM.

50

[23] Yun Zhang and Boqin Feng. Tag-based user modeling using formal concept analysis.In 2008 8th IEEE International Conference on Computer and Information Technology,pages 485–490. IEEE, July 2008.

51

Resumé v slovenskom jazyku(Resumé in Slovak language)

We provide a summary of the work in Slovak language. There is an obligation on all thetheses written in English to contain a resumé in Slovak. We covered the most importantparts of the thesis.

53

Appendix A

Pilot User Study

We have performed a pilot user study with 7 participants. The participants were seniorstudents of master’s degree, doctoral students and postdoctoral staff at FIIT STU inBratislava. They work on their master’s, doctoral thesis or supervise these works and thuscan be called researchers and users of digital libraries. All participants do their research inthe field of web engineering or science.

A.1 Objectives

There are two objectives of the pilot user study.

• We would like to get deeper insight into the problem domain of research and digitallibraries. Understanding of the problem domain gives us precious ideas of how toapply and thus evaluate results of our work.

• We would like to understand the researcher, since she or he is the object that we aregoing to model. We are interested in the way researchers acquire inspiration, searchthe Web, look for papers, what papers they find interesting and what applicationsthey use and how they use them.

A.2 Methodology

The study consisted of a single interview with every participant. The interviews took placein a room with presence of nobody else besides the interviewer and the participant, so thatthe participants would not feel anxious when answering questions. At the beginning ofthe interview, every participant was told that the results will be anonymized. However,these precautions turned out not to be necessary in the end, as asked questions were nottoo personal and all the participants felt comfortable answering them.

The interviewer wrote notes to record the answers of participants.

A-1

A.3 Questions

We have prepared a set of questions to cover our study objectives. The question setslightly evolved during the interviews and first participants (notably interview with the firstparticipant proved some questions difficult to understand as they were intended) did notget some questions that the participants interviewed later did. Interviews were also quiteinformal and many answers to open questions gradually changed the topic of conversation.This was however welcome, as the change of topic gave us valuable opinions and ideas ofthe participant. The change of topic also gave us feedback to our question set, which couldbe improved for future participants.

We have divided the question set into three sections — research, papers and applica-tions.

A.3.1 Research

The aim of these general questions is to start the conversation and make the participant feelcalm and unconcerned. They also investigate the researcher’s motivation, area of interestand the way she or he looks for new ideas.

• How would you define research generally?

• How would you define research in informatics and information technologies? Isthere a difference?

• Why are you interested in research? What is your motivation?

• What is the area of your research interest?

• Where do you look for ideas and inspiration for your research?

A.3.2 Papers

These questions investigate the researcher’s way of working with research papers, digitallibraries, search engines and the opinion on the quality of a paper.

• What do you do when you get an idea? Do you search for research appears?

• Do you search for research papers when you have no objective idea? When?

• How do you look for research appears? What are the steps? Which digital librariesdo you use? Which search engines do you use? Describe the last searching of aresearch paper.

• How would you define a good paper? Do you reference papers that are not good?

• Do you read anything besides that to keep pace with the newest trends? Surveypapers? Blogs? Articles?

A.3.3 Applications

These questions investigate the researchers way of working with document organizationtools and academic social networks.

• Do you use Mendeley, ResearchGate, Annota? How?

• Do you add notes to the content? What are the features you like in Annota? Whatother features would you like Annota to have?

• Do you have an idea for an application that could help you in doing research?

A.4 Results

All the participants have provided similar definitions of research. Their motivation fordoing research is that they do something more interesting and creative than standardproblem solving in IT. One participant stated that it is exciting to find something thatnobody has found before. Another participant reveals that research satisfies his curiosity.

Participants confirm getting ideas from blogs, web articles, existing applications,publications, conference talks and conversations with colleges. One participant getsinspired from ’real life’ by solving problems he has. Another participant describes anincremental improvement of state-of-the-art methods, which can sometimes turn into acompletely novel approach. Another participant explains that based on the fact he has beenreading papers and attending conferences for some time, he knows many problems thatneed to be solved and always chooses one that he is interested in.

The participants search the web and digital libraries to find out if anyone has workedon similar ideas as they have before or when they need to review related work in theirthesis or papers. They have to describe their ideas in terms that others are likely to haveused. One participant states, that he always finds a solution to the problem he is interestedin, but after enough work he can always find a better one. When the idea concerns anapplication, they search the Web using Google. When the idea concerns research, theysearch in digital libraries — mostly ACM, since there are the most relevant sources fortheir area of interest. Sometimes they also use IEEE or Springer. Many papers in Springerare however paid, they often bypass this restriction by having Google Scholar find anddownload papers for them. They also use Google Scholar for searching research papers, sothat they do not have to search in all digital libraries. While some participants use GoogleScholar and do not search digital libraries directly at all, one participant stated, that heprefers searching directly in ACM, because it is easier to work with paper meta-data (likereferences and citations) there than in Google Scholar, he also finds most relevant papersthere. Another participant adds that having paper published by ACM also assures qualityof the source. Besides direct searching of terms, another common way of discovering newpapers is using references of read papers. Another way is searching for papers of particularauthors, as there are authors who are authorities in their areas and their papers are certainlyworth reading. Some participants also search papers presented at the conferences they areinterested in.

One participant has mentioned the way he filters results of digital library search. Hereads paper’s title, summarization in the search results and abstract. The most papers arefiltered out when reading the title. Another participant stated, that after filtering searchresults, he reads abstract, introduction and conclusions of the read article.

The participants have defined a good paper as follows.

• Understandable. Regarding both correct English and clean description of ideas (nocryptic or vague language). The paper should be also focused on one clean relevantidea.

• Replicable. It should be possible to follow the steps mentioned in the paper and getthe same results.

• Good evaluation. Clean evaluation and conclusions what the authors have achieved.

• Good conference. Most ACM papers were presented at good conferences.

• The paper is not older than five years. However, some old articles are worthreferencing, as they are important in their area.

• The paper is not too long, such papers either should have been separated to multiplepapers or they have a cryptic and vague language. Unless we talk about journalarticles, which can be longer.

• The paper has some citations.

• The paper references relevant sources and provides a comprehensive review ofrelated work.

• Abstract, introduction and conclusions gives a good indication of what the article isabout.

• The paper does not require many other papers to be read to understand.

The participants use following sources for keeping pace with the newest trends.

• Twitter for interesting web articles, blogs and other socially shared content. Theparticipant, who uses Twitter, also heavily filters news on Twitter, as there is toomuch unrelated information in social media

• Mailing lists that regularly send news from particular areas of interest (conferences,programming languages etc.).

• Journals like Communications of the ACM or XRDS ACM. The participant, whooccasionally reads these journals, has usually no time to read long articles in thesejournals.

• Google Scholar recommendations of papers related to researcher’s interests. Theparticipant, who uses this service stated, that he finds many of these recommendationsinteresting.

• Articles for review often extend the scope of knowledge of the reviewer, as he has tostudy literature in the area other than her or his core domain of interest.

• Keynotes are often a good way to get insights into new research areas, as they reviewtheir state-of-the-art methods and approaches.

One participant uses Annota to annotate web pages and articles, tag them and markthem as to be read later. He uses Mendeley just to have his papers accessible on networkswithout access to digital libraries. Another one uses Annota similarly, but likes to useMendeley for document categorization (folders). The same participant has also been usingResearchGate for a short time. He has no time to read the stream of news that ResearchGateprovides. He likes the idea of quick and clean information on what is happening right now.No other participants use ReseaechGate. However, one participant regularly visits websites of selected authors — this is exactly the use case of ResearchGate, he might startusing the service later. One participant uses Mendeley mostly for full-text search in thearticles he has read before. All the participants find reading articles using Mendeley morecomfortable than using Annota with Firefox built-in PDF reader. One participants uses notools for paper organization — he stores all the papers simply in folders on the disk andeffortfully searches manually when he needs something.

The participants have told which applications they find useful. One participant hasprobably thanks to the interview invented an idea, he might use as topic for a futurebachelor thesis, to match research papers in the ACM digital library with conference talksand slides related to them.

Appendix B

User Documentation

We realized the proposed software design in the Annota digital library and therefore includethe user documentation for Annota.

B-1

Appendix C

Developer Documentation

We realized the proposed software design in the Annota digital library and therefore appendrelated parts of the developer documentation of Annota. We include the latest snapshot ofthe documentation available in Annota development Wiki. If you are an Annota developer,use the latest documentation from Wiki.

https://redmine.fiit.stuba.sk/projects/annota/wiki/Graphene

C-1

https://redmine.fiit.stuba.sk/projects/annota/wiki/Graphene

Appendix D

Journal Article Proposal

We have summarized the most important parts of our work in a draft of journal articleproposal.

D-1

Appendix E

Attached CD Content

The attached CD contains the following items.

• README.md — text file describing the CD contents

• annota/ — git repository with Ruby source codes of the Annota project

• thesis/ — git repository with Latex source codes of the thesis

• appendices/ — appendices that are not included in the thesis source codes

E-1