how users explore ontologies on the web: a study of ncbo’s ...papers. · how users explore...

How Users Explore Ontologies on the Web:A Study of NCBO’s BioPortal Usage Logs

Simon WalkStanford University

[email protected]

Lisette Espín-NoboaGESIS & University of

[email protected]

Denis HelicGraz University of Technology

[email protected]

Markus StrohmaierGESIS & University of

[email protected]

Mark A. MusenStanford University

[email protected]

ABSTRACTOntologies in the biomedical domain are numerous, highlyspecialized and very expensive to develop. Thus, a cru-cial prerequisite for ontology adoption and reuse is effectivesupport for exploring and finding existing ontologies. To-wards that goal, the National Center for Biomedical Ontol-ogy (NCBO) has developed BioPortal—an online repositorycontaining more than 500 biomedical ontologies. In 2016,BioPortal represents one of the largest portals for explo-ration of semantic biomedical vocabularies and terminolo-gies, which is used by many researchers and practitioners.While usage of this portal is high, we know very little abouthow exactly users search and explore ontologies and whatkind of usage patterns or user groups exist in the first place.Deeper insights into user behavior on such portals can pro-vide valuable information to devise strategies for a bettersupport of users in exploring and finding existing ontologies,and thereby enable better ontology reuse. To that end, westudy and group users according to their browsing behavioron BioPortal and use data mining techniques to character-ize and compare exploration strategies across ontologies. Inparticular, we were able to identify seven distinct browsingtypes, all relying on different functionality provided by Bio-Portal. For example, Search Explorers extensively use thesearch functionality while Ontology Tree Explorers mainlyrely on the class hierarchy for exploring ontologies. Further,we show that specific characteristics of ontologies influencethe way users explore and interact with the website. Ourresults may guide the development of more user-orientedsystems for ontology exploration on the Web.

KeywordsBrowsing behavior; Semantic Web; BioPortal; Markov chain;Stationary distribution; Clustering

c©2017 International World Wide Web Conference Committee(IW3C2), published under Creative Commons CC BY 4.0 License.WWW 2017, April 3-7, 2017, Perth, Australia.ACM 978-1-4503-4913-0/17/04.http://dx.doi.org/10.1145/3038912.3052606

.

1. MOTIVATIONFacilitating reuse on the Semantic Web requires support

for effectively finding and exploring existing semantic re-sources such as ontologies or taxonomies. Particularly inthe biomedical domain, where ontologies are highly special-ized, costly and often created collaboratively by large groupsof domain experts, identifying and finding existing seman-tic vocabularies and terminologies is crucial. To supportresearchers and practitioners in this task, the National Cen-ter for Biomedical Ontology (NCBO) has developed Bio-Portal1—the world’s most comprehensive online repositoryof biomedical ontologies [21, 16, 33, 18].Problem. In this paper, we want to shed light on howusers are exploring ontologies and terminologies in BioPor-tal. Without a deep understanding of how to facilitate theexploration of existing ontological resources on the Web, in-creasing the reuse of ontologies on the Semantic Web willremain an elusive goal. Towards that goal, we want to (i)cluster traces of user interactions to obtain different explo-ration behavior types and (ii) investigate ontologies in termsof their user behavior. New insights into user behavior onsuch portals can potentially provide actionable informationto devise strategies that better support researchers and prac-titioners in exploring existing ontologies, and thereby mayenable better reuse of ontological content in general.Approach. To identify different types of user behavioron BioPortal, we adopt a clustering approach based on (i)dynamic browsing features extracted from the Apache logsof BioPortal and on (ii) calculating stationary distributionsfrom first order Markov chain representations of user tran-sitions. We demonstrate the effectiveness of our approachand investigate if and to what extent exploration strategiesof users differ when exploring biomedical ontologies on Bio-Portal. In addition, we use Principal Component Analysis(PCA) to visually inspect the obtained clusters.

Contributions. We present an approach for identify-ing different user behavior types in ontology exploration logdata using stationary distributions from first order Markovchains. We provide evidence for a total of seven distinctbrowsing-behavior types on BioPortal and offer an interpre-tation of their relevance and meaning. We use these differ-ent types of exploration behavior to characterize ontologiesin terms of the different user interactions they attract. We

1http://bioportal.bioontology.org

775

Table 1: Overview of all actions available in the BioPortal user interface.Action Labels Description

Main PageBrowse Main Page, Browse Ontologies, Browse Search, BrowseHelp, Browse Mappings, Browse Recommender, Browse Annotator,Browse Resource Index, Browse Projects, Browse Notes

Browsing main areas and functionality ofBioPortal.

Ontology Page

Ontology Summary, Browse Ontology Classes, Browse OntologyClass, Browse Ontology Class Tree, Browse Ontology Mappings, On-tology Analytics, Browse Ontology Widgets, Browse Ontology Vi-sualization, Browse Ontology Notes, Browse Ontology Properties,Browse Widgets, Browse Ontology Property Tree, Browse ClassNotes

Labels for actions that can be performedwhile browsing a specific ontology on

BioPortal.

Edit ContentCreate Ontology Submission, Validate Ontology File, Virtual Appli-ance Download, Browse Ontology Submission

Actions triggered when uploading newversions of ontologies.

User Account Login, Log-Out, Sign-Up, Lost Password, Browse Account, Feedback Actions to manage accounts on BioPortal.

Control Term BREAK ≥30 minutes inactivity between two actions.

find that particular characteristics of semantic resources, asrepresented on BioPortal, influence how users interact withthem. For example, BioPortal lacks the ability to visual-ize flat ontologies (i.e., ontologies without a structured hi-erarchy), which fuels the emergence of diverse explorationstrategies as identified by our approach. Overall, our workadvances our understanding of ontology exploration behav-ior on prominent repositories of biomedical ontologies on theWeb.

2. RELATED WORKBiomedical Ontology Repositories. There exists a vastvariety of repositories for ontologies and taxonomies withsimilarly diverse core interests. We provide a brief overviewof the most important projects from the biomedical domainand describe how they are related to BioPortal. The OpenBiomedical Ontologies (OBO) Foundry [25] initiative col-lects and maintains a set of different ontologies, which arespecifically designed for interoperability. The whole libraryof ontologies is available on GitHub2 and developers of on-tologies aspire to have their content included in the OBOFoundry. All ontologies of the OBO Library are automati-cally imported into BioPortal.

The European Bioinformatics Institute [11] maintains theOntology Lookup Service which provides an alternative me-chanism for accessing content in the OBO Library.

The Unified Medical Language System (UMLS) [1] is acollection of biomedical terminologies and ontologies dis-tributed by the National Library of Medicine. Terms inUMLS are mapped to one another through the UMLS Meta-thesaurus. Similarly to OBO Foundry, the content of UMLSis also available for exploration in BioPortal.

The majority of research about (biomedical) ontology re-positories focuses on the selection and maintenance (includ-ing provenance) of the included ontologies and taxonomies.In this paper, we present analyses that complement existingresearch by studying aspects about the interaction behav-ior of users on BioPortal [21, 16, 33, 18]—one of the largestbiomedical ontology and taxonomy repositories on the Web.User Interactions on the (Semantic) Web. A lot ofresearch has addressed the tasks, actions, tools and processesrequired for developing and evaluating ontologies [8]. Thishas led the Semantic Web community to develop a variety ofdifferent guidelines and methodologies [23] or compile bestpractices [17] for engineering ontologies.

2http://obofoundry.org

Engineering ontologies collaboratively can potentially leadto structural conflicts since several users can edit ontologiessimultaneously. To circumvent this, Hellman and Gal [9]propose locking mechanisms based on a graph depiction ofthe concept dependencies. Pesquita et al. [19] leveraged thelocation and specific structural features to show that thesecan be used to determine if and where the next change isgoing to take place in the Gene Ontology3.

To learn more about the impact and quality of collabo-ration in collaborative ontology-engineering projects, Stroh-maier et al. [26] investigated the hidden social dynamics thattake place in such projects from the biomedical domain andprovided new metrics to quantify various aspects of the col-laborative engineering processes. User roles have been alsostudied from many different angles. For example, Falconeret al. [6] investigated and classified users according to differ-ent roles in collaborative ontology-engineering projects. In2014, Van Laere et al. [27] used K-means and the GOSPLmethodology to classify users by clustering interactions thatusers engage in, while engineering an ontology. Wang etal. [32] used association-rule mining on the change-logs ofcollaborative ontology-engineering projects, and showcasedthe utility of the identified editing patterns in a predictionexperiment. Walk et al. [28] studied user editing trails ofontology-engineering projects by defining and comparing aset of hypotheses about how users edit ontologies.

Further, there has been increasing interest on analyzingand categorizing human sequences in the past, including mo-bility [5], page access [7, 12], edit [28, 30, 31] as well as clickstream logs [24]. Additionally, Markov models of various or-ders have been employed [2, 15, 20, 4, 22, 29] to modeland predict clicks or interactions of users on the (Semantic)Web. Recently, Chierichetti et al. [3] conducted an analy-sis that questioned if a first-order model best represents thenavigation behavior of humans on the Web. In this direc-tion, the work in Lakshimarayan et al. [13] demonstratesthat higher order Markov chains can be used to model Webbrowsing behavior and predict user intent towards buyingspecific products.

Making sense and interpreting such sequential data is nota trivial task. The work in Hoxha et al. [10] proposes aplatform to analyze user browsing patterns by semanticallyformalizing the usage logs of Web data, providing meansto query and retrieve sessions of users that satisfy certainsemantic and temporal conditions. Research has also been

3http://www.geneontology.org

776

(a) Seconds between Requests (b) Requests per IP (c) Ontologies per User (d) Requests per Session

Figure 1: BioPortal Characteristics. Entities on the x-axes for all figures are binned according to the displayedvalues and cut-off at their maximum values for reasons of readability. The seconds between requests (a) areheavy-tailed, meaning that the majority of requests are conducted within short periods of times. The numberof request actions per IP (b), a proxy for users on BioPortal, is heavy-tailed. The majority of users havevisited only a total of one or two ontologies and (c) the majority of sessions exhibit a small number of requests(d).

focused on human behavior beyond browsing patterns andstudy reading behavior of users in Wikipedia as well asthe stability of the reading patterns per user [14]. A moregeneric approach is proposed in [24] where the authors presentHypTrails—a Bayesian methodology that allows researchersto compare and rank hypotheses about digital trails on theWeb, which has also been applied on the edit logs of collab-orative ontology-engineering projects by Walk et al. [28].

To the best of our knowledge, analyzing how users exploreontologies in the context of ontology repositories on the Webhas not been addressed as in-depth before.

3. MATERIALS & METHODS

3.1 BioPortalBioPortal—an online biomedical ontology and taxonomy

repository—was created by the National Center for Biomed-ical Ontology (NCBO). The main goals of NCBO involve notonly the creation and maintenance of a comprehensive repos-itory of biomedical ontologies and terminologies, but also thetask of building novel tools and Web services to enable andaugment the (re)use of all stored semantic vocabularies andterminologies in clinical and translational research. BioPor-tal currently contains more than 500 biomedical ontologies,and supports a wide range of Web services, such as using on-tologies for annotating resources and generating value sets orspecific ontology views. Additionally, BioPortal allows usersto explore ontological content not only by using a standardtree browser, but it can also visualize resources using cus-tom tailored widgets, which help users in comprehendingthe complexity of large biomedical resources (see Table 1 forall click types). Similarly, the website provides functionalityto explore mappings between ontologies, which can be usedto directly compare the use of related terms as well as theoverlap between different ontologies.

3.2 Data Acquisition & PreprocessingFor the analyses presented in this paper, we have parsed

all the requests stored in the Apache logs of BioPortal from12/31/2015 to 07/19/2016 (around 7 months; cf. Table 2).Each line in the request logs contains information (amongothers) about the IP, the timestamp, the actual resource thatwas requested and the useragent, which is used to provideadditional information about the requester, such as browser,operating system or name of the bot/spider, for each request.

Preprocessing. In a first step, we reduced a total of roughly51.8M entries (7 months) to 16.7M by removing all requeststhat were conducted by bots, spiders and other automaticscripts via the combination of existing useragent and IPblacklists as well as through manual inspections.

Then, we reduced our dataset to only include requests,which were triggered by users and involve an actual inter-action with the website. For example, if users click on anontology name, many automatic requests (AJAX calls) aretriggered, which do not represent interactions of users, andare hence filtered from our dataset. Note that we considerall interactions that trigger a request for our analyses. Forexample, we consider search queries entered by users on Bio-Portal as interactions with the website. After preprocessing,we still have a total of 2.52M requests (see Table 1), gen-erated by 215, 908 unique IP addresses over the course of 7months. Note that the number of different ontologies thatusers requested is larger than the number of stored ontolo-gies in BioPortal. Whenever users manually type in requestsin the browser, typos can occur. Further, ontologies on Bio-Portal are also not only accessible through their abbrevia-tions but also through their internal IDs. In both situations,the number of unique ontologies in the request logs increases,while the number of stored ontologies remains unaffected.Dataset Characteristics. The distribution of time be-tween requests (seconds) is depicted in Figure 1(a). Theseconds between requests are heavy-tailed, meaning that themajority of requests are performed within 0-10 seconds.

To be able to asses if and to what extent we are able toobserve power users—very active users who contribute the

Table 2: Characteristics of the BioPortal dataset.Feature Value

Unique IPs 215, 908Unique IPs (with ≥ 2 requests) 168, 008Ontologies 1, 818 (517)Actions (ca.) 2.52M

Sessions (time between requests ≥ 30 mins) 513, 6591-request sessions 165, 543# requests in ≥ 2-request sessions 2, 36MAverage/Median session duration 217s/0s

First request 2015/12/31Last request 2016/07/19Observation period (ca.) 7 months

777

Table 3: Example of a session. Using the Apache logs of BioPortal, we can chronologically order the requests ofeach user (identified by IP) individually, assign action labels to each request and create a sequence of actions,which can not only be used to fit first-order Markov chains, but also to calculate stationary distributions foreach user.

Timestamp Type Request Action Labels (Sequence Step)

2016-03-14 09:07:32 GET / Browse Main Page (1)2016-03-14 09:07:46 GET /login?redirect=http%3A%2F%2Fbioportal.bioontology.org%2F Login (2)2016-03-14 09:07:48 POST /login Login (3)2016-03-14 09:07:50 GET / Browse Main Page (4)2016-03-14 09:08:04 GET /ontologies/MCCV Ontology Summary (5)2016-03-14 09:08:22 GET /ontologies/MCCV/submissions/new Create Ontology Submission (6)2016-03-14 09:09:34 POST /ontologies/MCCV/submissions Create Ontology Submission (7)2016-03-14 09:09:59 GET /ontologies/success/MCCV Create Ontology Submission (8)2016-03-14 09:10:14 GET /ontologies/MCCV Ontology Summary (9)

Sequence: Browse Main Page –> Login –> Login –> Browse Main Page –> Ontology Summary –> Create Ontology Submission –> Create Ontology Submission–> Create Ontology Submission –> Ontology Summary

majority of all requests—we have grouped users (in the formof IP addresses) according to the number of requests theyhave contributed and visualized the results in Figure 1(b).While the majority only conduct a very small number ofrequests (between 1-10), we were able to identify 30 userswho conducted more than 5, 000 interactions on BioPortalwithin 7 months. Additionally, we were able to identify thatthe majority of users only visit one or two ontologies on thewebsite (cf. Figure 1(c)).

Finally, we aggregated all requests of users into sessions.Each session contains all chronologically sorted interactionsof one user, unless two interactions are apart more than 30minutes. In that case, the current session is closed, a newsession is opened and the process is repeated. We have se-lected 30 minutes as threshold, as the requests per IP areheavy-tailed and the majority of timespans between requestsare within that threshold, introducing only a small numberof sessions. We were able to group all of the 2.52M requestsinto a total of 513, 659 sessions, where 165, 543 only con-sist of a single interaction. As depicted in Figure 1(d), themajority of sessions in BioPortal exhibit a low number ofrequests, while only very few sessions last for more than 100requests.Action Label Mapping. Finally, we created a set of regu-lar expressions to map each request to easier-readable labels(cf. Table 1 for all the different types of labels). In general,the generated action labels can be grouped into 5 differentcategories. The two most important groups of action labelsare the ones accessible from the main page of BioPortal, suchas Browse Ontologies or Browse Search, and the ones thatdescribe actions conducted on single ontology pages, such asBrowse Ontology Class or Browse Ontology Property Tree.

Action Sequence Generation. Using the action labels weare able to extract easily readable chronological sequences ofactions for each user on BioPortal (see Table 3). Wheneveran IP address exhibits two or more sessions, we connect thesessions with a BREAK state to be able to see where usersstop and resume browsing BioPortal.

3.3 Modeling Browsing BehaviorWe model user browsing behavior on BioPortal with mem-

oryless Markov chains. A Markov chain consists of n statessi from a finite state-space S with n = |S| and a transitionmatrix P ∈ Rn×n where pij defines the probability to tra-verse from state si to state sj . Since each row in P definesa probability distribution for each i we have

∑j pij = 1.

We represent each action on BioPortal as a single state ofa Markov chain (see Table 1). Then, an element pij from thetransition matrix P represents the probability of performingaction j after action i has been performed.

To compare users with each other we compute P and itsstationary distribution for each individual user. In our case,the stationary distribution is a probability distribution overactions, which defines how likely we will find a user per-forming a given action in the limit of large number of steps.Generally, the stationary distribution captures the way howusers move between actions, i.e. it encodes the sequentialinteractions within an n-dimensional vector, and thereforeprovides additional information over simple view counts (cf.Table 4).

To compute stationary distributions, we first define theweighted matrix A ∈ Rn×n, where each element aij is set tothe number of observed transitions between states i and j inour empirical dataset (see Table 1 for all possible states). A

Table 4: Stationary Distribution and Page Views Illustration. Given a vector of page views (2xA, 2xB and2xC) the stationary distribution depends on the order of the appearance of each state in the sequence. Forexample, given a sequence of “ABCABC”, we create a weighted matrix by counting the transitions betweenthe three states A, B and C. Normalizing each row produces the transition matrix P , which we use to calculatethe stationary distribution π (top row). For a user with the same page views, but a different sequence, weobtain a different stationary distribution (bottom row). The static page views remain unaffected by theordering of the sequence, as only the total number of occurrences of each state is considered.

Sequence Weighted Matrix Transition Matrix Stationary Distribution Page Views

ABCABC A =

0 2 00 0 21 0 0

P =

0 1 00 0 11 0 0

π =

0.35330.33560.3111

222

AABBCC A =

1 1 00 1 10 0 1

P =

0.5 0.5 00 0.5 0.50 0 1

π =

0.52290.29160.1855

222

778

(a) Selection of K (b) Stationary Distribution Clusters

Figure 2: User Behavior Clusters. To estimate the number of clusters to investigate, we have plotted thepercentage of the explained variance (y-axis) per cluster (x-axis) in (a) for the stationary distribution clusters(solid blue) and the page view clusters (dashed green). We select the number of clusters (7) to a value wherethe introduction of new clusters only minimally increases the explained variance (red circles). To furtherinvestigate the results of K-means, we have reduced and visualized the stationary distribution vectors of eachuser to three dimensions using principal component analysis in (b). Each point represents one user of ourdataset, while the colors represent the corresponding clusters obtained by K-means (before dimensionalityreduction). We provide labels for the clusters and the extremes of the axes. The latter represent the actionswith the smallest and highest coefficients of the corresponding principal components (see Table 5).

transition matrix P has to satisfy certain conditions to havea stationary distribution, most notably irreducibility (eachaction has to be reachable from all other actions) and ape-riodicity (the return times to actions have to be aperiodic).

To guarantee both irreducibility and aperiodicity, we adda teleportation factor α to A, which (technically) would al-low users to teleport between all actions with a very smallprobability, similar to PageRank. The teleportation factorconnects each state to all others (satisfying irreducibility)including a self-loop (satisfying aperiodicity):

W = A+α

n11T , (1)

where we set α = 0.15 and 1 is a vector of all ones fromRn. Finally, we normalize each row in W to sum up to 1 toobtain the transition matrix P .

Now, the stationary distribution π can be calculated asthe left eigenvector of P with its largest eigenvalue 1 and thestationary distribution satisfies the eigenvalue equation forthe matrix P : πT = πTP . An example calculation of thestationary distribution (without the teleportation factor) isdepicted in Table 4. Assuming two users exhibit the samenumber of page views on the pages A, B and C, we can in-corporate the information about the chronological orderingof the sequence by calculating the stationary distribution.Hence, a sequence of “ABCABC” page visits (top row ofTable 4) yields a different stationary distribution than a se-quence of “AABBCC” page visits (bottom row of Table 4).

3.4 Clustering Browsing BehaviorTo deepen our understanding about the browsing behav-

ior of users on BioPortal, we first group similar users. Tothat end, we use K-means—an unsupervised clustering al-gorithm—which clusters “close-by” users. For K-means weneed to embed users in a vector space so that their proximity

can be determined by, for example, pairwise Euclidean dis-tances between user vectors. For that purpose we representusers with their stationary distribution vectors.

Once, when users are represented as vectors, K-meansrandomly selects K initial central vectors (centroids), whereK has to be manually set and defines the number of clustersthe algorithm will find. In every iteration, each user vectoris assigned to the closest centroid. The positions of the cen-troids are then updated (i.e., moved towards the center ofthe corresponding vectors).

This process continues until convergence, that is until thedisplacement of the centroids between iterations is below acertain threshold. Note that the choice of the clustering al-gorithm is arbitrary and we leave it open to future work, todetermine which algorithm is best suited for a specific appli-cation. For illustration purposes, we have chosen K-meansas it is well studied and can handle large datasets (168, 008distinct users, clustered over 34 action-label dimensions).

4. RESULTS

4.1 Differences in User Browsing BehaviorWe have calculated the stationary distribution for each

user as described in Section 3.4, and used K-means to groupthem accordingly. Additionally, we clustered users usingstatic page view vectors.Estimating & Validating K. To estimate a plausiblenumber of clusters, which best represent the different typesof browsing behavior in BioPortal, we have calculated thepercentage of variance explained by 1 to 25 clusters for ourempirical data (cf. Figure 2(a)). This method is also oftenreferred to as “elbow”-method, and states that the optimalvalue for K, given empirical data, should be set so that theintroduction of additional clusters only minimally increases

779

Figure 3: Browsing-Behavior Types: We have extracted and visualized the transition probabilities betweenthe different actions for all 7 clusters obtained by K-means when using the stationary distribution vectors(bottom). Transitions are always read From State (left) to To State (bottom) and rows are normalizedindividually. The darker the colors, the higher the transition probability between two states. Histogramsof the actions are depicted on top of the transition matrices and indicate their absolute occurrences in theextracted sequences for each cluster. All of the clusters exhibit differences in the frequencies (histograms)and transition probabilities.

the explained variance. For both our experiments, the clus-tering via the stationary distributions and page views, thebest value for K is 7.

To be able to visually inspect and validate the resultingclusters, we have extracted three principle components byapplying PCA on the stationary distribution vectors and vi-sualized the results (see Figure 2(b)). Each point representsone user of our dataset and the colors indicate one of theseven clusters suggested by K-means. Additionally, we haveadded (i) the action labels with the largest and smallest co-efficients (see Table 5) to the corresponding axes to allow

for manual interpretation and (ii) provide labels for the dif-ferent clusters. We can see that the clusters obtained byK-means, when using the stationary distribution vectors,are easily distinguishable and warrant further inspection.Categorizing Clusters. Hence, we have extracted and vi-sualized the action sequences of all clusters (see Figure 3;the colors correspond to the clusters in Figure 2(b)) indi-vidually. Note that we have removed actions with very fewoccurrences from the visualizations for reasons of readabil-ity, which is why the transition probabilities of the rows inFigure 3 do not necessarily sum up to 1.0.

780

Using the stationary distribution as input for K-means,we were able to obtain a total of 7 distinguishable browsing-behavior types on BioPortal:The Main Page Visitors primarily visit the main pageof BioPortal and exhibit very few other actions. The clus-ter consists of 2, 813 users with an average of 32.2 actions(median of 2) per user.The Ontology Overview Visitors consist of 2, 668 users,who primarily visit ontology overview pages and refrain fromexploring the classes of the ontologies. The average user inthis group performs 22.7 actions (median of 3).The group of Class Explorers, the biggest cluster with atotal of 109, 159 users, specifically target and visit classes ofthe different ontologies hosted on BioPortal. The averagenumber of actions for this group is 3.6 (a median of 2).Ontology Tree Explorers use the hierarchical representa-tion of an ontology on BioPortal to explore and browse thecontent of an ontology. A total of 4, 490 users belong to thisgroup with an average of 77.3 actions (median of 14).A total of 6, 701 Search Explorers make extensive use ofthe search functionality provided by BioPortal to identify,explore and find classes in the ontologies. On average, usersof this cluster conduct 92.1 actions (median of 18).Specific Class Browsers represent the second largest brow-sing behavior type with 22, 436 users. In contrast to SearchExplorers, this group exhibits shorter browsing sessions, ev-ident in the higher number of BREAK states, and concen-trates on exploring multiple classes of a specific area of anontology (as opposed to whole ontologies, by inspecting theontology class tree). The average number of actions for usersin this group is 24.9 with a median of 8.The final group are BioPortal Experts, who use severalfeatures that BioPortal provides to find and explore ontolo-gies and classes, such as the annotator and recommender.On average, the 19, 741 users of this cluster performed 24actions (median of 4).Comparison to Page Views. We repeated the same ex-periment with the static page view vectors as input for K-means and manually inspected the results for K = 7 clus-ters. In contrast to the results shown in Figure 2(b), wewere only able to distinguish a total of four different kindsof browsing-behavior types. The biggest group of users arethe Search Explorers, consisting of a total of 167, 979 usersand a total of four different clusters with 167, 631, 1, 21 and326 users respectively. We consider them as single clusteras the actions and transition matrices of these groups werenearly indistinguishable upon manual inspection. The threeremaining clusters, Class Explorers, Ontology Tree Explorersand Feedback Providers, are again very small, with only 3, 6and 20 users respectively.Increasing K. Additionally, we have manually exploredup to 14 clusters obtained from K-means, using the station-ary distributions as input. Due to the information inherent

in the stationary distribution, we could further refine theobtained class labels. In additional to the seven clustersoutlined before, we identified Annotators (542 users), Wid-get Browsers (399 users), Search & Tree Explorers (3, 685users), Feedback Providers (9, 311 users) and Account Main-tainers (374 users).

4.2 Differences when Exploring OntologiesTo be able to learn more about exploration and usage

strategies of users, we were interested in identifying differ-ences in the distribution of browsing-behavior types thatinteract with the different ontologies on BioPortal. Hence,we have determined the number of users for each browsing-behavior type cluster for the 50 most visited ontologies inour dataset and aggregated the corresponding number ofactions. Finally, we have applied PCA on the aggregatednumbers of requests per cluster and visualized the results(similar to Figure 2(b)) to identify differences in the brow-sing-behavior types between projects (see Figure 4(a)). Thelarger the difference in actions performed by different brow-sing-behavior clusters for each project, the larger the dis-tance between the projects.

Due to limitations in space, we restrict the presentation ofresults to two projects. In particular, we compare the brows-ing behavior of users for the Current Procedural Terminol-ogy (CPT) as well as RxNorm (RXNORM). The former is aterminology, which is maintained by the American MedicalAssociation, and is used to describe services in the medi-cal, surgical, and diagnostic domain (mostly exclusive) forbilling purposes in the US. The latter was developed by theNational Library of Medicine as a public resource for variousapplications, and represents a terminology that contains allmedications available on the US market. We have selectedCPT and RXNORM, as they are among the ontologies thathave received the most visits, exhibit substantially differentrankings for the actions per browsing-behavior cluster andwere initially designed for different practical applications.Sequence Extraction. For this analysis, we have extractedall sequences of all users, who performed at least 20% of alltheir actions on one of the corresponding projects. As aconsequence, sequences are not mutually exclusive, mean-ing that one user could be present in both of our extractedaction sequences if that user performed at least 20% of allclicks on CPT and another 20% on RXNORM. Given thatthe majority of users only visit one or two ontologies (see

Table 6: Actions for CPT & RXNORM.User Type CPT Actions RXNORM Actions

Class Explorers 120, 926 (3) 49, 440 (2)Specific Class Browsers 187, 987 (2) 88, 068 (1)BioPortal Experts 19, 934 (5) 12, 755 (4)Main Page Visitors 195, 030 (1) 35, 869 (3)Ontology Tree Explorers 78, 485 (4) 1, 478 (5)Ontology Overview Visitors 252 (7) 69 (7)Search Explorers 1, 142 (6) 224 (6)

Table 5: Principal Component Coefficients. The ex-plained variance for the first principal component(PC1) is 63.67%, for the second (PC2) 73.57% and80.53% for the third (PC3).

Action Label PC1 PC2 PC3

Browse Search −0.1897 +0.9088 −0.0368Ontology Summary −0.1253 −0.2736 +0.6240Browse Main Page −0.1236 −0.2785 −0.7712Browse Ontology Class +0.9571 −0.0702 −0.0127

781

(a) Ontology Projects over Principal Components (b) Comparison of Exploration Behavior (CPT& RXNORM)

Figure 4: Comparison of Exploration Behavior. We have aggregated the number of actions per browsing-behavior cluster, applied PCA and plotted the landscape of exploration behavior types for the 50 most visitedontologies (blue points) in (a). The extremes of the axes correspond to the clusters with the largest andsmallest coefficients of PCA respectively. We have added labels for the ontologies and terminologies whereappropriate. The inset magnifies the densely concentrated group of ontologies. The larger the difference inactions performed by different browsing-behavior clusters for each ontology, the larger the distance betweenontologies. The detailed comparison of CPT (green) and RXNORM (brown) is depicted in (b). The histogram(top) depicts how often the different actions occurred in the extracted sequences in both ontologies. Thetransition matrix (bottom) depicts the importance of the transitions between actions of users while browsingCPT (blue) and RXNORM (brown). Transitions that are equally important in both projects are white.The transition from Browse Search (From State) to Browse Ontology Class (To State) is more dominant forRXNORM. Additionally, RXNORM exhibits a very small number of Browse Ontology Class Tree actions.

Figure 1(c)) and that there are many actions, which cannot be assigned to an ontology, 20% of all actions alreadyrepresents a very high filtering criterium. Overall, we col-lected a total of 61, 367 unique IP addresses for CPT, whichconducted on average 10.7 actions (median of 2). In con-trast, we extracted action sequences of 23, 757 IP addressesfor RXNORM, with an average of 8.6 actions (median of 2).The aggregated actions for each cluster for the two ontolo-gies are listed in Table 6.Comparing Browsing-Behavior Types. To visualizeand investigate the differences in browsing-behavior strate-gies between the two ontologies, we have fit a first-orderMarkov chain on the extracted sequences of both projectsand visualized the difference between the two transition ma-trices in Figure 4(b). Note that we only display the top 10most frequent actions for reasons of readability.

CPT—being the more popular ontology—received roughlythree times more requests than RXNORM, as shown in thehistogram on top of Figure 4(b). The most common actionsfor both projects are Browse Ontology Class and BrowseSearch. In contrast to RXNORM, users of CPT also rely onthe hierarchy of the ontology to browse classes (see BrowseOntology Class Tree in Figure 4(b)). This is not surpris-ing, as RXNORM is a flat ontology, which does not havea hierarchy that can be visualized by BioPortal. This alsoexplains why the relative importance of the transition from

Browse Search to Browse Ontology Class is more dominantfor RXNORM, than it is for CPT (evident in the dark browntransition probabilities in Figure 4(b)).

5. DISCUSSIONIn this paper we have analyzed, modeled and clustered

the browsing behavior of users on BioPortal. In particular,we have grouped users according to their browsing behaviorin Section 4.1 and compared how users explore ontologies inSection 4.2.Differences in Browsing-Behavior Types. We haveshown that we were able to obtain multiple, clearly dis-tinguishable browsing-behavior types, when using station-ary distribution as input for K-means. Further, we havecalculated the explained variance per cluster to determinethe number of browsing-behavior types to investigate. Themanual inspection of the obtained clusters shows that usersprimarily concentrate on browsing specific classes (Class Ex-plorers and Specific Class Browsers) of biomedical ontolo-gies on BioPortal. Two strategies of users to explore on-tologies involve the exploitation of the ontological hierarchy(Ontology Tree Explorers; if available) and the search func-tionality provided by BioPortal (Search Explorers). Users inthe third-largest cluster (19, 741)—BioPortal Experts—useadvanced functionality, such as the annotator or the recom-mender, to find classes and explore ontologies.

782

The comparison of our results with clusters obtained fromstatic page views indicate that the inclusion of informationabout the dynamic nature of a user’s interaction behaviorwith a website is important. We were able to distill a to-tal of 7 information consumption strategies, which we wouldhave otherwise not identified. Note that the presented re-sults reflect how users interact with BioPortal from a purelydata-driven analysis and further analyses are required to de-termine the generality of our findings.

In general, our results indicate that users exhibit a pref-erence to explore and browse the content of the classes ofbiomedical ontologies. However, this also indicates that ad-ditional efforts are warranted to promote and further en-hance and evaluate the utility of the functionality providedby BioPortal to explore and find ontologies. For example,only a small portion of users (the BioPortal Experts) tra-verse along and use mappings between ontologies when vis-iting BioPortal. Further, the sooner we can assign a usertype to any given visitor, the better we can leverage our re-sults and automatically adapt caching and recommendationstrategies to anticipate the next click.Differences when Exploring Ontologies. Our resultsindicate that several characteristics of ontologies, such asthe hierarchical structure (or lack thereof), can influencehow users explore and interact with these semantic vocab-ularies and terminologies on BioPortal. One particular ex-ample of such an ontology is RXNORM, which does notexhibit a structured hierarchy, and is thus not displayed inthe class explorer on BioPortal. The only ways to inter-act with classes of this ontology on the website is to eitheruse the search functionality, which requires prior knowledgeabout the content of the ontology, or to rely on mappingsfrom other ontologies. However, this also explains the strongfocus of users/visitors of RXNORM to use the search func-tionality (Browse Search) to explore, investigate and findclasses of this ontology on BioPortal (see Figure 4(b)).

In contrast, the structural hierarchy of CPT is used byBioPortal to visualize the class tree. Hence, to explore theontology, users can (and do)—aside from searching (BrowseSearch) for specific classes—use and explore the hierarchy(Browse Ontology Class Tree) to find a specific class. Notethat we have only presented the differences in browsing-behavior types for two of more than 500 ontologies! Yet,we have shown that simple differences (i.e., the lack of aclass hierarchy) between projects cause users of BioPortalto employ different strategies to find classes or explore theontological content. We leave it open for future work to con-tinue this line of investigation to identify characteristics andfeatures of ontologies causing differences in how users seekand consume information stored inside them.

6. CONCLUSIONS & FUTURE WORKIn this paper, we have presented an approach for iden-

tifying different types of user behavior in ontology explo-ration log data using stationary distributions from first or-der Markov chains. We provide evidence for a total of 7 dis-tinct browsing-behavior types on BioPortal and offer an in-terpretation of their relevance and meaning. In addition, wecluster ontologies by their user behavior and identify prag-matic commonalities among ontology projects. Our resultsadvance our understanding of the ways in which ontologi-cal repositories, such as BioPortal, are used and may guide

the development of more user-oriented systems for ontologyexploration and reuse on the Web.

For future work, we are particularly interested in furtherelaborating the comparison between the browsing behav-ior of different ontologies on BioPortal. In doing that, wemight be able to identify commonalities for groups of ontolo-gies that trigger differences in the way users interact withontologies on BioPortal. Understanding the different waysin which semantic structures may influence ontology explo-ration behavior has the potential to help in the design ofBioPortal as well as in the development of more effectiveand more reusable ontologies. Additionally, we would liketo compare the results presented in this paper to usabilitystudies, testing the utility of different features in BioPortaland how they—on a qualitative level—influence the brows-ing behavior of users.

7. ACKNOWLEDGMENTSThis work was supported by Grant GM086587 from the

U.S. National Institute of General Medical Sciences (NIGMS)of the National Institutes of Health. The Protege project issupported by NIGMS grant GM103316. NCBO is fundedby National Institutes of Health grant U54 HG004028.

8. REFERENCES[1] O. Bodenreider. The unified medical language system

(umls): integrating biomedical terminology. Nucleicacids research, 32(suppl 1):D267–D270, 2004.

[2] J. Borges and M. Levene. Evaluating variable-lengthmarkov chain models for analysis of user webnavigation sessions. IEEE Trans. on Knowl. and DataEng., 19(4):441–452, Apr. 2007.

[3] F. Chierichetti, R. Kumar, P. Raghavan, andT. Sarlos. Are web users really markovian? InProceedings of the 21st international conference onWorld Wide Web, pages 609–618. ACM, 2012.

[4] M. Deshpande and G. Karypis. Selective markovmodels for predicting web page accesses. ACM Trans.Internet Technol., 4(2):163–184, May 2004.

[5] L. Espın Noboa, F. Lemmerich, P. Singer, andM. Strohmaier. Discovering and characterizingmobility patterns in urban spaces: A study ofmanhattan taxi data. In Proceedings of the 25thInternational Conference Companion on World WideWeb, pages 537–542, 2016.

[6] S. M. Falconer, T. Tudorache, and N. F. Noy. Ananalysis of collaborative patterns in large-scaleontology development projects. In M. A. Musen andO. Corcho, editors, K-CAP, pages 25–32. ACM, 2011.

[7] B. Fortuna, D. Mladenic, and M. Grobelnik. Usermodeling combining access logs, page content andsemantics. arXiv preprint arXiv:1103.5002, 2011.

[8] A. Gomez, M. Fernandez, and O. Corcho. Ontologicalengineering. 2 nd Edition, Springer-Verlag, 2004.

[9] Z. Hellman and A. Gal. Structural Conflict Avoidancein Collaborative Ontology Engineering, pages 216–225.Springer Netherlands, Dordrecht, 2005.

[10] J. Hoxha, M. Junghans, and S. Agarwal. Enablingsemantic analysis of user browsing patterns in the webof data. arXiv preprint arXiv:1204.2713, 2012.

783

[11] S. Jupp, T. Burdett, J. Malone, C. Leroy, M. Pearce,J. McMurry, and H. Parkinson. A new ontologylookup service at embl-ebi.

[12] R. Kumar and A. Tomkins. A characterization ofonline browsing behavior. In Proceedings of the 19thinternational conference on World wide web, pages561–570. ACM, 2010.

[13] C. Lakshminarayan, R. Kosuru, and M. Hsu.Modeling complex clickstream data by stochasticmodels: Theory and methods. In Proceedings of the25th International Conference Companion on WorldWide Web, pages 879–884, 2016.

[14] J. Lehmann, C. Muller-Birn, D. Laniado, M. Lalmas,and A. Kaltenbrunner. Reader preferences andbehavior on wikipedia. In Proceedings of the 25thACM conference on Hypertext and social media, pages88–97. ACM, 2014.

[15] R. Lempel and S. Moran. The stochastic approach forlink-structure analysis (salsa) and the tkc effect.Comput. Netw., 33(1-6):387–401, June 2000.

[16] M. A. Musen, N. F. Noy, N. H. Shah, P. L. Whetzel,C. G. Chute, M.-A. Story, B. Smith, et al. Thenational center for biomedical ontology. Journal of theAmerican Medical Informatics Association,19(2):190–195, 2012.

[17] N. F. Noy, D. L. McGuinness, et al. Ontologydevelopment 101: A guide to creating your firstontology, 2001.

[18] N. F. Noy, N. H. Shah, P. L. Whetzel, B. Dai,M. Dorf, N. Griffith, C. Jonquet, D. L. Rubin, M.-A.Storey, C. G. Chute, et al. Bioportal: ontologies andintegrated data resources at the click of a mouse.Nucleic acids research, page gkp440, 2009.

[19] C. Pesquita and F. M. Couto. Predicting the extensionof biomedical ontologies. PLoS Comput Biol,8(9):e1002630, 09 2012.

[20] P. L. T. Pirolli and J. E. Pitkow. Distributions ofsurfers’ paths through the world wide web: Empiricalcharacterizations. World Wide Web, 2(1-2):29–45,1999.

[21] M. Salvadores, P. R. Alexander, M. A. Musen, andN. F. Noy. Bioportal as a dataset of linked biomedicalontologies and terminologies in rdf. Semantic web,4(3):277–284, 2013.

[22] R. Sen and M. Hansen. Predicting a web user’s nextaccess based on log data. Journal of ComputationalGraphics and Statistics, 12(1):143–155, 2003.

[23] E. Simperl and M. Luczak-Rosch. Collaborativeontology engineering: a survey. The KnowledgeEngineering Review, 29(01):101–131, 2014.

[24] P. Singer, D. Helic, A. Hotho, and M. Strohmaier.Hyptrails: A bayesian approach for comparinghypotheses about human trails on the web. InProceedings of the 24th International Conference onWorld Wide Web, WWW ’15, pages 1003–1013, 2015.

[25] B. Smith, M. Ashburner, C. Rosse, J. Bard, W. Bug,W. Ceusters, L. J. Goldberg, K. Eilbeck, A. Ireland,C. J. Mungall, et al. The obo foundry: coordinatedevolution of ontologies to support biomedical dataintegration. Nature biotechnology, 25(11):1251–1255,2007.

[26] M. Strohmaier, S. Walk, J. Poschko, D. Lamprecht,T. Tudorache, C. Nyulas, M. A. Musen, and N. F.Noy. How ontologies are made: Studying the hiddensocial dynamics behind collaborative ontologyengineering projects. Web Semantics: Science,Services and Agents on the World Wide Web, 20(0),2013.

[27] S. Van Laere, R. Buyl, and M. Nyssen. A Method forDetecting Behavior-Based User Profiles inCollaborative Ontology Engineering. In On the Moveto Meaningful Internet Systems: OTM 2014Conferences, pages 657–673. Springer, 2014.

[28] S. Walk, P. Singer, L. E. Noboa, T. Tudorache, M. A.Musen, and M. Strohmaier. Understanding how usersedit ontologies: Comparing hypotheses about fourreal-world projects. In The Semantic Web - ISWC2015 - 14th International Semantic Web Conference,pages 551–568, 2015.

[29] S. Walk, P. Singer, and M. Strohmaier. Sequentialaction patterns in collaborative ontology-engineeringprojects: A case-study in the biomedical domain. InProceedings of the 23rd ACM International Conferenceon Conference on Information and KnowledgeManagement, CIKM 2014, Shanghai, China,November 3-7, 2014, pages 1349–1358, 2014.

[30] S. Walk, P. Singer, M. Strohmaier, D. Helic, N. F.Noy, and M. A. Musen. How to apply markov chainsfor modeling sequential edit patterns in collaborativeontology-engineering projects. International Journal ofHuman-Computer Studies, 84:51 – 66, 2015.

[31] S. Walk, P. Singer, M. Strohmaier, T. Tudorache,M. A. Musen, and N. F. Noy. Discovering beatenpaths in collaborative ontology-engineering projectsusing markov chains. Journal of BiomedicalInformatics, 51:254–271, 2014.

[32] H. Wang, T. Tudorache, D. Dou, N. F. Noy, andM. A. Musen. Analysis of user editing patterns inontology development projects. In On the Move toMeaningful Internet Systems: OTM 2013 Conferences,pages 470–487. Springer, 2013.

[33] P. L. Whetzel, N. F. Noy, N. H. Shah, P. R.Alexander, C. Nyulas, T. Tudorache, and M. A.Musen. Bioportal: enhanced functionality via new webservices from the national center for biomedicalontology to access and use ontologies in softwareapplications. Nucleic acids research, 39(suppl2):W541–W545, 2011.

784

how users explore ontologies on the web: a study of ncbo’s ...papers. · how users explore...

Documents