characterization and analysis of usage patterns …...characterization and analysis of usage...

4
CHARACTERIZATION AND ANALYSIS OF USAGE PATTERNS IN LARGE MULTIMEDIA WEBSITES Mike Wong, Bibek Bhattarai, Rahul Singh* Department of Computer Science, San Francisco State University, San Francisco, CA 94132 {mikewong, bdb}@sfsu.edu, *[email protected] ABSTRACT User behavior in a website is a critical exponent of the web site’s usability. Therefore an understanding of usage patterns is essential to website design optimization. This challenging problem is prominent in media-rich websites, due both to the complexity of analyzing media-based content and the challenge of understanding user-media interactions. Towards this goal, we present a novel paradigm encompassing a unique combination of interactive multimodal visualization and media-based web content mining for information goal extraction. This integrated approach offers several advantages: it delivers a gestalt view of user behavior, it offers an interactive, multimodal environment for pattern discovery, and it provides a feedback mechanism for improving website design. We evaluate these features with real data from the multimedia website of the SkyServer project. 1. INTRODUCTION User behavior for large, media-rich websites is complex and difficult to characterize. However, understanding usage patterns is a key step in optimizing and improving web-site design. For multimedia websites, the challenge occurs not just due to the use of media to express information, but also due to different interaction modalities that may be supported. For example, the SkyServer website has several modes of interaction, including: static content browsing, JavaScript-enabled clickable-images, and parametric and SQL-based database queries. Users often browse a few static pages first to get to the starting points of the dynamic querying tools. Other users might browse through the static content to get an understanding of the Sky Server site. Advanced users employ combinations of the dynamic content query tools. These diverse interactions require different strategies to understand what information the user may be looking for and determine how successful the user was. An important deficiency of current approaches to this problem is that they address some aspects of user behavior but ignore others. For instance, usage mining can discover patterns in how users browse text and images, but cannot provide a semantic analysis of media. Web content mining can reveal semantic similarities between pages in a user session, but overlooks usage information for a topic in the website. Therefore research based on a single general approach provides a limited view of user behavior. We present an integrated multimodal approach which addresses the challenges of developing an understanding of user behavior. Our research focuses on discovering, specifying, and quantifying patterns in browsing text and image-based media. We realize these goals by combining multimodal visualization of user behavior, information foraging techniques, and knowledge discovery techniques. The guiding inspiration which drives our approach is the complementary interaction between website usage information (e.g. who’s using the website) and semantic analysis of the content (e.g. why are they using the website; what information are they seeking?). This integrated approach offers a multi-faceted view of user behavior which offers a more complete picture than previously available. We use real data from SkyServer as an example of a large multimedia- multimodal website [10]. SkyServer is a suite of web-based applications built around the Sloan Digital Sky Survey (SDSS) database. The SDSS database houses 15 terabytes of images and data on galaxies and stars. Some of the media interaction modes and services include: galaxy image access through a virtual telescope, astronomical spectrum data access, self-paced astronomy tutorials, direct database access through SQL queries and stored procedures, and annotated tours of the night sky. 2. RELATED RESEARCH AND OVERVIEW OF THE PROPOSED APPROACH The direct approach to web usage analysis is web usage mining [4][11]. Web usage mining extracts patterns from the website usage logs [6][8]. Web usage mining typically begins with clustering user requests into user sessions [7]. Sessions are clustered to describe usage patterns. Our approach uses multimodal visualization of usage data to allow the web designer to interactively cluster user sessions by user-driven criteria. Fundamentally, we treat usage information as vectors of attributes, and project the information to context-sensitive, low dimensional subspaces or other semantically meaningful representations. Each projection is implemented as an interface metaphor. For SFSU-CS-TR-20

Upload: others

Post on 19-Apr-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CHARACTERIZATION AND ANALYSIS OF USAGE PATTERNS …...CHARACTERIZATION AND ANALYSIS OF USAGE PATTERNS IN LARGE MULTIMEDIA WEBSITES Mike Wong, Bibek Bhattarai, Rahul Singh* Department

CHARACTERIZATION AND ANALYSIS OF USAGE PATTERNS IN LARGE MULTIMEDIA WEBSITES

Mike Wong, Bibek Bhattarai, Rahul Singh*

Department of Computer Science, San Francisco State University, San Francisco, CA 94132

{mikewong, bdb}@sfsu.edu, *[email protected]

ABSTRACT User behavior in a website is a critical exponent of the web site’s usability. Therefore an understanding of usage patterns is essential to website design optimization. This challenging problem is prominent in media-rich websites, due both to the complexity of analyzing media-based content and the challenge of understanding user-media interactions. Towards this goal, we present a novel paradigm encompassing a unique combination of interactive multimodal visualization and media-based web content mining for information goal extraction. This integrated approach offers several advantages: it delivers a gestalt view of user behavior, it offers an interactive, multimodal environment for pattern discovery, and it provides a feedback mechanism for improving website design. We evaluate these features with real data from the multimedia website of the SkyServer project.

1. INTRODUCTION

User behavior for large, media-rich websites is complex and difficult to characterize. However, understanding usage patterns is a key step in optimizing and improving web-site design. For multimedia websites, the challenge occurs not just due to the use of media to express information, but also due to different interaction modalities that may be supported. For example, the SkyServer website has several modes of interaction, including: static content browsing, JavaScript-enabled clickable-images, and parametric and SQL-based database queries. Users often browse a few static pages first to get to the starting points of the dynamic querying tools. Other users might browse through the static content to get an understanding of the Sky Server site. Advanced users employ combinations of the dynamic content query tools. These diverse interactions require different strategies to understand what information the user may be looking for and determine how successful the user was. An important deficiency of current approaches to this problem is that they address some aspects of user behavior but ignore others. For instance, usage mining can discover patterns in how users browse text and images, but cannot provide a semantic analysis of media. Web content mining can reveal semantic similarities between pages in a user

session, but overlooks usage information for a topic in the website. Therefore research based on a single general approach provides a limited view of user behavior.

We present an integrated multimodal approach which addresses the challenges of developing an understanding of user behavior. Our research focuses on discovering, specifying, and quantifying patterns in browsing text and image-based media. We realize these goals by combining multimodal visualization of user behavior, information foraging techniques, and knowledge discovery techniques. The guiding inspiration which drives our approach is the complementary interaction between website usage information (e.g. who’s using the website) and semantic analysis of the content (e.g. why are they using the website; what information are they seeking?). This integrated approach offers a multi-faceted view of user behavior which offers a more complete picture than previously available. We use real data from SkyServer as an example of a large multimedia- multimodal website [10]. SkyServer is a suite of web-based applications built around the Sloan Digital Sky Survey (SDSS) database. The SDSS database houses 15 terabytes of images and data on galaxies and stars. Some of the media interaction modes and services include: galaxy image access through a virtual telescope, astronomical spectrum data access, self-paced astronomy tutorials, direct database access through SQL queries and stored procedures, and annotated tours of the night sky.

2. RELATED RESEARCH AND OVERVIEW OF THE PROPOSED APPROACH

The direct approach to web usage analysis is web usage mining [4][11]. Web usage mining extracts patterns from the website usage logs [6][8]. Web usage mining typically begins with clustering user requests into user sessions [7]. Sessions are clustered to describe usage patterns. Our approach uses multimodal visualization of usage data to allow the web designer to interactively cluster user sessions by user-driven criteria. Fundamentally, we treat usage information as vectors of attributes, and project the information to context-sensitive, low dimensional subspaces or other semantically meaningful representations. Each projection is implemented as an interface metaphor. For

SFSU-CS-TR-20

Page 2: CHARACTERIZATION AND ANALYSIS OF USAGE PATTERNS …...CHARACTERIZATION AND ANALYSIS OF USAGE PATTERNS IN LARGE MULTIMEDIA WEBSITES Mike Wong, Bibek Bhattarai, Rahul Singh* Department

example, our system geolocates users and shows them on a map. All of the interface metaphors are reflectively interrelated; selections in one subspace constrain all subspaces. This multimodal visualization provides a flexible and semantically meaningful usage pattern discovery system.

Another approach to web usage analysis is web content mining (predominantly text-oriented) [2][3][6]. These works start by exploring web content and web structure information. The information goal of a user session is extracted with information foraging techniques [5]. User sessions are clustered based on information goal similarity. We extend this approach by performing text and image semantic analysis in conjunction with web usage mining. We also broaden information goal extraction to include dynamic web page content and web database content as well. Usage analysis solely based on static page content mining cannot provide accurate usage information of websites which are fundamentally multimodal. Likewise, web content mining-based usage analyses do not reveal the fundamental usage relationships (e.g. spatio-temporal relationships), whereas visualizing usage information does. In short, both approaches have strengths and gaps in describing usage behavior. We believe that the strength of each approach addresses the weakness of the other.

3. DESCRIPTION OF THE PROPOSED APPROACH

We postulate that an integrated approach provides a more complete character-ization of user behavior. In this paper, we describe how three techniques (i.e. infor-mation goal extraction and usage pattern discovery, and knowledge discovery in data-base content) reveal user behavior. Figure 1 shows a conceptual view of the in-formation gleaned from this collaboration of techniques.

3.1. Usage Pattern Visualization

Our approach begins with a multimodal visualization interface which exposes different aspects of user behavior in the context of appropriate interface metaphors. Figure 2 shows the different modes of visualization. Information about user location is shown on a world map, along with referential information such as political borders and known observatories. Usage log information (e.g. hits per hour, sessions per browser, etc.) is shown on a chart. The generality of this approach is important to our goal of free discovery; only one visualization is domain-specific and

that is the projection of web pages about galaxies and stars onto

a Cartesian map of the night sky. Thus, the web designer may explore the usage data in this context-rich multimodal environment with fewer constraints than many automated approaches. A web designer can click on a feature shown in the various contexts; the system then reveals only usage patterns which include the selected features on all metaphors. The web designer can continue constraining the search by selecting features on other metaphors. Once a usage pattern is selected, the web designer can request usability feedback on the selected usage pattern, and then continue discovering usage patterns or start a new search.

3.2. Knowledge Discovery in Database Content

We use knowledge discovery in database content (KDDC) to qualify the celestial features that users seek. For each celestial feature the system extracts: hits per feature, spectra, images, and morphology. This information feeds back into the interactive visualization by allowing usage pattern selection by celestial feature. The information also complements information goal extraction by providing richer semantics to image media and to generate the dynamic web pages. This domain-specific information, derived from general data mining techniques, provides the means for common communication between information goal extraction and usage pattern discovery.

3.3. Information Goal Extraction

Figure 3 shows the process flow for the proposed approach, with the usage pattern analysis visualization shown in the top-left and the database knowledge discovery step shown in the top-middle.

Semantic Analysis of Textual and Image-Based Content: The system constructs a webpage vector and term vector from the website structure and content. It should be noted, that for web service dynamic content, scripting complicates content extraction. To address this in a general manner, the system first extracts the parameter-value pairs from HTTP request. It then queries the data derived from KDDC for dynamic content. The dynamic content is used to reconstruct the content of the dynamic web page, which is

Figure 1. An integrated approach better char-acterizes user behavior

Figure 2. The interactive visualization interface showing the

features of user behavior

SFSU-CS-TR-20

Page 3: CHARACTERIZATION AND ANALYSIS OF USAGE PATTERNS …...CHARACTERIZATION AND ANALYSIS OF USAGE PATTERNS IN LARGE MULTIMEDIA WEBSITES Mike Wong, Bibek Bhattarai, Rahul Singh* Department

then added it to the web page vector. Also, each unique term from the

dynamic content is extracted and added to the term vector. This process is repeated for each dynamic HTTP request present in the session. For image analysis the system extracts and analyzes images present in every page of the website. Images are analyzed by color content to create a histogram-based image signature. The histogram-based approach was selected because it is more broadly applicable; for SkyServer, a shape recognition-based approach is also appropriate [1]. The images are semantically enriched by the surrounding text on the web page. The associated semantic information is weighted and added to the term vector. The system then constructs a structural adjacency matrix T. Then the importance of a term t in a page p is calculated using normalized Term Frequency Inverse Document Frequency (TFIDF) [9]. Using the TFIDF value, terms vector, and the web pages vector we construct a term-page matrix, TPTFIDF,

Information Goal Extraction: The information goal is extracted as a subset of the information in the pages in a session. First, an importance value is assigned to each page visited by the user. Then importance of a term is calculated as summation of the TFIDF value of the term corresponding to the pages it belongs to, multiplied by the importance value of the pages. Finally, the term list is sorted and the 20 most important terms are used to form the users’ information goal summary. Image information goal is extracted as a weighted list of images present in the pages visited during the session. Images are weighted depending on the importance value of the page to which it belongs.

User Flow Computation: For the extracted information goal, a technique based on information foraging theory [5] is used to compute the user flow through the website. The computed user flow predicts the probability of success for

other users with similar goal to reach their destination page. Information correlation between a link and the information goal is calculated as sum of TFIDF value of all the terms that are present in both the link and the information goal. The text present in the title of the distal page is used to calculate information correlation, in the cases where the text is absent in the link. This approach is based on approach used in [3], but we extend the idea further by combining multimedia (text and image) information for correlation calculation. Image information correlation is calculated as quantitative comparison of signature of each image present in information goal list with signature of images present in each page of the website. Finally, we sum text information correlation and image information correlation values to construct Information Correlation matrix IC. The user flow is computed by simulating users through an activation function A(t). The total percentage of users at a given time in a page depends on total information correlation value for all the links pointing to the page.

EtAICtA +−××= ))1(()( α (1.0)

The dampening factor α controls the number of users browsing to next step of simulation. E simulates users flowing through the links from the entry (or start) page of the usage pattern. The initial activation vector A(1) = E. The final activation vector, A(n), will give percentage of users in each node of the website after n simulations.

Shortest Path Computation and Comparison: For each user session in a usage pattern, the algorithm computes the shortest path from the start page to the final page. Our underlying assumption is that the shortest path represents the most optimal (direct) path to the desired information goal. Comparison of the actual user paths with the optimal shortest path provides an analysis of how well the links are organized in the website.

4. EXPERIMENT AND RESULTS

As a case study we analyze a simple usage pattern from a subset of the usage data from SkyServer. This study illustrates how our system finds usability issues of large multimedia websites. Using the parameters from the same study, we conducted an experiment to evaluate text-and-image information correlation versus text-based information correlation. As previously mentioned, the system begins with the visual interface shown in Figure 2. A brief look at the hits per month in the chart reveals that the month of April, 2004 was very busy for the given usage data sample; contributing 28.57% of the traffic between May, 2003 and October, 2004. We select (at random) traffic from Oxford, England which is shown on the map as a listed observatory. With these constraints, the system reveals 4 user sessions, one of which is long enough to have a meaningful

Figure 3. Information goal extraction

flow diagram

SFSU-CS-TR-20

Page 4: CHARACTERIZATION AND ANALYSIS OF USAGE PATTERNS …...CHARACTERIZATION AND ANALYSIS OF USAGE PATTERNS IN LARGE MULTIMEDIA WEBSITES Mike Wong, Bibek Bhattarai, Rahul Singh* Department

information goal. This user starts the session from the index page of the website, then browses to a spatial search page and executes three dynamic queries, each looking for a

galaxy at specific points in the sky. The user informa-tion goal predicted by the system consists of the dynamic query results and static web page terms. In Figure 4, lines represent hypertext links and nodes represent web pages. Figure 4 shows the user flow (orange solid line), the user session path (green dotted line), the shortest path (blue broken line) and the user percentage computed by the system. Selecting a node shows a thumbnail of the page and its URL. The system shows that the user missed the shortest path between the index and search pages of the site. The search page was ultimately reached through the tool page. We found that tool page link visually dominated the search page link on the index page. This is a plausible explanation for the user’s longer, indirect path. A web designer can graphically redesign the search page link, and use our system to monitor the results. The user flow diagram shows the probability of success for other users with a similar information goal. For this study, we used the value of dampening factor α = 60%.

Table 1 shows the results of the experimental evaluation of text-based information correlation versus text-and-image information correlation. In the case of text correlation, there was no flow directed to traffic and project subsection of the website because these subsections had no textual correlation with the information goal. When we performed text and image correlation, some amounts of user flow did get directed to the traffic and project sections because of the image contribution to information correlation. That is, images which were navigational category motifs revealed

web structure information which text-based techniques could not discover.

5. CONCLUSION

Our research shows that through visualization, web designers can explore and experiment with discovering usage patterns from user behavior. Furthermore, once a pattern has been selected, a combination of algorithmic techniques functioning at both the web-content and database levels can reveal the information goals of the users, and the extent to which they were optimally met. As the study indicates, this information can be used to infer the usability of the website design and determine how it may be improved. Furthermore, the experiment shows that text and image-based information goal extraction.

6. REFERENCES [1] Anderson, B., Connolly, A., Moore, A., Nichol, R., Fast

Nonlinear Regression via Eigenimages Applied to Galactic Morphology, ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD ’04) Session: Research Track, pp. 40-48, 2004.

[2] Blackmon M. H., Polson P. G., Kitajima M., Lewis C. Cognitive Walkthrough for the Web. ACM Proceedings of Conference on Human Factors in Computing Systems (CHI ’02), 2002.

[3] Chi E. H., P. L. Pirolli, Chen K., Pitkow J. Using Information Scent to Model User Information Needs and Actions on the Web. ACM Proceedings of Conference on Human Factors in Computing Systems (CHI ’01)

[4] Kosala, R., and Blockeel H. Web Mining Research: A Survey. ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD ’00) Explorations 2(1), pp. 1-15, 2000.

[5] Pirolli P. L., and Card S. K. (1999) Information foraging. Psychological Review. 106: p. 643-675.

[6] Pirolli, P., Pitkow, J. and Rao, R. Silk from a sow’s ear: Extracting usable structures from the web. ACM Proceedings of Conference on Human Factors in Computing Systems (CHI ’96), pp. 118-125, 1996.

[7] Pitkow, J. Characterizing browsing strategies in the World-Wide Web. Computer Networks and ISDN Systems, 27(6) pp. 1065-1073, 1995.

[8] Pitkow, J. and Pirolli, P. Mining longest repeated subsequences to predict World Wide surfing. In the Proceedings of the USENIX Conference on Internet, 1999

[9] Scheuetze, H., Manning, C. (1999) Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press.

[10] Sloan Digital Sky Survey project’s website SkyServer: http://skyserver.sdss.org/

[11] Srivastava, J. et al. Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data. ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD ’00) Explorations 1(2), pp. 12-23, 2000.

Figure 4. User Flow over static

and dynamic content with semantic contributions from text

and image media

Table 1. Text vs. Text/Image Correlation

Website Subsections

Text Only Correlation (%)

Text/Image Correlation (%)

Index Page 25.16 25.29 Tools 33.06 28.81 Help 1.65 1.55

Traffic 0.00 4.10 Project 0.00 0.38

SFSU-CS-TR-20