logs miner : portal for data mining web access logs

55
LOGS MINER : PORTAL FOR DATA MINING WEB ACCESS LOGS Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009

Upload: peers

Post on 25-Feb-2016

68 views

Category:

Documents


4 download

DESCRIPTION

Logs Miner : Portal for Data Mining Web Access Logs. Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009. Agenda. Definitions Motivations Architecture of Logs Miner Logs Miner User Interface Logs Miner reports Benefits Future development. Definitions. - PowerPoint PPT Presentation

TRANSCRIPT

Logs Miner Training Workshop

Logs Miner : Portal for Data Mining Web Access LogsPresented byAndrew Wong

9th Annual IUG meeting at HKU Library 8 December 2009

AgendaDefinitionsMotivationsArchitecture of Logs MinerLogs Miner User InterfaceLogs Miner reportsBenefitsFuture development

22DefinitionsWeb data mining-- application of data mining methodologies, techniques, and models to variety of data forms, structures, and usage patterns that comprise the World Wide Web(Markov, Z. & Larose, D. T. 2007)

3Three scopes of Web data mining:Web content miningWeb structure miningWeb log mining3DefinitionsWeb log miningDiscover user access patterns from Web usage logsIs also called web usage miningThree processing stages:Pre-processingPattern discoveryPattern analysis

44Purposes for web logs miningIdentify and classify different group of patronsUnderstand search patterns by different group of patronsAdapt web-user interfaces to suit users needStatistical data for collection management

55Web logs6lbz000.ust.hk - - [16/Nov/2009:12:03:26 +0800] "GET /catalog/ HTTP/1.1" 200 20283 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)

lbnxyz.ust.hk - - [16/Nov/2009:12:03:27 +0800] "GET /catalog/?s=brandy&feed=rss HTTP/1.1" 304 - "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=10486796160015392754)"

lbz222.ust.hk - - [16/Nov/2009:12:03:30 +0800] "GET /stream/xml/stream.xml HTTP/1.1" 304 - "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; zh-TW; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5

lbz333.ust.hk - - [16/Nov/2009:12:03:33 +0800] "GET /catalog/?s=brandy HTTP/1.1" 304 - "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; zh-TW; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5"

lbz444ust.hk - - [16/Nov/2009:12:03:35 +0800] "GET /stream/xml/stream.xml HTTP/1.1" 304 - "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; zh-TW; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5"

Web logs provide huge information on user action6Web logs7FieldsValueRemote host field lbz000.ust.hkDate/Time field [16/Nov/2009:12:03:26 +0800] HTTP request GET /catalog/ HTTP/1.1Status code field 200Transfer Volume (Bytes) Field 20283 User agent field "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)

lbz000.ust.hk - - [16/Nov/2009:12:03:26 +0800] "GET /catalog/ HTTP/1.1" 200 20283 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)7Various types of web logCommon Log Format usually used by Apache Web server logs, Apache Tomcat Logse.g. Library web server, INNOPAC, SmartCAT, Institutional Repository

8lbz000.ust.hk - - [16/Nov/2009:12:03:26 +0800] "GET /catalog/ HTTP/1.1" 200 20283 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)Microsoft IIS Log Formate.g. ILLiad, Class Registration Form2009-07-20 01:22:44 GET /ce/ - 66.249.71.201 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - 401 1891 0Include: Remote host field Date fieldTime field HTTP request field Status code field Transfer Volume (Bytes) Referrer field User agent field8Various types of web logMicrosoft Streaming Servere.g. Streaming video

9143.89.160.133 2009-09-02 10:21:20 - /arc-open/oudpa/OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv 0 6 5 200 {3300AD50-2C39-46c0-AE0A-41B7139D4722} 11.0.5721.5251 en-US WMFSDK/11.0.5721.5251_WMPlayer/11.0.5721.5268 - wmplayer.exe 11.0.5721.5145 Windows_XP 5.1.0.2600 Pentium 3816 216613290 2830093 rtsp TCP - - - 2244972 2244972 398 398 0 0 0 0 0 0 1 1 100 143.89.105.168 lbms07.ust.hk 1 0 - 245 file://C:\wmhome\hkust\arc-open\oudpa/OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv mms://stream.ust.hk/arc-open/oudpa/OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv - - 0Fields only for streaming server: Video codec Audio codec Duration Clients player9Web Logfile analysis toolsTools used to analyze web access logsAccessWatch v1.33Analog 6.0PwebstatsRefStats 1.2INNOPAC Millennium Web Report Search Statistics

Others:AWStatsSawmill AnalyticsWebalizer1010MotivationsCreate a portal for storing, analyzing all different web access logs.Interface for querying web access logs to generate dynamic statistical report

1111AWStats as coreAbility to analyze different log formats including Apache NCSA combined log files, IIS log files (W3C), streaming servers log files

Feasible to analyze non-standardized log format

Support works from command line and from a browser as CGIBuild a web interface to query the data (Logs Miner)Pre-process the raw log data, running large scale query in cron job

1212AWStats as coreUnlimited log file size

Report number of unique visit and visit

Provides Plug-in to expand the functionality

Open source

1313Requirement for AWStats Web logs files: raw data must be contained web logs components such as client IP address, status code, HTTP Request field Any OS platform which supporting PERL

1414 System configuration of Logs Miner:PC-level workstationsCentOS release 5.4Apache web server 2.0PERL v.5.8.8AWStats 6.9

1515Logs Miner architecture16AWStats

AWStats reportsPattern discovery, pattern analysisPreprocessingRaw logs: Library web server,INNOPAC,SmartCAT,Institutional repository,Digital archives ..Access statisticsLogs Miner UI

Customized reportThree phrases Preprocessing, Pattern discovery, pattern analysis

Preprocessing consists of converting the usage, content, and structure information contained in the various available data sources into the data abstraction

Pattern discovery

Pattern analysis

16Logs Miner user interfaceA portal for mining web access log data and retrieve information about usages of multiple web applications.

Built on top of AWStats, an open source logs analyzer.

Currently set up to analyze more than 20 library servers and applications including Library Web Server, INNOPAC, Institutional Repository, Digital Archives, SmartCAT, ILLiad, Streaming Video Server, etc.

1717Logs Miner user interface18URL: https://lbnx16.ust.hk/mining

Includes 20+ applicationsProvides three types of reportFiltered by URL or HostGenerates Yearly or monthly reportQuery box which supporting regular expression18Logs Miner user interface19URL: https://lbnx16.ust.hk/mining

Tips for construct query string19Three types of reports AWStats reports Access statistics- filtered by URL / Host Customized reports2020AWStats report21

21AWStats report22

22AWStats report23Report the number of number of unique visitors number of visits These number are exclude the visit from the Robot

23AWStats report24

24AWStats report25

Created by plugins: geoip25AWStats report26

Work in progress

HKUST's iPhone Application for receiving Library information and searching on SmartCATCreated by Publishing Technology Center of HKUSTReleased at the end of October26Access statistics report27

Query box which supporting regular expression27Access statistics report filtered by URL28

28Access statistics report filtered by Host29

29Example (1) Usage of a database30

Database title:Cambridge Journals Online URL:http://library.ust.hk/cgi/db/cambridge.pl?subscribedToServer name: library.ust.hk (Library web server)Parameters/cgi/db/cambridge.pl?subscribedToInclude pattern:cgi\/db\/cambridge\.pl.+

I am going to show you some examples.Through these examples, you will know how logs miner works in a real life. Lets staff with a simple one. In this example, it is required to know the access statistics of on a database title.At first, it is required to know the access point to this database. In our library, patrons do not access the database via a direct link. They should click on a CGI script which found in database list of library web page or Webpac holding box. Then the CGI script will re-direct to the database. So we need to know the URL of this CGI script.In this example, this is the URL link of the CGI script for this title.This URL link provides two information:One is server name which tells us which server access logs we should go through.The second is the path to the CGI file. The query string is derived from this path. I put it into Include pattern box. I click submit button. The statistical report is come out.From the report, it shows how many time the database is accessed.

30Example (1) Usage of a database31

31Example (1) Usage of a database32

32Example (2) Usage of a document of HKUST Institutional Repository33DocumentLong, Jiafu 2005, Autoinhibition of X11/Mint scaffold proteins revealed by the closed URL:http://repository.ust.hk/dspace/bitstream/1783.1/2496/1/nsmb958.pdfServer name: repository.ust.hk (HKUST Institutional Repository)Parameters/dspace/bitstream/2496/1/nsmb958.pdfInclude pattern:\/1783\.1\/2496\/1\/nsmb958\.pdf

33Example (2) Usage of a document of HKUST Institutional Repository34

Similarly, we can use same way to generate access statistics on e-journal title. At first, we locate the URL link of CGI script. I found this URL link from webpac holding box. This URL provide me the server name and the path to the CGI script. I fill the data in the appropriate position and click submit. The access report is come out.

34Example (2) Usage of a document of HKUST Institutional Repository35

Similarly, we can use same way to generate access statistics on e-journal title. At first, we locate the URL link of CGI script. I found this URL link from webpac holding box. This URL provide me the server name and the path to the CGI script. I fill the data in the appropriate position and click submit. The access report is come out.

35Example (3) Access by particular group36Number of access on Library web page from Library public workstationsLibrary web pageURL:http://library.ust.hk/Server name: library.ust.hk (Library web server)Clients name conventionOPAC workstation (lbb[nnn].ust.hk)IC workstation (lbc[nnn].ust.hk)Computer Lab (lba[nnn].ust.hkInclude pattern:lb(a|b|c)[\d]+\.ust.hk\.hkApart from filtering the result by URL, it is possible to filter the result by host/ip address.

In this example, it shows how to get the number of access on Library web pages from Library public workstation.

Before going on, I should gather an information about the hostname convention of the public workstation.

The following table show you the hostname convention for the public PC.

I put this regex string to the include pattern box.

36Example (3) Access by particular group37

- This time, I click host radio button. I put the query string in the include pattern box.And click submit button.

The figures will be restricted to the library public workstation only.

37Example (3) Access by particular group38

- This time, I click host radio button. I put the query string in the include pattern box.And click submit button.

The figures will be restricted to the library public workstation only.

38Example (4) Exclude particular group39Number of access on Digital Archives from HKUST campus but exclude HKUST Library StaffDigital university archivesURL:http://archives.ust.hk/Server name: archives.ust.hk (Digital Archives)Clients name conventionLibrary staff workstation (lbz[nnn].ust.hk)

In this example, I am going to show how logs miner can exclude noise when you generate a report. We create a browsing page for access streaming videos. The year pull down menu and month pull down menu is created dramatically. It will only show the year and month When the patron access this page, it will call a program to check against library catalogs. Then it implies that when the patron access this page, it will create more than one clickstream data line.

3940Example (4) Exclude particular group

Include pattern:^.+\.ust\.hk$Exclude pattern:lbz.+\.ust.hk\.hk- So if I put the path to file in the include pattern box. The result will include the ajax-search.php file .I call these link is the noises.

- If I want to have a more accurate figures, then ..

4041Example (4) Exclude particular group

41Example (5) Number of virtual visitsA virtual visit is defined as a users request on the librarys website in order to use one of the services provided by the library.One Key Performance Indicator Virtual visits per capitaIncludes main web applications:Library web server Innopac SmartCAT (Next generation Catalogs) HKUST Institutional Repository Digital Archives HKUST ILLiad

4242Example (5) Number of virtual visits43Report the number of Visits a unique IP accesses a page, and requests other pages without an hour between any of the requests

43Example (5) Number of virtual visits44

Request within an hour Request within an hourRequest within an hourCount as a visit44Example (5) Number of virtual visits45Applicationsunique visitvisitpagevisit/visitorpages/visitLibrary web server413,3241,018,81160,78,9132.465.96IR94,596133,458632,2561.414.73Digital Archives14973,51190,4892.3425.77E-Journal21,83342,768376,4731.958.8E-theses25,84834,956116,6641.353.33HKUST ILLiad8,03918,548138,1092.37.44SmartCat4,2029,398288,7872.2330.72Streaming Videos7781,2334,0731.583.30Total570,1171,262,6837,725,7642.216.11Virtual Visit in 20091,262,6832.216.1145Customized reportsBuilt-in customized reports to provide a full picture of page visit figures of similar pages

From HKUST Library Web Server (http://library.ust.hk) Sitemap Databases List Course Guides Database Guides Subject Guides

4646Customized reports47

SubSet: Sitemap Databases List Course Guides Database Guides Subject Guides47Customized reports48

HKUST library web sitemap48Customized reports49

49Customized reports50Add more customized reports template

E-Journal listLibrary Forms

50Benefits of Logs MinerCentral place for storing, processing and analyzing Web Logs dataCombined usage data from different server logsStatistics report can be generated dynamically. Flexible querying interface enabling users to construct their own statistical reports in real-time

5151Privacy issueFrom web access logs, individual clients action can be trackedProtected by firewall, file permission, user authenticationLogs Miner User Interface can be only accessed from library network

52IMPORTANT: As data retrieved in your searches or reports may contain usage patterns of our users, please be careful not to re-distribute such information outside of the HKUST Library.52Future DevelopmentInclude more web applications such as HKUST PowerSearch server (federated search to Librarys subscription resources)Create more customized report template such as E-journal list

53

53ReferenceHan, J., & Kamber, M. 2006. Data mining :Concepts and techniques (2nd ed.). Amsterdam: Morgan Kaufmann.

Liu, H., & Keselj, V. 2007. Combined mining of web server logs and web contents for classifying user navigation patterns and predicting users' future requests. Data knowledge engineering, 61(2): 304.

Markov, Z., & Larose, D. T. 2007. Data mining the web :Uncovering patterns in web content, structure, and usage. Hoboken, N.J.: Wiley-Interscience.

5454Thank you!Email address: [email protected]

5555