the bibliomining process: data warehousing and data mining for libraries. sponsored by sig lt

2
The Bibliomining Process: Data Warehousing and Data Mining for Libraries Sponsored by SIG LT Scott Nicholson [Moderator] Syracuse University School of Information Studies, 4-127 Center for Science and Technology, Syracuse, NY 13244. Email: [email protected] San-Yih Hwang Department of Information Management, National Sun Yat-sen U., Kaohsiung, Taiwan, 80424. Email: [email protected] Paula Keezer Alexa Internet, An Amazon Company, Building 37, The Presidio of San Francisco, PO Box 29141, San Francisco, CA 94129. Email: [email protected] Edward T. O'Neill Office of Research, OCLC Online Computer Library Center, Inc., 6565 Frantz 4301 7. Email: [email protected] Road, Dublin, Ohio Bibliomining is the combination of data mining, bibliometrics, statistics, and reporting tools used to extract patterns of behavior-based artifacts and item-based metadata from library systems. The bibliomining process involves the identification of problem areas, the collecting and anonymizing of data into a data warehouse, the exploration of the data with data mining tools, and the analysis, validation, and implementation of the results. This panel will introduce the topic of bibliomining and present ways in which data warehousing and data mining are currently being used in library settings. Introduction For decades, libraries have been collecting lnformation about their users, their staff, and other organizations. These data, usually locked within conceptual€y separate databases, contain artifacts of behavior that could be used by libraries in creating a higher-quality patron experience, malung better management decisions, and justifying their performance to the parent organization. Traditionally, only small portions of this data have been analyzed; in addition, the methods used for analysis are usually unsophisticated frequency-based techniques. More sophisticated analysis techniques can be employed to give managers a much better idea of common patterns of interaction among the library, patrons, and external organizations. The bibliomining process provides one way of learning about the use of the library (Nicholson & Stanton, 2003). Bibliomining is the combination of data mining, text mining, multidimensional database analysis, and bibliometric analysis to discover novel and applicable patterns of data-based artifacts of behavior and item metadata wib the library organization. The larger-scale bibliomining process, whch is based on knowledge discovery in databases, is the identification of problems, location and collection of data, building of the data warehouse, analyzing the data using bibliomining techniques, and validating the results with domain experts. In order to perform bibliomining, the library must create a data warehouse that will provide a dependable source of cleaned data from internal library databases. One challenge facing bibliominers is the creation of warehouses of anonymned data that contain no personally identifiable information about the patrons involved. Useful external data sources must be identified ASIST 2003 Panel 478

Upload: scott-nicholson

Post on 15-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

The Bibliomining Process: Data Warehousing and Data Mining for Libraries

Sponsored by SIG LT

Scott Nicholson [Moderator] Syracuse University School of Information Studies, 4-1 27 Center for Science and Technology, Syracuse, NY 13244. Email: [email protected]

San-Yih Hwang Department of Information Management, National Sun Yat-sen U., Kaohsiung, Taiwan, 80424. Email: [email protected]

Paula Keezer Alexa Internet, An Amazon Company, Building 37, The Presidio of San Francisco, PO Box 29141, San Francisco, CA 941 29. Email: [email protected]

Edward T. O'Neill Office of Research, OCLC Online Computer Library Center, Inc., 6565 Frantz 4301 7. Email: [email protected]

Road, Dublin, Ohio

Bibliomining is the combination of data mining, bibliometrics, statistics, and reporting tools used to extract patterns of behavior-based artifacts and item-based metadata from library systems. The bibliomining process involves the identification of problem areas, the collecting and anonymizing of data into a data warehouse, the exploration of the data with data mining tools, and the analysis, validation, and implementation of the results. This panel will introduce the topic of bibliomining and present ways in which data warehousing and data mining are currently being used in library settings.

Introduction For decades, libraries have been collecting lnformation

about their users, their staff, and other organizations. These data, usually locked within conceptual€y separate databases, contain artifacts of behavior that could be used by libraries in creating a higher-quality patron experience, malung better management decisions, and justifying their performance to the parent organization. Traditionally, only small portions of this data have been analyzed; in addition, the methods used for analysis are usually

unsophisticated frequency-based techniques. More sophisticated analysis techniques can be employed to give managers a much better idea of common patterns of interaction among the library, patrons, and external organizations.

The bibliomining process provides one way of learning about the use of the library (Nicholson & Stanton, 2003). Bibliomining is the combination of data mining, text mining, multidimensional database analysis, and bibliometric analysis to discover novel and applicable patterns of data-based artifacts of behavior and item metadata w i b the library organization. The larger-scale bibliomining process, whch is based on knowledge discovery in databases, is the identification of problems, location and collection of data, building of the data warehouse, analyzing the data using bibliomining techniques, and validating the results with domain experts.

In order to perform bibliomining, the library must create a data warehouse that will provide a dependable source of cleaned data from internal library databases. One challenge facing bibliominers is the creation of warehouses of anonymned data that contain no personally identifiable information about the patrons involved. Useful external data sources must be identified

ASIST 2003 Panel 478

and incorporated into this data warehouse. Depending upon the type of library involved and areas of focus for analysis, this warehouse may contain only metadata about works in the collection or may contain the full text of the works. Once such a data warehouse is created, analysts can apply a number of techniques to discover patterns to aid decision-making and justification.

Patron privacy is an important issue. Some libraries are responding to the current privacy issues by deleting records. This loss of the data-based institutional memory severely hamstrings the ability of library administrators to effectively justify decisions and policies to funding bodies. One solution is to use data warehousing technology to create a database of transaction-level data that does not contain personally identifiable information about the patrons. Data mining techniques can then be used to extract patterns of use tied to patron categories.

Scott Nicholson: An Introduction to Bibliomining

Nicholson will place bibliomining in context with other types of evaluation using his theoretical framework for holistic library evaluation and discuss theoretical support for the bibliomining method. In addition, he will also outline the steps in the bibliomining process and present information about current bibliomining resources and activities. Finally, he will discuss privacy issues surrounding bibliomining and present methods to protect the privacy of users.

Edward O'Neill: Data Mining at OCLC OCLC's WorldCat is an enormous bibliograpbc

database consisting of over fifty million bibliographic records representing nearly a billion holdings in almost twenty thousand libraries. The primary function of the database is to support OCLC's online shared cataloging, reference tools, and interlibrary loan services. However, the database also includes a wealth of information on history of publishing, library acquisitions, publishing patterns, library collections, publishers, and related topics. A new research effort at OCLC attempts to mine WorldCat to extract information on many of these topics. Initially the research is focusing on deriving the attributes of individual books, such as popularity and intellectual level, from the holdings patterns.

San-Yih Hwang: Bibliomining at National Sun Yat-sen University Library

from both the circulation log of the OPAC system and the web logs of electronic databases. These patterns are subsequently used for providing recommendation services of the perspective collections. A new library book recommendation system that makes use of patrons' check- out patterns has been implemented and operational since 2002. An ETD recommendation system that utilizes ETD clusters has also been implemented. Experiences with the implementation and deployment of these systems will be reported.

Paul Keezer: Web Discovery at Alexa: Collection Development through Knowledge Discovery

Web Discovery is the process of searching Web page content for information relevant to a particular vertical interest. The discovery process starts with adapting knowledge provided by experts of the particular vertical interest, to a 'set' based text search of a web archive usually bounded by some period of time. The results are then indexed based on the text 'sets' provided and those indexes are presented in a simple web interface. The resulting special collection can then be easily browsed by knowledge experts and further refined into a permanent library 'display' collection. Additionally the results can be used to identify specific web sites that may continue to be of value for the particular vertical interest collection effort and warrant further archiving.

Conclusion of Panel The panel will be concluded with a presentation of

other bibliomining projects. For example, Joe Zucca ([email protected]) at the University of Pennsylvania created a data warehouse that takes information from "real-time" web interaction logging, web server logs, the patron database and the topology of campus networks and merges it into one data warehouse. Details about this and other projects will be briefly presented to demonstrate the variety of bibliomining applications.

REFERENCE Nicholson, S. & Stanton, J. (2003). Gaining strategic advantage

through bibliomining: Data mining for management decisions in corporate, special, digital, and traditional libraries. In Nemati, H. & Barko, C. (Eds.). Organizational data mining: Leveraging enterprise data resources for optimal performance. Hershey, PA: Idea Group Publishing.

At National Sun Yat-sen University Library in Taiwan, we have been worlung on mining useful patterns

ASIST 2003 Panel 479