advances in knowledge discovery and data mining

2
especially if no solution is offered. For these kinds of reasons, Advances in Knowledge Discovery and Data Mining. Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padraic Smyth, and Ra- the formal literature tends not to match well, or in a timely fashion, the difficulties that technological change is continually masamy Uthurusamy, eds. Menlo Park, CA: AAAI Press/The MIT Press; 1996: 625 pp. Price: $50.00 ( ISBN 0-262-56097-6.) bringing to library managers. As examples: Whether and how to filter Internet access; library involvement in providing access to locally produced electronic resources; gopher and website Fayyad and colleagues’ Advances in Knowledge Discovery administration; the regular escalation of requirements for net- and Data Mining is not the book it could have been, but it is work connections and bandwidth; the growing dependence on still a valuable part of the literature. And this is a literature that network technicians; the integration of CD-ROM, mainframe, seems to go out of print with hypersonic speed. A quick search local and remote servers; managing the migration to client- through the online version of Books in Print indicates only 15 server technology; provision of printing infrastructure; endless titles about the subjects, four of which are conference proceed- software compatibility problems; and so forth. This book pro- ings. One of these books was published in 1995, two in 1996, vides very little on these topics. In 1997, a book with this title and 12 in 1997. The earliest in-print proceedings is from 1993. and more than 300 pages could reasonably have been expected Obviously, this is a field that is growing very rapidly. to pay considerably more attention than it does to, for example, At the time that Advances in Knowledge Discovery and Data the outsourcing of technical services or to ‘‘information ar- Mining was published, the primary applications of knowledge cades.’’ discovery in databases ( KDD ) and data mining were in business Individual items in a professional literature tend to be spe- settings, especially marketing analysis. Newer titles ( e.g., Data cific, particular, and local in scope, and it is difficult to organize Mining Techniques: For Marketing, Sales & Customer Support, a review of such a literature. The authors acknowledge the Berry & Linoff, 1997) indicate that business remains a corner- problem and have done a reasonable job, but the problems are stone application. They also indicate that telecommunications deeper and are inherent in any attempt to provide an exposition is now an application ( Data Warehousing & Data Mining for of a topic by means of a literature review. It is desirable to Telecommunications, Mattison, 1997). convey concisely the character or flavor of the selected writings, Knowledge discovery in databases and data mining appear but it is difficult to provide either an adequate account of the to be technologies just waiting to develop applications in areas items reviewed or to give a coherent account of the topic. For like space exploration. This reviewer was watching the Mars example, a one-page section on ‘‘The Concept of the Digital rover landing recently and it was immediately clear how much Library’’ (pp. 179–180) is a summary of one particular paper data was going to be sent if the mission was successful. Now on that topic. Other material can be found elsewhere through the that it has been (and continues to be) more than successful, index, but the result is not a convenient, effective introduction to the amount of data that will result from the mission is almost the concept of the digital library. incomprehensible. It is clear that some very robust method of It is unreasonable to expect the sum of the descriptions of data analysis will be required. It is equally clear that this tech- individual writings to constitute a good, clear introduction to nology will have to allow people to define the patterns looked any topic. The final section, pages 242 to 259, on ‘‘The Future for; otherwise, science will be constrained to old knowledge. of the Librarian’’ illustrates the difficulty. It is a kaleidoscope It is also clear that data warehousing has become inextricably of quotations and citations, useful for some purposes, but weak linked to data mining and knowledge discovery in databases. as a systematic, analytic account of the topic. According to Fayyad et al., the encompassing activity is knowl- The literature review, which is what this book largely is, has edge discovery in databases; a multi-step process finds and a time-honored role. The authors have provided a useful service defines significant patterns in data examined by a software in what they have prepared. But this publication also demon- agent. Data mining is the subset of steps which allows the agent strates that the literature review as a genre has its limitations if to examine the database. And data warehousing must be the what is wanted is a coherent, systematic introduction to a topic. collection of data and its preparation so that data mining can Summarizing the writings is not the same as summarizing the efficiently proceed. issues. As a literature review, this work is quite good, within Advances in Knowledge Discovery and Data Mining com- its significant limitations. But as a book on technology and prises a preface followed by eight sections: Seven parts and a management in library and information services, it is a weak set of appendixes. The preface provides an overview of KDD and unsatisfying offering. and data mining; each part has from two to five papers covering aspects of the part’s topic; the appendixes claim to provide a glossary and a list of online resources about KDD. Part one, ‘‘Foundations,’’ contains three chapters: A pro- posed approach to KDD, a discussion of the use of graphical models, and a review of statistical methods used in KDD. Michael Buckland Part two, ‘‘Classification and Clustering,’’ contains four School of Information Management and Systems chapters: An examination of the use of inductive logic in KDD, University of California, Berkeley a discussion of a Bayesian classification product called Berkeley, CA 94720-4600 AutoClass, a description of data cleaning and ways of finding E-mail: [email protected] informative patterns, and a paper regarding the conversion of rules and trees into KDD-recognizable knowledge structures. Part three, ‘‘Trend and Deviation Analysis,’’ contains two chapters: A discussion of the use of dynamic programming for discovering patterns in time series and a description of a discov- References ery assistant called Explora. Part four, ‘‘Dependency Derivation,’’ contains three chap- Bishop, A. P., & Star, S. L. ( 1996 ) . Social informatics and digital library ters: A discussion of Bayesian networks in KDD, a method of use. Annual Review of Information Science and Technology, 31, 301– speeding the discovery of association rules, and a description 401. of converting contingency tables to database knowledge. Part five, ‘‘Integrated Discovery Systems,’’ contains three Davies, C. (1997). Organizational influences on the university elec- tronic library. Information Processing & Management, 33, 377–392. chapters, all focusing on data mining: A discussion of mixing 386 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—April 1, 1998 8N4D BKRW / 8N4D$$BKRW 01-21-98 13:14:32 jasa W: JASIS

Upload: frank-exner-little-bear

Post on 06-Jun-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advances in knowledge discovery and data mining

especially if no solution is offered. For these kinds of reasons, Advances in Knowledge Discovery and Data Mining. UsamaM. Fayyad, Gregory Piatetsky-Shapiro, Padraic Smyth, and Ra-the formal literature tends not to match well, or in a timely

fashion, the difficulties that technological change is continually masamy Uthurusamy, eds. Menlo Park, CA: AAAI Press/TheMIT Press; 1996: 625 pp. Price: $50.00 (ISBN 0-262-56097-6.)bringing to library managers. As examples: Whether and how

to filter Internet access; library involvement in providing accessto locally produced electronic resources; gopher and website Fayyad and colleagues’ Advances in Knowledge Discoveryadministration; the regular escalation of requirements for net- and Data Mining is not the book it could have been, but it iswork connections and bandwidth; the growing dependence on still a valuable part of the literature. And this is a literature thatnetwork technicians; the integration of CD-ROM, mainframe, seems to go out of print with hypersonic speed. A quick searchlocal and remote servers; managing the migration to client- through the online version of Books in Print indicates only 15server technology; provision of printing infrastructure; endless titles about the subjects, four of which are conference proceed-software compatibility problems; and so forth. This book pro- ings. One of these books was published in 1995, two in 1996,vides very little on these topics. In 1997, a book with this title and 12 in 1997. The earliest in-print proceedings is from 1993.and more than 300 pages could reasonably have been expected Obviously, this is a field that is growing very rapidly.to pay considerably more attention than it does to, for example, At the time that Advances in Knowledge Discovery and Datathe outsourcing of technical services or to ‘‘information ar- Mining was published, the primary applications of knowledgecades.’’ discovery in databases (KDD) and data mining were in business

Individual items in a professional literature tend to be spe- settings, especially marketing analysis. Newer titles (e.g., Datacific, particular, and local in scope, and it is difficult to organize Mining Techniques: For Marketing, Sales & Customer Support,a review of such a literature. The authors acknowledge the Berry & Linoff, 1997) indicate that business remains a corner-problem and have done a reasonable job, but the problems are stone application. They also indicate that telecommunicationsdeeper and are inherent in any attempt to provide an exposition is now an application (Data Warehousing & Data Mining forof a topic by means of a literature review. It is desirable to Telecommunications, Mattison, 1997).convey concisely the character or flavor of the selected writings, Knowledge discovery in databases and data mining appearbut it is difficult to provide either an adequate account of the to be technologies just waiting to develop applications in areasitems reviewed or to give a coherent account of the topic. For like space exploration. This reviewer was watching the Marsexample, a one-page section on ‘‘The Concept of the Digital rover landing recently and it was immediately clear how muchLibrary’’ (pp. 179–180) is a summary of one particular paper data was going to be sent if the mission was successful. Nowon that topic. Other material can be found elsewhere through the that it has been (and continues to be) more than successful,index, but the result is not a convenient, effective introduction to the amount of data that will result from the mission is almostthe concept of the digital library. incomprehensible. It is clear that some very robust method of

It is unreasonable to expect the sum of the descriptions of data analysis will be required. It is equally clear that this tech-individual writings to constitute a good, clear introduction to nology will have to allow people to define the patterns lookedany topic. The final section, pages 242 to 259, on ‘‘The Future for; otherwise, science will be constrained to old knowledge.of the Librarian’’ illustrates the difficulty. It is a kaleidoscope It is also clear that data warehousing has become inextricablyof quotations and citations, useful for some purposes, but weak linked to data mining and knowledge discovery in databases.as a systematic, analytic account of the topic. According to Fayyad et al., the encompassing activity is knowl-

The literature review, which is what this book largely is, has edge discovery in databases; a multi-step process finds anda time-honored role. The authors have provided a useful service defines significant patterns in data examined by a softwarein what they have prepared. But this publication also demon- agent. Data mining is the subset of steps which allows the agentstrates that the literature review as a genre has its limitations if to examine the database. And data warehousing must be thewhat is wanted is a coherent, systematic introduction to a topic. collection of data and its preparation so that data mining canSummarizing the writings is not the same as summarizing the efficiently proceed.issues. As a literature review, this work is quite good, within Advances in Knowledge Discovery and Data Mining com-its significant limitations. But as a book on technology and prises a preface followed by eight sections: Seven parts and amanagement in library and information services, it is a weak set of appendixes. The preface provides an overview of KDDand unsatisfying offering. and data mining; each part has from two to five papers covering

aspects of the part’s topic; the appendixes claim to provide aglossary and a list of online resources about KDD.

Part one, ‘‘Foundations,’’ contains three chapters: A pro-posed approach to KDD, a discussion of the use of graphicalmodels, and a review of statistical methods used in KDD.

Michael Buckland Part two, ‘‘Classification and Clustering,’’ contains fourSchool of Information Management and Systems chapters: An examination of the use of inductive logic in KDD,University of California, Berkeley a discussion of a Bayesian classification product calledBerkeley, CA 94720-4600 AutoClass, a description of data cleaning and ways of findingE-mail: [email protected] informative patterns, and a paper regarding the conversion of

rules and trees into KDD-recognizable knowledge structures.Part three, ‘‘Trend and Deviation Analysis,’’ contains two

chapters: A discussion of the use of dynamic programming fordiscovering patterns in time series and a description of a discov-Referencesery assistant called Explora.

Part four, ‘‘Dependency Derivation,’’ contains three chap-Bishop, A. P., & Star, S. L. (1996). Social informatics and digital library ters: A discussion of Bayesian networks in KDD, a method of

use. Annual Review of Information Science and Technology, 31, 301– speeding the discovery of association rules, and a description401. of converting contingency tables to database knowledge.

Part five, ‘‘Integrated Discovery Systems,’’ contains threeDavies, C. (1997). Organizational influences on the university elec-tronic library. Information Processing & Management, 33, 377–392. chapters, all focusing on data mining: A discussion of mixing

386 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—April 1, 1998

8N4D BKRW/ 8N4D$$BKRW 01-21-98 13:14:32 jasa W: JASIS

Page 2: Advances in knowledge discovery and data mining

inductive and deductive reasoning, a description of meta- Frank Exner, Little BearNORTEL (Northern Telecom)queries, and an examination specifically of attribute-oriented

logic in data mining. and NC Central UniversityDurham, NC 27707Part six, ‘‘Next Generation Database Systems,’’ contains

two chapters: A description of the use of inductive learning to E-mail: [email protected] KDD rules and a discussion of Data Surveyor, a methodof examining knowledge discoveries in parallel.

Part seven, ‘‘KDD Applications,’’ contains five chapters: A Referencesreport of an automated sky survey catalog, an application of

Berry, M., & Linoff, G. (1997). Data mining techniques: For market-KDD to healthcare data, a description of modeling subjectiveing, sales & customer support. New York: Wiley.uncertainty in image sets, an experimental use of KDD in a

Mattison, R. (1997). Data warehousing & data mining for telecommuni-securities environment, and a discussion of current and futurecations. New York: Artech House.work needed in KDD and data mining.

The two appendixes are a discussion of KDD terminologyand a list of Internet resources in KDD.

Finally, Advances in Knowledge Discovery and Data Mininghas a rather weak index. There is one index page for each 53pages of text; many name entries refer to referenced authors Borders in Cyberspace: Information Policy and the Globalrather than content; superfluous conjunctions which begin sub- Information Infrastructure. Brian Kahin and Charles Nesson,entries are used to alphabetize; and there are few if any refer- eds. Cambridge, MA: The MIT Press; 1997: 374 pp. Price:ences to definitions. A book about a multidisciplinary field like $25.00. (ISBN 0-262-61126-0.)knowledge discovery in databases will have readers with manybackgrounds; the ability to find the definitions used by the au- This collection of papers is concerned with some highlythors is central to achieving a common understanding. Nor- current and significant issues. The preface notes that the experi-mally, this is the job of the terminology list, but, for reasons ence of geographic space has been transformed by the informa-discussed below, the terminology list of Advances in Knowledge tion revolution and implies that a spatially bound concept ofDiscovery and Data Mining does not fill the purpose. jurisdiction has been inherited. Subsequent papers are con-

Advances in Knowledge Discovery and Data Mining has cerned, in different ways, with the conflict between these twoboth strengths and weaknesses. Its greatest strength is the inclu- tendencies, with, for instance, difficulties in determining thesive view the editors take of the subject matter. They clearly location of commercial and criminal acts, and with the enforce-expect readers from many disciplines and planned chapters for ment of sanctions against extra-territorial persons. The collec-all. Unfortunately, many of the chapters are written and edited tion seems to have originated from a conference, although fullfor readers already familiar with their contents; too little atten- information on this is not given, and appears to have beention is paid to developing an organization that invites readers subject to light editorial control.into material written for other communities. The papers cover a variety of topics within the broad area

One important improvement would have been to move the of concern. David Johnson and David Post discuss the naturediscussion of terminology to the beginning of the book immedi- of cyberspace and argue that it needs laws and legal institutionsately following the preface. This is especially important because of its own. They simultaneously recognize that ( in a quotationthe terminology is organized in the form of a classic thesaurus from a 1909 United States judgment) that ‘‘All law is primarather than a glossary. Therefore it acts as a concept organizing facie territorial’’ (p. 4) . They suggest a password boundary astool, not just a specialized dictionary. Had the concept organiza- solution to the problem of enforcement where territorial powertion been applied to the whole book, or the thesaurus structure does not subsist. Ingrid Volkmer addresses the conflict betweenbeen made to match the book as written, each reader could have cultural sovereignty and global information, pointing to televi-shared in the sense of community. sion broadcasting as a precursor to the globalism of the Internet,

Chapter 23, ‘‘From Data Mining to Knowledge Discovery: particularly with regard to the innovation of global program-Current Challenges and Future Directions,’’ offered a wonderful ming. Some forms of online provision are noted to have aglimpse of the work needed from each part of the scholarly regional focus. Joel Reidenberg notes that technical standardscommunity of readers; a retrospective examination of the contri- effectively exert substantial control over information flows andbutions from each field would have greatly improved the book. suggests that technical standardization should be deliberately

It is probably clear from the description of the contents of used to achieve regulatory objectives. Christopher Kedzie ar-Advances in Knowledge Discovery and Data Mining that this gues, in a rather tautological presentation, that modern commu-reviewer is a committed indexer who deeply believes that inac- nication and information technologies have a democratic ten-cessible information might as well be non-existent information. dency. A. Michael Froomkin deals with the Internet as a sourceAt the same time that massive databases are a nightmare, defy- of regulatory arbitrage, the practice of evading disliked domesticing all training and instincts, they are today’s data-collection regulations by communicating and conducting transactions un-reality. Therefore, knowledge discovery in databases and data der regulatory regimes with more favorable rules. Henry Perritmining (or techniques like them) will be necessary in the very is concerned with jurisdiction in cyberspace and notes that statenear future. power has been closely related to judicial authority and that

Advances in Knowledge Discovery and Data Mining is not this has historically led to a localization of judicial authority.the ideal book to introduce the study, techniques, and practice Dan Burk considers the market for digital piracy, contrastingof its subjects. On the other hand, that book is not currently in data piracy with earlier forms of smuggling by reference to theprint. And, if the growing specialization of new titles in Books lack of physical substance of data. He also indicates that Internetin Print is to be believed, it is not likely to be printed. Since commerce may not follow conventional models of spatial distri-practitioners of the diverse arts providing information access bution. Victor Schonberger and Teree Foster discuss free speechwill be faced with unprecedented amounts of data tomorrow, and the global information infrastructure, noting that there arelearning about knowledge discovery in databases and data min- inconsistent approaches to restricting freedom of expression,ing today would be a good idea. even among Western, democratic and liberal, countries. Robert

Gellman deals with the possibility of overlapping rules and of

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—April 1, 1998 387

8N4D BKRW/ 8N4D$$BKRW 01-21-98 13:14:32 jasa W: JASIS