data mining - here we go again

18 0885-9000/96/$4.00

Bill Mark, National Semiconductor Architecture Laboratory

ERE IS A LOT OF TALK THESE DAYS ABOUT DATA mning-and this special issue will add some more. To help the mate rial here clarify and illustrate rather than add to the hype, here’s i brief perspective on the trends that shaped this issue, as well as a roac map for the individual articles.

Trends

The growing perception of the advantage of putting informa tion on line, beginning with the success of automated data pro cessing in commercial and scientific enterprises, has led to the col lection and storage of ever larger amounts of data. The effort tha has gone into ensuring the stability, security, and accessibility o these data has resulted in the substantial database managemen technology we see today.

So, many enterprises are actively constructing and maintaining very large databases that enjoy strong organizational commitment and rely on specific underlying technologies. A common case is the large transaction database that relies on a softwarehardware infra- stntcture of relational database and data-server technology. A relational database has a well-defined structure of rows and columns. Because the information in the database has been forced into this structure to enable the efficient execution of certain access and manipulation operations, in most environments this in turn bounds the uses to which the data will ever be put. And because only easily stated relational queries are well supported, the data is used-indeed is thought of-strictly in terms of basic query-and-display operations.

Industry has long recognized the limitation imposed by this codependence of databases and thelr underlying technology. There has, for example, been a decades-old call for better decision support-

0 1996 TEEE IEEE EX ~

:PERT ~

tools for data analysis beyond the retrieval, manipulation, and graphics meant for usual business processing. In fact, as we often hear, relational databases were originally intended for decision support. However, vir- tually all of the technology and, more important, the organizational culture of databases, has been directed toward transaction processing. Thus, while most database compa- nies contend that their products can be used for some flavor of decision support, knowl- edgeable observers generally agree that in- depth decision support requires new technology. This new technology should enable the discovery of trends and predictive pat- terns in data, the creation and testing of hypotheses, and the generation of insight- provoking visualizations. Nondatabase experts and nonstatisticians should find the technology easy to use; it should also accommodate their ever-changing needs by giving clear, rapid answers to their unplanned-for and perhaps informally stated questions.

Many think data mining promises this technology. Just as there is nothing new about the cry for decision support, there is nothing particularly new about the technologies that underlie data mining: visualization, statistics, machine learning, and deductive databases. What is new is the confluence of (fairly) mature offshoots of these technologies at a time when the world is ready to see their value (Java is an interesting compari- son). Also new is the emergence of an approach for applying these technologies to real problems.

In fact, one of this issue’s themes is that data mining is really a general approach that is supported to varying degrees by a set of technologies. The approach has been shaped by the challenges faced by data-mining technologists in applying their technologies to an existing infrastructure. As always happens when new technologies are applied to an existing infrastructure, the technologists need to do a great deal of “other” work to create an environment in which their technologies can operate effectively (several articles in this issue document this process). For data mining, a consensus is currently emerging on a process for creating the environment and applying the technologies.

Therefore, while this issue consciously attempts to sample the diversity of data- mining technology and applications, it also tries to place this diversity in the context of an overall approach and emerging process.

I OCTOBER 1996

Road map for this issue

The issue begins with a discussion of the scientific underpinnings and current activi- ties of the field. The article by Usama Fayyad opens the discussion of terminology, posi- tioning data mining as part of an overall endeavor of knowledge discovery and pro- viding insight into the many issues that influ- ence the success or failure of that endeavor.

Evangelos Simoudis’s article provides an excellent analysis of the data-mining process, including a detailed description of the contri- butions of the various underlying technologies. The article illustrates the process with a set of actual examples of how data mining has been used with noteworthy effect in the real world.

The rest of the issue provides in-depth descriptions of actual data-mining practice, covering both applications-solutions to specific problems-and techniques that can be applied to a wide range of problems. The articles show the range of problems and technologies that are relevant to this burgeoning field.

Edmond Mesrobian and his colleagues present Oasis, a data-mining environment designed to help geophysical scientists find interesting phenomena in large databases of collected data. Their article outlines the kind of system solution that is required to address the real-world nature of data mining as an ongoing process. They discuss the issues of dealing with software and hardware hetero- geneity, and the need to help users find poten- tially relevant databases in a large distributed environment. It is useful to view this effort in the context of the guidelines and approach described in the initial Fayyad and Simoudis articles.

The article by Kazuo J. Ezawa and Steven W. Norton describes an enormous problem in the telecommunications industry and the way it is being addressed with data-mining technology. They show not only an important application but also a framework for integrat- ing uncertainty reasoning-a major issue for data mining, as it is for much of AI. In par- ticular, they describe a Bayesian network approach that has produced impressive results, and they very usefully put it in the context of related work and alternative approaches.

George H. John, Peter Miller, and Randy Kerber show the application of a very differ- ent technology, rule induction, to a very dif- ferent problem, stock selection. This domain is important not only because of its role as one of the major “interested parties” for data- mining technology, but also because results

are relatively easy-in fact, deceptively easy-to score: they are inherently numeri- cal, and the problems have been studied inten- sively enough to provide solid metrics for success. Again, the results are impressive. The analysis of those results is particularly instructive, both for the thoroughness of the effort and for the illustration of how difficult it is to really understand results based on large amounts of complex data.

In the next article, Diane J. Cook, Lawrence B. Holder, and Sumjani Djoko show the value of combining technologies to improve data- mining results. They address the issue of auto- matically finding useful substructures in data (such as, identifying reusable substructures in a complex circuit). An interesting aspect of their work is the explicit investigation of the value of domain knowledge, explored by run- ning a substructure discovery algorithm on the same data with and without the addition of specific domain knowledge.

Finally, Hing-Yan Lee and Hwee-Leng Ong offer a visualization technique for multidimensional data. Their technique has broad applicability and is presented in terms of an easy-to-understand software package, amply demonstrating the appeal of data- mining technology to nonexpert users.

Summary

Data mining is an emerging field. It is rea- sonable to expect some differences in scope and definition-and this special issue cer- tainly meets this expectation. But the main message here is that within the diverse set of technologies and applications, technologists are addressing a central thread of an important area: unlocking the information that is buried in the enormous stock of data we have already put on line and developing the underpinnings for better ways to handle data and support future decision making.

Bill Mark is the director of the National Semi- conductor Architecture Laboratory and former Associate Editor-in-Chief of IEEE Expert maga- zine. His research interests include distributed sys- tems of “smart things” and information appli- ances. He received his BS and MS in electrical engineering and computer science and his PhD in computer science from the Massachusetts Insti- tute of Technology. He is a member of the AAA1 and the ACM. Reach him at the Nat’l Semicon- ductor Architecture Lab, 2900 Semiconductor Dr. MIS E-100, Santa Clara, CA 95052; bmark@ ampere.nsc.com.

19

http://ampere.nsc.com

data mining - here we go again

Documents