data mining: practical machine learning tools and ... mining: practical machine learning tools and...
TRANSCRIPT
Data Mining: Practical Machine Learning Tools andTechniques with Java Implementations
by Ian H. Witten and Eibe Frank
Morgan Kaufmann Publishers, 2000416 pages, Paper, $49.95
ISBN 1-55860-552-5
Review by: James Geller, New Jersey Institute of Technology
CS Department, 323 Dr. King Blvd., Newark, NJ [email protected]
http://web.njit.edu/~geller/
Summary of the book
Witten and Frank's textbook was one of twobooks that I used for a data mining class inthe Fall of 2001. The book covers all majormethods of data mining that produce aknowledge representation as output.Knowledge representation is herebyunderstood as a representation that can bestudied, understood, and interpreted byhuman beings, at least in principle. Thus,neural networks and genetic algorithms areexcluded from the topics of this textbook.We need to say “can be understood inprinciple” because a large decision tree or alarge rule set may be as hard to interpret as aneural network.
The book first develops the basic machinelearning and data mining methods. Theseinclude decision trees, classification andassociation rules, support vector machines,instance-based learning, Naive Bayesclassifiers, clustering, and numericprediction based on linear regression,regression trees, and model trees. It thengoes deeper into evaluation andimplementation issues. Next it moves on todeeper coverage of issues such as attributeselection, discretization, data cleansing, andcombinations of multiple models (bagging,boosting, and stacking). The final chapterdeals with advanced topics such as visualmachine learning, text mining, and Webmining.
A walk through the contents
The greatest strength of this Data Miningbook lies outside of the book itself. All thealgorithms described in this book areimplemented and freely available throughthe WEKA (Waikato Environment forKnowledge Analysis) Website(www.cs.waikato.ac.nz/ml/weka). Chapter 8of the book is a tutorial to the implementedalgorithms. The integration between thebook and the Web site is excellent, and theWeb site is alive, thriving and growing.Thus, the number of data mining algorithmsavailable on the Web site goes far beyondwhat is described in the book. Indeed, evenNeural Networks have been added to theWeb site since the book was first published.While many books offer an associated Website by now, the close linkage between bookand Web site and the rapid growth of theWeb site are highly commendable.
Another pleasant feature of the WEKAimplementation is that it is done in Java.This makes it possible to construct systems,based on Java, that capitalize on the otherstrengths of Java, such as access to relationaldatabases through JDBC and easy access toWeb pages from within Java programs.
Target audience
The book is written for academics andpractitioners and I believe it can be wellunderstood, even by undergraduate students.
In fact, it is probably the most accessiblesurvey of data mining in print, withoutsacrificing too much of precision and rigor.The book is written in a highly redundantstyle, which I would like to describe as anexercise in iterative deepening. Basicconcepts are repeated in several chapters,but covered to a deeper level in the laterchapters. This should make it easy forstudents to keep reading it, without havingto refer back to earlier chapters at every stepof the way. On the other hand, for a personthat is already familiar with the basics ofdata mining, this makes boring reading atsome places. However, I do not recommenda streamlining of the book. Instead, Irecommend that readers with someknowledge of the topic may skip paragraphsthat sound familiar without any guiltyfeelings.
Reviewer's appreciation
The book goes to great lengths to avoid“formula shock”. Formulas are developedstep-by-step and well explained. Onlyabsolutely necessary formulas are included.In many cases, where the derivation of acomplex result is irrelevant to the actual datamining issues, the authors defer to statisticstextbooks. While I am greatly in favor ofboth these approaches in writing textbooks, Ifeel that they have gone too far at a fewplaces. At a number of places, the authorsavoid introducing ``one more letter'' to keepthe text readable. However, the price theypay for that is that many of their formulashave no equal signs. Thus, a sentence isterminated with a colon and followed with aformula, which is presumably equal to thequantity described by the sentence. This isdone on many pages, e.g., 132--135, 137,196, 207, 222, etc. Not in my wildestdreams would I have thought that I couldever criticize a book author for having toofew formulas and too few variables. Butthis is exactly what I need to do here. WhileI do not recommend eliminating thepreviously mentioned redundancy ofdescription, I do recommend for the next
edition (which this book will undoubtedlyhave) to strengthen the formulas, withoutnecessarily adding new ones.
At a few places, the book could also beimproved by adding more explanations tofigures. Figure 3.6 is a prime example forthis issue. I found myself spending timeverifying that instance counts in twosubfigures truly add to the same total (of209). They do. The reader could be sparedthis effort by a better caption or a betterdescription in the body of the text.Similarly, the Apriori algorithm isintroduced in a figure, but only in the“Further Reading” subsection (followingmuch later) is the name of the algorithmmentioned. A better figure caption wouldhelp the scholarly advancement of studentswho might not take the “Further Reading”section that seriously.
In America we say “Actions speak louderthan words”. Thus, instead of summarizingthe book I will describe some actions that Iintend to take (or that I am already taking). (1) I am using WEKA for my research. (2) If I teach the same course again, I will
use Witten and Frank's book again.(3) If the book appears in a second edition, I
will acquire it.