classifiers and machine learning data intensive linguistics spring 2008 ling 684.02
TRANSCRIPT
Decision Trees What does a decision tree do? How do you prepare training data? How do you use a decision tree? The traditional example is a tiny
data set about weather. Here I use Wagon, many other
similar packages exist.
Decision processes Challenge: Who am I? Q:Are you alive? A: Yes Q: Are you famous? A: Yes … Q: Are you a tennis player? A: No Q: Are you a golfer? A:
Yes Q: Are you Tiger Woods A: Yes
Decision trees Played rationally, this game has the
property that each binary question partitions the space of possible entities.
Thus, the structure of the search can be seen as a tree.
Decision trees are encodings of a similar search process. Usually, a wider range of questions is allowed.
Decision trees In a problem solving setup, we’re
not dealing with people, but with a finite number of classes for the predicted variable.
But the task is essentially the same, given a set of available questions, narrow down the possibilities till you are confident about the class.
How to choose a question? Look for the question that most
increases your knowledge of the class We can’t tell ahead of time which answer
will arise. So take the average over all possible
answers, weighted by how probable each answer seems to be.
The maths behind this is either information theory or an approximation to it.
How to be confident
If a simple majority classifier would achieve acceptable performance on the data in the current partition.
Obvious generalization (Kohavi): be confident if some other baseline classifier would perform well enough.
Data format Each row of the table is an
instance Each column of the table is an
attribute (or feature) You also have to say which
attribute is the predictee or class variable. In this case we choose Playable.
Attribute types We also need to understand the
types of the attributes. For the weather data:
Windy and Playable look boolean Temperature and Humidity look as if
they can take any numerical value Cloudy looks as if it can take any of
“sunny”,”overcast”,”rainy”,”sunny”
Wagon description files Because guessing the range of an
attribute is tricky, Wagon instead requires you to have a “description file”
Fortunately (especially if you have big data files), Wagon also provides make_wagon_desc which makes a reasonable guess at the desc file.
For the weather data(
(outlook overcast rainy sunny)
(temperature float)
(humidity float)
( windy FALSE TRUE)
( play no yes)
)
(needed a little help: replacing lists of numbers with “float”)
Commands for Wagon wagon –data weather.dat –desc
weather.desc –o weather.tree This produces unsatisfying results,
because we need to tell it that the data set is small by setting –stop 2 (or else it notices that there are < 50 examples in the top-level tree, and doesn’t build a tree)
Using the stopping criterion
wagon –data weather.dat \–desc weather.desc \ –o weather.tree \–stop 1
This allows the system to learn the exact circumstances under which Play takes on particular values.
Using Wagon to classify
wagon_test -data weather.dat \ -tree weather.tree \ -desc weather.desc \ -predict play
Over-training -stop=1 is over-confident, because
it might build a leaf for every quirky example.
There will be other quirky examples once we move to new data. Unless we are very lucky, what is learnt from the training set will be too detailed.
Over-training 2 The bigger -stop is, the more errors
that the system will commit on the training data.
Conversely, the smaller -stop is, the more likely that the tree will learn irrelevant detail.
The risk of overtraining grows as -stop shrinks, and as the set of available attributes increases.
Why over-training hurts
If you have a complex attribute space, your training data will not cover everything.
Unless you learn general rules, new instances will not be correctly classified.
Also, the system's estimates of how well it is doing will be very optimistic.
This is like doing Linguistics but only on written, academic English...
Setting -stop automatically
Split training data in two. Use first half to train, trying several different values for -stop
Use second half for cross-validation: measure performance of the various trees learnt.
Train
Tune
Test
Cross-validation
If performance gain generalizes to cross-validation half, then probably also to unseen data.
Any problems? Train
Tune
Test
Data efficiency
Train/tune split is wasteful. Reduce tuning part to 10% of data.
Train on 90%. Rotate the 10% through the
training data.Train
Tune
Test
Cross-validation
Train
Tune
Test
Train
Train
Tune
Test
Train
Train
Tune
Test
Train
Train
Tune
Test
Train
Cross-validation
Because the tuning set was 10%, this is 10-fold cross-validation. 20-fold would be 5%
In the limit (very expensive or small training data) we have “leave one out” cross-validation.
Clustering with decision trees
The standard stopping criterion is purity of the classes at the leaves of the trees.
Another criterion uses a distance matrix measuring the dissimilarity of instances. Stop when the groups of instances at the leaves from tight clusters.
What are decision trees?
A decision tree is a classifer. Given an input instance it inspects the features and delivers a selected class.
But it knows slightly more than this. The set of instances grouped at the leaves may not be a pure class. This set defines a probability distribution over the classes. So a decision tree is a distribution classifier.
There are many other varieties of classifier.
Nearest neighbour(s)
If you have a distance measure, and you have a labelled training set, you can assign a class by finding the class of the nearest labelled instance.
Relying on just one labelled data point could be risky, so an alternative is to consider the classes of k neighbours.
You need to find a suitable distance measure. You might use cross-validation to set an
appropriate value of k
Bellman's curse
Nearest neighbour classifiers make sense if classes are well-localized in the space defined by the distance measure.
As you move from lines to planes, volumes and high-dimensional hyperplanes, the chance that you will find enough labelled data points “close enough” decreases.
This is a general problem, not specific to nearest-neighbour, and is known as Bellman's curse of dimensionality
Dimensionality
If we had uniformly spread data, and we wanted to catch 10% of the data, we would need 10% of the range of x in a 1-D space, but 31% of the range of each x and y in a 2-D space and 46% of the range of x,y,z in a cube. In 10 dimensions you need to cover ~80% of the ranges.
In high dimensional spaces, most data points are closer to the boundaries of the space than they are to any other data point
Text problems are very often high-dimensional
Decision trees in high-D space
Decision trees work by picking on an important dimension and using it to split the instance space into slices of lower dimensionality.
They typically don't use all the dimensions of the input space.
Different branches of the tree may select different dimensions as relevant.
Once the subspace is pure enough, or well enough clustered, the DT is finished.
Cues to class variables
If we have many features, any one could be a useful cue to the class variable. (If the token is a single upper case letter followed by a ., it might be part of A. Name)
If cues conflict, we need to decide which ones to trust. “... S. p. A. In other news”
In general, we may need to take account of combinations of features (The “[A-Z]\.” feature is relevant only if we haven't found abbreviations
Compound cues
Unfortunately there are very many potential compound cues. Training them all separately will throw us into a very high-D space.
The naive Bayes classifier “deals with” this by adopting very strong assumptions about the relation between feature and the underlying class.
Assumption: Each feature is independently affected by the class, nothing else matters.
The naïve Bayes classifier
P(F1,F2...Fn|C) ≈ P(F1|c)P(F2|c)...P(Fn|c) Classify by finding class with highest score
given features and this (crass) assumption. Nice property: easy to train, just count
number of times that Fi and class co-occur
C
F1 F2 Fn...
Comments on naïve Bayes
Clearly, the independence assumption is false. All features, relevant or not, get the same
chance to contribute. If there are many irrelevant features, they may swamp the real effects we are after.
But it is very simple and efficient, so can be used in schemes such as boosting that rely on combinations of many slightly different classifiers.
In that context, even simpler classifiers (majority classifier, single rule) can be useful