Computational Intelligence for Information Selection

Download Computational Intelligence for Information Selection

Post on 31-Dec-2015




4 download

Embed Size (px)


Computational Intelligence for Information Selection. Filters and wrappers for feature selection and discretization methods. Wodzisaw Duch Google: Duch. Concept of information. Information may be measured by the average amount of surprise of observing X (data, signal, object). - PowerPoint PPT Presentation


<ul><li><p>Computational Intelligence for Information SelectionFilters and wrappers for feature selection and discretization methods. </p><p>Wodzisaw DuchGoogle: Duch</p></li><li><p>Concept of informationInformation may be measured by the average amount of surprise of observing X (data, signal, object). 1. If P(X)=1 there is no surprise, so s(X)=02. If P(X)=0 then this is a big surprise, so s(X)=.If two observations X, Y are independent than P(X,Y)=P(X)P(Y), but the amount of surprise should be a sum s(X,Y)=s(X)+s(Y).</p><p>The only suitable surprise function that fulfills these requirements is ... The average amount of surprise is called information or entropy. Entropy is a measure of disorder, information is the change in disorder. </p></li><li><p>InformationInformation derived from observations of variable X (vector variable, signal or some object) that has n possible values is thus defined as:If the variable X is continuous with distribution P(x) an integral is taken instead of the sum: What type of logarithm should be used? Consider binary event with P(X(1))=P(X(2))=0.5, like tossing a coin: how much do we learn each time? A bit. Exactly one bit. Taking lg2 will give: </p></li><li><p>DistributionsFor a scalar variable P(X(i)) may be displayed in form of a histogram, and information (entropy) calculated for each histogram. </p></li><li><p>Joint informationOther ways of introducing the concept of information start from the number of bits needed to code a signal.Suppose now that two variables, X and Y, are observed. Joint information is: For two uncorrelated features this is equal to: Since: </p></li><li><p>Conditional informationIf the value of Y variable is fixed and X is not quite independent conditional information (average conditional surprise) may be useful: Prove that </p></li><li><p>Mutual informationThe information in one X variable that is shared with Y is: </p></li><li><p>Kullback-Leibler divergenceIf two distributions for X variable are compared, their divergence is expressed by:KL divergence is the expected value of the ratio of two distributions, it is non-negative, but not symmetric, so it is not a distance. Mutual information is KL distance between joint and product (independent) distributions: </p></li><li><p>Joint mutual informationThe information in two X variable that is shared with Y is: Efficient method to calculate joint MI: where conditional joint information:</p></li><li><p>Graphical relationshipsH(X,Y)So the total joint information is (prove it!)H(X)H(X|Y)H(Y)MI(X;Y)H(Y|X)</p></li><li><p>Some applications of info theoryInformation theory has many applications in different CI areas. Few applications will be mentioned only, with visualization, discretization, and feature selection treated here in more details.Information gain has already been used in decision trees (ID3, C4.5) to define the gain of information by making a split: for feature A, used to split node S into left Sl, and right Sr sub-node, with classes w=(w1 ... wK)with being information contained in the class distribution w for vectors in the node S. Information is zero if all samples in the node are from one class (log 1 = 0, and 0 log 0 = 0), and maximum H(S)=lg2K for uniform distribution in K equally probable classes, P(wi|S)=1/K. </p></li><li><p>Model selectionHow complex should our model be? For example, what size of a tree, how many functions in a network, and what degree of the kernel?Crossvalidation is a good method but sometimes costly.Another way to optimize model complexity is to measure the amount of information necessary to specify the model and its errors. Simpler models make more errors, complex need longer description.Minimum Description Length for model + errors (very simplified).General intuition: learning is compression, finding simple models, regularities in the data (ability to compress). Therefore estimate:L(M) = how many bits of information are needed to transmit the model.L(D|M) = how many bits to transmit information about data, given M.Minimize L(M)+L(D|M). Data correctly handled need not be transmitted. Estimations of L(M) are usually nontrivial. </p></li><li><p>More on model selectionMany criteria for information-based model selection have been devised in computational learning theory, two best known are: AIC, Akaike Information Criterion. BIC, Bayesian Information Criterion.The goal is to predict, using training data, which model has the best potential for accurate generalization.Although model selection is a popular topic applications are relatively rare and selection via crossvalidation is commonly used.Models may be trained by max. mutual information of outputs/classes:This may in general be any non-linear transformation, for example implemented via basis set expansion methods.</p></li><li><p>Visualization via max MI Linear transformation that maximizes mutual information between a set of class labels and a set of new input features Yi, i=1..d &lt; d is:Here W is a d x d dimensional matrix of mixing coefficients. Maximization proceeds using gradient-based iterative methods. Left FDA view of 3 clusters, right linear MI view with better separationFDA does not perform well if distributions are multimodal, separating the means that may be lead to overlapping clusters. </p></li><li><p>More examplesTorkkola (Motorola Labs) has developed the MI-based visualization; see more ex at: the Landsat Image data contain 36 features (spectral intensities of 3x3 submatrix of pixels in 4 spectral bands) are used to classify 6 type of land use; 1500 samples used for visualization. Left: FDA, right: MI, note violet/blue separation. Classification in the reduced space is more accurate.Movie 1: Reuters</p><p>Movie 2: Satimage(local only)</p></li><li><p>Feature selection and attentionAttention: basic cognitive skill, without attention learning would not have been possible. First we focus on some sensation (visual object, sounds, smells, tactile sensations) and only then the full power of the brain is used to analyze this sensation.</p><p>Given a large database, to find relevant information you may: </p><p>discard features that do not contain information, use weights to express their relative importance,reduce dimensionality aggregating information, making linear or non-linear combinations of subsets of features (FDA, MI) new features may not be so understandable as the original ones. create new, more informative features, introducing new higher-level concepts; this is usually left to human invention. </p></li><li><p>Ranking and selectionFeature ranking: treat each feature as independent, and compare them to determine the order of relevance or importance (rank).Note that:Several features may have identical relevance, especially for nominal features. This is either by chance, or because features are strongly correlated and therefore redundant. Ranking depends on what we look for, for example rankings of cars from the comfort, usefulness in the city, or rough terrain performance point of view, will be quite different.Feature selection: search for the best subsets of features, remove redundant features, create subsets of one, two, k-best features. </p></li><li><p>Filters and wrappersCan ranking or selection be universal, independent of the particular decision system used? Some features are quite irrelevant to the task at hand. </p><p>Feature filters are model-independent, universal methods based on some criteria to measure relevance, for information filtering. They are usually computationally inexpensive. </p><p>Wrappers methods that check the influence of feature selection on the result of particular classifier at each step of the algorithm. For example, LDA or kNN or NB methods may be used as wrappers for feature selection. Forward selection: add one feature, evaluate result using a wrapper.Backward selection: remove one feature, evaluate results. </p></li><li><p>NB feature selectionNaive Bayes assumes that all features are independent. Results degrade if redundant features are kept.Naive Bayes predicts the class and its probability. For one feature Xi P(X) does not need to be computed if NB returns a class label only. Comparing predictions with the desired class C(X) (or probability distribution over all classes) for all training data gives an error estimation when a given feature Xi is used for the dataset D </p></li><li>NB selection algorithmForward selection, best first: start with a single feature, find Xi1 that minimizes the NB classifier error rate; this is the most important feature; set Xs ={Xi1}.Set Xs </li><li><p>FiltersComplexity of the wrapper approach for d features in the worst case is O(d*(d-1)/2), usually the number of selected features m &lt; dIf the evaluation of the error on the training set is costly, or d is very large (in some problems it can be 104-105), then filters are necessary.Since P(w0,0)+P(w0,1)=P(w0), P(w1,0)+P(w1,1)=P(w1)=1-P(w0), and P(w0)+P(w1)=1 there are only 2 free parameters here. Complexity of filters is always O(d), but evaluations are less expansive.Simplest filter for nominal data: MAP classifier. K=2 classes, d binary features Xi=0, 1, i=1..d and N samples X(k). Joint probability P(wj,Xi) is a 2x2 matrix carrying full information.</p></li><li><p>MAP Bayesian filterThe informed majority classifier (i.e. knowing the Xi value) makes in the two-class, two Xi=0,1 values, optimal decisions: IF P(w0, Xi=0) &gt; P(w1, Xi=0) THEN class w0 Predicted accuracy: a fraction P(w0, Xi=0) correct, P(w1, Xi=0) errors. IF P(w0, Xi=1) &gt; P(w1, Xi=1) THEN class w0 Predicted accuracy: a fraction P(w0, Xi=1) correct, P(w1, Xi=1) errors. </p><p>In general MAP classifier predicts: Accuracy of this classifier using feature Xi requires summing over all values Xi that the feature may take: </p></li><li><p>MC propertiesSince no more information is available two features with the same accuracy A(MC,Xa) = A(MC,Xb) should be ranked as equal. If for a given value x all samples are from a single class then accuracy of MC is 100%, and a single feature is sufficient. Since optimal decisions are taken at each step is the majority classifier an optimal solution? For binary features yes, but for others: Joint probabilities are difficult to estimate, especially for smaller datasets smoothed probabilities lead to more reliable choices. For continuous features results will strongly depend on discretization. Information theory weights contributions from each value of X taking into account not only the most probable class, but also distribution of probabilities over other classes, so it may have some advantages, especially for continuous features. </p></li><li><p>Bayesian MAP indexBayesian MAP rule accuracyAccuracy of the majority classifier:Bayesian MAP index:</p></li><li><p>MI indexMutual information index is frequently used: To avoid numerical problems with 0/0 for values xk that are not present in some dataset (ex. in crossvalidation partition) Laplace corrections are used. Their most common form is:where the number of all different feature values is N(X=x). Some decision trees evaluate probabilities in this way; instead of 1 and N(wi) other values may be used as long as probabilities sum to 1. number of classes</p></li><li><p>Other entropy-based indicesJBC and mutual information index measures concentration of probability around maximum value; simplest measure is: Measures something like entropy for each partition among classes. Other possibilities include Renyi entropy; meaning of q: Never used in decision treesor info selection?</p><p>Joint of conditional? Joint or conditional? </p></li><li><p>Confusion matrices for BCMapping from joint probability to confusion matrices for Bayesian rule:</p></li><li><p>An exampleCompare three binary features with class distributions:BC ranking: X3,&gt; X1,= X2, , MI ranking: X1,&gt; X3,&gt; X2Gini ranking: X3,&gt; X2,&gt; X1</p><p>Which is the best? Why to use anything else but BC?</p></li><li><p>Correlation coefficientPerhaps the simplest index is based on the Pearsons correlation coefficient (CC) that calculates expectation values for product of feature values and class values:For features values that are linearly dependent correlation coefficient is 1 or -1, while for completely class distribution independent of Xj it is 0. How significant are small correlations? It depends on the number of samples n. The answer (see Numerical Recipes) is given by: For n=1000 even small CC=0.02 gives P ~ 0.5, but for n=10 only 0.05</p></li><li><p>Other relevance indicesMutual information is based on Kullback-Leibler distance, any distance measure between distributions may also be used, ex. Jeffreys-MatusitaBayesian concentration measure is simply:Many other such measures exist. Which is the best? In practice they are similar, although accuracy of calculations is important, relevance indices should be insensitive to noise and unbiased in their treatment of features with many values. </p></li><li><p>DiscretizationAll indices of feature relevance require summation over probability distributions. What to do if the feature is continuous? There are two solutions:Discretize the range of the feature values, and calculate sums. Histograms with equal width of bins (intervals);Histograms with equal number of samples per bin;Maxdiff histograms: bins starting in the middle (xi+1-xi)/2 of largest gapsV-optimal: sum of variances in bins should be minimal (difficult). 1. Fit some functions to histogram distributions using Parzen windows, ex. a sum of several Gaussians, and integrate: </p></li><li><p>Tree (entropy-based) discretizationV-opt histograms are good, but difficult to create (dynamics programming techniques should be used). Simple approach: use decision trees with a single feature, or a small subset of features, to find good splits this avoids local discretization.Ex: C4.5 decision tree discretization maximizing information gain, or SSV tree based separability criterion, vs. constant width bins.Hypothyroid screening data, 5 continuous features, MI shown.EP = equal width partition; SSV = decision tree partition (discretization) into 4, 8 .. 32 bins</p></li><li><p>Discretized informationWith partition of Xj feature values x into rk bins, joint information is calculated as: and mutual information as:</p></li><li><p>Feature selectionSelection requires evaluation of mutual information or other indices on subsets of features S={Xj1,Xj2,..Xjl}, with discretization of l-dimensional feature values X into Rk bins: The difficulty here is reliable estimation of distributions for large number M(S) of l-dimensional partitions. Feedforward pair approximation tries to maximize MI of new feature, minimizing sum of MI with features in S Ex. select feature maximizing:with some b &lt; 1. </p></li><li><p>Influence on classificationSelecting best 1, 2, ... k-dimensional subsets, check how different methods perform using the reduced number of features. MI using SSC discretization.SSV bfs/beam selection of features that are at the top of the SSV decision tree, with best first search, or beam search tree creation method.kNN ranking backward wrapper using kNN.Ba pairwise approximation.Hypothyroid data, 21 features, 5 continuous. </p></li><li><p>GM example, wrapper/manual selectionLook at GM 3 wrapper-based feature selection.Try GM on Australian Credit Data, with 14 features.Standardize the data.Select transform and classify, add feature selection wrapper and choose SVM leaving default features.Run it and look at Feature Rank...</p></li></ul>


View more >