03 data mining techniques

Chapter 3Data Mining Techniques

3.1 Introduction• Parametric models describe the relationship between input and output through

the use of algebraic equations what are some parameters are not specified. These unspecified parameters are determined by providing input examples.• Nonparametric techniques are more appropriate for data mining applications. A

non-parametric model is one that is data-driven. Recent techniques are able to learn dynamically as data are added to the input. This dynamic Learning process allows the model to be created continuously. The more data, the better the model.• Nonparametric techniques are particularly suitable to the database applications

with large amounts of dynamically changing data. Nonparametric techniques include neural networks, decision trees, and genetic algorithms.

3.2 Statistical Perspective. Point Estimation• The bias of an estimator is the difference between the expected value of the estimator and the actual value. Let denote

the expected value

• One measure of the effectiveness of an estimate is the mean squared error (MSE), which is the expected value of difference between the estimates and the actual value:

• The root mean square error (RMSE) is found by taking the square root of the MSE.• The root mean square (RMS) may also be used to estimate error or as another statistic to describe a distribution. Unlike

mean, it does indicate the magnitude of the values.

• At popular estimating technique is the jackknife estimate. With this approach, the estimate of a parameter, , is obtained by omitting one value from the set of observed values. Given set of jackknife estimates, we can obtain an overall estimate

• When we determine a range of values, within which the true parameter value should fall. This range is called a confidence interval.

3.2.2 Estimation and Summarization Models• Maximum likelihood estimate (MLE) technique for point estimation. The approach obtains parameter estimates

that maximize the probability that that sample data occur for the specific model The likelihood function is thus defined as

The value that maximizes is the estimate chosen. This can be found by taking the derivative with respect to • The expectation maximization (EM) algorithm can solve the estimation problem with incomplete data. The EM

algorithm finds an MLE for a parameter (such as a mean) using a two step process: estimation and maximization. These steps are applied iteratively until successive parameter estimates converge. Such iterative estimates must satisfy

• Models based on summarization provide an abstraction and the summarization of the data as a whole. Well-known statistical concepts such as mean, variance, standard deviation, median, mode are simple models of the underlying population. Fitting population into a specific frequency distribution provides an even better model of the data.

• Visualization techniques help to display the structure of the data graphically (histograms, box plots, scatter diagrams).

3.2.3 Bayes Theorem• Bayes rule is a technique to estimate the likelihood of a property given the set of data as evidence or

input. Suppose that either hypothesis or hypothesis must occur and is an observable event, the Bayes rule states

• is called the posterior probability, while is the prior probability associated with hypothesis is the probability of the occurrence of data value and is the conditional probability that, given a hypothesis the tuple satisfies it. Bayes rule allows to assign probabilities of hypotheses given a data value

• Hypothesis testing helps to determine if a set of observed variable values is statistically significant (differs from the expected case). This approach explains the observed data by testing a hypothesis against it. A hypothesis is first made, then the observed values are compared based on this hypothesis to those of the expected case. Assuming that represents the observed data and is the expected values based on hypothesis, the chi-squared statistic, , is defined as:

3.2.5 Correlations and Regression• Linear regression assumes that a linear relationship exists between the input and the output

data. The common formula for a linear relationship is:

• There are: input variables, which are called 𝑛 predictors or regressors; one output variable being predicted (called a response); +1 constants, which are chosen to match model by the input 𝑛sample. This is called multiple linear regression because there is more than one predictor.

• Both bivariate regression and correlation can be used to evaluate the strength of a relationship between two variables.

• One standard formula to measure linear correlation is the correlation coefficient . Here negative correlation indicates that one variable increases while the other decreases:

• When two data variables have a strong correlation, they are similar. Thus, the correlation coefficient can be used to define similarity for clustering or classification.

3.3 Similarity MeasuresThose tuples, that answer the query should be more like each other than those that do not answer it. Each IR query provides the class definition in the form of the IR query itself. So classification problem then becomes one of determining similarity between each tuple and the query rather than Common similarity measures used: • Dice relates the overlap to the average size of the two sets together • Jaccard measures overlap of two sets as related to the whole set caused by their union • Cosine relates the overlap to the geometric average of the two sets • Overlap determines the degree to which two sets overlapDistance or dissimilarity measure are often used instead of similarity measures. These measure how unlike items are.• Euclidean • Manhattan Since most similarity measures assume numeric (and often discrete) values, they may be difficult to use for general data types. A mapping from the attribute domain to a subset of integers may be used and some approach to determining the difference is needed.

3.4 Decision TreesA decision tree (DT) is a predictive modeling technique used in classification, clustering, and prediction. A computational DT model consists of three steps:• A decision tree• An algorithm to create the tree• An algorithm that applies the tree to data and solves the problem under

consideration (complexity depends on the product of the number of levels and the maximum branching factor).

Most decision tree techniques differ in how the tree is created. An algorithm examines data from a training sample with known classification values in order to build the tree, or it could be constructed by a domain expert.

3.5 Neural Networks• The NN can be viewed as directed graph consisting of

vertices and arcs. All the vertices are partitioned into source(input), sink (output), and internal (hidden) nodes; every arch is labeled with a numeric value every node is labeled with a function . The NN as an information processing system consists of a directed graph and various algorithms that access the graph.

• NN usually works only with numeric data. • Artificial NN can be classified based on the type of

connectivity and learning into feed-forward or feedback, with supervised or unsupervised learning.

• Unlike decision trees, after a tuple is processed, the NN may be changed to improve future performance.

• NN have a long training time and thus are not appropriate for real-world applications. NN can be used in massively parallel systems.

Activation FunctionsThe output of each node in the NN is based on the definition of an activation function , associated with it. An activation is applied to the input values and weights . The inputs are usually combined in a sum of products form . The following are alternative definitions for activation function at node :• Linear: • Threshold or step: • Sigmoid: This function possesses a simple derivative • Hyperbolic tangent: • Gaussian:

3.6 Genetic Algorithms• Initially, a population of individuals is created. They typically are generated randomly. From this

population, a new population of the same size is created. The algorithm repeatedly selects individuals from whom to create new ones. These parents , are then used to produce offspring or children using a crossover process. Then mutants may be generated. The process continues until the new population satisfies the termination condition.

• A fitness function is used to determine the best individuals in a population. This is then used in selection process to chose parents to keep. Given an objective by which the population can be measured, the fitness function indicates how well the goodness objective is being met by an individual.

• The simplest selections process is to select individuals based on their fitness. Here is the probability of selecting individual . This type of selection is called roulette wheel selection.

• A genetic algorithm (GA) is computational model consisting of five parts: 1) starting set, 2) crossover technique, 3) mutation algorithm, 4) fitness function, 5) GA algorithm.

References:Dunham, Margaret H. “Data Mining: Introductory and Advanced Topics”. Pearson Education, Inc., 2003.

03 data mining techniques

Data & Analytics