sqlday2013_marcinszeliga_dataindatamining
TRANSCRIPT
NASI SPONSORZY I PARTNERZY
Data in Data Miningwith SQL Server 2012
Marcin Szeligawww.sqlexpert.pl
http://blog.sqlexpert.pl/http://www.facebook.com/SQLExpertpl
Agenda
• Know Your Data
• What Kind of Data Do You Need?
• How Much Data Do You Need?
• The Problem of Missing Data
• Recap
Know Your Data
• Data is not ready for mining
– Even if it comes from DW/BI system
• In Data Mining Garbage In Really Bad Garbage Out
• Assesing the data is the key
– What and how much information it holds?
– How much data is invalid/missing?
– Does the data support the business problem?
• Data Profiling Task and Naive Bayes come to rescue
Demo
• Data Profiling
• Checking Attributes Relationship
What Kind of Data Do You Need?
• Tabular one – Most of the time a row = a case, and a column = an attribute (or
variable)
• Attribute can be:– Single-valued or multi-valued– Discrete, ordered, continuous or cyclical– Monotonic or not
• As far as relations are concerned, an attribute can be:– Independent or not– Redundant or not– Anachronistic or not
• T-SQL and Mining Structure Column Properties come to rescue– Get rid of single-valued, monotonic, independent, redundant and
anachronistic ones– Convert discrete attributes into continuous ones or vice versa
Demo
• Adding variables
• Changing variables’ type
How Much Data Do You Need? Part 1
• The raw amount of data is mostly irrelevant for data mining algorithms– Only the information hidden in this data matters– The problem is not related to an algorithm
• Does this means that you can use only a handful of data to mine? – Probably not ! (more about this later)
• Data mining algorithms work by analyzing statistical relationship between variables– The values’ distribution of each parameter is the most
important factor that determines the results
• Algorithm parameters come to rescue
Demo
• Who had the best chance to survive, according to Decison Tress?
• Tweaking Data Mining Algorithms
How Much Data Do You Need? Part 2
• Model parameters are called variables because each of them can take on a variety of values– Those values contain some sort of pattern
– They are distributed across the variable’s range in some specific way
• To see this pattern, display it graphically, as a curve
• At this stage you can only check if there is too few data
• Statistics comes to rescue– You can measure the frequency of a attributes states/values to
get variability
– Standard deviation can be used as the measure of the variability• It’s a sort of average distant between values and the mean
Demo
• Checking Variability
• Variability as a data quantity measure
How Much Data Do You Need? Part 3
• In real projects the population is too big to be measured – Most of the time we have to deal with sample data, the data that
represents only some part of the population– Even if the whole population is available, we still need to divide it
into at least two datasets (the training one and the test one)
• Convergence means that by adding more cases variability will„settle down”– When the sample is small, each new record can greatly change the
value distribution– As the sample is getting bigger, adding new records barely makes
any difference
• T-SQL and OVER clause come to rescue– Checking how much deviation has changed between samples is
easy – But what if we are unlucky and in our sample some correlations
between variables (i.e. between people under 18 who have very high salary) are not properly represented ?
Demo
• Converging on a representative sample
• What about correlations between variables?
The Problem of Missing Data
• NULL has two meanings:
– There is no data
– The data exists but it’s unknown
• I should stress the word „meaning”
– By removing NULLs you can lost valuable information
– By replacing NULLs you can severely skewed your data
• Model of Missing Value Pattern (MVP) comes to rescue
Demo
• Building Missing Value Pattern Model
• Flagging and predicting missing values
Recap• Never mine unknown data• Basic preparation can completely change the results• There is no easy way to say how much data you need for a
particular model– However, you can check if the sample is representative by
measuring the differences in the variability• This test should be done for most important, if nor for all,
variables• This has nothing to do with the data mining algorithm itself• But as long as you plan to use the data mining model to solve
some real world problems, you have to train it using representative data
• Missing data often hides important information – do not loseit– Use separe data mining models to supplement it
NASI SPONSORZY I PARTNERZY
Organizacja: Polskie Stowarzyszenie Użytkowników SQL Server - PLSSUGProdukcja: DATA MASTER Maciej Pilecki