the art and (data) science of data cleansing and quality

18
The Art and (Data) Science of Data Cleansing and Quality

Upload: dataversity

Post on 22-Jan-2018

482 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: The Art and (Data) Science of Data Cleansing and Quality

The Art and (Data) Science of Data Cleansing and Quality

Page 2: The Art and (Data) Science of Data Cleansing and Quality

© 2017 Alteryx, Inc. | Confidential Download a FREE Trial: alteryx.com/trial

Sr. Product Marketing Manager

Lisa Aguilar

Chief Scientist

Dan Putler

Today’s Speakers

Page 3: The Art and (Data) Science of Data Cleansing and Quality

Download a FREE Trial: alteryx.com/trial© 2017 Alteryx, Inc. | Confidential

Agenda

• Thinking Through Predictive Modeling Use Cases

• Starting with the Right Data

• The “gotchas” of data hygiene in developing predictive models

• Choosing the Right Modeling Technique

3

Page 4: The Art and (Data) Science of Data Cleansing and Quality

© 2017 Alteryx, Inc. | Confidential Download a FREE Trial: alteryx.com/trial

• What decision needs to be made?• What information is needed to inform that decision?

• Typically developing a mental model of the process helps a great deal in terms of determining all the potentially relevant information

• What type of analysis is going to be able to provide the exact information needed to inform the decision?

4

Understanding the Business Issue

Page 5: The Art and (Data) Science of Data Cleansing and Quality

© 2017 Alteryx, Inc. | Confidential Download a FREE Trial: alteryx.com/trial

• How much electricity does a utility need to have the capacity to supply for any given hour tomorrow?

• To which of its customers should an outdoor sports retailer send a paddling sports catalog?

5

Two Specific Use Cases to Illustrate Business Issue Understanding

Page 6: The Art and (Data) Science of Data Cleansing and Quality

© 2017 Alteryx, Inc. | Confidential Download a FREE Trial: alteryx.com/trial

• The question “How much electricity does a utility need to have the capacity to supply for any given hour tomorrow?” actually has two underlying decisions:

• Which of our existing power plants should we start to bring online or start to take offline?• Should we purchase electricity from the spot market, and, if yes, how much?

• The critical information that needs to be known is how much electricity will be demanded in each hour of the day tomorrow

• Unfortunately, this information is not known at the time decisions need to be made, but it can be predicted using a predictive model

6

The Electricity Supply Use Case

Page 7: The Art and (Data) Science of Data Cleansing and Quality

© 2017 Alteryx, Inc. | Confidential Download a FREE Trial: alteryx.com/trial

• What factors are likely to drive the demand for electricity in a given hour tomorrow?• This is where having a mental model of the process can be very handy

• Some factors that are likely to be important:• Day of the week• Hour of the day• The temperature that hour and the preceding hour• The month of the year

• One issue is that, like electricity demand, the temperature in that hour (or even the preceding hour) tomorrow will not be known at the time decisions are made (today), but the temperature in each hour tomorrow can be predicted using a model

7

The Electricity Supply Use Case

Page 8: The Art and (Data) Science of Data Cleansing and Quality

© 2017 Alteryx, Inc. | Confidential Download a FREE Trial: alteryx.com/trial

The Electricity Supply Use Case

• Available factors to predict hourly temperatures• The forecast high and low for the day from the US National Weather Service or other

organization• The number of minutes since sunrise or sunset at the start of each hour• The temperature for the same hour on the previous day

• In this case two different predictive models are needed:• Predict hourly temperatures for the next day• Predict hourly electricity use given temperatures and other factors

8

Page 9: The Art and (Data) Science of Data Cleansing and Quality

© 2017 Alteryx, Inc. | Confidential Download a FREE Trial: alteryx.com/trial

The Paddle Sports Catalog Use Case

• The question “To which of its customers should an outdoor sports retailer send a paddling sports catalog? ” has a definitive answer:

• Send it to any customer where the full cost of sending the catalog is less than the expected margin dollars (item price less item cost) from the items a customer would purchase from the catalog

• While the criteria for answering the question about whether a specific customer should be sent a catalog is definitive, knowing whether that customer meets the criteria is another matter

• Predictive models can help to provide the information needed on whether a particular customer is expected to meet the criteria

• Two models would typically be used• A model that predicts whether a customer will purchase anything from the catalog at all• A model of the margin dollars a customer will generate conditional on using the catalog

9

Page 10: The Art and (Data) Science of Data Cleansing and Quality

© 2017 Alteryx, Inc. | Confidential Download a FREE Trial: alteryx.com/trial

The Paddle Sports Catalog Use Case

• In terms of selecting variables for the two models we have identified, we need to make use of information that is known prior to sending a catalog to a customer. There are a number of ready candidates to use

• Demographic and socioeconomic information: Age, income, family status• Location information: The state of the store’s location; proximity to the sea, lakes, or rivers• Past purchase behavior, typically measured using the concept of Recency, Frequency, and

Monetary Value (or RFM)

• We also need to have observations on an appropriate target variable. There are two ways to do this:

• Use appropriate historical data (i.e., the response to last year’s paddle sports catalog)• Use of a “test” approach, where we send the catalog to a sample of our customers, and then

use this data to predict the behavior of all our customers

10

Page 11: The Art and (Data) Science of Data Cleansing and Quality

© 2017 Alteryx, Inc. | Confidential Download a FREE Trial: alteryx.com/trial

The Nitty-Gritty of Developing a Predictive Model: Modeling Method• There are a large (overwhelming?) number of different modeling methods available• There are two criteria for selecting the final modeling method to use:

• Selecting an appropriate modeling method, which is largely driven by the data type of the target variable (categorical or numeric)

• Selecting the model (hence the method) with the greatest predictive efficacy for predicting new data among a set of appropriate models

• Basic model types• Classification models which predict the category into which a case (e.g., a customer) falls• Regression models which predict numeric quantities

• Linking back to the use cases• Classification: Whether a customer will respond to the paddling sports catalog• Regression: The margin dollars from a customer who receives the catalog, hourly

temperature and electricity demand

11

Page 12: The Art and (Data) Science of Data Cleansing and Quality

© 2017 Alteryx, Inc. | Confidential Download a FREE Trial: alteryx.com/trial

The Nitty-Gritty of Developing a Predictive Model: Data Hygiene

• The data hygiene requirements for developing predictive models is more exacting than for reporting and building BI dashboards

• The common data hygiene “gotchas” are:• Fields with missing values. Some modeling methods can address missing values for predictor

variables, others cannot, and typically drop records that contain one or more missing values from the selected set of predictor variables. No method can address records with a missing target variable

• Categorical variables that have little variability (e.g., 99% of all records are in the same category) or have categories with a small number of records (leading to reliability problems and/or the possibility that new data cannot be predicted due to “unknown” categories)

• Categorical variables that are disguised as integers. For target variables it can mean that an inappropriate modeling method is used, for predictors, it can mean the variable is used in an inappropriate way

12

Page 13: The Art and (Data) Science of Data Cleansing and Quality

© 2017 Alteryx, Inc. | Confidential Download a FREE Trial: alteryx.com/trial

The Nitty-Gritty of Developing a Predictive Model: Data Hygiene

• Addressing fields with missing values• For predictor variables it makes sense to impute missing values in most cases, but if there are

very few of them, then dropping records may be in order• In the case of numeric variables, using a fixed value, such as the mean, median, or zero is

commonly used to address missing values. In addition, a categorical variable can be created for each predictor to indicate whether its value has been imputed or not. My recommendation is to use zero values along with a categorical variable to indicate if the value of the variable has been imputed

• Missing values of categorical values can be replaced with a new category indicating the value is missing (my recommendation) or the mode value for the variable

• There are model based methods that replace missing values with predicted values based on other available data

• Records with missing values for the target variable should be filtered out of the data

13

Page 14: The Art and (Data) Science of Data Cleansing and Quality

© 2017 Alteryx, Inc. | Confidential Download a FREE Trial: alteryx.com/trial

The Nitty-Gritty of Developing a Predictive Model: Data Hygiene

• Addressing problematic categorical variables• Addressing categorical variables which are dominated by a single category (e.g., have little

variability) depends on the amount of data available for creating a model. If there is a lot of data, and there is a reasonable number of records (at least 20) in each of the non-dominant categories, then including the field in the model is a viable choice. Otherwise, it makes sense to not include these fields as predictors

• In the case of categorical variables with categories with few records, it makes sense to combine categories together. The combination of categories should have a sound logical basis, as opposed to being combined due to having a similar relationship with the target field

• Fields that use integer values to identify the different categories should have their data type changed to a string type to indicate that the values are actually category labels

14

Page 15: The Art and (Data) Science of Data Cleansing and Quality

© 2017 Alteryx, Inc. | Confidential Download a FREE Trial: alteryx.com/trial

Key Takeaways

• Clearly define the business issue – create a mental model• Starting with the right data is critical to accuracy of predictive models• Data hygiene requirements from predictive modeling are more stringent than for

BI/Reporting • Data variable type – “numeric” or “categorical” – matter:

• For selecting an appropriate modeling method• When imputing missing values

• The volume of data can be critical when addressing problematic categorical variables

15

Page 16: The Art and (Data) Science of Data Cleansing and Quality

© 2017 Alteryx, Inc. | Confidential Download a FREE Trial: alteryx.com/trial

A Leading Platform for Self-Service Data Analytics

16

Enrich

Prep & Blend Analyze

Input All Relevant Data

Share

Output All Popular Formats

Page 17: The Art and (Data) Science of Data Cleansing and Quality

© 2017 Alteryx, Inc. | Confidential Download a FREE Trial: alteryx.com/trial

What Makes the Alteryx Platform Different

N o C o d i n gDrag & drop tools using an intuitive user

interface to prep, blend, and analyze data

U n l o c k a l l y o u r d a t aSecurely connect business users to all data

regardless of source or data type

E l i m i n a t e s i l o sBridge the gap between disparate teams and

departments by collaborating in a secure,

centralized analytic platform

A n a l y t i c g o v e r n a n c eEnsure data quality by providing transparent

data management and auditability to data

sources, authors and transformation

E n t e r p r i s e s c a l a b i l i t yScale analytics to service users in the

systems/technologies they depend on

R e p e a t a b l e W o r k f l o wAutomate time-consuming, manual data

tasks, and adjust analytic queries easily

Page 18: The Art and (Data) Science of Data Cleansing and Quality

@alteryx

See what Alteryx can do for you!

Download a free trial of Alteryx

alteryx.com/trial

or visit alteryx.com for more information

Thank you