feature engineering studio february 23, 2015. let’s start by discussing the hw

28
Feature Engineering Studio February 23, 2015

Upload: cecily-spencer

Post on 24-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Feature Engineering Studio

February 23, 2015

Let’s start by discussing the HW

Assignment 3

• Data Cleaning

• Look for outliers in your data set• Find 3 variables that have one or more outliers (if you can)• Identify those variables• Given the mean, median, SD, and some outlier values in

them• For each variable, write a 1 sentence “just so story” (or

multiple just so stories) about what might have caused the outlier(s)

• Argue (briefly) for a reasonable approach to dealing with that variable’s outliers (and explain why your chosen approach is reasonable)

Everyone will present an outlier

• Alphabetical Order Based on First Name– Tie-Breaker: Last Name

• I’ll call out letters– Using the class roster failed last time

Tell us about your best outlier

• Mean, Median, SD, and some outlier values • Give your “just so story” (or multiple just so

stories) about what might have caused the outlier(s)

• What do you plan to do about it (if anything)?

Questions? Comments?

Things you can do in Excel part 2 of 3

Identifying specific cases of interest

Did event of interest ever occur for student?

Ratios between events of interest

How many students had 3 (or 4, 5, 2,…) of an event

Unitized actions (such as unitized time)

Last 3 or 5 unitized

Comparing earlier behaviors to later behaviors through caching

Counts-if

Percentages of action type

Percentages of time spent per action/location/KC/etc.

List merging

Pearson Correlation

T-tests

More complex stats in Excel

• I have worksheets that can do Chi-squared, Cohen’s Kappa, Extra-Sum-of-Squares F-test, and some various meta-analytic methods in Excel

• But if you don’t really know what you’re doing, it’s better to use a stats package for these

What else might you want to do in Excel?

Questions? Comments?

HW4• Feature Engineering 1

“Bring Me a Rock”

• Get your data set• Open it in Excel• Create as many features as you feel inspired to create

– Features should be created with the goal of predicting your ground truth variable– At least 12 separate features that are not just variations on a theme (e.g. “time for

last 3 actions” and “time for last 4 actions” are variations on a theme; but “time for last 3 actions” and “total time between help requests and next action” are two separate features)

• For each feature, write a 1-3 sentence “just so story” for why it might work• Test how good each feature is

Testing Feature Goodness

• For this assignment, there are a bunch of ways to test feature goodness

• Single-feature prediction models in data mining or stats package, giving Pearson correlation, Spearman’s rho, or Cohen’s kappa (special session this Wednesday)

• Compute Pearson correlation in Excel • Compute t-test in Excel • Compute other metrics in Excel (but see earlier

disclaimer)

Were you right?

• Which of your “just so stories” seem to be correct?

• Did any of your feature correlate in the opposite direction from what you expected?

Assignment 4

• Write a brief report for me• Email me an excel sheet with your features• You don’t need to prepare a presentation• But be ready to discuss your features in class

Next Classes

• 2/25 Special Session– Using RapidMiner to Produce Prediction Models– Come to this if you’ve never built a classifier or

regressor in RapidMiner (or a similar tool)– Statistical significance tests using linear regression

don’t count…

• 3/2 Advanced Feature Distillation in Excel– HW4 due