feature engineering studio september 23, 2013. welcome to mucking around day
TRANSCRIPT
Feature Engineering Studio
September 23, 2013
Welcome to Mucking Around Day
Sort into pairs
• Partner with the person next to you
• One group of 3 is allowed
Sort into pairs
• Do we have a group of 3?• One of the 3 will work with me
Sort into pairs
• Go over your reports together– A maximum of 5 minutes apiece
5 minutes for first person
5 minutes for second person
Re-assemble into one big group
Who here found something really cool while mucking around?
• Show us, tell us
Who here found a histogram with a normal distribution?
• Show us, tell us
Who here found a histogram with a hypermode?
• Show us, tell us
Who here found a histogram with a flat distribution?
• Show us, tell us
Who here found a histogram with a skewed distribution?
• Show us, tell us
Who here found a histogram with a bimodal distribution?
• Show us, tell us
Who here found a histogram with something else interesting?
• Show us, tell us
Who here found something surprising with their min, max, average, stdev?
Categorical variables
• Who here found something curious, weird, or interesting in the distribution of their categorical variables?
Who here hasn’t spoken yet?(and analyzed data)
• Tell us something interesting you found in your data
Who here played with pivot tables?
• What did you learn?
My turn to play with pivot tables
• Who wants to volunteer their data?• (I might request a 2nd or 3rd data set,
depending on how the 1st one goes)
Who here played with vlookup?
• What did you learn?
My turn to play with vlookup
• Using the same volunteered data set(s)
Other cool things you can create with a few simple formulas (plus demos!)
Identifying specific cases of interest
Did event of interest ever occur for student?
Counts-so-far(and total value for student)
Counts-last-N-actions
First attempts
Ratios between events of interest
How many students had 3 (or 4, 5, 2,…) of an event
Times-so-far
Cutoff-based features
Unitized actions (such as unitized time)
Last 3 or 5 unitized
Comparing earlier behaviors to later behaviors through caching
Counts-if
Percentages of action type
Percentages of time spent per action/location/KC/etc.
Questions? Comments?
Other cool ideas?
Assignment 3• Feature Engineering 1
“Bring Me a Rock”
• Get your data set• Open it in Excel• Create as many features as you feel inspired to create
– Features should be created with the goal of predicting your ground truth variable– At least 12 separate features that are not just variations on a theme (e.g. “time for
last 3 actions” and “time for last 4 actions” are variations on a theme; but “time for last 3 actions” and “total time between help requests and next action” are two separate features)
• For each feature, write a 1-3 sentence “just so story” for why it might work• Test how good each features is
Testing Feature Goodness
• For this assignment, there are a bunch of ways to test feature goodness
• Single-feature prediction models in data mining or stats package, giving correlation or kappa (special session this Wednesday)
• Compute correlation in Excel (want to see?)– You can do this with binaries variables too, although it’s not really
optimal• Compute t-test in Excel (want to see?)• Compute kappa in Excel (if you don’t know how, easier to do in
RapidMiner)
Were you right?
• Which of your “just so stories” seem to be correct?
• Did any of your feature correlate in the opposite direction from what you expected?
Assignment 3
• Write a brief report for me• Email me an excel sheet with your features• You don’t need to prepare a presentation• But be ready to discuss your features in class
Next Classes
• 9/25 Special Session– Using RapidMiner to Produce Prediction Models– Come to this if you’ve never built a classifier or
regressor in RapidMiner (or a similar tool)– Statistical significance tests using linear regression
don’t count…
• 9/30 Advanced Feature Distillation in Excel– Assignment 3 due– Online Equation Solver Tutorials should be in your
INBOX
Upcoming Classes
• 10/2 Special session on prediction models– Come to this if you don’t know why student-level
cross-validation is important, or if you don’t know what J48 is
• 10/7 Advanced Feature Distillation in Google Refine
• 10/9 Special session? TBD.