data janitor 101
TRANSCRIPT
tl;dr4 KISS is the philosophy,
4 take the long view, invest in durable knowledge,
4 strive for fast and good enough,
4 just because you can doesn't mean you should.
3
"... American MBA? ... if you don’t understand
something it must be simple and only take five
minutes."1
Sean Murphy, PingThings5
KPIs that matter4 DAU, WAU, MAU, LTV, churn,
4 cohorts, segments, funnels,
4 first hour, first day.
8
Approach4 KPIs must hurt (aka no feelgood metrics),
4 you are what you measure,
4 you can run in one direction,
4 is it actionable (the Friday 1700 test).
9
Don't4 just Apache it,
4 build a Hadoop JENGA (10x-235x slow),
4 real-time it,
4 stream it,
4 overengineer it.
14
Do4 embrace dirty reality
(entity recognition makes a data engineer),
4 ETL, events and DWH,
4 data quality (know your leakage),
4 testing (yes, you can even unit test data).
15
Approach4 avoid GIGO,
4 pedal to the metal, skip the overhead,
4 know that big RAM is eating big data,
4 use open source, pragmatic, cloud service agnostic tools.
16
Toolset4 UNIX (bash, make),
4 Python,
4 SQL,
4 ETL in batch (mETL, night-shift)
4 event tracking (Hamustro, logsanitizer, RPi?),
4 DWH = MPP SQL (Azure DWH, Redshift, Vertica...).17
Heroes of the dayJames Mickens: Computers are a Sadness, I am the CureDan McKinley: Choose Boring TechnologyDavid Beazley: Discovering Python
18
"Friends don’t let friends calculate p-values
(without fully understanding them)."
1Scott Weingart
20
Don't4 expect CSVs and produce models whatever it takes,
4 expect that you have to explore the laws of Universe,
4 forget about Occam's razor,
4 A/B test (only if it REALLY REALLY makes sense).
21
Do4 user testing to define context (usertesting.com),
4 talk to users via surveys,
4 embed yourself in departments (personas),
4 have common sense.
22
Approach4 you mostly tell what not to do,
4 it's hard, but still the only way,
4 persist when not finding anything or trivialities,
4 kill teh lurking causation.
23
A/B4 think twice about TCO,
4 the world isn’t identically distributed,
4 random variation will cheat you in small samples,
4 most A/B test results are illusory,
4 small data -> go Bayesian = less certainty.
24
Heroes of the dayEvan Miller: Wizard Statistical AnalyzerChris Stucchio talks and posts on testing
26
Don't4 need a PhD,
4 develop new unique matrix algos, please,
4 need more than Excel,
4 give false hope.
28
Do4 deploy good enough fast,
4 copy Kaggle (ensembles, random forest, XGBoost),
4 feature engineer,
4 build core data/feature (augment and enhance).
29
Approach4 the Mailchimp way
(offline built model redeployed each quarter),
4 hybrid approaches (domain expert, vanilla ML),
4 you are a machine instructor,
4 Tensorflow (logic to clients, handle models).
30
Don't4 believe the hype,
4 trust no-one, just benchmarks,
4 let black box take over,
4 expect hiring to be easy.
35
Do4 maintain data mythology,
4 keep the view backwards straight,
4 expect emotions,
4 see the future.
36
Approach4 train to be the bearer of the bad news,
4 laugh at endless growth without saturation,
4 handle the cargo cult (inverse causality).
37
Marketing4 Google Analytics (sampling, off by 20%, no user
granularity, no raw, 150k per year),
4 CPA, FB CPA, mobile CPA, conversion, attribution,
4 Net Promoter Score.
38