advanced ml in google cloud (2) - stanford...
TRANSCRIPT
![Page 1: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)](https://reader035.vdocuments.mx/reader035/viewer/2022063009/5fbf32b8966fe75aae6063e6/html5/thumbnails/1.jpg)
Advanced ML in Google Cloud (2)
Abhay Agarwal (MS Design ‘19)
CS341: Project in Mining Massive Datasets
![Page 2: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)](https://reader035.vdocuments.mx/reader035/viewer/2022063009/5fbf32b8966fe75aae6063e6/html5/thumbnails/2.jpg)
Agenda
● ‘Productizing’ analytics
● Data wrangling
● Data fundamentals
● Data studio vs datalab vs colab
![Page 3: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)](https://reader035.vdocuments.mx/reader035/viewer/2022063009/5fbf32b8966fe75aae6063e6/html5/thumbnails/3.jpg)
‘Productizing’● What does it mean to ‘productize’ your ML?
![Page 4: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)](https://reader035.vdocuments.mx/reader035/viewer/2022063009/5fbf32b8966fe75aae6063e6/html5/thumbnails/4.jpg)
Pitfalls in Productizing● My algorithm has a 95% accuracy -- is it ready for production?
● My algorithm has a 95% accuracy and 95% precision -- is it ready for
production?
● My algorithm has a 95% accuracy, 95% precision, and my training data is
roughly sampled from real examples -- is it ready for production?
● My algorithm has a 95% accuracy, 95% precision, training data sampled from
real examples, and my algorithm tests hypotheses that match the use cases --
is it ready for production?
![Page 5: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)](https://reader035.vdocuments.mx/reader035/viewer/2022063009/5fbf32b8966fe75aae6063e6/html5/thumbnails/5.jpg)
Data wrangling
![Page 6: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)](https://reader035.vdocuments.mx/reader035/viewer/2022063009/5fbf32b8966fe75aae6063e6/html5/thumbnails/6.jpg)
DATA COLLECTION FUNDAMENTALS
6
![Page 7: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)](https://reader035.vdocuments.mx/reader035/viewer/2022063009/5fbf32b8966fe75aae6063e6/html5/thumbnails/7.jpg)
Key Concepts
7
Freshness
Quality
Structure
Cost
Quantity
![Page 8: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)](https://reader035.vdocuments.mx/reader035/viewer/2022063009/5fbf32b8966fe75aae6063e6/html5/thumbnails/8.jpg)
Quantity•Breadth
• Number of entities or observations• E.g., People, companies, stars, shopping trips,…• Ideally: comprehensive
•Depth• Data gathered on each entity or observation
8
![Page 9: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)](https://reader035.vdocuments.mx/reader035/viewer/2022063009/5fbf32b8966fe75aae6063e6/html5/thumbnails/9.jpg)
Breadth and Depth
9
Depth
Bre
adt
h
World Bank Development Indicators
![Page 10: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)](https://reader035.vdocuments.mx/reader035/viewer/2022063009/5fbf32b8966fe75aae6063e6/html5/thumbnails/10.jpg)
Structure
10
Structured Unstructured Semi-structured
![Page 11: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)](https://reader035.vdocuments.mx/reader035/viewer/2022063009/5fbf32b8966fe75aae6063e6/html5/thumbnails/11.jpg)
Graph DataGraphs arise naturally in many settings
Many interesting techniques e.g., Page Rank, community detection
11Moz.com
![Page 12: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)](https://reader035.vdocuments.mx/reader035/viewer/2022063009/5fbf32b8966fe75aae6063e6/html5/thumbnails/12.jpg)
Data Quality•Errors
• E.g., human labeling mistakes
•Missing data• E.g., missing addresses in customer records
•Bias• Sample bias, measurement bias, prejudice/stereotype
12
![Page 13: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)](https://reader035.vdocuments.mx/reader035/viewer/2022063009/5fbf32b8966fe75aae6063e6/html5/thumbnails/13.jpg)
Data Quality: Sample Bias
13
Day Driving vs Night Driving
Tank recognition
![Page 14: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)](https://reader035.vdocuments.mx/reader035/viewer/2022063009/5fbf32b8966fe75aae6063e6/html5/thumbnails/14.jpg)
Data Quality: Prejudice/Stereotype BiasAlgorithmic Law Enforcement
14The Economist, August 20, 2016
But what about perpetuating bias against minorities?
![Page 15: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)](https://reader035.vdocuments.mx/reader035/viewer/2022063009/5fbf32b8966fe75aae6063e6/html5/thumbnails/15.jpg)
Data Quality: Measurement Bias
15
![Page 16: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)](https://reader035.vdocuments.mx/reader035/viewer/2022063009/5fbf32b8966fe75aae6063e6/html5/thumbnails/16.jpg)
Data FreshnessRate of data collection must match rate of change of underlying phenomenon
16
![Page 17: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)](https://reader035.vdocuments.mx/reader035/viewer/2022063009/5fbf32b8966fe75aae6063e6/html5/thumbnails/17.jpg)
Data manipulation in Google Cloud● Data Studio
● Datalab
● Colab
● (offline!)
![Page 18: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)](https://reader035.vdocuments.mx/reader035/viewer/2022063009/5fbf32b8966fe75aae6063e6/html5/thumbnails/18.jpg)
Data Studio● Data Studio - glorified spreadsheets with a few integrations to Google Cloud
to pull data
● Use cases: excel-like functions, simple visualizations (e.g. geographic)
![Page 19: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)](https://reader035.vdocuments.mx/reader035/viewer/2022063009/5fbf32b8966fe75aae6063e6/html5/thumbnails/19.jpg)
Datalab● Datalab - hosted Jupyter instance with preset libraries
● Use cases: python scripting, visualization, ML pipelining, some long-running
scripting, versioned scripts and models
![Page 20: Advanced ML in Google Cloud (2) - Stanford Universityweb.stanford.edu/class/cs341/slides/3-Data.pdf · 2019. 9. 26. · Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design ‘19)](https://reader035.vdocuments.mx/reader035/viewer/2022063009/5fbf32b8966fe75aae6063e6/html5/thumbnails/20.jpg)
Colab● Colab - Shared, no-setup version of Datalab that is designed around sharing
● Use cases: creating publicly accessible work, collaboration, but no
long-running scripting