![Page 1: DAWN: Infrastructure for Usable Machine LearningThe DAWN Stack Data Acquisition Feature Engineering Model Training Productionizing aces ms ms e … Snorkel DeepDive MacroBase(Streaming](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f21e333f2278f7a280ad449/html5/thumbnails/1.jpg)
DAWN: Infrastructure for Usable Machine LearningPeter Bailis, Kunle Olukotun, Chris Ré, Matei Zaharia
dawn.cs.stanford.edu
![Page 2: DAWN: Infrastructure for Usable Machine LearningThe DAWN Stack Data Acquisition Feature Engineering Model Training Productionizing aces ms ms e … Snorkel DeepDive MacroBase(Streaming](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f21e333f2278f7a280ad449/html5/thumbnails/2.jpg)
It’s the Golden Age of DataIncredible advances in image recognition, natural language processing, planning, info retrieval
Society-scale impact: autonomous vehicles, personalized medicine, real-time translation
No end in sight for advances in ML
*
*for the best-funded, best-trained engineering teams
![Page 3: DAWN: Infrastructure for Usable Machine LearningThe DAWN Stack Data Acquisition Feature Engineering Model Training Productionizing aces ms ms e … Snorkel DeepDive MacroBase(Streaming](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f21e333f2278f7a280ad449/html5/thumbnails/3.jpg)
Building ML Products is Too Hard
Major successes (e.g., Siri, AlphaGo) require hundreds to thousands of engineers
Most effort in data acquisition, preparation, testing and productionizing: not just core ML!
![Page 4: DAWN: Infrastructure for Usable Machine LearningThe DAWN Stack Data Acquisition Feature Engineering Model Training Productionizing aces ms ms e … Snorkel DeepDive MacroBase(Streaming](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f21e333f2278f7a280ad449/html5/thumbnails/4.jpg)
“Only a fraction of real-world ML systemsis composed of ML code”
![Page 5: DAWN: Infrastructure for Usable Machine LearningThe DAWN Stack Data Acquisition Feature Engineering Model Training Productionizing aces ms ms e … Snorkel DeepDive MacroBase(Streaming](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f21e333f2278f7a280ad449/html5/thumbnails/5.jpg)
The DAWN QuestionWhat if anyone with domain expertise could build their own production-quality ML products?• Without a PhD in machine learning• Without being an expert in systems• Without understanding the latest hardware
It’s happened before
![Page 6: DAWN: Infrastructure for Usable Machine LearningThe DAWN Stack Data Acquisition Feature Engineering Model Training Productionizing aces ms ms e … Snorkel DeepDive MacroBase(Streaming](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f21e333f2278f7a280ad449/html5/thumbnails/6.jpg)
It’s happened before: SearchBefore: Decades of research on information retrieval, indexes, ranking, etc
After: any developer can add search to an app using a library (e.g. Solr); any user can use search
Key idea: end-to-end systems that tackle the barriers to access & production use
![Page 7: DAWN: Infrastructure for Usable Machine LearningThe DAWN Stack Data Acquisition Feature Engineering Model Training Productionizing aces ms ms e … Snorkel DeepDive MacroBase(Streaming](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f21e333f2278f7a280ad449/html5/thumbnails/7.jpg)
The DAWN StackData Acquisition Feature Engineering Model Training Productionizing
Inte
rface
sAl
gorit
hms
Syst
ems
Hard
war
e
…
Snorkel
DeepDive
MacroBase (Streaming Data)
NoScope (Video)
AutoRec, SimDex (Recommendation)
Data Fusion
Mulligan (SQL+graph+ML)
CPU GPU FPGA Cluster Mobile
New Hardware: FuzzyBit, Plasticine CGRA
End-to-End Compilers: Weld, Delite
ModelQAModelSnap
![Page 8: DAWN: Infrastructure for Usable Machine LearningThe DAWN Stack Data Acquisition Feature Engineering Model Training Productionizing aces ms ms e … Snorkel DeepDive MacroBase(Streaming](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f21e333f2278f7a280ad449/html5/thumbnails/8.jpg)
Example: MacroBasefor Continuous Analytics
End-to-end system for anomaly identification
MacroBasemulti-dimensionaldata streams
anomalies &explanations
![Page 9: DAWN: Infrastructure for Usable Machine LearningThe DAWN Stack Data Acquisition Feature Engineering Model Training Productionizing aces ms ms e … Snorkel DeepDive MacroBase(Streaming](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f21e333f2278f7a280ad449/html5/thumbnails/9.jpg)
Too much data for manual inspectionEven harder when data is streaming
![Page 10: DAWN: Infrastructure for Usable Machine LearningThe DAWN Stack Data Acquisition Feature Engineering Model Training Productionizing aces ms ms e … Snorkel DeepDive MacroBase(Streaming](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f21e333f2278f7a280ad449/html5/thumbnails/10.jpg)
github.com/stanford-futuredata/macrobase
Early successful users: manufacturing, automotive, online video, mobile apps
![Page 11: DAWN: Infrastructure for Usable Machine LearningThe DAWN Stack Data Acquisition Feature Engineering Model Training Productionizing aces ms ms e … Snorkel DeepDive MacroBase(Streaming](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f21e333f2278f7a280ad449/html5/thumbnails/11.jpg)
The DAWN StackData Acquisition Feature Engineering Model Training Productionizing
Inte
rface
sAl
gorit
hms
Syst
ems
Hard
war
e
…
Snorkel
DeepDive
MacroBase (Streaming Data)
NoScope (Video)
AutoRec, SimDex (Recommendation)
Data Fusion
Mulligan (SQL+graph+ML)
CPU GPU FPGA Cluster Mobile
New Hardware: FuzzyBit, Plasticine CGRA
End-to-End Compilers: Weld, Delite
ModelQAModelSnap
![Page 12: DAWN: Infrastructure for Usable Machine LearningThe DAWN Stack Data Acquisition Feature Engineering Model Training Productionizing aces ms ms e … Snorkel DeepDive MacroBase(Streaming](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f21e333f2278f7a280ad449/html5/thumbnails/12.jpg)
Training data is key enabler, barrier to entry
How can we leverage data that’s expensive to label at scale?
![Page 13: DAWN: Infrastructure for Usable Machine LearningThe DAWN Stack Data Acquisition Feature Engineering Model Training Productionizing aces ms ms e … Snorkel DeepDive MacroBase(Streaming](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f21e333f2278f7a280ad449/html5/thumbnails/13.jpg)
github.com/HazyResearch/snorkel
Snorkel’s Approach:Weak Supervision
1) User writes labeling functions: short programs that may not always give right label• E.g. pattern to search in text
2) Snorkel simultaneously learns noise in LFs and a noise-aware target model (e.g. LSTM)
Result: 4 hours writing labeling functions matches months of hand-labeling 10,000+ documents
![Page 14: DAWN: Infrastructure for Usable Machine LearningThe DAWN Stack Data Acquisition Feature Engineering Model Training Productionizing aces ms ms e … Snorkel DeepDive MacroBase(Streaming](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f21e333f2278f7a280ad449/html5/thumbnails/14.jpg)
The DAWN StackData Acquisition Feature Engineering Model Training Productionizing
Inte
rface
sAl
gorit
hms
Syst
ems
Hard
war
e
…
Snorkel
DeepDive
MacroBase (Streaming Data)
NoScope (Video)
AutoRec, SimDex (Recommendation)
Data Fusion
Mulligan (SQL+graph+ML)
CPU GPU FPGA Cluster Mobile
New Hardware: FuzzyBit, Plasticine CGRA
End-to-End Compilers: Weld, Delite
ModelQAModelSnap
![Page 15: DAWN: Infrastructure for Usable Machine LearningThe DAWN Stack Data Acquisition Feature Engineering Model Training Productionizing aces ms ms e … Snorkel DeepDive MacroBase(Streaming](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f21e333f2278f7a280ad449/html5/thumbnails/15.jpg)
NoScope: Fast CNN-BasedVideo Queries
Opportunity: CNNs allow more accurate queries on visual data than ever
Challenge: processing 1 video in real time requires a $1000 GPU
Result: 100-3000x faster with <1% loss in accuracy via• Model specialization• Adaptive cascades
github.com/stanford-futuredata/noscope
![Page 16: DAWN: Infrastructure for Usable Machine LearningThe DAWN Stack Data Acquisition Feature Engineering Model Training Productionizing aces ms ms e … Snorkel DeepDive MacroBase(Streaming](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f21e333f2278f7a280ad449/html5/thumbnails/16.jpg)
NoScope Results
![Page 17: DAWN: Infrastructure for Usable Machine LearningThe DAWN Stack Data Acquisition Feature Engineering Model Training Productionizing aces ms ms e … Snorkel DeepDive MacroBase(Streaming](https://reader034.vdocuments.mx/reader034/viewer/2022042407/5f21e333f2278f7a280ad449/html5/thumbnails/17.jpg)
DAWN: AI for everyone via new systems that tackle the barriers to real-world use
Whitepaper & more at dawn.stanford.edu