shipping data science products! - bi...
TRANSCRIPT
![Page 1: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/1.jpg)
Shipping Data Science Products!Turning raw data into valuable servicesBudapestBI Forum 2015License: CC By Attribution
Ian Ozsvald @IanOzsvald ModelInsight.io
![Page 2: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/2.jpg)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Who Am I?
● “Industrial Data Science” for 15 years● Data Product Builder● O'Reilly Author● Teacher at PyCons
![Page 3: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/3.jpg)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Who are you?
● Type A(nalysis) or B(building)● Robert Chang - “Doing Data Science at
Twitter”
![Page 4: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/4.jpg)
[email protected] @IanOzsvald BudapestBI Forum October 2015
What frustrations do we share?
● Lack of useful data● Biggest time sink - cleaning & transforming
● Conservative management● How can we derisk projects?
● Medium Data● luckily we have Wes in the room
![Page 5: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/5.jpg)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Which projects succeed?
● Explain existing data (visualisation!)● Automate repetitive/slow processes (higher accuracy, more repeatable)
● Augment data to make new data (e.g. for search engines and ML)
● Predict the future (e.g. replace human intuition or use subtler relationships)
![Page 7: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/7.jpg)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Visualising data
● Most data isn't interesting...● Requires human curation + detective skills to get the good stuff
● Couple a researcher + a business person
![Page 8: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/8.jpg)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Medical data (anti-allergy)
Perceived complexity might make sign-off more difficult...
![Page 9: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/9.jpg)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Medical data (anti-allergy)
Predict using:● food● alcohol ● pollen● pollution● location● cats● ...
![Page 10: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/10.jpg)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Extracting data from binary files
● Copy/pasting PDF/PNG data is laborious● How can we scale it?● textract/Tika - unified interface● Specialised tools e.g. Sovren● This might take months!
![Page 11: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/11.jpg)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Augmenting data
● Identifying people, places, brands, sentiment
● “i love my apple phone” ● Context-sensitive (e.g movies vs products)
● Build custom machine-learned tools● Augment job titles● Reconcile the same order in 2 tables
![Page 12: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/12.jpg)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Machine Learning
● PyMC (Markov Chain Monte Carlo)Please cite these projects! (it helps their funding)
![Page 13: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/13.jpg)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Debugging Machine Learning?● Thoughts from you?● No obvious tools to show me:
● these examples were well-fitted● these always wrongly-fitted● these always uncertain
● No data-diagnostics to validate inputs (e.g. for Logistic Regression)
● No visualisers for most of the models● Your hard-won knowledge->new debug tools? (PLEASE!)
![Page 14: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/14.jpg)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Debugging Machine Learning?Roelof Pieters PyDataLondon2015
![Page 15: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/15.jpg)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Delivery: Keep It Simple (Stupid!)
● We're (probably) not publishing the best result
● Debuggability is key - 3am Sunday CTO beeper alert is no time for complexity
● “cult of the imperfect” Watson-Watt● Dumb models + clean data beat other combinations
![Page 16: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/16.jpg)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Don't Kill It!● Your data is missing, it is poor and it lies
● Missing data kills projects!● Log everything! ● Make data quality tools & reports● More data->desynchronisation
● R&D != Engineering● Discovery-based● Success and failure equally useful
engarde
![Page 17: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/17.jpg)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Internal deployment
● CSVs/Reports● Database updates● IPython Notebook
(not secure though!)
![Page 18: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/18.jpg)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Deploying live systems
● Spyre (locked-down)● Microservices
● Flask is my go-to tool● Swagger docs● (git pull / fabric / provisioned machines)● Docker + Amazon ECS
![Page 19: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/19.jpg)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Python Deployment● Make Python modules (setup.py)
● python setup.py develop # symlink● Unit tests + coverage● Use a config system (e.g.
github.com/ianozsvald/ python_template_with_config)
● Keep Separation of Concerns!● “12 Factor App” useful ideas
![Page 20: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/20.jpg)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Some common gotchas● MySQL UTF8 is 3 byte by default #sigh● JavaScript months are 0-based (not 1)● Never compromise on datetimes (ISO 8601)
● iOS NSDate's epoch is 2001● Windows CP1252 text (strongly prefer UTF8)● MongoDB no_timeout_cursor=True● Github's 100MB file limits (new Large File Support)● Never throw data away! Never overwrite original data! Always transform it (e.g. Luigi)
● Data duplication bites you in the end...
![Page 21: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/21.jpg)
[email protected] @IanOzsvald BudapestBI Forum October 2015
(Perhaps) Avoid Big Data
● Don't be in a rush - 50,000 lines of good data will beat a pile of Bad Big Data
● 244GB RAM EC2+many Xeons $2.80/hr
![Page 22: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/22.jpg)
[email protected] @IanOzsvald BudapestBI Forum October 2015
“Data Science Delivered”● New mini project / pamphlet● Includes dirty data strategies, ways to debug ML, thoughts on managing projects - 15 yrs experience (please critique and file bugs!)
● https://github.com/ianozsvald/
data_science_delivered ● Please give me your feedback
![Page 23: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull](https://reader034.vdocuments.mx/reader034/viewer/2022042406/5f20b0ebc6a4d867fb3042ef/html5/thumbnails/23.jpg)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Closing
● Tell me your dirty data stories, perhaps in a Ruin Pub? (I am automating some of this)
● Takehome - Keep it clean, keep it simple● Come talk on your projects at our PyDataLondon monthly meetup or start your own!