data science in the wild › lectures › lec1... · a definition of data science wikipedia data...
TRANSCRIPT
![Page 1: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/1.jpg)
Data Science in the Wild
Giri Iyengar
Cornell University
Jan 24, 2018
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 1 / 27
![Page 2: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/2.jpg)
Overview
1 IntroductionAbout your InstructorWhat is Data Science?What we will cover in this course?
2 Class MechanicsSoftware tools
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 2 / 27
![Page 3: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/3.jpg)
Overview
1 IntroductionAbout your InstructorWhat is Data Science?What we will cover in this course?
2 Class MechanicsSoftware tools
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 3 / 27
![Page 4: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/4.jpg)
About me
EE from IIT Mumbai, India. PhD from MIT (Media Lab)Researcher at IBM Research doing Audio-Visual Speech Recognitionand Multimedia MiningStartup No. 1 - Mobile and Social Media appsStartup No. 2 - Big Data Machine Learning as a Service. Acquired byAOL/Verizon in 2015Engineering Director of Merchandising, currently Head of ComputerVision, eBay
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 4 / 27
![Page 5: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/5.jpg)
What is the excitement all about?
Harvard Business Review called it the sexiest job of the 21st century(https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/)
HBR
Mashable called it the best job in America(http://mashable.com/2016/01/20/the-best-jobs-in-america-2016/),based on a recently concluded Glassdoor annual survey Mashable
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 5 / 27
![Page 6: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/6.jpg)
Elections Projections
Figure: Nate Silver Elections projections
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 6 / 27
![Page 7: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/7.jpg)
A Definition of Data Science
WikipediaData Science is an interdisciplinary field about processes and systems toextract knowledge or insights from large volumes of data in various forms,either structured or unstructured, which is a continuation of some of thedata analysis fields such as statistics, data mining and predictive analytics,as well as Knowledge Discovery in Databases (KDD).Data scientists use their data and analytical ability to find and interpretrich data sources; manage large amounts of data despite hardware,software, and bandwidth constraints; merge data sources; ensureconsistency of datasets; create visualizations to aid in understanding data;build mathematical models using the data; and present andcommunicate the data insights and findings.
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 7 / 27
![Page 8: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/8.jpg)
Data Science
Figure: Drew Conway Venn DiagramGiri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 8 / 27
![Page 9: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/9.jpg)
Who is a Data Scientist?
Josh Wills, Slack Data Scientist, Open Source CommitterData Scientist (n.): Person who is better at statistics than any softwareengineer and better at software engineering than any statistician.
IBM Developer WorksPart Scientist, Part Artist
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 9 / 27
![Page 10: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/10.jpg)
A broader perspective
Figure: Rob Hyndman Venn Diagram
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 10 / 27
![Page 11: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/11.jpg)
The goal of this course
Teach skills beyond Machine Learning and Database management systems
Figure: Kernel Machine by Alisneaky, RDBMS by Scifipete
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 11 / 27
![Page 12: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/12.jpg)
What is Machine Learning?
Machine learning is a subfield of computer science that evolved from thestudy of pattern recognition and computational learning theory in artificialintelligence. Machine learning explores the study and construction ofalgorithms that can learn from and make predictions on data. Suchalgorithms operate by building a model from example inputs in order tomake data-driven predictions or decisions, rather than following strictlystatic program instructions.
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 12 / 27
![Page 13: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/13.jpg)
Machine Learning as per HBR
How Machines Learn (and you win) HBR
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 13 / 27
![Page 14: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/14.jpg)
ML compared with DS
Machine Learning1 Develop new models2 Prove mathematical
properties3 Validate on relatively clean
(possibly small) datasets4 Publish paper
Data Science1 Explore many models, focus on
tuning2 Understand empirical properties
of models3 Handle messy, massive datasets4 Actionable systems
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 14 / 27
![Page 15: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/15.jpg)
DBMS compared with DS
Database Systems1 Individual records valuable2 Modest data volumes3 Structured, Consistent,
Auditable4 ACID compliance
Data Science1 Individual rows ”cheap”2 Massive data volumes3 Structured, Unstructured, and
everything in between4 Lots of ad-hoc
querying/transformations
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 15 / 27
![Page 16: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/16.jpg)
The Data Science Process
Figure: Data Science Process by Farcaster
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 16 / 27
![Page 17: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/17.jpg)
Data provides valuable insights
Figure: Seven Countries Study - Cholesterol vs Mortality Go
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 17 / 27
![Page 18: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/18.jpg)
Good Data Visualization is Invaluable
Figure: Gapminder - Wealth vs Health Go
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 18 / 27
![Page 19: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/19.jpg)
Good Data Visualization is Invaluable
Figure: Facebook World Connections
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 19 / 27
![Page 20: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/20.jpg)
Good Data Visualization is Invaluable
Figure: World Bank OpenData World Bank
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 20 / 27
![Page 21: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/21.jpg)
What makes Data Science hard?
Insufficient domain knowledgeIncorrect assumptionsAd-hoc explanations of data patternsOverreachValidation/Data integrityComplex data and modeling pipelinesGoing from prototype to productionCommunicating the implications
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 21 / 27
![Page 22: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/22.jpg)
Topics Covered
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 22 / 27
![Page 23: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/23.jpg)
Overview
1 IntroductionAbout your InstructorWhat is Data Science?What we will cover in this course?
2 Class MechanicsSoftware tools
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 23 / 27
![Page 24: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/24.jpg)
Class Mechanics
Meet twice a week. Mondays, Wednesdays 4:45-6:00 PM6 Programming assignments1 Course Project
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 24 / 27
![Page 25: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/25.jpg)
Software tools we’ll be using
PyTorch PyTorch
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 25 / 27
![Page 26: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/26.jpg)
Weekly Reading
Forrester Analyst Video Play
Hilary Mason Video Play
Hans Rosling TED Talk TED
Short History of Data Science Blog
O’Reilly Definition of Data Science OReilly
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 26 / 27
![Page 27: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge](https://reader033.vdocuments.mx/reader033/viewer/2022053013/5f10000f7e708231d446f1ff/html5/thumbnails/27.jpg)
Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 27 / 27