data science methodology - ibm · data science the interest in data science • solve problems and...
TRANSCRIPT
![Page 2: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/2.jpg)
2
“Every day, we create 2.5 quintillion bytes of data —so much that 90% of the data in the world today has been created in the last two years alone.”
![Page 3: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/3.jpg)
Datascience
Theinterestindatascience
• Solveproblemsandanswerquestionsusingdata
• Goaltoimprovefutureoutcomes
3
Whatisthedatascienceprocess?
![Page 4: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/4.jpg)
CRISP-DMMethodologydiagram
4
Business Understanding
Data Understanding
Data Preparation
Analytic Approach
Data Requirements
Data Collection
Modeling
Evaluation
Deployment
Feedback
CrossIndustryStandardProcessforDataMining
![Page 5: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/5.jpg)
Everyprojectbeginswithbusinessunderstanding.
• Projectobjective?• Businesssponsorsplaythemostcriticalrole• Whatarewetryingtodo– whatisthegoal?• Howdoyoudefine“success”andhowcanyoumeasureit?
5
1.Businessunderstanding
Business Understanding
![Page 6: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/6.jpg)
6
1.Businessunderstanding
Business Understanding
Traffic:
Problem:Trafficcongestionwastestimeandmoney
Clearquestion:Howcanweoptimizetrafficlightdurationusingdataontrafficpatterns,weather,andpedestriantraffic?
Measurableoutcomes:
- %decreaseincommutetime
- %decreaseinlength/durationoftrafficjams
![Page 7: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/7.jpg)
2.AnalyticApproach
• Expressproblemincontextofstatisticalandmachinelearningtechniques• Regression:
• “Predictingrevenueinthenextquarter?”• Classification:
• “DoesthispatienthavecancerA,cancerB,oraretheyhealthy?”• Clustering:
• “Aretheregroupsofusersthatseemtobehavesimilarlytoeachother?”• Recommendation/Personalization:
• “HowcanItargetdiscountstospecificcustomers?”• OutlierDetection
7
Business Understanding
Analytic Approach
![Page 8: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/8.jpg)
8
• Linearregression• Logisticregression• Clustering
• K-means• Hierarchical• Density-based
• ClassificationTrees• RandomForests• Neuralnetworks
• Textmining(naturallanguageprocessing)
• Principalcomponentanalysis
• SupportVectorMachines• HiddenMarkovModels• …
Statistical/machinelearningtechnique(s)
![Page 9: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/9.jpg)
Data Understanding
Data Requirements
Data Collection
Datacompilation
• Thechosenanalyticapproachdeterminesthedatarequirements.• Content,formats,representations
• Initialdatacollection isperformed.• AvailableData?• Obtaindata?• Revisedatarequirementsorcollectmoredata?
• Thendataunderstanding isgained.• Initialinsightsaboutdata• Descriptivestatisticsandvisualization• Additionaldatacollectiontofillgaps,ifneeded
9
![Page 10: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/10.jpg)
#1Whatcanyoutellmeaboutthisdata?
10
![Page 11: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/11.jpg)
#2Whatcanyoutellmeaboutthisdata?
11
![Page 12: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/12.jpg)
#3Whatcanyoutellmeaboutthisdata?
12
![Page 13: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/13.jpg)
#4Whatcanyoutellmeaboutthisdata?
13
![Page 14: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/14.jpg)
ImportanceofVisualization
14
Sameproperties:mean(x)=9mean(y)=7.5y=3.00+0.500xcorr(x,y)=0.816
Anscombe's Quartet
![Page 15: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/15.jpg)
CRISP-DMMethodologydiagram
15
Business Understanding
Data Understanding
Data Preparation
Analytic Approach
Data Requirements
Data Collection
Modeling
Evaluation
Deployment
Feedback
![Page 16: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/16.jpg)
Datapreparation
• Datapreparation encompassesallactivitiestoconstructandcleanthedataset.• Datacleaning
• Missingorinvalidvalues
• Eliminatingduplicaterows
• Formattingproperly
• Combiningmultipledatasources• Transformingdata• Featureengineering• Textanalysis
• Acceleratedatapreparationbyautomatingcommonsteps
16
Data Preparation
• Arguablythemosttime-consumingstep• “80%oftheentireDSprocessisin
datacleaningandpreparation”
![Page 17: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/17.jpg)
Modeling
Modeling
• Modeling:• Developingpredictiveordescriptivemodels• Maytryusingmultiplealgorithms
• Highlyiterativeprocess
17
Data Preparation
Evaluation
![Page 18: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/18.jpg)
K-meansClustering
Groupsimilarcuisinestogetherintok numberofclusters.
Example:Clustering
![Page 19: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/19.jpg)
Groupsimilarcuisinestogetherintok numberofclusters.
k =3
K-meansClustering
Example:Clustering
![Page 20: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/20.jpg)
Example:Clustering
20
[Age:18,Sex:M,BMI:23,Exercise:Frequent,Hobbies:Golf,…]
[Age:45,Sex:F,BMI:28,Exercise:Frequent,Hobbies:Baseball,…]
[Age:83,Sex:F,BMI:25,Exercise:Sedentary,Hobbies:Gymnastics,…]
[Age:28,Sex:M,BMI:23,Exercise:Normal,Hobbies:Softball,…]
[Age:30,Sex:F,BMI:25,Exercise:Normal,Hobbies:Golf,…]
[Age:15,Sex:M,BMI:22,Exercise:Frequent,Hobbies:Golf,…]
Model
CLUSTERA
CLUSTERB
CLUSTERC
CLUSTERB
CLUSTERA
CLUSTERA
![Page 21: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/21.jpg)
Example:Classification
21
[Age:32,Sex:M,BMI:23,Exercise:Frequent,… ,Condition:Disorder1]
[Age:45,Sex:F,BMI:28,Exercise:Frequent,… ,Condition:Healthy]
[Age:63,Sex:F,BMI:21,Exercise:Sedentary,… ,Condition:Disorder2]
Model
[Age:48,Sex:M,BMI:23,Exercise:Sedentary,… ,Condition:________]
Disorder1
![Page 22: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/22.jpg)
Modelevaluation
• Modelevaluation isperformedduringmodeldevelopmentandbeforemodeldeployment.• Understandthemodel’squality• Ensurethatitproperlyaddressesthebusinessproblem
• Diagnosticmeasures• Suitabletothemodelingtechniqueused• Training/Testingset• Refinemodelasneeded
• Statisticalsignificancetests
22
Evaluation
Modeling
![Page 23: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/23.jpg)
Deploymentandfeedback• Oncefinalized,themodelisdeployed intoaproductionenvironment.
• Maystartinalimited/testenvironment• Involvesotherroles:
• Solutionowner
• Marketing
• Applicationdevelopers
• ITadministration
• Getting Feedback:• Howwelldidthemodelperform?• Iterativeprocessformodelrefinementandredeployment• A/Btesting
23
Deployment
Feedback
BigDataUniversity:• Inactive->Active
![Page 24: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/24.jpg)
CRISP-DMMethodologydiagram
24
Business Understanding
Data Understanding
Data Preparation
Analytic Approach
Data Requirements
Data Collection
Modeling
Evaluation
Deployment
Feedback
![Page 25: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/25.jpg)
25
“All models are wrong but some are useful” – George Box, Statistician
![Page 26: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/26.jpg)
26
Variable#503
Varia
ble#5
03
Variable#503
![Page 27: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/27.jpg)
27
Variable#503
Varia
ble#5
03
Variable#503
![Page 28: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/28.jpg)
28
Variable#503
Varia
ble#5
03
Variable#503
![Page 29: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/29.jpg)
29
![Page 30: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/30.jpg)
LearningMoreAboutDataScienceWherecanyoulearnmoreaboutdatascience?
30
![Page 31: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/31.jpg)
31
![Page 32: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/32.jpg)
32
![Page 33: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/33.jpg)
33
![Page 34: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data](https://reader033.vdocuments.mx/reader033/viewer/2022041718/5e4cd2abd47eb429d720ef73/html5/thumbnails/34.jpg)
34
BigDataUniversity.comFreecourses!• DataScience• BigData• DataEngineering
Earnbadges!
Learnanytime!
Foryourorganizations• Wecancreatededicatedportals
foryouremployeestogainskillsindatascience