data science
DESCRIPTION
A quick introduction to the fascinating world of business and data analyticsTRANSCRIPT
![Page 1: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/1.jpg)
prithwis mukerjee, ph.d.
Introduction to Data Science
Prithwis Mukerjee, PhDPraxis Business School, Calcutta
![Page 2: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/2.jpg)
prithwis mukerjee, ph.d.
Agenda
● Why data science ?● Techniques
○ Statistics○ Data Mining○ Visualisation
● Tools & Platforms○ R○ Hadoop / MapReduce○ Real Time Systems
● Business Domains
![Page 3: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/3.jpg)
prithwis mukerjee, ph.d.
![Page 4: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/4.jpg)
prithwis mukerjee, ph.d.
Data is being acquired from a variety of sources● EFT in Banks, Credit card
payments● Cell phones● Sensors attached to a variety
of equipment● Surveillance cameras, CCTV● Social Media Updates● Blogs● Websites
Volume
![Page 5: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/5.jpg)
prithwis mukerjee, ph.d.
Variety / Velocity
● Numeric data● Structured text data● Unstructured text data● Images● Sound and video recordings● Graph Nodes
○ Social Media “friends”○ Websites linked to each
other
Data is being generated fast and is becoming obsolete or useless equally faster● Realtime ( or near realtime)
data from sensors, cameras● Website traffic● Social media “trends”
![Page 6: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/6.jpg)
prithwis mukerjee, ph.d.
So what is Big Data ?
● Volume● Velocity● Variety ?
A new term coined by IT vendors to push new technology like● Map Reduce● Hadoop● NOSQL
A new way to● collect● store● manage● analyse● visualise data
![Page 7: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/7.jpg)
prithwis mukerjee, ph.d.
Big Data is like Crude Oil { not new Oil }
Think of data as crude oil !
Big Data is like extracting the crude oil, transporting it in mega tankers, pumping it through pipelines and storing it in massive silos
But what about refining ?
![Page 8: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/8.jpg)
prithwis mukerjee, ph.d.
The Science (and Art ) of Data
Think of data as crude oil !
Big Data is like extracting the crude oil, transporting it in mega tankers, pumping it through pipelines and storing it in massive silos
Data Science● Discovering what we do not
know about the data● Obtaining predictive, actionable
insight● Creating data products that have
business impacts● Communicating relevent
business stories
Refining
![Page 9: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/9.jpg)
prithwis mukerjee, ph.d.
Two Perspectives
Programmingor “Hacking”Skills
Mathematics,Statistics
Knowledge
BusinessDomain
Knowledge
MachineLearning
OperationsResearch
RDBMSERP / BI
DataScience
![Page 10: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/10.jpg)
prithwis mukerjee, ph.d.
10 Things {most} Data Scientists do ...1. Ask good questions
What is what ?We do not know ! We would like to know
2. Define, Test Hypothesis, Run experiments3, Scoop, scrape, sample business data4. Wrestle and tame data5. Play with data, discover unknowns
6. Create models, algorithms7. Under data relationships8. Tell the machine how to learn from the data9. Create data products that deliver actionable insights10. Tell relevant business stories from data
![Page 11: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/11.jpg)
prithwis mukerjee, ph.d.
Statistics - World of Data
● Data comes in various types○ Nominal - colour, gender,
PIN code ○ Ordinal - scale of 1-10,
{high, medium, low}○ Interval - Dates,
Temperature (Centigrade)○ Ratio - length, weight, count
● Data comes in various structure○ Structured data - nominal,
ordinal, interval, ratio○ Unstructured text - email,
tweets, reviews○ Images, voice prints○ graphs, networks - social
media friendships, likes
![Page 12: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/12.jpg)
prithwis mukerjee, ph.d.
Descriptive Statistics
● Numeric Description○ Mean, Median, Mode○ Quartile, Percentile○ Variance / Standard
Deviation
![Page 13: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/13.jpg)
prithwis mukerjee, ph.d.
Statistics : The Path Ahead
Probability, Distributions
Testing of Hypothesis
Regression,Testing
PredictiveAnalysis
![Page 14: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/14.jpg)
prithwis mukerjee, ph.d.
Data Mining / Machine Learning
Is the process of obtaining● novel
● valid
● potentially useful
● understandable
patterns in data
Typical tasks are ● classification
● clustering
● association rules
● sequential patterns
● regression
● deviation detection
![Page 15: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/15.jpg)
prithwis mukerjee, ph.d.
Some definitionsInstance ( an item or record)● an observation that is
characterised by a number of attributes
○ person - with attributes like age, salary, qualification
○ sale - with product, quantity, price
Attribute● measuring characteristics of an
instanceClass● grouping of an instance into
○ acceptable, not acceptable○ mammal, fish, bird
Nominal● colour, PIN code, state
Ordinal● ranking : tall, medium, short or
feedback on a scale of 1 - 10Ratio● length, price, duration, quantity
Interval● date, temperature
![Page 16: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/16.jpg)
prithwis mukerjee, ph.d.
Data Mining : Classification
Classification● Which loan applicant will not
default on the loan ?● Which potential customer will
respond to a mailer campaign ?
![Page 17: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/17.jpg)
prithwis mukerjee, ph.d.
Classification Example
categorical
categorical
continuous
class
Training Set
ModelLearn
Classifier
Test Set
![Page 18: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/18.jpg)
prithwis mukerjee, ph.d.
Data Mining : Clustering
Given a set of unclassified data points, how to find a natural grouping within them
● Can we segment the market in some way that is not yet known ?
![Page 19: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/19.jpg)
prithwis mukerjee, ph.d.
Example of Document Clustering
Clustering points : 3204 article from the Los Angeles Times
Similarity Measure : How many words are common in these documents ( after excluding some common words )
![Page 20: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/20.jpg)
prithwis mukerjee, ph.d.
Clustering of S&P Stock Data
● Observe Stock Movements every day.
● Clustering points: Stock-{UP/DOWN}
● Similarity Measure: Two points are more similar if the events described by them frequently happen together on the same day.
● We used association rules to quantify a similarity measure.
![Page 21: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/21.jpg)
prithwis mukerjee, ph.d.
Regression● Predict a value of a given continuous valued variable
based on the values of other variables, assuming a linear or nonlinear model of dependency.○ Greatly studied in statistics, neural network fields.
● Examples:○ Predicting sales amounts of new product based on advertising
expenditure.
○ Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
○ Time series prediction of stock market indices.
![Page 22: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/22.jpg)
prithwis mukerjee, ph.d.
Data Mining : Association Rules Mining
Association Rules● which products
should be kept along with other products
● which two products should never be discounted together
![Page 23: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/23.jpg)
prithwis mukerjee, ph.d.
Visualisation : The need to tell a story
![Page 24: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/24.jpg)
prithwis mukerjee, ph.d.
Visualisation : The need to tell a story
![Page 25: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/25.jpg)
prithwis mukerjee, ph.d.
Definitions
Data Mining● Is the process of extracting
unknown, valid and actionable information from large databases and using this to make business decisions
● Non trivial process of identifying valid, novel, potentially useful and understandable / explainable patterns in data
Data Science is a rare combination of multiple skills that include● Technology : obviously !
but also● Curiosity - a desire to go below
the surface and discover a hypothesis that can be tested
● Storytelling - create a business story around the data
● Cleverness - again obviously, to look at the problem from different angles
![Page 26: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/26.jpg)
prithwis mukerjee, ph.d.
![Page 27: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/27.jpg)
prithwis mukerjee, ph.d.
R : Your first step into Data Science
Try out this free interactive tutorial just now
![Page 28: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/28.jpg)
prithwis mukerjee, ph.d.
Statistical Tools
http://r4stats.com/articles/popularity/
![Page 29: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/29.jpg)
prithwis mukerjee, ph.d.
Some Comparisons
![Page 30: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/30.jpg)
prithwis mukerjee, ph.d.
Map Reduce
● Input : A set of (key, value) pairs
● User supplies two functions○ Map (k,v) => List(k1,v1)○ Reduce (k1, list(v1)) => v2
● Output is the set of (k1,v2) pairs
![Page 31: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/31.jpg)
prithwis mukerjee, ph.d.
Hadoop
A programming framework that allows you to run Map-Reduce jobs on a distributed cluster of low cost machines without having to bother about anything except ● the Map and Reduce functions● loading data into HDFS
1. HIVEa. A plug-in that allows one to
use SQL like queries that are converted into map-reduce jobs
2. PIGa. A scripting language for
writing long queries3. HBASE
a. A non-relational DBMS4. SQOOP
a. moves data to andfrom HDFS
![Page 32: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/32.jpg)
prithwis mukerjee, ph.d.
Data-in-Flight
![Page 34: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/34.jpg)
prithwis mukerjee, ph.d.
Business Domain
● Financial Sector○ Risk Management, Credit
Scoring○ Predict Customer Spend○ Stock and Investment
Analysis○ Loan approval
● Telecom Sector○ Fraud Detection○ Churn Prediction
● Retail and Marketing○ Market segmentation○ Promotional strategy○ Market Basket Analysis○ Trend Analysis
● Healthcare & Insurance○ Fraud Detection○ Drug Development○ Medical Diagnostic Tools
![Page 35: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/35.jpg)
prithwis mukerjee, ph.d.
Conclusion
Data Science is a rare combination of multiple skills that include● Technology : obviously !
but also● Curiosity - a desire to go below
the surface and discover a hypothesis that can be tested
● Storytelling - create a business story around the data
● Cleverness - again obviously, to look at the problem from different angles
● Why data science ?● Techniques
○ Statistics○ Data Mining○ Visualisation
● Tools & Platforms○ R○ Hadoop / MapReduce○ Real Time Systems
● Business Domains
![Page 36: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/36.jpg)
prithwis mukerjee, ph.d.
![Page 37: Data Science](https://reader033.vdocuments.mx/reader033/viewer/2022060108/555075d7b4c905cc0f8b503d/html5/thumbnails/37.jpg)
prithwis mukerjee, ph.d.
Thank You
Contact
Prithwis MukerjeeProfessor, Praxis Business [email protected]
This presentation is accessible at at the blog
http://blog.yantrajaal.com at the following URL
http://bit.ly/pm-datascience