intro to data science for non-data scientists
TRANSCRIPT
H2O.ai Machine Intelligence
Data Science for Non-Data Scientists
Erin LeDell Ph.D.
Silicon Valley Big Data Science August 2015
H2O.ai Machine Intelligence
H2O.ai
H2O Company
H2O Software
• Team: 35. Founded in 2012, Mountain View, CA• Stanford Math & Systems Engineers
• Open Source Software • Ease of Use via Web Interface• R, Python, Scala, Spark & Hadoop Interfaces• Distributed Algorithms Scale to Big Data
H2O.ai Machine Intelligence
Scientific Advisory CouncilDr. Trevor Hastie
Dr. Rob Tibshirani
Dr. Stephen Boyd
• John A. Overdeck Professor of Mathematics, Stanford University• PhD in Statistics, Stanford University• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining• Co-author with John Chambers, Statistical Models in S• Co-author, Generalized Additive Models • 108,404 citations (via Google Scholar)
• Professor of Statistics and Health Research and Policy, Stanford University• PhD in Statistics, Stanford University• COPPS Presidents’ Award recipient• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining• Author, Regression Shrinkage and Selection via the Lasso• Co-author, An Introduction to the Bootstrap
• Professor of Electrical Engineering and Computer Science, Stanford University• PhD in Electrical Engineering and Computer Science, UC Berkeley• Co-author, Convex Optimization• Co-author, Linear Matrix Inequalities in System and Control Theory• Co-author, Distributed Optimization and Statistical Learning via the Alternating Direction
Method of Multipliers
H2O.ai Machine Intelligence
What is Data Science?
Problem Formulation
• Identify an outcome of interest and the type of task: classification / regression / clustering
• Identify the potential predictor variables• Identify the independent sampling units
• Conduct research experiment (e.g. Clinical Trial)• Collect examples / randomly sample the population• Transform, clean, impute, filter, aggregate data• Prepare the data for machine learning — X, Y
• Modeling using a machine learning algorithm (training)• Model evaluation and comparison• Sensitivity & Cost Analysis
• Translate results into action items• Feed results into research pipeline
Collect & Process Data
Machine Learning
Insights & Action
H2O.ai Machine Intelligence Source: marketingdistillery.com
H2O.ai Machine Intelligence
What is Machine Learning?
What it is: ✤ “Field of study that gives computers the ability to learn without being explicitly programmed.” (Samuel, 1959)
✤ “Machine learning and statistics are closely related fields. The ideas of machine learning, from methodological principles to theoretical tools, have had a long pre-history in statistics.” (Jordan, 2014)
✤ M.I. Jordan also suggested the term data science as a placeholder to call the overall field.
Unlike rules-based systems which require a human expert to hard-code domain knowledge directly into the system, a machine learning algorithm learns how to make decisions from the data alone.
What it’s not:
H2O.ai Machine Intelligence
Classification
Clustering
Machine Learning Overview
• Predict a real-valued response (viral load, weight)• Gaussian, Gamma, Poisson and Tweedie • MSE and R^2
• Multi-class or Binary classification• Ranking• Accuracy and AUC
• Unsupervised learning (no training labels)• Partition the data / identify clusters• AIC and BIC
Regression
H2O.ai Machine Intelligence
Machine Learning Workflow
Source: NLTK
Example of a supervised machine learning workflow.
H2O.ai Machine Intelligence
ML Model Performance
Test & Train • Partition the original data (randomly) into a training set and a test set. (e.g. 70/30)
• Train a model using the “training set” and evaluate performance on the “test set” or “validation set.”
• Train & test K models as shown.
• Average the model performance over the K test sets.
• Report cross-validated metrics.
• Regression: R^2, MSE, RMSE• Classification: Accuracy, F1, H-measure• Ranking (Binary Outcome): AUC, Partial AUC
K-foldCross-validation
Performance Metrics
H2O.ai Machine Intelligence
What is Deep Learning?
What it is: ✤ “A branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, composed of multiple non-linear transformations.” (Wikipedia, 2015)
✤ Deep neural networks have more than one hidden layer in their architecture. That’s what’s “deep.”
✤ Very useful for complex input data such as images, video, audio.
Deep learning architectures, specifically artificial neural networks (ANNs) have been around since 1980, so they are not new. However, there were breakthroughs in training techniques that lead to their recent resurgence (mid 2000’s). Combined with modern computing power, they are quite effective.
What it’s not:
H2O.ai Machine Intelligence
Deep Learning Architecture
Example of a deep neural net architecture.
H2O.ai Machine Intelligence
What is Ensemble Learning?
What it is: ✤ “Ensemble methods use multiple learning algorithms to obtain better predictive performance that could be obtained from any of the constituent learning algorithms.” (Wikipedia, 2015)
✤ Random Forests and Gradient Boosting Machines (GBM) are both ensembles of decision trees.
✤ Stacking, or Super Learning, is technique for combining various learners into a single, powerful learner using a second-level metalearning algorithm.
Ensembles typically achieve superior model performance over singular methods. However, this comes at a price — computation time.
What it’s not:
H2O.ai Machine Intelligence
Where to learn more?
• H2O Online Training (free): http://learn.h2o.ai• H2O Slidedecks: http://www.slideshare.net/0xdata• H2O Video Presentations: https://www.youtube.com/user/0xdata• H2O Community Events & Meetups: http://h2o.ai/events• Machine Learning & Data Science courses: http://coursebuffet.com
Customers ! Community ! Evangelists !
November 9, 10, 11 Computer History Museum
H 2 O W O R L D . H 2 O . A I
!
20% off registration using code:
h2ocommunity !
H2O.ai Machine Intelligence
Questions?
@ledell on Twitter, GitHub [email protected]
http://www.stat.berkeley.edu/~ledell
Data Science for Non-Data Scientists
aka. How the Business Views Data Science
Chen HuangAugust 20, 2015
Agenda
• Introduction• Data Science Primer• Working with Data Scientists• Decoding the Data Science Lingo• Q&A
Introduction
• Who am I? • Why am I giving this talk?
Who am I?• Data Strategist• Career in Business Intelligence,
Analytics, and Big Data• Various roles
• Consultant• Developer• Business and Data Analyst• Product Manager• Functional and Technical Trainer • Client Services
• Worked in various industries• Health care, pharmaceutics,
communications and high tech, consumer products, automotive, finance, government contracting
August, 2015 – San Francisco, CA
Why am I giving this talk?
July, 2011 – Beijing, China
Data Science Primer
• What can Data Science do for the Business?• Applications of Data Science • Data-Driven Decisions• What does a Data Scientist do?• Data Science Skills
What can Data Science do for the Business?
A: Data science! Extracting useful information and knowledge from large volumes of data in order to improve business decision-making or providing the business insights to make data-driven decisions
Data Business
What can Data do?
Image: http://www.slideshare.net/andrewgardner5811/big-data-and-the-art-of-data-science
Applications of Data Science
Image: http://www.slideshare.net/andrewgardner5811/big-data-and-the-art-of-data-science
Data-Driven Decisions
• Practice of basing decisions on data, rather than purely on intuition
• There is evidence that data-driven decision making and big data technologies substantially improve business performance
The Art and Science of Data Science
• Discover unknowns in data• Obtain predictive, actionable insights• Communicate business data stories• Build confidence in decision making• Create valuable Data Products that has business
impacts
http://www.slideshare.net/datasciencelondon/big-data-sorry-data-science-what-does-a-data-scientist-do
What does a Data Scientist do?
• Data curiosity. Explore data. Discover unknowns• Understand data relationships • Understand the business, has domain knowledge• Can tell relevant stories with data• Holistic view of the business• Knows machine learning, statistics, probability• Can hack and code• Define and test an hypothesis, run experiences• Asks good questions
http://www.slideshare.net/andrewgardner5811/big-data-and-the-art-of-data-science
Data Science Skills
Image: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Image: http://www.slideshare.net/galvanizeHQ/how-to-become-a-data-scientist-by-ryan-orban-vp-of-operations-and-expansion-galvanize
Image: http://www.slideshare.net/galvanizeHQ/how-to-become-a-data-scientist-by-ryan-orban-vp-of-operations-and-expansion-galvanize
Working with Data Scientists
• Collaboration• Data Science Cycle• Organizational Models for Data Science Teams
Working with Data Scientists
Data ScienceBusiness
Data Engineering
Data Science Cycle
Image: https://en.wikipedia.org/wiki/Data_science
Organizational Models for Data Science Teams
Image: http://www.slideshare.net/emcacademics/building-data-science-teams-31057129
Decoding the Data Science Lingo
Machine Learning
• A subfield of computer science and artificial intelligence (AI) that focuses on the design of systems that can learn from and make decisions and predictions based on data.
• Machine learning enables computers to act and make data-driven decisions rather than being explicitly programmed to carry out a certain task.
• Machine Learning programs are also designed to learn and improve over time when exposed to new data.
• Everything!Data Science Definition: Business Application:
Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
Unsupervised Learning
Data Science Definition:• Where a program, given a
dataset, can automatically find patterns and relationships within the dataset.
• The business will decide how deeply or many categories there are.
• Clustering or grouping of like data.
• Examples: k-means clustering, hierarchical clustering
Business Application:• Customer segmentation• Understanding users and
behaviors• Classifying unknown and pre-
defined images into categories
Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
Supervised Learning
• Where a program is “trained” on a pre-defined dataset.
• Based off its training data the program can make accurate decisions when given new data.
• Classifying Twitter sentiments• Recommender systems
Data Science Definition: Business Application:
Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
Score
• Number of ways to evaluate how well the model assigns the correct class value to the test instances.
• Confidence gauge Data Science Definition: Business Application:
Definition: https://mlcorner.wordpress.com/tag/scoring/
Score Cont.• True Positive (TP): If the instance
is positive and it is classified as positive False
• Negative (FN): If the instance is positive but it is classified as negative True
• Negative (TN): If the instance is negative and it is classified as negative False
• Positive (FP): If the instance is negative but it is classified as positive
• Classification problems:• Precision = the number of times you correctly classify = TP/(TP+FP)• Accuracy = proportion of correctly classified instances = (TP+TN)/(TP+TN
+FP+FN)• Recall or Sensitivity = the number of positive that you correctly classify out
of all the actual positives = TP/(TP+FN)• Specificity = classifier’s ability to identify negative results = TN/(TN+FP)
Classification
• Sub-category of Supervised Learning
• Classification is the process of taking some sort of input and assign a label to it. The predictions are discrete, categories, or “yes or no” nature.
• Examples: Logistic Regression, Random Forest
• What customers should a company target with its marketing campaigns?
• Is this Nigerian prince committing fraud? (Spam classification)
• Is this actually Barack Obama’s Facebook profile and review on Amazon? (Fraud detection)
Data Science Definition: Business Application:
Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
Regression
• Sub-category of Supervised Learning
• Regression is a type of algorithm that predicts a continuous values.
• How much would a user spend on a mobile game like CandyCrush?
• How much would someone spend on healthcare out of pocket?
• How many attendees will come to this event based on past registration?
Data Science Definition: Business Application:
Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
Decision Trees
• Using a tree-like graph or model of decisions and their possible consequence.
• Medical Testing (e.g. health incidences, etc.)
• Genealogy breakdowns (e.g. eye color, blood type, etc.)
Data Science Definition: Business Application:
Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
Deep Learning
• A category of machine learning algorithms that often use Artificial Neural Networks to generate model.
• Image classification• Language processing• Audio processing• Outlier and fraud detection
Data Science Definition: Business Application:
Definition: http://blog.aylien.com/post/121281850733/10-machine-learning-terms-explained-in-simple
Questions?