introduction to (big) data science

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be

Data Science Company

Introduction to (big) data science

Infofarm - Seminar30/09/2014

Agenda

• About us

• What is Data Science?

• Data Science in practice– Models– Tools

• Case study

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company

About us

InfoFarm - Company

• Data Science and BigData startup

• Part of the Cronos group– Largest indepent IT services supplier in Belgium– Organized in limited-sized highly focused competence

centers– 3000+ Consultants

• Incubated at Xplore Group, within the context of:– Java – PHP– e-commerce (Hybris, Intershop, Magento,

DrupalCommerce, ...)– Mobile development (iOS, Android, ...)– Web development (HTML5, CSS3, ...)

InfoFarm - Team

• Mixed skills team– 2 Data Scientists

• Mathematics• Statistics

– 4 BigData Consultants– 1 Infra specialist

– n Cronos colleagueswith various background

• Certifications– CCDH - Cloudera Certified Hadoop Developer– CCAD - Cloudera Certified Hadoop Administrator– OCJP – Oracle Certified Java Programmer

InfoFarm - Focus

• Mission– “Help our customers to excel in their business activities

by providing them with new information and insights of high business value. Identifying, extracting and using data of all types and origins; exploring, correlating and using it in new and innovative ways in order to extract meaning and business value from it.”

• Focus Domains– Data Science– Machine Learning– Big Data

Introduction: what is Data Science?

What is Data Science?

• Data Science & Business decisions

• Data Science vs … – Statistics – Business Intelligence – Big Data

• What can Data Science do for your business?

• The Data Science maturity model

Business decisions

• Any business requires continuous decision taking– Will we offer this customer a discount or not?– Do we need to keep extra stock for product X?– How do we answer this customer question?– At which supplier do we buy this product?– With which solution will be respond to this RFP?– Do we need to replace device X?– …

• The possible answers to these questions are based on prior experience with the business

• Each decision can turn out to be the right or wrong one, business knowledge should avoid picking the wrong ones

Business decisions

– However …• Do you really know your business that well?• Hasn’t it evolved in this fast-changing world?• Are you sure your competitors aren’t making better decisions?

– You probably own a lot more information than you might realize!

• All your business processes are generating data which you can use to your advantage!• Quotes you made vs deals you won• Historical sales records• Web logs showing user activity• Social media activity referring your brand/product• Metering info on devices (internet of things)• …

Types of Data

– Proprietary data• ERP, CRM, Orders, Customers, Products, etc…

– “Dark Data” – currently unused, maybe not even aware of• Unknown, but present in the company• Cost-efficient BigData tools might enable business cases using this data

– External data• Websites, social media, open data, …

– Data still to be captured• “If only we knew X or Y” …

– There might be a huge added value in “mashing up” proprietary data with public/open data!

Business Knowledge vs Data Science(Intuitive knowledge vs data driven decisions)

Business KnowledgeAcquired by experience

(assumed) insights

RISK: too high bias on past experience and gut feeling

Data ScienceComplementary to business knowledge

Confirmative or new insightsData-driven decision taking

RISK: too naive data intepretation, disconnected from business

Business Knowledge vs Data Science(Intuitive knowledge vs data driven decisions)

Business decisions: marketing example

• Example: We want to send mailings about our new product

• Decisions to take:– Which mail to send to which customers?– We need customer segmentation!

• Risks in failing to do this correctly– Missing opportunities (not informing customers)– Annoying customers with irrelevant mailings (churn, reputation

damage, …)

• Business knowledge based approach– “We know our segments: -25y, 25y-35y, 35y+ groups, and

male/female”– But is this (still) true?– E.g.: do we really want to send an ad of the new iPhone to a long-time

Android user because he’s a 30-something male customer?

• Data-driven approach: Can we identify different segments automatically?(machine learning!)

– WEB SERVER LOGSWhich customers have already looked at similarproduct on our website?

– ORDER HISTORYWhich customers own complementary products?

– CRM INFORMATIONWhat is the typical profile of a customer that clicked through on the last e-mail campaign for a similar product?

– …

• Business knowledge and Data Science become in- and output for each other!– Ideas/hypotheses and data to be examined should be identified from business knowledge!– A/B testing can be applied to test approaches and check results– Let the data talk for itself! New business insights are generated

Being a Data Scientist

• “Data Scientist – the most sexy job of the 21st century”- Thomas H. Davenport

• Data Scientist: “A person who is better at statistics than any software engineer and better at software engineering than any statistician”- Josh Wills

Data Science = team work!

Data Science vs Statistics

• Basic Statistics concepts– Reliability and validity– Probability– Descriptive statistics and graphics

• Inferential statistics (and hypothesis testing)– Probability distributions– Populations and samples– Confidence intervals– Correlation

• Data Science– Link with IT (tooling, scale, …)– Data preparation & hacking (get data from databases, websites, …)– Machine learning and automation– Working interactively together with business

Data Science vs Business Intelligence

• Basic BI concepts: structuring data to report and query upon it– DWH, OLAP, ETL processes– Star- and snowflake schemas– Query-oriented architectures– Close to typical IT development cycle

• Data Science: working and experimenting with data to gain insights– Exploratory working– Work in a research cycle rather than development cycle– Limited investment towards analysis that might or might not

deliver– Tools designed to avoid heavy ETL (loosely structured data)– Eventually valuable analyses can be ported to BI systems

Data Science vs Business Intelligence

• Using tools that are designed to support exploratory working – Not requiring strict up-front schema design– Allowing fast and cheap hypotheses testing– Open up opportunities to quickly integrate many data

sources• Excel files, Text files, Word Documents• Log files• Relational databases• Sensor data• Timeseries data• ...

• Integrations with online (OLTP) and analytical (OLAP/BI) systems– Typically for automating repetitive analysis and reporting outputs

Data Science vs Big Data

• Process of statistical inference: sampling & induction

• BigData allows:– N=ALL (avoid sampling errors)

• Sampling issues can be overcome by just processing ALL available data (process massive data)

– N=1 (avoid issues with non-homogenous datasets)• Categorization becomes true personalisation: project towards ONE individual (calculate per

• Significance considerations are not applicable!

Sampling Induction

What can Data Science do for your business?• Extract meaning from data

– Using and combining data in ways it has never done before– Finding patterns and correlations in data from all possible sources– Detecting anomalies and changes in known patterns

• Transform data of various types into valuable information– As a basis for management decisions– As a basis for data products – That can improve your business in any way

• Build and integrate Data Products– Recommendation engines, Prediction models, Automated classification,

• The key point is spotting opportunities to outperform your competitors using any data available!

Scientific cycle

Question

Hypothesis

Experiment(data)

Analyse results

Conclusion

• This is NOT a development cycle!

• Experimentation vs engineering

• Being a Science makes that the outcome cannot be predicted

• This makes it hard to integrate in an IT development process

Scientific cycle

• Take small steps

• Formulate hypotheses

• Actually build things

• Apply A/B testing

• Even without success, you learned something!

The Data Science maturity model• Don’t run before you can walk: The Data Science Maturity

modelEach level builds on the quality of the underlying step. It’s science, not magic …

– Start off by simply collecting the data you need (type, quantity, quality)– Then report on your current business (confirmative analysis)– Discover new and valuable information (exploratory analysis)– Build and test prediction models (predictive analysis)– Steer your business based on advise output from your predictions (data-

driven)

CollectDescribe

DiscoverPredict

Advise

This is were the hype

around BigData and Data

Science generates

unrealistic expectations!

The Data Science maturity modelPhase Actions Examples in commerce

Collect Logging informationGathering data from different sources

Logging user actions on a websiteUsing loyalty cards to id customers

DescribeExplorative Data AnalysisBasic analytical functions

Checking quantity and quality of dataTypical reporting

Correlating data over sources

Discover Finding correlationsBuilding models Finding similarly behaving customers

PredictBuilding prediction models

Formulating expectations for the future based on past info

Predict sales figures for a new productPredict whether a certain customer will or will not buy a certain product

Advise Use prediction models to evaluate decision possibilities and pick the best

Target advertising to the right customer groups to optimize revenue

Data Science in practice

Overview

• Tools: R, Hive, Pig

• Modeling methods & statistics: Decision trees, Naive Bayes, Regression, Nearest Neighbor, K-means clustering, A priori, …

Tools – Data Science• Analytics: R• Visualisation: Shiny• Docs: MarkDown

• Data retrieval– CSV, TAB, ... files– Apache Hive

• Data processing– Apache Pig

• Open Source based

Tools – Machine Learning• Apache Mahout

• Apache Spark Mlib

• Open Source based

Tools - BigData• Hadoop

– HDFS– MapReduce– Pig– Hive– Oozie– Impala– ...

• Spark– Shark, SparkR

• Platforms– Open Source Apache Hadoop– CDH - Cloudera (partnership at Cronos level)– HDP – Hortonworks Data Platform

Tools - HDFS

Tools – MapReduce : Wordcount

Code CodeFramework FrameworkFramework

Input Splitting Mapping Shuffling Reducing Output

Modeling methods & statistics• Basic patterns

– RecommendationsBased on known taste, propose items that might be liked as well

– ClusteringDetecting correlation groups in data without using pre-defined segmentation based on business knowledge

– ClassificationAutomated labeling, acceptance/rejection of data based on probability models

• Supervised & unsupervised learning methods– k-means, naive bayes, n-nearest neighborhood, random

forrests, logistic regression, A priori, ...

Modeling methods: Decision Tree• Query: which kind of fruit am I looking at

– More general: image recognition

• Clean your data– What to do with missing values?

• Insert average value• Insert special value• Delete data

– What to do with outliers?• Wrong data?

Modeling methods: Decision Tree• Find most decisive variable

– Categorical variable: One leaf for each variable or one leaf for a group of categories

– Numerical variable: find best cut-off(s)

ColorGreen Yellow Red

Modeling methods: Decision Tree• For each leave, repeat the process:

Size is actually numerical: find size cut offs Query

BigMedium

Yellow

Round Thin

Medium Small

Modeling methods: Decision TreeQuery

Water-melon

Green apple

MediumGrapes

Yellow

Grape-fruit

Medium

Banana

Medium

Try it

Cherry

Modeling methods: Decision Tree - Distributed• A big advantage of the big data tools are the Distributed

processing power (run processes in parallel)

• Build your decision tree– Each leaf can be processed by another node– All your data should still be available to every mapper

• Upgrading your decision tree– Bagging trees (sampling your data)– Random Forest (sampling your variables)– Every mapper should only read a part of your data– Still in general better results than a decision tree

Modeling methods: Decision Tree• QUESTION: Can we predict whether a customer will

place an order during this web session?

• Modeling (data mining)– Input: historical surfing information– Decision tree algorithm

• Loop at historical data• Find most decisive variable• For each leaf, repeat

– Avoid overfitting!

• Runtime usage– Pass current info in tree model– Allow certain discounts to increase conversion?– Put user on checkout or in-store after putting product in

basket?

Date_added > 1.5

Hour_added > 16.29

0.06 Date_added < 5.113

0.1136 0.1829

0.3273

Modeling methods: Naive Bayes

• QUESTION: Will I play tennis today?

• Start with labeled data from the pastAgain clean your data!

• Often used with plain text

• Assumes that each variable is independent from all others

• Named after Bayes rule (statistics)

Modeling methods: Naive BayesDay • Outlook Temperature Humidity Wind PlayTennis

D1 • Sunny Hot High Weak No

D2 • Sunny Hot High Strong No

D3 • Overcast Hot High Weak Yes

D4 • Rain Mild High Weak Yes

D5 • Rain Cool Normal Weak Yes

D6 • Rain Cool Normal Strong No

D7 • Overcast Cool Normal Strong Yes

D8 • Sunny Mild High Weak No

D9 • Sunny Cool Normal Weak Yes

D10 • Rain Mild Normal Weak Yes

D11 • Sunny Mild Normal Strong Yes

D12 • Overcast Mild High Strong Yes

D13 • Overcast Hot Normal Weak Yes

D14 • Rain Mild High Strong No

Modeling methods: Naive Bayes• Consider PlayTennis problem and new instance

(sun, cool, high, strong)

Modeling methods: Naive Bayes• Estimate parameters

– P(yes) = 9/14 P(no) = 5/14– P(Wind=strong|yes) = 3/9– P(Wind=strong|no) = 3/5– …

• We haveP(y)P(sun|y)P(cool|y)P(high|y)P(strong|y) =

0.005P(n)P(sun|y)P(cool|n)P(high|n)P(strong|n) =

• Therefore this new instance is classified to “no”

Modeling methods: Naive Bayes - distributed• Vectorisation of trainining data (more or less

wordcount) can easily be distributed:– Each text to one mapper– Even when dealing with a large text cut your text in to peaces– Every small block of data only read once by one mapper

• Vectorisation of your new instance

• Actual prediction is a multiplication of all conditional chances

also calculation of prediction easy to distribute

Modeling methods: Naive Bayes• QUESTION: Can we route incoming questions (free

text) to the right person/department?

• Modeling (data mining)– Input: historical information questions and handling

person/department– Naive bayes algorithm

• For each word or n-gram (2 or 3 words) – count occurences per file• Very valuable are words with high frequency in a single document• Very valuable are words only used in a small number of documents• Remove stopwords, generic words, etc…

• Runtime usage– Vectorize incoming document (which words/n-grams occur how

many times?)– Predict category based on comparison with historical documents

Modeling methods: k-means Clustering• QUESTION: Which countries have the same type of

food consumption

• Your data is not labeled!

• You define labels for your clusters after applying the cluster algorithm

• Choose the number of clusters you are expecting– Try for different number of clusters– Run an algorithm to decide the optimal number of

clusters

• Plot your final results mapped on your principal components

Modeling methods: k-means Clustering Country RedMeat WhiteMeat Eggs Milk Fish Cereals Starch Nuts Fr.Veg1 Albania 10.1 1.4 0.5 8.9 0.2 42.3 0.6 5.5 1.72 Austria 8.9 14.0 4.3 19.9 2.1 28.0 3.6 1.3 4.33 Belgium 13.5 9.3 4.1 17.5 4.5 26.6 5.7 2.1 4.04 Bulgaria 7.8 6.0 1.6 8.3 1.2 56.7 1.1 3.7 4.25 Czechoslovakia 9.7 11.4 2.8 12.5 2.0 34.3 5.0 1.1 4.06 Denmark 10.6 10.8 3.7 25.0 9.9 21.9 4.8 0.7 2.47 E Germany 8.4 11.6 3.7 11.1 5.4 24.6 6.5 0.8 3.68 Finland 9.5 4.9 2.7 33.7 5.8 26.3 5.1 1.0 1.49 France 18.0 9.9 3.3 19.5 5.7 28.1 4.8 2.4 6.510 Greece 10.2 3.0 2.8 17.6 5.9 41.7 2.2 7.8 6.511 Hungary 5.3 12.4 2.9 9.7 0.3 40.1 4.0 5.4 4.212 Ireland 13.9 10.0 4.7 25.8 2.2 24.0 6.2 1.6 2.913 Italy 9.0 5.1 2.9 13.7 3.4 36.8 2.1 4.3 6.714 Netherlands 9.5 13.6 3.6 23.4 2.5 22.4 4.2 1.8 3.715 Norway 9.4 4.7 2.7 23.3 9.7 23.0 4.6 1.6 2.716 Poland 6.9 10.2 2.7 19.3 3.0 36.1 5.9 2.0 6.617 Portugal 6.2 3.7 1.1 4.9 14.2 27.0 5.9 4.7 7.918 Romania 6.2 6.3 1.5 11.1 1.0 49.6 3.1 5.3 2.819 Spain 7.1 3.4 3.1 8.6 7.0 29.2 5.7 5.9 7.220 Sweden 9.9 7.8 3.5 24.7 7.5 19.5 3.7 1.4 2.021 Switzerland 13.1 10.1 3.1 23.8 2.3 25.6 2.8 2.4 4.922 UK 17.4 5.7 4.7 20.6 4.3 24.3 4.7 3.4 3.323 USSR 9.3 4.6 2.1 16.6 3.0 43.6 6.4 3.4 2.924 W Germany 11.4 12.5 4.1 18.8 3.4 18.6 5.2 1.5 3.825 Yugoslavia 4.4 5.0 1.2 9.5 0.6 55.9 3.0 5.7 3.2

Modeling methods: k-means Clustering• Define a metric: take every variable into account as

much as all other variables

• Create random starting points (as many as clusters you expect)

• Assign each point to the closest center (or starting) point

• Calculate the center of each cluster

• Iterate the previous two steps

Modeling methods: k-means clustering

Modeling methods: k-means Clustering

Modeling methods: k-means Clustering"cluster 1" Country RedMeat Fish Fr.VegAlbania 10.1 0.2 1.7Bulgaria 7.8 1.2 4.2Romania 6.2 1.0 2.8Yugoslavia 4.4 0.6 3.2

"cluster 2" Country RedMeat Fish Fr.VegDenmark 10.6 9.9 2.4Finland 9.5 5.8 1.4Norway 9.4 9.7 2.7Sweden 9.9 7.5 2.0

"cluster 3" Country RedMeat Fish Fr.VegCzechoslovakia 9.7 2.0 4.0E Germany 8.4 5.4 3.6Hungary 5.3 0.3 4.2Poland 6.9 3.0 6.6USSR 9.3 3.0 2.9[

"cluster 4" Country RedMeat Fish Fr.VegAustria 8.9 2.1 4.3Belgium 13.5 4.5 4.0France 18.0 5.7 6.5Ireland 13.9 2.2 2.9Netherlands 9.5 2.5 3.7Switzerland 13.1 2.3 4.9UK 17.4 4.3 3.3W Germany 11.4 3.4 3.8

"cluster 5" Country RedMeat Fish Fr.VegGreece 10.2 5.9 6.5Italy 9.0 3.4 6.7Portugal 6.2 14.2 7.9Spain 7.1 7.0 7.2

Modeling methods: k-means Clustering - distributed• Calculate conditional chances

– Every mapper only needs one variable

• Assigning points to clusters:– All centers in distributed cache– Rest of the data only read once by one mapper– Calculate distances and assign to the closest center

• Update center points– One mapper for each cluster

Modeling methods: k-means Clustering• QUESTION: In which different segments can we split

our customer base?

• Modeling (data mining)– Input: any information on the customers (CRM, ERP, Social

Media, …)– Very important to find columns to use (requires business

knowledge to formulate hypotheses!)– K-means clustering algorithm

• Define a “distance” formula to calculate how close two customers are to each other

• Define starting points for each cluster center• Iterate and re-allocate customers to a cluster, move cluster centers

• Runtime usage– Quickly check the cluster in which a new customer could be

residing

Modeling methods: A priori• QUESTION: Which books might be interesting for

you, knowing which books you have read?

• Modeling (data mining)– Input: all titles of books someone has read– Make sure that same books have same titles (e.g.: drop edition

from title)– A priori algorithm

• Make baskets of read books, labeled with the reader• Identify common occuring books• Tweak your recommendation rules:

– Chose big enough support– Confidence of recommendations can be calculated– The bigger the lift, the more valuable your recommendation might be for the reader

• Runtime usage– Check if a subset of the books occur as left-hand-side of a rule

Modeling methods: A priori• Data consists of books bought online

• There were more than 40000 users buying more than one book (If they only bought one book, they are not useful to make your model)

• In total they bought more than 220000 books

• Notice the permutations in the rules

• As you might expect, sequel books are bought together

Modeling methods: A priori

Modeling methods: A priori - distributed• Make list of books bought together (training data)

– Similar to n-grams (Naïve Bayes)– Every customer only read once by one mapper

• Make recommendations– Every mapper handles a number of rules

Modeling methods: A priori• QUESTION: Which adds can I show on a website?

• Modeling (data mining)– Input: All visited links, all bought items, …– Decide what you think is important: you want to show items

others were also interested in, items others also bought, ….– A priori algorithm

• Find items which occur together• Define your support, confidence and lift you want

• Runtime usage– Check if a subset of the visited links occur as a left hand side

of a rule

Case study

End: Wrap up & Lunch

introduction to (big) data science

data external data websites

data proprietary data

dark data

publicopen data

company data science

naive data intepretation

domains data science

data science maturity

Technology

bigscience4business.nl€¦ · 2 | big science. dutch...

powerpoint gm foods the big picture and introduction to...

big data, big thinking: data science in action

it’s a big asia made for big dreams - global...cz1003...

science big, science connected

introduction of big data analytics and basic of data...

introduction to ‘data science’ · 2019-05-29 ·...

the big jaw - sjsu computer science...

introduction to big data science

praktikum big data science ss 2017 - uni-muenchen.de ·...

exoplanets big science: big telescopes

big data meets big science and big ethics: emerging trends...

big science policy

(big) data science

introduction to big data management - …...introduction to...

introduction to data science: a practical approach to big...

welcome to earth science!. chapter 1: introduction to earth...

big science big opportunities

introduction to big data analytics and data science

big science catalogue