analytics and data mining industry overview

49
Analytics Industry Overview: To Big Data and Beyond ! Gregory Piatetsky www.KDnuggets.com/ gps.html 1 (c) KDnuggets 2011

Upload: gregory-piatetsky-shapiro

Post on 13-Jan-2015

20.117 views

Category:

Technology


2 download

DESCRIPTION

My keynote talk at San Diego Superdata conference, looking at history and current state of Analytics and Data Mining, and examining the effects of Big Data

TRANSCRIPT

Page 1: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 1

Analytics Industry Overview:To Big Data and Beyond !

Gregory Piatetskywww.KDnuggets.com/gps.html

Page 2: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 2

My Data Path

• PhD in applying Machine Learning to databases• Researcher at GTE Labs – started first project

on Knowledge Discovery in Databases in 1989• Organized first 3 KDD workshops (1989-93),

cofounded KDD conferences and ACM SIGKDD• Chief Scientist at analytics startup 1998-2001• Chair, SIGKDD, 2005-2009• Analytics/Data Mining Consultant, 2001-

Page 3: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 3

KDnuggets

• Stands for Knowledge Discovery Nuggets• 1993 - started KDnuggets News email newsletter (~

12,000 email subscribers now)• early website in 1994, www.KDnuggets.com in 1997

– 2011 best year, 45-50,000 unique visitors/month• twitter.com/kdnuggets ~3,000 followers• facebook.com/kdnuggets page• group: KDnuggets Analytics & Data Mining • Recently featured on CNN

Page 4: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 4

KDnuggets mission

Cover Analytics and Data Mining field : • News, Jobs, Software, Data (most popular)• Also Academic positions, CFP, Companies,

Consulting, Courses, Meetings, Polls, Publications, Solutions, Webcasts

• Subscribe to bi-weekly KDnuggets News at www.kdnuggets.com/subscribe.html

Page 5: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 5

Analyzing Data or …

• Statistics• Data mining• Knowledge Discovery in Data • KDD• Analytics• Data Science• …?

Core:

Finding Useful Patterns in Data

Page 6: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 6

History

• Statistics: 1800 - • Data dredging, data “fishing” : 1960s• Data Mining: 1980 –• Database Mining ~ 1985 (was HNC trademark, not used)

• Knowledge Discovery in Data: 1989 –– KDD workshop in 1989

• Analytics : 2006 – – Google Analytics, “Competing on Analytics” book

• Data Science: 2010 –

Page 7: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 7

Pre-history

From Google Ngram viewer – English language booksNote: Our analysis uses only English language data. Other languages, especially Chinese , need to be considered for full picture

Statistics is the biggest term in 20th century, but data mining and analytics appears in late 1990s

Page 8: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 8

Recent History: Analytics, Data Mining, Knowledge Discovery

Analytics has been used since 1800, but started to rise in 2005Data Mining jumps around 1996 (soon after first KDD conference) but declines after 2003 (TIA controversy, associated with gov. invasion of privacy).Knowledge Discovery appears in 1989, jumps in 1996, and plateaus after 2000

Page 9: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 9

Google N-gram Results case sensitive

Different capitalizations changes counts, but using lowercase is probably appropriate to measure general popularity.

Page 10: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 10

Earliest use of “data mining” 1962?

Source: Google Books

After eliminating many “following data. Mining cost is ” exampleswhich refer to Mining of minerals, and books from “1958” that have a CD attached (errors in book year)

The earliest “data mining” reference I found is

Page 11: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 11

Google Trends: After 2006, Data Mining < Analytics

Page 12: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011

Google Trends: Analytics observations

Google Analytics introduced,Dec 2005

Competing on Analytics book, Apr 2007 December vacation drop

Page 13: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 13

Half of “Analytics” searches are for “Google Analytics”

Page 14: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 14

Excluding Google Analytics

Page 15: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 15

Google Insights: searches for data mining, analytics -googleare most popular in India, US

Page 16: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 16

Data Mining >> Predictive Analytics

Page 17: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 17

Business, Predictive, Text Analytics

Page 18: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 18

Analytics > Data Mining > Data Science

Page 19: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 19

Data Science, Big Data

Page 20: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 20

Analytics Today

KDnuggets Polls Findings

www.KDnuggets.com/polls/

Page 21: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 21

avg 2.4 industries

CRM/ consumer analytics Banking

Health care/ HR Fraud Detection

Direct Marketing/ Fundraising Finance

Telecom / Cable Science

Insurance Advertising

Education Web usage mining

Credit Scoring Retail

Medical/ Pharma Manufacturing

e-Commerce Social Networks

Search / Web content mining Government/Military

Biotech/Genomics Investment / Stocks

Entertainment/ Music Security / Anti-terrorism

Travel / Hospitality Social Policy/Survey analysis

Junk email / Anti-spam Other

0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0%

Where did you apply analytics/data mining?

www.KDnuggets.com/polls/2010/analytics-data-mining-industries-applications.html

Page 22: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 22

Data Types Analyzed/Mined

www.KDnuggets.com/polls/2011/data-types-analyzed-mined.html

Page 23: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 23

Data Types w. Most Growth in 2011

• location/geo/mobile data

• music / audio • time series

• Genomics, according to John Mattison

Page 24: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 24

Largest Dataset Analyzed?2011 median dataset size ~10-20 GB, vs 8-10 GB in 2010.

Increase in10 GB to 1 PB range

www.KDnuggets.com/polls/2011/largest-dataset-analyzed-data-mined.html

Page 25: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 25

Largest Dataset Analyzed by Region

Page 26: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 26

Which methods/algorithms did you use for data analysis in 2011

Decision Trees

Regression

Clustering

Statistics

Visualization

Time series/Sequence analysis

Support Vector (SVM)

Association rules

Ensemble methods

Text Mining

Neural Nets

Boosting

Bayesian

Bagging

Factor Analysis

Anomaly/Deviation detection

Social Network Analysis

Survival Analysis

Genetic algorithms

Uplift modeling

0% 10% 20% 30% 40% 50% 60% 70%

% analysts who used it

www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html

Page 27: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 27

Algorithms with highest Industry Affinity

www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html

Page 28: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 28

“Academic” algorithmslowest Industry affinity

www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html

Page 29: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 29

Cloud Analytics is not common (yet)

www.KDnuggets.com/polls/2011/algorithms-analytics-data-mining.html

Page 30: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 30

JOBS AND SKILLS

Page 31: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 31

Shortage of Skills

• McKinsey: shortage by 2018 in the US of– 140-190,000 people with deep analytical skills

– 1.5 M managers/analysts with the know-how to use the analysis of big data to make effective decisions.

Source: www.mckinsey.com/mgi/publications/big_data/

Page 32: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 32

Job data: Data Scientist

Page 33: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 33

Jobs: Data Mining >> Data Scientist

Page 34: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 34

“Ground” Analytics (LinkedIn Skills)

~ 75,000 with Data Mining skill

~ 7,000 with Predictive Modeling

Also ~ 20,000 with Predictive Analytics(not related with Predictive Modeling ??

Page 35: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 35

Cloud (Big Data) Analytics Skills

Page 36: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 36

Analytics LinkedIn Skills

Machine LearningPredictive Analytics

Text Mining MapReduce

Page 37: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 37

Data Tsunami

• In 2010 enterprises stored 7 exabytes =7,000,000,000 GB

of new data (McKinsey)• 90 percent of the

world's data has been generated in the past two years (IBM)

Image with apologies to KDD-2011

Page 38: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 38

Big Data Aspects?

• Volume– Terabytes to Petabytes …

• V e l o c i t y – online streaming

• Variety – numbers, text, links, images, audio, video, …

Page 39: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 39

Volume + Velocity => No consistency

• CAP Theorem (Eric Brewer, 2000)For highly scalable distributed systems, you can only have

two of following: – 1) consistency, – 2) high availability, and – 3) (network) partition tolerance (network failure tolerance)

http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

Implication: Big data solutions must stop worrying about consistency if they want high availability

Page 40: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 40

Big Data

• 2nd Industrial Revolution

• Do old activities better

• Create new activities/businesses

Page 41: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 41

Application areas

• Doing old things better– Churn prediction – Direct marketing/Customer modeling– Recommendations– Fraud detection– Security/Intelligence – …

• Competition will level companies

Page 42: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 42

Limit to Predicting Customer Behavior?

• There is fundamental randomness in human behavior and once we find 1-level effects, more data or better algorithms will give diminishing returns in most cases

• Example: Netflix Prize: the most advanced algorithms were only a few percentages better than basic algorithms

Page 43: Analytics and Data Mining Industry Overview

Direct Marketing: Random and Model-sorted Lists

0102030405060708090100

5 15 25 35 45 55 65 75 85 95

RandomModel

5% of random list have 5% of hits5% of model-score ranked list have 21% of hits. Lift(5%) = 21%/5% = 4.2

Pct list

CPH: Cum

ulative Pct Hits

Page 44: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 44

Most lift curves are surprising similarStudy of lift curves in banking,

telecom

Best lift curves are similar

Special point T=Target percentage

Lift(T) ~ sqrt (1/T)

G. Piatetsky-Shapiro, B. Masand, Estimating

Campaign Benefits and Modeling Lift, in Proceedings of KDD-99 Conference, ACM Press, 1999.

0

2

4

6

8

10

12

14

0 5 10 15 20 25

100*T%

Lift

Actual lift(T) Est. lift(T)

Page 45: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 45

Big Data Enables New Things !

– Google – first big success of big data – Social networks (facebook, Twitter, LinkedIn, …)

success depends on network size, i.e. big data

– Location analytics– Health-care

• Personalized medicine

– Semantics and AI ?• Imagine IBM Watson, Siri in 2020 ?

Page 46: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 46

Big Data Growth By Industry

Source: http://www.mckinsey.com/mgi/publications/big_data/

Page 47: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 47

Research and Industry Disconnect?

• Uplift modeling – needs more research• Association rules need less papers• Data Mining with Privacy research – industry

use?

• KDD conference aims to bring researchers and industry people together

Page 48: Analytics and Data Mining Industry Overview

(c) KDnuggets 2011 48

Hot Growth Areas

• Social Analytics– Klout– many twitter micro-analytics (twitalyzer,

TweetEffect, TweetStats)

• Mobile Analytics– Privacy and data tracks (KDD Lab, Pisa)

Page 49: Analytics and Data Mining Industry Overview

49

Big Data Bubble?

Copyright © 2011 KDnuggets

Gartner Hype Cycle

Big Data