data mining in practice: techniques and practical applications junling hu may 14, 2013

44
Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Upload: shyann-ashwell

Post on 01-Apr-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Data Mining in Practice:

Techniques and Practical Applications

Junling Hu

May 14, 2013

Page 2: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

What is data mining?

2

Mining patterns from data

Is it statistics? Functional form? Computation speed concern? Data size Variable size

Is it machine learning? Big data issue New methods: network mining

Page 3: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Examples of data mining

3

Frequently bought together Movie recommendation

Page 4: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

More examples of data mining

4

Keyword suggestions Genome & disease mining

Heart monitoring

Page 5: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Overview of data mining

5

Frequent pattern mining Machine Learning

Supervised Unsupervised

Stream mining Recommender system Graph mining Unstructured data

Text, Audio Image and Video

Big data technology

Page 6: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Frequent Pattern Mining

6

Diaper and Beer

Product assortment Click behavior Machine breakdown

?

Page 7: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

The case of Amazon

7

User Items1 {Princess dress, crown, gloves, t-shirt}2 {Princess dress, crown, gloves, pink dress, t-shirt }3 {Princess dress, crown, gloves, pink dress, jeans}4 { Princess dress, crown, gloves, pink dress}5 {crown, gloves }

Count frequency of co-occurrence Efficient algorithm

Page 8: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Machine Learning Process

8

Page 9: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Machine Learning

9

Supervised

Unsupervised (clustering)

Page 10: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Binary classification

10

Checking Duration (years)

Savings($k)

Current Loans

Loan Purpose

Risky?

Yes 1 10 Yes TV 0Yes 2 4 No TV 1No 5 75 No Car 0Yes 10 66 No Repair 1Yes 5 83 Yes Car 0Yes 1 11 No TV 0Yes 4 99 Yes Car 0

Input features

Output class

Data point

Page 11: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Classification (1)

11

Decision tree

Page 12: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Classification (2): Neural network

12

Perceptron

Multi-layer neural netowrk

Page 13: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Head pose detection

13

Page 14: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Support Vector Machine (SVM)

14

Search for a separating hyperplane Maximize margin

Page 15: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Perceived advantage of SVM

15

Transform data into higher dimension

Page 16: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Applications of SVM: Spam Filter

16

Input Features:

Transmission IP address --167.12.24.555 Sender URL -- one-spam.com

Email header From --“[email protected]” To --“undisclosed” cc

Email Body # of paragraphs # words

Email structure # of attachments # of links

Page 17: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Logistic regression

17

Advantage: Simple functional form Can be parallelized Large scale

Page 18: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Applications of logistic regression

18

Click prediction Search ranking (web pages, products) Online advertising Recommendation

The model Output: Click/no click Input features:

page content, search keyword, User information

Page 19: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Regression

19

Linear regression

Non-linear regression

Application: • Stock price prediction• Credit scoring• employment forecast

Page 20: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

History of Supervised learning

20

Page 21: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Semi-supervised learning

21

Application: Speech dialog system

Page 22: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Unsupervised learning: Clustering

22

No labeled data

Methods K-means

Page 23: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Categories of machine learning

23

Page 24: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Applications of Clustering

24

Malware detection Document clustering: Topic detection

Page 25: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Graphs in our life

25

Social network Molecular compound

Friend recommendation Drug discovery

Page 26: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Graph and its matrix representation

26

1 2 3 4 5 6

1 0 1 0 0 0 12 1 0 1 1 0 03 0 1 0 1 1 04 0 1 1 0 1 05 0 0 1 1 0 16 1 0 0 0 1 0

12

6

3

5

4

Adjacency matrix

Page 27: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

The web graph

27

Anchor text

Anchor text

Anchor text

Anchor text

HyperlinkPage 1 Page 2

Page 3

Page 28: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

PageRank as a steady state

28

1 2 3 4 5 61 0 0.33 0.33 0 0 0.332 0.5 0 0.5 0 0 03 0.25 0.25 0 0.25 0.25 04 0 1 0 0 0 05 0 0 0.33 0.33 0 0.336 0.5 0 0 0 0.5 0

Transition matrix

P=

PageRank is a probability vector such that P

Page 29: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Discover influencers on Twitter

29

The Twitter graph Node Link

A PageRank approach: TwitterRank

2

13

5

4

Following

Page 30: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Facebook graph search

30

Entity graph Natural language search “Restaurants liked

by my friends”

Page 31: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Recommending a game

31

Page 32: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

32

Recommendation in Travel site

Page 33: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

33

Prediction Problems

Rating Prediction Given how an user rated other items, predict the user’s rating

for a given item

Top-N Recommendation Given the list of items liked by an user, recommend new items

that the user might like

**** ?

Page 34: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

34

Explicit vs. Implicit Feedback Data Explicit feedback

Ratings and reviews

Implicit feedback (user behavior) Purchase behavior: Recency, frequency, …

Browsing behavior: # of visits, time of visit, time of staying, clicks

Page 35: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

35

Collaborative Filtering Hypotheses

User/Item Similarities Similar users purchase similar items Similar items are purchased by similar users

Matching characteristics Match exists between user’s and item’s characteristics

Page 36: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

36

User-User similarity User’s movie rating

Out of Africa

Star Wars

Air Force One

Liar, Liar

John 4 4 5 1

Adam 1 1 2 5

Laura

? 4 5 2

Page 37: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

37

Item-item similarity

Out of Africa

Star Wars

Air Force One

Liar, Liar

John 4 4 5 1 Adam 1 1 2 5

Laura

? 4 5 2

Page 38: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Application of item-item similarity

38

Amazon

Page 39: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

39

SVD (Singular Value Decomposition)

Page 40: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

40

Latent factors

Page 41: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Application of Latent Factor Model

41

GetJar

Page 42: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Ranking-based recommendation

42

Page 43: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Application in LinkedIn

43

Ranking-based model

Page 44: Data Mining in Practice: Techniques and Practical Applications Junling Hu May 14, 2013

Thanks and Contact

44

Co-author: Patricia Hoffman

Contact: [email protected]

Twitter: @junling_tech