data mining in practice: techniques and practical applications junling hu may 14, 2013

Post on 01-Apr-2015

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Mining in Practice:

Techniques and Practical Applications

Junling Hu

May 14, 2013

What is data mining?

2

Mining patterns from data

Is it statistics? Functional form? Computation speed concern? Data size Variable size

Is it machine learning? Big data issue New methods: network mining

Examples of data mining

3

Frequently bought together Movie recommendation

More examples of data mining

4

Keyword suggestions Genome & disease mining

Heart monitoring

Overview of data mining

5

Frequent pattern mining Machine Learning

Supervised Unsupervised

Stream mining Recommender system Graph mining Unstructured data

Text, Audio Image and Video

Big data technology

Frequent Pattern Mining

6

Diaper and Beer

Product assortment Click behavior Machine breakdown

?

The case of Amazon

7

User Items1 {Princess dress, crown, gloves, t-shirt}2 {Princess dress, crown, gloves, pink dress, t-shirt }3 {Princess dress, crown, gloves, pink dress, jeans}4 { Princess dress, crown, gloves, pink dress}5 {crown, gloves }

Count frequency of co-occurrence Efficient algorithm

Machine Learning Process

8

Machine Learning

9

Supervised

Unsupervised (clustering)

Binary classification

10

Checking Duration (years)

Savings($k)

Current Loans

Loan Purpose

Risky?

Yes 1 10 Yes TV 0Yes 2 4 No TV 1No 5 75 No Car 0Yes 10 66 No Repair 1Yes 5 83 Yes Car 0Yes 1 11 No TV 0Yes 4 99 Yes Car 0

Input features

Output class

Data point

Classification (1)

11

Decision tree

Classification (2): Neural network

12

Perceptron

Multi-layer neural netowrk

Head pose detection

13

Support Vector Machine (SVM)

14

Search for a separating hyperplane Maximize margin

Perceived advantage of SVM

15

Transform data into higher dimension

Applications of SVM: Spam Filter

16

Input Features:

Transmission IP address --167.12.24.555 Sender URL -- one-spam.com

Email header From --“admin@one-spam.cpm” To --“undisclosed” cc

Email Body # of paragraphs # words

Email structure # of attachments # of links

Logistic regression

17

Advantage: Simple functional form Can be parallelized Large scale

Applications of logistic regression

18

Click prediction Search ranking (web pages, products) Online advertising Recommendation

The model Output: Click/no click Input features:

page content, search keyword, User information

Regression

19

Linear regression

Non-linear regression

Application: • Stock price prediction• Credit scoring• employment forecast

History of Supervised learning

20

Semi-supervised learning

21

Application: Speech dialog system

Unsupervised learning: Clustering

22

No labeled data

Methods K-means

Categories of machine learning

23

Applications of Clustering

24

Malware detection Document clustering: Topic detection

Graphs in our life

25

Social network Molecular compound

Friend recommendation Drug discovery

Graph and its matrix representation

26

1 2 3 4 5 6

1 0 1 0 0 0 12 1 0 1 1 0 03 0 1 0 1 1 04 0 1 1 0 1 05 0 0 1 1 0 16 1 0 0 0 1 0

12

6

3

5

4

Adjacency matrix

The web graph

27

Anchor text

Anchor text

Anchor text

Anchor text

HyperlinkPage 1 Page 2

Page 3

PageRank as a steady state

28

1 2 3 4 5 61 0 0.33 0.33 0 0 0.332 0.5 0 0.5 0 0 03 0.25 0.25 0 0.25 0.25 04 0 1 0 0 0 05 0 0 0.33 0.33 0 0.336 0.5 0 0 0 0.5 0

Transition matrix

P=

PageRank is a probability vector such that P

Discover influencers on Twitter

29

The Twitter graph Node Link

A PageRank approach: TwitterRank

2

13

5

4

Following

Facebook graph search

30

Entity graph Natural language search “Restaurants liked

by my friends”

Recommending a game

31

32

Recommendation in Travel site

33

Prediction Problems

Rating Prediction Given how an user rated other items, predict the user’s rating

for a given item

Top-N Recommendation Given the list of items liked by an user, recommend new items

that the user might like

**** ?

34

Explicit vs. Implicit Feedback Data Explicit feedback

Ratings and reviews

Implicit feedback (user behavior) Purchase behavior: Recency, frequency, …

Browsing behavior: # of visits, time of visit, time of staying, clicks

35

Collaborative Filtering Hypotheses

User/Item Similarities Similar users purchase similar items Similar items are purchased by similar users

Matching characteristics Match exists between user’s and item’s characteristics

36

User-User similarity User’s movie rating

Out of Africa

Star Wars

Air Force One

Liar, Liar

John 4 4 5 1

Adam 1 1 2 5

Laura

? 4 5 2

37

Item-item similarity

Out of Africa

Star Wars

Air Force One

Liar, Liar

John 4 4 5 1 Adam 1 1 2 5

Laura

? 4 5 2

Application of item-item similarity

38

Amazon

39

SVD (Singular Value Decomposition)

40

Latent factors

Application of Latent Factor Model

41

GetJar

Ranking-based recommendation

42

Application in LinkedIn

43

Ranking-based model

Thanks and Contact

44

Co-author: Patricia Hoffman

Contact: junlinghu@gmail.com

Twitter: @junling_tech

top related