sas homework 3 review association rules mining mis2502 data analytics

10
SAS HOMEWORK 3 REVIEW ASSOCIATION RULES MINING MIS2502 Data Analytics

Upload: julian-bishop

Post on 28-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: SAS HOMEWORK 3 REVIEW ASSOCIATION RULES MINING MIS2502 Data Analytics

SAS HOMEWORK 3 REVIEWASSOCIATION RULES MINING

MIS2502

Data Analytics

Page 2: SAS HOMEWORK 3 REVIEW ASSOCIATION RULES MINING MIS2502 Data Analytics

SAS Homework 3 Review Association Rules

• Using Transactions Data Set • Reject Store and Quantity – don’t need them • Assign ID to Transaction (Nominal) – this is our ‘basket’• Target to Product (Nominal) - this is what we’re trying to

determine but now its not a Y/N(binary) • Step 8 = Transaction !

• Add an Associations node (Model)• In Properties Export Rule by ID = Yes

• Answer some questions regarding the Association Rules • Evaluate Support, Confidence and Lift

Page 3: SAS HOMEWORK 3 REVIEW ASSOCIATION RULES MINING MIS2502 Data Analytics

Set Up • Retail – associations between items purchased from Health/Beauty

and Stationary. • 400K + transactions collected from POS • Products

• bar soap• bows• candy bars• deodorant• greeting cards• magazines

• markers• pain relievers• pencils• pens• perfume

• photo processing• prescription medications• shampoo• toothbrushes• toothpaste• wrapping paper

We are using 2

Page 4: SAS HOMEWORK 3 REVIEW ASSOCIATION RULES MINING MIS2502 Data Analytics

Association Rules - Diagram

• Right Click and Run . Then view results…..

Page 5: SAS HOMEWORK 3 REVIEW ASSOCIATION RULES MINING MIS2502 Data Analytics

Process• Set rule thresholds• Define Item Sets • Read through Item Sets, create list of all possible association

rules (X => Y) for the Item Sets• Compute Support, Confidence and Lift for each Rule

• Support, frequency count of occurrence/ all transactions for both the individual items (X and, Y) and for the ItemSet (X,Y)

• Confidence , strength of the association. How often Y appears in baskets that contain X • count (X=>Y)/count(X)

• Expected Confidence X=>Y is the probability that one of the baskets has Y• Lift = s (X->Y)/s(X)*s(Y)

• Or, in SAS, (confidence/expected confidence )

• Drop those that don’t meet thresholds

Page 6: SAS HOMEWORK 3 REVIEW ASSOCIATION RULES MINING MIS2502 Data Analytics

Evaluating the Statistics

Confidence Plot Left v Right (red = high) range at bottom

Support – frequency: % occurrence of ItemSet in dataConfidence – strength: % right hand occurs in left Lift – dependence: prob of dependent occurrence /prob of random occurrence (>1)

Support v Confidence Blue – 2 variable , - Red 3 variable

<=Ordered by lift on x axis

Confidence v Expected ConfidenceDiff is Lift

Page 7: SAS HOMEWORK 3 REVIEW ASSOCIATION RULES MINING MIS2502 Data Analytics

Evaluating the Rules Tableview>rules>rule table

Page 8: SAS HOMEWORK 3 REVIEW ASSOCIATION RULES MINING MIS2502 Data Analytics

In Class

Page 9: SAS HOMEWORK 3 REVIEW ASSOCIATION RULES MINING MIS2502 Data Analytics

In Class1) Which rule(s) have the highest confidence?

MUSICSTREAM ==> WEBSITE

2) Which rule(s) have the highest support?

WEBSITE ==> PODCAST and PODCAST ==> WEBSITE

3) Which rule(s) have the highest lift?

ARCHIVE ==> WEBSITE and WEBSITE ==> ARCHIVE

4) What are the two rule “pairs” in the list above?

ARCHIVE ==> WEBSITE/WEBSITE ==> ARCHIVE and

WEBSITE ==> PODCAST/PODCAST ==> WEBSITE

5) What other service “goes the most” with visiting the website for general information (WEBSITE)? In other words, what other service are WEBSITE visitors most likely to seek out? What statistic did you use to figure this out?

ARCHIVE – LIFT is greater than 1. This implies that this isn’t just random chance – people are actively seeking out the WEBSITE if they’ve used the ARCHIVE.

Page 10: SAS HOMEWORK 3 REVIEW ASSOCIATION RULES MINING MIS2502 Data Analytics

In Class6) What other service seems to “go the least” with visiting the website for general information (WEBSITE)? In other words, what other service are WEBSITE visitors least likely to seek out? What statistic did you use to figure this out?

PODCAST – LIFT is less than 1. This also implies that this isn’t just random chance – but this time, people who visit the web site are particularly unlikely to also download a podcast.

7) The rule MUSICSTREAM ==> WEBSITE has poor lift (i.e., less than 1), but the rule has the highest confidence. Explain how this is possible.

It could be that many people use both MUSICSTREAM and WEBSITE so it appears in visitors’ set of services a lot. However, there can still be a negative effect of one on the other. For example, I use the website a lot, and I use music streaming a lot, but I’m still less likely to do one if I’ve done the other – possibly they are substitutes.