sas homework 3 review association rules mining mis2502 data analytics
Post on 28-Dec-2015
216 Views
Preview:
TRANSCRIPT
SAS HOMEWORK 3 REVIEWASSOCIATION RULES MINING
MIS2502
Data Analytics
SAS Homework 3 Review Association Rules
• Using Transactions Data Set • Reject Store and Quantity – don’t need them • Assign ID to Transaction (Nominal) – this is our ‘basket’• Target to Product (Nominal) - this is what we’re trying to
determine but now its not a Y/N(binary) • Step 8 = Transaction !
• Add an Associations node (Model)• In Properties Export Rule by ID = Yes
• Answer some questions regarding the Association Rules • Evaluate Support, Confidence and Lift
Set Up • Retail – associations between items purchased from Health/Beauty
and Stationary. • 400K + transactions collected from POS • Products
• bar soap• bows• candy bars• deodorant• greeting cards• magazines
• markers• pain relievers• pencils• pens• perfume
• photo processing• prescription medications• shampoo• toothbrushes• toothpaste• wrapping paper
We are using 2
Association Rules - Diagram
• Right Click and Run . Then view results…..
Process• Set rule thresholds• Define Item Sets • Read through Item Sets, create list of all possible association
rules (X => Y) for the Item Sets• Compute Support, Confidence and Lift for each Rule
• Support, frequency count of occurrence/ all transactions for both the individual items (X and, Y) and for the ItemSet (X,Y)
• Confidence , strength of the association. How often Y appears in baskets that contain X • count (X=>Y)/count(X)
• Expected Confidence X=>Y is the probability that one of the baskets has Y• Lift = s (X->Y)/s(X)*s(Y)
• Or, in SAS, (confidence/expected confidence )
• Drop those that don’t meet thresholds
Evaluating the Statistics
Confidence Plot Left v Right (red = high) range at bottom
Support – frequency: % occurrence of ItemSet in dataConfidence – strength: % right hand occurs in left Lift – dependence: prob of dependent occurrence /prob of random occurrence (>1)
Support v Confidence Blue – 2 variable , - Red 3 variable
<=Ordered by lift on x axis
Confidence v Expected ConfidenceDiff is Lift
Evaluating the Rules Tableview>rules>rule table
In Class
In Class1) Which rule(s) have the highest confidence?
MUSICSTREAM ==> WEBSITE
2) Which rule(s) have the highest support?
WEBSITE ==> PODCAST and PODCAST ==> WEBSITE
3) Which rule(s) have the highest lift?
ARCHIVE ==> WEBSITE and WEBSITE ==> ARCHIVE
4) What are the two rule “pairs” in the list above?
ARCHIVE ==> WEBSITE/WEBSITE ==> ARCHIVE and
WEBSITE ==> PODCAST/PODCAST ==> WEBSITE
5) What other service “goes the most” with visiting the website for general information (WEBSITE)? In other words, what other service are WEBSITE visitors most likely to seek out? What statistic did you use to figure this out?
ARCHIVE – LIFT is greater than 1. This implies that this isn’t just random chance – people are actively seeking out the WEBSITE if they’ve used the ARCHIVE.
In Class6) What other service seems to “go the least” with visiting the website for general information (WEBSITE)? In other words, what other service are WEBSITE visitors least likely to seek out? What statistic did you use to figure this out?
PODCAST – LIFT is less than 1. This also implies that this isn’t just random chance – but this time, people who visit the web site are particularly unlikely to also download a podcast.
7) The rule MUSICSTREAM ==> WEBSITE has poor lift (i.e., less than 1), but the rule has the highest confidence. Explain how this is possible.
It could be that many people use both MUSICSTREAM and WEBSITE so it appears in visitors’ set of services a lot. However, there can still be a negative effect of one on the other. For example, I use the website a lot, and I use music streaming a lot, but I’m still less likely to do one if I’ve done the other – possibly they are substitutes.
top related