scott burton and richard morris cs 676 presentation 12 april 2011
TRANSCRIPT
![Page 1: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649f335503460f94c4fd6b/html5/thumbnails/1.jpg)
Mining Rules from Surveys and Questionnaires
Scott Burton and Richard MorrisCS 676 Presentation
12 April 2011
![Page 2: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649f335503460f94c4fd6b/html5/thumbnails/2.jpg)
Frequently Used Problems for data mining• Rarity• Related and dependent questions• Ordinal / Likert scale
Surveys and Questionnaires
![Page 3: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649f335503460f94c4fd6b/html5/thumbnails/3.jpg)
Association Rule Mining
Market basket analysis
Cookies -> Milk
Customer Milk Cookies Butter Bread
A x x
B x x x
C x x
D x x
![Page 4: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649f335503460f94c4fd6b/html5/thumbnails/4.jpg)
Our Goal: Improve PrecisionStandard Algorithms/Approaches• Apriori, MS-Apriori• Too many rules• Rules are not “interesting” or actionable• Finding the needle in the haystack
Our goal• Improve Precision• How do you measure “interestingness?”
![Page 5: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649f335503460f94c4fd6b/html5/thumbnails/5.jpg)
Mostly based on Support or Confidence Considered about 40 different metrics All seemed to favor the wrong types of rules
Interestingness Measures
![Page 6: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649f335503460f94c4fd6b/html5/thumbnails/6.jpg)
Our Datasets Smoking habits of middle school students
in Mexico• Global Youth Tobacco Survey for the Pan
American Health Organization (GYTSPAHO)• ~65 Questions and 13,000 responses
HINTS (Health Information National Trends Survey)• hints.cancer.gov• 2007 response data had ~475 Questions and
8,000 responses• We focused on a subset of ~100 questions
![Page 7: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649f335503460f94c4fd6b/html5/thumbnails/7.jpg)
Apriori vs. MS-Apriori
Apriori (Figure 1)
MS-Apriori (Figure 2)
![Page 8: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649f335503460f94c4fd6b/html5/thumbnails/8.jpg)
Related and Dependent QuestionsTrue but worthless rules• Do you smoke=no -> Did you smoke last
week=no
Our approach• Cluster similar questions• Remove any intra-cluster rules
1
2 3
4
5 6
7
8 9
![Page 9: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649f335503460f94c4fd6b/html5/thumbnails/9.jpg)
Distance Metrics◦ Bi-conditional prediction
Attribute vs. Attribute-Value pair
Involving the subject matter expert
Creating Clusters
![Page 10: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649f335503460f94c4fd6b/html5/thumbnails/10.jpg)
A Sample Clustering of Questions
(see handout)
![Page 11: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649f335503460f94c4fd6b/html5/thumbnails/11.jpg)
Effects of Cluster PruningMS-Apriori (Figure 2)
After cluster pruning (Figure 3)
![Page 12: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649f335503460f94c4fd6b/html5/thumbnails/12.jpg)
Similar Rules
Abstract Viewpoint:• A B -> C D• A -> C D• A B -> C• A B Z -> C D
![Page 13: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649f335503460f94c4fd6b/html5/thumbnails/13.jpg)
Similar Rule Pruning
![Page 14: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649f335503460f94c4fd6b/html5/thumbnails/14.jpg)
Effects of Similar Rule Pruning
After cluster pruning (Figure 3)
After Similar Rule Pruning (Figure 4)
![Page 15: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649f335503460f94c4fd6b/html5/thumbnails/15.jpg)
Ordinal and Likert DataTwo Approaches• Pre-process• Post-process
Ordinal Likert
![Page 16: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649f335503460f94c4fd6b/html5/thumbnails/16.jpg)
Effects of Pre-Binning (Figure 5)
![Page 17: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649f335503460f94c4fd6b/html5/thumbnails/17.jpg)
HINTS Data
(see handout, Figures 6-10)
Other Examples
![Page 18: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649f335503460f94c4fd6b/html5/thumbnails/18.jpg)
Conclusions and Future WorkConclusions• Increased precision of “interesting” rules• More work to be done
Future work• Tuning of existing processes• Handle numerical data• Handle questions not asked to everyone• Handle questions with multiple responses• Try other record matching techniques for similar
rule pruning