a web-based data mining application for the contraceptive method choice database kerey carter...

23
A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr. Bert Boyce

Upload: hilary-watson

Post on 11-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

A Web-Based Data Mining Application for the Contraceptive

Method Choice Database

Kerey Carter

Presented To:

Dr. Donald Kraft (Chair)

Dr. John Tyler

Dr. Bert Boyce

Page 2: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

Outline

• Introduction

• Technical Description

• Data Mining Algorithms

• System Layout

• Demo / Results

• Conclusion

Page 3: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

Introduction - Overview• Statement of Problem: According to (United

Nations Population Fund), the progress in the contraceptive prevalence rate (CPR) for Indonesia seems to have stalled at about 57 percent. A critical challenge for Indonesia remains the access to affordable contraceptives by all its citizens, especially the poor.

• Solution: Use data mining techniques to predict the current contraceptive method choice (no use, long-term methods, or short-term methods) of an Indonesian woman based on her demographic and socio-economic characteristics. Predicting the contraceptive method choice of Indonesian women can assist the government with how and where to target and provide information on contraceptive choices for its female population.

Page 4: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

Introduction – Project Description

• Architecture Used: Web-based 3-Tier Client/Server

• Features: Administrator and User Data Manipulation options; Access Logs

• Data Mining Algorithms used for Prediction:– Naive Bayesian Classification– One-Rule Classification– Decision Tree

Page 5: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

Introduction – The Dataset

• This dataset comes via a subset of the 1987 National Indonesia Contraceptive Prevalence Survey. It was created and donated by Tjen-Sien Lim on June 7, 1997. The contents were downloaded from the UCI Machine Learning Depository

• The samples contained in the survey are of married women who were either not pregnant or did not know if they were pregnant at the time of the interview.

• The number of instances is 1473, and the number of attributes is 11, including the primary key (ID) and the classifying attribute (cmchoice)

Page 6: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

Introduction – Attribute Information

Page 7: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

Introduction – Website Design

• Administrator Privileges:– Add, Delete, Edit, and Search Records– Add Users, Change Admin Password

• User Privileges:– Add Records, Search Records

• Dual Access:– Access Logs, Data Mining Results

Page 8: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

Technical Description – Three-Tier Model

• Three-Tier Client/Server Architecture– Increases scalability and reliability by separating the

three major logical functions of an application (user interaction, business logic, data storage) from one another. Is easier to maintain and can utilize same business logic for multiple interfaces, vs. client/server

– Client Side Front-End: Web browser (I.E.) / HTML– Middle-Tier: CGI, Perl, ASP, VBScript– Server Side Back-End: Access, CSV Text File

Page 9: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

A Typical Three-Tier Model

Page 10: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

Technical Description – Middle Tier

• Common Gateway Interface (CGI) – A standard for interfacing external applications with

information servers, such as HTTP or Web servers

– A CGI program is executed in real-time, which means that it can output dynamic data to a Web page.

– Allows visitors to a Web page to run a program on the server where the CGI document is hosted

– CGI programs must reside in a special directory, so that the Web server knows to execute the program instead of merely displaying it to the browser.

• Ex. /cgi-bin

Page 11: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

Technical Description – Middle Tier

• Practical Extraction Report Language (Perl) – Language used in writing many CGI programs

• Combines many features of C, sed, awk, and sh, as well as csh, Pascal, and BASIC-PLUS

– Available for most operating systems, including virtually all Unix-like platforms

– Optimized for scanning arbitrary text files, extracting information from those text files, and printing reports based on that information

– Does not arbitrarily limit the size of the user’s data, as long as the required memory is available

Page 12: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

Technical Description – Middle Tier

• Active Server Pages (ASP)– Components that allow Web developers to create

server-side scripted templates – Templates generate dynamic, interactive web

server applications – HTML output by an Active Server Page is

browser independent – Visual Basic Script (VBScript)

• Scripting language based on the Visual Basic programming language used to create ASP Web pages

Page 13: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

What is Data Mining?

• Data mining is the exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules (Berry, Linoff, pg. 5)

• It is a process of taking data and analyzing, interpreting, and transforming it into useful information. Data mining can be used to make predictions after data has been transformed.

Page 14: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

Data Mining Algorithms• Naive Bayesian Classification

– Bayes Theorem illustrates how to calculate the probability of one event given that it is known some other event has occurred, expressed algebraically as:

P(A|B) = P(A) * P(B|A) / P(B)

Or, The probability that A takes place given that B has occurred (P(A|B)) equals the probability that A occurs (P(A)) times the probability that B occurs if A has happened (P(B|A)), divided by the probability of B occurring (P(B))

Page 15: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

Data Mining Algorithms

• Bayesian Network - consists of nodes and arcs that can connect pairs of nodes. For each variable, exactly one node exists. A major restriction for the Bayesian network is that arcs are not allowed to form loops.

• NOT a Bayesian network:

Page 16: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

Data Mining Algorithms• Bayesian network dependency model example:

– A and B are dependent on each other if we know something about C or D (or both).

– A and C are dependent on each other no matter what we know and what we don't know about B or D (or both).

– B and C are dependent on each other no matter what we know and what we don't know about A or D (or both).

– C and D are dependent on each other no matter what we know and what we don't know about A or B (or both)

– There are no other dependencies that do not follow from those listed above.

Page 17: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

Data Mining Algorithms

• One-Rule Classification – Creates one data mining rule for the dataset based on

one attribute (one column in a database table).

– After comparing the error rates from all the attributes, the rule is chosen that gives the lowest classification error.

– The rule will assign to one category or class each distinct value of one chosen attribute

Page 18: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

Data Mining Algorithms

• Pseudocode for One-Rule Classification:

• For each attribute in the data set

– For each distinct value of the attribute

• Find the most frequent classification

• Assign the classification to the value

• Calculate the error rate for the value

• Calculate the total error rate for the attribute

• Choose the attribute with the lowest error rate

Page 19: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

Data Mining Algorithms• Decision Tree:

– (1) Begins by finding the test that performs the best task of splitting the data among the preferred categories

– (2) At each successive level of the tree, subsets created by the previous split are themselves split, making a path down the tree

– (3) Each of the paths through the tree represents a rule, and some rules are more useful than other ones

Page 20: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

Data Mining Algorithms

• Decision Tree (cont.)

– (4) At each node of the tree, three things can be measured: the number of records entering the node, the percentage of records classified correctly at the node, and the way the records would be classified if it were a leaf node.

– (5) The tree continues to grow until it is no longer possible to locate more useful ways to split the incoming records

Page 21: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

System Layout

Page 22: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

Demo / Results

Page 23: A Web-Based Data Mining Application for the Contraceptive Method Choice Database Kerey Carter Presented To: Dr. Donald Kraft (Chair) Dr. John Tyler Dr

Conclusion• All 3 data mining algorithms were successful at

making predictions of the contraceptive method choice. – Naive Bayesian - the estimated classification accuracy of the

best model found was 48.74%

– One Rule - determined that the no_child attribute was the best predictor of the contraceptive method choice

– Decision Tree - • determined that the best predictor of the contraceptive method choice

was the rule where no_child <=0, which would predict the no use category (95.7%)

• The regular decision tree had an error rate of 25.0%, while the decision tree with fuzzy thresholds had an error rate of 25.5%