mining attributes

Download Mining attributes

Post on 29-Jul-2015




0 download

Embed Size (px)


1. Mining Attribute Lifecycleto Predict Faultsand Incompletenessin Database ApplicationsPresented by:-Sandra AlexRoll no: 40 2. Outline INTRODUCTION ATTRIBUTE LIFECYCLE CHARACTERIZATION PROPOSED APPROACH EXPERIMENT PREDICTION RELATED WORK CONCLUSION REFERENCESPage 2 3. Introduction Each attribute a value created initially viaPage 3insertion Referenced, updated or deleted These occurrences of events, associated withthe states attribute lifecycle. Behaviour of an attribute value from itsinsertion to final deletion Extract the attribute lifecycle out of a databaseapplication 4. Introduction Our empirical studies discover,faults and incompleteness in db applications highly associated with attribute lifecycle. Learned prediction model applied indevelopment and maintenance of databaseapplications Experiments conducted on PHP systemsPage 4 5. Attribute Lifecycle CharacterizationPage 5 for each attribute, a value isi. created -> insertionii. referenced -> selectioniii. updated -> updatingiv. deleted -> deletion These occurrences of events are associatedwith states , to constitute the attributelifecycle. 6. Attribute Lifecycle CharacterizationPage 6State transition diagram of the attribute lifecycle 7. Attribute Lifecycle Characterization programs sustain attribute lifecycle by 4database operations:INSERT, SELECT, UPDATE and DELETE formulate the following attributes toPage 7characterize its lifecycle:i. Create (C) -> value of attribute is inserted.ii. Null Create (NC) -> inserted withoutvalueiii. Control Update (COU) -> not influencedby existing attribute value & inputs fromuser and database. 8. Attribute Lifecycle CharacterizationPage 8iv. Overriding Update (OVU)-> not influenced by existing value.v. Cumulating Update (CMU)-> influenced by existing Delete (D) : attribute is deleted as a resultof the deletion of the recordvii. Use (U): value is used to support theinsertion, updating or deletion of otherdatabase attributes or output to theexternal environment. 9. Attribute Lifecycle Characterization Hence,we characterize the attribute lifecycle byPage 9a seven element vector[m1, m2, m3, m4, m5, m6, m7],where m1, m2,m3, m4, m5, m6, m7denote whether there is database operationperformed on the attribute is of typeC, NC, COU, OVU, CMU, D and U respectively. 10. Proposed ApproachA. Mining Attribute LifecyclePage 10 11. Proposed ApproachB. Extracting Attribute LifecyclePage 11Characterization Data1) Query Extraction query can be different in runtime. 12. Proposed Approach control flow graph(CFG) for the codePage 12 13. generates a set of basis pathsencounter a query execution function likemysql_query, -> definition of every variable usedis retrievedliterals -> replaced by their actual valuesvariables whose values are not statically known ->replaced by placeholdersparts of query strings with replaced values ->Page 13connectedProposed Approach 14. Proposed Approach2) Analysis of Attribute Lifecyclequeries are extracted analysed to obtain theattribute lifecycle patterns by using an SQL grammar parserCREATE TABLE : first parsed to collect the schemaof tableVIEW: mapping of attributes between the view &backup tablePage 14 15. Proposed Approach SELECT :o query is parsed, table aliases restored by theactual table names, & attributes are identifiedo * -> schema of table consulted to get allattribute nameso count(*) -> not considered, characterized asUSEPage 15 16. Proposed Approach INSERT:o table name is identified firsto no column list -> all the attributes inserted.o column list -> attributes are extractedo auto incremental or have not null default values-> treated as inserted by the queryo These attributes are characterized as Createo explicitly assigned to null -> marked as NullCreate.Page 16 17. UPDATE :o collect attribute nameso identify the update patterno attribute assignments in the SET clause arePage 17separatedo analyse the value string to determine the updatecharacteristico either COU, OVU or CMUo attributes used in the WHERE clause -> markedas Use 18. DELETE :oidentify table nameomark all the attributes as Deleteoattributes in the WHERE clause as UseFor each query,attribute names in it -> put into a collection -> createPage 18attribute lifecycle vectors. 19. 3) Generation of Attribute Lifecycle VectorsFor example,if there is at least one Create characteristic for oneattribute,Page 19othe first element of the vector 1o otherwise 0no operation on an attribute, all elements set to 0we generate vectors for all attributes in a databaseapplication. 20. A. Data Collectionseed faults in open-source database applications totrain our modelwe chose systems -> should have very few faultsassociated with attribute lifecycle.Page 20 source code -> publicly available application size -> considerable (transactionnumber and attribute number) mature enough -> very few faults associatedwith attribute lifecycle.Experiment 21. Experiment batavi a web-based e-commerce system; webERP, an accounting & business management system; FrontAccounting, a professional web based system OpenBusinessNetwork, application designed for business; SchoolMate, solution for school administrations.Page 21 22. Experimentattribute lifecycle have a number of common patternsthose which do not follow -> cause errorswe seeded the following common errors1) Missing function: attributes are provided, function is notPage 22catered for during the program design2) Inconsistency design: correcting the result of a transactionthat updates an attribute by cumulative update usingoverriding update3) Redundant function: new programs for different types ofoperations4) No Update: new attributes without any update functions 23. ExperimentB. Experimental Designthree classifiers to learn the prediction model1) C4.5 classifierdecision tree classification algorithmuses normalized information gain to split datainformation gain of one attribute APage 23 24. ExperimentInfo(D) is defined as:pi : probability that one instance belongs to class i In training process,Page 24each time the classifier chooses one attributewith the highest normalized information gainto split the data until all attributes are used. 25. Experiment2) Nave Bayes classifier generative probabilistic model Bayes theorem: assumed that attributes are independent, we have For categorical value, the probability P(xi|Ci) is theproportion of the instances in class Ci which haveattribute xi.Page 25 26. Experiment3) SVM classifier Support Vector Machine (SVM) based on the statistical learning theory trains the classification model byPage 26searching the hyper planewhich maximizes the margin between classes 27. ExperimentC. Model Training attributes from the five systems labelled to create the trainingset manually checked, labelled each attribute as missingfunction ,inconsistency design ,redundant function, "noupdate or normalPage 27 28. Experimentmodel was trained by three classifiersfor evaluation of trained models 10-fold crossvalidation on training setset was randomly partitioned into 10 foldseach time 9 folds of them as training setPage 28and 1 fold was testing setwe computed the average measurements 29. ExperimentD. Assessing Performance probability of detection pd=(tp/(tp+fn)) probability of false alarm pf=(fp/(fp+tn)) precision pr=(tp/(tp+fp))Page 29 pd 1 pf 0 30. Page 30 pd>87 pfC4.5 C4.5>naveBayes SVM: pd>95% pfto predict whether there are attributes with missing function,inconsistency design, redundant function and no update.applied our prediction model learned by SVM to thesesystems and counted the attributes that were predictedPage 31 32. Predictiondesigners could take corresponding actions tomodify these design faults and incompletenessfurther, we manually validate all the attributespredictedOf all the 107 attributes, 98 are confirmed to be realprediction precision is 91.59%Page 32 33. Conclusion For each attribute, we extract the set of attributesPage 33that can be extracted from code of databaseapplications to characterize its lifecycle. a characterization vector is formed Data mining technique is applied to mine theattribute lifecycle using the data collected fromdatabase open-source systems. We seed errors in mature systems and simulatethe design faults to train our dataset for ourclassification method. Five types of labelled attributes are obtained. 34. Conclusion Fault and completeness prediction model is then built. In our experiment, the model achieved 98.04%precision and 98.25% recall on average for SVM We also applied the model on four database opensource applications to predict conduct more comprehensive experiments on a largerset of systems ,further validate the merits of theproposed approachPage 34 35. References[1] N. Nagappan and T. Ball, Static Analysis Tools as Early Indicatorsof Pre-release Defect Density, in Proceedings of the 27th InternationalConference on Software Engineering. ACM, 2005, pp. 580586.[2] A. Nikora and J. Munson, Developing Fault Predictors for EvolvingSoftware Systems, in Proceedings of Ninth International SoftwareMetrics Symposium, 2003. IEEE, 2003, pp. 338350.[3] A. Watson, T. McCabe, and D. Wallace, Structured testing: A testingmethodology using the cyclomatic complexity metric, NIST specialPublication, vol. 500, no. 235, pp. 1114, 1996.[4] W. Fan, M. Miller, S. Stolfo, W. Lee, and P. Chan, Using artificialanomalies to detect unknown and known network intrusions,Knowledge and Information Systems, vol. 6, no. 5, pp. 507527, 2004.Page 35 36. Page 36 37. Page 37


View more >