data mining query languages

Click here to load reader

Post on 18-Dec-2014

1.136 views

Category:

Documents

3 download

Embed Size (px)

DESCRIPTION

 

TRANSCRIPT

  • 1. Data Mining QueryLanguagesKristen LeFevreApril 19, 2004With Thanks to Zheng Huang and Lei Chen
  • 2. Outline Introduce the problem of querying data mining models Overview of three different solutions and their contributions Topic for Discussion: What would an ideal solution support?
  • 3. Problem Description You guys are armed with two powerful tools Database management systems Efficient and effective data mining algorithms and frameworks Generally, this work asks: How can we merge the two? How can we integrate data mining more closely with traditional database systems, particularly querying?
  • 4. Three Different Answers DMQL: A Data Mining Query Language for Relational Databases (Han et al, Simon Fraser University) Integrating Data Mining with SQL Databases: OLE DB for Data Mining (Netz et al, Microsoft) MSQL: A Query Language for Database Mining (Imielinski & Virmani, Rutgers University)
  • 5. Some Common Ground Create and manipulate data mining models through a SQL-based interface (Command- driven data mining) Abstract away the data mining particulars Data mining should be performed on data in the database (should not need to export to a special-purpose environment) Approaches differ on what kinds of models should be created, and what operations we should be able to perform
  • 6. DMQL Commands specify the following: The set of data relevant to the data mining task (the training set) The kinds of knowledge to be discovered Generalized relation Characteristic rules Discriminant rules Classification rules Association rules
  • 7. DMQL Commands Specify the following: Background knowledge Concept hierarchies based on attribute relationships, etc. Various thresholds Minimum support, confidence, etc.
  • 8. DMQL Syntax use database Specify backgroundknowledge {use hierarchy forSpecify rules to be }discovered Relevant attributes oraggregations related to Collect the set of from relevant data to mine [where ] [order by ]Specify threshold {with [] threshold =parameters [for ]}
  • 9. DMQL Syntax find classification rules [as ] [according to ]Find association rules [as ]generalize data [into ]others
  • 10. DMQL use database Hospital find association rules as Heart_Health related to Salary, Age, Smoker, Heart_Disease from Patient_Financial f, Patient_Medical m where f.ID = m.ID and m.age >= 18 with support threshold = .05 with confidence threshold = .7
  • 11. DMQL DMQL provides a display in command to view resulting rules, but no advanced way to query them Suggests that a GUI interface might aid in the presentation of these results in different forms (charts, graphs, etc.)
  • 12. MSQL Focus on Association Rules Seeks to provide a language both to selectively generate rules, and separately to query the rule base Expressive rule generation language, and techniques for optimizing some commands
  • 13. MSQL Get-Rules and Select-Rules Queries Get-Rules operator generates rules over elements of argument class C, which satisfy conditions described in the where clause [Project Body, Consequent, confidence, support] GetRules(C) [as R1] [into ] [where ] [sql-group-by clause] [using-clause]
  • 14. MSQL may contain a number of conditions, including: restrictions on the attributes in the body or consequentin, has, and is are rule rule.body HAS {(Job = Doctor} subset, superset, and equality rule1.consequent IN rule2.body respectively rule.consequent IS {Age = *} pruning conditions (restrict by support, confidence, or size) Stratified or correlated subqueries
  • 15. MSQL GetRules(Patients) where Body has {Age = *} and Support > .05 and Confidence > .7 and not exists ( GetRules(Patients) Support > .05 and Confidence > .7 and R2.Body HAS R1.Body)Retrieve all rules with descriptors of the form Age = x in the body,except when there is a rule with equal or greater support andconfidence with a rule containing a superset of the descriptors inthe body
  • 16. MSQL GetRules(C) R1 where correlated and not exists ( GetRules(C) R2 where and R2.Body HAS R1.Body) GetRules(C) R1 where and consequent is {(X=*)} stratified and consequent in (SelectRules(R2) where consequent is {(X=*)}
  • 17. MSQL Nested Get-Rules Queries and their optimization Stratified(non-corrolated) queries are evaluated bottom-up. The subquery is evaluated first, and replaced with its results in the outer query. Correlated queries are evaluated either top- down or bottom-up (like loop-unfolding), and there are rules for choosing between the two options
  • 18. MSQLGetRules(Patients)where Body has {Age = *}and Support > .05 and Confidence > .7and not exists ( GetRules(Patients) Support > .05 and Confidence > .7 and R2.Body HAS R1.Body)
  • 19. MSQLTop-Down EvaluationGetRules(Patients)where Body has {Age = *}and Support > .05 and Confidence > .7For each rule produced by the outer, evaluate theinner not exists ( GetRules(Patients) Support > .05 and Confidence > .7 and R2.Body HAS R1.Body)
  • 20. MSQLBottom-Up Evaluationnot exists ( GetRules(Patients) Support > .05 and Confidence > .7 and R2.Body HAS R1.Body)For each rule produced by the inner, evaluate theouter GetRules(Patients) where Body has {Age = *} and Support > .05 and Confidence > .7
  • 21. MSQL Choosing between the two In general, evaluate the expression with more restrictive conditions first Heuristic rules Evaluate the query with higher support threshold first Next consider confidence thresholdMeant to prevent A (length = x) expression is in general more restrictiveunconstrained than (length > x), which is more restrictive than (length
  • 22. OLE DB for DM An extension to the OLE DB interface for Microsoft SQL Server Seeks to support the following ideas: Define a model by specifying the set of attributes to be predicted, the attributes used for the prediction, and the algorithm Populate the model using the training dataNone of the Predict attributes for new data using theothersseemed to populated modelsupport this Browse the mining model (not fully addressed because it varies a lot by model type)
  • 23. OLE DB for DM Defining a Mining Model Identify the set of data attributes to be predicted, the set of attributes to be used for prediction, and the algorithm to be used for building the model Populating the Model Pullthe information into a single rowset using views, and train the model using the data and algorithm specified Supports complex objects, so rowset may be hierarchical (see paper for more complex examples)
  • 24. OLE DB for DM Using the mining model to predict Defines a new operator prediction join. A model may be used to make predictions on datasets by taking the prediction join of the mining model and the data set.
  • 25. OLE DB for DMCREATE MINING MODEL [Heart_Health Prediction][ID] Int Key,[Age] Int,[Smoker] Int,[Salary] Double discretized,[HeartAttack] Int PREDICT, %Prediction columnUSING [Decision_Trees_101]Identifies the source columns for the trainingdata, the column to be predicted, and the datamining algorithm.
  • 26. OLE DB for DMINSERT INTO [Heart_Health Prediction]([ID], [Age], [Smoker], [Salary])SELECT [ID], [Age], [Smoker], [Salary] FROM Patient_Medical M, Patient_Financial FWHERE M.ID = F.IDThe INSERT represents using a tuple fortraining the model (not actually inserting it intothe rowset).
  • 27. OLE DB for DMSELECT t.[ID], [Heart_Health Prediction].[HeartAttack]FROM [Heart_Health Prediction]PREDICTION JOIN (SELECT [ID], [Age], [Smoker], [Salary]FROM Patient_Medical M, Patient_Financial FWHERE M.ID = F.ID) as tON [Heart_Health Prediction].Age = t.Age AND [Heath_Health Prediction].Smoker = t.Smoker AND [Heart_Health Prediction].Salary = t.SalaryPrediction join connects the model and an actual datatable to make predictions
  • 28. Key Ideas Important to have an API for creating and manipulating data mining models The data is already in the DBMS, so it makes sense to do the data mining where the data is Applications already use SQL, so a SQL extension seems logical
  • 29. Key Ideas Need a method for defining data mining models, including algorithm specification, specification of various parameters, and training set specification (DMQL, MSQL, ODBDM) Need a method of querying the models (MSQL) Need a way of using the data mining model to interact with other data in the database, for purposes such as prediction (ODBDM)
  • 30. Discussion Topic:What Functionality wouldand Ideal SolutionSupport?