Transcript
  • 7/28/2019 Data Mining Query Languages (2)

    1/30

    Data Mining Query

    Languages

    Kristen LeFevre

    April 19, 2004

    With Thanks to Zheng Huang and Lei Chen

  • 7/28/2019 Data Mining Query Languages (2)

    2/30

  • 7/28/2019 Data Mining Query Languages (2)

    3/30

    Problem Description

    You guys are armed with two powerful tools

    Database management systems

    Efficient and effective data mining algorithmsand frameworks

    Generally, this work asks:

    How can we merge the two?

    How can we integrate data mining moreclosely with traditional database systems,

    particularly querying?

  • 7/28/2019 Data Mining Query Languages (2)

    4/30

    Three Different Answers

    DMQL: A Data Mining QueryLanguage for Relational Databases(Han et al, Simon Fraser University)

    Integrating Data Mining with SQLDatabases: OLE DB for Data Mining(Netz et al, Microsoft)

    MSQL: A Query Language forDatabase Mining (Imielinski &Virmani, Rutgers University)

  • 7/28/2019 Data Mining Query Languages (2)

    5/30

    Some Common Ground

    Create and manipulate data mining modelsthrough a SQL-based interface (Command-driven data mining)

    Abstract away the data mining particulars Data mining should be performed on data in

    the database (should not need to export toa special-purpose environment)

    Approaches differ on what kinds of modelsshould be created, and what operations weshould be able to perform

  • 7/28/2019 Data Mining Query Languages (2)

    6/30

    DMQL

    Commands specify the following:

    The set of data relevant to the data mining

    task (the training set)

    The kinds of knowledge to be discovered

    Generalized relation

    Characteristic rules

    Discriminant rules

    Classification rules

    Association rules

  • 7/28/2019 Data Mining Query Languages (2)

    7/30

    DMQL

    Commands Specify the following:

    Background knowledge

    Concept hierarchies based on attributerelationships, etc.

    Various thresholds

    Minimum support, confidence, etc.

  • 7/28/2019 Data Mining Query Languages (2)

    8/30

    DMQL

    Syntaxuse database

    {use hierarchy for}

    related to

    from

    [where ]

    [order by ]

    {with [] threshold = [for ]}

    Specify background

    knowledge

    Specify rules to be

    discovered

    Collect the set of

    relevant data to mine

    Specify thresholdparameters

    Relevant attributes or

    aggregations

  • 7/28/2019 Data Mining Query Languages (2)

    9/30

    DMQL

    Syntax

    find classification rules [as ]

    [according to ]

    Find association rules [as ]

    generalize data [into ]

    others

  • 7/28/2019 Data Mining Query Languages (2)

    10/30

    DMQL

    use database Hospital

    find association rules as Heart_Health

    related to Salary, Age, Smoker,

    Heart_Diseasefrom Patient_Financial f, Patient_Medical m

    where f.ID = m.ID and m.age >= 18

    with support threshold = .05

    with confidence threshold = .7

  • 7/28/2019 Data Mining Query Languages (2)

    11/30

    DMQL

    DMQL provides a display in

    command to view resulting rules, but

    no advanced way to query them Suggests that a GUI interface might

    aid in the presentation of these resultsin different forms (charts, graphs, etc.)

  • 7/28/2019 Data Mining Query Languages (2)

    12/30

    MSQL

    Focus on Association Rules

    Seeks to provide a language both to

    selectively generate rules, andseparately to query the rule base

    Expressive rule generation language,

    and techniques for optimizing somecommands

  • 7/28/2019 Data Mining Query Languages (2)

    13/30

    MSQL

    Get-Rules and Select-Rules Queries Get-Rules operator generates rules over

    elements of argument class C, which satisfyconditions described in the where clause

    [Project Body, Consequent,confidence, support]

    GetRules(C) [as R1]

    [into ]

    [where ][sql-group-by clause]

    [using-clause]

  • 7/28/2019 Data Mining Query Languages (2)

    14/30

    MSQL

    may contain a number of

    conditions, including:

    restrictions on the attributes in the body or

    consequent

    rule.body HAS {(Job = Doctor}

    rule1.consequent IN rule2.body

    rule.consequent IS {Age = *}

    pruning conditions (restrict by support,confidence, or size)

    Stratified or correlated subqueries

    in, has, and is are rule

    subset, superset,

    and equality

    respectively

  • 7/28/2019 Data Mining Query Languages (2)

    15/30

    MSQL

    GetRules(Patients)

    where Body has {Age = *}

    and Support > .05 and Confidence > .7

    and not exists ( GetRules(Patients)

    Support > .05 andConfidence > .7

    and R2.Body HAS R1.Body)

    Retrieve all rules with descriptors of the form Age = x in the body,

    except when there is a rule with equal or greater support andconfidence with a rule containing a superset of the descriptors in

    the body

  • 7/28/2019 Data Mining Query Languages (2)

    16/30

    MSQL

    GetRules(C) R1

    where

    and not exists ( GetRules(C) R2

    where

    and R2.Body HAS R1.Body)

    correlated

    stratified

    GetRules(C) R1

    where

    and consequent is {(X=*)}and consequent in (SelectRules(R2)

    where consequent is {(X=*)}

  • 7/28/2019 Data Mining Query Languages (2)

    17/30

    MSQL

    Nested Get-Rules Queries and their

    optimization

    Stratified (non-corrolated) queries are

    evaluated bottom-up. The subquery is

    evaluated first, and replaced with its results

    in the outer query.

    Correlated queries are evaluated either top-

    down or bottom-up (like loop-unfolding),and there are rules for choosing between the

    two options

  • 7/28/2019 Data Mining Query Languages (2)

    18/30

    MSQL

    GetRules(Patients)

    where Body has {Age = *}

    and Support > .05 and Confidence > .7

    and not exists ( GetRules(Patients)

    Support > .05 andConfidence > .7

    and R2.Body HAS R1.Body)

  • 7/28/2019 Data Mining Query Languages (2)

    19/30

    MSQL

    GetRules(Patients)

    where Body has {Age = *}

    and Support > .05 and Confidence > .7

    Top-Down Evaluation

    For each rule produced by the outer, evaluate the

    inner

    not exists ( GetRules(Patients)Support > .05 andConfidence > .7

    and R2.Body HAS R1.Body)

  • 7/28/2019 Data Mining Query Languages (2)

    20/30

    MSQL

    not exists ( GetRules(Patients)

    Support > .05 and

    Confidence > .7and R2.Body HAS R1.Body)

    Bottom-Up Evaluation

    For each rule produced by the inner, evaluate the

    outer

    GetRules(Patients)where Body has {Age = *}

    and Support > .05 and Confidence > .7

  • 7/28/2019 Data Mining Query Languages (2)

    21/30

    MSQL

    Choosing between the two In general, evaluate the expression with more

    restrictive conditions first

    Heuristic rules

    Evaluate the query with higher support threshold first Next consider confidence threshold

    A (length = x) expression is in general more restrictivethan (length > x), which is more restrictive than (length< x)

    Body IS (constant expression) is more restrictive thanBody HAS, which is more restrictive than Body IN

    Next consider Consequent IN expressions

    Descriptors of for (A = a) are more restrictive thanwildcards such as (A = *)

    Meant to prevent

    unconstrained

    queries from being

    evaluated first

  • 7/28/2019 Data Mining Query Languages (2)

    22/30

    OLE DB for DM

    An extension to the OLE DB interface forMicrosoft SQL Server

    Seeks to support the following ideas:

    Define a model by specifying the set ofattributes to be predicted, the attributes usedfor the prediction, and the algorithm

    Populate the model using the training data

    Predictattributes for new data using the

    populated model Browse the mining model (not fully

    addressed because it varies a lot by modeltype)

    None of the

    others

    seemed tosupport this

  • 7/28/2019 Data Mining Query Languages (2)

    23/30

    OLE DB for DM

    Defining a Mining Model Identify the set of data attributes to be

    predicted, the set of attributes to be used forprediction, and the algorithm to be used forbuilding the model

    Populating the Model Pull the information into a single rowset

    using views, and train the model using thedata and algorithm specified

    Supports complex objects, so rowset may behierarchical (see paper for more complexexamples)

  • 7/28/2019 Data Mining Query Languages (2)

    24/30

    OLE DB for DM

    Using the mining model to predict

    Defines a new operatorprediction join.

    A model may be used to makepredictions on datasets by taking the

    prediction join of the mining model

    and the data set.

  • 7/28/2019 Data Mining Query Languages (2)

    25/30

    OLE DB for DM

    CREATE MINING MODEL [Heart_Health Prediction]

    [ID] Int Key,

    [Age] Int,

    [Smoker] Int,[Salary] Double discretized,

    [HeartAttack] Int PREDICT, %Prediction column

    USING [Decision_Trees_101]

    Identifies the source columns for the training

    data, the column to be predicted, and the data

    mining algorithm.

  • 7/28/2019 Data Mining Query Languages (2)

    26/30

    OLE DB for DM

    INSERT INTO [Heart_Health Prediction]

    ([ID], [Age], [Smoker], [Salary])

    SELECT [ID], [Age], [Smoker], [Salary] FROM

    Patient_Medical M, Patient_Financial FWHERE M.ID = F.ID

    The INSERT represents using a tuple fortraining the model (not actually inserting it into

    the rowset).

  • 7/28/2019 Data Mining Query Languages (2)

    27/30

    OLE DB for DM

    SELECT t.[ID],

    [Heart_Health Prediction].[HeartAttack]

    FROM [Heart_Health Prediction]

    PREDICTION JOIN (

    SELECT [ID], [Age], [Smoker], [Salary]

    FROM Patient_Medical M, Patient_Financial F

    WHERE M.ID = F.ID) as t

    ON [Heart_Health Prediction].Age = t.Age AND[Heath_Health Prediction].Smoker = t.SmokerAND [Heart_Health Prediction].Salary =t.Salary

    Prediction join connects the model and an actual data

    table to make predictions

  • 7/28/2019 Data Mining Query Languages (2)

    28/30

    Key Ideas

    Important to have an API for creating

    and manipulating data mining models

    The data is already in the DBMS, so itmakes sense to do the data mining

    where the data is

    Applications already use SQL, so aSQL extension seems logical

  • 7/28/2019 Data Mining Query Languages (2)

    29/30

    Key Ideas

    Need a method for defining data miningmodels, including algorithm specification,specification of various parameters, and

    training set specification (DMQL, MSQL,ODBDM)

    Need a method of querying the models(MSQL)

    Need a way of using the data mining modelto interact with other data in the database,for purposes such as prediction (ODBDM)

  • 7/28/2019 Data Mining Query Languages (2)

    30/30

    Discussion Topic:

    What Functionality wouldand Ideal Solution

    Support?


Top Related