dm ext qp solution 2015-16

Upload: venkataramana-battula

Post on 01-Mar-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/25/2019 Dm Ext Qp Solution 2015-16

    1/26

    1.

    Distinguis

    hing

    Features

    OLTP OLAP

    Users and

    systemorientation

    An OLTP system is customer-

    oriented and is used fortransaction and query processing

    by clerks, clients, and

    information technology

    professionals.

    An OLAP system is market-oriented and is

    used for data analysis by knoledgeorkers, including managers, e!ecuti"es,

    and analysts.

    #ata

    contents

    An OLTP system manages current

    data that, typically, are too

    detailed to be easily used for

    decision making.

    An OLAP system manages large amounts

    of

    historical data, pro"ides facilities for

    summari$ation and aggregation, and

    stores and

    manages information at di%erent le"els of

    granularity. These features make the data

    easier to use in informed decision making.#atabase

    design

    An OLTP system usually adopts

    an entity-relationship &'() data

    model and an application-

    oriented database design.

    An OLAP system typically adopts either a

    star or sno*ake model and a sub+ect

    oriented database design.

    ie An OLTP system focuses mainly

    on the current data ithin an

    enterprise or

    department,ithout referring to

    historical data or data in di%erent

    organi$ations.

    An OLAP system often spans multiple

    "ersions of a database schema,due to the

    e"olutionary process of an organi$ation.

    OLAP systems also deal ith

    information that originates from di%erent

    organi$ations, integrating information from

    many data stores. ecause of their huge"olume, OLAP data are stored on multiple

    storage media.Access

    patterns

    The access patterns of an OLTP

    systemconsist mainly of short,

    atomic

    transactions. uch a system

    requires concurrency control and

    reco"ery mechanisms.

    accesses to OLAP systems are mostly

    read-only operations &because most data

    arehouses store historical rather than up-

    to-date information), although many could

    be comple! queries.

    2. #ata mining refers to e!tracting or /mining0 knoledge from largeamounts of data. #ata mining is an essential step in the knoledgedisco"ery process .The data mining step may interact ith the user or aknoledge base. The interesting patterns are presented to the user and maybe stored as ne knoledge in the knoledge base. data mining is only onestep an essential one because it unco"ers hidden patterns for e"aluation.

    1

  • 7/25/2019 Dm Ext Qp Solution 2015-16

    2/26

    3.2ormali$ation, here the attribute data are scaled so as to fall ithin asmall speci3ed range, $-score normali$ation &or $ero-mean normali$ation),the "alues for an attribute,A, are normali$ed based on the mean andstandard de"iation of A. A "alue, ", of A is normali$ed to "4 by computing

    here A and sA are the mean and standard de"iation, respecti"ely, ofattribute A. This method of normali$ation is useful hen the actual minimumand ma!imum of attribute A are unknon, or hen there are outliers thatdominate the min-ma! normali$ation.

    45&67888-97:88);&1

  • 7/25/2019 Dm Ext Qp Solution 2015-16

    3/26

    .

    Agglo!erati"e Di"isi"ebottom-up approach Top-don approachtarts ith each ob+ect forming a

    separate group. t successi"elymerges the ob+ects that are close to

    one another, until all of the groups

    are merged into one, or until a

    termination condition holds.

    tarts ith all of the ob+ects in the

    same cluster. n each successi"eiteration, a cluster is split up into

    smaller clusters, until e"entually

    each ob+ect forms its on cluster or

    until a termination condition holds

    #.

    True $ositi"es TP%These refer to the positi"e tuples that ere correctlylabeled bythe classi3er.Let TP be the number of true positi"es.True negati"es T&% These are the negati"e tuples that ere correctlylabeled by theclassi3er. Let T2 be the number of true negati"es.False $ositi"es FP% These are the negati"e tuples that ere incorrectlylabeled aspositi"e . Let BP be the number of false positi"es.False negati"es .F&C These are the positi"e tuples that ere mislabeled asnegati"eLet B2 be the number of false negati"es.

    CO&F'()O& *AT+),

    A?TUAL

    ?LA.. P('#?T'# ?LA

    D' D' 2O TOTALD' TP B2 P2O BP T2 2TOTAL P4 24 PE2

    PrecisionC TP;P4(ecall or ensiti"ity or true positi"e &recognition) rate the proportion ofpositi"e tuples that are correctly identi3ed.+ecallC TP;P.

    -.>inPts, hich speci3es the density threshold of dense regions.

    9

  • 7/25/2019 Dm Ext Qp Solution 2015-16

    4/26

    >inimum number of points in an 'ps-neighbourhood of that point.

    1. #ata analysis and decision support

    >arket analysis and management

    Target marketing, customer relationship management&?(>), market basket analysis, cross selling, marketsegmentation

    (isk analysis and management

    Borecasting, customer retention, impro"ed underriting,quality control, competiti"e analysis

    Braud detection and detection of unusual patterns &outliers)

    Other Applications

    Te!t mining &nes group, email, documents) and Feb mining

    tream data mining

    ioinformatics and bio-data analysis

    11/a0 ncomplete, inconsistent and noisy data are commonplace properties oflarge real-orld databases. Attributes of interest may not alays be a"ailableand other data as included +ust because it as considered to be importantat the time of entity. (ele"ant data may not sometimes be recorded.Burthermore, the recording of the modi3cations to the data may not ha"e been

    done. There are many possible reasons for noisy data &incorrect attribute"alues). They could ha"e been human as ell as computer errors thatoccurred during data entry. There could be inconsistent in the namingcon"entionsadopted. ometimes duplicate tuples may occur.

    #ata cleaning routines ork to /clean0 the data by 3lling in the missing"alues, moothing noisy data, identifying and remo8"ing outliers, andresol"ing inconsistencies in the data. Although mining routines ha"e someform of handling noisy data, they are alays not robust. f you ould like toinclude 3les from many sources in your analysis then requires data

    integration. 2aming inconsistencies may occur in this conte!t. A largeamount of redundant data may confuse or slo don the knoledgedisco"ery process. n addition to data cleaning steps must taken to remo"eredundancies in the data.

    ometimes data ould ha"e to be normali$ed so that it scaled to a speci3crange e.g. G8.8, 1.8H in order to ork data mining algorithms such as neural-netorks, or clustering. Burthermore, you ould require aggregating data

    :

  • 7/25/2019 Dm Ext Qp Solution 2015-16

    5/26

    e.g. as sales per region-something thet is not part of the data transformationmethods need to be applied to the data.

    #ata reduction obtains a reduced representation of the data set that is muchsmaller in "olume, yet produces the same or almost the same analytical

    results. There are a number of strategies for data reduction-datacompression, numerosity reduction, generali$ation.

    Data reduction%

    2ormally data used for data mining is huge. ?omple! analysis on datamining on huge amounts of data can take a "ery long time, making suchanalysis impartibly or infeasible.#ata reduction techniques can be applied to obtain a reduced representationof the data set that is smaller in "olume yet closely maintains the integrity ofthe original data. That is, mining on the educed set should be eIcient and

    yet produce the same or almost the same analytical results.

    O"er"ie o Data +eduction (trategies

    #ata reduction strategies include dimensionality reduction, numerosityreduction, anddata compression.Di!ensionalit reduction is the process of reducing the number ofrandom"ariables or attributes under consideration. #imensionality reductionmethods include a"elet transor!s and $rinci$al co!$onentsanalsis hich transform or pro+ect the original data onto a smaller space.

    Attriute suset selection is a method of dimensionality reduction inhich irrele"ant, eakly rele"ant, or redundant attributes or dimensions aredetected and remo"ed .&u!erosit reduction techniues replace the original data "olume byalternati"e,smaller forms of data representation. These techniques may beparametric or nonparametric.Bor $ara!etric !ethods, a model is used to estimate the data, so thattypically only the data parameters need to be stored, instead of the actualdata. &Outliers may also be stored.) +egression and loglinear !odelsare e!amples.&on$ara!etric !ethodsfor storing reduced representationsof the data include histogra!s7 clustering 7sa!$ling and data cue

    aggregation

    n data compression, transformations are applied so as to obtain a reducedor /compressed0 representation of the original data. f the original data canbe reconstructed from the compressed data ithout any information loss, thedata reduction is called lossless. f, instead, e can reconstruct only anappro!imation of the original data, then the data reduction is called lossy.There are se"eral lossless algorithms for string compressionJ hoe"er, they

    =

  • 7/25/2019 Dm Ext Qp Solution 2015-16

    6/26

    typically allo only limited data manipulation. #imensionality reduction andnumerosity reduction techniques can also be considered forms of datacompression.

    Data Cue Aggregation%#ata cube store multidimensional aggregated information. 'ach cell holds anaggregate data "alue, corresponding to the data in the multidimensionalspace. ?oncept hierarchies may e!ist for each attribute, alloing analysis ofdata at multiple le"els of abstraction. Bor e!ample, a hierarchy for branchcould allo branches to be grouped into regions, based on their address.#ata cubes pro"ide fast access to pre computed summari$ed data therebybene3ting online analytical processing as ell as data mining.

    The cube created at the loest le"el of abstraction is referred to as the base

    cuboids. #ata cubes created at "arious le"els of abstraction are called ascuboids, so that a data cube may refer to a lattice of cuboids. 'ach higherle"el of abstraction further reduces the resulting data si$e. The loest le"elcuboids should be useable or used for data analysis.

    Di!ensionall +eduction%#ata sets may contain hundreds of attributes for analysis most of hich maybe irrele"ant to the data mining task, or redundant. Although it may bepossible for a domain e!pert to pick out certain attributes, this cans adiIcult and a time consuming task. Lea"ing out rele"ant attributes or

    keeping irrele"ant attributes may cause confusion for the mining algorithmemployed. The redundant data may slo don the mining process.

    #imensionally reduction reduces the data set si$e by remo"ing suchattributes &dimensions) from it. Typically methods for attribute selection areapplied. The goal of attribute subset selection is to 3nd the minimum numberof attributes so that the resulting probability pattern of the data class is asclose as the original distribution obtained using all the attributes. >ining on areduced set has an additional bene3t. t reduces the number of attributesappearing in the disco"ered patterns, helping to make thepatterns easier to understand.

    8asic heuristic !ethods o attriute selection include the olloingtechniues%

    1. (te$ise orard selection%The procedure starts ith an empty set ofattributes. The best of the original attributes is determined and added tomthe set. At each subsequent iteration or step, it remo"es the orst attributeremaining original attributes is added to the set.

    6

  • 7/25/2019 Dm Ext Qp Solution 2015-16

    7/26

  • 7/25/2019 Dm Ext Qp Solution 2015-16

    8/26

    personal computer. There is a 7N probability&con3dence, or certainty) thata student in this group ons a personal computer.

    Classifcation di%ers from $redictionin that the former constructs a set ofmodels &or functions)that describe and distinguish data classes or concepts,

    hereas the latter builds a model to predict some missing or una"ailable,and often numerical, data "alues. Their similarity is that they are both toolsfor predictionC ?lassi3cation is used for predicting the class label of dataob+ects and prediction is typically used for predicting missing numerical data"alues.

    Clustering analy$es data ob+ects ithout consulting a knon class label.The ob+ects are clustered or grouped based on the principle of ma!imi$ingthe intraclass similarity and minimi$ing the interclass similarity. 'ach clusterthat is formed can be "ieed as a class of ob+ects. ?lustering can alsofacilitate ta!onomy formation, that is, the organi$ation of obser"ations into a

    hierarchy of classes that group similar e"ents together.

    Data e"olution analsisdescribes and models regularities or trends forob+ects hose beha"ior changes o"er time. Although this may includecharacteri$ation, discrimination, association, classi3cation, or clustering oftime-related data, distinct features of such an analysis include time-seriesdata analysis, sequence or periodicity pattern matching, and similarity-baseddata analysis.

    12.

    /a0AttriuteOriented )nduction&a) Proposed in 17 &Q## R7 orkshop),2ot con3ned to categorical datanor particular measures ?ollect the task-rele"ant data &initial relation) usinga relational database query Perform generali$ation by attribute remo"al orattribute generali$ation.Apply aggregation by merging identical, generali$edtuples and accumulating their respecti"e counts nteracti"e presentation ithusers

    8asic Princi$les o AttriuteOriented )nductionData ocusing%task-rele"ant data, including dimensions, and the result is

    the initial relationAttriutere!o"al% remo"e attribute A if there is a large set of distinct"alues for A but&1) there is no generali$ation operator on A, or

    7

  • 7/25/2019 Dm Ext Qp Solution 2015-16

    9/26

    &

  • 7/25/2019 Dm Ext Qp Solution 2015-16

    10/26

    12. /0Three tier arehouse architecture1. The bottom tier is a arehouse dataase ser"er that is almost alays a

    relational database system. ack-end tools and utilities are used to feed datainto the bottom tier from operational databases or other e!ternal sources&e.g., customer pro3le information pro"ided by e!ternal consultants). Thesetools and utilities perform data e!traction, cleaning, and transformation &e.g.,to merge similar data from di%erent sources into a uni3ed format), as ell asload and refresh functions to update the data arehouse. The data aree!tracted using application program interfaces knon as gateas. Agateay is supported by the underlying #> and allos client programs togenerate SL code to be e!ecuted at a ser"er. '!amples of gateays includeO#? &Open #atabase ?onnection) and OL'# &Ob+ect Linking and'mbedding #atabase) by >icrosoft and #? &a"a #atabase

    ?onnection).This tier also contains a metadata repository, hich storesinformation about the data arehouse and its contents.2. The middle tier is an OLAP ser"er that is typically implemented usingeither &1) a+elational OLAP/+OLAP0 model &i.e., an e!tended relational #> thatmaps operationson multidimensional data to standard relational operations)J or &

  • 7/25/2019 Dm Ext Qp Solution 2015-16

    11/26

    OLAP /*OLAP0 model &i.e., a special-purpose ser"er that directlyimplements multidimensional data and operations).3. The top tier is a rontend client laer, hich contains query andreporting tools,analysis tools, and;or data mining tools &e.g., trend analysis, prediction, and

    so on).

    13./a0@i"en to data ob+ects are &

  • 7/25/2019 Dm Ext Qp Solution 2015-16

    12/26

    516.:9upremum

    Or

    ?hebyshe"

    abs&18-

  • 7/25/2019 Dm Ext Qp Solution 2015-16

    13/26

    The resulting data cube details the total sales per month rather thansummari$ing them by quarter.

    ecause a drill-don adds more detail to the gi"en data, it can also beperformed by adding ne dimensions to a cube. Bor e!ample, a drill-don on

    the central cube of Bigure :.1< can occur by introducing an additionaldimension, such as customer group.

    (lice and dice%The slice operation performs a selection on one dimensionof the gi"encube, resulting in a subcube. Bigure :.1< shos a slice operation here thesales data are selected from the central cube for the dimension time usingthe criterion time # /S1.0 The dice operation de3nes a subcube byperforming a selection on to or more dimensions. Bigure :.1< shos a diceoperation on the central cube based on the folloing selection criteria thatin"ol"e three dimensionsC &location # /Toronto0or /ancou"er0) and &time #

    /S10 or /S

  • 7/25/2019 Dm Ext Qp Solution 2015-16

    14/26

    Figure 4.12 '!amples of typical OLAP operations on multidimensional data.

    14/a0

    The Apriori Algorithm The 3rst pass of the algorithm simply counts item occurrencesto determine the large 1-itemsets. A subsequent pass, say pass k, consists of to

    phases. Birst, the large itemsets Lk-1 found in the &k-1)th pass are used to generate

    the candidate itemsets ?k, using the Apriori candidate generation function &apriori-

    gen) described belo. 2e!t, the database is scanned and the support of candidates

    in ?k is counted.

    The A$riori algorith! is%

    1:

  • 7/25/2019 Dm Ext Qp Solution 2015-16

    15/26

    L1 5 Wlarge 1-itemsetsXJ

    for & k 5

  • 7/25/2019 Dm Ext Qp Solution 2015-16

    16/26

    can # for

    each

    candidate

    itemsetup.count

    A :

    :

    ? edian, ma!, min, quantiles,

    outliers, "ariance, etc. ill help us 3nding noise. Brom the data mining point

    of "ie it is important to e!amine ho these measures are computed

    eIciently,introduce the notions of distributi"e measure, algebraic measure

    and holistic measure.

    *easuring the Central Tendenc

    *ean /algeraic !easure0

    2oteC n is sample si$e

    A distributi"e measure can be computed by partitioning the data into smaller

    subsets &e.g., sum, and count).An algebraic measure can be computed by

    applying an algebraic function to one or more distributi"e measures

    &e.g.,mean5sum;count). ometimes each "alue !i is eighted , Feighted

    arithmetic mean. Problem can be the mean measure is sensiti"e to e!treme

    &e.g., outlier) "alues

    *edian /holistic !easure0

    >iddle "alue if odd number of "alues, or a"erage of the middle to "alues

    otherise. A holistic measure must be computed on the entire data

    set.Kolistic measures are much ore e!pensi"e to compute than distributi"e

    measures and ?an be estimated by interpolation &for grouped data)C

  • 7/25/2019 Dm Ext Qp Solution 2015-16

    22/26

    >edian inter"al contains the median frequency.L1Cthe loer boundary of the

    median inter"al.2Cthe number of "alues in the entire dataset.& freq)lC sum of

    all freq of inter"als belo the median inter"al.Breqmedianand idth C frequency

    idth of the median inter"al

    *ode

    alue that occurs most frequently in the data.t is possible that se"eral

    di%erent "alues ha"e the greatest.frequencyC Unimodal, bimodal, trimodal,

    multimodal.f each data "alue occurs only once then there is no mode .

    *idrange

    ?an also be used to assess the central tendency.t is the a"erage of the

    smallest and the largest "alue of the set. t is an algebric measure that is

    easy to compute.

    The degree in hich data tend to spread is called the dis$ersion,or"ariance of the dataThe most common measures for data dispersion are range, the f"enu!er su!!ar &based on quartiles), the interuartile range, andstandard de"iation.

    +ange %The distance beteen the largest and the smallest "alues

    th $ercentile%alue i ha"ing the property that 9? of the data lies at orbelo i The median is 5th percentile.The most popular percentiles other

    than the median are Euartiles E1 &

  • 7/25/2019 Dm Ext Qp Solution 2015-16

    23/26

    tandard de"iation measures spread about the mean and should be used

    only hen the mean is chosen as the measure of the center.tandard

    de"iation 58 only hen there is no spread, that is, hen all obser"ations

    ha"e the same "alue. Otherise tandard de"iation 8. ariance and

    standard de"iation are algebraic measures. Thus,their computation is

    scalable in large databases.

    Bra$hic Dis$las

    8o$lotC graphic display of 3"e-number summary.

    Nistogra!C !-a!is are "alues, y-a!is repres. Brequencies.

    Euantile $lot%each "aluexi is paired ith iindicating that appro!imately

    188 iN of data are xi.

    Euantileuantile /0 $lot% graphs the quantiles of one uni"ariantdistribution against the corresponding quantiles of another.

    (catter $lot%each pair of "alues is a pair of coordinates and plotted as

    points in the plane

    >easure Bormula>ean

    Feighted

    >ean

    >edian

  • 7/25/2019 Dm Ext Qp Solution 2015-16

    24/26

    ariance

    (tandar

    d

    de"iatio

    n

    is the square root of "ariance

    'mpirical

    formulaC

    1./c 0

    ayesian classi3cation is based on ayes Theorem. ayesian classi3ers are

    the statistical classi3ers. ayesian classi3ers can predict class membership

    probabilities such as the probability that a gi"en tuple belongs to a

    particular class.

    8aes Theore!

    ayes Theorem is named after Thomas ayes. There are to types of

    probabilities

    Posterior Probability GP&K;)H

    Prior Probability GP&K)H

    here is data tuple and K is some hypothesis.

    According to ayes Theorem,

    P&K;)5 P&;K)P&K) ; P&)

    8aesian 8elie &etor9

    ayesian elief 2etorks specify +oint conditional probability distributions.

    They are also knon as elief 2etorks, ayesian 2etorks, or Probabilistic

    2etorks.

    A elief 2etork allos class conditional independencies to be de3ned

    beteen subsets of "ariables.

  • 7/25/2019 Dm Ext Qp Solution 2015-16

    25/26

    t pro"ides a graphical model of causal relationship on hich learning

    can be performed.

    Fe can use a trained ayesian 2etork for classi3cation.

    There are to components that de3ne a ayesian elief 2etork

    #irected acyclic graph

    A set of conditional probability tables

    Directed Acclic Bra$h

    'ach node in a directed acyclic graph represents a random "ariable.

    These "ariable may be discrete or continuous "alued.

    These "ariables may correspond to the actual attribute gi"en in the

    data.

    Directed Acclic Bra$h +e$resentation

    The folloing diagram shos a directed acyclic graph for si! oolean

    "ariables.

    The arc in the diagram allos representation of causal knoledge. Bor

    e!ample, lung cancer is in*uenced by a persons family history of lung

    cancer, as ell as hether or not the person is a smoker. t is orth noting

    that the "ariable Positi"eray is independent of hether the patient has a

    family history of lung cancer or that the patient is a smoker, gi"en that e

    kno the patient has lung cancer.

  • 7/25/2019 Dm Ext Qp Solution 2015-16

    26/26

    Conditional Proailit Tale

    The arc in the diagram allos representation of causal knoledge. Bor

    e!ample, lung cancer is in*uenced by a persons family history of lung

    cancer, as ell as hether or not the person is a smoker. t is orth noting

    that the "ariable Positi"eray is independent of hether the patient has a

    family history of lung cancer or that the patient is a smoker, gi"en that e

    kno the patient has lung cancer.