Download - Dm Ext Qp Solution 2015-16

7/25/2019 Dm Ext Qp Solution 2015-16

1/26

1.

Distinguis

hing

Features

OLTP OLAP

Users and

systemorientation

An OLTP system is customer-

oriented and is used fortransaction and query processing

by clerks, clients, and

information technology

professionals.

An OLAP system is market-oriented and is

used for data analysis by knoledgeorkers, including managers, e!ecuti"es,

and analysts.

#ata

contents

An OLTP system manages current

data that, typically, are too

detailed to be easily used for

decision making.

An OLAP system manages large amounts

of

historical data, pro"ides facilities for

summari$ation and aggregation, and

stores and

manages information at di%erent le"els of

granularity. These features make the data

easier to use in informed decision making.#atabase

design

An OLTP system usually adopts

an entity-relationship &'() data

model and an application-

oriented database design.

An OLAP system typically adopts either a

star or sno*ake model and a sub+ect

oriented database design.

ie An OLTP system focuses mainly

on the current data ithin an

enterprise or

department,ithout referring to

historical data or data in di%erent

organi$ations.

An OLAP system often spans multiple

"ersions of a database schema,due to the

e"olutionary process of an organi$ation.

OLAP systems also deal ith

information that originates from di%erent

organi$ations, integrating information from

many data stores. ecause of their huge"olume, OLAP data are stored on multiple

storage media.Access

patterns

The access patterns of an OLTP

systemconsist mainly of short,

atomic

transactions. uch a system

requires concurrency control and

reco"ery mechanisms.

accesses to OLAP systems are mostly

read-only operations &because most data

arehouses store historical rather than up-

to-date information), although many could

be comple! queries.

2. #ata mining refers to e!tracting or /mining0 knoledge from largeamounts of data. #ata mining is an essential step in the knoledgedisco"ery process .The data mining step may interact ith the user or aknoledge base. The interesting patterns are presented to the user and maybe stored as ne knoledge in the knoledge base. data mining is only onestep an essential one because it unco"ers hidden patterns for e"aluation.

1


2/26

3.2ormali$ation, here the attribute data are scaled so as to fall ithin asmall speci3ed range, $-score normali$ation &or $ero-mean normali$ation),the "alues for an attribute,A, are normali$ed based on the mean andstandard de"iation of A. A "alue, ", of A is normali$ed to "4 by computing

here A and sA are the mean and standard de"iation, respecti"ely, ofattribute A. This method of normali$ation is useful hen the actual minimumand ma!imum of attribute A are unknon, or hen there are outliers thatdominate the min-ma! normali$ation.

45&67888-97:88);&1


3/26

.

Agglo!erati"e Di"isi"ebottom-up approach Top-don approachtarts ith each ob+ect forming a

separate group. t successi"elymerges the ob+ects that are close to

one another, until all of the groups

are merged into one, or until a

termination condition holds.

tarts ith all of the ob+ects in the

same cluster. n each successi"eiteration, a cluster is split up into

smaller clusters, until e"entually

each ob+ect forms its on cluster or

until a termination condition holds

#.

True $ositi"es TP%These refer to the positi"e tuples that ere correctlylabeled bythe classi3er.Let TP be the number of true positi"es.True negati"es T&% These are the negati"e tuples that ere correctlylabeled by theclassi3er. Let T2 be the number of true negati"es.False $ositi"es FP% These are the negati"e tuples that ere incorrectlylabeled aspositi"e . Let BP be the number of false positi"es.False negati"es .F&C These are the positi"e tuples that ere mislabeled asnegati"eLet B2 be the number of false negati"es.

CO&F'()O& *AT+),

A?TUAL

?LA.. P('#?T'# ?LA

D' D' 2O TOTALD' TP B2 P2O BP T2 2TOTAL P4 24 PE2

PrecisionC TP;P4(ecall or ensiti"ity or true positi"e &recognition) rate the proportion ofpositi"e tuples that are correctly identi3ed.+ecallC TP;P.

-.>inPts, hich speci3es the density threshold of dense regions.

9


4/26

>inimum number of points in an 'ps-neighbourhood of that point.

1. #ata analysis and decision support

>arket analysis and management

Target marketing, customer relationship management&?(>), market basket analysis, cross selling, marketsegmentation

(isk analysis and management

Borecasting, customer retention, impro"ed underriting,quality control, competiti"e analysis

Braud detection and detection of unusual patterns &outliers)

Other Applications

Te!t mining &nes group, email, documents) and Feb mining

tream data mining

ioinformatics and bio-data analysis

11/a0 ncomplete, inconsistent and noisy data are commonplace properties oflarge real-orld databases. Attributes of interest may not alays be a"ailableand other data as included +ust because it as considered to be importantat the time of entity. (ele"ant data may not sometimes be recorded.Burthermore, the recording of the modi3cations to the data may not ha"e been

done. There are many possible reasons for noisy data &incorrect attribute"alues). They could ha"e been human as ell as computer errors thatoccurred during data entry. There could be inconsistent in the namingcon"entionsadopted. ometimes duplicate tuples may occur.

#ata cleaning routines ork to /clean0 the data by 3lling in the missing"alues, moothing noisy data, identifying and remo8"ing outliers, andresol"ing inconsistencies in the data. Although mining routines ha"e someform of handling noisy data, they are alays not robust. f you ould like toinclude 3les from many sources in your analysis then requires data

integration. 2aming inconsistencies may occur in this conte!t. A largeamount of redundant data may confuse or slo don the knoledgedisco"ery process. n addition to data cleaning steps must taken to remo"eredundancies in the data.

ometimes data ould ha"e to be normali$ed so that it scaled to a speci3crange e.g. G8.8, 1.8H in order to ork data mining algorithms such as neural-netorks, or clustering. Burthermore, you ould require aggregating data

:


5/26

e.g. as sales per region-something thet is not part of the data transformationmethods need to be applied to the data.

#ata reduction obtains a reduced representation of the data set that is muchsmaller in "olume, yet produces the same or almost the same analytical

results. There are a number of strategies for data reduction-datacompression, numerosity reduction, generali$ation.

Data reduction%

2ormally data used for data mining is huge. ?omple! analysis on datamining on huge amounts of data can take a "ery long time, making suchanalysis impartibly or infeasible.#ata reduction techniques can be applied to obtain a reduced representationof the data set that is smaller in "olume yet closely maintains the integrity ofthe original data. That is, mining on the educed set should be eIcient and

yet produce the same or almost the same analytical results.

O"er"ie o Data +eduction (trategies

#ata reduction strategies include dimensionality reduction, numerosityreduction, anddata compression.Di!ensionalit reduction is the process of reducing the number ofrandom"ariables or attributes under consideration. #imensionality reductionmethods include a"elet transor!s and $rinci$al co!$onentsanalsis hich transform or pro+ect the original data onto a smaller space.

Attriute suset selection is a method of dimensionality reduction inhich irrele"ant, eakly rele"ant, or redundant attributes or dimensions aredetected and remo"ed .&u!erosit reduction techniues replace the original data "olume byalternati"e,smaller forms of data representation. These techniques may beparametric or nonparametric.Bor $ara!etric !ethods, a model is used to estimate the data, so thattypically only the data parameters need to be stored, instead of the actualdata. &Outliers may also be stored.) +egression and loglinear !odelsare e!amples.&on$ara!etric !ethodsfor storing reduced representationsof the data include histogra!s7 clustering 7sa!$ling and data cue

aggregation

n data compression, transformations are applied so as to obtain a reducedor /compressed0 representation of the original data. f the original data canbe reconstructed from the compressed data ithout any information loss, thedata reduction is called lossless. f, instead, e can reconstruct only anappro!imation of the original data, then the data reduction is called lossy.There are se"eral lossless algorithms for string compressionJ hoe"er, they

=


6/26

typically allo only limited data manipulation. #imensionality reduction andnumerosity reduction techniques can also be considered forms of datacompression.

Data Cue Aggregation%#ata cube store multidimensional aggregated information. 'ach cell holds anaggregate data "alue, corresponding to the data in the multidimensionalspace. ?oncept hierarchies may e!ist for each attribute, alloing analysis ofdata at multiple le"els of abstraction. Bor e!ample, a hierarchy for branchcould allo branches to be grouped into regions, based on their address.#ata cubes pro"ide fast access to pre computed summari$ed data therebybene3ting online analytical processing as ell as data mining.

The cube created at the loest le"el of abstraction is referred to as the base

cuboids. #ata cubes created at "arious le"els of abstraction are called ascuboids, so that a data cube may refer to a lattice of cuboids. 'ach higherle"el of abstraction further reduces the resulting data si$e. The loest le"elcuboids should be useable or used for data analysis.

Di!ensionall +eduction%#ata sets may contain hundreds of attributes for analysis most of hich maybe irrele"ant to the data mining task, or redundant. Although it may bepossible for a domain e!pert to pick out certain attributes, this cans adiIcult and a time consuming task. Lea"ing out rele"ant attributes or

keeping irrele"ant attributes may cause confusion for the mining algorithmemployed. The redundant data may slo don the mining process.

#imensionally reduction reduces the data set si$e by remo"ing suchattributes &dimensions) from it. Typically methods for attribute selection areapplied. The goal of attribute subset selection is to 3nd the minimum numberof attributes so that the resulting probability pattern of the data class is asclose as the original distribution obtained using all the attributes. >ining on areduced set has an additional bene3t. t reduces the number of attributesappearing in the disco"ered patterns, helping to make thepatterns easier to understand.

8asic heuristic !ethods o attriute selection include the olloingtechniues%

1. (te$ise orard selection%The procedure starts ith an empty set ofattributes. The best of the original attributes is determined and added tomthe set. At each subsequent iteration or step, it remo"es the orst attributeremaining original attributes is added to the set.

6


7/26


8/26

personal computer. There is a 7N probability&con3dence, or certainty) thata student in this group ons a personal computer.

Classifcation di%ers from $redictionin that the former constructs a set ofmodels &or functions)that describe and distinguish data classes or concepts,

hereas the latter builds a model to predict some missing or una"ailable,and often numerical, data "alues. Their similarity is that they are both toolsfor predictionC ?lassi3cation is used for predicting the class label of dataob+ects and prediction is typically used for predicting missing numerical data"alues.

Clustering analy$es data ob+ects ithout consulting a knon class label.The ob+ects are clustered or grouped based on the principle of ma!imi$ingthe intraclass similarity and minimi$ing the interclass similarity. 'ach clusterthat is formed can be "ieed as a class of ob+ects. ?lustering can alsofacilitate ta!onomy formation, that is, the organi$ation of obser"ations into a

hierarchy of classes that group similar e"ents together.

Data e"olution analsisdescribes and models regularities or trends forob+ects hose beha"ior changes o"er time. Although this may includecharacteri$ation, discrimination, association, classi3cation, or clustering oftime-related data, distinct features of such an analysis include time-seriesdata analysis, sequence or periodicity pattern matching, and similarity-baseddata analysis.

12.

/a0AttriuteOriented )nduction&a) Proposed in 17 &Q## R7 orkshop),2ot con3ned to categorical datanor particular measures ?ollect the task-rele"ant data &initial relation) usinga relational database query Perform generali$ation by attribute remo"al orattribute generali$ation.Apply aggregation by merging identical, generali$edtuples and accumulating their respecti"e counts nteracti"e presentation ithusers

8asic Princi$les o AttriuteOriented )nductionData ocusing%task-rele"ant data, including dimensions, and the result is

the initial relationAttriutere!o"al% remo"e attribute A if there is a large set of distinct"alues for A but&1) there is no generali$ation operator on A, or

7


9/26

&


10/26

12. /0Three tier arehouse architecture1. The bottom tier is a arehouse dataase ser"er that is almost alays a

relational database system. ack-end tools and utilities are used to feed datainto the bottom tier from operational databases or other e!ternal sources&e.g., customer pro3le information pro"ided by e!ternal consultants). Thesetools and utilities perform data e!traction, cleaning, and transformation &e.g.,to merge similar data from di%erent sources into a uni3ed format), as ell asload and refresh functions to update the data arehouse. The data aree!tracted using application program interfaces knon as gateas. Agateay is supported by the underlying #> and allos client programs togenerate SL code to be e!ecuted at a ser"er. '!amples of gateays includeO#? &Open #atabase ?onnection) and OL'# &Ob+ect Linking and'mbedding #atabase) by >icrosoft and #? &a"a #atabase

?onnection).This tier also contains a metadata repository, hich storesinformation about the data arehouse and its contents.2. The middle tier is an OLAP ser"er that is typically implemented usingeither &1) a+elational OLAP/+OLAP0 model &i.e., an e!tended relational #> thatmaps operationson multidimensional data to standard relational operations)J or &


11/26

OLAP /*OLAP0 model &i.e., a special-purpose ser"er that directlyimplements multidimensional data and operations).3. The top tier is a rontend client laer, hich contains query andreporting tools,analysis tools, and;or data mining tools &e.g., trend analysis, prediction, and

so on).

13./a0@i"en to data ob+ects are &


12/26

516.:9upremum

Or

?hebyshe"

abs&18-


13/26

The resulting data cube details the total sales per month rather thansummari$ing them by quarter.

ecause a drill-don adds more detail to the gi"en data, it can also beperformed by adding ne dimensions to a cube. Bor e!ample, a drill-don on

the central cube of Bigure :.1< can occur by introducing an additionaldimension, such as customer group.

(lice and dice%The slice operation performs a selection on one dimensionof the gi"encube, resulting in a subcube. Bigure :.1< shos a slice operation here thesales data are selected from the central cube for the dimension time usingthe criterion time # /S1.0 The dice operation de3nes a subcube byperforming a selection on to or more dimensions. Bigure :.1< shos a diceoperation on the central cube based on the folloing selection criteria thatin"ol"e three dimensionsC &location # /Toronto0or /ancou"er0) and &time #

/S10 or /S


14/26

Figure 4.12 '!amples of typical OLAP operations on multidimensional data.

14/a0

The Apriori Algorithm The 3rst pass of the algorithm simply counts item occurrencesto determine the large 1-itemsets. A subsequent pass, say pass k, consists of to

phases. Birst, the large itemsets Lk-1 found in the &k-1)th pass are used to generate

the candidate itemsets ?k, using the Apriori candidate generation function &apriori-

gen) described belo. 2e!t, the database is scanned and the support of candidates

in ?k is counted.

The A$riori algorith! is%

1:


15/26

L1 5 Wlarge 1-itemsetsXJ

for & k 5


16/26

can # for

each

candidate

itemsetup.count

A :

:

? edian, ma!, min, quantiles,

outliers, "ariance, etc. ill help us 3nding noise. Brom the data mining point

of "ie it is important to e!amine ho these measures are computed

eIciently,introduce the notions of distributi"e measure, algebraic measure

and holistic measure.

*easuring the Central Tendenc

*ean /algeraic !easure0

2oteC n is sample si$e

A distributi"e measure can be computed by partitioning the data into smaller

subsets &e.g., sum, and count).An algebraic measure can be computed by

applying an algebraic function to one or more distributi"e measures

&e.g.,mean5sum;count). ometimes each "alue !i is eighted , Feighted

arithmetic mean. Problem can be the mean measure is sensiti"e to e!treme

&e.g., outlier) "alues

*edian /holistic !easure0

>iddle "alue if odd number of "alues, or a"erage of the middle to "alues

otherise. A holistic measure must be computed on the entire data

set.Kolistic measures are much ore e!pensi"e to compute than distributi"e

measures and ?an be estimated by interpolation &for grouped data)C


22/26

>edian inter"al contains the median frequency.L1Cthe loer boundary of the

median inter"al.2Cthe number of "alues in the entire dataset.& freq)lC sum of

all freq of inter"als belo the median inter"al.Breqmedianand idth C frequency

idth of the median inter"al

*ode

alue that occurs most frequently in the data.t is possible that se"eral

di%erent "alues ha"e the greatest.frequencyC Unimodal, bimodal, trimodal,

multimodal.f each data "alue occurs only once then there is no mode .

*idrange

?an also be used to assess the central tendency.t is the a"erage of the

smallest and the largest "alue of the set. t is an algebric measure that is

easy to compute.

The degree in hich data tend to spread is called the dis$ersion,or"ariance of the dataThe most common measures for data dispersion are range, the f"enu!er su!!ar &based on quartiles), the interuartile range, andstandard de"iation.

+ange %The distance beteen the largest and the smallest "alues

th $ercentile%alue i ha"ing the property that 9? of the data lies at orbelo i The median is 5th percentile.The most popular percentiles other

than the median are Euartiles E1 &


23/26

tandard de"iation measures spread about the mean and should be used

only hen the mean is chosen as the measure of the center.tandard

de"iation 58 only hen there is no spread, that is, hen all obser"ations

ha"e the same "alue. Otherise tandard de"iation 8. ariance and

standard de"iation are algebraic measures. Thus,their computation is

scalable in large databases.

Bra$hic Dis$las

8o$lotC graphic display of 3"e-number summary.

Nistogra!C !-a!is are "alues, y-a!is repres. Brequencies.

Euantile $lot%each "aluexi is paired ith iindicating that appro!imately

188 iN of data are xi.

Euantileuantile /0 $lot% graphs the quantiles of one uni"ariantdistribution against the corresponding quantiles of another.

(catter $lot%each pair of "alues is a pair of coordinates and plotted as

points in the plane

>easure Bormula>ean

Feighted

>ean

>edian


24/26

ariance

(tandar

d

de"iatio

n

is the square root of "ariance

'mpirical

formulaC

1./c 0

ayesian classi3cation is based on ayes Theorem. ayesian classi3ers are

the statistical classi3ers. ayesian classi3ers can predict class membership

probabilities such as the probability that a gi"en tuple belongs to a

particular class.

8aes Theore!

ayes Theorem is named after Thomas ayes. There are to types of

probabilities

Posterior Probability GP&K;)H

Prior Probability GP&K)H

here is data tuple and K is some hypothesis.

According to ayes Theorem,

P&K;)5 P&;K)P&K) ; P&)

8aesian 8elie &etor9

ayesian elief 2etorks specify +oint conditional probability distributions.

They are also knon as elief 2etorks, ayesian 2etorks, or Probabilistic

2etorks.

A elief 2etork allos class conditional independencies to be de3ned

beteen subsets of "ariables.


25/26

t pro"ides a graphical model of causal relationship on hich learning

can be performed.

Fe can use a trained ayesian 2etork for classi3cation.

There are to components that de3ne a ayesian elief 2etork

#irected acyclic graph

A set of conditional probability tables

Directed Acclic Bra$h

'ach node in a directed acyclic graph represents a random "ariable.

These "ariable may be discrete or continuous "alued.

These "ariables may correspond to the actual attribute gi"en in the

data.

Directed Acclic Bra$h +e$resentation

The folloing diagram shos a directed acyclic graph for si! oolean

"ariables.

The arc in the diagram allos representation of causal knoledge. Bor

e!ample, lung cancer is in*uenced by a persons family history of lung

cancer, as ell as hether or not the person is a smoker. t is orth noting

that the "ariable Positi"eray is independent of hether the patient has a

family history of lung cancer or that the patient is a smoker, gi"en that e

kno the patient has lung cancer.


26/26

Conditional Proailit Tale

The arc in the diagram allos representation of causal knoledge. Bor

e!ample, lung cancer is in*uenced by a persons family history of lung

cancer, as ell as hether or not the person is a smoker. t is orth noting

that the "ariable Positi"eray is independent of hether the patient has a

family history of lung cancer or that the patient is a smoker, gi"en that e

kno the patient has lung cancer.

Download - Dm Ext Qp Solution 2015-16

Top Related