Download - Dm Ext Qp Solution 2015-16
-
7/25/2019 Dm Ext Qp Solution 2015-16
1/26
1.
Distinguis
hing
Features
OLTP OLAP
Users and
systemorientation
An OLTP system is customer-
oriented and is used fortransaction and query processing
by clerks, clients, and
information technology
professionals.
An OLAP system is market-oriented and is
used for data analysis by knoledgeorkers, including managers, e!ecuti"es,
and analysts.
#ata
contents
An OLTP system manages current
data that, typically, are too
detailed to be easily used for
decision making.
An OLAP system manages large amounts
of
historical data, pro"ides facilities for
summari$ation and aggregation, and
stores and
manages information at di%erent le"els of
granularity. These features make the data
easier to use in informed decision making.#atabase
design
An OLTP system usually adopts
an entity-relationship &'() data
model and an application-
oriented database design.
An OLAP system typically adopts either a
star or sno*ake model and a sub+ect
oriented database design.
ie An OLTP system focuses mainly
on the current data ithin an
enterprise or
department,ithout referring to
historical data or data in di%erent
organi$ations.
An OLAP system often spans multiple
"ersions of a database schema,due to the
e"olutionary process of an organi$ation.
OLAP systems also deal ith
information that originates from di%erent
organi$ations, integrating information from
many data stores. ecause of their huge"olume, OLAP data are stored on multiple
storage media.Access
patterns
The access patterns of an OLTP
systemconsist mainly of short,
atomic
transactions. uch a system
requires concurrency control and
reco"ery mechanisms.
accesses to OLAP systems are mostly
read-only operations &because most data
arehouses store historical rather than up-
to-date information), although many could
be comple! queries.
2. #ata mining refers to e!tracting or /mining0 knoledge from largeamounts of data. #ata mining is an essential step in the knoledgedisco"ery process .The data mining step may interact ith the user or aknoledge base. The interesting patterns are presented to the user and maybe stored as ne knoledge in the knoledge base. data mining is only onestep an essential one because it unco"ers hidden patterns for e"aluation.
1
-
7/25/2019 Dm Ext Qp Solution 2015-16
2/26
3.2ormali$ation, here the attribute data are scaled so as to fall ithin asmall speci3ed range, $-score normali$ation &or $ero-mean normali$ation),the "alues for an attribute,A, are normali$ed based on the mean andstandard de"iation of A. A "alue, ", of A is normali$ed to "4 by computing
here A and sA are the mean and standard de"iation, respecti"ely, ofattribute A. This method of normali$ation is useful hen the actual minimumand ma!imum of attribute A are unknon, or hen there are outliers thatdominate the min-ma! normali$ation.
45&67888-97:88);&1
-
7/25/2019 Dm Ext Qp Solution 2015-16
3/26
.
Agglo!erati"e Di"isi"ebottom-up approach Top-don approachtarts ith each ob+ect forming a
separate group. t successi"elymerges the ob+ects that are close to
one another, until all of the groups
are merged into one, or until a
termination condition holds.
tarts ith all of the ob+ects in the
same cluster. n each successi"eiteration, a cluster is split up into
smaller clusters, until e"entually
each ob+ect forms its on cluster or
until a termination condition holds
#.
True $ositi"es TP%These refer to the positi"e tuples that ere correctlylabeled bythe classi3er.Let TP be the number of true positi"es.True negati"es T&% These are the negati"e tuples that ere correctlylabeled by theclassi3er. Let T2 be the number of true negati"es.False $ositi"es FP% These are the negati"e tuples that ere incorrectlylabeled aspositi"e . Let BP be the number of false positi"es.False negati"es .F&C These are the positi"e tuples that ere mislabeled asnegati"eLet B2 be the number of false negati"es.
CO&F'()O& *AT+),
A?TUAL
?LA.. P('#?T'# ?LA
D' D' 2O TOTALD' TP B2 P2O BP T2 2TOTAL P4 24 PE2
PrecisionC TP;P4(ecall or ensiti"ity or true positi"e &recognition) rate the proportion ofpositi"e tuples that are correctly identi3ed.+ecallC TP;P.
-.>inPts, hich speci3es the density threshold of dense regions.
9
-
7/25/2019 Dm Ext Qp Solution 2015-16
4/26
>inimum number of points in an 'ps-neighbourhood of that point.
1. #ata analysis and decision support
>arket analysis and management
Target marketing, customer relationship management&?(>), market basket analysis, cross selling, marketsegmentation
(isk analysis and management
Borecasting, customer retention, impro"ed underriting,quality control, competiti"e analysis
Braud detection and detection of unusual patterns &outliers)
Other Applications
Te!t mining &nes group, email, documents) and Feb mining
tream data mining
ioinformatics and bio-data analysis
11/a0 ncomplete, inconsistent and noisy data are commonplace properties oflarge real-orld databases. Attributes of interest may not alays be a"ailableand other data as included +ust because it as considered to be importantat the time of entity. (ele"ant data may not sometimes be recorded.Burthermore, the recording of the modi3cations to the data may not ha"e been
done. There are many possible reasons for noisy data &incorrect attribute"alues). They could ha"e been human as ell as computer errors thatoccurred during data entry. There could be inconsistent in the namingcon"entionsadopted. ometimes duplicate tuples may occur.
#ata cleaning routines ork to /clean0 the data by 3lling in the missing"alues, moothing noisy data, identifying and remo8"ing outliers, andresol"ing inconsistencies in the data. Although mining routines ha"e someform of handling noisy data, they are alays not robust. f you ould like toinclude 3les from many sources in your analysis then requires data
integration. 2aming inconsistencies may occur in this conte!t. A largeamount of redundant data may confuse or slo don the knoledgedisco"ery process. n addition to data cleaning steps must taken to remo"eredundancies in the data.
ometimes data ould ha"e to be normali$ed so that it scaled to a speci3crange e.g. G8.8, 1.8H in order to ork data mining algorithms such as neural-netorks, or clustering. Burthermore, you ould require aggregating data
:
-
7/25/2019 Dm Ext Qp Solution 2015-16
5/26
e.g. as sales per region-something thet is not part of the data transformationmethods need to be applied to the data.
#ata reduction obtains a reduced representation of the data set that is muchsmaller in "olume, yet produces the same or almost the same analytical
results. There are a number of strategies for data reduction-datacompression, numerosity reduction, generali$ation.
Data reduction%
2ormally data used for data mining is huge. ?omple! analysis on datamining on huge amounts of data can take a "ery long time, making suchanalysis impartibly or infeasible.#ata reduction techniques can be applied to obtain a reduced representationof the data set that is smaller in "olume yet closely maintains the integrity ofthe original data. That is, mining on the educed set should be eIcient and
yet produce the same or almost the same analytical results.
O"er"ie o Data +eduction (trategies
#ata reduction strategies include dimensionality reduction, numerosityreduction, anddata compression.Di!ensionalit reduction is the process of reducing the number ofrandom"ariables or attributes under consideration. #imensionality reductionmethods include a"elet transor!s and $rinci$al co!$onentsanalsis hich transform or pro+ect the original data onto a smaller space.
Attriute suset selection is a method of dimensionality reduction inhich irrele"ant, eakly rele"ant, or redundant attributes or dimensions aredetected and remo"ed .&u!erosit reduction techniues replace the original data "olume byalternati"e,smaller forms of data representation. These techniques may beparametric or nonparametric.Bor $ara!etric !ethods, a model is used to estimate the data, so thattypically only the data parameters need to be stored, instead of the actualdata. &Outliers may also be stored.) +egression and loglinear !odelsare e!amples.&on$ara!etric !ethodsfor storing reduced representationsof the data include histogra!s7 clustering 7sa!$ling and data cue
aggregation
n data compression, transformations are applied so as to obtain a reducedor /compressed0 representation of the original data. f the original data canbe reconstructed from the compressed data ithout any information loss, thedata reduction is called lossless. f, instead, e can reconstruct only anappro!imation of the original data, then the data reduction is called lossy.There are se"eral lossless algorithms for string compressionJ hoe"er, they
=
-
7/25/2019 Dm Ext Qp Solution 2015-16
6/26
typically allo only limited data manipulation. #imensionality reduction andnumerosity reduction techniques can also be considered forms of datacompression.
Data Cue Aggregation%#ata cube store multidimensional aggregated information. 'ach cell holds anaggregate data "alue, corresponding to the data in the multidimensionalspace. ?oncept hierarchies may e!ist for each attribute, alloing analysis ofdata at multiple le"els of abstraction. Bor e!ample, a hierarchy for branchcould allo branches to be grouped into regions, based on their address.#ata cubes pro"ide fast access to pre computed summari$ed data therebybene3ting online analytical processing as ell as data mining.
The cube created at the loest le"el of abstraction is referred to as the base
cuboids. #ata cubes created at "arious le"els of abstraction are called ascuboids, so that a data cube may refer to a lattice of cuboids. 'ach higherle"el of abstraction further reduces the resulting data si$e. The loest le"elcuboids should be useable or used for data analysis.
Di!ensionall +eduction%#ata sets may contain hundreds of attributes for analysis most of hich maybe irrele"ant to the data mining task, or redundant. Although it may bepossible for a domain e!pert to pick out certain attributes, this cans adiIcult and a time consuming task. Lea"ing out rele"ant attributes or
keeping irrele"ant attributes may cause confusion for the mining algorithmemployed. The redundant data may slo don the mining process.
#imensionally reduction reduces the data set si$e by remo"ing suchattributes &dimensions) from it. Typically methods for attribute selection areapplied. The goal of attribute subset selection is to 3nd the minimum numberof attributes so that the resulting probability pattern of the data class is asclose as the original distribution obtained using all the attributes. >ining on areduced set has an additional bene3t. t reduces the number of attributesappearing in the disco"ered patterns, helping to make thepatterns easier to understand.
8asic heuristic !ethods o attriute selection include the olloingtechniues%
1. (te$ise orard selection%The procedure starts ith an empty set ofattributes. The best of the original attributes is determined and added tomthe set. At each subsequent iteration or step, it remo"es the orst attributeremaining original attributes is added to the set.
6
-
7/25/2019 Dm Ext Qp Solution 2015-16
7/26
-
7/25/2019 Dm Ext Qp Solution 2015-16
8/26
personal computer. There is a 7N probability&con3dence, or certainty) thata student in this group ons a personal computer.
Classifcation di%ers from $redictionin that the former constructs a set ofmodels &or functions)that describe and distinguish data classes or concepts,
hereas the latter builds a model to predict some missing or una"ailable,and often numerical, data "alues. Their similarity is that they are both toolsfor predictionC ?lassi3cation is used for predicting the class label of dataob+ects and prediction is typically used for predicting missing numerical data"alues.
Clustering analy$es data ob+ects ithout consulting a knon class label.The ob+ects are clustered or grouped based on the principle of ma!imi$ingthe intraclass similarity and minimi$ing the interclass similarity. 'ach clusterthat is formed can be "ieed as a class of ob+ects. ?lustering can alsofacilitate ta!onomy formation, that is, the organi$ation of obser"ations into a
hierarchy of classes that group similar e"ents together.
Data e"olution analsisdescribes and models regularities or trends forob+ects hose beha"ior changes o"er time. Although this may includecharacteri$ation, discrimination, association, classi3cation, or clustering oftime-related data, distinct features of such an analysis include time-seriesdata analysis, sequence or periodicity pattern matching, and similarity-baseddata analysis.
12.
/a0AttriuteOriented )nduction&a) Proposed in 17 &Q## R7 orkshop),2ot con3ned to categorical datanor particular measures ?ollect the task-rele"ant data &initial relation) usinga relational database query Perform generali$ation by attribute remo"al orattribute generali$ation.Apply aggregation by merging identical, generali$edtuples and accumulating their respecti"e counts nteracti"e presentation ithusers
8asic Princi$les o AttriuteOriented )nductionData ocusing%task-rele"ant data, including dimensions, and the result is
the initial relationAttriutere!o"al% remo"e attribute A if there is a large set of distinct"alues for A but&1) there is no generali$ation operator on A, or
7
-
7/25/2019 Dm Ext Qp Solution 2015-16
9/26
&
-
7/25/2019 Dm Ext Qp Solution 2015-16
10/26
12. /0Three tier arehouse architecture1. The bottom tier is a arehouse dataase ser"er that is almost alays a
relational database system. ack-end tools and utilities are used to feed datainto the bottom tier from operational databases or other e!ternal sources&e.g., customer pro3le information pro"ided by e!ternal consultants). Thesetools and utilities perform data e!traction, cleaning, and transformation &e.g.,to merge similar data from di%erent sources into a uni3ed format), as ell asload and refresh functions to update the data arehouse. The data aree!tracted using application program interfaces knon as gateas. Agateay is supported by the underlying #> and allos client programs togenerate SL code to be e!ecuted at a ser"er. '!amples of gateays includeO#? &Open #atabase ?onnection) and OL'# &Ob+ect Linking and'mbedding #atabase) by >icrosoft and #? &a"a #atabase
?onnection).This tier also contains a metadata repository, hich storesinformation about the data arehouse and its contents.2. The middle tier is an OLAP ser"er that is typically implemented usingeither &1) a+elational OLAP/+OLAP0 model &i.e., an e!tended relational #> thatmaps operationson multidimensional data to standard relational operations)J or &
-
7/25/2019 Dm Ext Qp Solution 2015-16
11/26
OLAP /*OLAP0 model &i.e., a special-purpose ser"er that directlyimplements multidimensional data and operations).3. The top tier is a rontend client laer, hich contains query andreporting tools,analysis tools, and;or data mining tools &e.g., trend analysis, prediction, and
so on).
13./a0@i"en to data ob+ects are &
-
7/25/2019 Dm Ext Qp Solution 2015-16
12/26
516.:9upremum
Or
?hebyshe"
abs&18-
-
7/25/2019 Dm Ext Qp Solution 2015-16
13/26
The resulting data cube details the total sales per month rather thansummari$ing them by quarter.
ecause a drill-don adds more detail to the gi"en data, it can also beperformed by adding ne dimensions to a cube. Bor e!ample, a drill-don on
the central cube of Bigure :.1< can occur by introducing an additionaldimension, such as customer group.
(lice and dice%The slice operation performs a selection on one dimensionof the gi"encube, resulting in a subcube. Bigure :.1< shos a slice operation here thesales data are selected from the central cube for the dimension time usingthe criterion time # /S1.0 The dice operation de3nes a subcube byperforming a selection on to or more dimensions. Bigure :.1< shos a diceoperation on the central cube based on the folloing selection criteria thatin"ol"e three dimensionsC &location # /Toronto0or /ancou"er0) and &time #
/S10 or /S
-
7/25/2019 Dm Ext Qp Solution 2015-16
14/26
Figure 4.12 '!amples of typical OLAP operations on multidimensional data.
14/a0
The Apriori Algorithm The 3rst pass of the algorithm simply counts item occurrencesto determine the large 1-itemsets. A subsequent pass, say pass k, consists of to
phases. Birst, the large itemsets Lk-1 found in the &k-1)th pass are used to generate
the candidate itemsets ?k, using the Apriori candidate generation function &apriori-
gen) described belo. 2e!t, the database is scanned and the support of candidates
in ?k is counted.
The A$riori algorith! is%
1:
-
7/25/2019 Dm Ext Qp Solution 2015-16
15/26
L1 5 Wlarge 1-itemsetsXJ
for & k 5
-
7/25/2019 Dm Ext Qp Solution 2015-16
16/26
can # for
each
candidate
itemsetup.count
A :
:
? edian, ma!, min, quantiles,
outliers, "ariance, etc. ill help us 3nding noise. Brom the data mining point
of "ie it is important to e!amine ho these measures are computed
eIciently,introduce the notions of distributi"e measure, algebraic measure
and holistic measure.
*easuring the Central Tendenc
*ean /algeraic !easure0
2oteC n is sample si$e
A distributi"e measure can be computed by partitioning the data into smaller
subsets &e.g., sum, and count).An algebraic measure can be computed by
applying an algebraic function to one or more distributi"e measures
&e.g.,mean5sum;count). ometimes each "alue !i is eighted , Feighted
arithmetic mean. Problem can be the mean measure is sensiti"e to e!treme
&e.g., outlier) "alues
*edian /holistic !easure0
>iddle "alue if odd number of "alues, or a"erage of the middle to "alues
otherise. A holistic measure must be computed on the entire data
set.Kolistic measures are much ore e!pensi"e to compute than distributi"e
measures and ?an be estimated by interpolation &for grouped data)C
-
7/25/2019 Dm Ext Qp Solution 2015-16
22/26
>edian inter"al contains the median frequency.L1Cthe loer boundary of the
median inter"al.2Cthe number of "alues in the entire dataset.& freq)lC sum of
all freq of inter"als belo the median inter"al.Breqmedianand idth C frequency
idth of the median inter"al
*ode
alue that occurs most frequently in the data.t is possible that se"eral
di%erent "alues ha"e the greatest.frequencyC Unimodal, bimodal, trimodal,
multimodal.f each data "alue occurs only once then there is no mode .
*idrange
?an also be used to assess the central tendency.t is the a"erage of the
smallest and the largest "alue of the set. t is an algebric measure that is
easy to compute.
The degree in hich data tend to spread is called the dis$ersion,or"ariance of the dataThe most common measures for data dispersion are range, the f"enu!er su!!ar &based on quartiles), the interuartile range, andstandard de"iation.
+ange %The distance beteen the largest and the smallest "alues
th $ercentile%alue i ha"ing the property that 9? of the data lies at orbelo i The median is 5th percentile.The most popular percentiles other
than the median are Euartiles E1 &
-
7/25/2019 Dm Ext Qp Solution 2015-16
23/26
tandard de"iation measures spread about the mean and should be used
only hen the mean is chosen as the measure of the center.tandard
de"iation 58 only hen there is no spread, that is, hen all obser"ations
ha"e the same "alue. Otherise tandard de"iation 8. ariance and
standard de"iation are algebraic measures. Thus,their computation is
scalable in large databases.
Bra$hic Dis$las
8o$lotC graphic display of 3"e-number summary.
Nistogra!C !-a!is are "alues, y-a!is repres. Brequencies.
Euantile $lot%each "aluexi is paired ith iindicating that appro!imately
188 iN of data are xi.
Euantileuantile /0 $lot% graphs the quantiles of one uni"ariantdistribution against the corresponding quantiles of another.
(catter $lot%each pair of "alues is a pair of coordinates and plotted as
points in the plane
>easure Bormula>ean
Feighted
>ean
>edian
-
7/25/2019 Dm Ext Qp Solution 2015-16
24/26
ariance
(tandar
d
de"iatio
n
is the square root of "ariance
'mpirical
formulaC
1./c 0
ayesian classi3cation is based on ayes Theorem. ayesian classi3ers are
the statistical classi3ers. ayesian classi3ers can predict class membership
probabilities such as the probability that a gi"en tuple belongs to a
particular class.
8aes Theore!
ayes Theorem is named after Thomas ayes. There are to types of
probabilities
Posterior Probability GP&K;)H
Prior Probability GP&K)H
here is data tuple and K is some hypothesis.
According to ayes Theorem,
P&K;)5 P&;K)P&K) ; P&)
8aesian 8elie &etor9
ayesian elief 2etorks specify +oint conditional probability distributions.
They are also knon as elief 2etorks, ayesian 2etorks, or Probabilistic
2etorks.
A elief 2etork allos class conditional independencies to be de3ned
beteen subsets of "ariables.
-
7/25/2019 Dm Ext Qp Solution 2015-16
25/26
t pro"ides a graphical model of causal relationship on hich learning
can be performed.
Fe can use a trained ayesian 2etork for classi3cation.
There are to components that de3ne a ayesian elief 2etork
#irected acyclic graph
A set of conditional probability tables
Directed Acclic Bra$h
'ach node in a directed acyclic graph represents a random "ariable.
These "ariable may be discrete or continuous "alued.
These "ariables may correspond to the actual attribute gi"en in the
data.
Directed Acclic Bra$h +e$resentation
The folloing diagram shos a directed acyclic graph for si! oolean
"ariables.
The arc in the diagram allos representation of causal knoledge. Bor
e!ample, lung cancer is in*uenced by a persons family history of lung
cancer, as ell as hether or not the person is a smoker. t is orth noting
that the "ariable Positi"eray is independent of hether the patient has a
family history of lung cancer or that the patient is a smoker, gi"en that e
kno the patient has lung cancer.
-
7/25/2019 Dm Ext Qp Solution 2015-16
26/26
Conditional Proailit Tale
The arc in the diagram allos representation of causal knoledge. Bor
e!ample, lung cancer is in*uenced by a persons family history of lung
cancer, as ell as hether or not the person is a smoker. t is orth noting
that the "ariable Positi"eray is independent of hether the patient has a
family history of lung cancer or that the patient is a smoker, gi"en that e
kno the patient has lung cancer.