dw dm question bank solutions

8/7/2019 DW DM Question Bank Solutions

http://slidepdf.com/reader/full/dw-dm-question-bank-solutions 1/30

Data Warehousing and Data Mining Question Bank

1

Chapter 1 Introduction

1. With a simple block diagram, explain the steps involved in the process of knowledgediscovery. Or justify the statement Data mining is the process of discovering interesting

knowledge.




2

2. Enumerate the differences between rdbms, data mining and KDD.

a. Data Mining

b. KDD




3

c. RDBMS:




4

3. Write short notes on:

a. Frequent Patterns and Association rules

b. Classification and Prediction Analysis

c. Clustering Analysis

d. Outlier Analysis

Frequent Patterns:

Frequent pattern: a pattern (a set of items, subsequences, substructures, etc that occurs

frequently in a data set

y Motivation:

Finding inherent regularities in data

o What products were often purchased together? Beer and diapers?!

o What are the subsequent purchases after buying a PC?

o What kinds of DNA are sensitive to this new drug?

o Can we automatically classify web documents?

y Application:

o Basket data analysis, cross-marketing, catalog design, sale campaign

analysis, Web log (click stream) analysis, and DNA sequence analysis.

Association Rules:

o In data mining, association rule learning is a popular and well researched

method for discovering interesting relations between variables in large

databases.

o Piatetsky-Shapiro describes analyzing and presenting strong rules discoveredin databases using different measures of interestingness.

o Based on the concept of strong rules, Agrawal introduced association rules

for discovering regularities between products in large scale transaction data

recorded by point-of-sale (POS) systems in supermarkets.

For example, the rule found in the

sales data of a supermarket would indicate that if a customer buys onions

and potatoes together, he or she is likely to also buy burger. Such

information can be used as the basis for decisions about marketing activities

such as, e.g., promotional pricing or product placements. In addition to the

above example from market basket analysis association rules are employed

today in many application areas including Web usage mining, intrusion

detection and bioinformatics.

Definition:

The problem of association rule mining is defined as:




5

o Let be a set of n binary attributes called items.

o Let be a set of m transactions called

the database. Each transaction in D has a unique transaction ID and contains

a subset of the items in I.

o A rule is defined as an implication of the

form where and .

o The sets of items (for short itemsets) X and Y are called antecedent (left-

hand-side or LHS) and consequent (right-hand-side or RHS) of the rule

respectively.

To illustrate the concepts, we use a small example from the supermarket domain.

y The set of items is I = {milk, bread, butter, beer} and a small database containing the

items (1 codes presence and 0 absence of an item in a transaction) is shown in the tableto the right.

y An example rule for the supermarket could

be meaning that if butter and bread is bought,

customers also buy milk.

Example data base with 4 items and

5 transactions

transaction

IDmilk bread butter beer

1 1 1 0 0

2 0 0 1 0

3 0 0 0 1

4 1 1 1 0

5 0 1 0 0

Cluster Analysis:




6

Outlier analysis:




7

4. Outline the major research challenges of mining in one specific application: Bioengineering

or Bio-informatics

Chapter 2: Data processing

5. Discuss the various methods of data cleaning and transformation




8




9

Handling noisy data:




10




11

Data Transformation




12

6. For the following data:

a. Use smoothing by bin means to smooth the data

b. How will you determine outliers?

c. What are the other methods existing for smoothing?

13, 15,16,16,19,20,20,21,22,22,25,25,25,25,25,30,33,33,35,35,35,36,40,45,46,52,70

a. Smoothing by bin means:

a. Put the entries in to equal bins (e.g. 9 each into 3 bins)

b. Find the mean of all the entries in the bin

c. Replace all the entries with the mean value in the bin

Bin1: 13, 15, 16, 16, 19, 20, 20, 21, 22

Bin2: 22, 25, 25, 25, 25, 25, 30, 33, 33

Bin3: 35, 35, 35, 36, 40, 45, 46, 52, 70

Mean for Bin1: 18

Mean for Bin2: 27

Mean for Bin3: 44

Smoothing by bin means yields:

y Bin1: 18, 18, 18, 18, 18, 18, 18, 18

y Bin2: 27, 27, 27, 27, 27, 27, 27, 27

y Bin3: 44, 44, 44, 44, 44, 44, 44, 44




13

b. Outliers are data objects that do not comply to the general behavior

c. Other smoothing techniques include:

a. Smoothing by equal frequency

b. Smoothing by bin means

c. Smoothing by bin boundaries

7. Discuss the various methods of data discretization and concept

hierarchies for numerical data




14

a. Binning Refer question #4 for explanation

b. Histogram analysis

c. Clustering




15

d. Entropy based discretization




16




17




18

e. Interval merge by X2 analysis

f. Segmentation by natural partitioning




19

8. Explain why normalization of data set is often required before it is

analyzed. Differentiate between min-max and z-score methods of data transformation for

the following dataset:

Sl. No Finger print feature Face recognition feature

1 0.57 120

2 0.28 122

3 0.15 85

4 0.72 27

5 0.91 12

6 0.40 51

The attribute data is scaled to fit in a specific range. There are many type of normalization

available:

a. Min-Max Normalization

b. Z-score normalization

Min Max Normalization:

It transforms a value A to B which fits in the range[C, D]. It is given the below formula




20

In the dataset given above, let [C, D] be equal to [0.0, 1.0]. The normalized values

are given below

E.g.

1. Finger print feature:

Let A = 0.28, Min (A) = 0.15, Max (A) = 0.91.

Hence B = ((0.28 -0.15) / (0.91 0.15) * (1-0)) + 0 = (0.13/0.76) * 1 = 0.17

Similarly fill in the values for other columns.

2. Face recognition feature

Let A = 85, Min (A) = 12, Max (A) = 122

Hence B = ((85 -12) / (122 12) * (1-0)) + 0 = (73/110) * 1 = 0.66

Similarly fill in the values for other columns.


1 0.00

2 0.17

3 0.66

4

56

Z-score Method:

The standard score is

Where:

1. x is a raw score to be standardized;

2. is the mean of the population;

3. is the standard deviation of the population.




21

The quantity z represents the distance between the raw score and the population mean

in units of the standard deviation.z is negative when the raw score is below the mean,

positive when above.

E.g. mean(x) = (0.57+0.28+0.15+0.72+0.91+0.40)/6 =0.51

To calculate standard deviation:

1. Compute the difference of each data point for the mean and square the result.

2. Add these above values and find the square root, you will get the standard

deviation.

E.g.

1. x= 0.57, (0.57-0.51) ̂ 2 = 0.0036

2. x= 0.28, (0.28-0.51) ̂ 2 = 0.0529

3. x=0.15, (0.15-0.51) ^ 2 =0.1296

4. x=0.72, (0.72 ± 0.51) ^ 2 = 0.0441

5. x=0.91, (0.91± 0.51) ^ 2 = 0.16

6. x=0.40, (0.40± 0.51) ^ 2 = 0.0121

Sum of the above results: s = 0.4203,

Number of data items, n = 6

Standard Deviation:

SD = (s/n) ^ 0.5 = (0.4203/6) ^ 0.5 = 0.265

Z = (x - mean(x)) / SD

e.g. Z1 = (0.57 ± 0.51)/0.265 = 0.226


1 0.23

2

3

4

5

6




22




23

Chapter 3: Data Warehousing

9. Compare and contrast database with data warehouse

y OLTP (On-line transaction processing) - Traditional Database systems

y OLAP (On-line analytical processing) Data Warehouse




24

10. Discuss 3 tier data warehouse architecture

The following are the 3 tiers of the dataware house:

a. Bottom tier Data warehouse database server

i. It has the data warehouse server

ii. Data is fed using back end tools and utilities

iii. Data is extracted using programs called gateways

iv. It also contains the meta data repository

b. Middle tier OLAP serveri. The middle tier is usually an OLAP server which is implemented using

either:

1. A relational OLAP model (ROLAP), i.e. an extended relational DBMS

that maps operations on multidimensional data to standard

relational operations.




25

2. A multidimensional OLAP model (MOLAP), i.e. a special purpose

server which directly implements the multidimensional data and

operations.

c. Top tier Front end tools

The top tier is a front end client layer that consists of querying and reporting tools,and /or data mining tools.

11. Clearly compare the following based on certain major issues: ROLAP,

MOLAP, and HOLAP.

In the OLAP world, there are mainly two different types:

a. Multidimensional OLAP (MOLAP)

b. Relational OLAP (ROLAP)

c. Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP

andR

OL

AP.

MOLAP

This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a

multidimensional cube. The storage is not in the relational database, but in proprietary

formats.

Advantages:

a. Excellent performance: MOLAP cubes are built for fast data retrieval, and are

optimal for slicing and dicing operations.

b. C an perform complex calculations: All calculations have been pre-generated when

the cube is created . Hence, complex calculations are not only doable, but they

return quickly.

Disadvantages:

a. C an handle only limited amount of data: Because all calculations are performed

when the cube is built, it is not possible to include a large amount of data in the

cube itself. This is not to say that the data in the cube cannot be derived from a

large amount of data.Indeed, this is possible. But in this case, only summary-level

information will be included in the cube itself.

b. Requires additional investment : Cube technologies are often proprietary and do not

already exists in the organization. Therefore, to adopt MOLAP technology, chances

are additional investments in human and capital resources are needed.




26

ROLAP

This methodology relies on manipulating the data stored in the relational database to give

the appearance of traditional OLAP's slicing and dicing functionality . In essence, each action

of slicing and dicing is equivalent to adding a "WHERE" clause in the SQ L statement.

Advantages:

a. C an handle large amounts of data: The data size limitation of ROLAP technology is

the limitation on data size of the underlying relational database. In other words,

ROLAP itself places no limitation on data amount.

b. C an leverage functionalities inherent in the relational database: Often, relational

database already comes with a host of functionalities. ROLAP technologies, since

they sit on top of the relational database, can therefore leverage these

functionalities.

Disadvantages:

a. P erformance can be slow: Because each ROLAP report is essentially a SQ L query (or

multiple SQ L queries) in the relational database, the query time can be long if the

underlying data size is large.

b. Limited by SQL functionalities: Because ROLAP technology mainly relies on

generating SQ L statements to query the relational database, and SQ L statements do

not fit all needs (for example, it is difficult to perform complex calculations using

SQ L), ROLAP technologies are therefore traditionally limited by what SQ L can do.

ROLAP vendors have mitigated this risk by building into the tool out-of-the-box

complex functions as well as the ability to allow users to define their own functions.

HOLAP

HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For

summary-type information, HOLAP leverages cube technology for faster performance.

When detail information is needed, HOLAP can "drill through" from the cube into the

underlying relational data.

12. What do you mean by a data cube? Explain any 2 multidimensional

data modeling paradigms with an example.

A data warehouse is based on a multidimensional data model which views data in the form

of a cube.

A data cube such as sales allows data to be modeled and viewed from multiple dimensions




27

a. Dimension tables Item (item_name, brand, type) or time (day, week, month,

quarter, year). These tables represent the dimensions.

b. Fact table contains the measures (such as the dollars_sold) and keys to each of the

related dimension tables.

Multidimensional data modeling schemes:




28




29

13. Discuss the various OLAP operations for a multidimensional data

model.




14. Suggest an integrated architecture for OLAM and OLAP for efficient

analytical processing of data cubes.

dw dm question bank solutions

Documents