databasesberka/docs/4iz451/sl02-databaze-en.pdf · 2018-09-25 · hypercube multicube „true“...

14
Knowledge Discovery in Databases T2: databases P. Berka, 2018 1/14 Databases Flat file client transaction account Relational database Querying: QBE vs. SQL date name surname street city account no. payment balance 980103 Jan Novak Dlouha 5 Praha 1 9945371 100.00 100.00 980105 Jan Novak Dlouha 5 Praha 1 9945371 1500.00 1600.00 980106 Jan Novak Dlouha 5 Praha 1 9945371 -1550.00 50.00 980106 Karel Nemec Podolska 4 Praha 2 24867134 3000.00 6000.00 980107 Karel Nemec Podolska 4 Praha 2 24867134 -4000.00 2000.00 980108 Jan Novak Dlouha 5 Praha 1 9945371 -150.00 -100.00 980111 Karel Nemec Podolska 4 Praha 2 24867134 5000.00 7000.00 . . . id_client name surname street city . . . id_transaction id_account date payment balance . . . id_account id_client . . . SELECT client.name, client.surname, client.street, client.city, account. id_client, transaction.balance FROM client, account, transaction WHERE client.id_client = account.id_account; AND transaction.id_account = account.id_account; AND transaction.balance < 100; GROUP BY client.city

Upload: others

Post on 05-Apr-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Databasesberka/docs/4iz451/sl02-databaze-en.pdf · 2018-09-25 · hypercube multicube „true“ OLAP vs. ROLAP MOLAP vs. ROLAP physical implementation of the system: star schema,

Knowledge Discovery in Databases T2: databases

P. Berka, 2018 1/14

Databases

Flat file

client transaction

account

Relational database

Querying: QBE vs. SQL

date name surname street city account no. payment balance

980103 Jan Novak Dlouha 5 Praha 1 9945371 100.00 100.00

980105 Jan Novak Dlouha 5 Praha 1 9945371 1500.00 1600.00

980106 Jan Novak Dlouha 5 Praha 1 9945371 -1550.00 50.00

980106 Karel Nemec Podolska 4 Praha 2 24867134 3000.00 6000.00

980107 Karel Nemec Podolska 4 Praha 2 24867134 -4000.00 2000.00

980108 Jan Novak Dlouha 5 Praha 1 9945371 -150.00 -100.00

980111 Karel Nemec Podolska 4 Praha 2 24867134 5000.00 7000.00

. . .

id_client

name

surname

street

city

. . .

id_transaction

id_account

date

payment

balance

. . .

id_account

id_client

. . .

SELECT client.name, client.surname, client.street, client.city, account.

id_client, transaction.balance

FROM client, account, transaction

WHERE client.id_client = account.id_account;

AND transaction.id_account = account.id_account;

AND transaction.balance < 100;

GROUP BY client.city

Page 2: Databasesberka/docs/4iz451/sl02-databaze-en.pdf · 2018-09-25 · hypercube multicube „true“ OLAP vs. ROLAP MOLAP vs. ROLAP physical implementation of the system: star schema,

Knowledge Discovery in Databases T2: databases

P. Berka, 2018 2/14

Decision support using databases

1. Executive Information Systems

managerial information systems designed for

fast access to information

user friendly interface but less flexible

2. On-Line Analytical Processing

multidimensional concept of data storage and

manipulation (DATA CUBE),

intuitive data manipulation,

work with data collected from heterogeneous

data sources – data conversion is necessary,

use of analytical methods – statistical

summaries, what-if analysis,

Client/Server architecture,

support for multi-user access,

OLAP results stored separately from the

source data,

dynamic manipulation with sparse matrices,

missing values processing,

unlimited number of dimensions and

aggregation levels.

Page 3: Databasesberka/docs/4iz451/sl02-databaze-en.pdf · 2018-09-25 · hypercube multicube „true“ OLAP vs. ROLAP MOLAP vs. ROLAP physical implementation of the system: star schema,

Knowledge Discovery in Databases T2: databases

P. Berka, 2018 3/14

Database structure

basketID date product city sales

000001 10.1. screws Praha 241

000001 10.1. nuts Praha 61

000001 10.1. screws Brno 17

000001 10.1. nails Brno 42

000002 10.2. screws Praha 92

000002 10.2. nails Praha 27

000002 10.2. screws Kladno 35

Table SALES

Praha Brno Kladno

screws nuts nails screws nuts nails screws nuts nails

10.1. 241 61 17 42

10.2. 92 27 35

Sparse matrix

product date region

city sales

Page 4: Databasesberka/docs/4iz451/sl02-databaze-en.pdf · 2018-09-25 · hypercube multicube „true“ OLAP vs. ROLAP MOLAP vs. ROLAP physical implementation of the system: star schema,

Knowledge Discovery in Databases T2: databases

P. Berka, 2018 4/14

Data manipulation:

slice and dice

roll up vs. drill down

Microsoft Data Analyzer

sales

aggregation

cities

aggregation

regions

aggregation

products

Page 5: Databasesberka/docs/4iz451/sl02-databaze-en.pdf · 2018-09-25 · hypercube multicube „true“ OLAP vs. ROLAP MOLAP vs. ROLAP physical implementation of the system: star schema,

Knowledge Discovery in Databases T2: databases

P. Berka, 2018 5/14

Implementation:

hypercube

multicube

„true“ OLAP vs. ROLAP

MOLAP vs. ROLAP

physical implementation of the system:

star schema,

snowflake schema.

SQL engine

OLAP engine

summarized

data

granular data

User interface

ROLAP MOLAP

Page 6: Databasesberka/docs/4iz451/sl02-databaze-en.pdf · 2018-09-25 · hypercube multicube „true“ OLAP vs. ROLAP MOLAP vs. ROLAP physical implementation of the system: star schema,

Knowledge Discovery in Databases T2: databases

P. Berka, 2018 6/14

dimension store fact table dimension time

dimension product

Star

dimension store

fact table region

fact table district

fact table store

Snowflake

STORE KEY

PRODUCT KEY

PERIOD KEY

cena

počet

STORE KEY

data o prodejně

město

ID okresu

data o okresu

ID regionu

data o regionu

úroveň (level)

PRODUCT KEY

data o produktu

značka

výrobce

úroveň (level)

PERIOD KEY

data o období

rok

čtvrtletí

měsíc

den

STORE KEY

PRODUCT KEY

PERIOD KEY

cena

počet

STORE KEY

data o prodejně

město

ID okresu

data o okresu

ID regionu

data o regionu

úroveň (level) ID okresu

PRODUCT KEY

PERIOD KEY

cena

počet

ID regionu

PRODUCT KEY

PERIOD KEY

cena

počet

ID okresu

data o okresu

ID regionu

ID regionu

data o regionu

Page 7: Databasesberka/docs/4iz451/sl02-databaze-en.pdf · 2018-09-25 · hypercube multicube „true“ OLAP vs. ROLAP MOLAP vs. ROLAP physical implementation of the system: star schema,

Knowledge Discovery in Databases T2: databases

P. Berka, 2018 7/14

OLAP functionality „reachable“ by classic

technologies

Microsoft Access

Microsoft Excel

Page 8: Databasesberka/docs/4iz451/sl02-databaze-en.pdf · 2018-09-25 · hypercube multicube „true“ OLAP vs. ROLAP MOLAP vs. ROLAP physical implementation of the system: star schema,

Knowledge Discovery in Databases T2: databases

P. Berka, 2018 8/14

3. Data warehouse subject oriented,

integrated,

time variant,

nonvolatile data repository used for decision

support

ddddd Data

současná detailní

data

starší detailní data

m

e

t

a

d

a

t

a

středně

sumarizovaná data

silně

sumarizovaná data

1. vrstva

Data

Mart

Data Warehouse 2. vrstva

3. vrstva

produkční databáze

Page 9: Databasesberka/docs/4iz451/sl02-databaze-en.pdf · 2018-09-25 · hypercube multicube „true“ OLAP vs. ROLAP MOLAP vs. ROLAP physical implementation of the system: star schema,

Knowledge Discovery in Databases T2: databases

P. Berka, 2018 9/14

4. Business Intelligence

computerized tools and techniques used to

collect, integrate, analyze, interpret and

present (business) data and information. 4 main

components:

data warehouse

business analytics (querying, reporting,

statistical analyses, data mining)

business performance management

user interface (presentations)

Major Components of BI (Turban a kol., 2007)

Page 10: Databasesberka/docs/4iz451/sl02-databaze-en.pdf · 2018-09-25 · hypercube multicube „true“ OLAP vs. ROLAP MOLAP vs. ROLAP physical implementation of the system: star schema,

Knowledge Discovery in Databases T2: databases

P. Berka, 2018 10/14

Databases meet Data Mining

1. Query languages for KDD

Mine Rule (Boulicaut, 1998) - association rules

MSQL (Imielinski, Virmani, 1999) – association

rules, records

MSQL - rules

MSQL - exceptions

MINE RULE Priklad AS

SELECT DISTINCT 1..n produkt AS BODY, 1..1 produkt AS HEAD,

SUPPORT, CONFIDENCE

FROM Prodej

WHERE BODY.město = HEAD.město

AND BODY.datum = HEAD.datum

EXTRACTING RULES WITH SUPPORT: 0.1, CONFIDENCE: 0.5

Emp(Id,Age,Sex,Salary,Position,Car)

GetRules (Emp)

into R

where support > 0.1 and confidence > 0.9

SelectRules (R)

where body has {Age=*), (Sex=*)}

and body is {(Car=*)}

Select *

from Emp

where violates all (GetRules (Emp)

where body is {(Age=*)}

and head is {(Salary=*)}

and confidence > 0.3)

Page 11: Databasesberka/docs/4iz451/sl02-databaze-en.pdf · 2018-09-25 · hypercube multicube „true“ OLAP vs. ROLAP MOLAP vs. ROLAP physical implementation of the system: star schema,

Knowledge Discovery in Databases T2: databases

P. Berka, 2018 11/14

DMQL (Han et al., 1996) – different rules

DMQL – association rules

DMQL – classification rules

DMQL – discrimination rules

2. API standards

SQL/MM Data Mining

OLE DB for Data Mining

Find classification rules for computer_science_students

according to average_grading

related to birth_place, address

from student

where major = „computer_science“ and birth_place = „Canada“

Find association rules

related to average_grading, birth_place, address

from student

where major = „computer_science“ and birth_place = „Canada“

with support threshold = 0.05

with confidence threshold = 0.7

Find discriminant rule

for cs_grads with status = „graduate“

in contrast to cs_undergrads with status = „undergraduate“

related to average_grading, birth_place, address

from student

where major = „computer_science“ and birth_place = „Canada“

Page 12: Databasesberka/docs/4iz451/sl02-databaze-en.pdf · 2018-09-25 · hypercube multicube „true“ OLAP vs. ROLAP MOLAP vs. ROLAP physical implementation of the system: star schema,

Knowledge Discovery in Databases T2: databases

P. Berka, 2018 12/14

3. Extending databse systems with data

mining („in-database“ data mining)

MicroSoft SQL Server 2005

- Decision trees, association rules, naive bayes

classifier, neuronal nets, text mining, sequence

clustering, time series

- uses OLE DB DM and

PMML to describe

tasks and models) and

BI Development

studio as interface

Oracle Data Mining

tools for classification, regression, anomaly

detection, association mining, clustering, feature

extraction, attribute ranking

Page 13: Databasesberka/docs/4iz451/sl02-databaze-en.pdf · 2018-09-25 · hypercube multicube „true“ OLAP vs. ROLAP MOLAP vs. ROLAP physical implementation of the system: star schema,

Knowledge Discovery in Databases T2: databases

P. Berka, 2018 13/14

New Trend – Big Data

Big Data are data that cannot be handled using

standard data base systems and standard data

analysis tools.

Big Data sources: sensors, surveillance systems,

mobile phones, GPS devices, RFID readers,

social networks, computer networks, web logs,

scientific data …

Big Data Characteristics (4 V’s):

Volume: the size of Big Data goes beyond

standard data storage and manipulation

techniques

Velocity: Big Data is often available in real

time

Variety: Big Data contains not only

structured data (e.g. in tabular or relational

form) but also texts, images, audio or video

Veracity: the quality and reliability of Big

Data can vary

Page 14: Databasesberka/docs/4iz451/sl02-databaze-en.pdf · 2018-09-25 · hypercube multicube „true“ OLAP vs. ROLAP MOLAP vs. ROLAP physical implementation of the system: star schema,

Knowledge Discovery in Databases T2: databases

P. Berka, 2018 14/14

Challenges for collecting, storing and

manipulating Big Data

New forms of data storage: file systems

(e.g. Google File System), NoSQL databases

(data that cannot be expressed using

tabular relations of relational databases)

New forms of computation: parallel

computing, distributed computing, grid

computing, cloud computing

Example solution: Apache Hadoop

Hadoop distributed file system

Map/Reduce: divide and conquer approach to

break-down intractable problem into

tractable sub-problems