bogdan shishedjiev data analysis1 data analysis oltp and olap data warehouse sql for data analysis...

20
Bogdan Shishedjiev Data Analysis 1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining

Upload: dora-blanche-copeland

Post on 27-Dec-2015

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Bogdan Shishedjiev Data Analysis1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining

Bogdan Shishedjiev Data Analysis 1

Data Analysis

OLTP and OLAP

Data Warehouse

SQL for Data Analysis

Data Mining

Page 2: Bogdan Shishedjiev Data Analysis1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining

Bogdan Shishedjiev Data Analysis 2

Data Processing

• Data processing types– OLTP (On Line Transaction Processing) – OLAP (On-line Analytical Processing )

• Database types– Transactional

• Numerous users• Dynamic (flickering)• Always maintaining current state• Critical (very loaded)

– Warehouses• Aa few users (analyzers)• Relatively stable• Maintaining the data history (all states in the time)• Not loaded

Page 3: Bogdan Shishedjiev Data Analysis1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining

Bogdan Shishedjiev Data Analysis 3

Architecture

Page 4: Bogdan Shishedjiev Data Analysis1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining

Bogdan Shishedjiev Data Analysis 4

Architecture

• Source compnents– Filter – separate and assure the coherency of data to be

exported– Export – do the transfer of data portions in precise

moments of time.

• Warehouse components– Loader – Initial loading and preparing the warehouse.– Refresh – loads the portions– достъп– data mining– Export – to other warehouses. This creates an hierarchy

of warehouses

Page 5: Bogdan Shishedjiev Data Analysis1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining

Bogdan Shishedjiev Data Analysis 5

Relational Schemes

• Star

Page 6: Bogdan Shishedjiev Data Analysis1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining

Bogdan Shishedjiev Data Analysis 6

Relational Schemes

• Snowflake

Page 7: Bogdan Shishedjiev Data Analysis1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining

Bogdan Shishedjiev Data Analysis 7

Data Warehouse Design

• Stages– Choose the activity processes to model

– Choose the granularity of the activity procesus

– Choose the dimensions that can be applied to every record of the fact table.

– Choose the facts that must be recorded in the fact table

Page 8: Bogdan Shishedjiev Data Analysis1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining

Bogdan Shishedjiev Data Analysis 8

Data Warehouse Design

• Fact types – the most valuable are numerical continuous values– Additive – they can be added along all dimensions

(Money amounts)

– Semi-additive – they can be added along some of dimensions (Precipitations, Product quantities)

– Non-additive – they cannot be added along any dimension (Wind speed, wind direction)

Page 9: Bogdan Shishedjiev Data Analysis1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining

Bogdan Shishedjiev Data Analysis 9

Data Warehouse Design

• Recommendations– Use continuous additive numerical values

– The fact table is highly normalized

– Don’t normalize the dimensions. The gain is < 1%

– Design thoroughly the dimension attributes. Most often they are textual and discrete. They are used as headings and constraint sources in the answers to users

Page 10: Bogdan Shishedjiev Data Analysis1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining

Bogdan Shishedjiev Data Analysis 10

Example – Hypermarket Chain

time_keyday_of_ week

day_no_in_monthday_no_overall

week_no_ln_year week_no_overall

monthmonth_no_overall

quarterfiscal_periodholiday_flag

weekday_flaglast_day_in_month_flag

seasonevent

Dimension time

promotion_keypromotion_name

price__reduction_ typead_type

dlsplay_ typecoupon_type

ad_media_namedisplay_provider

promo_costpromo_begin_datepromo_end-date

..and others

Promotion

product_keySKU(stock keeping units )_description

SKU_numberpackage_size

brand subcategorycategory

departementpackage_ type

diet_typeweight

weight_unit_of_mesureunits_per_retail_case

units_per_shipping_casecases_per_pallet

shelf_widthshelf_heightshelf_depth..and others

Dimension product

stoie_keystore_ name

store_numberstore_street_address

store_citystore_countystore_statestore_zip

store__managerstore_phonestore_FАX

floor_plan_ typephoto_processing_typefinance_services_type

first_opened_datelast_remodel_date

store_surfacegrocery_surface

frozen-surface ..and others

Dimension shop

Time_keyproduct_keyStore_key

promotion keydollar_salesunits_salesdollar_cost

customer-count

Facts - Sales

Page 11: Bogdan Shishedjiev Data Analysis1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining

Bogdan Shishedjiev Data Analysis 11

Hypermarket Chain

• Fact table– Granularity – each sell of a product (SKU – Stock

Keeping Unit)

– Values – Total cost (additive), SKU quantity (semi-additive), price (non-additive), customer count (non additive)

• Dimensions– Time

– Product

– Store

– Promotion

Page 12: Bogdan Shishedjiev Data Analysis1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining

Bogdan Shishedjiev Data Analysis 12

Hypermarket Chain

• Calculation of disk space needed – Dimension time : 2 years x 365 days = 730 days

– Dimension shop : 300 shops, everyday records

– Dimension product : 30.000 products in each shop; 3000 are sold every day in each shop.

– Dimension promotion : An article can participate in only one promotion in a shop during one day.

– Elementary fact records 300 x 730 x 3000 x 1 = 657 .106 records

– Key field number 4; Value field number 4 ; Total number osf fields =8

– Fact table size - 657 .106 x 8 fields x 4B = 21 GB

Page 13: Bogdan Shishedjiev Data Analysis1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining

Bogdan Shishedjiev Data Analysis 13

Data Operations for Data Analysis

• General form of a SQL statement

select D1.C1, ... Dn.Cn, Aggr1(F,Cl),…, Aggrn(F,Cn)from Fact as F, Dimension1 as D1,... DimensionN as Dn where join-condition (F, D1) and... and join-condition (F, Dn) and selection-condition group by D1.C1, ... Dn.Cn order by D1.C1, ... Dn.C

Page 14: Bogdan Shishedjiev Data Analysis1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining

Bogdan Shishedjiev Data Analysis 14

Data Operations for Data Analysis

• Exampleselect Time.Month, Product.Name, sum(Qty) from Sale, Time, Product, Promotion where Sale.TimeCode = Time.TimeCode and Sale.ProductCode = Product.ProductCode and Sale.PromoCode = Promotion.PromoCode and (Product. Name = ' Pasta' or Product.Name = 'Oil') and Time.Month between 'Feb' and 'Apr' and Promotion.Name = 'SuperSaver' group by Time.Month, Product.Name order by Time.Month, Product.Name pivot Time.Month

Feb Mar Apr

Oil 5K 5K 7K

Pasta 45K 50K 51K

Page 15: Bogdan Shishedjiev Data Analysis1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining

Bogdan Shishedjiev Data Analysis 15

Data CubeThe cube is used to represent data along some measure of interest. Although called a "cube", it can be 2-dimensional, 3-dimensional, or higher-dimensional. Each dimension represents some attribute in the database and the cells in the data cube represent the measure of interest.

Page 16: Bogdan Shishedjiev Data Analysis1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining

Bogdan Shishedjiev Data Analysis 16

Data Cube

• Data cube representation

Combination Count

{P1, Calgary, Vance} 2

{P2, Calgary, Vance} 4

{P3, Calgary, Vance} 1

{P1, Toronto, Vance} 5

{P3, Toronto, Vance} 8

{P5, Toronto, Vance} 2

{P5, Montreal, Vance} 5

{P1, Vancouver, Bob} 3

{P3, Vancouver, Bob} 5

{P5, Vancouver, Bob} 1

{P1, Montreal, Bob} 3

{P3, Montreal, Bob} 8

{P4, Montreal, Bob} 7

{P5, Montreal, Bob} 3

{P2, Vancouver, Richard} 11

Combination Count

{P3, Vancouver, Richard} 9

{P4, Vancouver, Richard} 2

{P5, Vancouver, Richard} 9

{P1, Calgary, Richard} 2

{P2, Calgary, Richard} 1

{P3, Calgary, Richard} 4

{P2, Calgary, Allison} 2

{P3, Calgary, Allison} 1

{P1, Toronto, Allison} 2

{P2, Toronto, Allison} 3

{P3, Toronto, Allison} 6

{P4, Toronto, Allison} 2

Page 17: Bogdan Shishedjiev Data Analysis1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining

Bogdan Shishedjiev Data Analysis 17

Data Cube

• Totals - the value ANY or ALL or NULL

Page 18: Bogdan Shishedjiev Data Analysis1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining

Bogdan Shishedjiev Data Analysis 18

Data Cube

• Drill down – adding a dimension for more detailed resultsTime. Month Product.Name sum(Qty)

Feb Pasta 48K

Mar Pasta 50K

Apr Pasta 51K

TIme.Monih Product.Name Zone sum(Qty)

Feb Pasta North 18K

Feb Pasta Centre 18K

Feb Pasta South 12K

Mar Pasta North 18K

Mar Pasta Centre 18K

Mar Pasta South 14K

Apr Pasta North 18K

Apr Pasta Centre 17K

Apr Pasta South 16K

Page 19: Bogdan Shishedjiev Data Analysis1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining

Bogdan Shishedjiev Data Analysis 19

Data Cube

• Roll-up - removing dimensionTIme.Monih Product.Name Zone sum(Qty)

Feb Pasta North 18K

Feb Pasta Centre 18K

Feb Pasta South 12K

Mar Pasta North 18K

Mar Pasta Centre 18K

Mar Pasta South 14K

Apr Pasta North 18K

Apr Pasta Centre 17K

Apr Pasta South 16K

Product.Name Zone sum(Qty)

Pasta North 54K

Pasta Centre 53K

Pasta South 42K

Page 20: Bogdan Shishedjiev Data Analysis1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining

Bogdan Shishedjiev Data Analysis 20

Data Cube

TIme.Monih Product.Name Zone sum(Qty)

Feb Pasta North 18K

Feb Pasta Centre 18K

Feb Pasta South 12K

Mar Pasta North 18K

Mar Pasta Centre 18K

Mar Pasta South 14K

Apr Pasta North 18K

Apr Pasta Centre 17K

Apr Pasta South 16KALL Pasta North 54KALL Pasta Centre 53KALL Pasta South 42KFeb Pasta ALL 48KMar Pasta ALL 50KApr Pasta ALL 51KALL Pasta ALL 149KALL ALL ALL 149K

•The whole data cube