introducing parallel data warehouse (the project formerly known as madison)

30
Thomas Kejser Senior Program Manager Microsoft Corp. Introducing Parallel Data Warehouse (The project formerly known as Madison)

Upload: jud

Post on 24-Feb-2016

89 views

Category:

Documents


0 download

DESCRIPTION

Introducing Parallel Data Warehouse (The project formerly known as Madison). Agenda. The Typical problem with data warehouses MPP vs SMP SQL Server Parallel Data Warehouse Hardware architecture Query Processing Data Loading My email: [email protected]. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introducing Parallel Data Warehouse (The project formerly known as Madison)

Thomas KejserSenior Program ManagerMicrosoft Corp.

Introducing Parallel Data Warehouse(The project formerly known as Madison)

Page 2: Introducing Parallel Data Warehouse (The project formerly known as Madison)

2

AgendaThe Typical problem with data warehousesMPP vs SMPSQL Server Parallel Data Warehouse

Hardware architectureQuery ProcessingData Loading

My email: [email protected]

Page 3: Introducing Parallel Data Warehouse (The project formerly known as Madison)

3

Introducing Parallel Data WarehouseThe Typical Problem with Data Warehouses

Page 4: Introducing Parallel Data Warehouse (The project formerly known as Madison)

11

Microsoft DW Solutions

SSRS SSAS SSIS

Microsoft & PartnerServices

Page 5: Introducing Parallel Data Warehouse (The project formerly known as Madison)

12

Symmetric Multi-Processing vs. Massively Parallel

Processing

HW advancements increasing ability to scale-up

But scaling limited by designHigh end SMP very expensive

Extremely high concurrency for simple workloadsLess than 1-2 TB of data SMP will almost always be better.

At higher sizes - depends

HW advancements increasing ability to scale-out

Scaling to 1 PB+Scale out is relatively low cost

Relatively high concurrency for complex workloads> 2TB up to 1 PB for DW workloads

Data Warehousing(esp. VLDB, complex workloads)

OLTP, Transactional,Data Warehousing

MPPSMP

Page 6: Introducing Parallel Data Warehouse (The project formerly known as Madison)

13

PDW: No Assembly RequiredSoftwareServersStorage arraysNetwork switchesCablesLicensesPower distribution unitsRacksComes fully assembledSoftware is installed at the factoryFully configured

Page 7: Introducing Parallel Data Warehouse (The project formerly known as Madison)

14

Basic Building BlocksCompute Nodes

Handles the CPU cycles required to answer queriesStorage Nodes

Stores data using Fiber Attached Disks. Scaled to support CPU with enough throughput

Other nodesMore about those later

Page 8: Introducing Parallel Data Warehouse (The project formerly known as Madison)

15

Anatomy of a Compute Node

Pre-configured For Each SQL Server Instance On Each Compute Node.

Drives Configured As RAID1 To Avoid Appliance Failover for a Single Drive FailureIBM Compute Nodes Will Have 1 Lun (1 RAID1 Pair)Dell Compute Nodes Will Have 2 Lun’s (2 RAID1 Pairs)HP Compute Nodes Will Have 3 Luns’s (3 RAID1 Pairs)

TempDB: Sort-work Area For Data Loading Into Clustered Index TablesWork Area for PDW Temporary Work FilesSpill Area For Hash Joins Not Fitting Into Memory

Page 9: Introducing Parallel Data Warehouse (The project formerly known as Madison)

16

Anatomy of a Storage Node

Pre-configured4 RAID10 Pairs for Primary User Data1 RAID10 Pair for Database Logs2 LUN’s Are Spread Across Each RAID Pair

User Databases are Separate Physical SQL Server DatabasesStaging Database (Optional) Used for Loading & to Minimize Fragmentation

Page 10: Introducing Parallel Data Warehouse (The project formerly known as Madison)

17

More Node TypesBackup node:

Stores backup files from the applianceCan be logged into by authorized Windows usersCan be augmented with 3rd party H/W and S/W

Landing Zone:Used as a holding place for data to be loadedCan be logged into by authorized Windows usersCan be augmented with 3rd party H/W and S/W

Management node:Runs the Windows domain controller (Active Directory)Used for deploying patches to all nodes in the applianceHolds images in case a node needs reimaging

Page 11: Introducing Parallel Data Warehouse (The project formerly known as Madison)

18

Putting It All Together - PDWControl Node

Failover Protection:• Redundant Control Node• Redundant Compute Node• Cluster Failover

•Redundante Array of Inexpensive Databases

Spare Node

Page 12: Introducing Parallel Data Warehouse (The project formerly known as Madison)

19

Software Architecture

SQL Server

DW Authenticati

on

DW Configuratio

nDW

Schema TempDB

MPP EngineData Movement

Service

IIS

Compute NodesCompute Nodes

Compute Node

Query Tool

SQL Server

Data Movement Service

User Data

Admin Console

MS BI(AS, RS)

Control Node

Other 3rd

Party Tools

OLEDB, ODBC, ADO.Net, JDBC

DWSQLInternet Explorer

Landing Zone Node

Data Movement Service

Page 13: Introducing Parallel Data Warehouse (The project formerly known as Madison)

20

Create DatabaseCREATE DATABASE database_name WITH ( AUTOGROW = ON , REPLICATED_SIZE = 1024 , DISTRIBUTED_SIZE = 16384 , LOG_SIZE = 300)

Page 14: Introducing Parallel Data Warehouse (The project formerly known as Madison)

21

Date Dim

D_DATE_SK

D_DATE_ID

D_DATE

D_MONTH

Item

I_ITEM_SK

I_ITEM_ID

I_REC_START_

DATE

I_ITEM_DESC

Store Sales

Ss_sold_date_sk

Ss_item_sk

Ss_customer_sk

Ss_cdemo_sk

Ss_store_sk

Ss_promo_sk

Ss_quantity

Promotion

P_PROMO_SK

P_PROMO_ID

P_START_DATE

_SK

P_END_DATE_

SK

Store

S_STORE_SK

S_STORE_ID

S_REC_START_D

ATE

S_REC_END_DAT

E

S_STORE_NAME

Customer

C-

CUSTOMER_SK

C_CUSTOMER_I

D

C_CURRENT_AD

DR

Customer

Demographics

CD_DEMO_SK

CD_GENDER

CD_MARITAL_STATU

S

CD_EDUCATION

Database Distributed & Replicated Tables

Data Distribution with Replication

C I

D

CD

S

P

C I

D

CD

S

P

C I

D

CD

S

P

C I

D

CD

S

P

C I

D

CD

S

P

C I

D

CD

S

P

SS

SS

SS

SS

SS

SS

Distribution and Replication

Page 15: Introducing Parallel Data Warehouse (The project formerly known as Madison)

22

Table CreationCREATE TABLE table_name      [ ( { <column_definition> } [ ,...n ] )     [ AS SELECT select_criteria ]     [ WITH ( <table_option> ) ] [;] <column_definition> ::= column_name <data_type> [ NULL | NOT NULL ] <data

type> ::= type_name [ ( precision [ , scale ] ) ] <table_option> ::= { [ CLUSTER_ON ( column_name [ ,...n ] ) ]

, [ DISTRIBUTE_ON ( column_name ) ] | [ REPLICATE ] , [ PARTITION_ON column_name ( RANGE { LEFT | RIGHT } FOR VALUES

{ [ boundary_value [,...n] ] ) ) ] }

Type Class Types SupportedIntegers tinyint, smallint, int, bigintFloating point float, realCharacter char, varchar, nchar, nvarcharDate & time date, time, datetime, dateime2, datetimeoffset,

timestamp, smalldatetime

Fixed point decimal, money, smallmoneyBinary binary, varbinary (8192)Other uniqueidentifier (?)

Page 16: Introducing Parallel Data Warehouse (The project formerly known as Madison)

23

Create Table – Behind the ScenesCreate Table store_sales withdistribute_on (ss_item_sk) partition_on(ss_sold_date_sk)cluster_on (ss_sold_date_sk)

8K8K

8K8K

8K

8 Filegroups (one per core) - 1 Table per Filegroup

12 Partitions(ss_sold_date_sk)

N-number ofPages

Row

Page 17: Introducing Parallel Data Warehouse (The project formerly known as Madison)

24

Physical File Layout (Per Compute Node)

Page 18: Introducing Parallel Data Warehouse (The project formerly known as Madison)

25

MPP Query ProcessingControl Node

Query Rewritten Into Steps That Run Efficiently On Compute Nodes

ODBC/JDBCSQL92 with Analytical Extensions

Distribution-incompatible JoinsResolved Using High Speed Dynamic Re-distribution

Select location, yearsum(b.sales_amt)from customer a, sales bwhere b.sales > 500 anda.custid = b.custidgroup by 2,1order by 1,2

Page 19: Introducing Parallel Data Warehouse (The project formerly known as Madison)

26

MPP Execution PlansThe MPP engine creates parallel execution plans from client SQLThe plans can include the following types of operations:

SQL operations: used to pass SQL directly to SQL Server on 1 or more nodes.DMS operations: used to move data among the nodes in an appliance for further processing.Temp tables operations: used to stage data for further processing.Return operations: push data back to the client.

Simple plans may include just one type of operation.Complex plans may include all of these operations.Plans are executed serially, one step at a time.

Page 20: Introducing Parallel Data Warehouse (The project formerly known as Madison)

27

Date Dim

D_DATE_SK

D_DATE_ID

D_DATE

D_MONTH

Item

I_ITEM_SK

I_ITEM_ID

I_REC_START_

DATE

I_ITEM_DESC

Store Sales

Ss_sold_date_sk

Ss_item_sk

Ss_customer_sk

Ss_cdemo_sk

Ss_store_sk

Ss_promo_sk

Ss_quantity

Promotion

P_PROMO_SK

P_PROMO_ID

P_START_DATE

_SK

P_END_DATE_

SK

Store

S_STORE_SK

S_STORE_ID

S_REC_START_D

ATE

S_REC_END_DAT

E

S_STORE_NAME

Customer

C-

CUSTOMER_SK

C_CUSTOMER_I

D

C_CURRENT_AD

DR

Customer

Demographics

CD_DEMO_SK

CD_GENDER

CD_MARITAL_STATU

S

CD_EDUCATION

Data Distribution with Replication Sales table distributed

on customer... And partitioned by time

Example Schema

Page 21: Introducing Parallel Data Warehouse (The project formerly known as Madison)

28

Distribution Compatible QuerySELECT CustomerId, SUM(Amount) AS TotalSales,

SUM(Quantity) AS TotalUnitsSold

FROM Sales s

JOIN Item i ON s.ItemId = i.ItemId

WHERE SaleDate BETWEEN '2009-08-01' AND '2009-08-31‘ AND Description LIKE '%gadgets%'

GROUP BY CustomerId

ORDER BY CustomerId;

Page 22: Introducing Parallel Data Warehouse (The project formerly known as Madison)

29

MPP Query PlanStep 1 – On each compute node:SELECT s.[customerid], sum(s.[amount]) AS totalsales, sum(s.

[quantity]) AS totalunitssold

FROM [tpch_3].[dbo].[h_sales_34] s JOIN [tpch_3].[dbo].item_37 I ON (s.[itemid] = i.[itemid])

WHERE (s.[saledate] BETWEEN '2009-08-01' AND '2009-08-31' and i.[description] like '%gadgets%')

GROUP BY s.[customerid]

ORDER BY s.[customerid];

Page 23: Introducing Parallel Data Warehouse (The project formerly known as Madison)

30

Query 1 Processing Flow

SQL Server

DW Authenticati

on

DW Configuratio

nDW

Schema TempDB

Data Movement

Service

Compute Node 1

Query Tool

SQL Server

Data Movement Service

User Data

Control Node

MPP Engine

Parse SQLValidate & AuthorizeBuild MPP PlanExecute PlanReturn Data to Client

Compute Node N

SQL Server

Data Movement Service

User Data

Page 24: Introducing Parallel Data Warehouse (The project formerly known as Madison)

31

Reshuffling the dataSELECT SaleDate, SUM(Amount) AS TotalSales,

SUM(Quantity) AS TotalUnitsSold

FROM Sales s JOIN Item i ON s.ItemId = i.ItemId

WHERE SaleDate BETWEEN '2009-08-01' AND '2009-08-31' AND Description LIKE '%gadgets%‘

GROUP BY SaleDate

ORDER BY SaleDate;

Page 25: Introducing Parallel Data Warehouse (The project formerly known as Madison)

32

MPP Query PlanStep 1 – Create temp table on control nodeCREATE TABLE [tempdb].[dbo].Q_[TEMP_ID_6760]

( saledate DATE, totalsales DECIMAL(38, 2), totalunitssold INTEGER )

WITH (DATA_COMPRESSION = PAGE);

Step 2 – Run on each compute nodeSELECT s.[saledate], sum(s.[amount]) AS totalsales, sum(s.

[quantity]) AS totalunitssold

FROM [tpch_3].[dbo].[h_sales_34] s JOIN [tpch_3].[dbo].item_37 i ON (s.[itemid] = i.[itemid])

WHERE (s.[saledate] BETWEEN '2009-08-01' AND '2009-08-31' and i.[description] like '%gadgets%’)

GROUP BY s.[saledate]

Page 26: Introducing Parallel Data Warehouse (The project formerly known as Madison)

33

MPP Query Plan continuedStep 3:SELECT [saledate], sum([totalsales]) AS totalsales,

sum([totalunitssold]) AS totalunitssold

FROM [tempdb].[dbo].Q_[TEMP_ID_6760]

GROUP BY [saledate]

ORDER BY [saledate]

Step 4:DROP TABLE [tempdb].[dbo].Q_[TEMP_ID_6760];

Page 27: Introducing Parallel Data Warehouse (The project formerly known as Madison)

34

Reshuffling – Query Processing Flow

SQL Server

DW Authenticati

on

DW Configuratio

nDW

Schema TempDB

Data Movement

Service

Compute Node

Query Tool

SQL Server

Data Movement Service

User Data

Control Node

MPP Engine

Parse SQLValidate & AuthorizeBuild MPP PlanExecute PlanReturn Data to Client Compute Node

SQL Server

Data Movement Service

User Data

Page 28: Introducing Parallel Data Warehouse (The project formerly known as Madison)

35

Control Node

Spare Node

Landing Zone Node

Text FileText

FileText FileText

File

Data Loading

Tables Are Hash Distributed Or

Replicated

Page 29: Introducing Parallel Data Warehouse (The project formerly known as Madison)

36

Load File

Bulk Insert

Partitioned Staging

Table(Heap)

Insert-Select

Partitioned FinalTable(CIDX)

Sort each BATCH

in memory

or TempDB

Sort each partition

In memory

or TempDB

Bulk Insert Phase

Trace Flags None

BATCHSIZE Calculated

TABLOCK ON

TempDB Entire BATCHSIZE for Sort

TempDB Log Minimal

StageDB Log Minimal

ROLLBACK

Commits per BATCHSIZERollback to last BATCH Only

Trace Flags 610 per NUMA Session

MAXDOP 1 Per NUMA SessionTABLOCK OFF

TempDB Entire PARTITION for sort

TempDB Log Minimal

UserDB Log Twice Data File Size

ROLLBACK

Commits Full TRANSACTIONRollback Full TRANSACTION

Insert-Select Phase

Data Loader Process

Page 30: Introducing Parallel Data Warehouse (The project formerly known as Madison)

37

© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.

The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after

the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.