scaling out ssis with parallelism, diving deep into the dataflow engine

Scaling OutSSIS with

Parallelism

An independent SQL ConsultantA user of SQL Server from version 2000 onwards with 12+ years

experience.A DBA / developer hybrid

About me . . .

Techniques for scaling out the data flow and how well they scale

A look into the inner working of the dataflow engine using Xperf.

How ‘Elastic’ scalability might be achievedA wrap up with some key ‘Takeaway’ points

What Will Be Covered ?

No parallel ‘On’ switchParallelism has to be implemented by design, at:Package levelIn the execution flowIn the data flow, by hand and / or throughTransforms that come with SSISThird party componentsSeparating out synchronous transforms

Integration Services Parallelism 101

This flow helps determine:1. Maximum data flow performance <=

source extract speed Does the source need to be parallelized ?

2. CPU and I/O profile of the source when no back pressure is taking place.

Does this swamp the available hardware resources ?

Integration Services Performance 101

Good parallel throughput requires:An even distribution of work between child

threads ( data flows )Hardware to be configured such that it is

“Hot spot free”SQL Server and SSIS configured such that

hardware resources are utilised evenly.In other words, the SSIS equivalent of Bad CX

Packet waits is to be avoided.

Parallel Throughput 101

Four different ways of extracting data from the source will be looked at:NTILEDELETE statement with an OUTPUT clauseHash partitioning the source tableSelect statement to ‘Partition’ the source by

TransactionID.

Parallel Source Extract

Using “WITH RESULT SETS” To Use Stored Procedures As The Source

SQL Server 2012 SP 1Windows server 2008 R2Adam Mechanic's “Big adventure” databaseHardware

Intel i960, 6 core, 12 logical threads 3.2 Ghz22 Gb memory2 x 80Gb Fusion IO (Gen 1) io drives

The “Lab” Environment

Demo 1: Scaling out the source extract

Scaling beyond three threads was initially hampered by PATCHLATCH_EX, LCK_M_X, LCK_M_IX and SOS_SCHEDULER_YIELD waits.

The ‘Winning’ approach:Partition the bigTransactionHistory evenly across twelve file groups, one per

logical processorAssign specific threads to specific partitions.Page and row locking turned off on the table and lock escalation set to auto on

the clustered primary key in order to force partition level locking.

Destructive Read

Destructive Read Tuning For Four Data Flows

Test Execution Time ( s )

CPU Consumption

( % )

IO Throughput ( Mb/s)

% ImprovementFrom Baseline

Baseline 57 40 130

Forced partition level locking 33 46 215 42

OLE.DB provider for SQL used instead of SQL native client 28 50 240 51

Packet size changed from 4K default to 8K 22 50 275 61

1 2 3 4 60

20

40

60

80

100

120

140

Execution Time (s) Per Data Flow (Thread) Count

Destructive Read Partition Scan Range Scan Ntile

1 2 3 4 60

10

20

30

40

50

60

70

80

Average Percentage CPU Consumption Per Data Flow (Thread) Count

Destructive Read Partition Scan Range Scan NTILE

Destructive Read Range Scan Partition Scan Ntile0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Wait Event Breakdown ( Percentage )

ASYNC_NETWORK_IO PREEMPTIVE_OS_WAITFORSINGLEOBJECTASYNC_IO_COMPLETION SOS_SCHEDULER_YIELDWRITELOG LOGBUFFERPAGEIOLATCH_SH

NTILE is clearly the slowest approach.The range scan and partition scan can only be separated by CPU

consumption.Wait activity stats are dominated by ASYNC_NETWORK_IO and

PREEMPTIVE_WAITFORSINGLEOBJECT

The source is out performing the rest of flow.

Conclusions From Scaling The Extract

Use a heap version of the bigTransactionHistory table partitioned across twelve file groups on (TransactionID % 12) + 1.

Compare the scalability of the balanced data distributor versus the conditional split.

Source is a single straight select from the bigTransactionHistory table.

Scaling Out Destination

Synchronous Non blocking Rows in = Rows out

AsynchronousRows out usually <> Rows in Semi Blocking Blocking “Magic” Virtual buffer ;-)

A Recap On Transforms

Demo 2: Scaling out the destination

Conditional Split Vs The Balanced Data Distributor

Results on next slide

1 2 3 4 5 60

20

40

60

80

100

120

140

160

Execution Time (s) Per Output Count

Balanced Data Distributor Conditional Split

Saturation point, time to scale out

1 2 3 4 5 60

50

100

150

200

250

IO Throughput (MB/s) Per Output Thread Count

Balanced Data Distributor Conitional Split

The two Fusion I/O cards are capable of more throughput than that which appears on any of the graphs in this material. What is presented is sustained throughput, when performing the actual tests, during check points, ‘Spikes’ of much higher throughput were observed.

1 2 3 4 5 60

10

20

30

40

50

60

Average CPU Consumption ( % )Per Thread Count

Balanced Data Distributor Conditional Split

A transform level view of the CPU can be obtained via xperf as per the next slide . . .

TxBDD.dll weight= 79,997,966

TxSplit.dll weight= 13,004,998.777

Too few threads = CPU starvation Too many threads = context switching The “Sweet spot” is somewhere in between \O/Elements in the dataflow that can create new threads:Execution pathsConditional splits, multicasts and the balanced data distributor create

threads for their outputs Synchronous transforms

Threading

A section in the dataflow starting with a asynchronous component and ending with a transform or destination with no synchronous output.

. . . as the next slide will help illustrate.

Execution Paths, What Are They ?

Execution Path 1

EXECUTION PATH

Execution Path 2

Demo 4: Scaling out by splitting synchronous transforms up

1 2 3 4 50

5

10

15

20

25

30

Execution Time / Thread Count

Union Pass Through

1 2 3 4 5 60

20

40

60

80

100

120

CPU Consumption / Data Flow (Thread) Count

Union Pass Through

1 2 3 4 5 60

20

40

60

80

100

120

140

160

180

IO Throughput Per Data Flow (Thread) Count ( MB/s)

Union Pass Through

One execution path= 37,039 context switches

Two execution paths= 69,986 context switches

Most of the demos so far have achieved data flow scale out via “Copy and paste”.

Service broker is highly elastic, the number of readers associated with a queue can be increased via the ALTER QUEUE command.

SSIS has no “Out of the box” equivalent to this.However the work pile pattern can be adapted in order to achieve

‘Elastic’ style scale out as the next slide will illustrate.

‘Elastic’ Scale Out

Package 1

Package N

“WORK PILE”

Package 2

DTEexec . . ./set Package.variables[MaxThreads].Value;3 /set Package.variables[ThreadNumber].Value;1



SSIS Server 1

SSIS Server 2

SSIS Server N

SSIS “Server Farm”

With a dedicated server hardware for SSIS SQL Server, how does the resource utilisation vary on each as various scale out via parallelisation techniques are used ?.

How does SSIS perform with hyper threading turned on and off ?L2/3 cache is touted as the “New flash memory”:How does the “Performance curve” behave in relation to L2/3 misses ?What can be done to influence L2/3 cache misses.

Areas For Future Investigation

The performance and scalability of extracting from the source is paramount, the only wait events you want to see are ASYNC_NETWORK_IO and PREEMPTIVE_WAITFORSINGLEOBJECT.

When deleting from partitions ( and inserting into them ), significant performance gains can be had by forcing partition level locking.

Packages with fewer execution paths will tend to incur fewer context switches and scale better.

Seek out opportunities to scale out synchronous transforms by splitting them up as much as possible.

Look to leverage the work pile pattern for ‘Elastic’ scale out.

Takeaways

Integration Services: Performance Tuning TechniquesElizabeth Vitt, Intellimentum and Hitachi Corporation

SQL Server Integration Services Performance Design PatternsMatt Masson, Senior Program Manager Microsoft

Increasing Throughput of Pipelines by Splitting Synchronous Transformations into Multiple Tasks

Sedat Yogurtcuoglu, Henk van der Valk, and Thomas KejserResources for SSIS Performance Best Practices

Matt Masson and others

References and Material For Further Reading

http://technet.microsoft.com/en-us/library/cc966529.aspx

http://www.mattmasson.com/2013/03/slides-from-ssis-performance-design-patterns-techdays-hong-kong-2013/

http://sqlcat.com/sqlcat/b/technicalnotes/archive/2010/08/18/increasing-throughput-of-pipelines-by-splitting-synchronous-transformations-into-multiple-tasks.aspx

http://sqlcat.com/sqlcat/b/technicalnotes/archive/2010/08/18/increasing-throughput-of-pipelines-by-splitting-synchronous-transformations-into-multiple-tasks.aspx

http://www.mattmasson.com/2012/02/resources-for-ssis-performance-best-practices/

Questions ?

[email protected]

http://uk.linkedin.com/in/wollatondba

Contact Details

ChrisAdkin8

mailto:[email protected]



Coming up…

#SQLBITS

Speaker Title Room

Jan Pieter Posthuma ETL with Hadoop and MapReduce Theatre

Phil Quinn XML: The Marmite of SQL Server Exhibition B

Laerte Junior The Posh DBA: Troubleshooting SQL Server with PowerShell Suite 3

James Skipwith Table-Based Database Object Factories Suite 1

Neil Hambly SQL Server 2012 Memory Management Suite 2

Matija Lah SQL Server 2012 Statistical Semantic Search Suite 4

scaling out ssis with parallelism, diving deep into the dataflow engine

Technology