scaling out ssis with parallelism, diving deep into the dataflow engine
DESCRIPTION
Scaling out integration services with SSIS, incorporating a deep dive into the dataflow engine with XPerf.TRANSCRIPT
Scaling OutSSIS with
Parallelism
An independent SQL ConsultantA user of SQL Server from version 2000 onwards with 12+ years
experience.A DBA / developer hybrid
About me . . .
Techniques for scaling out the data flow and how well they scale
A look into the inner working of the dataflow engine using Xperf.
How ‘Elastic’ scalability might be achievedA wrap up with some key ‘Takeaway’ points
What Will Be Covered ?
No parallel ‘On’ switchParallelism has to be implemented by design, at:Package levelIn the execution flowIn the data flow, by hand and / or throughTransforms that come with SSISThird party componentsSeparating out synchronous transforms
Integration Services Parallelism 101
This flow helps determine:1. Maximum data flow performance <=
source extract speed Does the source need to be parallelized ?
2. CPU and I/O profile of the source when no back pressure is taking place.
Does this swamp the available hardware resources ?
Integration Services Performance 101
Good parallel throughput requires:An even distribution of work between child
threads ( data flows )Hardware to be configured such that it is
“Hot spot free”SQL Server and SSIS configured such that
hardware resources are utilised evenly.In other words, the SSIS equivalent of Bad CX
Packet waits is to be avoided.
Parallel Throughput 101
Four different ways of extracting data from the source will be looked at:NTILEDELETE statement with an OUTPUT clauseHash partitioning the source tableSelect statement to ‘Partition’ the source by
TransactionID.
Parallel Source Extract
Using “WITH RESULT SETS” To Use Stored Procedures As The Source
SQL Server 2012 SP 1Windows server 2008 R2Adam Mechanic's “Big adventure” databaseHardware
Intel i960, 6 core, 12 logical threads 3.2 Ghz22 Gb memory2 x 80Gb Fusion IO (Gen 1) io drives
The “Lab” Environment
Demo 1: Scaling out the source extract
Scaling beyond three threads was initially hampered by PATCHLATCH_EX, LCK_M_X, LCK_M_IX and SOS_SCHEDULER_YIELD waits.
The ‘Winning’ approach:Partition the bigTransactionHistory evenly across twelve file groups, one per
logical processorAssign specific threads to specific partitions.Page and row locking turned off on the table and lock escalation set to auto on
the clustered primary key in order to force partition level locking.
Destructive Read
Destructive Read Tuning For Four Data Flows
Test Execution Time ( s )
CPU Consumption
( % )
IO Throughput ( Mb/s)
% ImprovementFrom Baseline
Baseline 57 40 130
Forced partition level locking 33 46 215 42
OLE.DB provider for SQL used instead of SQL native client 28 50 240 51
Packet size changed from 4K default to 8K 22 50 275 61
1 2 3 4 60
20
40
60
80
100
120
140
Execution Time (s) Per Data Flow (Thread) Count
Destructive Read Partition Scan Range Scan Ntile
1 2 3 4 60
10
20
30
40
50
60
70
80
Average Percentage CPU Consumption Per Data Flow (Thread) Count
Destructive Read Partition Scan Range Scan NTILE
Destructive Read Range Scan Partition Scan Ntile0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Wait Event Breakdown ( Percentage )
ASYNC_NETWORK_IO PREEMPTIVE_OS_WAITFORSINGLEOBJECTASYNC_IO_COMPLETION SOS_SCHEDULER_YIELDWRITELOG LOGBUFFERPAGEIOLATCH_SH
NTILE is clearly the slowest approach.The range scan and partition scan can only be separated by CPU
consumption.Wait activity stats are dominated by ASYNC_NETWORK_IO and
PREEMPTIVE_WAITFORSINGLEOBJECT
The source is out performing the rest of flow.
Conclusions From Scaling The Extract
Use a heap version of the bigTransactionHistory table partitioned across twelve file groups on (TransactionID % 12) + 1.
Compare the scalability of the balanced data distributor versus the conditional split.
Source is a single straight select from the bigTransactionHistory table.
Scaling Out Destination
Synchronous Non blocking Rows in = Rows out
AsynchronousRows out usually <> Rows in Semi Blocking Blocking “Magic” Virtual buffer ;-)
A Recap On Transforms
Demo 2: Scaling out the destination
Conditional Split Vs The Balanced Data Distributor
Results on next slide
1 2 3 4 5 60
20
40
60
80
100
120
140
160
Execution Time (s) Per Output Count
Balanced Data Distributor Conditional Split
Saturation point, time to scale out
1 2 3 4 5 60
50
100
150
200
250
IO Throughput (MB/s) Per Output Thread Count
Balanced Data Distributor Conitional Split
The two Fusion I/O cards are capable of more throughput than that which appears on any of the graphs in this material. What is presented is sustained throughput, when performing the actual tests, during check points, ‘Spikes’ of much higher throughput were observed.
1 2 3 4 5 60
10
20
30
40
50
60
Average CPU Consumption ( % )Per Thread Count
Balanced Data Distributor Conditional Split
A transform level view of the CPU can be obtained via xperf as per the next slide . . .
TxBDD.dll weight= 79,997,966
TxSplit.dll weight= 13,004,998.777
Too few threads = CPU starvation Too many threads = context switching The “Sweet spot” is somewhere in between \O/Elements in the dataflow that can create new threads:Execution pathsConditional splits, multicasts and the balanced data distributor create
threads for their outputs Synchronous transforms
Threading
A section in the dataflow starting with a asynchronous component and ending with a transform or destination with no synchronous output.
. . . as the next slide will help illustrate.
Execution Paths, What Are They ?
Execution Path 1
EXECUTION PATH
Execution Path 2
Demo 4: Scaling out by splitting synchronous transforms up
1 2 3 4 50
5
10
15
20
25
30
Execution Time / Thread Count
Union Pass Through
1 2 3 4 5 60
20
40
60
80
100
120
CPU Consumption / Data Flow (Thread) Count
Union Pass Through
1 2 3 4 5 60
20
40
60
80
100
120
140
160
180
IO Throughput Per Data Flow (Thread) Count ( MB/s)
Union Pass Through
One execution path= 37,039 context switches
Two execution paths= 69,986 context switches
Most of the demos so far have achieved data flow scale out via “Copy and paste”.
Service broker is highly elastic, the number of readers associated with a queue can be increased via the ALTER QUEUE command.
SSIS has no “Out of the box” equivalent to this.However the work pile pattern can be adapted in order to achieve
‘Elastic’ style scale out as the next slide will illustrate.
‘Elastic’ Scale Out
Package 1
Package N
“WORK PILE”
Package 2
DTEexec . . ./set Package.variables[MaxThreads].Value;3 /set Package.variables[ThreadNumber].Value;1
DTEexec . . ./set Package.variables[MaxThreads].Value;3 /set Package.variables[ThreadNumber].Value;2
DTEexec . . ./set Package.variables[MaxThreads].Value;3 /set Package.variables[ThreadNumber].Value;3
SSIS Server 1
SSIS Server 2
SSIS Server N
SSIS “Server Farm”
With a dedicated server hardware for SSIS SQL Server, how does the resource utilisation vary on each as various scale out via parallelisation techniques are used ?.
How does SSIS perform with hyper threading turned on and off ?L2/3 cache is touted as the “New flash memory”:How does the “Performance curve” behave in relation to L2/3 misses ?What can be done to influence L2/3 cache misses.
Areas For Future Investigation
The performance and scalability of extracting from the source is paramount, the only wait events you want to see are ASYNC_NETWORK_IO and PREEMPTIVE_WAITFORSINGLEOBJECT.
When deleting from partitions ( and inserting into them ), significant performance gains can be had by forcing partition level locking.
Packages with fewer execution paths will tend to incur fewer context switches and scale better.
Seek out opportunities to scale out synchronous transforms by splitting them up as much as possible.
Look to leverage the work pile pattern for ‘Elastic’ scale out.
Takeaways
Integration Services: Performance Tuning TechniquesElizabeth Vitt, Intellimentum and Hitachi Corporation
SQL Server Integration Services Performance Design PatternsMatt Masson, Senior Program Manager Microsoft
Increasing Throughput of Pipelines by Splitting Synchronous Transformations into Multiple Tasks
Sedat Yogurtcuoglu, Henk van der Valk, and Thomas KejserResources for SSIS Performance Best Practices
Matt Masson and others
References and Material For Further Reading
Questions ?
http://uk.linkedin.com/in/wollatondba
Contact Details
ChrisAdkin8
Coming up…
#SQLBITS
Speaker Title Room
Jan Pieter Posthuma ETL with Hadoop and MapReduce Theatre
Phil Quinn XML: The Marmite of SQL Server Exhibition B
Laerte Junior The Posh DBA: Troubleshooting SQL Server with PowerShell Suite 3
James Skipwith Table-Based Database Object Factories Suite 1
Neil Hambly SQL Server 2012 Memory Management Suite 2
Matija Lah SQL Server 2012 Statistical Semantic Search Suite 4