distributed query deep dive conor cunningham

21
DISTRIBUTED QUERY Conor Cunningham Principal Architect SQL Server Engineering Team

Upload: radityo-prasetianto-wibowo

Post on 12-Apr-2017

20 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Distributed query deep dive   conor cunningham

DISTRIBUTED QUERYConor CunninghamPrincipal ArchitectSQL Server Engineering Team

Page 2: Distributed query deep dive   conor cunningham

WHO AM I?

• I work on the SQL Server Core Engine•Specialize in Query Processing/Optimization•14+ years at Microsoft•3rd year speaking at SQLBits – 3 talks this year• I love hearing about how you use the product• I take that back to the Engineering team so we can work

on the next versions of SQL Server/Azure

Page 3: Distributed query deep dive   conor cunningham

TALK AGENDA

•Problem Statement•(Quick) Summary of SQL Server’s Optimizer•DQ Optimization Approach•Under-the-hood Examples •Distributed Partitioned Views•Common Troubleshooting Techniques

Page 4: Distributed query deep dive   conor cunningham

PROBLEM STATEMENT

•Data Living on Different Servers?•Data Living on non-SQL Server?•Need to Manage Many Servers?•Want to move data from one server to another without dealing with SSIS? •…•There are many reasons to use Distributed Query – it fills many holes

Page 5: Distributed query deep dive   conor cunningham

OPTIMIZER OVERVIEW

• I gave a SQLBits talk on this 2 years ago•You can watch that talk on sqlbits.com•Key Concepts in the Optimizer:• Operators shaped into trees• Trees and Sub-Trees have Properties• Rules transform sub-trees into new sub-trees• Equivalent sub-trees get stored in a management structure

called the “Memo”• The sequence of rules and heuristics is applied to try to

generate good query plans efficiently

Page 6: Distributed query deep dive   conor cunningham

DQ OPTIMIZATION GOAL

•DQ tries to make remote tables appear to be local (so you don’t care that they are remote)

SELECT SUM(col1), col2 FROM <remotetbl> WHERE col3 > 10000GROUP BY col2

SELECT SUM(col1), col2 FROM localtbl WHERE col3 > 10000GROUP BY col2

Server 1 (Local) Server 2 (Remote)

Expectation: Push operations to remote server• That works for the basic cases• What about more complex cases?

Page 7: Distributed query deep dive   conor cunningham

NEXT EXAMPLE – SHOULD IT REMOTE?

• Let’s try a cross product:

SELECT * FROM <remotetbl> as t1, <remotetbl> as t2

Server 1 (Local) Server 2 (Remote)

Should it remote?

Page 8: Distributed query deep dive   conor cunningham

HOW ABOUT THIS ONE?

• Join Small Local Table to Large Remote Table

SELECT * FROM smalllocal as L, <bigremote> as R ON L.col1=R.col1

Server 1 (Local) • Pulling a big table over the network is expensive

• It would be great if we could get that join condition to remote…

Page 9: Distributed query deep dive   conor cunningham

DQ OPTIMIZATION DIFFERENCES

•Data is remote, expensive to move (network)•Often the desired behavior is pretty basic – remote if you can do so•Sweet spots for several optimizations changes• We force several optimizations we use only for “expensive”

local queries (example: pushing local group by to the remote source)

Page 10: Distributed query deep dive   conor cunningham

ONE LAYER DEEPER…

•SQL Server’s QP acts like a SQL Server client•Based on OLEDB• It can talk to most OLEDB providers, not just SQL Server• So you can pull data from Oracle or DB2 or Excel or Text Files

or even write your own provider•Each phase of query compilation and execution are overridden to use remote data instead

Page 11: Distributed query deep dive   conor cunningham

QUERY BINDING•We load metadata from OLEDB schema rowsets instead of our own system tables• DBSCHEMA_TABLES, _COLUMNS, _INDEXES, …

•Metadata is cached locally to avoid round trips• OLEDB Types converted to closest SQL type • Lossy conversions possible for non-SQL Server

•We ask for the output schema for views and sprocs by compiling them on the remote side• If we do, we try to cache this connection for execution

Page 12: Distributed query deep dive   conor cunningham

OPTIMIZATION• General Goal: Remote Large Subtrees• We do use statistics, indexes, and some constraint information from

remote sources• We can work against SQL providers, Index providers, or simple table

providers• We start with a “get all data from remote source” plan and try to find

better plans• Startup and per-row costs for remote sources are expensive• We also tweak lots and lots of rules to run differently for DQ (no

trivial plan, different join reordering, aggressive local-global agg pushdown)

• Finally, we generate lots of subtrees that remote and pick the “cheapest” one per our cost model

Page 13: Distributed query deep dive   conor cunningham

OPTIMIZATION SEARCH

GB(b,c) SUM(C.d)

JoinJoin

RmtA BRmtC

GB(a,c) SUM(C.d)

JoinJoin

B

GB(b,c) SUM(C.d)

JoinJoin

RmtAB

RmtC

GB(a,c) SUM(C.d)

Join

RmtAB

RmtC

GB(b,c) SUM(C.d)

Join

JoinRmtA

B

RmtC

GB(c) SUM(C.d)

GB(b,c) SUM(C.d)

Join

B

Page 14: Distributed query deep dive   conor cunningham

EXECUTION• Mostly similar to regular OLEDB clients• Open DB, SetCommandText, Execute, Read Rows

• Some parts are more unique• Compile and Execute are 2 separate steps• We have to validate the plan is still valid• So we compare the schema compile vs. execute (and recompile if

needed)o We find many provider bugs nobody else does here

• Note: we can remote lock hints in remote queries

Page 15: Distributed query deep dive   conor cunningham

READING DQ EXECUTION PLANS

1. ICommand::Execute/IOpenRowset opens each scan initially2. We retrieve rows in batches (50-100) when possible3. Each new NLJ scan of inner side calls IRowset::RestartPosition4. We stop reading when we have satisfied the query

requirements (only do complete scans when necessary)

123

4

Page 16: Distributed query deep dive   conor cunningham

(DISTRIBUTED) TRANSACTIONS• Transactions ensure correctness• Distributed Transactions require multiple databases to either

commit or abort together• Microsoft ships a component called MSDTC that:• Provides a common service for dist. Transactions• Works for non-database things (queues, etc.)• Brokers between transaction protocols of different vendors

• DQ uses this component• Not all queries require transactions, and DQ optimizes

performance by only starting a DTC when necessary• Configuring MSDTC is done on the Domain Controller by the

Domain Administrator…

Page 17: Distributed query deep dive   conor cunningham

DOUBLE-HOP AUTHENTICATION• Use Integrated Auth? Get Errors through DQ?• This scenario happens in different places• User->IIS->SQL Server• User->SQL-(DQ)->SQL

• This is known as the “double hop problem”• Don’t be afraid! It is possible to flow credentials and use

your domain identities – talk to your domain administrator to define your SPN and permissions!

Links to read:http://msdn.microsoft.com/en-us/library/ms189580.aspxhttp://support.microsoft.com/kb/238477

Page 18: Distributed query deep dive   conor cunningham

DISTRIBUTED PARTITIONED VIEWS• DPVs were an early scale-out model in DQ• You split a table by ranges and put each on its own server

(check constraints for the ranges)• A UNION ALL view tied them together• DQ then did various optimizations including:• Pruning of unneeded partitions• Startup predicates to do dynamic pruning

• Downsides:• Compilation time was high• Commands not done in parallel to each server

• This feature influenced our partitioned tables design

Page 19: Distributed query deep dive   conor cunningham

TROUBLESHOOTING• Biggest problem in DQ is “it didn’t remote”• Various reasons:• Some function isn’t supported by DQ• Exotic data types (XML, CLR types)• Correctness issues – most date issues only trust the local clock (otherwise

results can differ when you remote)• Sometimes the costing model will be close on 2+ plan choices and a plan will

“stop remoting” (switch plans) to one that we think is similar in cost but is not• Workarounds: In most cases, OPENQUERY() can be used to specify the

exact text you wish to remote. Think of this as plan forcing for Distributed Query

• Also note:• SQL Server – SQL Server remoting is much better than SQL-Other DMBS vendors

(our algebra and theirs does not always align)

Page 20: Distributed query deep dive   conor cunningham

CONCLUSION• Thank you for your attention• Questions?

Page 21: Distributed query deep dive   conor cunningham

© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.

The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after

the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.