deep dive on new search features in denali ctp1 · 10.501 43.775 119.017 0.5970.5340 0.385 0.853...
TRANSCRIPT
Deep Dive on new Search Features in Denali CTP1
Naveen Garg, Principal Program Manager
Microsoft Corporation
Search Improvements
FullText Search
• Revamped Codebase for Significant Performance and Scale Improvement
• New Property-scoped Search
• Customizable NEAR Search
Performance & Scale Goals
• Scale up to 350M documents
• Query magnitudes faster than 2008 release
• Worst-case Query response time < 3 sec
• At par or better than key DBMS players
Code Investments - Versioned Updates
• SQL 2008 Design Issue
– Queries block updates to internal table that maintains state of index wrt document updates (such as for auto change tracking)
• Denali Investment
– Track batch commits in order
– Track lowest timestamp below which all batches are committed
– Select index data for query and merge from below this timestamp
Block update of the lowest timestamp only (instead of index updates)
Code Investments – Single STVF for Query
Goals • Improve Query Execution Performance
• Lower costs, better plans and hints
• Better code organization
Code Changes • Query Preparation
– Rewrite CONTAINS/FREETEXT in terms of CONTAINSTABLE/FREETEXTTABLE during binding
– Rewrite it as SELECT [TOP N] key, score FROM STVF [ORDER BY score] during QP prepare
• Compilation – Parse parameters to a tree and bind specific columns, Word breaking
– Tree Expansion with appropriate AND/OR, Noise word filtering
– Tree Reduction, Load Stats
• Execution – Transform to execution tree including Ranking function
– Iterate to produce resulting rows
Code Investments - Predicate Folding
• Multiple CONTAINS, FREETEXT Folded together
e.g. – CT(1) AND CT(2) AND CT(3) => CT(1 AND 2 AND 3)
– CT(1) AND CT(2) OR CT(3) => CT(1 AND 2 OR 3)
– CT(1) AND NOT (NOT CT(2) OR CT(3)) => CT( 1 AND 2 AND NOT 3)
– CT(1) AND (CT(2) OR i=10 OR CT(3)) => CT(1) AND (CT(2 OR 3) OR i=10)
• Except… – CT(1) AND NOT (CT(2) AND CT(3)) => CT(1) AND (NOT CT(2) OR NOT CT(3))
NO folding
Code Investments – Query Parallelism
Goals • Retain basic assumptions to avoid complete rewrite
• Scale to 1.6x latency reduction for doubling the cores
• Work well on both NUMA and UMA architectures
Changes • Query Optimizer and Execution updates to allow fulltext query
parallelism
• Fulltext STVF Updates to support multiple threads per query – Use DocID histogram to slice doc ranges for each thread
– Rebuild Autostats as part of background/master merge
Summary Of Code Improvements
• Faster Execution – Numerous code and data layout improvements
– No blocking during high index update workloads
– Improved mixed relational query processing
– Optimize Top N by Rank
• 10x: Select top 1K by score for keyword in 1M docs (250ms -> 28ms)
• Leverage CPU – Cache for Operators and Core Algorithms
• Batch decompression and rank computation, virtual functions
– Vector CPU instructions (SSE*) for scalar computations
• Ranking, TOP N, and Stale Test as major benefiters
• Leverage multicore – Parallel Query execution
– Parallel Master Merge
* SSE: Streaming SIMD (Single Instruction Multiple Data) Extensions
Query Throughput on 350M Documents
0 3.014
10.501
43.775
119.017
0.597 0.534 0.385 0.853 1.364
0
20
40
60
80
100
120
140
0 5 10 15 20 25 30 35
Thro
ugh
pu
t (q
ps)
CPU
Throughput (qps) with DML
SQL Server Denali
SQL Server 2005
0 3.009
13.571
64.825
157.93
4.772 8.147
17.102
48.27
61.374
0
20
40
60
80
100
120
140
160
180
0 5 10 15 20 25 30 35
Th
rou
ghp
ut
(qp
s)
CPU
Throughput (qps) without DML
SQL Server Denali
SQL Server 2005
Throughput & Execution Time on a Customer Workload
0
20
40
60
80
100
120
140
0 500 1000 1500 2000 2500
Qu
eri
es/
Seco
nd
Number of Connections
Throughput/#Connections
SQL Server Denali
SQL Server 2005
0
10000
20000
30000
40000
50000
60000
70000
0 500 1000 1500 2000 2500
Avg
Exe
cTim
e(m
s)
Number of Connections
AvgExecTime (ms)/#connections
SQL Server Denali
SQL Server 2005
Query Throughput on another Customer Workload
2X Query performance improvement compared with SQL Server 2005
0
1
2
3
4
5
6
7
8
9
0 50 100 150 200 250 300 350 400 450
Qu
erie
s /
Seco
nd
s
Users
Scaling Queries/Seconds
SQL Server Denali
SQL Server 2005
Performance & Scale Summary
• Index and Query tested on scale up to 350Million documents with < ~2 Sec Response – ~3X better w/o DML and ~9X better w DML throughput
– Scale easily with increasing number of connections
• TAP customers already reporting significant performance improvement on their workloads
Property Scoped Search
• Load Office Filters (needed once per database instance) –EXEC sp_fulltext_service 'load_os_resources',1; –EXEC sp_fulltext_service 'restart_all_fdhosts„;
• Create a property list –CREATE SEARCH PROPERTY LIST p1;
• Add properties to be extracted –ALTER SEARCH PROPERTY LIST [p1] ADD N'System.Author' WITH – (PROPERTY_SET_GUID = 'f29f85e0-4ff9-1068-ab91-08002b27b3d9', – PROPERTY_INT_ID = 4, PROPERTY_DESCRIPTION = N'System.Author');
• Create/Alter Fulltext index to specify property list to be extracted –ALTER FULLTEXT INDEX ON fttable... SET SEARCH PROPERTY LIST = [p1];
• Query for properties –SELECT * FROM fttable WHERE – CONTAINS(PROPERTY(ftcol, 'System.Author'), 'fernlope');
Identifying Property GUIDs • Commonly known Property Guids documented in MSDN
• For the rest… – Enable TF 7603
– Create and fully populate a Fulltext index with property search
– Check error log for Property Guids
– Recreate Index with required properties
• OR use FiltDump.EXE (Windows SDK) – Get property details
Attribute = {F29F85E0-4FF9-1068-AB91-08002B27B3D9}\2 (System.Title)
Indexing Properties with Keywords • Stored along with keywords but with additional
Internal Property ID (s)
Customizable ‘NEAR’ operator
• NEAR (( { <simple_term> | <phrase> | <prefix_term> } [,…n] ), [<maximum_distance> [, <match_order> ]) <maximum_distance> ::= { integer | MAX } <match_order> ::= { TRUE | FALSE }
• E.G. • Resumes in the human resources DB containing the term “SQL
Server” within no more than 5 words from “expertise”:
• SELECT candidate_name FROM Candidates • WHERE CONTAINS(Resume, „NEAR((“SQL Server”, expertise),5,
FALSE)‟);
Customize Maximum Gap between terms/phrases when using NEAR operator
Customizable NEAR
• Search for documents with two words a distance apart
Old NEAR Usage SELECT * FROM fttable WHERE CONTAINS(*, 'test NEAR Space')
New NEAR Usages • Specify Distance SELECT * FROM fttable WHERE CONTAINS(*, „NEAR((test, Space), 5,FALSE)')
• Reduce Distance SELECT * FROM fttable WHERE CONTAINS(*, „NEAR((test, Space), 2,FALSE)')
• Mandate Order of words SELECT * FROM fttable WHERE CONTAINS(*, „NEAR((test, Space), 5,TRUE)')