scylla summit 2017: planning your queries for maximum performance

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Planning your queries for maximum performance

VP R&D, ScyllaDB

Shlomi Livne



Shlomi Livne

2

Shlomi is VP of R&D at ScyllaDB. Prior to ScyllaDB

he led the research and development team at

Convergin, which was acquired by Oracle.



How Scylla executes your queries



Cluster View

4

client Cluster of nodes1

7

3

4

5

68

2

Coordinator

Replica



Coordinator Tasks

5

1. Prepare the statement

2. Single partition queriesa. Selects replicas (using cache heat info) - and send query / digest requests

requesting a page of results b. Compare the digests, if there is a mismatch:

i. Request data from selected replicasii. Repair the data on replicas

c. Return result

3. Partition scan queriesa. Split the request up based on the ringb. Send requests for data using ranges - requesting a page of resultsc. Merge resultsd. Return result



Replica Tasks

6

1. Receive a data/digest/range request

2. Split the request up according to shards

3. On each shard:a. Execute the request merging data from memtables + cache/sstablesb. For data request:

i. prepare a result and return it (compute digest if RF > 1)c. For digest request:

i. compute digest and return itd. For partition scan request

i. return the partition range data (do not prepare a result)



emtableP8:R1:C=3

Replica Shard Read Diagram

7

Bloom Filter Summary Index Compression Data



ResultRow CacheMemtable

Read Req Result




emtableP8:R1:C=3


8

Bloom FilterP8

SummaryP8

IndexP8

Compression DataP8:R1:A=8

Bloom FilterP8 Summary Index

P8Compression Data

P8:R1:B=7

Bloom FilterP8 Summary Index Compression Data

Row CacheP8:R1:A=8,B=7

MemtableP8:R1:C=3

Read: P8:R1




emtableP8:R1:C=3


9

Bloom FilterP8

SummaryP8

IndexP8



P8Compression Data

P8:R1:B=7


Row CacheP8:R1:A=8,B=7

MemtableP8:R1:C=3

Read: P8:R1 P8:R1A=8,B=7,C=3




emtableP8:R1:C=3


10

Bloom FilterP8

SummaryP8

IndexP8



P8Compression Data

P8:R1:B=7


Row CacheMemtableP8:R1:C=3

Read: P8:R1




Bloom Filter

emtableP8:R1:C=3


11

Bloom FilterP8

SummaryP8

IndexP8



P8Compression Data

P8:R1:B=7



Read: P8:R1

Summary Index Compression Data



emtableP8:R1:C=3


12

Bloom FilterP8

SummaryP8

IndexP8



P8Compression Data

P8:R1:B=7



Read: P8:R1

Bloom Filter 12Summary Index Compression Data



emtableP8:R1:C=3


13

Bloom FilterP8

SummaryP8

IndexP8



P8Compression Data

P8:R1:B=7



Read: P8:R1

13

Bloom Filter 13Summary Index Compression Data



emtableP8:R1:C=3


14

Bloom FilterP8

SummaryP8

IndexP8



P8Compression Data

P8:R1:B=7



Read: P8:R1




emtableP8:R1:C=3


15

Bloom FilterP8

SummaryP8

IndexP8



P8Compression Data

P8:R1:B=7


P8:R1:A=8,B=7Row CacheMemtableP8:R1:C=3

Read: P8:R1




emtableP8:R1:C=3


16

Bloom FilterP8

SummaryP8

IndexP8



P8Compression Data

P8:R1:B=7


P8:R1:A=8,B=7Row CacheP8:R1:A=8,B=7

MemtableP8:R1:C=3

Read: P8:R1 P8:R1A=8,B=7,C=3


P8Compression Data

P8:R1:B=7




emtableP8:R1:C=3


17

Bloom FilterP8

SummaryP8

IndexP8



P8Compression Data

P8:R1:B=7


P8:R1:A=8,B=7Row CacheP8:R1:A=8,B=7

MemtableP8:R1:C=3

Read: P8:R1 P8:R1A=8,B=7,C=3




Row Cache

18

▪ Cache stores complete row data

▪ In addition to storing existing rows, cache stores information

about completeness of clustering ranges (continuity), so it doesn't

miss between cached rows.

▪ Cache is populated on:o Querieso Memtable flush:

• Data is merged - to keep it up to date with new sstables written.• Data is inserted - in case there is no data for that partition on disk.



Selecting Sstables

19

▪ Given a partition key (pk), the current set of sstables is reduced so that

sstable X will be included iff:o min_partition_key(sstable X) < pk < max_partition_key (sstable X)

o bloom_filer (sstable X, pk) = True

▪ Scylla 2.0: SStables will be read in parallel

▪ Scylla 2.1:o The reduced set of sstables is searched newest to oldest until a result can be

constructed and we can prove that older sstables are not relevant.o SStables read parallelism will grow starting from a single sstable



7 Rules To Optimize your Queries



Rule #1 - Use Prepared statements

▪ Coordinator needs to pre-process the query:o A lot of repetitive work that can be done only once

o Adds overhead in execution of a query - directly translates to throughput and

latency

▪ Driver is not able to send the request to a coordinator node that

holds the data (an additional hop)

▪ tip: compare scylla_query_processor_statements_prepared to the

# of executed scylla_transport_requests_served

21



Sample: single Scylla server, using c-s

22

Results Unprepared Prepared

op rate 13037 18704

partition rate 13037 18704

row rate 13037 18704

latency mean 1.5 1.1

latency median 1.3 1

latency 95th percentile 2.9 1.6

latency 99th percentile 6.2 2.5

latency 99.9th percentile 12.2 7.1

latency max 31.1 16.9

Total partitions 100000 100000



Rule #2 - Use Paging

▪ Paging Disabled: Coordinator will be forced to prepare a single

result that holds all the data and send it back:o If coordinator is not able to return a response (allocate enough memory for

the single result) an error will be returned to the cliento tip: compare scylla_transport_unpaged_queries to scylla_cql_reads to

detected if many of your read queries are unpaged

23



Rule #3 - Use correct Page Size

▪ Drivers enable paging by default with a default page_size 5000

rows (java, python, gocql)

▪ CQL requires returning at least one result and allows returning less

results than the page size

▪ Scylla utilizes this:o Scylla caps a page_size to ~1MB of memory - Scylla will return less rows than

requested when rows are largeo Do not use the number of returned results as indication if there are no more

results

24



25

21

Has more pages



Scylla 2.0: does the default page_size make sense

26

page size 10^6 rows of 100 bytes 10^5 rows of 1000 bytes 10^4 rows of 10^4 bytes 1000 rows of 10^5 bytes10 timed out 2104.492031 331.087871 173.93254350 5679.087615 737.148927 202.113023 168.165375

100 4034.920447 573.046783 186.384383 168.951807500 2663.383039 415.760383 183.894015 173.015039

1000 2451.570687 395.313151 182.976511 168.4275195000 2285.895679 400.031743 184.942591 169.345023

10000 2281.701375 399.769599 183.369727 169.73823950000 2273.312767 396.099583 183.107583 170.000383

Test: duration in millisecond fetching a single wide partition with 10^8 bytes

split into rows using different page size



Test: duration in millisecond fetching a single wide partition with 10^8 bytes

split into rows using different page size

C* 3.11.0: does the default page_size make sense

27

page size 10^6 rows of 100 bytes 10^5 rows of 1000 bytes 10^4 rows of 10^4 bytes 1000 rows of 10^5 bytes10 timed out 4030.726143 903.872511 364.38015950 12876.51328 1535.115263 419.430399 300.941311

100 8992.587775 1202.716671 405.274623 316.407807500 6400.507903 907.542527 354.680831 348.651519

1000 6077.546495 874.512383 360.972287 370.4094715000 5620.367359 791.674879 422.051839 358.612991

10000 5490.343935 793.772031 389.021695 360.44799950000 5662.310399 913.833983 383.516671 355.467263

tip: consider changing the page size if your rows are large



Rule #4 - Beware of Multi Partition CQL IN queries

▪ Multi-Partition CQL IN queries: force the coordinator node to split

the queries up to single partition queries and aggregate results.

28



Rule #5 - Beware of Single Partition CQL IN queries

Question: Should I split the CQL IN Query ?

Sample:

▪ CQL: “Select * from ks.cf where pk = X and ck in (Y1, Y2, … Yn)

Translated to:

▪ CQL: o “Select * from ks.cf where pk = X and ck = Y1“ o “Select * from ks.cf where pk = X and ck = Y2“

.

o “Select * from ks.cf where pk = X and ck = Yn“

29



30



31



32



33



Question: Should I split the CQL IN Query ?

Answer: It depends on how wide your rows are

Comments:

▪ Prior to Scylla-2.0 in some wide partition cases single partition CQL

IN Queries - performed very badly.

▪ All reported results are using Scylla 2.0

34



Rule #6 - There’s a faster way todo full scans

▪ The blog post efficient-full-table-scans-with-scylla outlaid an

algorithm todo full scans; in highlevel:o split the range up into small sub ranges

o run “enough” sub ranges in parallel

▪ In follow up blog How to scan 475 million partitions 12x faster

using efficient full table scan a sample implementation applying

this was provided

▪ Is there even a “faster” way ?

35

http://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/

http://www.scylladb.com/2017/03/28/parallel-efficient-full-table-scan-scylla/

http://www.scylladb.com/2017/03/28/parallel-efficient-full-table-scan-scylla/



▪ Yes there is:o Using the token ownership of nodes in the ring one can select ranges of

tokens. Once a “range” has been processed - the next “range” can be selected based on the ownership in the ring.

o An even more optimized solution would use the “sharding” information and aim ranges based on shards on a machine - so that all cores are executing requests in parallel.

36



Rule #7: Use the tools ….

▪ Probelastic tracing

▪ Slow query tracing

▪ Wireshark

▪ CQL Trace

▪ Enable Client Side tracing.

37



THANK YOU

[email protected]

@ShlomiLivne

Any questions?

scylla summit 2017: planning your queries for maximum performance

Technology