seize the data. 2015 · 2015. 8. 7. · • break the query up into sub-queries - vertica can only...

© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.1 SEIZE THE DATA. 2015

SEIZE THE DATA. 2015


VERTICA PERFORMANCE TUNINGPractical LessonsCurtis Bennett, Vertica Professional Services

August 10, 2015


Logical / Physical Modeling


Denormalize!

It's the Data Model, Stupid!

• Do not offload your relational, 3rd normal form model right into Vertica. Bad idea

• Get rid of any 1:1 relationships that you had in your row-oriented database because the table got too wide

• It may be advantageous to separate out "intelligent" fields such as credit cards or phone numbers to increase the amount of RLE

− where phone_number = ‘123-456-7890’

becomes

where area_code = ‘123’

and phone_number = ‘123-456-7890’


Partitioning

Benefits of Partitioning include:• Ability to drop partitions easily

− No delete vectors

− Very fast

• Ignore huge chunks of irrelevant information, known as "Partition Pruning"

• Take Advantage of some powerful functions:

− SWAP_PARTITIONS_BETWEEN_TABLES

− MOVE_PARTITIONS_TO_TABLE

• Increased optimizer parallelism

Partitioning by dates works best. Avoid Complex

partition keys such as modulus values

Finding something by knowing where it is not


Pre-Join Projections

There are some very good use-cases

Pre-Join projections can solve the problem of inefficient GROUP BY operations when the GROUP BY list spans across tables, and you need to have a single set of ordered columns in order to facilitate a GROUP BY PIPELINE

Caveats:

• Slight load penalty

• Enforces referential integrity

• Cascades deletes (due to RI)


Live Aggregate Projections

• Can replace the creation of aggregation tables

• Projection which is maintained and aggregated on the fly as data is loaded

• Supports

• Count

• Max

• Min

• Sum

• Combinations of any of the above

• Restrictions apply, check the documentation

• See also Top-K projections and Projections with Expressions


Some More Physical Model Tidbits

• Speaking of Referential Integrity, don't forget to create the Primary Key and Foreign Key constraints on your tables

• Prevents the optimizer from making a poor decision by flipping an Inner/Outer join

• Don't use data types that are larger than necessary

• NUMBER defaults to NUMBER(38) which consumes 3 binary words. NUMBER(37) consumes 2 binary words and thus would be slightly faster

• CHAR(1) is not as efficient as a BOOLEAN type

• Don't bloat your VARCHARs. If VARCHAR(200) is sufficient, don't make it VARCHAR(1000) just to be safe - you'll add excess processing time to your queries - as much as a 20% overhead

• Joining on INT types is WAY faster than joining on large CHAR values.


PROJECTIONS


Replication vs. Segmentation

• Replicate dimensions, Segment facts

• In a large cluster (somewhere north of 20 nodes), segment almost everything

• If speed is of paramount importance, replicate

• A single node cluster is faster than a 3 node cluster, but obviously doesn't scale

• Define your segmentation keys with simplicity and consistency in mind

• The key should be unique, or nearly unique

• The segmentation value should be consistently applied in order to facilitate local joins as often as possible

Remember that if left to its own devices, Vertica will choose to segment by default.


The ORDER BY Clause

Two primary methods:

• ORDER BY low cardinality to high cardinality, ending with the primary key

• Promotes great RLE and generates good compression and performance

• ORDER BY for predicate-based lookups

• Predicates the predicate values

• Super fast first, then joins


Don't Overlook Encoding!

Encoding & Compression

• Vertica supports nearly a dozen kinds of column encoding

• A powerful feature of the columnar architecture

• Having good compression can result in tremendous performance gains

• Don't be afraid to experiment with different encoding types if performance is critical

• Sometimes AUTO actually works really well

• Let the Database Designer decide which encoding to use

• Familiarize yourself with the function: DESIGNER_DESIGN_PROJECTION_ENCODINGS()


Database Designer & CorrelationsPro Tip:

Check out the function ANALYZE_CORRELATIONS()In order to generate correlations, you must run Database Designer manually through API calls directly in VSQL.

See the following functions to get started:

• DESIGNER_CREATE_DESIGN

• DESIGNER_ADD_DESIGN_TABLES

• DESIGNER_SET_ANALYZE_CORRELATIONS_MODE

• DESIGNER_RUN_POPULATE_DESIGN_AND_DEPLOY

Correlation statistics replace regular statistics, so be careful not to create regular statistics on your table if correlated statistics are more optimal


Miscellaneous

• Speaking of Database Designer - USE IT!

• Feel free to experiment with projection design if performance is critical

• Replace the table name with the projection name in the query to test different projection designs

• Projections that are no longer used should be removed -> fewer projections = fewer choices = slightly faster queries

Probably 90% of all performance-related problems are solvable with good projection design

EXPLAIN your queries. If performance is key, avoid HASH JOINS and HASH GROUP BYs.

Avoid RESEGMENTATION and BROADCAST at all cost!

Don't forget to update statistics!


QUERIES


What NOT to do

• Avoid the use of IN() clauses to produce a set of keys values in a sub-query for the use in an outer query

• UNION statements inside a subquery

• Don’t select more than you need - in a columnar database, there is a cost associated with selecting lots of columns

• Don't go crazy with analytics when a simple aggregate will do

• Avoid inequality or negation predicates: !=, <>, >=, <= are all inefficient

• LIKE and ILIKE are slow. If possible, avoid % at the beginning of the string., e.g., use query ilike 'select%' instead of query ilike '%select%"

• Avoid OR, if possible

• Replace GROUP BY 1,2,3 or ORDER BY 1,2,3 with the actual column names, especially in production code


Query Experimentation

• Try WITH CLAUSE materialization

• SELECT add_vertica_options('OPT', 'ENABLE_WITH_CLAUSE_MATERIALIZATION') ;

• SELECT clr_vertica_options('OPT', 'ENABLE_WITH_CLAUSE_MATERIALIZATION') ;

• Play with the Syntactic Optimizer

• SELECT /* +syntactic_optimizer */ col1, col2 from table1 …

• Try adding an ORDER BY clause into your sub-query, especially if it forces an outer to get a MERGE JOIN; it may be worth the cost of the sort

• Break the query up into sub-queries - Vertica can only choose one projection for each query, so having multiple sub-queries in a SQL can sometimes provide the optimizer with additional options

• Familiarize yourself with the analytic functions - perhaps you've coded something brute force that an analytic can solve more elegantly


Pinned Projections

Pro Tip:

• CREATE GLOBAL TEMP TABLE foo(i int, j int) NO PROJECTION;

• CREATE PROJECTION foo_p(i, j)

AS select i, j from foo order by j PINNED ;

Creates a temp table ONLY on the initiator node

Useful for staging tables and working tables

VERY fast

Temporary tables are faster than regular ones, whether

they are Pinned or not.


INTERNALS


Resource ManagerTake advantage of the Resource Manager• Increase the PlannedConcurrency in order to decrease budgeted RAM

− Different Resource Pools may have different footprints

− Compare the execution_engine_profile's counter_name values for 'memory allocated' and 'memory reserved'

• Lower ExecutionParallelism

− If hyper-threaded is enabled, the defaults here are very high because they are based on physical core counts, not logical

− Even if hyper-threading is off, it should usually be about 2/3rds of the default

− Take advantage of Cascading Pools

− If a query is important, raise the Priority to HIGH


Problems can adversely affect performance

System Health

• Check for projection skew - projections should be evenly distributed - remember that Vertica is only as fast as the slowest node

• Make sure statistics have been analyzed and are current

• Remove all the delete vectors - they can have a profound negative impact on system performance

• Check ROS fragmentation - improper loading methodologies can create ROS fragmentation, which can create performance problems

• Make sure SEQUENCE cache refill sizes are reasonable - the default is 250,000 - setting it lower can create excessive catalog locking

• When in doubt, leave it to the professionals - a Vertica HealthCheck evaluates over 130 different audit points. If we can't find your bottleneck, no one can!


Catalog

Catalog bloat can be a real problem

• Keep the number of objects to a minimum

• Empty tables and unused tables should be removed

− Tables that have no projections

• Unused projections should be removed

− Check the projection_usage table

• Partitions should be kept to a reasonable number

• Keep delete vectors in check

• Large clusters should segment everything

• Turn on Catalog Compression


Useful KnobsPro Tip:

• NewEEGroupBySmallMemMB - set to 16

− Increases RAM allocated for Group Bys. Improves performance

• MaxOptMemMB - increase to 200

− Amount of memory to allocate to optimizer. Some large queries require more and fail

• GlobalEEProfileing - set to 0

− Turn off Global Execution Engine Profiling - too costly

If you have lots of RAM, try increasing:

• NewEEROSSubdivisionRows

• MaxDesiredEEBlockSize

• GBHashMemCapMB


HARDWARE


Disks

Should have as many disks as physical cores

10k RPM SAS at least.

SSDs are fast, but very expensive

Catalog and Data on separate mounts

Avoid LVM - not fully supported

EXT4 filesystem

Increase ReadAhead to 4096 or 8192

When in doubt, throw more hardware at it


Memory & CPU

Memory:

• 8GB per physical core recommended

• Identical across servers

CPU:

• Greater than 2500 MHz

• At least 8 cores

• Frequency Scaling disabled


Network

10Gbit network preferred

Bonded

Private - keep your Vertica cluster isolated

Large clusters might require control nodes

SEIZE THE DATA. 2015QUESTIONS?Please attend our Q&A with HP Big Data experts today

Marina Ballroom, Lobby level

10:15 am • 10:30 am

12:00 pm • 1:00 pm

2:30 pm • 3:00 pm

4:30 pm • 5:00 pm

seize the data. 2015 · 2015. 8. 7. · • break the query up into sub-queries - vertica can only...

Documents