seize the data. 2015 · 2015. 8. 7. · • break the query up into sub-queries - vertica can only...
TRANSCRIPT
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.1 SEIZE THE DATA. 2015
SEIZE THE DATA. 2015
SEIZE THE DATA. 2015
VERTICA PERFORMANCE TUNINGPractical LessonsCurtis Bennett, Vertica Professional Services
August 10, 2015
SEIZE THE DATA. 2015
Logical / Physical Modeling
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.4 SEIZE THE DATA. 2015
Denormalize!
It's the Data Model, Stupid!
• Do not offload your relational, 3rd normal form model right into Vertica. Bad idea
• Get rid of any 1:1 relationships that you had in your row-oriented database because the table got too wide
• It may be advantageous to separate out "intelligent" fields such as credit cards or phone numbers to increase the amount of RLE
− where phone_number = ‘123-456-7890’
becomes
where area_code = ‘123’
and phone_number = ‘123-456-7890’
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.5 SEIZE THE DATA. 2015
Partitioning
Benefits of Partitioning include:• Ability to drop partitions easily
− No delete vectors
− Very fast
• Ignore huge chunks of irrelevant information, known as "Partition Pruning"
• Take Advantage of some powerful functions:
− SWAP_PARTITIONS_BETWEEN_TABLES
− MOVE_PARTITIONS_TO_TABLE
• Increased optimizer parallelism
Partitioning by dates works best. Avoid Complex
partition keys such as modulus values
Finding something by knowing where it is not
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.6 SEIZE THE DATA. 2015
Pre-Join Projections
There are some very good use-cases
Pre-Join projections can solve the problem of inefficient GROUP BY operations when the GROUP BY list spans across tables, and you need to have a single set of ordered columns in order to facilitate a GROUP BY PIPELINE
Caveats:
• Slight load penalty
• Enforces referential integrity
• Cascades deletes (due to RI)
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.7 SEIZE THE DATA. 2015
Live Aggregate Projections
• Can replace the creation of aggregation tables
• Projection which is maintained and aggregated on the fly as data is loaded
• Supports
• Count
• Max
• Min
• Sum
• Combinations of any of the above
• Restrictions apply, check the documentation
• See also Top-K projections and Projections with Expressions
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.8 SEIZE THE DATA. 2015
Some More Physical Model Tidbits
• Speaking of Referential Integrity, don't forget to create the Primary Key and Foreign Key constraints on your tables
• Prevents the optimizer from making a poor decision by flipping an Inner/Outer join
• Don't use data types that are larger than necessary
• NUMBER defaults to NUMBER(38) which consumes 3 binary words. NUMBER(37) consumes 2 binary words and thus would be slightly faster
• CHAR(1) is not as efficient as a BOOLEAN type
• Don't bloat your VARCHARs. If VARCHAR(200) is sufficient, don't make it VARCHAR(1000) just to be safe - you'll add excess processing time to your queries - as much as a 20% overhead
• Joining on INT types is WAY faster than joining on large CHAR values.
SEIZE THE DATA. 2015
PROJECTIONS
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.10 SEIZE THE DATA. 2015
Replication vs. Segmentation
• Replicate dimensions, Segment facts
• In a large cluster (somewhere north of 20 nodes), segment almost everything
• If speed is of paramount importance, replicate
• A single node cluster is faster than a 3 node cluster, but obviously doesn't scale
• Define your segmentation keys with simplicity and consistency in mind
• The key should be unique, or nearly unique
• The segmentation value should be consistently applied in order to facilitate local joins as often as possible
Remember that if left to its own devices, Vertica will choose to segment by default.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.11 SEIZE THE DATA. 2015
The ORDER BY Clause
Two primary methods:
• ORDER BY low cardinality to high cardinality, ending with the primary key
• Promotes great RLE and generates good compression and performance
• ORDER BY for predicate-based lookups
• Predicates the predicate values
• Super fast first, then joins
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.12 SEIZE THE DATA. 2015
Don't Overlook Encoding!
Encoding & Compression
• Vertica supports nearly a dozen kinds of column encoding
• A powerful feature of the columnar architecture
• Having good compression can result in tremendous performance gains
• Don't be afraid to experiment with different encoding types if performance is critical
• Sometimes AUTO actually works really well
• Let the Database Designer decide which encoding to use
• Familiarize yourself with the function: DESIGNER_DESIGN_PROJECTION_ENCODINGS()
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.13 SEIZE THE DATA. 2015
Database Designer & CorrelationsPro Tip:
Check out the function ANALYZE_CORRELATIONS()In order to generate correlations, you must run Database Designer manually through API calls directly in VSQL.
See the following functions to get started:
• DESIGNER_CREATE_DESIGN
• DESIGNER_ADD_DESIGN_TABLES
• DESIGNER_SET_ANALYZE_CORRELATIONS_MODE
• DESIGNER_RUN_POPULATE_DESIGN_AND_DEPLOY
Correlation statistics replace regular statistics, so be careful not to create regular statistics on your table if correlated statistics are more optimal
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.14 SEIZE THE DATA. 2015
Miscellaneous
• Speaking of Database Designer - USE IT!
• Feel free to experiment with projection design if performance is critical
• Replace the table name with the projection name in the query to test different projection designs
• Projections that are no longer used should be removed -> fewer projections = fewer choices = slightly faster queries
Probably 90% of all performance-related problems are solvable with good projection design
EXPLAIN your queries. If performance is key, avoid HASH JOINS and HASH GROUP BYs.
Avoid RESEGMENTATION and BROADCAST at all cost!
Don't forget to update statistics!
SEIZE THE DATA. 2015
QUERIES
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.16 SEIZE THE DATA. 2015
What NOT to do
• Avoid the use of IN() clauses to produce a set of keys values in a sub-query for the use in an outer query
• UNION statements inside a subquery
• Don’t select more than you need - in a columnar database, there is a cost associated with selecting lots of columns
• Don't go crazy with analytics when a simple aggregate will do
• Avoid inequality or negation predicates: !=, <>, >=, <= are all inefficient
• LIKE and ILIKE are slow. If possible, avoid % at the beginning of the string., e.g., use query ilike 'select%' instead of query ilike '%select%"
• Avoid OR, if possible
• Replace GROUP BY 1,2,3 or ORDER BY 1,2,3 with the actual column names, especially in production code
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.17 SEIZE THE DATA. 2015
Query Experimentation
• Try WITH CLAUSE materialization
• SELECT add_vertica_options('OPT', 'ENABLE_WITH_CLAUSE_MATERIALIZATION') ;
• SELECT clr_vertica_options('OPT', 'ENABLE_WITH_CLAUSE_MATERIALIZATION') ;
• Play with the Syntactic Optimizer
• SELECT /* +syntactic_optimizer */ col1, col2 from table1 …
• Try adding an ORDER BY clause into your sub-query, especially if it forces an outer to get a MERGE JOIN; it may be worth the cost of the sort
• Break the query up into sub-queries - Vertica can only choose one projection for each query, so having multiple sub-queries in a SQL can sometimes provide the optimizer with additional options
• Familiarize yourself with the analytic functions - perhaps you've coded something brute force that an analytic can solve more elegantly
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.18 SEIZE THE DATA. 2015
Pinned Projections
Pro Tip:
• CREATE GLOBAL TEMP TABLE foo(i int, j int) NO PROJECTION;
• CREATE PROJECTION foo_p(i, j)
AS select i, j from foo order by j PINNED ;
Creates a temp table ONLY on the initiator node
Useful for staging tables and working tables
VERY fast
Temporary tables are faster than regular ones, whether
they are Pinned or not.
SEIZE THE DATA. 2015
INTERNALS
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.20 SEIZE THE DATA. 2015
Resource ManagerTake advantage of the Resource Manager• Increase the PlannedConcurrency in order to decrease budgeted RAM
− Different Resource Pools may have different footprints
− Compare the execution_engine_profile's counter_name values for 'memory allocated' and 'memory reserved'
• Lower ExecutionParallelism
− If hyper-threaded is enabled, the defaults here are very high because they are based on physical core counts, not logical
− Even if hyper-threading is off, it should usually be about 2/3rds of the default
− Take advantage of Cascading Pools
− If a query is important, raise the Priority to HIGH
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.21 SEIZE THE DATA. 2015
Problems can adversely affect performance
System Health
• Check for projection skew - projections should be evenly distributed - remember that Vertica is only as fast as the slowest node
• Make sure statistics have been analyzed and are current
• Remove all the delete vectors - they can have a profound negative impact on system performance
• Check ROS fragmentation - improper loading methodologies can create ROS fragmentation, which can create performance problems
• Make sure SEQUENCE cache refill sizes are reasonable - the default is 250,000 - setting it lower can create excessive catalog locking
• When in doubt, leave it to the professionals - a Vertica HealthCheck evaluates over 130 different audit points. If we can't find your bottleneck, no one can!
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.22 SEIZE THE DATA. 2015
Catalog
Catalog bloat can be a real problem
• Keep the number of objects to a minimum
• Empty tables and unused tables should be removed
− Tables that have no projections
• Unused projections should be removed
− Check the projection_usage table
• Partitions should be kept to a reasonable number
• Keep delete vectors in check
• Large clusters should segment everything
• Turn on Catalog Compression
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.23 SEIZE THE DATA. 2015
Useful KnobsPro Tip:
• NewEEGroupBySmallMemMB - set to 16
− Increases RAM allocated for Group Bys. Improves performance
• MaxOptMemMB - increase to 200
− Amount of memory to allocate to optimizer. Some large queries require more and fail
• GlobalEEProfileing - set to 0
− Turn off Global Execution Engine Profiling - too costly
If you have lots of RAM, try increasing:
• NewEEROSSubdivisionRows
• MaxDesiredEEBlockSize
• GBHashMemCapMB
SEIZE THE DATA. 2015
HARDWARE
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.25 SEIZE THE DATA. 2015
Disks
Should have as many disks as physical cores
10k RPM SAS at least.
SSDs are fast, but very expensive
Catalog and Data on separate mounts
Avoid LVM - not fully supported
EXT4 filesystem
Increase ReadAhead to 4096 or 8192
When in doubt, throw more hardware at it
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.26 SEIZE THE DATA. 2015
Memory & CPU
Memory:
• 8GB per physical core recommended
• Identical across servers
CPU:
• Greater than 2500 MHz
• At least 8 cores
• Frequency Scaling disabled
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.27 SEIZE THE DATA. 2015
Network
10Gbit network preferred
Bonded
Private - keep your Vertica cluster isolated
Large clusters might require control nodes
SEIZE THE DATA. 2015QUESTIONS?Please attend our Q&A with HP Big Data experts today
Marina Ballroom, Lobby level
10:15 am • 10:30 am
12:00 pm • 1:00 pm
2:30 pm • 3:00 pm
4:30 pm • 5:00 pm
SEIZE THE DATA. 2015