twp bidw parallel execution 11gr1

Oracle SQL Parallel Execution

An Oracle White PaperJune 2008


Executive Summary...............................................................................4Introduction...........................................................................................4Why Parallel Execution?.......................................................................6

The ultimate goal: scalability............................................................6Shared everything – the Oracle advantage.......................................7

Fundamental Concepts of Oracle's Parallel Execution..........................8Processing parallel SQL statements..................................................9

Query Coordinator (QC) and parallel servers..............................10Producer/consumer model...........................................................12Granules.......................................................................................13Data redistribution.......................................................................14

Enabling parallel execution in Oracle.............................................19Controlling SQL Parallel Execution in Oracle................................21

Understand your target workload................................................22Controlling the degree of parallelism..........................................22Controlling the usage of parallelism............................................24

Oracle SQL Parallel Execution best practices.....................................25Start with a balanced system...........................................................26

Calibrate your configuration........................................................26Stripe And Mirror Everything (S.A.M.E.) – use ASM................27

Set database initialization parameters for good performance..........28Memory allocation ......................................................................28Controlling parallel servers.........................................................29Enabling efficient I/O throughput ...............................................30

Use parallel execution with common sense.....................................31Don't enable parallelism for small objects...................................31Use parallelism to achieve your goals, not to exceed them.........31Avoid using hints.........................................................................31

Combine parallel execution with Oracle Partitioning.....................32Ensure statistics are good enough....................................................32Monitor parallel execution activity..................................................32Whether or not to use parallel execution in RAC............................33Use Database Resource Manager....................................................33Don't try to solve hardware deficiencies with other features...........33Don't ignore other features..............................................................34

Monitoring SQL Parallel Execution....................................................34


(G)V$ parallel execution views.......................................................34Interpreting parallel SQL execution plans.......................................35

Parallel plan without partitioning................................................35Parallel plan with partitioning and partition-wise join................36

Oracle Enterprise Manager..............................................................38Wait events..................................................................................38Input/Output (I/O) monitoring.....................................................40Parallel execution monitoring......................................................40SQL monitoring...........................................................................41

Upgrade considerations coming from Oracle Database 9i..................43More parallel operations..................................................................43

If on Oracle Database 9i you used hints to enable SQL parallel execution......................................................................................44If on Oracle Database 9i you used session settings to enable SQL parallel execution.........................................................................44If on Oracle Database 9i you used object level settings to enable SQL parallel execution................................................................44

Execution plan changes...................................................................44Changes in database defaults...........................................................45Use Resource Manager....................................................................45

Conclusion...........................................................................................46



EXECUTIVE SUMMARY

Parallel execution is one of the fundamental database technologies that enable organizations to manage and access tens – if not even hundreds of terabytes of data. Without parallelism, these large databases, commonly used for data warehouses but increasingly found in operational systems as well, would not exist.

Parallel execution is the ability to apply multiple CPU and I/O resources to the execution of a single database operation. While every major database vendor today provides parallel capabilities, there remain key differences in the architectures provided by the various vendors.

SQL parallel execution was first introduced in Oracle more than a decade ago1 and has been enriched and improved since. This paper discusses the parallel execution architecture of Oracle Database 11g and shows its superiority over alternative architectures for real-world applications. This paper also touches on how to control and monitor parallel execution; lastly, it gives an insight into on upgrade considerations when migrating from earlier versions of Oracle.

While the focus of this paper is on Oracle Database 11g, the fundamental concepts are also applicable to earlier versions of Oracle.

INTRODUCTION

Databases today, irrespective of whether they are data warehouses, operational data stores, or OLTP systems, contain a wealth of information. However, finding and presenting the right information in a timely fashion can be a challenge because of the vast quantity of data involved.

Parallel execution is the capability that addresses this challenge. Using parallelism, terabytes of data can be processed in minutes or even less, not hours or days. Parallel execution uses multiple processes to accomplish a single task – to complete a SQL statement in the case of SQL parallel execution. The more effectively the database software can leverage all hardware resources – multiple cores, multiple I/O channels, or even multiple nodes in a cluster - the more efficiently queries and other database operations will be processed.

1 Parallel execution was first introduced in Oracle Version 7.3 in 1996


Examples of resource-intensive database operations include:

– Large (long-running) queries: for example data warehouse analysis comparing one year's results with the results of the year prior

– Building indexes on large tables

– Gathering statistics in a large database

– Loading a large amount of data into a database

– Taking a database backup

Large data warehouses should always use parallel execution to achieve good performance. Specific operations in OLTP applications, such as batch operations, can also significantly benefit from parallel execution. Oracle SQL parallel execution requires Oracle Database 11g Enterprise Edition.

The paper covers four main topics:

– The first section discusses the fundamental concepts of parallel processing of the Oracle database; the reader will become familiar with Oracle's parallel architecture, learn Oracle-specific terminology around parallel execution, and understand the basics of how to control and identify parallel SQL processing.

– The second section focuses on best practices around parallel execution to ensure the most optimal usage of your hardware resources

– The third section provides an insight into how to monitor an environment using parallel execution, leveraging either SQL or Oracle Enterprise Manager Database/Grid Control.

– The fourth section focuses on upgrade considerations when migrating an environment from an earlier release of Oracle to Oracle Database 11g.


WHY PARALLEL EXECUTION?

The ultimate goal: scalability

Imagine that your task is to count the number of cars in your street.

– Scenario 1: You can go through the street by yourself and count the number of cars.

– Scenario 2: If your friend is available then the two of you could start on opposite ends of the street, count cars until you meet each other and add the results of both counts to complete the task.

Assuming your friend counts equally fast as you do, you expect to complete the task of counting all cars in a street in approximately half the time compared to when you perform the job all by yourself. If this is the case then your operations scales linearly. 2x the number of resources halves the total processing time.

The database is not very different from the counting cars example. If you allocate twice the number of resources and achieve a processing time that is half of what it was with the original amount of resources, then the operation scales linearly. Figure 1 Below shows graphically how the processing time decreases for a linearly scalable operation.

The graph does not look linear to you, right? Look again: it shows the absolute processing time, not a relative speedup factor. For example, using 2x the resources reduces the processing time from 360 to 180, and from 2x to 4x down to 90, both cases of linear scalability. It's just that the absolute performance gain


Figure 1: Processing time as a function of resources for linear scalability.

1x 3602x 1803x 1204x 905x 726x 607x 51.438x 459x 4010x 36

1x 2x 3x 4x 5x 6x 7x 8x 9x 10x

0

50

100

150

200

250

300

350

400

Resources in units of x

Re

lativ

e p

roce

ssin

g ti

me

is decreasing with higher number of resources, but we will come back to this in the best practices section.

Now imagine your friend gets tired easily and has to rest regularly throughout the job. Of course the total amount of time it takes to count the total number of cars reduces, but doubling the resources does not half the processing time. Maybe you spend two thirds of the original processing time to complete the task. In this case the operation does not scale as well: doubling the resources does not give the expected linear reduction in processing time.

In a database there are multiple components involved in processing a query, each having its own maximum processing power. Most notably CPUs, memory and Input/Output (I/O) all collaborate together. For database processing you may experience a lack of scalability if you don't allocate resources in the correct quantities across the various components. For example, if you add CPU resources but you don't add I/O resources then the CPUs may not be able to retrieve the data fast enough to keep processing at full speed.

Shared everything – the Oracle advantage

Traditionally, two approaches have been used for the implementation of parallel execution in database systems. The main differentiation is whether or not the physical data layout is used as a base – and static pre-requisite – for dividing, thus parallelizing, the work.

These fundamental approaches are known as shared everything architecture and shared nothing architecture.

In a shared nothing system CPU cores are solely responsible for individual data sets and the only way to access a specific piece of data you have to use the CPU core that owns this subset of data2; such systems are are also commonly known

2 Some implementations allow a static small number of cores as smallest unit; for the sake of simplicity we will discuss them as one core, the architectural trade-offs are identical


Figure 2: Shared everything versus shared nothing

as MPP (Massively Parallel Processing) systems. In order to achieve a good workload distribution MPP systems have to use a hash algorithm to distribute (partition) data evenly across available CPU cores. As a result MPP systems introduce mandatory, fixed parallelism in their systems in order to perform operations that involve table scans; the fixed parallelism completely relies on a fixed static data partitioning at database or object creation time. Most non-Oracle data warehouse systems are MPP systems.

Thanks to Oracle's shared everything architecture the Oracle Database does not require any pre-defined data partitioning to enable parallelism. Oracle can parallelize almost every operation, independent of the underlying data distribution. If, however, the data has been pre-partitioned (using Oracle Partitioning), Oracle can use the same optimizations and algorithms shared nothing vendors claim.

Oracle's shared everything architecture enables flexible parallel execution and high concurrency without overloading the system, using a superset of parallel execution capabilities over shared nothing vendors.


FUNDAMENTAL CONCEPTS OF ORACLE'S PARALLEL EXECUTION

The Oracle Database provides functionality to perform a complex task in parallel, without manual intervention. Operations that can be executed in parallel include:

– SQL loader and SQL-based data loads

– Queries

– RMAN backups

– Index builds

– Gathering statistics

– And more

This paper focuses on SQL parallel execution only, which consists of parallel query, parallel DML (Data Manipulation Language) and parallel DDL (Data Dictionary Language). While the paper focuses on Oracle Database 11g, the information in this paper also applies to Oracle Database 10g and higher, unless explicitly stated.

Processing parallel SQL statements

When you execute a SQL statement in the Oracle Database it is decomposed into individual steps (a.k.a. rowsources), identified as separate lines in an execution plan. Below is an example of a simple serial SQL statement and its execution plan. The statement returns the total number of customers in the CUSTOMERS table:

select count(*) from customers c;

----------------------------------------| Id | Operation | Name |----------------------------------------| 0 | SELECT STATEMENT | || 1 | SORT AGGREGATE | || 2 | TABLE ACCESS FULL| CUSTOMERS |----------------------------------------Figure 3: customer count, serial plan


A serial example, showing all customer purchase information is shown below:

select c.name, s.purchase_date, s.amountfrom customers c, sales swhere s.customer_id = c.id ;

----------------------------------------| Id | Operation | Name |----------------------------------------| 0 | SELECT STATEMENT | ||* 1 | HASH JOIN | || 2 | TABLE ACCESS FULL| CUSTOMERS || 3 | TABLE ACCESS FULL| SALES |----------------------------------------

Figure 4: customer purchase information, serial plan

If you execute a statement in parallel (via mechanisms described later), the Oracle Database will parallelize as many of the individual steps in the execution plan as possible and reflects this in the execution plan. The two plans shown above will change as follows:

-------------------------------------------------------------------------------

| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |

-------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | | | |

| 1 | SORT AGGREGATE | | | | |

| 2 | PX COORDINATOR | | | | |

| 3 | PX SEND QC (RANDOM) | :TQ10000 | Q1,00 | P->S | QC (RAND) |

| 4 | SORT AGGREGATE | | Q1,00 | PCWP | |

| 5 | PX BLOCK ITERATOR | | Q1,00 | PCWC | |

| 6 | TABLE ACCESS FULL| CUSTOMERS | Q1,00 | PCWP | |

-------------------------------------------------------------------------------

Figure 5: customer count, parallel plan

----------------------------------------------------------------------------------------| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |----------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | | | |

| 1 | PX COORDINATOR | | | | |

| 2 | PX SEND QC (RANDOM) | :TQ10001 | Q1,01 | P->S | QC (RAND) |

| 3 | HASH JOIN | | Q1,01 | PCWP | |

| 4 | PX RECEIVE | | Q1,01 | PCWP | |

| 5 | PX SEND BROADCAST | :TQ10000 | Q1,00 | P->P | BROADCAST |


| 7 | TABLE ACCESS FULL | CUSTOMERS | Q1,00 | PCWP | |


| 9 | TABLE ACCESS FULL | SALES | Q1,01 | PCWP | |

-----------------------------------------------------------------------------------------

Figure 6: customer purchase information, parallel plan


These plans look quite a bit different than before, mainly because we are having additional “logistical” processing steps due to the parallel processing that we did not have before3.

SQL parallel execution in the Oracle database is based on a few fundamental concepts. The following section discusses these concepts that help you understand the parallel execution setup in your database and read the basics of parallel SQL execution plans.

Query Coordinator (QC) and parallel servers

SQL parallel execution in the Oracle Database is based on the principles of a coordinator (often called the Query Coordinator – QC for short) and parallel servers. The QC is the session that initiates the parallel SQL statement and the parallel servers are the individual sessions that perform work in parallel. The QC distributes the work to the parallel servers and may have to perform a minimal – mostly logistical – portion of the work that cannot be executed in parallel. For example a parallel query with a SUM() operation requires adding the individual sub-totals calculated by each parallel server.

The QC is easily identified in the parallel execution plans above as 'PX COORDINATOR' (for example ID 1 in Figure 6 shown above). The process acting as the QC of a parallel SQL operation is the actual user session process itself.

The parallel servers are taken from a pool of globally available parallel server processes and assigned to a given operation (the setup is discussed in a later section). All the work shown in a parallel plan BELOW the QC in our sample parallel plans (Figure 5, Figure 6) is done by the parallel servers.

Parallel server processes can be easily identified on the OS level, for example on Linux they are the oracle processes ORA_P***:

3 Parallel plans will look different in versions prior to Oracle Database 10g.


oracle 23473 1 0 17:46 ? 00:00:00 ora_p000_linux111




...

Figure 7: parallel server processes seen on the OS level using 'ps -ef'

Going back to the example of counting the cars, there would be a third person – the QC - telling you and your friend – the two parallel servers - to go ahead and count the cars; this is equivalent to the operation with the ID 2 in Figure 5, illustrated in Figure 8.

– You can do exactly the same on the road that is being done internally in the database with the SQL and execution plan shown in Figure 8: You and your friend will go ahead and count only the cars on your own side; this is equivalent to the operations with the ID 4, ID 5, and ID 6, where ID 5 is the equivalent to tell each one of you to only count the cars on your side of the road (details to follow in the granule section).

– Finally, each of you tells the third person your individual subtotals (ID 3) and he would add them up to the final result (ID 1). This is the hand-over from the parallel servers (processes doing the actual work) to the QC for final “assembly” of the result for returning it to the user process.

Producer/consumer model

Continuing with our car counting example, imagine the job is to count the total number of cars per car color you see. Well, if you and your friend are going to cover one side of the road each, each one of you potentially sees the same colors and gets a subtotal for the colors, but not the complete result for the street. You could go ahead, memorize all this information and tell it back to the third person (the “person in charge”), but this poor individual then has to sum up all of the results by himself – what if all cars in the street had a different color? The third person would redo exactly the same work as you and your friend. To parallelize


Figure 8: QC and parallel servers

QC gets the subtotals and adds them up

Parallel servers retrieve subtotals

the counting, you ask two more friends to help you out4. They both walk in the middle of the road, one of them taking count of all dark colors, the other one of all bright colors (assuming this “car color separation” is approximately splitting the information in half). Whenever you count a new car, you tell the person that is in charge of this color about the new encounter – you produce the information, redistribute it based on the color information, and the color counter is consuming the information. At the end, both color counting friends tell their result the person in charge and you're done; we had two sets, each with two friends doing a part of the job, working hand in hand.

That's how the database works: In order to execute a statement in parallel efficiently sets of parallel servers work in pairs: one set is producing rows (producer) and one set is consuming the rows (consumer).

For example for the parallel join between the SALES and CUSTOMERS tables, rows have to be redistributed based on the join key to make sure that matching join keys from both tables are sent to the same parallel server process doing the join. In this example one set of parallel servers reads and sends the data from table CUSTOMERS (producer) and another set receives the data (consumer) and joins it with table SALES, as shown in Figure 9.

Operations (rowsources) that are processed by the same set of parallel servers can be identified in an execution plan by looking in the 'TQ' column. As shown in Figure 9, the first slave set (Q1,00) is reading table CUSTOMERS in parallel and producing rows that are sent to slave set 2 (Q1,01) that consumes these records and joins it with table SALES. Whenever data is distributed from

4 Note that the number of additional friends is not related to the number of distinct car colors, but matches exactly the number of people that are counting cars. We want to use our additional friends in the most optimal manner and - assuming that all “car scanners” have equally distributed incremental results on a continuous base - having as many “car color counters” keeps them continuously busy as well; using more friends to count the car colors would leave all three of them without work for 30% of their time (on average).


Figure 9: Producer and Consumer

Slave set 1 “produces” rows from table customers

Slave set 2 “consumes” the records and joins it with table sales

a producers to consumers you will also see an entry of the form :TQxxxxx (Table Queue x) in the 'NAME' column. Please disregard the content of the other columns for now.

This has a very important consequence for the number of parallel servers that are spawned for a given parallel operation: the producer/consumer model expects two sets of parallel servers (a.k.a. slave set) for a parallel operation, so the number of parallel server processes is twice the number of the requested Degree Of Parallelism (DOP, the number of parallel servers working on an individual task). For example, if the parallel join in Figure 9 runs with DOP 4, then 8 parallel server processes will be used for this statement.

The only case when parallel servers do not work in pairs is if the statement is so basic that one set of parallel servers can complete the entire statement in parallel. For example select count(*) from customers requires only one parallel server set (see Figure 5).

Granules

A granule is the smallest unit of work when accessing data. Oracle Database uses a shared everything architecture, which from a storage perspective means that any CPU core in a configuration can access any piece of data; this is the most fundamental architectural difference between Oracle and all other major database products on the market. Unlike all other systems, Oracle can – and will - choose this smallest unit of work solely dependent on a query's requirements.

The basic mechanism the Oracle Database uses to distribute work for parallel execution is block ranges on disk – so-called block-based granules. This methodology is unique to Oracle and is not dependent on whether the underlying objects have been partitioned. Access to the underlying objects is divided into a large number of granules which are given out to parallel servers to work on (and when a parallel server finishes the work for one granule the next one is given out). The number of granules is always much higher than the


Figure 10: Block-based granule in the customer count example.

'Block Iterator' is the operation name for block-based granules

requested DOP in order to get a good distribution of the workload between parallel servers. As the first parallel step of the parallel processing, the operation 'PX BLOCK ITERATOR ' shown in Figure 11 literally is an iteration over all generated block range granules.

Although block-based granules are the basis to enable parallel execution for almost any operation, there are some operations that can benefit from the underlying data structure and leverage individual partitions as granules of work. With partition-based granules only one parallel server performs the work for all data in a single partition. The Oracle Optimizer considers partition-based granules if the number of (sub)partitions accessed in the operation is at least equal to the DOP (and ideally much higher if there may be skew in the sizes of the individual (sub)partitions). The most common operations that use partition-based granules are partition-wise joins which will be discussed later.

Based on the SQL statement and the degree of parallelism, the Oracle Database decides whether block-based on partition-based granules lead to a more optimal execution; you can not influence this behavior.

In the car counting example, one side of the street – or even a block of a long street - could be considered the equivalent of a block-based granule. The existing data volume – the street – is subdivided into physical pieces on which the parallel servers – you and your friend – are working on independently.

Data redistribution

Parallel operations – except for the most basic ones – typically require data redistribution. Data redistribution is required in order to perform operations such as parallel sorts, aggregations and joins. At the block-granule level there is no notion of knowledge about the actual data content of an individual granule. Data has to be redistributed as soon as a subsequent operation relies on the actual content. Remember the last car example? The car color mattered, but you don't know – or even control – what color car is parked where on the street. You redistributed the information about the amount of cars per color to the additional two friends based on their color responsibility, enabling them to do the total counting for the colors they're in charge of.

Data redistribution takes place between individual parallel servers either within a single machine, or, in the case of parallel execution across multiple machines in a Real Application Clusters (RAC) database, between parallel servers on multiple machines. Of course in the latter case interconnect communication is used for the data redistribution while shared-memory is used for the former.

Data redistribution is not unique to the Oracle Database. In fact, this is one of the most fundamental principles of parallel processing, being used by every product that provides parallel capabilities. The fundamental difference and advantage of Oracle's capabilities, however, is that parallel data access (discussed in the granules section earlier) and therefore the necessary data


redistribution are not constrained by any given hardware architecture or database setup.

Shared-nothing (MPP) database systems also require data redistribution unless operations can take advantage of partition-wise joins (as explained further down in this section). In shared-nothing systems parallel operations that cannot benefit from a partition-wise join – such as a simple three-way table join on two different join keys - always make heavy use of interconnect communication. Because the Oracle Database also enables parallel execution within the context of a node, parallel operations do not always have to use interconnect communication, thus avoiding a potential bottleneck at the interconnect channel.

The following section will explain Oracle's data redistribution capabilities using the simple example of table joins without any secondary data structures, such as indexes or materialized views.

Serial join

In a serial join a single session reads both tables and performs the join. In this example we assume two large tables CUSTOMERS and SALES are involved in the join.

The database uses full table scans to access both tables. For a serial join the single serial session (red arrows) can perform the full join because all matching values from the CUSTOMERS table are read by one process. Figure 11 depicts the serial join5.

5 Please note that the figures in this section represent logical diagrams to explain data redistribution. In an actual database environment data would typically be striped across multiple physical disks, accessible to any parallel server. This complexity has deliberately been left out from the images.


Figure 11: Serial join based on two full table scans.

Parallel joins

Processing the same simple join in parallel, a redistribution of rows will become necessary. Parallel servers scan parts of either table based on block ranges and in order to complete the join, rows have to be distributed between parallel servers. Figure 12 depicts the data redistribution for a parallel join at a DOP 2, represented by the green and red arrow respectively. Both tables are read in parallel by both the red and green process (using block-range granules) and then each parallel server has to redistribute its result set based on the join key to the subsequent parallel join operator.

There are many data redistribution methods. The following 5 are the most common ones:

– HASH: Hash redistribution is very common in parallel execution in order to achieve an equal distribution of work for individual parallel servers based on a hash distribution. Hash (re)distribution is the basic parallel execution enabling mechanism for most data warehouse database system, most notably MPP systems.

– BROADCAST: Broadcast redistribution happens when one of the two result sets in a join operation is much smaller than the other result set. Instead-of redistributing rows from both result sets the database sends the smaller result set to all parallel servers in order to guarantee the individual servers are able to complete their join operation. The small result set may be produced in serial or in parallel.

– RANGE: Range redistribution is generally used for parallel sort operations. Individual parallel servers work on data ranges so that the QC does not have to do any sorting but only to present the individual parallel server results in the correct order.


Figure 12: Data redistribution for a simple parallel join.

– KEY: Key redistribution ensures result sets for individual key values to be clumped together. This is an optimization that is primarily used for partial partition-wise joins (see further down) to ensure only one side in the join has to be redistributed.

– ROUND ROBIN: Round-robin data redistribution can be the final redistribution operation before sending data to the requesting process. It can also be used in an early stage of a query when no redistribution constraints are required.

As a variation on the data redistribution methods you may see a LOCAL suffix in a parallel execution plan on a Real Application Clusters (RAC) database. LOCAL redistribution is an optimization in RAC to minimize interconnect traffic for inter-node parallel queries. For example you may see a HASH LOCAL redistribution in an execution plan indicating that the row set is produced on the local node and only sent to the parallel servers on that node.

Data redistribution is shown in the SQL execution plan in the 'PQ Distrib' column. The execution plan for the simple parallel join illustrated in Figure 13.

Parallel partition-wise joins

If at least one of the tables accessed in the join has been partitioned on the join key the database may decide to use a partition-wise join. If both tables are equi-partitioned on the join key the database may use a full partition-wise join. Otherwise a partial partition-wise join may be used in which one of the tables is dynamically partitioned in memory followed by a full partition-wise join.

A partition-wise join does not require any data redistribution because individual parallel servers will work on the equivalent partitions of both joined tables.


Figure 13: Data redistribution for a simple parallel join using a HASH redistribution.

HASH redistribution on join column

HASH redistribution on join column

As shown in Figure 14, the red parallel process reads data partition one of the CUSTOMERS table AND data partition one of the SALES table; the equi-partitioning of both tables on the join key guarantees that there will no matching rows for the join outside of these two partitions. The red parallel process will always be able to complete the full join by reading just these matching partitions. The same is true the green parallel server process, too, and for any pair of partitions of these two tables. Note that partition-wise joins use partition-based granules rather than block-based granules.

The partition-wise join is the fundamental enabler for shared nothing systems. Shared nothing systems typically scale well as long as they can take advantage of partition-wise joins. As a result, the choice of partitioning (distribution) in a shared nothing system is critical as well as the access path to the tables. Operations that do not use partition-wise operations in an MPP system often do not scale well.

Enabling parallel execution in Oracle

Consider the following example. Your database stores historical sales data and customer data. Following are the relevant table definitions:

SQL> desc customers Name Null? Type ----------------- -------- ------------ ID NOT NULL NUMBER(38) NAME NOT NULL VARCHAR2(60) YEAR_OF_BIRTH NUMBER(38) EMAIL_ADDRESS VARCHAR2(50) STREET_NUMBER VARCHAR2(10) STREET_NAME VARCHAR2(60) CITY VARCHAR2(60) STATE_PROVINCE VARCHAR2(40) ZIP_CODE VARCHAR2(10) COUNTRY NOT NULL VARCHAR2(40)


Figure 14: Full partition-wise joins do not require data redistribution.

SQL> desc sales Name Null? Type ----------------- -------- ------------ PURCHASE_DATE NOT NULL DATE ITEM_ID NOT NULL NUMBER(38) CUSTOMER_ID NOT NULL NUMBER(38) STORE_ID NUMBER(38) QUANTITY NUMBER(38) AMOUNT NOT NULL NUMBER(7,2) TAX NUMBER(7,2)

The tables are initially not partitioned and there are no indexes on the tables.

You want to know the total revenue for the last two months of 2007 in the United States, by state. The following query retrieves this result:

select c.state_province, sum(s.amount) revenuefrom customers c, sales swhere s.customer_id = c.idand s.purchase_date between to_date('01-NOV-2007','DD-MON-YYYY') and to_date('31-DEC-2007','DD-MON-YYYY')and c.country = 'United States of America'group by c.state_province/

You run the query without enabling parallel execution and let's say it takes 10 minutes to execute the query.

The end user who runs the query expects a faster response time (less than 3 minutes) and one way to achieve this, assuming there are surplus resources available, is to execute in parallel.

By default the Oracle Database is configured to support parallel execution out-of-the-box. The most relevant database initialization parameters are:

– parallel_max_servers: the maximum number of parallel servers that can be started by the database instance. In order to execute an operation in parallel, parallel servers must be available (i.e. not in use by another parallel operation). By default the value for parallel_max_servers is derived from other database settings and will be discussed later in this paper. Going back to the example of counting cars and using help from friends: parallel_max_servers is the maximum number of friends that you can call for help.

– parallel_min_servers: the minimum number of parallel servers that are always started when the database instance is running. parallel_min_servers enables you to avoid any delay in the


execution of a parallel operation for the startup of parallel servers.Again going back to the example of counting cars: parallel_min_servers is the number of friends that are there with you that you don't have to call in order to start the job of counting the cars.

Verify that parallel execution is enabled for your database instance (connect to the database as a DBA or SYSDBA):

SQL> show parameter parallel_max_servers

NAME TYPE VALUE--------------------- ----------- -------- parallel_max_servers integer 80

There are three ways to enable a query to execute in parallel.

1) Enable the table(s) for parallel execution:alter table sales parallel ;alter table customers parallel ;Use this method if you generally want to execute operations accessing these tables in parallel.

2) Use a parallel hint.select /*+ parallel(c) parallel(s) */ c.state_province, sum(s.amount) revenuefrom customers c, sales swhere s.customer_id = c.idand s.purchase_datebetween to_date('01-JAN-2007','DD-MON-YYYY')and to_date('31-DEC-2007','DD-MON-YYYY')and c.country = 'United States'group by c.state_province/

This method is mainly useful for testing purposes, or if you have a particular statement or few statements that you want to execute in parallel, but most statements run in serial.

3) Use alter session force parallel query ;This method is useful if your application always runs in serial except for this particular session that you want to execute in parallel. A batch operation in an OLTP application may fall into this category.


All of these three methods enable the so-called DEFAULT parallel capabilities where Oracle chooses the DOP. By default, Oracle will spawn two parallel server processes per each core on most systems, so if you run your query on an environment with 2 CPU cores the DOP will be 4. The original query which initially took 10 minutes to complete should complete within less than 3 minutes at DOP of 4, assuming sufficient resources are available.

Controlling SQL Parallel Execution in Oracle

Now that you know how to enable parallel execution and you know the concepts behind Oracle's parallel execution model, you may wonder where's the limit of parallel processing. Obviously, you can use more resources to speed up response times, but if too many operations take this approach, the system may soon be starved for resources - you can't use more resources than you have.

Oracle Database has built-in limits and settings to prevent system overload and ensure the database remains available to applications. Database initialization parameter parallel_max_servers is a good example of one of these limits. All processes in the database require resources, including memory and while active, CPU and I/O resources. The system will not allocate more parallel servers to users than the setting of this initialization parameter.

Understand your target workload

Parallel execution can enable a single operation to utilize all system resources. While this may not be a problem in certain scenarios there are many cases in which this would not be desirable. Consider the workload to which you want to apply parallel execution to get optimum use of the system while satisfying your requirements.

Single-user workload

The single-user workload is a workload in which there is a single operation executing on the database and the objective is for this operation to finish as fast as possible. An example for this type of workload is a large overnight batch load that populates database tables or gathers statistics. Also benchmark situations often measure maximum performance in a single-user workload.

In a single-user workload all resources can be allocated to improve performance for the single operation.

Multi-user concurrent workload

Most production environments have a multi-user workload. Users concurrently execute queries – often ad-hoc type queries – and/or concurrent data load operations take place.

In a multi-user environment, workload resources must be divided amongst concurrent operations. End-users will expect a fair amount of resources to be allocated to their operation in order to get predictable response times.


Controlling the degree of parallelism

Oracle's parallel execution framework enables you to either explicitly chose - or even enforce - a specific degree of parallelism (DOP) or to rely on Oracle to control it.

DEFAULT parallelism

In the earlier example of our parallel query we used so-called DEFAULT parallelism. DEFAULT parallelism uses a formula to determine the DOP based on the system configuration6 (typically the DOP is 2 x [number of CPU cores]; in a cluster configuration 2 x [number of CPU cores] x [number of nodes]). So, on a four node cluster with each node having 8 CPU cores, the default DOP would be 2 x 8 x 4 = 64.

The DEFAULT algorithm was designed to use maximum resources assuming the operation will finish faster if you use more resources. DEFAULT parallelism targets the single-user workload. In a multi-user environment DEFAULT parallelism will rapidly starve system resources leaving no available resources for other users to execute in parallel.

Fixed Degree Of Parallelism (DOP)

Unlike the DEFAULT parallelism, a specific DOP can be requested from the Oracle database. For example, you can set a fixed DOP at a table or index level:

alter table customers parallel 8 ;

alter table sales parallel 16 ;

In this case queries accessing just the customers table use a requested DOP of 8, and queries accessing the sales table request a DOP of 16. A query accessing both the sales and the customers table will be processed with a DOP of 16 and potentially allocate 32 parallel servers (producer/consumer); whenever different DOPs are specified, Oracle is using the higher DOP7.

Adaptive parallelism

When using Oracle's adaptive parallelism capabilities, the database will use an algorithm at SQL execution time to determine whether a parallel operation should receive the requested DOP or be throttled down to a lower DOP.

In a system that makes aggressive use of parallel execution by using a high DOP the adaptive algorithm will throttle down with only few operations running in parallel. While the algorithm will still ensure optimal resource utilization, users may experience inconsistent response times. Using solely the adaptive

6 We are oversimplifying here for the purpose of an easy explanation. The multiplication factor of two is derived by the init.ora parameter parallel_threads_per_cpu, an OS specific parameter that is set to two on most platforms

7 Some statements do not fall under this rule, such as a parallel CREATE TABLE AS SELECT; a discussion of these exceptions is beyond the scope of this paper.


parallelism capabilities in an environment that requires deterministic response times is not advised.

Adaptive parallelism is controlled through the database initialization parameter parallel_adaptive_multi_user.

Guaranteeing a minimal DOP

Once a SQL statement starts execution at a certain DOP it will not change the DOP throughout its execution. However if you start at a low DOP – either as a result of adaptive parallel execution or because there were simply not enough parallel servers available - it may take a very long time to complete the execution of the SQL statement. If the completion of a statement is time-critical then you may want to either guarantee a minimal DOP or not execute at all (and maybe warn the DBA or programmatically try again later when the system is less loaded).

To guarantee a minimal DOP, use the initialization parameter parallel_min_percent. This parameter controls the minimal percentage of parallel server processes that must be available to start the operation; it defaults to 0, meaning that Oracle will always execute the statement, irrespective of the number of available parallel server processes.

For example, if you want to ensure to get at least 50% of the requested parallel server processes for a statement:

SQL> alter session set parallel_min_percent=50 ;

SQL> select /*+ parallel(s,128) */ count(*)from sales s ;

select /*+ parallel(s,128) */ count(*) from sales s *ERROR at line 1:ORA-12827: insufficient parallel query slaves available

If there are insufficient parallel query servers available – in this example less than 64 parallel servers for a simple SQL statement (or less than 128 slaves for a more complex operation, involving producers and consumers) - you will see ORA-12827 and the statement will not execute. You can capture this error in your code and retry later.

Controlling the usage of parallelism

Depending on your expected workload pattern you might want to ensure that Oracle's parallel execution capabilities are used most optimally for your environment. This implies two basic tasks, (a) to control the usage of parallelism and (b) to ensure that the system does not get overloaded while adhering to the


potential different priorities for different user classes in the case of mixed workload environments.

Whether or not a SQL operation is running parallel and what DOP is chosen is determined based on the following rules, in the following priority order:

– Parallelism was requested in a hint:select /*+ parallel(s,16) */ count(*)from sales s ;The requested DOP for this query is 16.

– The DOP was requested in an alter session command. For example:alter session force parallel query parallel;The requested DOP for any operation in that session will be DEFAULT parallelism.

– Tables and/or indexes in the select statement accessed have the parallel degree setting at the object level. If objects have a DEFAULT setting then the database determines the DOP value that belongs to DEFAULT. For a query that processes objects with different DOP settings, the object with the highest parallel degree setting accessed in the query determines the requested DOP.

Using Oracle Database Resource Manager

Oracle Database Resource Manager (DBRM) enables you to group users based on characteristics, and restrict parallel execution for some users. DBRM is the ultimate last instance in determining the maximum degree of parallelism, and no user in a resource group (using a specific resource plan) will ever be able to run with a higher DOP than the resource group's maximum. For example, if your


Figure 15: Restricting parallel execution in Oracle Database Control.

resource plan has a policy of using a maximum DOP of 4 and you request a DOP of 16 via a hint, your SQL will run with a DOP of 4. Figure 15 shows an Enterprise Manager Database Control screenshot restricting parallel execution to a DOP of 4 for a resource plan named 'DW_USERS'.

Furthermore, DBRM can control the maximum number of active sessions for a given resource group. So for the shown resource plan 'DW_USERS' a maximum of 4 sessions are allowed to be active, resulting in a total maximum resource consumption of 4 (sessions) x 4 (DOP) x 2 (slave sets) = 32 parallel server processes.

ORACLE SQL PARALLEL EXECUTION BEST PRACTICES

This section lists best practices that you should bear in mind when you consider using SQL parallel execution, or that you may revisit if you already use it to ensure you get the optimum performance out of your system.

Start with a balanced system

A good foundation is the basis for successful use of SQL parallel execution. In the case of SQL parallel execution the foundation consists of the hardware configuration that you use to run your database. All system resources, CPUs, I/O and memory, should be able to support the use of SQL parallel execution. If you use Real Application Clusters (RAC) then you have to also size the interconnect appropriately. Parallel execution is very I/O intensive by nature, so every built-in imbalance in a system may have a bigger and more visible impact on the overall scalability of the hardware platform than for less I/O intensive workloads.

For example, if your system is intended to run an I/O intensive workload, then plan the system conservatively, assuming that every CPU core can process approximately 200 MB/s sustained. For example, if you want to keep 4 CPU cores busy in such a configuration, then the entire I/O subsystem should be able to support 800 MB/s sustained for optimum performance. Note that I/O throughput requirement has to be guaranteed throughout the whole hardware system: the Host Bus Adapters (HBAs) in the compute nodes, any switches you use, and the I/O subsystem, incl. storage controllers and physical spindles. The weakest link is going to limit the performance and scalability of operations in this configuration. If you rely on storage shared with other applications then the throughput performance for your database is not guaranteed and you will likely see inconsistent response times for your parallel operations.

SQL parallel execution is also a heavy consumer of memory. Per CPU core you should have at least 4 GB of RAM.

If you use RAC and you use inter-node parallel query – parallel operations that spawn multiple nodes - then you have to size the interconnect appropriately; some of the data redistribution is going to happen over the interconnect, making


it as crucial as the overall I/O capabilities. Oracle has even more optimizations to minimize interconnect traffic as classical shared nothing architectures – such as choosing to run a parallel operation or a subset of it within a single node - but in the worst case the throughput required on the interconnect for good scalability is at least equal to the throughput going to disk. Use (multiple) 10 GigE or Infiniband interconnect if you plan to use inter-node parallel query.

Work with your hardware vendor and Oracle representative to ensure you start with a good foundation.

Calibrate your configuration

You should set a baseline for the performance you expect out of the Oracle Database. The Oracle Database software will not achieve better performance than the hardware configuration can achieve. Hence you should know what the operating system can achieve before you introduce the Oracle software, and use it as a baseline if later you think the performance is insufficient.

SQL parallel execution is typically very I/O intensive, so you want to measure the maximum I/O performance you can achieve without the Oracle Database. You can use ORION8 (ORacle I/O Number calibration tool, a free Oracle-provided utility designed to simulate Oracle I/O workloads) or basic operation system utilities (such as the Linux/Unix dd command) to measure the I/O performance for your system. Make sure to calibrate the configuration in the way Oracle will use it (how the data will be laid out across storage devices) and use a calibration workload that resembles the type of workload the Oracle Database will perform when running SQL statements in parallel (typically large random I/Os).

Stripe And Mirror Everything (S.A.M.E.) – use ASM

Conservatively, any physical disk may be able to sustain 20-30 MB/s for large random reads. Considering that you need about 200 MB/s to keep a single CPU core busy (i.e. 8 - 10 physical disks), you should realize that you need a lot of physical spindles to get good performance for database operations running in parallel. Do not use a single 1 TB disk for your 800 GB database, because you will not get good performance running operations in parallel against the database; this might work well for your single-user home video archive, but not for a database leveraging parallel query with multiple users.

The way to utilize multiple physical spindles with Oracle's shared everything architecture is to stripe across multiple devices. For high availability you should use a RAID configuration (storage-based RAID1 or RAID5 are commonly used) to ensure you can survive the failure of a single disk. For many years Oracle has recommended its users to use the Stripe And Mirror Everything (S.A.M.E.) methodology using a stripe size of 1 MB. Such a configuration is

8 Orion is downloadable from the Oracle Technology Network, http://www.oracle.com/technology/software/tech/orion/index.html


relatively simple to set up, providing good performance for pretty much any workload (OLTP, reporting, data warehouse).

Starting with Oracle Database 10g you can use Oracle's Automatic Storage Manager (ASM), included with the database. ASM can be used to store Oracle Database files (including online redo log files and archivelog files). ASM will stripe across all devices you present to it in the context of a disk group. Most importantly, if and when you expand your configuration, ASM will automatically re-balance (re-stripe) the data across all devices, so that you will always benefit from all storage devices in your configuration, without running into hot spots in the storage configuration. ASM implements Oracle's S.A.M.E. methodology and automatically maintains it as devices are added or removed. ASM can also be used to mirror data or it can be used with hardware RAID configurations.

Use ASM if you are using Oracle Database 10g or higher.

Set database initialization parameters for good performance

Once you have ensured a balanced system you install the Oracle Database software and create a database. There are a few parameters that you should pay attention to when it comes down to achieving good performance for SQL parallel execution.

Memory allocation

Large parallel operations may use a lot of execution memory, and you should take this into account when allocating memory to the database. You should also bear in mind that the majority of operations that execute in parallel bypass the buffer cache. A parallel operation will only use the buffer cache if the object has been either explicitly created with the CACHE option or if the object size is smaller than 2% of the buffer cache. If the object size is less than 2% of the buffer cache then the cost of the checkpoint to start the direct read is deemed more expensive than just reading the blocks into the cache.

shared_pool_size

Parallel servers communicate among themselves and with the Query Coordinator by passing messages. The messages are passed via memory buffers that are allocated from the shared pool. When a parallel server is started it will allocate buffers in the shared pool so it can communicate, if there is not enough free space in the shared pool to allocate the buffers the parallel server will fail to start. In order to size your shared pool appropriate you should use the following formulas to calculate the additional overhead parallel servers will put on the shared pool. If you are doing inter-node parallel operations

(((2 + (cpu_count X parallel_threads_per_cpu)) X 2) X

(cpu_count X parallel_threads_per_cpu)) X

parallel_execution_message_size X # concurrent queries


or when you use cross instance parallel operation in a RAC environment.

(((2+ (cpu_count X 2)) X 4) X cpu_count X 2)) X

parallel_execution_message_size X # concurrent queries

Note the results are returned in bytes.

Only the memory needed for the parallel_min_servers will be pre-allocated from the shard_pool at database startup. As additional parallel servers are needed, their memory buffers will be allocated “on the fly” from the shared pool. These rules apply irrespective of whether you use shared_pool_size directly, or sga_target (10g and higher) or memory_target (starting with 11g).

pga_aggregate_target

The pga_aggregate_target parameter controls the total amount of execution memory that can be allocated by Oracle. Oracle attempts to keep the amount of private memory below the target you specified by adapting the size of the work areas. When you increase the value of this parameter, you indirectly increase the memory allotted to work areas. Consequently, more memory-intensive operations are able to run fully in memory and less will work their way over to disk. For environments that run a lot of parallel operations you should set pga_aggregate_target as large as possible. A good rule of thumb is to have a minimum of 100MB X parallel_max_servers.

parallel_execution_message_size

As mentioned above the Parallel servers communicate among themselves and with the Query Coordinator by passing messages via memory buffers. If you execute a lot of large operations in parallel, it’s advisable to reduce the messaging latency by increasing the parallel_execution_message_size (the size of the buffers). By default the message size is 2K. Ideally you should increase it to 16k (16384). However, a larger value for parallel_execution_message_size will increase the memory requirement for the shared_pool so if you increase it from 2K to 16K your parallel server memory requirement will be 8 X more.

Controlling parallel servers

In order for a parallel operation to execute in an optimal fashion there has to be enough parallel servers available. If there are no parallel servers available the operation will actually be executed serially.

cpu_count

CPU count is an automatically derived parameter by the Oracle system and is used to determine the default number of parallel servers and the default degree of parallelism for an object. Do not change the value of this parameter.


parallel_threads_per_cpu

The parameter describes the number of parallel execution processes or threads that a CPU can handle during parallel execution. It is used to calculate the default degree of parallelism for the instance and determines the maximum number of parallel servers if parallel_max_servers is not set. The default is platform-dependent and is adequate in most cases (two on most platforms).

parallel_min_servers

This parameter determines the number of parallel servers that will be started during database startup. By default the value is 0. It is recommended that you set parallel_min_servers to “average number of concurrent queries * maximum degree of parallelism need by a query”. This will ensure that there are ample parallel server processes available for the majority of the queries executed on the system and queries will not suffer any additional overhead of having to spawn extra parallel servers. However, if extra parallel servers are required for additional queries above you average workload they can be spawn “on the fly” up to the value of parallel_max_servers. Bear in mind that any additional parallel server processes that are spawned above parallel_min_servers will be killed after they have been inactive for a certain about of time and will have to be re-spawned if they are need again in the future.

parallel_max_servers

This parameter determines the maximum number of parallel servers that may be started for a database instance, should there be demand for them. The default value on Oracle Database 10g and higher is 10 * cpu_count * parallel_threads_per_cpu. A good rule of thumb is to ensure parallel_max_servers is set to a number greater than the “maximum number of concurrent queries * maximum degree of parallelism need by a query”. By doing this you will ensure every query gets the appropriate number of parallel servers.

parallel_adaptive_multi_user

This parameter controls whether or not Oracle automatically downgrades parallel operations to proactively to prevent an overloading of the system. Depending on the workload and the user expectations you should set this parameter to true or false. Realize that if you set the parameter to true, then parallel operations may be downgraded aggressively, which can significantly impact the execution time. For predictable response times on a busy server it is better to set this parameter to false.


Enabling efficient I/O throughput

db_file_multiblock_read_count

SQL parallel execution is generally used for queries that will access a lot of data, for example when doing a full table scan. Since parallel execution will by-pass the buffer cache and access data directly from disk you want each I/O to be as efficient as possible, and using large I/Os is a way to reduce latency. Set db_file_multiblock_read_count such that when it is multiplied by the block size you end up with 1 MB. E.g. for 8K block size, use db_file_multiblock_read_count=128.

disk_async_io

For optimum performance make sure you use asynchronous I/Os. This is the default value for the majority of platforms.

Use parallel execution with common sense

While parallel execution provides a very powerful and scalable framework to speed up SQL operations, you should not forget use some common sense rules; never forget that while parallel execution might buy you an additional incremental performance boost, it requires more resources and might also have side effects on other users or operations on the same system. You cannot use more resources than you have available.

Don't enable parallelism for small objects

Small tables/indexes (up to thousands of records; up to 10s of data blocks) should never be enabled for parallel execution. Operations that only hit small tables will not benefit much from executing in parallel, whereas they would use parallel servers that you want to be available for operations accessing large tables. Remember also that once an operation starts at a certain DOP, then it will not change its DOP throughout the execution. Best practices for customers that are using object sizes as the main driving factor for parallelism are commonly aligning the DOP with some kind of step function for parallelism, e.g.

– objects smaller than 200 MB will not use any parallelism

– objects between 200 MB and 5GB are using a DOP of 4

– objects beyond 5GB are getting a DOP of 32

Needless to say that your personal optimal settings may vary - either in size range or DOP - and highly depend on your target workload and business requirements only.

Use parallelism to achieve your goals, not to exceed them

Use parallelism to achieve your business requirements, not to over-achieve them. If a certain class of queries has to run within 2 minutes don't increase the


DOP to run them in 30 seconds. Remember Figure 1? Assuming linear scalability, you need four times the number of parallel processes for this speed-up example – resources you could give to three additional queries of the same class, getting four times the work done and still adhering to your business goals (obviously this is somewhat of a simplification since you should plan for some head room, but the message should be clear).

Avoid using hints

In general you should avoid using hints to enable parallel execution. Hints are hard to maintain and may not give the right behavior over time when objects and business requirements change.

Combine parallel execution with Oracle Partitioning

Oracle Partitioning9 is powerful database functionality that is useful to manage large database objects, and to achieve good performance when accessing large database objects. Partitioning enables you to store one logical object – a table or index – transparently in several independent physical segments. The data placement is controlled with additional information about the object, such as ranges of order data or hash buckets of customer id information.

There are specific optimizations between SQL parallel execution and Oracle Partitioning that you should bear in mind when you plan to use these functionalities together: For example, two large partitioned tables that can take advantage of parallel partial or full partition-wise joins (as discussed on page 18) can be joined faster than if no partitioning is involved. Ideally (sub)partitions are similar in size which can be achieved by using hash (sub)partitioning on a unique or almost unique column with the number of hash (sub)partitions a power of 2.

For example: consider two large tables SALES and CUSTOMERS. Partition the SALES table using composite partitioning RANGE on ORDER_DATE, HASH on CUSTOMER_ID. Partition the CUSTOMERS table using HASH partitioning on CUSTOMER_ID. Parallel table joins between SALES and CUSTOMERS can now take advantage of full partition-wise joins.

Ensure statistics are good enough

Executing a SQL statement with the wrong execution plan usually results in poor execution performance. If you execute in parallel then using the wrong execution plan may exaggerate the performance issue.

The Oracle Database will compute the optimal execution plan for any SQL operation. The basis for a good computation is good information about table sizes, data distribution etc. Gathering statistics timely is the key to get the statistics right so that the optimizer can generate a good execution plan.

9 Oracle Partitioning is an extra licensable option of Oracle Enterprise Edition


Starting with Oracle Database 10g also make sure to gather system statistics. System statistics describe the system's hardware characteristics, such as I/O and CPU performance and utilization, to the query optimizer. System statistics enable the query optimizer to more accurately estimate I/O and CPU costs, enabling the query optimizer to choose a better execution plan.

Monitor parallel execution activity

Use database utilities to monitor the activity on your system, focusing on SQL parallel execution if you suspect problems in that area. Use the Enterprise Manager performance page in Database Control or Grid Control to monitor wait events. Alternatively you can use statspack (on Oracle Database 9i) and AWR reports (Oracle Database 10g and higher) to analyze system performance. For more information also see the following section about parallel execution monitoring.

Whether or not to use parallel execution in RAC

RAC provides an excellent architecture to incrementally scale your hardware configuration as you require system resources. You can use the additional resources to support additional users (and hence reduce the load on the other servers) and/or use the additional resources to directly improve the performance of the operations running on the database. Do take into account that inter-node parallel execution may result in a lot of interconnect traffic, so ensure you size the interconnect appropriately. By default the Oracle database enables inter-node parallel execution (parallel execution of a single statement involving more than one node).

If you use a relatively weak interconnect, relative to the I/O bandwidth from the server to the storage configuration, then you may be better of restricting parallel execution to a single node or to a limited number of nodes; inter-node parallel execution will not scale with an undersized interconnect. As a general rule of thumb, your interconnect must provide the total I/O throughput of all nodes in a cluster (since all nodes can distribute data at the same point in time with the speed data is read from disk); so, if you have a four node cluster, each node being able to read 1GB/sec from the I/O subsystem, the interconnect must be able to support 4 x 1GB/sec = 4GB/sec to scale linearly for operations involving inter-node parallel execution. It is not recommended to use inter-node parallel execution unless your interconnect satisfies this requirement (or comes very close).

Use instance_groups and parallel_instance_groups or database services (starting with Oracle Database 11g) to limit inter-node parallel execution. It is recommended to use services beginning with Oracle database 11g. The parameter instance_groups is going to be deprecated and only retained for backwards compatibility reasons.


Use Database Resource Manager

Database Resource Manager ultimately decides the final DOP for a parallel SQL operation before executing it. Consider using the Database Resource Manager if you want to restrict users from using unlimited parallelism (and hence overload the system). Database Resource Manager is an excellent tool to guarantee resources for operations that require a certain response time.

Don't try to solve hardware deficiencies with other features

The most common “problem” of parallel execution (besides overloading a system) is that people try to achieve scalability with parallel execution on unbalanced systems – which obviously will not work. Rather than trying to address the underlying problem by implementing a balanced system, people often fight the symptoms, e.g. are creating indexes or additional summary tables.

While such measurements might alleviate existing deficiencies in the short term, they will not fix them, but rather introduce unnecessary complexity and delay solving the problem. When you need a hammer because you have a nail, using a wrench might work for one or two nails, but not for building a whole house.

Don't ignore other features

On the other hand, having a hammer and making everything look like a nail is as bad as not having the hammer at all. SQL parallel execution is a great way to get better performance for expensive database operations, but do not forget that there may be other functionality that is equally if not more appropriate to achieve better performance for specific business requirements. For example, embedding a cube-organized materialized view for your multi-dimensional reporting and analysis might deliver a level of performance that would require an orders of magnitude larger hardware to satisfy the same queries using parallel execution and the detail data records. All of Oracle's warehousing functionality is working together in harmony, so use specific features to leverage its strengths and to solve your business requirements, not to adhere to a religious approach to only use one set of functionality.

MONITORING SQL PARALLEL EXECUTION

There are several ways to monitor parallel execution. This section discusses various options.

(G)V$ parallel execution views

Specific parallel execution performance views start with (G)V$_PQ and (G)V$_PX. While the so-called V$ views give you an instance-specific view, the GV$ views are useful in a Real Application Cluster (RAC) environment to extract cluster-wide information. In addition to the columns in the equivalent V$ view the GV$ view contains the instance ID (nothing more, nothing less). For


example if you wanted to know parallel execution activity across a cluster, you could run:

select inst_id , status , count(1) px_servers#from gv$px_processgroup by inst_id, statusorder by inst_id, status ;

INST_ID STATUS PX_SERVERS#---------- --------- ----------- 1 AVAILABLE 4 1 IN USE 12 2 AVAILABLE 8 2 IN USE 8 3 AVAILABLE 6 3 IN USE 10 4 AVAILABLE 2 4 IN USE 14

Interpreting parallel SQL execution plans

Starting with Oracle Database 10g, for a given query there is a single cursor that is executed by all parallel servers. All parallel execution information is in the single execution plan that is used by every parallel server process. You can get the parallel plan information through various mechanisms, e.g. using the EXPLAIN PLAN utility, select from the cursor cache, or use the advanced workload repository. The basic plan information will be the same for all these mechanisms, so we will discuss how to identify and interpret the most fundamental parallel execution optimization, namely a partition-wise join.

Parallel plan without partitioning

Initially tables SALES and CUSTOMERS are not partitioned. The following shows a portion of the execution plan.

explain plan for select c.state_province, sum(s.amount) revenuefrom customers c, sales swhere s.customer_id = c.idand s.purchase_date between to_date('01-NOV-2007','DD-MON-YYYY') and to_date('31-DEC-2007','DD-MON-YYYY')and c.country = 'United States of America'group by c.state_province;

select * from table(dbms_xplan.display);10

10 Note that some columns in the execution plan have been removed to improve the readability of this example.


Using all the information discussed in the concept section of this paper, you will be able to identify the following parallel processing steps:

– The CUSTOMERS table is read in parallel (ID 11) and is then broadcasted to all parallel servers (ID 9) who read the SALES table.

– After the join, the data is redistributed using a HASH redistribution (ID 5) on the group by column.

– Hash join and hash group by take place in parallel without a need for redistribution (ID 6 and ID 7). Every parallel server process is doing the incremental aggregation of their disjoint data set.

– Results are returned to the query coordinator in random order (ID 2), since no order by was specified in the SQL statement; whenever a parallel server finishes the computation of its incremental result it is returned to the QC.

Parallel plan with partitioning and partition-wise join

Large databases and particularly data warehouses – the types of databases that mostly use parallel execution – should always use Oracle Partitioning. Partitioning can provide great performance improvements because of partition elimination (pruning) capabilities, but also because parallel execution plans can take advantage of partitioning.

Let's recreate the tables SALES and CUSTOMERS as follows:

– HASH partitioning on the ID column for the CUSTOMERS table using 128 partitions.

– Table SALES and CUSTOMERS are now equi-partitioned on the join column


--------------------------------------------------------------| Id | Operation | Name | PQ Distrib |--------------------------------------------------------------| 0 | SELECT STATEMENT | | || 1 | PX COORDINATOR | | || 2 | PX SEND QC (RANDOM) | :TQ10002 | QC (RAND) || 3 | HASH GROUP BY | | || 4 | PX RECEIVE | | || 5 | PX SEND HASH | :TQ10001 | HASH || 6 | HASH GROUP BY | | || 7 | HASH JOIN | | || 8 | PX RECEIVE | | || 9 | PX SEND BROADCAST | :TQ10000 | BROADCAST || 10 | PX BLOCK ITERATOR | | || 11 | TABLE ACCESS FULL| CUSTOMERS | || 12 | PX BLOCK ITERATOR | | || 13 | TABLE ACCESS FULL | SALES | |--------------------------------------------------------------

Figure 16: customer purchase information per state, parallel plan

Figure 17 shows the execution plan for the same query using the now hash partitioned tables. Unlike in previous examples, you do not see the granules for table SALES and CUSTOMERS right away in the plan. The simple reason for this because we are now using partition-based granules, so Oracle does not have to partition the data for parallel access at runtime; the database simply has to iterate over existing partitions.

Furthermore, we are joining two equi-partitioned tables leveraging a partition-wise join. The partition-based granules are not only identical for both tables, but the iteration (processing) of granules is now a processing of pairs of partitions that includes the join as well; one parallel server process is working on one equivalent partition pair at a given point in time. Consequently, the partition-based granule iterator is ABOVE the hash join operation in the execution plan.

Besides the known processing steps of parallel execution this new behavior of a partition-wise join is seen in the execution plan in Figure 17.

– Tables SALES and CUSTOMERS are accessed in parallel, iterating over the existing equi-partitioned hash partition-based granules (ID 7). You can read this operation as “ loop over all hash partitions and process the operations below”. A set of parallel servers is working on n partitions at a time (n equals the DOP), from partition 1 to 128 (identified through columns 'Pstart' and 'Pstop')

– For each HASH partition pair, a parallel server process joins the table CUSTOMERS and SALES.

No data redistribution is taking place to join tables SALES and CUSTOMERS. In the case of inter-node parallel query, there would be no data transfer necessary between the compute nodes, and the Oracle database – although built on the shared everything paradigm - would behave like a shared nothing system for this operation.


--------------------------------------------------------------------------------------------------| Id | Operation | Name | Pstart| Pstop| TQ | PQ Distrib |--------------------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | | | | || 1 | PX COORDINATOR | | | | | || 2 | PX SEND QC (RANDOM) | :TQ10001 | | | Q1,01 | QC (RAND) || 3 | SORT GROUP BY | | | | Q1,01 | || 4 | PX RECEIVE | | | | Q1,01 | || 5 | PX SEND HASH | :TQ10000 | | | Q1,00 | HASH || 6 | SORT GROUP BY | | | | Q1,00 | || 7 | PX PARTITION HASH ALL| | 1 | 128 | Q1,00 | || 8 | HASH JOIN | | | | Q1,00 | || 9 | TABLE ACCESS FULL | CUSTOMERS | 1 | 128 | Q1,00 | || 10 | TABLE ACCESS FULL | SALES | 1 | 128 | Q1,00 | |----------------------------------------------------------------------------------------

Figure 17: customer purchase information, parallel plan , hash partitioning with partition-wise joins

A full partition-wise join only requires the partitioning strategy of the join column(s) to be identical. If we change the SALES table to become a composite RANGE–HASH partitioned table, using PURCHASE_DATE for range partitioning (7 years worth of data, partitioned by month) and CUSTOMER_ID for hash sub-partitioning using 128 sub-partitions we still adhere to the condition for a full partition-wise join and the plan would look change only slightly, as shown in Figure 18 above.

However the query against the new partitioned tables returns even faster than before. Besides the benefits from the parallel full partition-wise join a big performance improvement is achieved through partition elimination; the Oracle database analyzes all existing predicates in the query to see whether some partitions can be ruled out from the processing completely. In our case, the composite range-hash partitioned table SALES has 84 x 128 =10,752 subpartitions in total. Analyzing the filter predicate on purchase_date leads to a reduction down to two range partitions (#72 and #73, shown in pstart/pstop of ID 10); we only have to access 256 out of 10,752 partitioned, providing appr. a 40x performance improvement.

Partition-wise joins can also be leveraged when joining REF Partitioned tables or as so-called partial partition-wise joins when a small table is joined with a significantly larger table and the database enforces a data redistribution to match the partitioning strategy of the larger table. For the sake of focusing on parallel execution only we will not further discuss partition-wise joins for REF partitioned table nor do we discuss partial partition-wise joins.


--------------------------------------------------------------------------------------| Id | Operation | Name | Pstart| Pstop | PQ Distrib |--------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | | | || 1 | PX COORDINATOR | | | | || 2 | PX SEND QC (RANDOM) | :TQ10001 | | | QC (RAND) || 3 | HASH GROUP BY | | | | || 4 | PX RECEIVE | | | | || 5 | PX SEND HASH | :TQ10000 | | | HASH || 6 | HASH GROUP BY | | | | || 7 | PX PARTITION HASH ALL | | 1 | 128 | ||* 8 | HASH JOIN | | | | ||* 9 | TABLE ACCESS FULL | CUSTOMERS | 1 | 128 | || 10 | PX PARTITION RANGE ITERATOR| | 72 | 73 | ||* 11 | TABLE ACCESS FULL | SALES | 9089 | 9344 | |--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):---------------------------------------------------

8 – access("S"."CUSTOMER_ID"="C"."ID") 9 - filter("C"."COUNTRY"='United States of America') 10 - filter("TIME_ID">=TO_DATE(' 2007-11-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "TIME_ID"<=TO_DATE(' 2007-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Figure 18: customer purchase information, parallel plan with PWJ

Oracle Enterprise Manager

Oracle Enterprise Manager Database Control 11g provides new monitoring capabilities useful from a parallel execution perspective. The functionality will also be available in Oracle Enterprise Manager Grid Control 11g.

Wait events

The main performance screen in Oracle Enterprise Manager Database Control or Grid Control – starting with Oracle Database 10g – shows a graph of wait events over time. This screen is useful to identify what the system workload looks like at any point in time. It is easy to figure out whether the system is using a lot of CPU resources or whether it is waiting on a particular resource and if so, what resource that is. If a significant portion of the workload consists of SQL statements executing in parallel then it is typical to see a high CPU utilization and/or significant user I/O waits.

Figure 19 shows an Oracle Enterprise Manager Database Control screenshot of the performance page focused on the graph with wait events. The parallel execution workload shows a lot of I/O waits and not a very high CPU utilization on the system.

The most common PX events deal with the message (data) exchange of the producer/consumer model: in order to mitigate the waits the parallel execution infrastructure uses buffers: producers fill a buffer and consumers read the buffer. The mechanism works both ways to ensure efficient processing. As a result of this model you will very likely see wait events in the database instance that are due to producers waiting for consumers to accept data (PX Deq Credit: send blkd) or consumers waiting for producers to produce data (PX Deq Credit: need buffer). The wait events due to the producer/consumer


Figure 19: Oracle Enterprise Manager Database Console 11g performance page - wait events.

model are unavoidable to a large extent and don't really hurt performance (the wait events fall in the “idle” wait class). Other wait events that you might see are related to parallel server startup/shutdown, and for a coordinator to be able to get the parallel servers it needs. These wait events should be rare and should not take up a lot of time in a production environment.

In a statspack output or an Analytic Workload Repository (AWR) report you will see all parallel execution wait events reported. As mentioned earlier, most of the parallel execution (PX) events are either idle wait events or non-tunable, unavoidable events due to the additional process communication in a parallel environment..

Generally it is not parallel execution specific wait events that may cause slow system performance but rather waits introduced by the workload running in parallel, such as I/O waits, or high CPU utilization. An increase of the idle parallel execution events can often be considered a symptom of a performance problem rather than the cause. For example, an increase of consumers waiting for producers to produce data (PX Deq Credit: need buffer) very likely

indicates a performance problem of slow IO, in case of having a producer operation

that involves disk IO (e.g. a parallel full table scan).

Input/Output (I/O) monitoring

Almost all SQL statements executing in parallel will read data directly from disk rather than out of memory. As a result parallel statements can be very I/O intensive. Oracle Enterprise Manager Database Control 11g provides I/O throughput information on the main performance page – on the “I/O tab” – as well as on the detailed I/O pages.


Figure 20: Detailed I/O page in OEM 11g Database Console for a parallel DML workload.

The example in Figure 20 shows the I/O page for a parallel DML workload. A lot of the I/Os per second are for the database writer and a significant portion of the throughput is large writes. For a predominantly parallel query environment you expect the majority of the throughput (in MB/s or GB/s) from large reads. If parallel SQL operations are bottlenecked by I/O it is usually because the maximum throughput (MB/s) has been reached rather than the maximum I/O operations per second (IOPS).

Parallel execution monitoring

Oracle Enterprise Manager Database Control 11g also introduced parallel execution monitoring on the performance page. The screens help you identify whether the system is running a large number of statements in parallel and whether the majority of the resources are used for few statements running at a large DOP versus a large number of statements running at a lower DOP. Figure21 shows a screenshot of the Parallel Execution tab on the performance page in Oracle Enterprise Manager 11g Database Control..


Figure 21: Parallel execution monitoring in OEM 11g Database Console.

SQL monitoring

Oracle Database 11g introduced a new dynamic view GV$SQL_MONITOR11. This view enables real-time monitoring of long-running SQL statements and all parallel SQL statements without any overhead.

With Oracle Database 11.1.0.6 you can only use textual output from the view. Starting with Oracle Enterprise Manager database console 11.1.0.7 there is a graphical interface to GV$SQL_MONITOR. Oracle Enterprise Manager Grid Control 11g will also provide the graphical interface.

The examples and screenshots in this section show Oracle Enterprise Manager 11.1.0.7 database console on a single instance 2 CPU database server12.

The SQL Monitoring screen shows the execution plan of a long-running statement or a statement that is running in parallel. In near real-time (the default refresh cycle is 5 seconds) you can monitor which step in the execution plan is being worked on and if there are any waits (see Figure 22). A parallel statement shows the parallel server sets. The SQL Monitor output is extremely valuable to identify which parts of an execution plan are expensive throughout the total execution of a SQL statement.

The SQL Monitoring screens also provide information about the parallel server sets and work distribution between individual parallel servers on the “Parallel” tab (see Figure 23).

11 Oracle Database Enterprise Manager Tuning Pack must be licensed in order to access (G)V$SQL_MONITOR.

12 As of publication Oracle Database 11.1.0.7 is not yet available. The example shows screenshots of an early version of database console on a development version.


Figure 22: Monitoring a parallel execution query in near real-time.

Ideally you see an equal distribution of work across the parallel servers. If there is a skew in the distribution of work between parallel servers in one parallel server set then you have not achieved optimal performance. The statement will have to wait for the parallel server performing most work to complete.

The third tab in the SQL Monitoring interface shows the activity for the statement over time in near real-time (see Figure 24). Use this information to identify at statement level what resources are used most intensely.


Figure 23: Parallel server sets and work distribution in SQL Monitoring.

Figure 24: Wait activity in SQL Monitoring.

UPGRADE CONSIDERATIONS COMING FROM ORACLE DATABASE 9I

Oracle Database 10g introduced a completely rewritten internal parallel execution infrastructure. Many parallel execution restrictions that existed in Oracle Database 9i have been lifted starting with Oracle Database 10g.

If you are using SQL parallel execution on Oracle Database 9i, and you plan to upgrade to Oracle Database 10g or beyond, then you should be aware of some changes in the SQL parallel execution infrastructure. These changes may result in unexpected changes, mainly in terms of getting more statements parallelized and the chance of using more parallel resources on a system, that can lead to different execution times for parallel operation or a different system utilization between Oracle Database 9i and higher releases.

More parallel operations

The internal code rewrite introduced with Oracle Database 10g lifted a number of parallel execution restrictions that existed in Oracle Database 9i. As a result you might see that some operations that were running in serial are now executed in parallel when you use parallel settings at the table level. This may be great for the execution time of these operations that did not run in parallel before, but it also means that the system will end up using a lot more parallel resources than it used to. In the worst case operations that would run in parallel in Oracle Database 9i are now going to be starved for parallel resources and may be either be running at a lower DOP, or even be serialized. This problem is even exaggerated if your system already runs close to the resource limit with Oracle Database 9i.

When you plan to upgrade Oracle Database 9i you should review your SQL parallel execution settings. In any and all cases you should validate the parallel execution behavior between Oracle Database 9i and another release through a representative test of the workload on your production system.

If on Oracle Database 9i you used hints to enable SQL parallel execution

If you always use hints, and nothing but hints, to enable SQL parallel execution on Oracle Database 9i then there is little to worry about when upgrading. You should verify whether every operation with parallel hints actually runs in parallel in Oracle Database 9i, but if it does, it will do so in Oracle Database 10g and beyond as well.

If on Oracle Database 9i you used session settings to enable SQL parallel execution

If you always use only the session setting to enable parallel execution, then you should look at the operations that are executed in the sessions that enable or force parallel execution. Expect more operations to execute in parallel after an upgrade to Oracle Database 10g or beyond. If there are only parallel operations on Oracle Database 9i in your parallel enabled sessions then you would expect minimal changes, if any, after an upgrade.


If on Oracle Database 9i you used object level settings to enable SQL parallel execution

If you set the parallel properties at the table or index level in order to enable parallel execution, then you will face the highest likelihood to experience changes. Expect some operations that access parallel enabled objects which would not execute in parallel on Oracle Database 9i to run in parallel after an upgrade.

Carefully review the parallel settings at the table level, and reset the parallel setting on small database objects to noparallel (database objects with fewer than thousands of records and/or few database blocks in size). Operations that complete in a few seconds or less when running in serial benefit little from executing in parallel. Rather you want operations that take minutes or even hours to complete in serial to benefit from parallel execution.

Execution plan changes

As mentioned before in this paper, you will only see a single execution plan for a parallel statement in Oracle Database 10g and beyond that us used by all parallel servers. As a result the execution plan is easier to read. However, if you automate the comparison of execution plans between the old database release and the new database release, then you will see changes. You should understand where these changes come from and you may have to manually compare the execution plans to ensure they do not change for the worse.

Furthermore, due to the change to a single cursor model you will only see multiple parallel servers executing the actual single cursor for the parallel execution plan instead of seeing different SQL statements representing fragments of the parallel plan (a.k.a. slave SQL in version prior to Oracle database 10g), so they way of how to monitor and analyze parallel execution will change. Note that the fact of changing to a single cursor model by itself will not have any impact on the operation of your system; an impact, if any, only relates to more parallel capabilities in Oracle database 10g and beyond.

Changes in database defaults

Some of the default values for database initialization parameters for SQL parallel execution have changed from Oracle Database 9i. The most notable changes are:

parallel_max_servers

The default value in Oracle Database 9i was 10. For Oracle Database 10g and higher, assuming you use automatic memory management for execution memory (i.e. you use pga_aggregate_target or starting with Oracle Database 11g memory_target) the default equates to 10 * cpu_count. Generally 10 * cpu_count equates to a lot more than 10, which means that SQL parallel execution may end up using a lot more system resources. If your system was heavily loaded on Oracle Database 9i with some operations


running in parallel, you may see the overall system throughput go down when you upgrade to Oracle Database 10g or beyond. The remedy for this change is to manually set parallel_max_servers to 10 in the database initialization file pfile or spfile.

parallel_adaptive_multi_user

In Oracle Database 9i parallel_adaptive_multi_user was by default derived from parallel_automatic_tuning and defaulted to false. In Oracle Database 10g and beyond parallel_adaptive_multi_user equates to true. As a result the database will aggressively reduce the DOP for SQL parallel operations when some other statements already use SQL parallel servers. If you did not explicitly change parallel_automatic_tuning or parallel_adaptive_multi_user on Oracle Database 9i, then you should explicitly set parallel_adaptive_multi_user to false when you upgrade to Oracle Database 10g or beyond.

Use Resource Manager

Consider the use of Resource Manager beyond Oracle Database 9i to ensure operations get the resources they need when they need them. If there is a class of user or a type of application that should never execute in parallel, consider ensuring that this application cannot execute in parallel using a specific consumer group and an appropriate resource plan in Resource Manager. That way the application will not unexpectedly consume parallel resources, potentially starving operations that do require parallel execution in order to complete in a reasonable amount of time.

CONCLUSION

The objective of parallel execution is to reduce the total execution time of an operation by using multiple resources concurrently. Resource availability is the most important prerequisite for scalable parallel execution.

The Oracle Database provides a powerful SQL parallel execution engine that can run almost any SQL-based operation – DDL, DML and queries – in the Oracle Database in parallel. This paper explained how to enable SQL parallel execution and provided some best practices to ensure its successful use.



June 2008

Author: Mark Van de Wiel, Hermann Baer

Contributing Authors: Thierry Cruanes, Maria Colgan

Oracle Corporation

World Headquarters

500 Oracle Parkway

Redwood Shores, CA 94065

U.S.A.

Worldwide Inquiries:

Phone: +1.650.506.7000

Fax: +1.650.506.7200

oracle.com

Copyright © 2008, Oracle. All rights reserved.

This document is provided for information purposes only and the

contents hereof are subject to change without notice.

This document is not warranted to be error-free, nor subject to any

other warranties or conditions, whether expressed orally or implied

in law, including implied warranties and conditions of merchantability

or fitness for a particular purpose. We specifically disclaim any

liability with respect to this document and no contractual obligations

are formed either directly or indirectly by this document. This document

may not be reproduced or transmitted in any form or by any means,

electronic or mechanical, for any purpose, without our prior written permission.

Oracle, JD Edwards, PeopleSoft, and Siebel are registered trademarks of Oracle

Corporation and/or its affiliates. Other names may be trademarks

of their respective owners.

twp bidw parallel execution 11gr1

Documents