10g advanced sql and performance · 10g advanced sql and performance paper 615 . introduction we...

Identify the track name here

Paper #

Reviewed by Oracle Certified Master Korea Community

( http://www.ocmkorea.com http://cafe.daum.net/oraclemanager )

1100GG AADDVVAANNCCEEDD SSQQLL AANNDD PPEERRFFOORRMMAANNCCEE

PPAAPPEERR 661155

.

INTRODUCTION We will cover advanced SQL concepts, real-world problems and solutions and the performance impact of advanced SQL coding decisions. Ansi-compliant & Oracle 9i and 10g functions and features as well as some creative solutions to SQL problems will be discussed. This will be a fast-paced look at a fun-topic. Some of the things we will look at are:

• 9i/10g Full outer joins and join indexes

• Case and Decode

• 9i and 10g Analytics (e.g. Rankings)

• Medians

• First and Last Functions

• Model and (of course) more…

Finding the First/Last N Rows

Limiting Rows with Rownum

Want to find the first or last n rows in a table? Use Rownum!

For each row returned by a query, the ROWNUM pseudo-column returns a number indicating the order in which Oracle selects the row from a table or set of joined rows. This list however is not ordered. An example of this is: SQL> select username,rownum from dba_users order by username; USERNAME ROWNUM ------------------------------ ---------- AURORA$ORB$UNAUTHENTICATED 6 CTXSYS 10 DBSNMP 4 MDSYS 9 OPS$ORACLE 5 ORDPLUGINS 8


Paper #

ORDSYS 7 OUTLN 3 SYS 1 SYSTEM 2 TESTUSER 11 11 rows selected. Using ORDER BY does not necessarily solve the problem since the rownum is applied to the row before they are sorted. An example where we try to return the first 3 rows in a table is shown below. select username,rownum from dba_users where rownum < 4 order by username; USERNAME ROWNUM ------------------------------ ---------- OUTLN 3 SYS 1 SYSTEM 2 This can be corrected by retrieving the rows and sorting them in a subquery and then applying the rownum in the outer query. SELECT username,rownum FROM (SELECT username FROM dba_users ORDER BY username) WHERE ROWNUM < 4; USERNAME ROWNUM ------------------------------ ---------- AURORA$ORB$UNAUTHENTICATED 1 CTXSYS 2 DBSNMP 3 A greater-than sign used with rownum and a positive integer will never return a row as shown below: 1 SELECT username,rownum FROM 2 (SELECT username FROM dba_users ORDER BY username) 3* WHERE ROWNUM > 4 SQL> / no rows selected To show the last 3 rows we therefore cannot use >, but must instead use < and must order the rows descending in the subquery as shown below: 1 SELECT username,rownum FROM 2 (SELECT username FROM dba_users ORDER BY username desc) 3* WHERE ROWNUM < 4 SQL> / USERNAME ROWNUM ------------------------- ------ TESTUSER 1 SYSTEM 2 SYS 3 ANALYTIC FUNCTIONS

• Specialized functions that return aggregate values based on a grouping of rows. • Multiple rows can be returned for each group. • Each group can be called a “window” although, unfortunately, the term “partition” is used in SQL


Paper #

• Calculations can be performed on rows in the window. • Can only appear in “Select” or “Order By” clause.

o Last operators performed in a query except “order by” • Let’s look at some …

Finding the First/Last N rows

Limiting Rows with the Row_Number Function

This function has no relationship to rownum. It is an analytic function that allows you to assign a unique number in the sequence field defined by ORDER BY to each row in a partition. This can be shown in the example below: SELECT sales_rep, territory, total_sales, ROW_NUMBER() OVER

(PARTITION BY territory ORDER BY total_sales DESC) as row_number

FROM sales;

SALES_REP TERRITORY TOTAL_SALES ROW_NUMBER

-------------------- ---------- ----------- ----------

Simpson 1 990 1

Lee 1 800 2

Smith 1 400 3

Jones 1 200 4

Jones 1 145 5

Blake 5 2850 1

Allen 5 1600 2

Turner 5 1500 3

Ward 5 1250 4

We will discuss other analytic functions later in this paper. Analytic Example with Row_Number SELECT sales_rep, terr, total_sales, row_number() OVER (PARTITION BY terr ORDER BY total_sales DESC) as row_number FROM sales call count cpu elapsed disk query current rows Parse 1 0.02 0.09 0 1 0 0 Execute 1 0.00 0.00 0 0 0 0 Fetch 2 0.00 0.00 0 7 0 4 total 4 0.02 0.09 0 8 0 4 Rows Execution Plan 0 SELECT STATEMENT MODE: ALL_ROWS 4 WINDOW (SORT) 4 TABLE ACCESS (FULL) OF 'SALES' (TABLE)


Paper #

Ranking Functions

Ranking Values in a List You have a list of values and want to rank them from 1:n in ascending order. What is the best way to do this? Let’s look at a modified example from Joe Celko (note: 1) which takes the following Sales information from the Sales table: SALES sales_rep territory total_sales Jones 1 345 Smith 1 345 Lee 1 200 Simpson 1 990 Given the above list, we want to achieve the following result: sales_rep territory total_sales rank Simpson 1 990 1 Jones 1 345 2 Smith 1 345 2 Lee 1 200 3 The SQL we can use to perform this is: Select s1.sales_rep, s1.territory, s1.total_sales, (Select count (Distinct total_sales) From Sales S2 Where s2.total_sales >= s1.total_sales AND s2.territory = s1.territory) as rank From Sales s1 order by rank; Note the use of a Select statement embedded in the query to generate the Rank column. Ranking Values in a List with the Rank Function In Oracle, we can accomplish the above with a Rank analytic function that is more concise, though it takes some explaining. Analytic functions allow us to compare a row to a ‘Window’ of rows. In the Sales example, the Window is the entire result set, or the Territory. Look at an axample of this which returns the same result as the query above. SELECT sales_rep, territory, total_sales, RANK() OVER (PARTITION BY territory ORDER BY total_sales DESC) as rank FROM sales; SALES_REP TERRITORY TOTAL_SALES RANK -------------------- ---------- ----------- ---------- Simpson 1 990 1 Jones 1 345 2 Smith 1 345 2 Lee 1 200 4


Paper #

The Window determines the range of rows to evaluate the current row on. The syntax used above states that we want to Partition the data by Territory id and want to order the data by the Total_Sales in descending order. Note that the analytic functions are the last functions performed except for a final Order By clause. For a better example of this, let’s add a few more territories and values. The same query returns the following: SALES_REP TERRITORY TOTAL_SALES RANK -------------------- ---------- ----------- ---------- Simpson 1 990 1 Jones 1 345 2 Smith 1 345 2 Lee 1 200 4 Ford 3 3000 1 Clark 3 2450 2 Miller 3 1300 3 Adams 3 1100 4 Carter 3 800 5 Blake 5 2850 1 Allen 5 1600 2 Turner 5 1500 3 Ward 5 1250 4 To rank the entire table, remove “partition by territory” from the query. Other bounding and windowing options are available to this function.

Ranking and Aggregates Suppose our sales table had sales amounts by territory, sales_rep and product and that we wanted to determine the ranking of sales by territory and product. We could easily do this by combining the Rank and Group By features of SQL. Take the sample table below: SALES_REP TERRITORY PRODUCT TOTAL_SALES -------------------- ---------- ---------- ----------- Jones 1 8 200 Jones 1 8 145 Smith 1 8 400 Lee 1 9 800 Simpson 1 9 990 Blake 5 8 2850 Allen 5 9 1600 Turner 5 8 1500 Ward 5 9 1250 The combined Rank and Group By query looks as shown below: SELECT territory, product, sum(total_sales), RANK() OVER (PARTITION BY territory ORDER BY sum(total_sales) DESC) as rank FROM sales group by territory, product; This results in: TERRITORY PRODUCT SUM(TOTAL_SALES) RANK ---------- ---------- ---------------- ---------- 1 9 1790 1 1 8 745 2 5 8 4350 1


Paper #

5 9 2850 2

Rankings with different boundaries

The Rank analytic function can be used to oprate on different groups. We will look at an example, where 2 ranks are kept. One will show the ranking of a products total sales within a territory (shown by column rank_prod_per_terr) and the second will give the rank of a product across all territories (column rank_product_total).

SELECT territory, product, sum(total_sales),

RANK() OVER

(PARTITION BY territory ORDER BY sum(total_sales) DESC)

as rank_prod_per_terr,

RANK() OVER

(ORDER BY sum(total_sales) DESC)

as rank_product_total

FROM sales

GROUP BY territory, product

ORDER by territory;

TERRITORY PRODUCT SUM(TOTAL_SALES) RANK_PROD_PER_TERR RANK_PRODUCT_TOTAL

---------- ---------- ---------------- ------------------ ------------------

1 7 1390 1 6

1 9 800 2 8

1 8 345 3 9

3 8 5100 1 1

3 9 2550 2 4

3 7 1000 3 7

5 8 2850 1 2

5 9 2850 1 2

5 7 1500 3 5

Notice how ties are handled for products in Territory 5. This can be changed using the Dense_Rank function. Explain for Rankings with Different Boundaries SELECT terr, prod, sum(total_sales), RANK() OVER (PARTITION BY terr ORDER BY sum(total_sales) DESC) as rank_prod_by_terr, RANK() OVER (ORDER BY sum(total_sales) DESC) as rank_prod_ttl FROM sales GROUP BY terr, prod ORDER by terr; Rows Execution Plan 0 SELECT STATEMENT MODE: ALL_ROWS 4 WINDOW (SORT) 4 WINDOW (SORT) 4 SORT (GROUP BY) 7 TABLE ACCESS (FULL) OF 'SALES' (TABLE)


Paper #

Ranking per-cube and rollup-group

Rankings can also be performed using cube and rollup features. We will show an example below where we do a cube ranking of total_sales for each product. We then have total_sales and cube rankings by Territory (shown where product_id is null), of products (where territory_id is null) and then a total_sales amount for all products and territories combined (where territoty_id and product_id are both null). -- Rankings with per cube and rollup SELECT territory, product, sum(total_sales),

RANK() OVER

(PARTITION BY GROUPING(territory),

GROUPING(product)

ORDER BY sum(total_sales) DESC) as rank_per_cube

FROM sales

GROUP BY CUBE(territory, product)

ORDER by GROUPING(territory), GROUPING(product), territory;

TERRITORY PRODUCT SUM(TOTAL_SALES) RANK_PER_CUBE

---------- ---------- ---------------- -------------

1 7 1390 6

1 9 800 8

1 8 345 9

3 8 5100 1

3 7 1000 7

3 9 2550 4

5 8 2850 2

5 7 1500 5

5 9 2850 2

1 2535 3

3 8650 1

5 7200 2

8 8295 1

9 6200 2

7 3890 3

18385 1

Dense Rank vs. Rank

This is the same as Rank except that it handles ties differently. Rank skips rank levels for each tied row where Dense_rank does not skip numbers and goes to the next value. Cume-Dist Ranking (Inverse Percentiles)


Paper #

Cume_Dist computes a fraction of a value relative to its position in its partition. It returns the result as a decimal between 0 (not including 0) and 1. SELECT territory, product, sum(total_sales) amt, CUME_DIST() OVER (PARTITION BY territory ORDER BY sum(total_sales)) as cume_dist FROM sales group by territory, product order by territory, amt; TERRITORY PRODUCT AMT CUME_DIST ---------- ---------- ---------- ---------- 1 8 345 .333333333 1 9 800 .666666667 1 7 1390 1 3 7 1000 .333333333 3 9 2550 .666666667 3 8 5100 1 5 7 1500 .333333333 5 8 2850 1 5 9 2850 1 Percent Rank Function

This is similar to cume_dist except that it uses row counts as its value as a numerator and returns values between 0 and 1 (including 0). SELECT territory, product, sum(total_sales) amt, PERCENT_RANK() OVER (PARTITION BY territory ORDER BY sum(total_sales)) as pct_rank FROM sales group by territory, product order by territory, amt / TERRITORY PRODUCT AMT PCT_RANK ---------- ---------- ---------- ---------- 1 8 345 0 1 9 800 .5 1 7 1390 1 3 7 1000 0 3 9 2550 .5 3 8 5100 1 5 7 1500 0 5 8 2850 .5 5 9 2850 .5 Ntile Function Allows you to perform calculations and statistics for tertiles, quartiles, deciles and for other summary statistics. If there are extra rows in buckets, then each bucket will have the same number of rows or at most 1 row more than the others. An example of this is shown below: SELECT sales_rep, total_sales,


Paper #

NTILE(4) OVER (ORDER BY total_sales DESC NULLS FIRST) AS quartile FROM sales ; SALES_REP TOTAL_SALES QUARTILE -------------------- ----------- ---------- Blake 2850 1 Allen 1600 1 Turner 1500 1 Ward 1250 2 Simpson 990 2 Lee 800 3 Smith 400 3 Jones 200 4 Jones 145 4 Hypothetical Rank Analytic functions can be used to determine the rank of a “hypothetical” row inserted into a table. For example, consider a set of product categories that have total_sales determines by sub-category.

• What would the rank of a new product subcategory with sales of $2,000 be?

• What would the rank of a new product subcategory with sales of $1,480 be? Let’s look at an original list of the data and see how the hypothetical rank works with this list. Original List PROD TOTAL_SALES ---------- --------------------- 7 700 7 750 7 1600 7 2100 8 150 8 200 8 300 8 400 8 1200 9 1400 9 1475 9 1500 9 1600 9 2400 What is the rank of a new sub-product category with $2,000 of total_sales? select prod, RANK(2000) within group (ORDER BY total_sales desc) as HRANK, TO_CHAR(PERCENT_RANK(2000) WITHIN Group (ORDER BY total_sales),'9.999') as HPERC, TO_CHAR(CUME_DIST(2000) WITHIN Group (ORDER BY total_sales),'9.999') as HCUME FROM sales Group by prod;

PROD HRANK HPERC HCUME ---------- ---------- ---------- -----------


Paper #

7 2 .750 .800 8 1 1.000 1.000 9 2 .800 .833 What is the rank of a new sub-product category with $1,480 of total_sales? select prod, RANK(1480) within group (ORDER BY total_sales desc) as HRANK, TO_CHAR(PERCENT_RANK(1480) WITHIN Group (ORDER BY total_sales),'9.999') as HPERC, TO_CHAR(CUME_DIST(1480) WITHIN Group (ORDER BY total_sales),'9.999') as HCUME FROM sales Group by prod;

PROD HRANK HPERC HCUME ---------- ---------- ---------- ----------- 7 3 .500 .600 8 1 1.000 1.000 9 4 .400 .500 Lead/Lag Analytic Functions This analytic function provides us with the capability to determine the next value in a list without performing a self-join. Though this is not the same as traversing a tree or bill-of-materials type of structure it has great benefits in helping us find the next row without needing to code a subquery – though the syntax may be even more difficult to understand than subquery logic is. Let’s take the example of a table employee with columns ename and hiredate. Suppose that we want to develop a query where each row has the employees name, their hiredate and the next employees hire date. This could be accomplished through a join to a subquery that retrieves the lowest hiredate that is larger than the one on this row. Or, more simply, we can use the Lead function as follows: SELECT ename, hiredate, LEAD(hiredate, 1) OVER (ORDER BY hiredate) AS next_hire_date FROM employee; ENAME HIREDATE NEXT_HIRE_DATE --------------- ----------- -------------- COHEN 1991-APR-01 1991-OCT-31 KING 1991-OCT-31 1992-JAN-10 LEE 1992-JAN-10 1992-FEB-12 ALLEN 1992-FEB-12 1995-NOV-25 JONES 1995-NOV-25 1999-DEC-15 SMITH 1999-DEC-15 The offset of 1 tells Oracle to find the next row. If you do not specify offset, its default is 1. To retrieve the date of the employee hired before the employee on a row, use the LAG analytic function: SELECT ename, hiredate, LAG(hiredate, 1) OVER (ORDER BY hiredate) AS prev_hire_date FROM employee SQL> / ENAME HIREDATE PREV_HIRE_DATE


Paper #

--------------- ----------- -------------- COHEN 1991-APR-01 KING 1991-OCT-31 1991-APR-01 LEE 1992-JAN-10 1991-OCT-31 ALLEN 1992-FEB-12 1992-JAN-10 JONES 1995-NOV-25 1992-FEB-12 SMITH 1999-DEC-15 1995-NOV-25 This can be very useful for determining things like effective and expiry dates on a row where only 1 date exists. Explain for Lead Analytic Function SELECT ename, hiredate, LEAD(hiredate, 1) OVER (ORDER BY hiredate) AS next_hire_date FROM employee; Rows Execution Plan 0 SELECT STATEMENT MODE: ALL_ROWS 3 WINDOW (SORT) 3 TABLE ACCESS (FULL) OF 'EMPLOYEE' (TABLE)

First/Last Functions The First/Last Functions are analytic, aggregate functions that operate on a set of values from a set of rows. Use these when you need the lowest or highest value from a sorted set to compare to another value from a function such as min, max, sum, avg, count. Let’s look at an example to find the max salary of employees who have received the highest bonus and to find the lowest salary of employees who have received the lowest bonus.

The Input Data is shown below: salary deptid bonus 1,000,000 100 20,000 100,000 100 20,000 60,000 100 15,000 40,000 100 15,000 SELECT deptid, min(salary) keep (dense_rank FIRST order by bonus) "low", max(salary) keep (dense_rank LAST order by bonus) "high", FROM emp_salary group by deptid;

The Result is Below: deptid low high


Paper #

100 40000 1000000 In other words, the lowest salary that received the lowest bonus of 15,000 was 40,000 and the highest salary that received the highest bonus of 20,000 was 1000000.

Explain for First/Last Functions SELECT deptid, min(salary) keep (dense_rank FIRST order by bonus) "low", max(salary) keep (dense_rank LAST order by bonus) "high" FROM emp_salary group by deptid; Rows Row Source Operation 0 SELECT STATEMENT MODE: ALL_ROWS 1 SORT (GROUP BY) 4 TABLE ACCESS (FULL) OF 'EMP_SALARY' (TABLE)

To achieve the same thing without these functions, you would have needed to code the following SQL: SELECT a.deptid deptid, min(a.salary) low, max(c.salary) high FROM emp_salary a, (select min(bonus) bonus from emp_salary) b, emp_salary c, (select max(bonus) bonus from emp_salary) d WHERE a.bonus = b.bonus and c.bonus=d.bonus GROUP BY a.deptid; Explain for SQL Not Using First/Last Functions SELECT STATEMENT SORT GROUP BY MERGE JOIN SORT JOIN NESTED LOOPS MERGE JOIN VIEW SORT AGGREGATE TABLE ACCESS FULL EMP_SALARY FILTER TABLE ACCESS FULL EMP_SALARY VIEW SORT AGGREGATE TABLE ACCESS FULL EMP_SALARY SORT JOIN TABLE ACCESS FULL EMP_SALARY


Paper #

First_Value, Last_Value Analytics

• Similar functions, so we will look at first_value. • An analytic function that gets the first value in an ordered set of rows.

• e.g. get the name of the employee with the lowest bonus.

Select emp_name, emp_id, bonus, first_value(emp_name) Over (order by bonus asc rows unbounded preceding) as low_bonus From (select * from emp_sal order by emp_id); EMP_NAME EMP_ID BONUS LOW_BONUS carter 2 20,000 carter rose 3 20,000 carter williams 7 25,000 carter bosh 8 30,000 carter

Histogram Function in 9i/10g: Width-Bucket This is not the Optimizer Histograms that you are familiar with from pre-9i releases. Those were height-based histograms that place the same number of values into each range and are used for query optimization. The Histogram functions have been added and are Width-based. In other words, each column value is put into a corresponding bucket so that for each row, there is returned the number of the histogram bucket for the data. An example of this Histogram function is shown below:

SELECT salesrep_id, total_sales, WIDTH_BUCKET(total_sales,0,1000.1,10) “sale group" From Sales Where cityname = 'GOTHAM' order by total_sales;

Using the following histogram syntax: (expr,min_value,max_value,num_buckets), this will break all values from 0 to 1000 into 10 buckets and all values above 1000 will be in bucket 11. So, values between 1 and 100 are in bucket 1 and between 101 and 200 are in bucket 2 and so on. The results of this query are shown below. Note that total_sales value 1475 is in bucket 11. salesrep_id total_sales sale group 149 -20 0 150 0 1 152 150 2 151 200 2 153 400 4 154 400 4 155 785 8 156 800 8 157 1000 10


Paper #

158 1475 11 Median’s in SQL A common function used to determine trends in SQL is to determine the average of a set of values as shown by: Select avg(number_column1) from table1; A better way to see a normal tendency is to find the Median of a set of values. There are 2 categories of median. The Statistical Median which has as many values above as below the Median and the Financial median where all values are split into 2 groups and we take the average of the lowest value from the highest group and the highest value from the lowest group. Confused yet? Take a look at an example below of a list of 4 salaries. Person_Id Salary

1 1,000,000 2 100,000 3 60,000 4 40,000

The average salary from the list above is $1.2M/4 = $300,000 which is misleading since there is a single high value of $1 Million. The Median can be determined by splitting the above list into 2 groups and taking the average of the lowest value of the top group ($100,000) and the highest value of the bottom group ($60,000) and taking the average of these 2 which is $80,000. This is a better picture of what ‘normal’ salaries are given this list. There does not exist a Median function in SQL, but we can create our own in several ways. This is not a simple task, but there are many different solutions from SQL experts such as C.Date and J.Celko (see note 1) among others. This tip will include a solution of Date’s second Median based on Celko’s first Median. Given the following Emp_Salary table, find the median salary of $80,000. First, get split the table in 2 and get the lowest value of the top group ($100000): Select MIN(e.salary) FROM emp_salary e -- bottom salary Where e.salary in (Select E2.salary -- top half rows FROM Emp_Salary E1, Emp_Salary E2 WHERE E2.salary <= E1.salary GROUP BY E2.salary HAVING count(*) <= (Select CEIL(count(*) /2) FROM Emp_Salary)) / MIN(E.SALARY) ------------- 100000 Next, get the top value of the bottom half of the rows ($60000) Select MAX(e3.salary) FROM emp_salary e3 -- top salary Where e3.salary in (Select E4.salary -- bottom half rows FROM Emp_Salary E5, Emp_Salary E4 WHERE E4.salary >= E5.salary GROUP BY E4.salary HAVING count(*) <= (Select CEIL(count(*) /2) FROM Emp_Salary))


Paper #

MAX(E3.SALARY) -------------- 60000 Now, combine the rows and take the average to get this median value of $80000 as shown below: Select avg(E.salary) AS median From Emp_Salary E Where E.salary in -- Get the lowest salary of the top half rows (Select MIN(e.salary) FROM emp_salary e -- bottom salary Where e.salary in (Select E2.salary -- top half rows FROM Emp_Salary E1, Emp_Salary E2 WHERE E2.salary <= E1.salary GROUP BY E2.salary HAVING count(*) <= (Select CEIL(count(*) /2) FROM Emp_Salary)) UNION -- Get the highest salary of the bottom half rows Select MAX(e3.salary) FROM emp_salary e3 -- top salary Where e3.salary in (Select E4.salary -- bottom half rows FROM Emp_Salary E5, Emp_Salary E4 WHERE E4.salary >= E5.salary GROUP BY E4.salary HAVING count(*) <= (Select CEIL(count(*) /2) FROM Emp_Salary))) ; MEDIAN ---------- 80000 If another row is added to the table with a value of 80000, the result will be the same since both queries that take the top and bottom values will return the middle row. Explain for the Median SQL shown above 0 SELECT STATEMENT MODE: ALL_ROWS 1 SORT (AGGREGATE) 2 HASH JOIN 2 VIEW OF 'VW_NSO_3' (VIEW) 2 SORT (UNIQUE) 2 UNION-ALL 1 SORT (AGGREGATE) 2 HASH JOIN 2 VIEW OF 'VW_NSO_1' (VIEW) 2 FILTER 4 SORT (GROUP BY) 10 NESTED LOOPS 4 TABLE ACCESS (FULL) OF 'EMP_SALARY' (TABLE)


Paper #

10 TABLE ACCESS (FULL) OF 'EMP_SALARY' (TABLE) 1 SORT (AGGREGATE) 4 TABLE ACCESS (FULL) OF 'EMP_SALARY' (TABLE) 4 TABLE ACCESS (FULL) OF 'EMP_SALARY' (TABLE) 1 SORT (AGGREGATE) 2 HASH JOIN 2 VIEW OF 'VW_NSO_2' (VIEW) 2 FILTER 4 SORT (GROUP BY) 10 NESTED LOOPS 4 TABLE ACCESS (FULL) OF 'EMP_SALARY' (TABLE) 10 TABLE ACCESS (FULL) OF 'EMP_SALARY' (TABLE) 1 SORT (AGGREGATE) 4 TABLE ACCESS (FULL) OF 'EMP_SALARY' (TABLE) 4 TABLE ACCESS (FULL) OF 'EMP_SALARY' (TABLE) 4 TABLE ACCESS (FULL) OF 'EMP_SALARY' (TABLE) Medians in 9i Medians can be simulated using the new inverse percentile analytic function. This is an inverse distribution function that assumes a discrete distribution model. It may however result in a different value from the Median that was used in the other example. To find a median value, do the following:

• Write an expression to evaluate data value to a distribution between 0 and 1 using the cume_dist analytic function. Let’s take a look at an SQL statement using cume_dist and percentile_disc features. SELECT salary, deptid, CUME_DIST() OVER (PARTITION BY deptid ORDER BY salary DESC) cume_dist, PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY salary DESC) OVER (PARTITION BY deptid) percentile_disc FROM emp_salary ;

salary deptid cume_dist percentile_dist 1000000 100 .25 100000 100000 100 .5 100000 60000 100 .75 100000 40000 100 1 100000 Using the above, the inverse percentile finds the value at the 0.5 cume_dist level.


Paper #

In the example this is 100,000 Using 0.51 would have given 60,000

This may not be the Median you are looking for but you can see how this can be used in some cases to find the Median value.

Medians in 10g Finally, a Median function!

• Inverse distribution function assuming continuous distribution.

• Null values are ignored.

• Numeric datatypes and nonnumeric ones con be converted to numeric.

• Median first orders the rows.

• With N as the number of rows, Oracle determines the median row number (MRN) as: MRN = 1 + (0.5*(N-1))

• Once Oracle has determined the MRN, it gets ceiling row number (CRN) and floor row number (FRN) and uses these to get the Median.

CRN=ceiling(RN) and FRN = floor(RN) For an odd number of rows:

If (CRN=FRN=RN) then median = RN_value Else, for an even number of rows:

e.g. 4 rows = (row 3 value * 0.5) + (row 2 value * 0.5)

Select region, median(dollar_sales) From Sales_Hist Group By region;

Explain for the Median Function Select deptid, median(salary) From emp_salary Group By deptid; Rows Execution Plan 0 SELECT STATEMENT MODE: ALL_ROWS 1 SORT (GROUP BY) 4 TABLE ACCESS (FULL) OF 'EMP_SALARY' (TABLE)


Paper #

Full Outer Joins – 9i/10g

Oracle9i and 10g has implemented “many” SQL92 and 99 features, though it cannot be considered SQL99 compliant. One such feature is the full outer-join. The pre-9i method of performing a full outer join can still be performed and consists of a left outer join statement Unioned together with a right outer-join. You can see an example of this below: Old Proprietary Outer join example:

select t1.t1col01, t2.t2col01

from t1, t2

where t1col01 = t2col01 (+)

UNION

select t1.t1col01, t2.t2col01

from t1 , t2

where t1.t1col01 (+) = t2.t2col01;

Result: t1col t2col C1 C5 C5 C9 C4 This can be rewritten using ansi-SQL outer-join syntax as shown below: Select t1.t1col01, t2.t2col01 From t1 FULL OUTER JOIN t2 ON t1col01 = t2col01 order by t1.t1col01;

The above statement only processes the T1 and T2 one time as opposed to the first statement which must access them 2 times.

There are Ansi-SQL compliant syntax for full, left and right outer joins.

Explains from the 2 Outer Join Methods First, using the “old” syntax Rows Execution Plan Full Outer Join with Union (OLD) Syntax 0 SELECT STATEMENT MODE: ALL_ROWS 3 SORT (UNIQUE) 4 UNION-ALL 2 HASH JOIN (OUTER) 2 TABLE ACCESS (FULL) OF 'T1' (TABLE) 2 TABLE ACCESS (FULL) OF 'T2' (TABLE)


Paper #

2 HASH JOIN (OUTER) 2 TABLE ACCESS (FULL) OF 'T2' (TABLE)

2 TABLE ACCESS (FULL) OF 'T1' (TABLE) Second, using the “new” outer-join syntax Rows Execution Plan Full Outer Join New Syntax 0 SELECT STATEMENT MODE: ALL_ROWS 3 SORT (ORDER BY) 3 VIEW 3 UNION-ALL 2 HASH JOIN (OUTER) 2 TABLE ACCESS (FULL) OF 'T1' (TABLE) 2 TABLE ACCESS (FULL) OF 'T2' (TABLE) 1 HASH JOIN (ANTI) 2 TABLE ACCESS (FULL) OF 'T2' (TABLE) 2 TABLE ACCESS (FULL) OF 'T1' (TABLE)

Partitioned Outer Joins in 10g • Can convert sparse data into dense data. • aka: a Group Join

o Union of outer joins

• Logically partitioned data based on “Partition By” clause.

• Accepted by ISO and ANSI for SQL standards

• E.g. HR application tracks the number of employees in the company every week. If the number does not change, no entry is inserted into the emp_count table. We want an entry for every week regardless of whether this number has changed.

Partitioned Outer Joins in 10g- Example An example using 2 tables” emp_count and week. The employee count is taken every week and when it changes, a new entry is added to the emp_count table for each region. To get a row for every week use the following: INPUT TABLE 1: EMP_COUNT INPUT TABLE 2: WEEK Select * Select start_dt from week; From emp_count; Start_dt REG WEEK COUNT 01-JAN-05 R1 01-JAN-05 1023 08-JAN-05 R1 08-JAN-05 1030 15-JAN-05 R1 22-JAN-05 1033 22-JAN-05 R1 05-FEB-05 1032 29-JAN-05 05-FEB-05 RESULT Select region, start_dt, count From emp_count


Paper #

Partition By (region) Right Outer Join week On (emp_count.week=week.start_dt) Order by region, start_dt; REG START_DT COUNT R1 01-JAN-05 1023 R1 08-JAN-05 1030 R1 15-JAN-05 s.b. 1030 R1 22-JAN-05 1033 R1 29-JAN-05 s.b. 1033 R1 05-FEB-05 1032 Notice the 2 null entries because the employee count did not change for that region for that week. We want to fill that week in with the value from the week before, since that is really the correct value. First we will show the Explain plan from the above SQL and then we will show the new SQL that gives us the result that we really want. Partitioned Outer Join Explain Select region, start_dt, count From emp_count Partition By (region) Right Outer Join week On (emp_count.week=week.start_dt) Order by region, start_dt; Rows Execution Plan ------- --------------------------------------------------- 0 SELECT STATEMENT MODE: ALL_ROWS 6 VIEW 6 MERGE JOIN (PARTITION OUTER) 7 SORT (JOIN) 6 TABLE ACCESS (FULL) OF 'WEEK' (TABLE) 4 SORT (PARTITION JOIN) 4 TABLE ACCESS (FULL) OF 'EMP_COUNT' (TABLE) Partitioned Outer Joins in 10g- Example (cont.) Making Partitioned Data Dense The SQL below will fill out the values that were null. We use the previous SQL that was used above wrapped inside the LAST_VALUE function to get the emp_count from the previous week for that region (since the value is the same).


Paper #

Select region, start_dt, LAST_VALUE(count ignore nulls) OVER (Partition By region Order by region, start_dt) week FROM ( Select region, start_dt, count From emp_count Partition By (region) Right Outer Join week On (emp_count.week=week.start_dt)) Order by region, start_dt; REG START_DT COUNT R1 01-JAN-05 1023 R1 08-JAN-05 1030 R1 15-JAN-05 1030 dense R1 22-JAN-05 1033 R1 29-JAN-05 1033 dense R1 05-FEB-05 1032 The explain plan for this query is shown below. Making Partitioned Data Dense Explain Select region, start_dt, LAST_VALUE(count ignore nulls) OVER (Partition By region Order by region, start_dt) week FROM (Select region, start_dt, count From emp_count Partition By (region) Right Outer Join week On (emp_count.week=week.start_dt)) Order by region, start_dt; Rows Execution Plan ------- --------------------------------------------------- 0 SELECT STATEMENT MODE: ALL_ROWS 6 WINDOW (BUFFER) ß new from the last explain 6 VIEW 6 MERGE JOIN (PARTITION OUTER) 7 SORT (JOIN) 6 TABLE ACCESS (FULL) OF 'WEEK' (TABLE) 4 SORT (PARTITION JOIN)


Paper #

4 TABLE ACCESS (FULL) OF 'EMP_COUNT' (TABLE) GROUP BY, CUBE AND ROLLUP

Group By and Order By Though they seem similar, it is important to understand the differences that exist between the Group By and Order By statements. With Group By, grouping is done on the tables in the From clause (on the input) while ordering is done on the columns in the select list (on the output). This leads to the following: - ‘order by’ columns need to be in the select list while ‘group by’ columns do not. - ‘order by’ can use an integer and expression while ‘group by’ can not.

These differences are subtle but can explain some incompatibilities that you may have noticed between the 2 expressions. Using Group By and Order By together You may have occasion to Group data to perform a function, but want data to be ordered in a different way than the Grouped column. To do this, you can use “Group by” and “Order by” together to ensure that your data is ordered properly. An example of this is shown below: Select postal_code, customer_id, count(*) From Sales Group by postal_code, customer_id Order by customer_id, postal_code; The above query groups the data on Input to perform the count(*) and then orders the data differently on the output to display the data by customer_id and postal_code. You may have even run across cases where you have used “Group By” without “Order By” and the data is not displayed in the same order as the columns appear in the “Group By” clause. To ensure that data is returned in the order that you want, use “Order By” with “Group By”. Remember that “Order By” sorts the output data. Group By, Having and Where SQL statements that have a Group By clause may have data filtered by a Where clause and by a Having clause. The Having clause filters data after the Group by aggregation while the Where clause filters it before the 'Group By' is performed. It is more efficient to eliminate rows in a Where clause when possible. An example of such a statement is: Select owner, sum(num_rows)

From all_tables Where num_rows > 10000 Group by owner Having sum(num_rows) > 1000000;


Paper #

The above statement sums the number of rows for all tables that have more than 10,000 rows and displays those owners that have more than 1,000,000 rows combined. The filtering for 10,000 rows is performed first for each owner then the Group By is performed on only those tables with > 10,000 rows, then each owner is checked to see if the remaining tables have > 1,000,000 row. Extending Group By Group By can allow aggregates to be performed on a single column or on a set of columns. It is a powerful, but simple feature of SQL. “Group By” alone however does not provide us with all of the Grouping information that we may want to have. There are 2 other features that may be used to enhance Group By. These are Cube and Rollup. Rollup can be used to extend Group By to aggregate at many levels, including the Grand Total. Cube can further extend Rollup by calculating all possible combinations of subtotals for a Group By. It makes cross-tab reports possible for Group By columns. Group By with Rollup It is simple to Group By a set of columns and to perform a Sum function based on this grouping. An example of this groups by Sales Region and Territory as follows: Select region, territory, sum(sales_dollars) TOTAL_SALES From sales group by region, territory; The result from this query is:

REGION TERRITORY TOTAL_SALES ------------- ----------------- ---------------------

EAST 1 1500.00 EAST 2 2000.00 WEST 1 3000.00 WEST 2 500.00 What if we want the above and we want to also summarize at the Region level and for the Entire Result set. To do this required the ROLLUP function. It can be used as follows: Select nvl(region,’TOTAL COMPANY’) REGION, nvl(territory, ‘Total Region’) TERRITORY, sum(sales_dollars) TOTAL_SALES From sales group by rollup (region, territory); The result from this query would be:

REGION TERRITORY TOTAL_SALES -------------- ------------------ ---------------------

EAST 1 1500.00 EAST 2 2000.00 EAST Total Region 3500.00 WEST 1 3000.00


Paper #

WEST 2 500.00 WEST Total Region 3500.00 TOTAL COMPANY 7000.00 The NVL represents Null Value Logic. In other words, if the total for Territory is Null, we substitute the value ‘Total Territory’ and when Company is NULL, we substitute ‘TOTAL SALES’. The reason for this is that we have reached the end of sales for all Territories or Regions and are therefore getting null values. What is a Cube? Group By syntax can be enhanced with the CUBE feature which can provide totals and subtotals for different combinations of columns and categories in a Query. If we are querying our Sales table and Grouping by Region and Territory and giving us Total_Sales, the Group By would give us totals for each unique combination of the Region and Territory. Using the same example, Rollup will add to the above and give us subtotals for Regions, and a Total for all Regions combined. Cube would take this one step further and give totals for each Territory regardless of Region. The Cube therefore gives us totals for all combinations of Columns chosen in the Group By clause. This is a trademark of OLAP Services used in Data Warehousing. Extending the above query one last time with Cube gives us totals by Territory regardless of Region as shown below: Select decode(grouping(region),1,’TOTAL COMPANY’) REGION, decode(grouping(territory),1, ‘Total Region’) TERRITORY, sum(sales_dollars) TOTAL_SALES From sales Group By Cube (region, territory); The result from this query would be:

REGION TERRITORY TOTAL_SALES -------------- ------------------ ------------

EAST 1 1500.00 EAST 2 2000.00 EAST Total Region 3500.00 WEST 1 3000.00 WEST 2 500.00 WEST Total Region 3500.00 TOTAL COMPANY 1 4500.00 TOTAL COMPANY 2 2500.00 TOTAL COMPANY Total Region 7000.00 The decode function is a translation in Oracle that changes the grouping indicator of ‘1’ into another value of ‘TOTAL COMPANY’ or ‘Total Region’. Dealing with Nulls in CUBE and Rollup You may have noticed with CUBE and ROLLUP that we check for NULL values to be returned to determine that a higher level summary (rollup) has been reached. There is a catch to this: what if the value returned by a higher level such as at the Territory level (using our example) actually is NULL? How do we distinguish that the NULL is returned by the query or by a ROLLUP?


Paper #

Rollup and Cube provides a facility where a value of 1 is returned if the NULL is a result of CUBE or ROLLUP and it returns 0 if it is a natural result. To take advantage of this in Oracle, write the query as follows: Select decode(grouping(region),1,’TOTAL COMPANY’) REGION, decode(grouping(territory),1, ‘Total Region’) TERRITORY, sum(sales_dollars) TOTAL_SALES From sales group by rollup (region, territory); The decode function translates the grouping indicator of ‘1’ into another value of ‘TOTAL COMPANY’ or ‘Total Region’. Explain for Group By with Cube Select decode(grouping(region),1,'Total Company', region), decode(grouping(terr),1, 'Total Region', terr), sum(total_sales) Total_Sales FROM sales GROUP BY CUBE (region, terr); Rows Row Source Operation 0 SELECT STATEMENT MODE: ALL_ROWS 9 SORT (GROUP BY) 16 GENERATE (CUBE) 4 SORT (GROUP BY) 10 TABLE ACCESS (FULL) OF 'SALES' (TABLE) Grouping Sets – 9i/10g

This new 9i feature enhances SQL groupings used by the Cube and Rollup Group By clauses. Grouping Sets allows us to specify the exact level of aggregation across various combinations of columns. Let’s look at an example of a set of aggregations across 3 different groupings. To do this with a Cube would require many groupings and a “Union All” would use 3 queries. Consider the input data listed below.

month prod terr total_sales 1 1 8 200 2 1 8 150 1 2 8 300 2 2 8 400 1 1 9 1400 2 1 9 1600 1 2 9 1500 2 2 9 1475 We want to provide 3 sets of groupings by:

1) month, terr, prod. 2) month, prod. 3) terr, prod.


Paper #

This can be accomplished with the following SQL (note the “Group By Grouping Sets” syntax): Select month, terr, prod, sum(total_sales) sum_sales From sales Group By Grouping Sets ((month, terr, prod), (month, prod), (terr, prod));

The result of this is below: month terr prod sum_sales 1 1 8 200 1 2 8 300 1 1 9 1400 1 2 9 1500 2 1 8 150 2 2 8 400 2 1 9 1600 2 2 9 1475 1 8 500 1 9 2900 2 8 550 2 9 3075 1 8 350 1 9 3000 2 8 700 2 9 2975 Grouping Sets • Prunes aggregates you don’t need

o Does not aggregate as much as Cube or Rollup

• Computes all groupings in Grouping Sets and combines results with a Union All.

• Composite columns can be specified by grouping columns in parenthesis to be treated as a single unit by the Cube or Rollup.

• Concatenated groupings let you take multiple Grouping Sets, Cube or Rollup operations and separate them with commas to form a Group By.

Grouping Sets Explain See the explain below for the Grouping Sets. An efficiency of this statement is that aggregates not needed are not calculated BUT notice that this performs a lot more work than a full Cube or Rollup statement.


Paper #

Select month, terr, prod, sum(total_sales) sum_sales From sales Group By Grouping Sets ((month, terr, prod), (month, prod), (terr, prod)); Rows Execution Plan 0 SELECT STATEMENT MODE: ALL_ROWS 21 TEMP TABLE TRANSFORMATION 0 MULTI-TABLE INSERT 0 DIRECT LOAD INTO OF 'SYS_TEMP_0FD9D6605_88850' 0 DIRECT LOAD INTO OF 'SYS_TEMP_0FD9D6606_88850' 0 SORT (GROUP BY ROLLUP) 0 TABLE ACCESS (FULL) OF 'SALES' (TABLE) 0 LOAD AS SELECT 21 SORT (GROUP BY) 21 TABLE ACCESS (FULL) OF 'SYS_TEMP_0FD9D6605_88850' (TABLE (TEMP)) 21 VIEW 10 VIEW 11 UNION-ALL 2 TABLE ACCESS (FULL) OF 'SYS_TEMP_0FD9D6605_88850' (TABLE (TEMP)) 15 TABLE ACCESS (FULL) OF 'SYS_TEMP_0FD9D6606_88850' (TABLE (TEMP)) 10g Inter-row and Inter-array Calculations: The Model Clause

• Another use of analytic capabilities • Spread-sheet type functionality

o But, this is not Excel! • Map columns into partitions, dimensions and measures

o Partitions: viewed as an independent array o Dimensions: cells in a partition to define characteristics. o Measures: data cells (aka facts).

• Model clause is processed after all other clauses except Order By. • “return updated rows” clause only displays changed rows. • To insert, update or merge values in a table, you need to use the Model results as input to the insert,

update, merge statement. The Model Clause: Example In this example, we will project next quarters sales based on the last 2 quarters. INPUT DATA: Select region reg, product prod, quarter qtr, sales from Qtr_Sales; Region Product Qtr Sales


Paper #

E 1 04Q4 4550 E 1 05Q1 5000 E 2 04Q4 6300 E 2 05Q1 6700 W 1 04Q4 7900 W 1 05Q1 7700 W 2 04Q4 2500 W 2 05Q1 4000 The SQL statement to calculate next quarters sales is below: Select * from Qtr_Sales MODEL return updated rows Partition By (region) Dimension By (product,quarter) Measures(sales) Rules ( sales[1,’05Q2’]=sales[1,’05Q1’]-sales[1,’04Q4’]+sales[1,’05Q1’] , sales[2,’05Q2’]=sales[2,’05Q1’]*0.5) Order by region, product; QUERY RESULT Reg Prod Qtr Sales E 1 05Q2 5450 E 2 05Q2 3350 W 1 05Q2 7500 W 2 05Q2 2000

Explain of the above Model Clause Notice the efficient access path below where the table is only accessed 1 time and the “model” access plan is used. Select * from Qtr_Sales MODEL return updated rows Partition By (region) Dimension By (product,quarter) Measures(sales) Rules ( sales[product=1,quarter='05Q1']=sales[1,'05Q1']-sales[1,'04Q4']+sales[1,'05Q1'] , sales[2,'05Q1']=sales[2,'05Q1']*0.5) order by region, product;


Paper #

Rows Execution Plan ------- --------------------------------------------------- 0 SELECT STATEMENT MODE: ALL_ROWS 4 SORT (ORDER BY) 4 SQL MODEL (ORDERED FAST) 8 TABLE ACCESS MODE: ANALYZED (FULL) OF 'QTR_SALES'

The Model Clause: Dimensions

• A cell reference must qualify all dimensions in the “dimension by” clause.

• Positional Reference o Dimension By (product,quarter) … sales[2,’05Q2’]=sales[2,’05Q1’]*0.5)

• Symbolic Reference o sales[product=2,quarter=’05Q1’]=… o Only for updating existing cells. If the second quarter of 05 has no data yet, then no rows will be returned for

the following: sales[product=2,quarter=’05Q2’]=…

The Model Clause: Ordering of Rules

• Sequential (the default) o The order the rules are listed in the Model clause. Select … Model … Rules Sequential Order

• Automatic o Dependencies are evaluated and processes depending on this order. Select … Model … Rules Automatic Order ….

The Model Clause: Current Value Function cv()

• Apply specs from the left side of a formula to the right. • Like a short form version of a join condition.

Select * from Qtr_Sales MODEL return updated rows Partition By (region) Dimension By (product,quarter) Measures(sales) Rules ( sales[2,quarter between '04Q4' and '05Q1']=sales[1,CV(quarter)]*1.1) Order by region, product;


Paper #

The result from the above query is shown below: R PRODUCT QUAR SALES E 2 04Q4 5005 E 2 05Q1 5500 W 2 04Q4 8690 W 2 05Q1 8470

Example of the Model Clause with the Current Value Function

Select * from Qtr_Sales MODEL return updated rows Partition By (region) Dimension By (product,quarter) Measures(sales) Rules ( sales[2,quarter between '04Q4' and '05Q1']=sales[1,CV(quarter)]*1.1) Order by region, product; Rows Execution Plan ------- --------------------------------------------------- 0 SELECT STATEMENT MODE: ALL_ROWS 4 SORT (ORDER BY) 4 SQL MODEL (ORDERED) 8 TABLE ACCESS MODE: ANALYZED (FULL) OF 'QTR_SALES'

The Model Clause: Reference Models and Main Models

• Reference models o Allow you to reference many multi-dimensional arrays from a Main model. o Reference models are read-only & used as lookup tables. o Reference models have a Name. o Cannot have a Partition clause.

• Main Model o The multi-dimensional array that has its cells updated. o Can have one or more reference models.

An example of Reference Models and Main Models

Let’s look at an example that uses a reference model of inflation rates by quarter. This will be used in the main model to convert sales dollars into today’s dollars taking inflation rates into consideration. The 2 input tables are shown below:


Paper #

Reference Model Select Statement select qtr, rate_pct as rt from inflation_rate; Qtr rate_pct 04Q4 1 05Q1 0.5 Qtr_Sales table used in the Main Model. select * from qtr_sales where product=1 and quarter='04Q4'; R PRODUCT QUAR SALES E 1 04Q4 4550 W 1 04Q4 7900 SUM ………… 12450 Query with Main and Reference Model using the above 2 tables: Select product, quarter, sl from Qtr_Sales Group By product, quarter MODEL return updated rows Reference inf_rate on ( select qtr, rate_pct as rt from inflation_rate) Dimension By (qtr) Measures (rt) MAIN sale_model Dimension By (product,quarter) Measures(sum(sales) sales, 0 sl) Rules (sl[1,'04Q4']=sales[1,'04Q4'] + (sales[1,'04Q4']*inf_rate.rt['05Q1']/100) + (sales[1,'04Q4']*inf_rate.rt['04Q4']/100) ); The query above uses the inflation rate as input and calculates the current value of ’04Q4’ sales by multiplying it be the ’04Q4’ inflation rate and the ’05Q1’ rate. The result is below. Note that 12636.75 = 12450 + 124.50 + 62.25 RESULT OF THE QUERY

PRODUCT QUAR SL 1 04Q4 12636.75

Model Example: Reference Models and Main Model


Paper #

Considering the amount of work being performed in this example, you can see that a relatively efficient access path has been chosen by the optimizer.

Select product, quarter, sl from Qtr_Sales Group By product, quarter MODEL return updated rows Reference inf_rate on (select qtr, rate_pct as rt from inflation_rate) Dimension By (qtr) Measures (rt) MAIN sale_model Dimension By (product,quarter) Measures(sum(sales) sales, 0 sl) Rules (sl[1,'04Q4']=sales[1,'04Q4'] + (sales[1,'04Q4']*inf_rate.rt['05Q1']/100) + (sales[1,'04Q4']*inf_rate.rt['04Q4']/100) ) ; Rows Execution Plan ------- --------------------------------------------------- 0 SELECT STATEMENT MODE: ALL_ROWS 1 SQL MODEL (ORDERED FAST) 2 REFERENCE MODEL OF 'INF_RATE' 2 TABLE ACCESS (FULL) OF 'INFLATION_RATE' (TABLE) 1 FILTER 1 SORT (GROUP BY) 2 TABLE ACCESS MODE: ANALYZED (FULL) OF 'QTR_SALES'

Some other features of the Model clause • “For loop” can be used.

o Rules (sales[FOR product in (2,3),quarter between ’04Q4’ and ’05Q1’]=sales[1,CV(quarter)*1.1]) o 1 formula to calculate many cells.

• Ignoring Nulls o Use the IGNORE NAV feature to substitute a value for NULLS.

Defaults: number=0; date=01-JAN-2000;char=‘ ‘. Select … Model Ignore_NAV return updated rows … ;

• Iterate clause to calculate formulas iteratively for a certain number of times. BONUS SECTION We will cover this section below if time allows. There are 2 main areas in this next secion:

• Comparing join techniques: Joins, sub-queries and anti-joins

• Dealing with subqueries: Note: in 9i and 10g access paths are improved.


Paper #

The optimizer is better able to distinguish and rewrite queries in the optimal manner. For example, you will now often find correlated and non-correlated queries take the same access path in 10g.

COMPARING JOIN TECHNIQUES

Merge Scan joins, Nested-Loop joins, Hash Joins and Star queries. Nested Loop Joins The Nested Loop join is one of the 2 most common join techiniques - the other being the Sort Merge join. This is also a relatively simple join to understand. Take the example of 2 tables joined as follows: Select * From Table1 T1, Table2 T2 Where T1.Table1_Id = T2.Table1_id;

In the case of the Nested Loop Join, the rows will be accessed with an outer table being chosen (say Table1 in this case) and for each row in the outer table, the inner table (Table2) will be accessed with an index to retrieve the matching rows. Once all matching rows from Table2 are found, then the next row on Table1 is retrieved and the matching to Table2 is performed again. It's important that efficient index access is used on the inner table (Table2 in this example) or that the inner table be a very small table. This is critical to prevent table scans from being performed on the inner table for each row of the outer table that was retrieved. Nested Loop joins are usually used in cases where relatively few rows are being accessed in the outer table and joined to the inner table using efficient index access. It is the most common join performed by transactional (OLTP) systems. Cluster Joins are a special case of Nested-Loop join where the outer table can access the inner table with the cluster index. Clusters can be very efficient, but have many maintenance and performance drawbacks. These are rarely used. Merge Scan Joins The Merge Scan join - also called a Sort-Merge - is a very popular join performed when accessing a large number of rows. It is also seen in background tasks, batch processing and decision support systems. Take an example of 2 tables being joined and returning a large number of rows (say, thousands) as follows: Select * From Table1 T1, Table2 T2 Where T1.Table1_Id = T2.Table1_id;

The Merge Scan join will be chosen because the database has detected that a large number of rows need to be processed and it may also notice that index access to the rows are not efficient since the data is not clustered (ordered) efficiently for this join. There may also be an inequality clause used such as: <, <=, > or >=. This is fast because of multi-block fetches and the fact


Paper #

that each table is accessed once. This can be faster than hash joins if the rows are already sorted so that sorts do not need to be performed. The steps followed to perform this type of join are as follows: 1) Pick an inner and outer table 2) Access the inner table, choose the rows that match the predicates in the Where clause of the SQL statement. 3) Sort the rows retrieved from the inner table by the joining columns and store these as a Temporary table. This step may not be performed if data is ordered by the keys and efficient index access can be performed. 4) The outer table may also need to be sorted by the joining columns so that both tables to be joined are sorted in the same manner. This step is also optional and dependent on whether the outer table is already well ordered by the keys and whether efficient index access can be used. 5) Read both outer and inner tables (these may be the sorted temporary tables created in previous steps), choosing rows that match the join criteria. This operation is very quick since both tables are sorted in the same manner and Database Prefetch can be used. 6) Optionally sort the data one more time if a Sort was performed (e.g. an 'Order By' clause) using columns that are not the same as were used to perform the join. The Merge Join can be deceivingly fast due to database multi-block fetch (helped by initialization parameter db_file_multiblock_read_count) capabilities and the fact that each table is accessed only one time each. These are only used for equi-joins. The other init.ora parm that can be tuned to help performance is sort_area_size. Hash Join The Hash Join is is a very efficient join when used in the right situation. With the hash join, one Table is chosen as the Outer table. This is the larger of the two tables in the Join - and the other is chosen as the Inner Table. Both tables are broken into sections and the inner Tables join columns are stored in memory (if hash_area_size is large enough) and 'hashed'. This hashing provides an algorithmic pointer that makes data access very efficient. Oracle attempts to keep the inner table in memory since it will be 'scanned' many times. The Outer rows that match the query predicates are then selected and for each Outer table row chosen, hashing is performed on the key and the hash value is used to quickly find the matching row in the Inner Table. This join can often outperform a Sort Merge join, particularly when 1 table is much larger than another. No sorting is performed and index access can be avoided since the hash algorithm is used to locate the block where the inner row is stored. Hash-joins are also only used for equi-joins. Other important init.ora parms are: hash_join_enabled, sort_area_size and hash_multiblock_io_count. You can also set the init.ora parameter pga_aggregate_target to automatically size the sql areas. Star Joins A particular type of join common to Data Marts and Data Warehouses is known as the "star-join" or "star-query". This is a join of a large "Fact" table with 2 or more smaller tables commonly called "Dimensions". Fact tables may be thought of as having transactional properties. An example of a Fact table is Sales which contains keys and some measures. This is commonly a large table with millions - and sometimes billions - of rows. The Dimensional tables are used to describe the Fact table. Examples of Dimension tables are: Customer, Product and Time. Star queries get their name because there is a single Fact table in the middle and the smaller dimensional tables are directly related to the Fact table. To explain how Star-Joins work, let’s consider an example of a central Fact table and 3 small Dimensional tables related to the Fact table. The first and most simple approach for resolving a star-query is to query the 3 dimensional tables using the predicates in the Where clause to filter out unnecessary rows. The resulting rows from the 3 dimension tables can be joined together forming a cartesian product. This is not as bad as it sounds since these tables are relatively small and the rows matching the 'Where' criteria account for a subset of these. This cartesian product can be sorted and stored as a temporary table. A merge-scan join can then be performed between the resulting temporary table and the Fact table.


Paper #

There is a more complex approach that the Database optimizer can take towards Star Joins that is used in Oracle8i. Consider the case of the central Fact table that is being joined to 3 smaller Dimensional tables as shown below: Select * From Fact, Dim1, Dim2, Dim3 Where Fact.dim1_id = Dim1.id and Fact.dim2_id = Dim2.id and Fact.dim3_id = Dim3.id and Dim1.name like :input_variable_name and Dim2.Descriprion between :input_var1 and :input_var2 and Dim3.Text < :input_variable_text;

The optimizer can transform this by adding 3 subqueries so that the new transformed query looks like this: Select * From Fact, Dim1, Dim2, Dim3 Where Fact.dim1_id in (Select dim1.id from dim1 where dim1.name like :input_variable_name) and Fact.dim2_id in (Select dim2.id from dim2 where Dim2.descriprion between :input_var1 and :input_var2) and Fact.dim3_id in (Select dim3.id from dim3 where Dim3.Text < :input_variable_text);

Given the above, the subselects are performed first. If bitmap indexes exist on the Fact join columns, then the bitmap index entries that result from each subquery can be merged (in this case ANDed) together and the Fact table data can be retreived using the resulting index values. This is a very quick operation. The Fact entries retrieved can then be joined to the Dimension tables to complete the query. Using this approach, a Cartesian product is not required. To implement star_query transformation:

• Set init.ora parm star_transformation_enabled

• Create bitmap indexes for all of the foreign-key columns on the fact table

• Implement R.I. between the fact table and the dimension tables. DEALING WITH SUBQUERIES


Paper #

Subqueries are one of SQL’s most commonly used features. Despite this, the performance impacts of using them is not well understood. We’ll look at different coding techniques and their performance impact. Correlated Queries (Exists)

A subquery is said to be Correlated when it is joined to the outer query within the Subquery. An example of this is: Select last_name, first_name

From Customer

Where customer.city = ‘Chicago’

and Exists

(Select customer_id

From Sales where

sales.total_sales_amt > 10000

and sales.customer_id = customer.customer_id);

Notice that the last line in the above query is a join of the outer Customer table and inner Sales tables. This join is in the subquery. Given the query above, the outer query is read and for each Customer in Chicago, the outer row is joined to the Subquery. Therefore, in the case of a subquery, the inner query is executed once for every row read in the outer query. This is efficient where a relatively small number of rows are processed by the query, but considerable overhead is incurred when a large number of rows are read. Uncorrelated Queries: In

A subquery is said to be uncorrelated (aka non-correlated) when the two tables are not joined together in the inner query. In effect, the inner (sub) query is processed first and then the result set is joined to the outer query. An example of this is: Select last_name, first_name

From Customer

Where customer_id IN

(Select customer_id

From Sales where

sales.total_sales_amt > 10000);

The Sales table will be processed first and then all entries with a total_sales_amt > 10000 will be joined to the Customer table. This is efficient where a large number of rows is being processed.. Efficient when a large number of rows are being processed. Turning Subqueries into Joins

There are cases where a query may be written as either a Subquery or as a Join. When faced with this, what should you do? When it is possible, the query should be written as a Join. Care must be taken to make sure that the Join works correctly and returns the same result set as the subquery. The advantage of the Join is that it gives the optimizer more


Paper #

choices when determining which query plan and join approach to use. The optimizer can choose between nested loop, merge scan, hash and star joins when a Join is used. The options are limited when the compiler and optimizer are presented with a Subquery. The following subquery can be rewritten as a join: Select last_name, first_name

From Customer


(Select customer_id

From Sales where


Is rewritten as: Select cust.last_name, cust.first_name

From Customer cust, Sales

Where cust.customer_id = sales.customer_id

and sales.total_sales_amt > 10000;

In and Exists

Subqueries can be written using the ‘IN’ and ‘EXISTS’ clauses. In general, the IN clause is used in the case of an uncorrelated subquery where the inner query is processed first and the temporary result set table that is created is joined to the outer table. This is very efficient for queries that return a large number of rows. An example of the use of the IN clause is shown below: Select last_name, first_name

From Customer


(Select customer_id

From Sales where


The EXISTS clause on the other hand, is used in the case of a correlated subquery where outer query is processed first and as each row from the outer query is retrieved, it is joined to the inner query. Therefore the inner query is performed once for each result row in the outer query (as opposed to the ‘IN’ query shown above where the inner query is performed only one time). An example of the use of the EXISTS clause is shown below: Select last_name, first_name

From Customer


Paper #

Where EXISTS

(Select customer_id

From Sales where

customer.customer_id = sales.customer_id


The optimizer is more likely to translate an IN into a join. It is important to understand the number of rows to be returned by a query and then decide which approach to use. In vs. Exists In vs. Exists:

• Use a join where possible

• IN executes a subquery once while Exists executes it once per outer row

• IN is like a merge-scan

• EXISTS is like a nested-loop join

• EXISTS tries to satisfy the subquery as quickly as possible and returns ‘true’ if the subquery returns 1 or more rows – it should be indexed. Optimize the execution of the subquery.

• use IN over EXISTS (i.e. non-correlated subquery vs. correlated subquery).

• You need to understand the number of rows processed and the access paths being used. o Large outer-table and small inner-table favors “in” over “exists”. o Small outer-table and large inner-table favors “exists” over “in”.

Comparisons of In and Exists

Exists access paths use the nested loop join algorithm while In uses a merge scan join algorithm as shown below. Exists access path: Same as Nested Loop Join: select count(*) TEST2_with_indexes from cust_test ct where exists (select 1 from cust_addr_test cat where cat.cust_no = ct.cust_no)

Rows Execution Plan ------- --------------------------------------------------- 0 SELECT STATEMENT GOAL: CHOOSE 1 SORT (AGGREGATE) 20000 FILTER 25001 TABLE ACCESS GOAL: ANALYZED (FULL) OF 'CUST_TEST'


Paper #

25000 INDEX GOAL: ANALYZED (UNIQUE SCAN) OF 'CUST_ADDR_TEST_A01' (UNIQUE) In Access Path: same as Merge Join: select count(*) TEST1_no_indexes from cust_test ct where ct.cust_no in (select cat.cust_no from cust_addr_test cat )

Rows Execution Plan ------- --------------------------------------------------- 0 SELECT STATEMENT GOAL: CHOOSE 1 SORT (AGGREGATE) 20000 MERGE JOIN 20001 SORT (JOIN) 20000 VIEW OF 'VW_NSO_1' 20000 SORT (UNIQUE) 20000 TABLE ACCESS GOAL: ANALYZED (FULL) OF 'CUST_ADDR_TEST' 20000 SORT (JOIN) 25000 TABLE ACCESS GOAL: ANALYZED (FULL) OF 'CUST_TEST' Not In vs. Not Exists

Subqueries may be written using NOT IN and NOT EXISTS clauses. The NOT EXISTS clause is sometimes more efficient since the database only needs to verify non-existence. With NOT IN the entire result set must be materialized. Another consideration when using NOT IN, is if the subquery returns NULLS, the results may not be returned (at all). With NOT EXISTS, a value in the outer query that has a NULL value in the inner will be returned. Not In performs very well as a Hash Anti-Join using the cost-based optimizer and often performs Not Exists when this access path is used. Outer joins can also be a very fast way to accomplish this. NOT EXISTS can sometimes be efficient since the database only needs to determine non-existence and may not need to materialize the entire result set. NOT IN and NOT EXISTS both use the nested-loop algorithm by default. Comparisons of Not In vs. Not Exists Using the cost-based optimizer (which you should all be using), NOT In works well as an anti-join. A ‘Not In’ with a HASH_AJ hint is very fast. Outer Joins are also fast.

• Try to code subqueries as Joins wherever possible.

• Use an Index on the inner query


Paper #

• Another useful approach is the with a HASH_AJ hint: select /*+ HASH_AJ */ count(*) TEST11 from cust_test ct where ct.cust_no not in (select cat.cust_no from cust_addr_test cat );

Another useful approach is with an outer join: select count(*) TEST12 from cust_test ct, cust_addr_test cat where ct.cust_no = cat.cust_no (+) and cat.cust_no is null ;

Comparisons With Indexes: select count(*) TEST3_with_indexes from cust_test ct where ct.cust_no not in (select cat.cust_no from cust_addr_test cat )

call count cpu elapsed disk query current rows ------------------------------------------------------------------------------------ total 4 291.24 298.83 0 5287981 115946 1 Rows Execution Plan ------- --------------------------------------------------- 0 SELECT STATEMENT GOAL: CHOOSE 1 SORT (AGGREGATE) 5000 FILTER 25001 TABLE ACCESS GOAL: ANALYZED (FULL) OF 'CUST_TEST' 25000 TABLE ACCESS GOAL: ANALYZED (FULL) OF 'CUST_ADDR_TEST' select count(*) TEST4_with_indexes from cust_test ct where not exists (select 1 from cust_addr_test cat where cat.cust_no = ct.cust_no)

call count cpu elapsed disk query current rows ------------------------------------------------------------------------------------ total 4 0.36 0.36 0 50359 5 1


Paper #

Rows Execution Plan ------- --------------------------------------------------- 0 SELECT STATEMENT GOAL: CHOOSE 1 SORT (AGGREGATE) 5000 FILTER 25001 TABLE ACCESS GOAL: ANALYZED (FULL) OF 'CUST_TEST' 25000 INDEX GOAL: ANALYZED (UNIQUE SCAN) OF 'CUST_ADDR_TEST_A01' (UNIQUE) Test the anti-join using a hash anti-join with Not In: Using a hash anti-join and Not In can often provide the quickest response. The syntax is shown below. Ensure that you see anti-join in the explain output.

select /*+ HASH_AJ */ count(*) TEST5_with_indexes from cust_test ct where ct.cust_no not in (select cat.cust_no from cust_addr_test cat )

When, using “Not In”, ensure the column is Not Null. If it is nullable and there is a null value, there may be 0 rows returned since Oracle cannot determine the count. Consider 2 tables: T1 and T2: T1.COL01 T2.COL01 <null> <null> 0 0 2 2 4 3 6 7 8 8 10 9 select count(*) from t1 where col01 not in (select col01 from t2); COUNT(*) result = 0 select count(*) from t1 where not exists (select 1 from t2 where t1.col01=t2.col01); COUNT(*) result = 3

Solved By:

select count(*) from t1 where col01 not in (select col01 from t2 where col01 is not null);


Paper #

Test the anti-join using an outer-join: select count(*) TEST6_with_indexes from cust_test ct, cust_addr_test cat where ct.cust_no = cat.cust_no (+) and cat.cust_no is null

call count cpu elapsed disk query current rows ----------------------------------------------------------------------------------- total 4 0.29 0.29 0 48360 5 1 Rows Execution Plan ------- --------------------------------------------------- 0 SELECT STATEMENT GOAL: CHOOSE 1 SORT (AGGREGATE) 5000 FILTER 25000 NESTED LOOPS (OUTER) 25001 TABLE ACCESS GOAL: ANALYZED (FULL) OF 'CUST_TEST' 20000 INDEX GOAL: ANALYZED (UNIQUE SCAN) OF 'CUST_ADDR_TEST_A01' (UNIQUE) The anti-joins above with using an outer join is very efficient.. USING MINUS

Some Databases use the MINUS operator to return rows from 1 table that do not exist in another. This can also achieve good performance. An example of an SQL statement using MINUS is shown below: Select customer_id, customer_name

from Customer

MINUS

Select customer_id, customer_name

from Invoice;

The above statement will show all Customers that have not been invoiced. Notice that when using MINUS, the number of columns and column types must be the same for the 2 queries. In this way, it is similar to the way that a Union statement operates. It is also limiting since all of the same columns and values used in the statements must exist in both tables - therefore there are many cases where the MINUS operator will not be useful.


Paper #

Conclusion

• Become familiar with Oracle’s SQL functions. • Try out functions as well as your own solutions to problems. This can help you improve your SQL skills and can build

a repertoire that will help your most experienced developers.

• SQL is becoming more complicated.

• We can now do things in native SQL that used to only be possible in advanced query packages. o Get familiar with Oracle’s supplied PL/SQL packages

• Oracle is becoming more ansi SQL-92 and 99 compliant

• Also with 10g: SQL Model clause and regular expressions.

• SQL can be fun! (or at least interesting)

10g advanced sql and performance · 10g advanced sql and performance paper 615 . introduction we...

Documents