troubleshooting guide mmr

Troubleshooting Guide: Replication Propagation [ID 1035874.6]

Modified 10-MAY-2010 Type TROUBLESHOOTING Status PUBLISHED

In this Document Purpose Scope & Application Last Review Date Instructions for the Reader Troubleshooting Details 1. Overview of how data moves between Replicated Site 2. Checking the replication propagation environment is configured correctly 2.1 Replication group status 2.2 Replicated object triggers, packages and status 2.3 Status of existing admin requests 3. Checking the automatic propagation mechanism is working 3.1 Checking for errors 3.2 Check job_queue_processes is set 3.3 Has a push / purge job been scheduled 3.4 Check the status of the propagation jobs 3.5 Check the propagator and their private database links 3.6 Check if the push job is currently running 3.7 Terminating a deferred queue push job that is currently running 4. Diagnosing the progress and status of replicated transactions 4.1 Verify transactions can be manually pushed 4.2 Verifying transactions are being propagated 4.3 Deferred transaction propagated but not applied to receiving site 4.3.1 Checking for deferred errors in the queue 4.3.2 Stop on Error 4.4 Verify the purge transaction operation 4.4.1 Verify transactions can be manually purged 4.4.2 Verifying transactions are being purged 4.5 Large deferred transactions and slow propagation 5. Diagnosing hanging propagation 5.1 Check for locks 5.2 Check the wait events 5.3 Check if a large error is being queued 5.4 Advanced Analysis 6. How to clear down large deferred queues 7. Example of how deferred transactions are propagated 7.1 Data is inserted into the table: 7.2 Interrogating the deferred queue: 7.3 Identify the next transaction to be pushed: 7.4 Manually push transactions to REP901 and interrogate the deferred queue 7.5 Identify unpurged transactions

Applies to:

Oracle Server - Enterprise Edition - Version: 7.2.2.0 to 10.2.0.1 - Release: 7.2.2 to 10.2Oracle Server - Enterprise Edition - Version: 7.2.2.0 to 10.2.0.1 [Release: 7.2.2 to 10.2]Information in this document applies to any platform.

Checked for relevance on 27-Nov-2007Checked for relevance on 16-Mar-2010.

Purpose

The purpose of this article is to provide basic steps for troubleshooting advanced replication propagation and the underlying mechanism it uses; the deferred queue. Additional notes are referenced through out this article that address specific issues or provide additional information on a particular component used by Advanced Replication.

Scope & Application

To be used by Oracle support analysts and replication DBA's to understand and employ basic troubleshooting techniques for advanced replication propagation.

Last Review Date

March 16, 2010

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

1. Overview of how data moves between Replicated Site

Oracle uses internalised triggers to capture and store any data changes made to tables that have been defined as replicated. It stores the data changes as remote procedure calls (RPC's) in a deferred transaction queue table (defcall) for propagation at a predefined time and interval. Propagation to remote master sites is automated using the database job queue mechanism. The job either applies the transaction to the remote master or if a transaction fails it is placed in the remote deferred transaction error queue for analysis at a laterdate.

2. Checking the replication propagation environment is configured correctly

A large majority of the problems encountered by replication users are due to incorrect setup and configuration of the replicated environment. Refer to section 1 of note:122039.1 and note:117434.1 before checking the following :

2.1 Replication group status

For propagation to be successful the status of the replication groups must be normal:

select gname, status from dba_repgroup;

GNAME STATUS------------------------------ ---------GROUP1 NORMAL

2.2 Replicated object triggers, packages and status

For the propagation and queuing of data changes to be successful to remote sites, replicated tables (objects) must display as valid at all replication sites. They must also have the associated internalised triggers and packages defined. Run the following query to check the replicated tables (objects), in releases prior to Oracle 8.1.x:

select oname, status, generation_statusfrom dba_repobjectwhere status != 'VALID';

Run the following query to check the replicated tables (objects), in Oracle 8.1.x and above:

select oname, status, generation_status,replication_trigger_exists, internal_package_existsfrom dba_repobjectwhere status != 'VALID'or replication_trigger_exists != 'Y'or internal_package_exists != 'Y';

If the above query returns replication objects that do not have the associated internalised triggers or packages, it may be necessary to re-generate replication support. If after re-generation the replication objects still show as invalid, run the following statement to ensure that all dependant SYS and SYSTEM objects are valid:

column owner format a25column object_name format a30column object_type format a15

select owner, object_name, object_typefrom dba_objectswhere status != 'VALID'and owner in ('SYS', 'SYSTEM');

If problems persist, contact Oracle Support Services.

2.3 Status of existing admin requests

Data cannot move between replicated sites if there are outstanding admin requests for the associated master group and there should not be any such requests if the group displays with status normal. Use the following query to check for admin requests:

column gname format a18

select gname, request, status, errnumfrom dba_repcatlogorder by id, gname;

For additional analysis of the replication administrative requests queueplease refer to note:180014.1 and note:122039.1.

2.4 Check database links

The correct configuration of database link is essential for replication propagation to operate correctly. The following articles cover the links required by replication:

note:50593.1 Initial steps required to create Multi Master and Snapshot Replication v8.0note:117434.1 Initial steps required to a create Multi Master Replication environment v8.1 / v9.x/ v10.x

Check the existence of the public and private links with the following query at each site involved to propagation:

column owner format a15column db_link format a45column username format a15select owner, db_link, username from dba_db_links;

OWNER DB_LINK USERNAME--------------- ----------------------------------------- ---------------PUBLIC DB2REPADMIN DB2 REPADMIN

Test each of the links and ensure that the global name matches the link name with the following query:

connect as database link ownerselect * from global_name@;

It is important that the correct links exist for the user who owns the job that performs the replication push job.

3. Checking the automatic propagation mechanism is working

Oracle Replication uses the job queue to automate the propagation and purging of deferred transactions. If the jobs are not configured properly, they will not run as expected and their associated tasks will not be completed.

Administrators usually become aware that there may be a problem with the job queue mechanism when they discover that the replication deferred transaction queue is building up. To check if the queue is growing run :

select count(*) from deftrandest ;

If the queue appears to be growing to an unusual size, use the following 3.x sections to ensure that the automated jobs are not the cause. Do not attempt to quiesce the replication system with SUSPEND_MASTER_ACTIVITY, as that will just try to push the queue first.

3.1 Checking for errors

If a job fails while attempting to push or purge replication data changes, errors will be written to the alertSID.log. Additional and more detailed information will go into the following files referred to by the alert.log:

- Pre V9 : SID_snpx_nnnnn.TRC- V9 and above : SID_cjq0_nnnnn.TRC and SID_jnnn_nnnn.TRC

The format of these files may vary between operating systems, their location can be determined by running the following from SQLPLUS:

SQL> show parameter dump_dest

3.2 Check job_queue_processes is set

The init.ora parameters that control the Oracle job queues must be set, execute the following statement to check that they are correctly set at each replication site:

SQL> show parameter job

NAME TYPE VALUE-------------------------------- ------- --------job_queue_interval integer 30 job_queue_processes integer 4 <= must be > 0

job_queue_processes : A job queue process executes a single job at a time and this parameter determines the maximum concurrent number of these. In most replication environments configure this to be:

= number of replicated destinations (or connection qualifier destinations)+ 1 (for administrative requests)+ enough to service other concurrent non replication jobs if they exist

job_queue_interval : This parameter only applies to versions prior to Oracle9i, it specifies in seconds how frequently each SNPn background process of the instance wakes up and checks if there is a job to run. Its default value is 60 and if jobs need to execute more frequently reduce this value.

3.3 Has a push / purge job been scheduled

For transactions to be propagated to remote master sites automatically, there needs to be a job in the job queue to push them. There should be one push job per remote master site, check the job exists with:

column what format a64

select job, what from dba_jobs where upper(what) like '%DBMS_DEFER_SYS%';

JOB WHAT-------- ----------------------------------------------------------------43 declare rc binary_integer; begin rc := sys.dbms_defer_sys.purge(delay_seconds=>0); end;

44 declare rc binary_integer; begin rc := sys.dbms_defer_sys.push(d estination=>'DB2.WORLD', stop_on_error=>FALSE, delay_seconds=>0, parallelism=>2); end;

45 declare rc binary_integer; begin rc := sys.dbms_defer_sys.push(d estination=>'DB3.WORLD', stop_on_error=>FALSE, delay_seconds=>0, parallelism=>2); end;

If the push job exists it is important to know when it last ran and when it is next scheduled to run, check the jobs schedule with:

column dblink format a24column last_date format a9column next_date format a9column interval format a22

select job push_job, dblink, SUBSTR(TO_CHAR(LAST_DATE,'MM/DD/YY HH24:MI:SS'),1,20) LAST_DATE,SUBSTR(TO_CHAR(NEXT_DATE,'MM/DD/YY HH24:MI:SS'),1,20) NEXT_DATE,intervalfrom defschedulewhere job in (select job from dba_jobswhere upper(what) like '%DBMS_DEFER_SYS.PUSH%');

PUSH_JOB DBLINK LAST_DATE NEXT_DATE INTERVAL---------- --------------------- --------- --------- ---------------------44 DB2.WORLD 05/15/02 05/15/02 sysdate + 10/(60*24)14:27:22 14:37:22

45 DB3.WORLD 05/15/02 05/15/02 sysdate + 1/144014:30:33 14:31:33

There may be two reasons that the next_date has been passed. The first is the job has failed, to check this see section 3.4. The second is that the job is still running, to check this see section 3.6. It is important to note that the next_date is set when the job completes, so if the interval is 10 minutes and the job took 9 minutes to run the next_date will be 19 minutes from the time the job started. Please refer to DBMS_JOB.CHANGE in note:61730.1 if the next date of the job needs to be altered.

Oracle uses a lazy algorithm to purge deferred transactions from the local queue. It is important that the purge job runs regularly to clear down this queue because the same underlying table is used for transactions waiting to go to remote sites as for those which have been pushed but not purged. If there is a large number of transactions to be purged it can

affect the performance of propagation, check the scheduled purge job exists with:

column last_date format a9column next_date format a9column interval format a40

select job purge_job,SUBSTR(TO_CHAR(LAST_DATE,'MM/DD/YY HH24:MI:SS'),1,20) LAST_DATE,SUBSTR(TO_CHAR(NEXT_DATE,'MM/DD/YY HH24:MI:SS'),1,20) NEXT_DATE,intervalfrom defschedulewhere job in (select job from dba_jobswhere upper(what) like '%DBMS_DEFER_SYS.PURGE%');

PURGE_JOB LAST_DATE NEXT_DATE INTERVAL---------- --------- --------- ----------------------------------------43 05/15/02 05/15/02 /*1:Hr*/ sysdate + 1/2414:03:16 15:03:16

3.4 Check the status of the propagation jobs

If a job returns an error during execution, Oracle will try it again. The first attempt is made after one minute, the second attempt after two minutes, the third after four minutes, etc. This happens up to the original interval, so if the interval is two minutes, its 1 minute, 2 minutes, 2 minutes, ... If the job fails 16 times it is marked as broken, use the following to check how many failures there have been and if the job has been marked as broken:

column broken format a6select job, broken, failures from dba_jobswhere upper(what) like '%DBMS_DEFER_SYS%';

JOB BROKEN FAILURES---------- ------ ----------26 N 025 N 3

If the job is showing failures follow the advice in section 3.1, once the underlying problem has been resolved unbreak the job with DBMS_JOB.BROKEN (see note:1018453.102). Please note that the job may show failures = 0 and broken = Y is the job was manually broken with DBMS_JOB.BROKEN.

3.5 Check the propagator and their private database links

The owner of the job that performs the push must be the replication propagator and that user must have a private database link to the site where the job is pushing data to. See section 2.4 for details of the required database links and how to check they are working correctly.

Use the following SQL to check the propagator, links and push jobs all match up:

select * from defpropagator;

USERNAME USERID STATUS CREATED------------------------------ ---------- ------- ---------REPADMIN 31 VALID 18-APR-02

select db_link from dba_db_linkswhere owner = (select username from defpropagator);

DB_LINK----------------------------------DB2.WORLDDB3.WORLD

column pushed_site_by_propagator format a25select job, dblink pushed_site_by_propagatorfrom defschedulewhere job in (select job from dba_jobswhere log_user in (select username from defpropagator)and upper(what) like '%DBMS_DEFER_SYS.PUSH%');

JOB PUSHED_SITE_BY_PROPAGATOR---------- -------------------------43 DB2.WORLD44 DB3.WORLD

3.6 Check if the push job is currently running

Oracle replication only allows a single push operation to run to a master site at a time, although multiple push operations can occur concurrently but they have to be to different remote master sites. If connection qualifiers are being used then there can be multiple push jobs to the same master site, but with different database links (see note:1024982.6).

It may also be difficult to know if data is moving between replicated sites by using the deftrandest table because new transactions will be added all the time and for transactions with many calls they may take some time to process. The following query identifies push jobs that are currently running:

column dblink format a30select /*+ ORDERED */ j.job, j.sid, d.dblink,SUBSTR(TO_CHAR(J.THIS_DATE,'MM/DD/RRRR HH24:MI:SS'),1,20) START_DATEfrom defschedule d, dba_jobs_running jwhere j.job in (select job from dba_jobswhere upper(what) like '%DBMS_DEFER_SYS.PUSH%')and j.job = d.job;

JOB SID DBLINK START_DATE---------- ---------- ------------------------------ -------------------44 9 DB2.WORLD 05/16/2002 12:14:47

When a job runs it create a Job Queue Lock to protect it from being run more than once (i.e.

run manually from a users session), there have been conditions observed where the lock is still held by the Job Queue Process due to network failures. Use the following query to identify Job Queue (JQ) lock:

select s.sid, s.serial#, s.usernamefrom v$lock l, v$session swhere l.type = 'JQ'and l.sid = s.sid;

SID SERIAL# USERNAME---------- ---------- ------------------------------9 164 REPADMIN

It is important to note that the above queries will not identify manual push operations that have been initiated from a users session, see section 4.1 for more details of identifying these. After evaluating the above and section 4 it may be necessary to terminate the push process, please follow the steps defined in section 3.7 to do this.

3.7 Terminating a deferred queue push job that is currently running

There will be situations where the running push job needs to be terminated and prevented from running again, until the current problem that is being encountered is resolved. Perform the following steps:

- Break the job with :

execute dbms_job.broken(, true);

- Kill the Job Queue Process from the Operating System. To do this use the sid from section 3.6 to identify the process in v$session, v$process and v$bgprocess. The process will generally be named SNPx or Jxxx.

- After killing the process, wait approximately 1 minute, to ensure the job is removed from dba_jobs_running.

Killing the background process from the operating system will release the Job Queue lock and the User Lock used to protect the push operation. Once the underlying problem has been resolved by working through sections 4, 5 and 6, restart the jobs with:

execute dbms_job.broken(, false);

4. Diagnosing the progress and status of replicated transactions

The following section provides database administrators with a step by step guide to diagnosing the progress of deferred transactions. Prior to following these steps ensure the job queue is correctly configured, see section 3 above.

4.1 Verify transactions can be manually pushed

If it has been established that the job queue is configured correctly, but still there are problems with propagation, the first step should be to analyse the problem by trying a manual push of the deferred transaction queue.

Remember to make sure that the job that would normally perform the push is not currently running and has been prevented from running when the manual push is being tested (see sections 3.6 and 3.7). If this is not the case the manual push will normally return immediately without pushing any rows.

Log on as the propagation user and run the following:

declare x integer;beginx := sys.dbms_defer_sys.push('',...);end;/

If you get errors during the manual push, generally rows will not be pushed, use My Oracle Support to search for known problems that relate to these errors.

If the manual push completes without errors but the entries in deftrandest remain unchanged, the following could be the cause:

- Another users session is performing the same push operation, when a manual push starts it allocates a User Lock to ensure there is only one push at a time, use note:1059290.6 to identify the blocking session.- BUG:734902 (fixed in 8.0.6), which may manifest itself as a hang.

If neither of the above are the cause, check through the following 4.x sections and if the cause cannot be identified, raise a call with Oracle Support Services.

If the manual push hangs check through the following 4.x sections and if they do not identify the cause use section 5 to diagnose the problem.

4.2 Verifying transactions are being propagated

In Oracle release 7.x, the replication routines deleted all transactions from the deferred transaction queue once they had been pushed. Where as from Oracle 8.x onwards transactions (and calls) are no longer immediately deleted after application at the remote site, they are instead purged from the local queue on a regular basis (interval).

There are two different views of the deferred transaction queue:

- DEFTRAN : contains all unpurged transactions

- DEFTRANDEST : contains all transaction that have not yet been pushed to a remote master site. Transactions appear once per master site they have to be pushed to.

To view all unpurged transactions:

select * from deftran order by delivery_order;

DEFERRED_TRAN_ID DELIVERY_ORDER D START_TIM------------------------------ -------------- - ---------9.13.904 280185 R 16-MAY-021.5.559 280654 R 16-MAY-023.11.549 281112 R 16-MAY-02

** Note the deftran view also includes transactions from the deferred error queue, see sections 4.3.1 and 4.4 for additional information.

To view all transactions yet to be propagated:

column deferred_tran_id format a16column dblink format a30select * from deftrandest order by delivery_order;

DEFERRED_TRAN_ID DELIVERY_ORDER DBLINK---------------- -------------- ------------------------------1.5.559 280654 DB2.WORLD3.11.549 281112 DB2.WORLD

The transaction with the lowest delivery order will be the next transaction to be pushed to the remote replication site. Prior to Oracle9 the DEFTRANDEST view was the only way to identify how propagation was progressing. From Oracle8 onwards use the following to investigate the current push:

- Transactions that are currently being pushed appear in the target sites DEF$_ORIGIN table.- A transaction has been pushed if system.def$_destination.last_delivered is greater than system.def$_aqcall.cscn.

Oracle9 includes a mechanism for identifying how far through the current transaction the current push is. In Oracle9 and above, use the following query to identify if a transaction is being pushed and how many rows have been pushed:

column xid format a12column dblink format a30

select /*+ ORDERED */ p.xid, p.dblink,(MAX(c.callno) + 1) "Calls in Tran",(p.sequence/(MAX(c.callno) + 1)) * 100 "% Processed Calls"from v$replprop p, defcall cwhere p.name like '%Slave%'and c.deferred_tran_id = p.xidgroup by p.xid, p.dblink, p.sequence;

XID DBLINK Calls in Tran % Processed Calls------------ ------------------------------ ------------- -----------------10.20.556 DB2.WORLD 9999 32.446489

6.6.581 DB2.WORLD 7000 15.153031

The current implementation of v$replprop only applies to transactions that are pushed using parallel propagation, however Oracle recommends all customers use parallel propagation. On systems that are CPU bound pushing the deferred queue, it may be better to run the following query:


select p.xid, p.dblink,p.sequence "Processed Calls"from v$replprop pwhere p.name like '%Slave%';

In Oracle9 and above running the following query should assist database administrators is monitoring the overall activity in the deferred transaction queue:

select * from v$replqueue;

TXNS_ENQUEUED CALLS_ENQUEUED TXNS_PURGED LAST_ENQU LAST_PURG------------- -------------- ----------- --------- ---------6543 21299 300 17-MAY-02 17-MAY-02

See Section 7 for an example of a transaction being propagated to a remote site. If the current transaction being pushed appears to be hung or running very slowly, check the following 4.x sections of this document, if they do not help section 5 contains a more detailed analysis method.

For further information on replication parallel propagation, refer to:note:76447.1 Replication Parallel Propagation

4.3 Deferred transaction propagated but not applied to receiving site

If deferred transactions are leaving the local master site (being removed from the DEFTRANDEST view) and the data changes do not appear to be applied at the remote master site. The most likely cause is they are failing at the remote site; the most common reasons for this are:

- Database space management or rollback problems.- Data conflicts, where another user has updated a row that is being pushed to this replication site.

When the replication propagation mechanism encounters an error in applying a transaction to a remote master site, it rolls back the transaction and writes the whole transaction (all rows/calls) to the deferred error queue at the site to which the transactions was being pushed. The whole transaction is copiedeven if only one row causes the problem.

4.3.1 Checking for deferred errors in the queue

To obtain information of a transaction in error query deferror with:

column deferred_tran_id heading 'Deferred|Transaction|ID' format a11column callno heading 'Call|Number' format 99999999column origin_tran_db heading 'Origin |Database' format a16column destination heading 'Destination|Database' format a16column time_of_error heading 'Date Of|Error' format a9column error_number heading 'Oracle|Error|Number' format 999999

select deferred_tran_id, callno, origin_tran_db, destination,TO_CHAR(start_time, 'MM/DD/YY HH24:MI:SS') TIME_OF_ERROR,error_numberfrom deferrororder by start_time;

Deferred OracleTransaction Call Origin Destination Date Of ErrorID Number Database Database Error Number----------- --------- ---------------- ---------------- --------- -------5.11.152 0 DB3.WORLD DB4.WORLD 05/16/02 140320:03:078.10.153 2 DB3.WORLD DB4.WORLD 05/16/02 140320:08:01

In general to resolve the errors that the above query returns, search MetaLink for ORA-. If the error is an ORA-1403 as in the above example data between the replication sites tables has diverged, this data will have to be manually resynchronised and Oracle would recommend customers implement a conflict resolution mechanism.

Use note:2065172.102 to discover details of the call that is in error.

4.3.2 Stop on Error

The majority of customers configure their push jobs with stop_on_error = false, the consequence of this is:

- If one transaction fails and is written to the error queue, then the following transactions succeed, there is a possibility that data at the remote master site could be logically inconsistent, because the earliertransaction has been overtaken.

If a customer has configured stop_on_error = true, then the first transaction to fail will be written to the error queue and subsequent transactions will not be pushed. This makes it much easier to resolve divergent data.

Use the following query to identify what stop_on_error has been set to:

column what format a64select job, what from dba_jobs where upper(what) like '%STOP_ON_ERROR%';

JOB WHAT---------- ----------------------------------------------------------------44 declare rc binary_integer; begin rc := sys.dbms_defer_sys.push(destination=>'DB2.WORLD', stop_on_error=>FALSE, delay_seconds=>0,parallelism=>2); end;

4.4 Verify the purge transaction operation

As stated in section 4.2 deferred transactions are not immediately purged from the local replication site after they are propagated. It is very important that the purge happens at a regular interval, such that the queue to purge does not grow too large. If the queue to purge gets too large it will start to impact the performance of propagation.

4.4.1 Verify transactions can be manually purged

Use section 3.3 to check that the purge job is correctly scheduled and section 3.4 to identify if the job is failing. If the purge job is failing or the number of transactions to be purged does not appear to be decreasing follow the steps in this section.

The purge job should be owned (log_user) by a replication administrator, so log on as a replication administrator user (normally repadmin) and run the following:

declare x integer;beginx := sys.dbms_defer_sys.purge(delay_seconds=>0);end;/

If the manual purge raises errors, check MetaLink for likely causes, address the errors and run the purge again. If the manual purge returns without error follow the steps described in section 4.4.2 to check that transactions are being correctly purged.

4.4.2 Verifying transactions are being purged

The easiest way to check if the queue of deferred transactions to be purged is decreasing is to use the following query, which eliminates deferror entries in the local deferred queue:

select min(delivery_order) "Oldest Unpurged"from deftranwhere destination_list != 'D';

Oldest Unpurged---------------781633

In Oracle9 and above it may be better to run the following query particularly if the deferred queue is very large, it shows the number of transactions that have been purged since the database was last started:

select txns_purged from v$replqueue;

TXNS_PURGED-----------356

There are two types of purge that Oracle can perform:

- Lazy purge (the default).- Precise purge.

The lazy purge will purge transactions with a system.def$_aqcall.cscn lower than the local low water mark for propagated transactions (this is calculated based on the minimum last_delivered in the local system.def$_destination). This low water mark can be lower than some cscn numbers of some previously pushed transactions, so they will not be purged immediately. This can happen if not all the push jobs have run and still have active transactions. In this case, the transactions will remain in def$_aqcall until the low water mark rises above the cscn for the transaction.

A precise purge will purge transactions with a cscn lower than the low water mark for propagated transactions to it's specific destination. This means that the purge will query the last_delivered for each dblink destination. All transactions that have been pushed from the local site to that destination will usually fall below the low water mark and be purged. To perform the precise purge execute:

declare x integer;beginx := sys.dbms_defer_sys.purge(purge_method=>0);end;/

4.5 Large deferred transactions and slow propagation

The two main reasons for slow propagation of the deferred queue are:

- Limited systems resources, usually caused by CPU or network bottlenecks.- Large deferred transaction or error queues.

For the replication propagation mechanism to achieve maximum throughput the deferred queue needs to be kept as small as possible and transactions need to be propagated at regular intervals. There are two types of large queue:

- Queues with one or two transactions with tens or hundreds of thousands of calls, usually cause by bulk update operations (DML) or SQL loads.

- Queues with tens or hundreds of thousands of transactions, usually caused by a failure in the propagation job due to network outage or space management issues at the remote site.

In the majority of cases the queue will have already become large before the database administrator becomes aware of the problem and running queries against the DEF.... views will prove difficult because with enormous queues the views are slow.

The following queries should help the database administrator to make a decision about what to do with the large deferred queue, please note on some systems it may not be practical to run these queries.

Check how many rows are in the current or next transaction to be pushed:

select d.deferred_tran_id, count(*) callsfrom defcall c, deftrandest dwhere d.deferred_tran_id = c.deferred_tran_idand d.delivery_order = (select min(delivery_order) from deftrandest)group by d.deferred_tran_id;

DEFERRED_TRAN_ID CALLS------------------------------ ----------1.0.704 4999

Check how many rows exist in transactions to come (ordered by rows):

select deferred_tran_id, count(callno) callsfrom defcallgroup by deferred_tran_idorder by calls;

DEFERRED_TRAN_ID CALLS------------------------------ ----------1.0.704 49991.21.691 34301.7.725 11210.0.669 102

Oracle 9.x and above, check how many rows have been propagated from the current transaction so far:


select p.xid, p.dblink,p.sequence "Processed Calls"from v$replprop pwhere p.name like '%Slave%';

XID DBLINK Processed Calls------------ ------------------------------ ---------------

9.13.1016 DB2.WORLD 49994.10.628 DB2.WORLD 636

Oracle 9.x and above, check the overall number of transactions and rows that have been queued since the instance was last started:

select * from v$replqueue;

TXNS_ENQUEUED CALLS_ENQUEUED TXNS_PURGED LAST_ENQU LAST_PURG------------- -------------- ----------- --------- ---------72438 2179964 2400 22-MAY-02

After making the above assessment it may be the case that the only real option is to clear down the deferred queue and manually resynchronise the data. See Section 6 for details of how to perform this operation.

To avoid large deferred queues building up in the future:

- Ensure conflict resolution handlers are defined for tables that receive large updates, by handling the conflicts we avoid the overhead of rolling back the transaction and re-pushing it into the remote error queue. This operation can take considerably longer than the original transaction.- Monitor the push and purge jobs with Enterprise Manager events, as soon as a failure occurs the database administrator will then be alerted and can address the problem before the queues build up.- There is no easy way to avoid large transactions that are generated by mistake or adhoc user access, but for planned batch operations consider using procedural replication.

5. Diagnosing hanging propagation

If after completing the analysis described in section 4, rows do not appear to be moving between replication sites, perform the diagnostic steps described in the section.

5.1 Check for locks

If a row being pushed to a remote master is locked by a users session for an extended period of time, propagation will appear to be hung. Use the following select statement to identify sessions blocking propagation at the site where propagation is pushing data too:

column holder format 99999column holder_name format a16column rowno format 9999999999column object_name format a25column owner format a16

select w.holding_session holder,h.username holder_name,s.row_wait_row# rowno,o.object_name,

o.ownerfrom dba_waiters w, dba_objects o, v$session s, v$session hwhere w.waiting_session = s.sidand s.username = 'REPADMIN'and w.holding_session = h.sidand s.row_wait_obj# = o.object_id;

HOLDER HOLDER_NAME ROWNO OBJECT_NAME OWNER------ ---------------- ----------- ------------------------- --------------13 SCOTT 32 T2 REPDBA

For this query to be executed successfully; replace 'REPADMIN' with the user that the pushing site's replication propagator pushes to (receiver user) and make sure CATBLOCK.SQL has been run.

If the ROWNO / OBJECT_NAME column does not change the blocking users session will have to be killed to allow propagation to continue. Under some circumstances more than one row may be returned because each query slave used by parallel propagation opens a separate session at the remote site.

note:76447.1 contains additional information about Parallel Propagation

note:62354.1 and UTLLOCKT.SQL contain additional information aboutidentifying locks.

5.2 Check the wait events

If propagation is not waiting for a lock then the next step in diagnosing the hang is to find out what the pushing and receiving sessions are waiting for. The easiest way to do this is to use the V$SESSION_WAIT view; customers should note that Job Queue processes (DBMS_JOB) are not always recorded inthis view.

Run the following query at the pushing site if propagation is being executed from DBMS_JOB:

column event format a44

select sw.sid, sw.event, sw.p1, sw.p2from v$session_wait swwhere sw.sid in ( select qs.sidfrom dba_jobs_running jr, dba_jobs j,v$px_session qswhere jr.job = j.joband jr.sid = qs.qcsidand upper(j.what) like '%DBMS_DEFER_SYS.PUSH%')and sw.wait_time = 0;

Run the following query at the pushing site if propagation is being executed by a users

session, replace REPADMIN with the propagation user:


select sw.sid, sw.event, sw.p1, sw.p2from v$session_wait sw, v$session swhere sw.sid = s.sidand sw.sid in (select qs.sid from v$px_session qs)and sw.wait_time = 0and s.username = 'REPADMIN';

Run the following query at the site where data is being pushed to and replace 'REPADMIN' with the user that the propagation user pushes to (receiver user), note each parallel propagation slave will appear as a separate session:


select sw.sid, sw.event, sw.p1, sw.p2from v$session_wait sw, v$session swhere sw.sid = s.sidand s.username = 'REPADMIN'and sw.wait_time = 0;

If the sessions appear to be stuck on the same event for a long period of time, consult note:61998.1 or raise a call with Oracle Support Services.

5.3 Check if a large error is being queued

Writing failed replicated transactions to the remote deferred transaction error queue is one of the longest operations replication can perform, failed transactions often take four or more times as long to write to the error queue as they did to create.

Section 4.3 can be used to identify when small transactions are being written to the error queue, however it will not assist in identifying if there is a single transaction with thousands of rows being written to the error queue. The following query can be used to identify if a large error is being queued at the remote master site:

select count(*)from v$sqlarea a, v$session swhere s.sql_address = a.addressand s.sql_hash_value = a.hash_valueand a.sql_text like '%DEF$_AQERROR%'and s.username = 'REPADMIN';

Replace 'REPADMIN' with the user that the pushing sites replication propagator pushes to (receiver user).

5.4 Advanced Analysis

If after following all other sections in this article the push still seems to be hung or stuck, collect the information described in this section and supply it to Oracle Support Services. It may be necessary to raise a call with Oracle Support Services to assist in collecting the required information.

The first step is to collect an errorstack from the pushing session, this must include all query slaves used by the push. Use the query defined in section 5.2 to identify these.

The second step is to collect an errorstack from all sessions that are applying changes at the remote site. Use the query defined in section 5.2 to identify these.

Collect the errorstack by running the following from SQLPLUS for each session:

connect .... as sysdbaoradebug setospid oradebug unlimit oradebug dump errorstack 3

As hung or stuck push operations may in fact be spinning or looping operations, it may also be necessary to collect in depth sql_trace of all sessions listed above. Collect the required trace by running the following from SQLPLUS for each session:

connect .... as sysdba oradebug setospid oradebug unlimit oradebug dump 10046 12

If after analysis the sessions need to be killed refer to section 3.7.

6. How to clear down large deferred queues

Database administrators frequently have to make the decision to terminate propagation and manually resynchronise their replicated environments when the deferred queue has become large and or slow. See section 4.5 for additional information. Once the decision has been made to clear down the queue and resynchronise the data, follow : note:190885.1 How to Clear Down the Deferred Queue andDBMS_DEFER_SYS.DELETE_TRAN

Oracle normally recommends customers use dbms_defer_sys.delete_tran with allarguments set to null, to remove transactions from the deferred queue. Howeverwhen the queue is very large this may not be the most efficient mechanism forclearing the queue.

The basic steps to clear down the queue are listed below, if the queue islarge, slow or hanging refer to note:190885.1:

- Terminate the current push operation and prevent it from re-running,if the push is running from dbms_job see section 3.7.- Stop remote sites from replicating to the local site.- DO NOT attempt to suspend or quiesce the replicated environment, thatwill try to push the queue again and introduce admin requests that alsoneed to be cleaned out.

- If the system schema has optimiser statistics defined, make sure they are up to date, by analyzing the schema with compute statistics. In general Oracle does not recommend adding statistics to the system schema tables.- From SQLPLUS :

execute dbms_defer_sys.delete_tran(null, null);

7. Example of how deferred transactions are propagated

The following is an example of a transaction being replicated from ORA9I.WORLD to REP9I.WORLD.

7.1 Data is inserted into the table:

insert into scott.dept values (80,'MARKETING','ORLANDO');commit;insert into scott.dept values (90,'MARKETING','LONDON');commit;

7.2 Interrogating the deferred queue:

After applying DML to replicated tables, modifications are logged in the system.def$_aqcall table. These changes can be seen through the deftran and deftrandest views:

select * from deftran;

DEFERRED_TRAN_ID DELIVERY_ORDER D START_TIM------------------------------ -------------- - ---------4.49.495 479001 R 22-JAN-023.0.494 478972 R 22-JAN-02

column deferred_tran_id format a10column dblink format a30select * from deftrandest order by delivery_order;

--OR--

select enq_tid, cscnfrom system.def$_aqcallwhere cscn is not nullorder by cscn;

DEFERRED_T DELIVERY_ORDER DBLINK---------- -------------- ------------------------------4.49.495 479001 REP9I.WORLD

7.3 Identify the next transaction to be pushed:

Find out what the current LAST_DELIVERED value is for SYSTEM.DEF$_DESTINATION:

select dblink, last_delivered from system.def$_destination;

DBLINK LAST_DELIVERED------------------------------ --------------REP9I.WORLD 478992ORA9I.WORLD 478878

At this point, DEFERRED_TRAN_ID OR ENQ_TID 4.49.495 has not yet been pushed because its DELIVERY_ORDER OR CSCN of 479001 is greater than the current LAST_DELIVERED value for all transactions going to REP9I.WORLD which is 478992.

DEF$_AQCALL.CSCN > DEF$_DESTINATION.LAST_DELIVERED means the transaction has NOT been pushed.

7.4 Manually push transactions to REP901 and interrogate the deferred queue

Push the queue:

..... dbms_defer_sys.push(destination=>'REP9I.WORLD');

Check the views:

select * from deftrandest;

no rows selected

select * from deftran;

DEFERRED_TRAN_ID DELIVERY_ORDER D START_TIM------------------------------ -------------- - ---------4.49.495 479001 R 22-JAN-023.0.494 478972 R 22-JAN-02

select enq_tid, cscn from system.def$_aqcall;

ENQ_TID CSCN------------------------------ ----------4.49.495 4790013.0.494 478972

As expected, the unpurged queue view deftran shows all transactions and the deftrandest view is emply because all transactions have been pushed.

select dblink, last_delivered from system.def$_destination;

DBLINK LAST_DELIVERED---------------------------- --------------REP9I.WORLD 479017ORA9I.WORLD 478878

The LAST_DELIVERED column has increased and all transactions with DEF$_AQCALL.CSCN < DEF$_DESTINATION.LAST_DELIVERED have been pushed, which is all of the transactions currently in deftran.

7.5 Identify unpurged transactions

The following query identifies the number of transactions that have been pushed but not yet purged:

select count(*)from system.def$_aqcallwhere cscn < (select last_delivered fromsystem.def$_destination where dblink ='[dblink]');

COUNT(*)----------2

Related

Products

Oracle Database Products > Oracle Database > Oracle Database > Oracle Server - Enterprise Edition

Oracle Database Products > Oracle Database > Oracle Database > Oracle Server - Enterprise Edition

Keywords

troubleshooting guide mmr

Documents