troubleshooting frs-sonar

104
Operating System Deployment, monitoring and troubleshooting of the Windows 2000® File Replication Service using the SONAR, TOPCHK, CONSTAT and IOLOGSUM tools Abstract The Windows File Replication Service (FRS) is a multi-master file replication system that is included in all Windows 2000 Server and Windows Server 2003 SKUs. Windows 2000® domain controllers use FRS to replicate system policy and login scripts for Windows 2000® and down-level clients. In addition, the Windows 2000® Distributed File System (DFS) uses FRS in order to keep file shares synchronized for load balancing. This document describes how to configure, monitor and troubleshoot the most common FRS problems that customers might experience. All the information in this whitepaper refers to systems running Windows 2000® Service Pack 2 and later. White Paper Version 1.00d. January 2003. Windows Server 2003 RC2 Note: This paper is based on information available at the time of the Windows Server 2003 RC2 release. Information provided in this paper is subject to change before the final release of Windows Server 2003. This paper will be updated with the release candidate cycles of the © Copyright 2003 Microsoft Corporation. All rights reserved 1

Upload: tubaidada

Post on 01-Dec-2015

259 views

Category:

Documents


8 download

DESCRIPTION

Active Directory Replication tool.

TRANSCRIPT

Page 1: Troubleshooting Frs-Sonar

Operating System

Deployment, monitoring and troubleshooting of the Windows 2000® File Replication Service using the SONAR, TOPCHK, CONSTAT and IOLOGSUM tools

Abstract

The Windows File Replication Service (FRS) is a multi-master file replication system that is included in

all Windows 2000 Server and Windows Server 2003 SKUs.

Windows 2000® domain controllers use FRS to replicate system policy and login scripts for Windows

2000® and down-level clients. In addition, the Windows 2000® Distributed File System (DFS) uses

FRS in order to keep file shares synchronized for load balancing.

This document describes how to configure, monitor and troubleshoot the most common FRS problems

that customers might experience. All the information in this whitepaper refers to systems running

Windows 2000® Service Pack 2 and later.

White Paper Version 1.00d. January 2003.

Windows Server 2003 RC2 Note: This paper is based on information available at the time of the

Windows Server 2003 RC2 release. Information provided in this paper is subject to change before the

final release of Windows Server 2003. This paper will be updated with the release candidate cycles of

the product, and a final release will be made closer to the release date of Windows Server 2003.

© Copyright 2003 Microsoft Corporation. All rights reserved 1

Page 2: Troubleshooting Frs-Sonar

© 2003 Microsoft Corporation. All rights reserved.

The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.

This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT.

The BackOffice logo, Microsoft, Windows, and Windows NT are registered trademarks of Microsoft Corporation.

Other product or company names mentioned herein may be the trademarks of their respective owners.

Microsoft Corporation • One Microsoft Way • Redmond, WA 98052-6399 • USA

0399

© Copyright 2003 Microsoft Corporation. All rights reserved 2

Page 3: Troubleshooting Frs-Sonar

CONTENTS

1. FRS INTRODUCTION AND TECHNOLOGY OVERVIEW

1.1. GLOSSARY OF FRS TERMS

1.2. GLOSSARY OF OTHER FRS-RELATED TERMS

2. FILE REPLICATION SERVICE SCENARIOS

2.1. THE SYSVOL SCENARIO

2.2. USING FRS FOR REPLICATING DFS LINK TARGETS

2.3. LEAST APPROPRIATE SCENARIOS FOR FRS REPLICATION:2.4. MOST APPROPRIATE SCENARIOS FOR FRS REPLICATION

3. DESIGNING AN FRS DEPLOYMENT

3.1. BUILDING A DEPLOYMENT PLAN FOR AN FRS REPLICA SET

3.2. OPTIMIZING A COMPUTER FOR FRS3.3. BUILD & MAINTAIN A PICTURE OF THE ENVIRONMENT

3.4. OPERATIONS STAFFING AND TRAINING

4. MONITORING THE FILE REPLICATION SERVICE

4.1. VALIDATING REPLICATION TOPOLOGY

4.2. HIGH-LEVEL MONITORING OF REPLICA SETS USING THE ‘SONAR’ TOOL

4.3. MONITORING USING EVENT LOGS

5. TROUBLESHOOTING, RECOVERY & REPAIR

5.1. DRILL-DOWN USING HEALTH CHECK AND ANALYSIS TOOLS

5.2. ANALYSING FRS STATUS REPORTS AND LOGS

5.3. VERIFY THE FRS TOPOLOGY AS STORED IN THE ACTIVE DIRECTORY.5.4. OPTIMIZING REPLICATION SCHEDULES

5.5. INSPECTING THE CONNECTION STATE FOR SYSVOL AND DFS/FRS WITH CONNSTAT5.6. TROUBLESHOOTING FILES NOT REPLICATING

5.7. UNDERSTANDING MORPHED FOLDERS IN FRS5.8. TROUBLESHOOTING THE SYSVOL SHARE

5.9. TROUBLESHOOTING EXCESSIVE DISK/CPU USAGE BY NTFRS.EXE

6. RECOVERY USING D2 (NON-AUTHORITATIVE RESTORE) AND D4 (AUTHORITATIVE RESTORE)

6.1. WHEN TO USE D2. NON-AUTHORITATIVE RESTORE

6.2. WHEN TO USE D4. AUTHORITATIVE RESTORE

7. REFERENCE PAGES FOR IOLOGSUM, TOPCHK, CONSTAT

7.1. IOLOGSUM: FRS INBOUND AND OUTBOUND LOGS SUMMARY

7.2. TOPCHK: DFS AND SYSVOL REPLICATION TOPOLOGY ANALYSIS

7.3. CONNSTAT: FRS CONNECTION STATUS

8. FRS Event log messages

Page 4: Troubleshooting Frs-Sonar

1. FRS Introduction and technology overview

The Windows 2000® File Replication Service (FRS) is a multi-master file replication service for NTFS that allows the member machines to be interconnected in an arbitrary topology and uses last-writer-wins conflict resolution. It uses authenticated RPC with encryption as the communication protocol between members, and can also support disconnected operation. FRS supports multiple replication sets per computer, each with potentially distinct topology.

Windows 2000 domain controllers and servers use FRS to replicate system policy and login scripts for Windows 2000 and down-level clients – this is known as the ‘SYSVOL’ folder.

FRS can also replicate content between Windows 2000 servers hosting the same fault-tolerant DFS roots or child node replicas.

1.1. Glossary of FRS terms

Term DefinitionFRS The Windows® 2000 File Replication ServiceFile Replication The process of automatically copying updated data from a

specified directory tree on a source computer to an equivalent directory on one or more destination computers such that, in the absence of further changes, the data converges to identical content over time.

Replica Set The collection of servers all replicating a given set of directories is called a replica set. With an appropriate topology design and sufficient network support, a Windows 2000 FRS replica set can span thousands of computers. It is also possible for a single computer to be a member of multiple replica sets.

Replica Set Members

The computers participating in replicating a particular Replica Set.

Multi-master replication

A replication model in which any computer in the replica set can originate changes to the replica set and can accept changes from any other member. This differs from a ‘single-master’ replication models which are limited to having a single nominated ‘master’ computer where data updates must originate from.

NTFS FRS only replicates the contents of directory trees between NTFS-formatted volumes since it exploits a number of unique NTFS capabilities – the NTFS Change journal and Object ID support. NTFS is the recommended file system for use on Windows 2000® system because of its support for large volumes, support for security permissions and advanced disk management (stripe sets, mirrored volumes, RAID support)

Topology Topology defines the set of connections that are used to send updates between members of a replica set. The topology definition includes both the connections, and the properties of those connections – schedule, enabled/disabled flags etc.

Conflict resolution If differing changes are made to files and directories on two different members of the replication set, then FRS applies rules to decide which change will win. It does not attempt to merge or reconcile the conflicting updates. For files, the rule

© Copyright 2003 Microsoft Corporation. All rights reserved 4

Page 5: Troubleshooting Frs-Sonar

is essentially that the latest change wins – known as ‘last-writer-wins’; for folders, the rule is that the first directory created wins the original name, and later ones get renamed using a naming convention. The full set of conflict rules is discussed later in this whitepaper.

Disconnected operation

FRS can operate even if some or all members are disconnected from each other for periods of time. Changes can be accepted by any computer and will be replicated to other members once connectivity is re-established.

Staging To ensure that files can be replicated, even if they are in use, FRS makes copies of files to be replicated and places them in a ‘staging’ area in compressed form that is ready to be sent to each replication partner.

authenticated RPC with encryption

To provide secure communications, FRS uses Kerberos authentication protocol for authenticated remote procedure call (RPC) to encrypt and tamper-proof the data sent between replication partners.

1.2. Glossary of other FRS-related terms

Term DefinitionDFS The Windows Distributed File System that provides a UNC

namespace that can span & virtualize multiple file servers or NAS devices.

CSC Client-Side Caching. Another term for the Windows ‘Offline Folders’ feature.

RPC Remote Procedure CallEFS The NTFS Encrypted File System. Files stored in NTFS can be

optionally encrypted using EFS and transparently used by applications.

SYSVOL System Volume – a volume of files shared by Domain Controllers

DC Domain Controller – a computer in an Active Directory Domain that provides access to the Active Directory and participates in Active Directory replication.

KCC The Knowledge Consistency Checker – A part of the ISTG role in Active directory. KCC is responsible for checking and optionally re-creating topology information for the Active Directory Domain

ISTG Inter-Site Topology Generator. This is a role that one DC in an Active Directory Domain must perform.

MMC Microsoft Management ConsoleSCM Service Control Manager – the element of Windows that is

responsible for automatic/manual starting and stopping of system services.

SP Service PackGC Global Catalog - a role that one or more DCs in an Active

Directory Domain may perform.BURFLAGS Backup Restore Flags – An FRS-related registry setting used

to change the behaviour of the FRS service at startup. Common values are D2 and D4

D2 A BURFLAG used to trigger the ‘non-authoritative restore’ process in FRS – this causes the service on a specified member to resynchronize it’s state with it’s replication partners in the case of a failure on that member

D4 A BURFLAG used to trigger the ‘authoritative restore’ process

© Copyright 2003 Microsoft Corporation. All rights reserved 5

Page 6: Troubleshooting Frs-Sonar

in FRS – this is used to re-initialize EVERY member of the replica set, and is only performed in extreme circumstances, under supervision of your support provider or Microsoft Product Support Services.

© Copyright 2003 Microsoft Corporation. All rights reserved 6

Page 7: Troubleshooting Frs-Sonar

2. File Replication Service scenariosThe Windows File Replication Service (FRS) is widely used by Windows 2000® for a variety of scenarios, from SYSVOL replication to application & document distribution.

In Windows 2000®, FRS is used for SYSVOL replication between active Directory Domain Controllers (‘DCs’) and for replication between targets for Distributed File System (DFS) links.

2.1. The SYSVOL scenarioIn a Windows 2000® Active Directory domain, there are two important folders shared out by each domain controller:

\\<domain_name>\SYSVOL - used by XP and W2K clients to obtain Group Policy

\\<domain_name>\NETLOGON - used by the Windows logon process and by clients accessing legacy logon scripts and policy

The Windows 2000® Distributed File System (DFS) is automatically used to create this \\domain_name\... Network name. These shares actually map via DFS to shared folders hosted by each of the Domain Controllers in the domain ‘domain_name” – for example the DFS link may resolve to the shares \\domain_controller_X\SYSVOL and \\domain_controller_X\NETLOGON.

Active Directory requires the contents of these individual shares to be consistent throughout the domain, and the Windows 2000® File Replication Service is used to achieve this goal.

In large domains, the SYSVOL replica set can span hundreds or even thousands of computers. SYSVOL is used for a key part of Windows 2000® system infrastructure, including ‘default domain policy’ and other parts of group policy. For this reason, there are specific recommendations for the type of data that should and should not be stored in SYSVOL. The key parts of these recommendations are that:

Policy files and login scripts should be kept in SYSVOL.o Login scripts should be designed so that clients do not hold them

open for indefinite periods of time – e.g. with ‘press-a-key’ type delays.

Application or hotfix install points should not be kept in SYSVOL. It is recommended that a separate DFS/FRS be set up for these.

For SYSVOL, FRS uses the same connection topology that is manually or automatically1 created for Active Directory replication. Because the connection objects are the same, the schedule and topology for intra- and inter-site replication are the same for FRS and Active Directory. Like Active Directory replication, FRS compresses all replicated content between sites, uses a ‘trigger replication’ scheme, and implicitly uses an always-on schedule between members in the same site. However, unlike Active Directory replication, FRS also compresses replicated content within a site.

1 Created by the Active Directory Knowledge Consistency Checker (KCC)

© Copyright 2003 Microsoft Corporation. All rights reserved 7

Page 8: Troubleshooting Frs-Sonar

2.2. Using FRS for replicating DFS link targetsUnlike SYSVOL replication, which is enabled by default, replication for DFS roots and links must be explicitly enabled using the DFS administrative console. Only targets of roots and links in a domain-based DFS can use FRS; stand-alone DFS does not support automatic file replication using FRS.

You can enable replication of files and folders between computers using the Configure Replication Wizard in the DFS Administrative console. The replication policy can be different for each root and link in the DFS namespace. You must have at least two root targets or two targets configured in order to enable replication. Microsoft recommends that you enable FRS replication only on DFS links and NOT DFS roots.

2.2.1 A DFS/FRS scenario for publication of applicationsFRS and DFS are two separate services that can be combined to build a way of distributing/publishing applications.

DFS provides the means to construct a global name space spanning multiple servers and shares. It provides the capability to have any portion of the namespace (DFS Links) hosted by multiple servers (DFS Link targets) in one or more sites. In this way it provides availability, load balancing and reduced latency and bandwidth usage by referring the client to a server in a local or nearby site.

FRS complements DFS by providing a method for automatically replicating file data between the multiple targets specified for a given DFS link. The Windows 2000® DFS MMC snap-in allows the administrator to enable and configure FRS between DFS link targets.

In the DFS/FRS publication scenario, a distribution topology is defined (typically hub-and-spoke, a multilevel tree, or a redundant hub-and-spoke) and by convention, new applications are deployed/updated to a single machine (usually the hub/root of the topology or 2another centrally located computer), and FRS then propagates/distributes the changes to the spoke machines.

These configurations have proven very effective in the field, as long as the ‘golden rules’ described in the following section are adhered to. DFS provides a ‘service abstraction’ that allows the backing file servers to be scaled-out or consolidated as required, and FRS ensures consistent distribution of data.

2.2.2 Additional optimizations for the ‘run-from-network’ scenarioSome organizations have adopted a policy whereby they run applications from a network share instead of installing them onto local systems. For this scenario, and where Windows XP® clients are in use, administrators can take advantage of Offline Folders in order to cache application files on the local hard disks on client machines and thus reduce server load and increase application availability.

2.2.3 A DFS/FRS scenario for publication of data

An almost identical scenario is publication of data across an enterprise. Common examples include documents, diagrams, operational procedures, and historical/test result data. This scenario follows the same advice given for publication of data.

2 Windows® Server 2003 clients also support this configuration

© Copyright 2003 Microsoft Corporation. All rights reserved 8

Page 9: Troubleshooting Frs-Sonar

One difference is in the way Offline Folder ‘client side caching’ policy might be defined.

2.2.3.1 Reverse publication“Reverse publication’ is where the data flows from a number of dispersed machines to a centralized data collection server. One typical scenario is when collections of log files and reports generated on individual machines are collected to a central site. Another similar scenario when data is collected to a central server for backup purposes.

Typically this is configured using a replica set for each (hub, remote site) pair – since the intent is that the data flows to the central data collection server, and not to each of the servers generating the data. This is sometimes described as a ‘petal topology’ since the diagram for such a configuration shows a central hub and each replica set looks like the petal of a flower replicating between the hub and a single machine.

2.2.4 A DFS/FRS scenario for data availability of read-write user data

An alternate set of scenarios revolve around supporting a large number of independent users who want to read and write data in their own personal folders on a file server. In order to provide availability in the face of disk or system failure, it is possible to use FRS as a way to replicate data for the purpose of data availability and use DFS to provide failover

However, these scenarios need more careful consideration since they are not a core scenario for DFS/FRS; instead, it is a core scenario for Microsoft Cluster Services (MSCS).

You can use FRS to replicate data in domain-based DFS namespaces on file servers running Windows 2000 or Windows Server 2003. When evaluating using FRS for this scenario, you must determine whether your organization can tolerate inconsistent data that can occur within a replica set. Data inconsistency can occur at the file and folder level as follows:

FRS uses a “last writer wins” algorithm for files. The most recent update to a file in a replica set becomes the version of the file that replicates to the other members of the replica set, which might result in data loss. In addition, FRS cannot enforce file-sharing restrictions or file locking between two users who are working on the same file on two different replica set members.

o Note that deletes are a “last write” event and can ripple through all members of the set, taking a single, dir or the entire tree with it

FRS uses a “first writer wins” algorithm for folders. If two identically named folders on different servers are added to the replica tree, FRS identifies the conflict during replication and renames (morphs) the folder that was most recently created. Both folders are replicated to all servers in the replica set, and administrators can later merge the contents of two folders or take some other measure to re-establish the single folder.

Temporary data inconsistency due to replication latency is more likely to occur in geographically diverse sites with infrequent replication across slow WAN links. If you want to use replication among servers in the same site, then consistency is probably not an issue because the replication can occur quickly after the file changes — assuming that only one user makes changes to the data. If two users

© Copyright 2003 Microsoft Corporation. All rights reserved 9

Page 10: Troubleshooting Frs-Sonar

make changes to the data, then replication conflicts occur, and one user will lose those changes.

2.3. Least appropriate scenarios for FRS replication:

2.3.1 In organizations that do not have an operations group or dedicated administrators

Organizations that do not have the staff or the time to monitor FRS event logs, data logs and data consistency on each replica member should not implement FRS.

Organizations must also have well-defined procedures in place to ensure that data in the replica set is not accidentally or unintentionally deleted, because deleting a file or folder from one replica member causes the file or folder (and its contents) to be deleted from all replica members. In addition, if a folder is moved out of the replica tree, FRS will delete the folder and its contents on the remaining replica members. To avoid having to restore the files or folders from backup, you can enable shadow copies on some of the replica members so that you can easily restore a file or folder that was accidentally deleted. For more information about shadow copies, see “Designing a Shadow Copy Strategy” later in this chapter. For more information about FRS logs, see “File Replication Service” in the Distributed Services Guide of the Windows Server 2003 Resource Kit (or see “File Replication Service” on the Web at http://www.Microsoft.com/reskit).

2.3.2 The amount of data changes exceeds what FRS can replicateIf you plan to schedule replication to occur during a specified replication window, you need to ensure that FRS can replicate all the changed files within the window. Replication throughput is determined by a number of factors:

The rate, number and size of changed files. The speed of the disk subsystem. The speed of the network. The schedule between replication partners Whether you have optimized the servers by placing the replica tree, the

staging directory, and the FRS data on separate disks.

Each organization will have different FRS throughput rates depending on the previous factors. In addition, if your data compresses extremely well, your file throughput will be higher. To determine the replication rate, perform testing in a lab environment that resembles your production environment.

If the amount of data changes exceeds what FRS can replicate within a given period of time, you need to change one of these factors, such as increasing the speed of the disk subsystem (number of disks, mechanical speed, or disk cache) or network. If no change is possible, then FRS is not recommended for your organization.

2.3.3 In organizations that always use clustered file serversSome organizations use clustered file servers regardless of whether the server contains business-critical data. Although it might seem that storing FRS-replicated content on the shared cluster storage of a clustered file server would increase the availability of the data even more, combining clustering and FRS is not recommended because you then have the weaker guarantees of asynchronous file replication, but the more stringent configuration requirements of a cluster.

© Copyright 2003 Microsoft Corporation. All rights reserved 10

Page 11: Troubleshooting Frs-Sonar

In addition, Windows® 2000 and Windows® Server 2003 do not support configuring FRS to replicate data on shared cluster storage.

2.3.4 Locks prevent updates to files and directories from occurring

FRS cannot replicate locked files or directories to other replica members, nor can FRS install a file on a replica member if the local file is open by another process and denying access for other applications to modify the file.

If users or processes frequently leave files open for extended periods, consider using clustering instead of FRS.

2.3.5 Files are changed but held openFRS does not replicate files until the file has been closed for 3 seconds. If files are changed but are then held open (or re-opened) exclusively, before FRS is able to create a staging file of that change, then FRS cannot replicate the file until it has been closed again.

2.3.6 In organizations that use Remote StorageRemote Storage is a feature in Windows Server 2003 that automatically copies eligible files on local volumes to a library of magnetic tapes or magneto-optical disks. Organizations that use Remote Storage must not use FRS on the same volume. Specifically, do not perform any of the following tasks: Do not create a replica set on a volume that is managed by Remote Storage.Do not add a volume that contains directories that are part of an FRS replica set to Remote Storage.

2.3.7 For files encrypted using EFSUsing EFS in FRS replica sets is not supported – FRS will skip any files that are encrypted using EFS.

2.4. Most appropriate scenarios for FRS replication

2.4.1 The data to be replicated is read-only or read-mostlyIn these scenarios, because changes occur infrequently, the data is usually consistent. In addition, FRS has less data to replicate, so network bandwidth is not heavily affected.

2.4.2 Your organization has geographically dispersed sites, and consistency is not an issue

Geographically dispersed sites might have slower bandwidth connections, but if your organization does not require the data in those sites to always be consistent with each other, you can configure replication in those sites at a schedule that make sense for your organization.

For example, if your organization has sites in Los Angeles and Zimbabwe, then you can place one or more replicas of the data in servers in those sites and schedule replication to occur at night or during periods of low bandwidth use. Because in this scenario replication could take hours or days to update every member, the delay must be acceptable to your organization.

© Copyright 2003 Microsoft Corporation. All rights reserved 11

Page 12: Troubleshooting Frs-Sonar

2.4.3 Each file is changed by one person who accesses the file from one location

Replication conflicts rarely occur if only a single user changes a given file from a single location. Some common scenarios for single authorship are redirected My Documents folders and other home directories. Conversely, if users roam between sites, then replication latency could cause the file to be temporarily inconsistent between sites.

2.4.4 Replication takes place among a small number of servers in the same site

If members are connected by always-on, high-speed LAN connections, then replication latency is reduced. As a result, data tends to be more consistent.

2.4.5 As a fileserver failover configuration, if some data inconsistency between servers can be tolerated

It is possible to use DFS and FRS to replicate read-write user data so that if one file server fails, another can take its place. However, before deploying such a scenario, the following factors must be taken into account in order to determine if the behaviour of such a system is appropriate in the context of the planned scenario.

The issues to consider are:

DFS does not guarantee which file server a client will be referred to; there are clear rules around how DFS load balancing and site selection work, but depending on transient network issues, a DFS client might attach to any candidate server that advertised the file share used by an enabled DFS link target.

The FRS ‘last-writer-wins’ conflict resolution model means that if two client computers (with either the same or different users logged in) access a replicated DFS link, and are directed to different link targets, then these two clients can make changes to the two copies of the same file without being aware of each other’s locks on the files. One of these two clients will silently lose the changes they make.

FRS can only replicate file changes after the file is closed. FRS replicates whole files each time; if a file is only modified in one small

area, then FRS still transmits the complete file. This is acceptable for many files, but, depending on bandwidth, may not be appropriate for files such as .PST files which are large but typically undergo change in small areas of the file.

In some scenarios, this can still be acceptable – the key question is how likely it is that conflicting edits may be made by two different client computers to the same file before the data has had time to replicate.

Another concept is to consider a mechanism (such as scripts) whereby only one of the link targets raises its shared folder at a time. In this case, DFS can only ever successfully route a client computer to one file server, and so such write conflicts cannot occur. In this case, failover is provided by deciding one other member of the replication set can raise its share, and the failing member is disconnected and has its file share lowered.

The final issue to consider in this scenario is bandwidth usage. Since users are updating files, there is no clear bound to how much replication traffic they may generate, and this should be considered carefully in replica sets that are intended to span a wide area network (WAN).

© Copyright 2003 Microsoft Corporation. All rights reserved 12

Page 13: Troubleshooting Frs-Sonar

3. Designing an FRS deployment

There are a small number of high-level golden rules that should be followed when planning a FRS deployment in order to ensure a consistent service levels. These are:

1. Once you have chosen FRS as the means to replicate data, you should support that service, and not supplement or override it.

o If replication stops for some reason, the very worst thing one can do is to copy files manually to replication partners. This will cause additional replication traffic, backlog, and possible replication conflicts.

o The correct action is to find the root cause that stopped replication from progressing and resolve that – common causes include low disk free space, poor connectivity, excessive file updates, and files that are in use and cannot be replaced.

“support the FRS replicator – don’t try to be the replicator”o Section 5. “Troubleshooting, Recovery & Repair” of this whitepaper

discusses the most common scenarios and recovery steps.

2. Pro-actively monitor the status of replication. o If a system problem has prevented replication from progressing,

then it is important to determine this and remedy the issue in a timely manner.

o If replication outages persist unnoticed for long periods of time, then failures start to compound upon each other. For example, backlogs of data on one machine can cause accumulation of files in the staging directory to radiate out to other machines and progressively cause network-wide congestion, and also can affect other replica sets on the same machines.

o Section 4. “Monitoring the File Replication Service” of this whitepaper describes techniques for monitoring the status of the FRS.

3. Look for unexpected sources of replication traffic and file locks.o Any application can potentially alter a file. Some applications can

unexpectedly alter many files on a regular basis. Examples include disk defragmentation tools, anti-virus products, file system policy templates.

o Some applications may hold files open for long periods of time, preventing updated files from being applied from another computer. The most common example of this is ‘press-a-key’ prompts and message boxes in login scripts where the user goes away for a long period of time after logging in.

o An operator/user in one part of a system may make changes unaware that another operator is making conflicting changes in the same area from some other computer. This is known as ‘duelling admins’ and in some cases a manual replication feedback loop is created as the admins keep re-applying their changes unaware that another is also making changes.

o Section 5.9. “Troubleshooting excessive disk/CPU usage byNTFRS.EXE” of this whitepaper discusses how to detect and resolve such situations.

4. Build an FRS deployment plan that handles bandwidth availability, topology definition, data quantity, data change rate, and monitoring procedures.

© Copyright 2003 Microsoft Corporation. All rights reserved 13

Page 14: Troubleshooting Frs-Sonar

o There are no fixed limits to how much data or data change rate that FRS can support; however with more data and more volatile data, the system designer needs to consider how to optimize the topology and replica member configuration to support the required level of replication traffic.

o Section 3.1. “Building a deployment plan for an FRS replica set” of this whitepaper describes the steps in building such a plan.

5. Take the time to understand how FRS works. o A significant amount of detailed FRS training material is now

available. If you are using FRS in an advanced manner (large/complex topology, large amounts of replication traffic etc) then nominate a staff member to undergo this training so that they can be aware of potential issues and how to troubleshoot and repair any that occur.

o Section 5. “Troubleshooting, Recovery & Repair” of this whitepaper discusses the most common scenarios and recovery steps.

3.1. Building a deployment plan for an FRS replica setThis section discusses how to plan a single replica set. The aim is to ensure that the rate of change in the data set can be accommodated by the available communication bandwidth and schedule.

3.1.1 Planning for bandwidth availabilityThe first aspect to consider is the bandwidth availability between nodes in the replication set:

Determine the raw available bandwidth between nodes, especially the bandwidth between the most poorly-connected nodes.

Define a replication schedule for FRS. Each FRS connection may have its own schedule defined on an hourly basis.

o Note: If using schedules and hubs with many outgoing connections, it is a best practice to stagger the schedules so that the hub computer does not try to deliver a backlog of replication traffic to all partners at once. This reduces the amount of simultaneous disk I/O that the hub server needs to perform and will help to reduce timeouts from partners. The extent to which the schedules should be staggered should be based on number of connections to the hub, and the CPU & disk performance of the hub server.

Note that scheduling is the only mechanism that FRS provides for throttling replication traffic.

3.1.2 Planning for data size and change rate

When a file is modified, FRS will send a complete copy of the resultant file to the computer’s replication partners.

If you have an existing data set that you wish to replicate, it is possible to use the following procedure to estimate how much replication traffic will be generated in a given time period:

1. Use ATTRIB –A <path> /S 2. User the data set at <path> as normal3. After the chosen interval (for example daily), use XCOPY /A /S <path>

<temp_dir> Note that both <path> and <temp_dir> should be on NTFS volumes

© Copyright 2003 Microsoft Corporation. All rights reserved 14

Page 15: Troubleshooting Frs-Sonar

4. Ensure the files in <temp_dir> are compressed by executing COMPACT /C /S:<temp_dir>. When this command completes, the summary report indicates how many bytes are being used to store the files, for example:

405 files within 11 directories were compressed.19,862,282 total bytes of data are stored in 10,677,439 bytesThe compression ratio is 1.9 to 1.

Note that replication schedule also plays an important role here. If there are multiple versions of a file in the FRS replication queue from a computer, then FRS will only send the most recent version; not all of the intermediate versions. For this reason a schedule coalesces file changes and reduces bandwidth usage at the expense of a short term backlog.

3.1.3 Expected propagation requirementsIf the topology has multiple levels from the master to its furthest node (based upon the number of hops in the replication topology), and the connections have schedules defined, then the system designer should consider the likely propagation delay for the file to reach that final target.

3.1.4 Planning the number of replica sets per serverWhen using FRS to replicate DFS links as replica sets, there is no fixed limit to the number of replica sets that a single file server can be involved in - however it is recommended that you host no more than 150 different replica sets on a single server to ensure replication performance. The optimal number of replica sets for servers in your organization depends on the CPU, memory, disk input/output (I/O) throughput, and the amount of data changed.

3.1.5 Configuring files or folders to exclude from replicationYou can use the Distributed File System snap-in to exclude subfolders or files with certain extensions from replication. By default, the following files are excluded from FRS replication:

File names starting with a tilde (~) character Files with .bak or .tmp extensions

3.1.6 Topology definitionFRS supports arbitrary topologies subject to requirement of the replica set type.

SYSVOL replica sets require that each DC in the same domain have at least one inbound and one outbound connection to another DC in the same domain.

DFS replica sets only require that each member has a single inbound connection - so it is possible to have an all “one-way” topology for all members.

In ANY valid topology, there must be a direct or indirect path between any two members of the replica set.

FRS does not rely on broadcast technology and can provide fault tolerant distribution via multiple connection paths between members. If a given replica member is unavailable, the data will flow via a different route assuming one exists. Dampening logic prevents a data file from being sent more than once to any given member.

Replicating data between sites that are connected with slow WAN links requires careful planning of the topology and schedule. If the sites have a high-bandwidth connection, but business-critical databases and other applications use that connection as well, you will need to schedule replication so that it does not consume bandwidth needed for other uses.

© Copyright 2003 Microsoft Corporation. All rights reserved 15

Page 16: Troubleshooting Frs-Sonar

Topology trade-offs

A key part of topology definition involves choosing a trade-off between redundancy and scalability.

Topologies with more redundant paths offer greater fault tolerance by providing the largest number of alternate routes for data transfer - but at the same time this adds administrative overhead (more connections to monitor and support) and potential liability (a single offline downstream partner now affects additional upstream partners). The ‘full mesh’ topology in which every node has a connection to every other is the most extreme case: it provides the most redundancy, but the least scalability.

By contrast, the ‘single-hub-and-spoke’ topology provides the least redundancy, but provides better scalability.

Consequently, the best topology for a given scenario typically lies in a continuum between the extremes of ‘full mesh’ and ‘single-hub-and-spoke’, and balances scalability and redundancy.

In the majority of cases (especially large ones) the optimal topology is often “redundant hub-and-spoke with staggered schedules”. In this arrangement, a small number of members (the ‘hubs’) have direct inbound and outbound connections with every other member (the ‘spokes’ and the other ‘hubs’). There are no additional connections. The connections from the hubs to the spokes have ‘staggered schedules' so that the hubs do not get overloaded by replicating files to all partners at once. The hubs are well-connected to each-other and typically have constant (or very frequent) schedules.

Figure 3: A redundant-hub-and-spoke topology

A case that is common within sites is the ring topology. In this case, all members are connected as if they were points on a circle – each member has inbound and outbound connections with only its immediate neighbours.

© Copyright 2003 Microsoft Corporation. All rights reserved

Figure 1: A full-mesh topology

Figure 2: A hub-and-spoke topology

16

Page 17: Troubleshooting Frs-Sonar

Figure 4: A ring topology

3.1.7 Configuring the desired replication topology (DFS case)Administrators should use the updated DFSGUI.MSC in the Windows Server 2003 Admin pack installed on XP Professional domain members for defining FRS topologies. This tool offers a number of important benefits over the version of the tool that shipped in Windows 2000®:

The updated tool provides the ability to define a customized replication topology for both Windows 2000® and Windows Server 2003 systems. In comparison, the Windows 2000® tool always created full-mesh topologies with a constant schedule.

The updated tool allows schedules to be defined on each connection. The updated tool allows ‘connection priority’ to be defined on each

connection. This is an important parameter to set for redundant-hub-and-spoke topologies since it can be used to identify to a hub where the other hubs are. This information is used during recovery process (non-authoritative restore, also known as the ‘D2 process’) to ensure that it sources data from the well-connected hub server.

Allows the administrator to specify staging directory location as each target is added to FRS replica set.

The Windows Server 2003 RC1 Admin Pack is available for download from http://www.microsoft.com/downloads/release.asp?ReleaseID=34032 . It requires a Windows XP® Professional client to run on, but can be used to manage both Windows 2000® and Windows Server 2003 systems.

3.1.8 Configuring the desired replication topology (SYSVOL case)The techniques and tools for configuring SYSVOL replication sets is described in length, along with tools and scripts in the following document:

Windows 2000® Branch Office Deployment Guide: http://www.microsoft.com/technet/treeview/default.asp?url=/TechNet/prodtechnol/windows2000serv/proddocs/adguide/adbranch.asp

It is essential that administrators with large configurations (over 100 sites) review the Branch Office deployment guide (and the topology section especially) in order to decide if a custom, scripted, topology is required. Even for smaller configurations, this is a very important and useful document and toolset.

3.1.9 Plan for future growth

3.1.9.1 Expected growth of replicated dataYou need to know whether you plan to replicate larger and larger amounts of data over time so that you can ensure that your topology, schedule, and bandwidth can handle the additional data.

© Copyright 2003 Microsoft Corporation. All rights reserved 17

Page 18: Troubleshooting Frs-Sonar

3.1.9.2 Expected increase in the number of replica membersIf you plan to deploy a small number of servers at first and then deploy additional servers over time, you need to ensure that your topology and bandwidth can handle the new servers.

3.2. Optimizing a computer for FRS

3.2.1 Service Pack levelWindows 2000® Service Pack 3 contains a number of important fixes and supportability improvements. These are documented in Q321557 “Improvements in the Post-SP2 Release of Ntfrs.exe”.

It is highly recommended that FRS systems should be on at least Windows 2000 SP2, but Q321557 or Windows 2000® Service Pack 3 is preferred, primarily due to its improved handling of staging areas.

3.2.2 Change Journal sizeThere is a crucially important configuration parameter that must be considered when using FRS: The size of the NTFS ‘USN Change Journal’.

The NTFS ‘USN Change Journal’ is a log file that NTFS maintains and which describes the nature of changes that have occurred on the file system. NTFS updates this log transactionally and so it is kept in sync with the filesystem state, even in the case of a power-fail or crash. The USN Change Journal has a bounded maximum size – if it exceeds the defined size, then NTFS discards a number of the older records in order to keep the Journal within the defined size limits.

FRS uses this mechanism to track changes on the file tree being replicated. If items are discarded from the NTFS USN Change Journal before FRS has processed them, then FRS looses track of the file system state – a state known as ‘Journal Wrap’. In this case, the machine must undergo the ‘non-authoritative restore’ process (also known the ‘D2 process’).

As a rule of thumb, the NTFS ‘USN Change Journal’ for an NTFS volume should be sized at 128MB per 100,000 files being managed by FRS on that NTFS volume. Note that

Multiple FRS replication trees may exist on the same NTFS volume. Thus, the size of the journal should be decided based upon the total number of files managed by FRS on that volume

Other files may also be stored on the same NTFS volume, and file operations on these files will be entered in the USN Change Journal, even though they are ignored by FRS. A rule of thumb in this case is to consider an extra 8MB per 100,000 files in this case (but this is really very approximate – it depends upon how much activity might occur on those files).

For best results, put FRS-related shares on their own NTFS volume (or volumes) so that the USN journal cannot be affected by other file activity.

Knowledge base article Q221111 describes how the USN journal size can be increased..

3.2.3 Keep the FRS Service runningIn order to minimize the chance of encountering a ‘journal wrap’, it is a best practice to keep FRS service running at all times - especially when bulk add / delete changes are being made to FRS replicated files.

© Copyright 2003 Microsoft Corporation. All rights reserved 18

Page 19: Troubleshooting Frs-Sonar

3.2.4 Staging area size and quotaThe Staging Directory is an area where modified files are stored temporarily prior to being propagated to other replication partners. FRS encapsulates the data and attributes associated with a replicated file (or directory) object in a staging file. FRS needs adequate staging area space on both upstream and downstream machines in order to replicate files.

When the staging file has been generated on the originating computer, FRS compresses it - this saves space in the staging file and causes less data to be replicated between members. It also ensures that the file data can be supplied to partners regardless of any file activity that might prevent access to the original file.

The default size of the staging is 660 MB per computer. The minimum size that can be configured is 10MB; the maximum size that may be configured is 2TB.

The registry value that affects this is named "Staging Space Limit in KB". It is a DWORD registry value in the range {10*1024 ... MAXLONG } (Note MAXLONG = 2*1024*1024*1024 = 0x7fffffff).

FRS interprets this value with the units being KB (i.e. 1024 bytes). So that means that the minimum value corresponds to 10*1024*1024 bytes and the max value corresponds to 2*1024*1024*1024*1024 bytes or 2TB For full details, see “Staging Space Limit in KB” in Q221111 Description of FRS Entries in the Registry, and also Q264822 "File Replication Service Stops Responding when staging area full".

Determining the correct staging size for a given dataset:

The ratio of staging area size to data set size depends upon a range of factors

Factor Description

Largest file to be replicated

The largest file that FRS can replicate is determined by the staging area size on both the upstream and downstream machines. Therefore, the largest possible file that FRS can replicate is 2 Terabytes, when the staging area size has been set to this maximum value.

If using Windows 2000 SP2 or later

Windows 2000 SP2 and later compresses the data in the staging area. Some filetypes (text, some binaries, documents) are more compressible than others (e.g. compressed archives, and multimedia files).

If using Windows 2000 SP3 or later

If you are using Windows 2000 SP2 or earlier, be aware that FRS stops replicating if the staging area runs out of free space. This means that if a replica set member goes offline for an extended period of time, it does not block replication on an upstream member because the staging area is filled. Consequently, you should use a generous estimate for staging area size. However, Windows 2000 SP3 and later has an updated staging file management algorithm. On these systems, when FRS tries to allocate space for a staging file and is not successful (because either there is not enough space or because the amount of space in use has reached 90 percent of the staging space limit parameter), FRS starts to delete staging files. Staged files are deleted (in the order of the longest time since the last access) until the amount of space in use has dropped below 60 percent of the staging space limit parameter. Consequently, it is not as critical to use as generous an estimate for staging area size as it was for pre-SP3 systems, but it is still advised to do so - in order to prevent disk/cpu performance being consumed by repeatedly staging and deleting files

© Copyright 2003 Microsoft Corporation. All rights reserved 19

Page 20: Troubleshooting Frs-Sonar

For machines with outbound partners, the recommended minimum version of FRS is Windows 2000 SP2. See Q322157 for more details on FRS changes in Windows 2000 SP3 and the SP3 QFE rollup.

"Production rate" - how much change needs to be replicated

FRS replicates whole files which have been changed so the rate of change is sum(sizes of files modified), not sum(size of changes to files).

There is also the issue of the multiple changes to the same file. FRS can enter a file into the staging area multiple times - once for each time it was written and closed (but note that the FRS 'aging cache' prevents more than one change order + staging file being generated within 3 seconds).

"Consumption rate" - ability of downstream machines to accept files

Staging area for a file is eventually released when all outbound partners receive the staged files. The ability of downstream partners to accept files is a key factor in determining Staging Area Size. Sub-factors here include:

Replication schedule between partners - how long must the files wait for a chance to be replicated?

Availability of partners - issues like planned/unplanned downtime can cause backlogs. An outbound partner that has not connected for a while can cause a lot of staging space to be required.

Bandwidth available between partners - how long will it take to replicate the files?

Number of downstream partners - FRS staging space will be decided by the slowest partner

Other rules to remember include: The quality of monitoring of replication backlogs is an important issue - If

replication backlogs are not carefully monitored, then staging area can be exhausted (With Windows 2000® Service Pack 2) or to ‘churn’ and consume disk/CPU usage (with Windows 2000® Service Pack 3 or alter).

If using SP2 or earlier, 'unnecessary' replication can be provoked by Anti-Virus or File System Policy.

When adding a new member, FRS on the upstream partner needs to generate special  'directed' staging files that will be used to replicate only to the new member. FRS throttles staging space usage in this scenario, but it requires additional staging space to support up to 128 additional outstanding staging files per new downstream partner during this process. The amount of space this uses depends on the size of the files currently waiting replication - in the worst case, it would be the 128 largest files in the replica set. Note that the SP2 compression and SP3 LRU behavior on staging files applies in this case as well, easing demands on staging space.

3.2.5 Appropriate hardwareFRS can consume considerable amounts of disk space, disk IO, CPU and bandwidth, depending upon the replication traffic it needs to manage.

This applies especially to hub systems, and it is recommended that system designers consider providing hubs with faster CPU and faster/larger disk subsystems. In particular, using separate disks for the database, staging files, and replicated root will provide the most significant performance gains.

3.3. Build & maintain a picture of the environment

For monitoring and troubleshooting, one of the most important things you need to maintain is a clear understanding of your environment – for example the

© Copyright 2003 Microsoft Corporation. All rights reserved 20

Page 21: Troubleshooting Frs-Sonar

replication topology, the inter-site networking bandwidth available, and the space available on each system.

It is important to have clear documentation of replica set configuration, including how many servers are involved the bandwidth between servers/sites replication topology Definition of masters/hubs/spokes machines Available disk space on members how much data is in the replica set, how much data gets changed and how often (and therefore replicated on a

daily basis)

A common approach is to maintain a Visio document that describes the replica set configuration.

3.4. Operations staffing and trainingCorrect training of operations staff is very important for FRS for two key reasons – (a) in order to ensure that the computing/storage/communication resources that FRS requires are available, and (b) to ensure that they are aware that changes on one system will eventually occur on all other systems.

3.4.1 Defining monitoring and support procedures and staffingFile replication uses a number of finite resources – bandwidth availability, disk space, CPU – and yet the File Replication Service has no way of limiting how much change can be introduced in the file system. For this reason, it is essential that file replication be monitored regularly in order to ensure that the resources are matched to the replication demands, and to ensure that underlying technologies that FRS uses are available.

It is essential that clear operation procedures are put into place that ensure there is regular, accountable monitoring of the file replication service.

3.4.2 Training operators about the implications of Multi Master replication

Administrators who are likely to write data to replicated data areas should be trained in the basics of FRS, and especially the following points:

FRS supports a multi-master file replication model in which any computer accepts and replicates changes to any other computer taking part in the replication configuration.

They should be aware of the bandwidth impacts of changes that they make in replicated file areas

Any changes they apply to the file system will eventually occur on all other members of the replication set. If it doesn’t appear to be happening they should not ‘help the system’ by then making the same change on some other machines.

If they take an action on some files, and it later seems to have been ‘altered back’, then it may be that they are in a ‘duelling admins situation’ – another operator is making changes in the same area and has overwritten the earlier changes. In this case, they should try to contact other operators to see if this is the case.

Any files they delete on one member will also be deleted on all other members and that change could be a delete command.

© Copyright 2003 Microsoft Corporation. All rights reserved 21

Page 22: Troubleshooting Frs-Sonar

If they rename a file or folder so that it is moved out of the replication tree, then FRS will treat that as a delete on the other replication set members – since the file(s) have disappeared from the scope of the replica set.

If two operators create directories at the same time (or even before the first created directory has reached the other computer) then they will ‘morph’ – the folder created 2nd will get a modified name – e.g. folder_ntfrs_012345678. FRS behaves this way in order to avoid data loss in such situations.

This training should also be given to new administrators as they join the operations team.

© Copyright 2003 Microsoft Corporation. All rights reserved 22

Page 23: Troubleshooting Frs-Sonar

4. Monitoring the File Replication Service

Key FRS monitoring tasks include watching event logs, checking replication backlogs and looking for resource availability issues on individual systems. Monitoring FRS consists of three levels of tools:

1. Topology checking using the ‘topchk’ tool2. High-level Monitoring of replica sets using the ‘SONAR’ tool and Event logs

In addition, please refer to the Microsoft Operations Manager documentation for details on using the MOM FRS monitoring tools.

SONAR, event logs and MOM are useful for locating trouble at an early stage across the replica set, and can each be thought of as an ‘early warning’ system or ‘trouble-finder’. However, for detailed trouble-shooting, there are a range of command line tools that are used to drill down more deeply into a troubleshooting a system. Those are covered in the next chapter.

4.1. Validating replication topology

An important task to perform is to ensure that the replication topology information is correct in the Active Directory. If not, then replication get be stalled and backlogs occur.

A key issue to understand is that each member computer in a replica set reads the FRS topology information from its local Domain Controller. Whilst Domain Controllers are not in sync, then FRS systems can potentially have differing views of the replication topology. To some extent, this is a normal operational status – but the topology information should converge once the AD has replicated the FRS topology information to all members, and the FRS servers have picked up the latest topology information using their hourly poll of this data.

This means that it can be acceptable in the short term (for a few hours after making an FRS topology change) for each member to have a different view of the topology (and this is a common cause of a temporary 13508 condition), but after a longer term, the replication should converge. Therefore, the operator should regularly monitor replication topology, but in a stable environment, possibly on a weekly rather than daily basis.

Section 5.3. “Verify the FRS topology as stored in the Active Directory.” describes the detailed steps of how to validate the FRS topology both as part of regular monitoring, and as part of a troubleshooting process.

© Copyright 2003 Microsoft Corporation. All rights reserved 23

Page 24: Troubleshooting Frs-Sonar

4.2. High-level Monitoring of replica sets using ‘SONAR’

SONAR is a tool designed for monitoring key statistics about FRS members in a replica set, and is available for download from the http://www.microsoft.com website.

An administrator can use SONAR to easily watch key statistics on a replica set, so they may monitor for traffic levels, backlogs, free space and other issues. SONAR allows definition of filters that define rules for which rows to display, and also allows definition of column sets that can be viewed. SONAR does not modify any settings on the computers which you monitor, it just passively reads information

SONAR can collect status information from FRS running on either Windows 2000 (all service packs), or Windows Server 2003.

Note that the SONAR display is member-oriented – it tries to roll-up statistics into a per-member view, instead of a per-connection view. This is by design – in order to provide a way of monitoring a set at a high level for members that are in trouble. However, the actual trouble may be connection-specific; so troubleshooting often involves first finding a member that requires attention using

© Copyright 2003 Microsoft Corporation. All rights reserved 24

Page 25: Troubleshooting Frs-Sonar

SONAR, and then drilling into the connection-specific issues with the TOPCHK/CONNSTAT/IOLOGSUM tools as described later in this paper.

A successor to SONAR is planned for 2003: ULTRASOUND is a more sophisticated tool that performs deeper drill-down into situations, and also assists in recovery. You can contact the ULTRASOUND team with suggestions and release information via [email protected].

Sonar will be part of the Windows Server 2003 resource kit, and also available for download from www.microsoft.com.

4.2.1 Running SONAR SONAR is implemented using Microsoft Visual Studio .NET and require the

machine on which it is running to have the Common Language Runtime (CLR) installed.

o To download the latest version of the CLR, then visit http://msdn.microsoft.com/downloads/default.asp?url=/downloads/sample.asp?url=/msdn-files/027/001/829/msdncompositedoc.xml.

NOTE: It is not necessary to have the CLR on the machines that are being monitored.

o SONAR requires ntfrsapi.dll on the local computer system. If you wish to run SONAR on a Windows® 2000 Professional or Windows® XP system, you will need to copy this DLL from a Windows® 2000 Server system.

By default, you need to copy the sonar.exe to a local folder before running it.

o To enable the CLR to execute code from a remote server, go to “.NET Framework Configuration” (under Admin Tools), right click on ‘Runtime Security Policy’, select ‘Adjust security’, and set the ‘local intranet’ policy to ‘full trust’.

The tool assumes you have Administrator access, and are able to access WMI, Performance Counters, Active Directory, and the FRS service on the members to be monitored.

In order to minimize network traffic and status calls into FRS servers, SONAR should be run by a small number of users in your organization at any one time.

o It is recommended to leave one system running SONAR continuously in logging & notification mode for a given replication set. Note that the ‘CYCLE’ counters are not available until SONAR has been running for 24 hours.

o It is recommended that an administrator use SONAR interactively at least daily to observe the traffic and status of the replica set using the suggested methods in this section.

When SONAR is started with no command line options, it allows a query to be defined - the administrator can choose a domain, and then a replica set within the domain. They can also choose to just view the hub machines in the domain – in this case, SONAR queries the FRS topology in Active Directory and then limits the view to just those machines with a larger than average number of connections. Note that members can be explicitly added or removed later.

© Copyright 2003 Microsoft Corporation. All rights reserved 25

Page 26: Troubleshooting Frs-Sonar

Once the replica set is selected (or an existing query is loaded), SONAR displays replica member status in a grid view. The users should now click ‘Refresh All’ to collect the data from the member systems.

4.2.2 SONAR Command line Sonar uses the following syntax:

Sonar [/i] [/u] [file] [/?]

/i Installs Sonar.exe. /u Uninstalls Sonar. file Specifies the configuration file to load. /rpc Repair Performance counters. Use this option if you have seen a

‘Data Collection’ error of type ‘PerfCtr’. /npc No Performance counters. Use this option if you continue to see

‘Data Collection’ errors of type ‘PerfCtr’. In this mode, Sonar will no longer collect data using performance counters, but most of the data Sonar normally collects is still available

/s Start logging immediately/? Displays help.

4.2.3 Running SONAR in a large environmentSONAR was designed to run against thousands of machines if necessary. When using it with a large environment of over 500 members of a replica set, best practices to follow include:

*Try* it against all the machines in the replica set, but with a longer schedule, for example 12 hours between refreshes.

If data collection is unacceptably slow, then SONAR against the hub members only, plus a selection of ‘spoke’ machines from selected sites. If you have issues on spokes that are not being directly monitored, that will show up as backlogs instead of direct failure issues, and then you can add sonar monitoring of the downstream machines, or use the Connstat tool to find which connection is backlogged

4.2.3.1 The File Menu

© Copyright 2003 Microsoft Corporation. All rights reserved 26

Page 27: Troubleshooting Frs-Sonar

The File Menu provides a way to create a new query, Open an existing query, or save the current query. Queries are stored as XML files with the file type .SNR.

The File menu also provides a way to log output to a Comma Separated File, so that it may be analyzed by another tool:

4.2.3.2 The Options Menu

The Options menu allows definition of: Servers in the query, filters on which rows to display, notifications,

© Copyright 2003 Microsoft Corporation. All rights reserved 27

Page 28: Troubleshooting Frs-Sonar

column sets.

These definitions form part of the query saved in the .SNR files, for later re-use. The .SNR files are in XML format and can be externally edited using a text/xml editor.

The Options menu also allows selected rows to be either copied to the clipboard (for pasting into applications such as Word, Excel or Notepad), or for the selected rows to be refreshed.

4.2.3.3 Select Servers sub-menu

The ‘select servers’ sub-menu provides a way to define which members of the replica set will be queried for status.

The left-hand column lists servers in the replica set which could be added to the query. This may show candidates either (a) from the list of all members of the replica set, or (b) just members of the replica set with a greater than average number of connections – these are considered to be ‘hubs’.

The right-hand column lists the members of the replica set that are part of the current query.

This ability to select servers is useful for limiting network bandwidth usage on a WAN by excluding machines that are connected by low-bandwidth links.

© Copyright 2003 Microsoft Corporation. All rights reserved 28

Page 29: Troubleshooting Frs-Sonar

4.2.3.4 Filter & Notification selection

The current active Filter & Column set is selected from a drop down box above the grid tool. The filters shown above are default filters that are created in any new query, but the user is free to define their own filters from the Options menu.

Rules for filtering rows using a syntax like a “Where” command in SQL, supporting:

Comparison: > < >= <= > LIKE IN Arithmetic: + - * / % String Ops: * LIKE Aggregate: Sum Avg Min Max Count StDev Var Functions: IIF (if operator) Operators: AND OR NOT

Examples: VerCompiledOn < MAX( VerCompiledOn ) SharingViolations > 0

Note that Filters can also be defined as notifications – if enabled, then at the end of each refresh cycle, SONAR will execute each query marked as a ‘notification’. An email message will be sent listing all filters that result in a non-empty rowset.

© Copyright 2003 Microsoft Corporation. All rights reserved 29

Page 30: Troubleshooting Frs-Sonar

The ‘Min Notify Interval’ shows the minimum time between SONAR sending notifications by email. An example of a SONAR notification is:

From: SONAR [mailto:SONAR]Sent: Wednesday, August 28, 2002 7:28 PMTo: FRSMONSubject: SONAR FRS Notification [8/28/2002 7:28:25 PM]: Replica Set 'FRSDog|Pub1|Top' in domain test.reskit.com

Message from SONAR status viewer for Windows File Replication Service

- Issued by X:\frs\sonar\Sonar.exe on FOO.test.reskit.COM

- User: test\frsmon- Query: - Domain: test.reskit.com- Replica set: 'FRSDog|Pub1|Top'

The following filters have been triggered as part of a data refresh:--------------------------------------------Filter: Currently involved in joinServers: frsdogb--------------------------------------------For more details, use the SONAR tool to investigate the status.

4.2.3.5 Column set selectionColumn Sets are selected from another drop-down box, and again there are a number of built-in column sets added to any new query, but the user is free to define their own from the Options -> Edit Columns menu.

© Copyright 2003 Microsoft Corporation. All rights reserved 30

Page 31: Troubleshooting Frs-Sonar

4.2.3.6 Copy/Refresh rowFinally, the user is able to refresh all servers using the ‘refresh all’ button, or can select rows by right-clicking them, then choose to just refresh the selected ones from the Options -> Refresh menu. In addition, the default refresh period can be modified from the final drop-down box on this part of the window.

4.2.3.7 TooltipsMany of the cells in the grid display have additional information available as tooltips. These are displayed when the mouse is over a cell, and held stationary for a few seconds. For example:

Column headers have tooltips that describe the column contents in more details

In case of a Data Collection failure by Sonar, the ‘Data Collection State’ cells have tooltips that show which part of data collection failed

OutConnections has a tooltip that lists the replica partners, and shows which ones are currently joined.

4.2.3.8 Column sorting and resizingThe grid control can be sorted by clicking on a column heading, and columns can be resized by dragging the boundary between columns.

4.2.3.9 Double-clicking rows: Extensibility of analysis.When a row is double-clicked, SONAR can call out to a program defined in the HKLM\Software\Microsoft\FRSMon\SONAR\Expand registry value (REG_SZ), and will execute “program name” with the parameters “<local computer name> <dns suffix>

© Copyright 2003 Microsoft Corporation. All rights reserved 31

Page 32: Troubleshooting Frs-Sonar

<replica root>3 <column name>”. The user can thus define what program actually is executed – for example it could be health_chk, or a connstat report, and this can even be column-sensitive.

4.2.3.10 SONAR Status barA Status bar at the bottom of window shows:

Queued: Number of requests queued upPending: Number of requests currently being serviced (up to 25)Succeeded: Number of requests which succeededFailed: Number of requests failedNotify: Notifications enabled/disabled.Refresh in: Time to next scheduled refreshLog: Indicates if the log is enabled “started” or disabled “stopped”.

4.2.3.11 Column typesSome columns are expressed in terms of ‘totals’, ‘deltas’, or ‘cycles’:

TOTAL – means data that is displayed as an absolute value or string . For some parameters it is defined as the total number of “occurrences” since the server last started, for others it is a quantity expressed by the service (for example staging space in use)

DELTA – means show delta during this refresh period - the difference between the Absolute between the last two “refresh” intervals

CYCLE – means the total number of “occurrences” within a 24 hours period. If a 24 hour period has not elapsed (i.e. server started less than 24 hours ago) the value is null.

4.2.4 Monitoring an FRS replica set using SONAR.Best practice for Monitoring FRS is to work through the defined column sets for SONAR for the replica set(s) you are investigating, and then work through some filters for specific issues.

4.2.4.1 Column Set: “Error Conditions”This column set shows the most common situations that require attention. This is the default column set that SONAR displays, and issues that show on this column set should be addressed first.

DataCollectionState: This is NOT an FRS-reported error. Instead, it indicates that SONAR was unable to collect one or more pieces of status information about the FRS system. The details of the failure can be determined by either hovering the mouse pointer over the cell, or by looking at the ‘DataCollectionError’ column in the “All Columns” view:

PerfCtr: Failed to read Performance counter Known to fail on Windows 2000 SP2

systems and Windows Server 2003 RC1 systems. On other platforms, an issues are normally resolved with a refresh.

Registry: Failed to read registrySCM: Failed to query Service Control ManagerDS: Failed to query Active Directory

3 Note: Earlier releases of SONAR used ‘domain’ instead of DNS Suffix here.

© Copyright 2003 Microsoft Corporation. All rights reserved 32

Page 33: Troubleshooting Frs-Sonar

TimeZone: Failed to get TimeZone informationProc: Failed to collect Process informationWMI: Failed to query information via WMIEventLog: Failed to query Event logSysvol: Failed to connect to SYSVOL shareFRSVer: Failure of the FRS Version RPC interfaceFRSInlog: Failure of the FRS Inlog RPC interfaceFRSSets: Failure of the FRS Sets RPC interface

FRSState: Servers are normally in the ACTIVE state, look for servers in other states: ALLOCATED Replica set is in an initializing state INITIALIZING Replica set is in an initializing state STARTING Starting the replica set STOPPED Replica set is now stopped – typically because

the set no longer has connections to any partners

ERROR Replica set is stopped due to an error JRNL_WRAP_ERROR

Replica set is stopped due to data loss in the NTFS journal

REPLICA_DELETEDReplica set is marked as deleted

If a member is in one of the states marked in red, it requires attention before it can replicate again

SCM State: Shows if the FRS Service is running according to the Windows® ‘Service Control Manager’. Look for servers that are not in the “Running” state, and investigate. FRS Servers should normally be left running.

SYSVOLShared: If the replica set is a SYSVOL, look for “Not shared” or “Not a junction”.

If this is the case, then investigate as described in the ‘Troubleshooting the SYSVOL directory junction’ section.

LongJoinCycle: High numbers indicate servers which have had problems joining with a replication partner. For troubleshooting details, see the ‘troubleshooting Event ID 13508’ section.

ExcessiveReplicationCycle:Shows servers which have detected files that have been touched by some process, but not updated. If this is non-zero, then see the ‘Troubleshooting excessive disk/CPU usage by NTFRS.EXE ‘ for more details.

Huge File Cycle: This shows the number of 13523 events logged on this member in the last 24 hours. This indicates that a file could not be replicated since it was larger than the staging area size, and so cannot be replicated until the staging area has been reconfigured.

Increase the staging area on machines in the replica set so that these files can be replicated:

o HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NtFrs\Parameters,

© Copyright 2003 Microsoft Corporation. All rights reserved 33

Page 34: Troubleshooting Frs-Sonar

o Value "Staging Space Limit in KB”, type REG_DWORD.

o See Q221111, “Description of FRS Entries in the Registry” for more details

Staging Full Cycle: This shows the number of 13522 (Staging Area Full) events logged on this member in the last 24 hours. This indicates that a file could not be replicated since the staging area was full. It is recommended that the staging area size be increased.

ChangeOrdersMorphedCycle:ChangeOrdersMorphedTotal:

Shows the number of directories morphed to foldername_NTFRS_xxxxxxxx names due to conflicting updates. See Understanding Morphed folders for details on the causes and cleanup of this situation.

CommTimeoutDelta:CommTimeoutCycle:CommTimeoutTotal:

Look for machines with high numbers of Communication timeouts – this is an indication of network connectivity problems, and if it persists, should be investigated.

Example of “Error Conditions”:

In this example, the machine ‘frsdog-pssuk’ appears to be down, and this is the first issue to investigate. The next interesting items are the ‘Long Join’ conditions for frsdoga and frsdogb. This indicates that they are unable to contact a replication partner4

4.2.4.2 Column set: “Backlog”

After addressing issues shown in the ‘Error Conditions’ column set, the next column set to investigate is ‘Backlog’. This column set shows the backlog most common situations that require attention. The additional information in this column set is:

OutConnections: Shows the number of outbound connections for this member in this replica set. This is helpful for identifying which machines are hubs in the replication topology.

BacklogConnections:

4 Actually in this case, the cause of the Long Joins was precisely because frsdog-pssuk was down.

© Copyright 2003 Microsoft Corporation. All rights reserved 34

Page 35: Troubleshooting Frs-Sonar

This indicates that a number of connections from this member is in a state where the maximum number of files are being replicated in parallel. If this persists, troubleshoot as described below and in the next chapter.

BacklogFiles:BacklogSize: Look for servers with unusually high numbers of files backlogged

awaiting replication, or large total quantities of data to replicate. If this persists, troubleshoot as described below and in the next chapter. Note:1. Backlogs are actually a per-connection property. Since

SONAR just shows per-member information, it only displays the MOST backlogged connection for this member.

2. Number of Files backlogged is determined by examining the difference between the ‘change order’ sequence numbers: BacklogFiles = NextChangeOrderSeq – TrailingChangeOrderSeq. However, there are some cases where this metric gets distorted: A ‘vvjoin’ operation creates ‘directed change orders’ which only go to one partner, but SONAR does not take account of this possibility in this metric

LongJoinCycle: See above for details.

SharingViolations: This actually shows how many files are waiting to be installed on this machine (this list is known as the ‘inlog’ or ‘inbound log’). Files are most commonly in this state if there is a sharing violation preventing a file being installed – however there are other portential reasons why a change order might wait in the inlog). See section 5.6.1, “Detecting and resolvingSharing Violations on targets” for details on this process.

VVjoinsActive: This shows machines which are in the process of joining (performing an initial synchronization) with their replication partners. Note that this process can occur in normal operation, and may cause a temporary replication backlog.

JoinsTotal: This shows the total number of joins (or attempted joins) this machine has been involved in since it restarted. This is not an indication of an error – it is merely informational.

Example of “Backlog”:

In this case, we can see that frsdoga and frsdogb are in VVJOIN, and have a backlog – however, this is again caused by the fact that frsdog-pssuk is down.

© Copyright 2003 Microsoft Corporation. All rights reserved 35

Page 36: Troubleshooting Frs-Sonar

4.2.4.3 Column Set: “Replication Status”This is a more expanded column set that is useful when drilling down into issues with backlogs. It shows more context about what the cause of a backlog might be. New information in this column set includes:

InConnections: Shows the number of inbound connections for this member in this replica set. This is helpful for identifying which machines are hubs in the replication topology.

InJoinedConnections, OutJoinedConnections:This shows how many of the Inbound/Outbound connections are currently joined (and hence able to replicate).

LastInJoinInterval: Time interval since the last of this member’s Inbound partners joined to replicate.

LastOutJoinInterval:Time interval since the last of this member’s Outbound partners joined to replicate.

LocalChangeOrdersCycle:Shows changes initiated by this member in the last 24 hours. Regularly look to see that changes are being initiated from machines you expect. Look for unusual cases of many members suddenly originating changes, or for one machine unexpectedly initiating a lot of change.

StagingFilesRegeneratedCycle:On Windows® 2000 SP3 and later, this shows that the staging area was too small and that files are being purged from this cache. If this persists, If this condition occurs frequently:

Confirm that all direct outbound replication partners receiving updates from this member are online and receiving updates

Verify that the replication schedule for receiving partners is open or "on" for a sufficient window of time to accommodate the number of files being replicated

Consider increasing the staging area to improve system performance:

o HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NtFrs\Parameters,

o Value "Staging Space Limit in KB”, type REG_DWORD.

o See Q221111, “Description of FRS Entries in the Registry” for more details

Example of ‘Replication Status’:

© Copyright 2003 Microsoft Corporation. All rights reserved 36

Page 37: Troubleshooting Frs-Sonar

4.2.4.4 Column Set: ‘Resource Usage’

The next column set to Monitor is the ‘resource usage’ set which shows the disk space available to support FRS Replication

DiskSpaceReplicaRoot:: MB of free space on the replica root.

DiskSpaceStagingRoot: MB of free space in the staging area.

DiskSpaceDB: MB of free space on the volume holding the FRS DB.

StagingFilesRegeneratedCycle: See above for details.

Example of ‘Resource Usage’:

4.2.4.5 Column Set: ‘Settings’This column set shows some key parameters of the FRS members. New information in this column set includes:

UpdateTime: The time SONAR last collected data from this member.

VerLatestChanges: The FRS version string for this member.

BURFLAGS: The ‘Backup Restore Flags’ active for this member. This is useful to confirming that the entire replica set has been correctly configured during the Authoritative Restore Process (see Section 6.2. , “When to use D4. Authoritative restore” on page 67 for more details).

© Copyright 2003 Microsoft Corporation. All rights reserved 37

Page 38: Troubleshooting Frs-Sonar

USNJournalSize: The FRS registry setting for the NTFS USN journal size. If 0, then this indicates default (32MB before Windows 2000 SP3, 128MB for later versions). Note that 128MB of journal is recommended per 100,000 files in the replica set. See Section 3.2.2, “ChangeJournal size” on page 18 for more details.

Example of ‘Settings’:

4.2.4.6 Column Set: ‘Sources of change’

This column set indicates statistics about sources of change in the replica set. In normal use, most systems will have a limited number of systems that generate change in the replica set. Use this column set on a regular base to confirm that change is being originated from the expected systems. New information in this column set includes:

UpdateTime: The time SONAR last collected data from this member

ExcessiveReplicationCycle:Shows servers which have detected files that have been touched by some process, but not updated. For Troubleshooting details, see Troubleshooting excessive disk/CPU usage by NTFRS.EXE

LocalChangeOrdersDelta:LocalChangeOrdersCycle:LocalChangeOrdersTotal:

Shows changes initiated by this member in the last 24 hours. Regularly look to see that changes are being initiated from machines you expect. Look for unusual cases of many members suddenly originating changes, or for one machine unexpectedly initiating a lot of change.

ChangeOrdersMorphedCycle:ChangeOrdersMorphedTotal:

Shows the number of directories morphed to foldername_NTFRS_xxxxxxxx names due to conflicting updates. See Understanding Morphed folders for details on the causes and cleanup of this situation

Example of ‘Sources of Change’:

© Copyright 2003 Microsoft Corporation. All rights reserved 38

Page 39: Troubleshooting Frs-Sonar

4.3. Monitoring using event logsIn addition to regularly using SONAR, the FRS Event logs are a key source of monitoring information. The following table summarizes the main FRS event log entries that should be monitored on a regular basis, and this section describes each event in more detail and the actions required in each case.

Event ID Priority Summary Actions required13508 (depends) Unable to RPC to a

replication partnerWait for 13509. If no 13509 follows within 4 hours (rule of thumb) then investigate.

13509 (none) Able to RPC to a replication partner

No action – this indicates the 13508 wait is over.

13511 P1 FRS Database is out of disk space

Make more space available for FRS database.

13512 (none) Enabled disk cache detected

Typically no action required.

13522 SP2: P1SP3: P3

Staging area full On Windows 2000 before SP3, requires administrator intervention to clear replication backlog.In Windows 2000 SP3 and later, automatic processes trim the size of the staging area and treat it like a cache. Administrators should investigate increase as an optimization if this is repeatedly logged.

13526 P1 SID cannot be determined from the distinguished name

Restart FRS on that system

13548 P1 Systems clocks are too far apart

Correct clocks on one or more replica members

13557 P1 Duplicate connections configured

Delete the unnecessary connection object

13567 P2 Excessive replication detected and suppressed.

Investigate what is causing excessive replication work; in the meantime the FRS server is CPU and disk resources to damp the replication traffic by

© Copyright 2003 Microsoft Corporation. All rights reserved 39

Page 40: Troubleshooting Frs-Sonar

comparing files to previously sent copies.

13568 SP2: P2SP3: P1

Journal wrap Depends on FRS version, See details, below.

4.3.1 Troubleshooting FRS Events 13508 without FRS Event 13509Event 13508 in the FRS log is a warning that the FRS service has been unable to complete the RPC connection to a specific replication partner. It indicates that FRS is having trouble enabling replication with that partner and will keep trying to establish the connection.A single event ID 13508 does not mean anything is broken or not working; simply look for event ID 13509 to make sure that the problem was resolved. Based on the time between event IDs 13508 and 13509, you can determine if there is a real problem that needs to be addressed.

Note that if FRS is stopped after a 13508 interval, and then later started at a time when the communication issue has been resolved, no 13509 will be entered in the event log – so an event indicating that FRS has started, and without a 13508 message indicates replication connections are being made correctly

Since FRS servers gather their replication topology information from their closest Active Directory domain controller (itself on a domain controller that is also an FRS member), there is also an expected case where a replica partner in another site will not be aware of the replica set until the topology information has been replicated to domain controllers in that site. When the topology information finally reaches that distant domain controller, the FRS partner in that site will be able to participate in the replica set and lead to FRS event ID 13509. Note that intra-site Active Directory replication partners replicate every 5 minutes. Inter-site replication only replicates when the schedule is open (shortest delay is 15 minutes). In addition, FRS polls the topology in the active directory at defined intervals – 5 minutes on domain controllers, and 1 hour on other member servers of a replica set. These delays and schedules (and especially in topologies with multiple hops) can delay propagation of the FRS replication topology

Procedures for Troubleshooting FRS Event 13508 without Event 13509:1. Examine the 13508 Event in the FRS Event Log in order to determine

which machine that FRS has been unable to communicate with.

2. Determine whether the remote machine is working properly, and verify that FRS is running on it. A good method to do this to execute ‘NTFRSUTL VERSION <FQDN_of_remote_DC_name>’ from the machine logging the 13508 event.

a. If this fails, check network connectivity by pinging the <FQDN_of_remote_DC_name>’ from the machine logging the 13508 event. If this fails, then troubleshoot as a DNS or TCP/IP issue. If it succeeds, confirm the FRS service is started on the remote DC.

3. Determine whether FRS has ever been able to communicate with the remote computer by looking for 13509 in the event log and review recent change management to networking, firewalls, DNS configuration and Active Directory infrastructure to see if there is a correlation.

4. Determine whether there is anything between the two machines that is capable of blocking RPC traffic, such as a firewall or router.

5. Confirm that Active Directory replication is working

© Copyright 2003 Microsoft Corporation. All rights reserved 40

Page 41: Troubleshooting Frs-Sonar

4.3.2 Troubleshooting FRS Event 13511FRS event ID 13511 is logged when the FRS Database is out of disk space.

To correct this situation, free up some disk space on the volume containing the FRS Database. If this is not possible, then consider moving the database to a larger volume with more free space. For more information about how to move the database to a larger volume, see Knowledge Base article Q221093: How to Relocate the NTFRS Jet Database and Log Files

4.3.3 Troubleshooting FRS Event 13526FRS event ID 13526 is logged when a domain controller becomes unreachable.

Event ID:13526

The file replication service cannot replicate d:\Systemroot\sysvol\domain with the

computer DC1 because the computer SID cannot be determined from the distinguished

name "cn=dc1,ou=domain controller,dc=corp,dc=com".

The file Replication Service will try later.

SYSVOL and DFS content are not being replicated.

For additional domain controllers, FRS replication failures can prevent the

sharing of netlogon and sysvol shares, the application of policy, and the machine

registering itself in the Active Directory as a domain controller.

For existing computers, files and folders in SYSVOL or DFS replica set are

inconsistent between replica members.

This problem occurs because FRS polls Active Directory at regular intervals to read FRS configuration information. During the polling, an operation is performed to resolve the Security Identifier (SID) of an FRS replication partner. The binding handle might become invalid if the bound domain controller becomes unreachable over the network or restarts in a single polling interval (default 5 minutes).

To resolve this issue, restart FRS on the computer logging the error message

4.3.4 Troubleshooting FRS Event 13548FRS event ID 13548 is logged when two replica partners have diverged time settings

Event ID: 13548

The File Replication Service is unable to replicate with its partner computer

because the difference in clock times is outside the range of plus or minus 30

minutes.

The detected time difference is: XX minutes

This error could be caused by the selection of an incorrect time zone on the local computer or its replication partner.

Check that the time zone and system clock are correctly set on both computers – they must be within 30 minutes of each other, but preferably much closer.

4.3.5 Troubleshooting FRS Event 13522The Staging Directory is an area where modified files are stored temporarily (a) prior to being propagated to other replication partners or (b) when received from other replication partners. FRS encapsulates the data and attributes associated with a replicated file (or directory) object in a staging file. FRS needs adequate

© Copyright 2003 Microsoft Corporation. All rights reserved 41

Page 42: Troubleshooting Frs-Sonar

staging area space on both upstream and downstream machines in order to replicate files.

On Windows 2000 before SP3, Event 13522 indicates that the FRS service has paused because the staging area is full. Replication will resume if staging space becomes available or if the staging space limit is increased.

On Windows 2000 before SP3, you must clear the replication backlog. There are four common reasons why the staging area may fill up:

One or more downstream partners are not accepting changes. This could be a temporary condition due to the schedule being turned off and FRS waiting for it to open, or a permanent state because the service is turned off, or the downstream partner is in an error state.

The rate of change in files exceeds the rate at which FRS can process them.

There are no obvious changes made but the staging are is filling up anyway. This is typically the "excessive replication" issue described in "Troubleshooting FRS Event 13567".

A parent directory for large number of changes is failing to replicate in so all changes underneath it are blocked.

Often, there is a combination of both problems: i.e. changes cannot be replicated and the amount of change orders does not map to the amount of changes made.

4.3.6 Troubleshooting FRS Event 13557FRS event ID 13557 is logged when duplicate connections are detected between to members:

Event ID: 13557

The File Replication Service has detected a duplicate connection object between

this computer "<Computer 1>" and a computer named "<Computer 2>".

This was detected for the following replica set:

"DOMAIN SYSTEM VOLUME (SYSVOL SHARE)"

To resolve this problem, it is necessary to delete duplicate connection objects between the direct replication partners that are noted in the event text.

4.3.7 Troubleshooting FRS Event 13567Event 13567 in the FRS event log is generated on Windows 2000® Service Pack 3 and later systems when unnecessary file change activity is detected.

Unnecessary file change activity means that a file has been written by some user/application, but no change is actually made to the file. FRS detects that the file has not changed, and maintains a count of how often this happens. If the condition is detected more than 15 times per hour during a 3 hour period, the FRS service logs the 13567 event.

Such events should be investigated to find to application/user that is modifying file content. See Troubleshooting excessive disk/CPU usage by NTFRS.EXE for details on this procedure. More information can also be found in Knowledge Base article Q315045 - FRS Event 13567 Is Recorded in the FRS Event Log with SP3.

© Copyright 2003 Microsoft Corporation. All rights reserved 42

Page 43: Troubleshooting Frs-Sonar

4.3.8 Troubleshooting FRS Event 13568FRS event ID 13568 contains the following message:

The File Replication Service has detected that the replica set "1" is in JRNL_WRAP_ERROR.

The NTFS file system maintains a special log called the NTFS USN journal which is a high-level description of all the changes to files and directories on an NTFS filesystem volume. The FRS service uses this mechanism in order to track changes to the filesystem in directories of interest, and queue those changes for replication other machines. The NTFS USN journal has defined size limits and will discard old log information on a first in, first out basis in order to maintain its correct size.

If FRS processing falls behind the NTFS USN journal, and if USN journal information that FRS needed has been discarded, then FRS enters a "journal wrap" condition. FRS then needs to rebuild its current replication state with respect to the file system and other replication partners.

Each file change on the NTFS volume occupies approximately 100 bytes in this journal (possibly more, depending on filename size). As a rule of thumb, the NTFS USN journal for an NTFS volume should be sized at 128MB per 100,000 files being managed by FRS on that NTFS volume.

Prior to Windows 2000® SP3, the default journal size was 32MB and the maximum journal size was 128 MB. In Windows 2000® SP3 and later, the default journal size is 128MB, and the maximum journal size is 10,000MB

The Journal size may be configured with a registry key, but keep in mind that once you increase it you should not lower it again since this will cause a “journal wrap”. Knowledge base article Q221111 describes how the USN journal size can be increased.

FRS can encounter journal wrap conditions in the following cases:Many files are added at once to a replica tree while FRS is very busy, starting up, or not running.

On a server that is being used for authoritative restore, or as the primary for a new replica partner, there is a lot of file activity at the start of this process and this can consume USN journal records. Following the ‘rule of thumb’ mentioned above is sufficient to avoid this condition.

An NTFS filesystem needs to be processed with Chkdsk and Chkdsk corrects the filesystem structure. In this case, NTFS creates a new NTFS USN journal for the volume or deletes the corrupt entries from the end of the journal.

The NTFS USN journal is deleted or reduced in size. NTFRS service is in an error state that prevents it from processing

changes in the USN journal

If FRS is experiencing journal wrap errors on a particular server, it cannot replicate files until the condition has been cleared. To continue replication, the administrator must stop FRS on that server and perform a non-authoritative restore of the data so that the system may synchronize with its replication partners. Note the following:

Windows 2000 SP1 cannot perform this process automatically In Windows 2000 SP2, FRS performs this process automatically In Windows 2000 SP3 and later, FRS does not perform this process

automatically. The reason for this change was that it was typically being

© Copyright 2003 Microsoft Corporation. All rights reserved 43

Page 44: Troubleshooting Frs-Sonar

performed at times that were not planned by administrators. However, there is a registry setting available that allows FRS to perform the automatic non-authoritative restore, just as in Windows 2000 SP2. However, Microsoft recommends leaving this as a manual process.

Knowledge base article Q292438 describes how to perform the non-authoritative restore process on a server.

4.3.9 Details on other FRS Event log messages.See section 8. , “FRS Event log messages” on page 74 for the full list of FRS event log messages.

© Copyright 2003 Microsoft Corporation. All rights reserved 44

Page 45: Troubleshooting Frs-Sonar

5. Troubleshooting, Recovery & RepairThere are a number of key concepts to understand about the Windows 2000® File Replication Service before starting a troubleshooting exercise:

FRS supports a Multi-master file Replication model in which any computer can originate or accept changes to any other computer taking part in the replication configuration.

Administrators should be aware of the bandwidth impacts of changes that they make in replicated file areas including the bulk reset of permissions or other file attributes by administrators or applications

Any changes they apply to the file system will eventually occur on all other members of the replication set. If it doesn’t appear to be happening they should not ‘help the system’ by then making the same change on other FRS replication partners.

If administrators note a change on some files, and it later seems to have been ‘altered back’ to an earlier version, it may be that they are in a ‘duelling admins situation’ – another operator or application is making changes in the same area and has overwritten the earlier changes. In this case, they should try to contact other operators to see if this is the case.

Any files they delete on one member will also be deleted on all other members.

If they rename a file or folder so that it is moved out of the replication tree, then FRS will treat that as a delete on the other replication set members – since the file(s) have disappeared from the scope of the replica set.

If two operators create directories at the same time (or even before the first created directory has reached the other computer) then they will ‘morph’ – the folder created 2nd will get a modified name – e.g. folder_ntfrs_12345678. FRS behaves this way in order to avoid data loss in such situations.

The FRS service should be kept running at all times in order to avoid ‘journal wrap’ conditions.

The following table lists the most common scenarios that lead to troubleshooting the File replication service:

Scenario Troubleshooting processErrors in event log The event log is a key indicator of the health of

DFS and SYSVOL replication and should be regularly monitored. See Understanding and resolving FRS event log messages in the previous chapter for details of how to understand and act upon these entries.

Files not replicating Files can fail to replicate for a wide range of underlying reasons: DNS/communication issues, topology problems, incorrect file/folder filter definition, insufficient disk space, FRS servers in an error state, or sharing violations. See Troubleshooting Files not replicating for details on how to detect and resolve these issues

Modified folder names on replicas

If duplicate folders are manually created on multiple Domain controllers before they have been able to replicate, FRS preserves content by ‘morphing’ folder names of the last folders to be created into a folder_ntfrs_NNNNNNNN form.See Understanding Morphed folders for details

© Copyright 2003 Microsoft Corporation. All rights reserved 45

Page 46: Troubleshooting Frs-Sonar

on the causes and cleanup of this situationSYSVOL data appears present on DCs but \\domain\SYSVOL share appears empty

SYSVOL folders include a ‘junction point’ that points to the correct location of the data. If deleted by an administrator they require special steps to recover. See Troubleshooting the SYSVOL directory junction for more details.

Excessive disk/CPU activity from FRS

If NTFRS.EXE is identified as having extensive, continuous high CPU/disk usage, see Troubleshooting excessive disk/CPU usage by NTFRS.EXE

5.1. Drill-down using health check and analysis tools

5.1.1 Obtaining the health check and analysis toolsWindows Server 2003 installation CDs includes a ‘Support Tools’ folder which contains four tools which can be used to analyze FRS systems. These tools are:

Topchk: This tool processes the output of NTFRSUTL DS to analyze the FRS topology. This tool can be used on Windows 2000® or Windows Server 2003 systems.

Connstat: This tool processes the results of NTFRSUTL SETS to view the connection status and replication backlog of downstream partners. This tool can be used on Windows 2000® or Windows Server 2003 systems.

IoLogSum: This tool summarizes and sorts the lists of files:

a. The full table of files that FRS is replicating – NTFRSUTL IDTABLE,

b. Files that FRS has scheduled for outbound replication – NTFRSUTL OUTLOG

c. Files that FRS has scheduled for inbound replication - NTFRSUTL INLOG.

This tool can be used on Windows 2000® or Windows Server 2003 systems.

Health_chk: This command script collects a range of event, error log, NTFRSUTL status and other information useful for monitoring or troubleshooting an FRS system. This tool runs on a Windows Server 2003 system and can be used to collect information from Windows 2000® or Windows Server 2003 systems.

5.1.2 Collecting ‘health check’ data

Health check is an automated script that collects all of the common data required for either drill-down monitoring or initial troubleshooting of an FRS system.

Health check can take quite a while to run and may collect huge amounts of data especially if you encounter problems with full staging areas (uncommon with Windows 2000 SP3 and later).

For troubleshooting, Health check should be run against the server that shows the symptoms of replication failures, but also against all the other servers involved,

© Copyright 2003 Microsoft Corporation. All rights reserved 46

Page 47: Troubleshooting Frs-Sonar

for example the direct replication partners (or if this is too many), a good sample set of them.

Steps required to collect the FRS ‘health check’ data: FRS must be running on the target server to gather the data. Note that the

error messages are misleading in the case that the FRS service is not running on the target server.

The user must have admin rights to execute the tools - – you must have membership to local administrators group on each local computer targeted.

HEALTH_CHK can run remotely, so you can use any working server. However, it can gather additional useful data if run on the system it is analyzing.

Logs are written to a specified directory using the name of the target machine.

Note that this script does not modify any configuration settings on either the machine or in the Active Directory; it simply reads the configuration and status.

Here is a list of files collected by the health check script:

File Contents of fileds_showconn.txt Output from repadmin /showconn ds_showreps.txt Output from repadmin /showrepsevl_application.txt Event logs in txt format.

If possible, also obtain the event logs of all these servers in .evt format as they could contain very valuable information and are much easier to read then the ones in .txt format of the health check

evl_dns.txtevl_ds.txtevl_ntfrs.txtevl_system.txt

ntfrs_config.txt Output from ‘ntfrsutl configtable’ - listing all the replica sets

ntfrs_ds.txt Output from ‘ntfrsutl ds’ - displays the machines (FRSs) view of the topology definition in its local Domain Controller

ntfrs_errscan.txt A scan of error messages from the FRS debug logsntfrs_inlog.txt Output from ‘ntfrsutl inlog’ – shows inbound ‘change

orders’ntfrs_machine.txt Local computer name, logon server & usernamentfrs_outlog.txt Output from ‘ntfrsutl outlog’ – shows ‘outbound change

orders’Ntfrs_reg.txt Registry data from HKLM\system\currentcontrolset\

services\NtFrs and belowNtfrs_sets.txt Output from ‘ntfrsutl sets’ - list of inbound and

outbound connections & their statusNtfrs_sysvol.txt Output from ‘dir /s *.*’ of the SYSVOL shareNtfrs_version.txt Output from ‘ntfrsutl version’ – shows FRS version

information

Note that HEALTH_CHK does not collect the detailed FRS debug logs. If collecting data to provide to a support professional, it is a good idea to also collect this information.

5.1.3 Pre-processing the health check dataThe ‘ntfrsutl ds’, ‘ntfrsutl sets’ and ‘ntfrsutl outlog’ output is typically too large to read and diagnose. In order to overcome this problem, it is recommended to run the following scripts against certain files in order to create a summary of the data. Note that these pre-processing tools are implemented in the PERL language. To

© Copyright 2003 Microsoft Corporation. All rights reserved 47

Page 48: Troubleshooting Frs-Sonar

run them, you need to obtain a PERL interpreter from a site such as http://www.activestate.com/Products/ActivePerl/?_x=1

These tools are typically invoked as follows:

topchk ntfrs_ds.txt > top.txtconnstat ntfrs_sets.txt > con.txtiologsum ntfrs_outlog.txt > out.txtiologsum ntfrs_inlog.txt > in.txt

A best practice is to add this step to scripts you build to collect the health_check data from your chosen replication servers.

See also section 7. “Reference pages for IOLOGSUM, TOPCHK, CONSTAT” on page 69, below.

5.1.4 Collecting event logsOrganizations typically have well-established tools for collecting event logs from Windows 2000®-based systems, and these should also continue to be used – the event log dumps collected by ‘health_chk’ contain only a subset of the full event information.

5.2. Analysing FRS status reports and logsThis section describes how to analyze the data that has been collected.

5.2.1 Verify Active Directory replicationFRS replication depends on Active Directory replication. The first step is always to ensure you have correct monitoring processes for AD replication. The simplest way to check for any suspicious AD issues is to execute

FINDSTR ‘FAIL” ds_showreps.txt.

If any issues are found, consult Active directory troubleshooting guide and Q229896 “Using Repadmin.exe to Troubleshoot Active Directory Replication”

5.3. Verify the FRS topology as stored in the Active Directory.

Each FRS server reads the Replica set topology information from the closest available Domain Controller – not from some specified ‘master’ DC. If the topology information is changing, and if there are delays in AD replication due to backlog or schedule, then it is possible that different FRS servers will temporarily be out of sync.

Therefore, the first task in topology analysis is to decide the locations from which to check and compare the topology. This will typically include an FRS server from the main sites from which topologies are updated, along with a handful of branch sites, or any sites experiencing replication or join issues.

For each FRS server you choose to inspect topology from, run the NTFRSUTL DS command to obtain the raw topology information, then run the TOPCHK tool as described in section 5.1.4, “Collecting event logs” above in order to generate a top.txt file.

Then, for each server from which to compare topology, follow the procedures described in this section.

© Copyright 2003 Microsoft Corporation. All rights reserved 48

Page 49: Troubleshooting Frs-Sonar

5.3.1 Detecting a missing NTDS Settings reference

One possible reason for the differences could be that NTDS settings objects are missing in the Active Directory.

The ServerReference attribute on the FRS member object of a SYSVOL replica set points to the Distinguished Name (DN) of that member’s NTDS Settings object. If the NTDS Settings object was deleted, the Server Reference attribute is not rebuilt. This means that either (i) the NTDS Settings object is missing or (ii) the link to it in the ServerReference attribute does not exist.

In the top.txt file this situation is reported in the following manner (only on the servers where the ntds settings object is missing):

M I S S I N G N T D S S E T T I N G S R E F E R E N C E S

The following FRS Member objects have no Server Reference to an NTDS Settings Object

XYZA0314S01XYZA0699S01XYZA0281S01\

5.3.2 Repairing a missing NTDS Settings reference

Note: this process involves directly editing the FRS configuration objects in the Active directory – and must be performed with great care.

To recover from these problems you have to know if the NTDS Settings object was removed intentionally (so the server should have been demoted) or by accident. If it was intentional, remove the member object (DN=ERNI-VMDC1,CN=Domain System Volume (SYSVOL share),CN=File Replication Service,CN=System,DC=erni,DC=com) with ADSIEDIT or LDP. Also remove the whole server object (CN=ERNI-VMDC1 ,CN=Servers, CN=Site1, CN=Sites, CN=Configuration, DC=erni, DC=com). NTDSUTL should do this job, if not try to delete is in the sites and services MMC or use ADIEDIT or LDP again.

You’ll find a detailed description on how to recover missing/ corrupted FRS AD objects here:

Q312862 Recovering Missing FRS Objects and FRS Attributes in AD. Q216498 Removing Data in Active Directory After an Unsuccessful

Demotion

5.3.3 Investigating & repairing connection balancing and schedules

In this section, we look for 4 specific issues: For SYSVOL Replica Sets, ensure the requirement that every inbound

connection is matched by an outbound connection is satisfied. Ensure that no single Domain Controller is overloaded with an excessive

number of inbound or outbound connections Ensure that connections are ideally balanced among Domain Controllers in

the hub site Ensure that schedules are reasonable for the number of partners, speed of

network links and amount of data being replicated.

© Copyright 2003 Microsoft Corporation. All rights reserved 49

Page 50: Troubleshooting Frs-Sonar

Here is an example from a top.txt file that shows a case where hubs are not well balanced – DC08CCAN has significantly more partners than the others:

Servers referenced from cxtions (From List)

XYZ\DC06CCAN 47 63XYZ\DC07CCAZ 78 78XYZ\DC08CCAN 152 162

To remedy such situations, one can manually rebuild the topology or use the MKDSX script (in both cases refer to the Branch Office Deployment Guide). You can also use this script to set an appropriate schedule.

The topchk report includes a section that shows a schedule that has been set unnecessarily high (actually continuous since 24x7 = 168 hours). For a larger site with > 100 Domain controllers in as many AD sites it would be advisable to correct this.

M E M B E R S W I T H 1 6 8 H O U R C O N N E C T I O N S

The following FRS Member objects have connection objects with 168 hour replication schedules

Member: DC10CCAN cxtion: 03A1E67D-FCD3-4E6F-A17B-A81149A19206 host: CCA\DC10CCANMember: DC10CCAN cxtion: 40A91E15-9572-4B18-8B74-BF9B06DDD3C0 host: CCA\DC10CCANMember: DC10CCAN cxtion: 40EAAD73-511F-4D67-8555-EC69F2232370 host: CCA\DC10CCAN

The schedule as reported in the topchk report represents every hour of Monday – only Monday is shown in order to reduce the size of the report, by making the assumption that the Monday schedule is representative of the normal daily schedule:

RepHrs: 168 Sched: 111111111111111111111111

In this example FRS replicates once per hour in 168 hours of the week (24*7=168). Every single number of the above represents one hour of the day as a decimal 4 bit value. Each single bit represents 15 minutes of this hour. So if we have “1” in decimal, then one bit is set in binary (0001) and we replicate once per hour.

If the decimal value is 5 (0101 in binary) we replicate twice per hour, for example:

RepHrs: 168 Sched: 555555555555555555555555

Finally if it is F (1111) we replicate 4 times per hour:

RepHrs: 168 Sched: FFFFFFFFFFFFFFFFFFFFFFFF

A connection may be disabled (enabled: FALSE) or the schedule could be all “0” or “none”. If the connection is disabled, the FRS service does not use it and therefore does not replicate - this is true for both DFS/FRS and SYSVOL/FRS replication. If the schedule has 24 zeros FRS also does not replicate. However, if it is set to “(none)” FRS will replicate 4 times per hour, since this is the default schedule on Replica Set Objects.

5.3.4 Special handling of Global Catalog (GCs)

Domain Controllers that hold a copy of the Global Catalog and are in a different domain are listed under “server missing inbound connections”. These servers most likely have inbound connections to replicate parts of the local domain naming context, but they are not members of the SYSVOL replica set. As a result these servers are listed but they are not in an error state.

© Copyright 2003 Microsoft Corporation. All rights reserved 50

Page 51: Troubleshooting Frs-Sonar

5.3.5 Every member must have at least one inbound connectionEvery member of a replica set has to have at least one inbound connection. This is always true for SYSVOL and therefore Active Directory replication as otherwise no new users or even password changes could replicate. A possible exception could be a custom DFS topology where changes are only made on one of the participating servers.

In the case of servers missing inbound connections, the following warning is shown in the top.txt report:

S E R V E R S M I S S I N G I N B O U N D C O N N E C T I O N S

The following FRS Member servers have outbound replication partners but no inbound connection objects. There could be several reasons for this:

1. There are no connection objects under the NTDS Settings object for this server. This is an error. 2. The ServerReference Attribute for this server is null. This is an error. 3. This server could be in a different domain so there will be no FRS member object for it. 4. The FRS member object may be missing. This is an error.

XYZ\DCFRCANXYZ\DCFRCAZXYZ\NLDC001

5.3.6 Members missing computer reference

S E R V E R S M I S S I N G C O M P U T E R R E F E R E N C E

This part of the topology report appears if any FRS member objects have no computer reference. For more information, refer to the Knowledge Base article Q312862: “Recovering Missing FRS Objects and FRS Attributes in AD.”

5.3.7 Members missing connection objects

S E R V E R S M I S S I N G C O N N E C T I O N O B J E C T S

This part of the topology report appears if any FRS member objects have no inbound connection objects. This is most commonly caused by an Administrator manually defining a replication topology, and not creating a connection object.

In this situation, an Administrator should check for NTDS connection objects. If none exists, the Administrator can create one by using Active Directory Sites & Services. For more information, refer to Knowledge Base article Q257338, “Troubleshooting Missing SYSVOL and NETLOGON Shares.”

5.3.8 Members with self-reference connection objects

This part of the topology report appears if any FRS member objects have connection objects that refer back to them. This is most commonly caused by an administrator manually defining a replication topology and mistakenly creating this condition. In this situation, the topology must be manually corrected by deleting the object that points back to the server..

5.3.9 The Summary section of the TOPCHK report

© Copyright 2003 Microsoft Corporation. All rights reserved 51

Page 52: Troubleshooting Frs-Sonar

This section provides a statistical overview of the other parts of the report. Each potential entry for this section is described here.

5.3.9.1 Member objects with no NTDS Settings reference This entry represents the number of FRS members that did not replicate their SYSVOL share and need to be further investigated.

5.3.9.2 Member objects with no connection objectsThis entry represents the number of NTDS Settings objects with no connection objects, meaning that the servers shown here are not replicating inbound. Note that each SYSVOL member must have 1 inbound and 1 outbound connection object. DFS replica sets are permitted to have inbound connection objects with no outbound connection object when using ‘hub and spoke’ or custom topologies.

5.3.9.3 Connection objects set to disabledThis entry represents the number of disabled replication connections. This can be confirmed and repaired by editing the enabledConnection attribute on the nTDSConnection object within the configuration container.

The Administrator should investigate these to see which connections have been disabled, and determine if this is intended. The section titled ‘Connection Summary’ printed as part of the summary report (see below) can be used for this purpose.

5.3.9.4 Monday ScheduleThis section appears if Topchk finds a connection object scheduled to replicate at least daily (Monday in this case is assumed to be representative of the schedule for the business week). This provides some insight into whether the schedules have been staggered across connections (which is a best practice for large configurations)

5.3.9.5 Number connections with per-week active replication hoursThis section displays replication hours for each connection object. It begins with a number of rows showing (a) number of servers, then (b) the number of hours the schedule is enabled for those servers (note that this is approximate - any hour in which the schedule byte is non-zero is counted as an active replication hour). It is possible that objects used for intra-site communication may show high replication hours (in the range of 160 or over), however in most cases this level of network use is not advised.

5.3.9.6 Connection summaryAfter this summary information, each connection is then listed, along with its Monday schedule, replication hours, and enabled/disabled status in order to facilitate further investigation of schedules.

Note: A connection with a GUID, for example “cxtion: CC9BA0F4-3B12-4ADE-86BA-C405D86D4D59” indicates a connection that has been auto-generated by the Active Directory Knowledge Consistency Checker (KCC) or that has been created by using the DFS MMC snap-in utility. When the connection is associated with a name such as “cxtion: DC22,” this indicates the connection was manually generated.

© Copyright 2003 Microsoft Corporation. All rights reserved 52

Page 53: Troubleshooting Frs-Sonar

5.3.9.7 Error! Hyperlink reference not valid.The Server Inbound/Outbound partner report (the final part of the Summary) lists the total number of inbound and outbound connections for a given server. Each server shows two entries: one with '<<' signifying an inbound connection list, and one with '>>' signifying an outbound connection list.

This part of the report is beneficial for investigating two aspects of the topology definition:

It shows whether SYSVOL members have the required minimum of 1 inbound and 1 outbound connection object.

It helps determine if the topology is reasonable with respect to the number of direct partners for each member, the amount of content being replicated (size * change), the link speed, and so forth.

An excessive number of connections to or from any one partner should be evaluated in order to see if the topology is balancing replication load appropriately.

5.4. Optimizing replication schedulesYou might need to change the default FRS replication schedules for SYSVOL and DFS replicas in order to maximize your network bandwidth. Keep in mind as you consider your replication schedule that FRS replicates whole files, and this can impact the network bandwidth that is available. You might decide, for example, to schedule replication during off hours in order to maximize your bandwidth during business hours. Becoming familiar with the procedures for configuring replication schedules for both DFS and SYSVOL will assist you in scheduling replication for your organization.

One may build a staggered and well balanced topology using the MKDSX scripts from the Branch Office Deployment Guide.

Windows 2000® Branch Office Deployment Guide: http://www.microsoft.com/technet/treeview/default.asp?url=/TechNet/prodtechnol/windows2000serv/proddocs/adguide/adbranch.asp

Detailed discussion of this subject is beyond the scope of this troubleshooting document. However, for Active Directory replication, a rule of thumb is that a given domain controller that acts as a bridgehead server should not have more than 50 active simultaneous replication connections at any given time in a replication window. This was determined on a reference server that had four Pentium III Xeon processors with 2 gigabytes (GB) of RAM and 2 megabytes (MB) of L2 cache.) Adjusting this rule to a limit of fewer than 50 servers will have significant positive impact on CPU utilization, network throughput, and I/O throughput on this domain controller

NOTE: If you use the MKDSX script, note that the existence of multiple domains in the forest and the new Application Partitions in Windows Server 2003 make use of manual topologies more complex. Also note that the updated KCC in Windows Server 2003 has improved scalability compared to the Windows 2000 version.

5.4.1 Scheduling SYSVOL replicationFRS replicates SYSVOL using the same intra-site connection objects and schedule built by the KCC for Active Directory replication. The connection object schedule is an attribute associated with each connection object. The connection object schedule contains a 7x24 array of bytes, one byte for each hour in a 7-day week. The low 4 bits of each byte are used to indicate the number of times replication is attempted in that hour. The upper 4 bits of each byte are reserved for future use.

© Copyright 2003 Microsoft Corporation. All rights reserved 53

Page 54: Troubleshooting Frs-Sonar

Intra-site SYSVOL replication occurs once per hour by default, unless changes are made that trigger replication. When information in the SYSVOL folder is modified, FRS replicates the information immediately to all replication partners.

The example connection object schedule below (each entry is 2 hexadecimal digits) would trigger both SYSVOL and Active Directory replication on the even hours of the week.

Hour of the day

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23

sun=01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00

D mon=01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00

A tue=01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00

Y wed=01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00

S thu=01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00

fri=01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00

sat=01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00

Note that this is how the schedule information is interpreted by the service; the actual user interface does not present the schedule in this form. You can, however, view the connection object schedule for a particular computer by using ntfrsutl.exe.

Note that for SYSVOL, this schedule applies to inter-site replication and NOT to intra-site replication – for SYSVOL, schedules are implicitly always on.

5.4.2 Changing schedules on SYSVOL connection objectsIt is possible to change schedules on connection objects, but this should be done only after careful consideration of the implications and alternatives:

Note that schedules exist on BOTH site links AND connection objects. If you change the schedule on a connection object, the connection then

becomes a manual connection that cannot be managed by KCC.

To view and change the schedule on a connection object for SYSVOL1. In Control Panel, double-click the Administrative Tools icon.2. Double-click the Active Directory Sites and Services icon.3. In the console tree, expand Sites, and then expand the site of interest.4. Expand Servers, expand the server of interest, and click NTDS Settings.5. Right-click the connection object of interest, and click Properties.6. Click Change Schedule to view or change the schedule for that

connection.

5.4.3 Scheduling FRS replication for DFS replica setsFor DFS replica sets, FRS uses the connection objects, topology, and schedule built by the DFS administrative snap-in. However, a schedule for a DFS replica set can be assigned globally to the replica set or overridden on each individual connection. Generally, it is preferable to change the schedule on the replica set, unless you specifically want to have exceptions, or want to stagger schedules - Connection object schedules override a schedule assigned to a replica set.

For examples, assigning a schedule to the replica set might be more appropriate for a replica set with a large number of replicas. For example, it would be a

© Copyright 2003 Microsoft Corporation. All rights reserved 54

Page 55: Troubleshooting Frs-Sonar

tedious process to configure the schedules on all of the connection objects for a replica set with over 100 members.

You schedule replication for a DFS replica set by making replication either available or unavailable. To view or change the schedule on a connection object for a replica set:

1. In Control Panel, double-click the Administrative Tools icon.2. Double-click the Active Directory Users and Computers icon.3. On the View menu, click Advanced Features.4. In the console tree, expand System, expand File Replication

Service, and expand DFS Volumes.5. Expand the DFS root.6. Right-click the root/link of interest, and click Properties.7. Click Change Schedule to view the schedule for that connection.

Note again that if a schedule is assigned to a connection object within this replica set, it overrides the schedule assigned to the replica set.

5.5. Inspecting the Connection State for SYSVOL and DFS/FRS with CONNSTAT

The ‘connstat’ report in the con.txt file will show the connection status and how up-to-date downstream partners are relative to the upstream partner. Additional elements shown by connstat include:

FRS version # Service start status Backlog indicators show if that server has any outbound partner (a) does

not replicate, (b) is far behind, or (c) is “stuck” for other reasons.

5.5.1 The CONNSTAT header section

The first line of the Report Header lists the file specified in the Connstat datafile parameter, and the date the file was last modified. The remainder of the Report Header gives information about the FRS service on the host where the input file was generated. The following table describes this information.

Element Description Comments

Replica Displays the replica set name and GUID

SYSVOL replica sets are always called ‘DOMAIN SYSTEM VOLUME (SYSVOL SHARE)’. Replica sets are uniquely identified by the GUID.

Member Displays the member name. Individual computers in a replica set are referred to as replica members.

Note that for DFS replica sets this name is often a GUID.

ServiceState Shows the state of the FRS service. The return values are 0-7, where

0=Allocated - initializing state1=Initializing - initializing state2=Starting - initializing state3=Active - set is functioning4=Stopped - set stopped with error5=Error - set stopped with error6=Jrnl_Wrap_Error

- set stopped with error7=Replica_Deleted

- set is marked for

The friendly name is displayed in parentheses next to the return value, for example 3 (ACTIVE).

© Copyright 2003 Microsoft Corporation. All rights reserved 55

Page 56: Troubleshooting Frs-Sonar

deletion

OutLogSecNum Displays the current outbound log sequence number.

This is the next index that is used for a new change order.

OutLogCleanup Displays the outbound log sequence number where the next clean-up will begin.

This is the index of the last change order that was confirmed by all the downstream partners.

Delta Represents the difference between OutLogSecNum and OutLogCleanup. This is a good estimation of the number of change orders currently present in the outbound log.

The difference between OutLogSeqNumber and OutlogCleanup is the Delta. This delta represents the number of change orders we have to hold for our downstream partners and therefore also represents roughly the amount of files in the staging area (there are also change orders that do not represent file changes as part of performance optimization in SP3 and Windows Server 2003 versions of FRS). So a high Delta indicates that at least one partner doesn’t collect its data or just started a version vector join (as in the example above).

Note that NOT ALL change orders in this range may be destined for all partners – some operations such as VVjoins create ‘directed Change Orders’ which are only sent to one specified replication partner.

Config Flags Displays state information about the replica set. Possible values are: ‘Multimaster,’ ‘Primary,’ or ‘Seeding.’

In ‘multi-master’ replica sets, any domain controller accepts and replicates directory changes to any other domain controller. A value of ‘Primary’ means the host was a primary member when the replica set was created. ‘Seeding’ indicates the replica set is in the initial, seeding state.

Root Path The root path used by the replica member.

Staging Path The staging path used by the replica member.

File Filter Lists file extensions that were excluded during replication, if any.

Directory Filter Lists directories that were excluded during replication, if any.

5.5.2 The CONNSTAT connection status sectionAfter the header, the output is divided into 2 areas, the upstream partners at the top and the downstream partners at the bottom.

5.5.3 The upstream (inbound) partner sectionThis section gives state information for all upstream (inbound) connections for the given member machine. The upstream (inbound) partner section of the connstat report is generally not interesting and can be ignored – the analysis should be focussed exclusively on outbound partners.5 5 Note however that examining the inbound log using NTFRSUTL inlog and IOLOGSUM is a very interesting analysis – it shows which files are pending

© Copyright 2003 Microsoft Corporation. All rights reserved 56

Page 57: Troubleshooting Frs-Sonar

The following table lists columns and descriptions for this section:

Column Heading

Description Comments

Partner The name of the immediate inbound partner for this replica member (member machine).

To participate in a valid FRS replica set, each member must have at least one inbound partner.

I/O Whether the connection is inbound (In) or outbound (Out).

This column appears in both the Inbound and Outbound report sections.

State The current inbound connection state.

See Connection State table (See page 57, below) for a list of valid Inbound connection states and their meaning.

Rev Represents operating system version information for the inbound partner’s communication protocol.

Revision 0 means the partner never joined and therefore we do not know the FRS version.

Revision 3 is the FRS version before SP2 or an SP4 release with compression disabled

Revision 4 is the SP2 roll up. Major difference between these Revision 3 and Revision 4 is that FRS supports compression in Revision 4. This also means that in a mixed environment with Revision 3’s and Revision 4’s that the Revision 4 servers will create 2 staging files for each change order, one compressed, one uncompressed.

Revision 6 is the first release of the FRS SP3 roll up (Q307319)

Revision 7 is the final release of the FRS SP3 roll up (Q321557)

Revision 8 is Windows Server 2003 RTM

NOTE: Microsoft strongly advises that systems be running at least the SP2 version of FRS (although the SP3 or later versions have significant improvements).

LastJoinTime A timestamp showing when the replica member last successfully joined with this partner.

This data is available only for partners running Windows 2000 SP2 or later. The last join time column is interesting when the last join exceeds the replication schedule between sites hosting the upstream and downstream partners.

5.5.4 Connection State definedThe Connection State is the connection status between the upstream and downstream partner. Partners process change orders from the upstream partner when the schedule on the connection “opens” allowing the upstream and downstream partners negotiate a “join” session.

Direct replication partners in DFS/FRS replica sets remain in the joined state as long as the schedule is on and the service is running, regardless of what site they are located in. If the schedule is turned off, on the connection object, the partner will be unjoined until the next opening.

installation on the member.

© Copyright 2003 Microsoft Corporation. All rights reserved 57

Page 58: Troubleshooting Frs-Sonar

For SYSVOL, replication partners in the same AD site as the upstream partner remain in the joined state as long is the service is running. Downstream partners in different AD sites transition from unjoined to joined when the schedule opens.

The following table lists Connection State values and their meaning:

Connection State

Description

Init The connection is newly allocated.

Unjoined The connection is not yet joined to a partner.

Start The inbound partner has requested a join (StartJoin).

Starting The join has started.

Scanning FRS is scanning the inbound log for the replica set.

SendJoinThe scan was completed, and the join request has been forwarded to the partner.

WaitJoin FRS is waiting for a reply from the partner.

Joined The join request succeeded.

Unjoining The join request failed after a series of retries.

Deleted The connection has been deleted

5.5.5 The downstream (outbound) partner sectionThe state of the downstream (outbound) partners of this server should be the most interesting part as this shows most of the problems.

Note: The first several columns are identical to the Inbound connection report, with some additional information in the State column (see the Connection State table for more information). The columns starting with “OLog State” contains outbound log-related state information for the connection.

The following table lists columns and descriptions for this section.

Column Heading

Description Comments

Partner The name of the immediate inbound partner for this replica member (member machine).

To participate in a valid FRS replica set, each member must have at least one inbound partner.

I/O Whether the connection is inbound (In) or outbound (Out).

This column appears in both the Inbound and Outbound report sections.

State The current inbound connection state.

See Connection State table (See page 57, above) for a list of valid Inbound connection states and their meaning.

In the outbound connections report, the appearance of the tag "-vv" means that this connection is performing a Version Vector-based join operation. Typically this is done when the outbound partner is first added to the replica set or if it is in the process of doing a non-authoritative restore (a

© Copyright 2003 Microsoft Corporation. All rights reserved 58

Page 59: Troubleshooting Frs-Sonar

manual or automatic ‘D2’). Once this initial sync is complete, the connection leaves the VVJoin state.

Rev Represents operating system version information for the inbound partner’s communication protocol.

See the definitions of these values in the inbound connection table, above.

LastJoinTime A timestamp showing when the replica member last successfully joined with this partner.

This data is available only for partners running Windows 2000 SP2 or later. The last join time column is interesting when the last join exceeds the replication schedule between sites hosting the upstream and downstream partners.

OLog State The OLog state describes the state of outbound log processing for this connection.

Note: OLP_AT_QUOTA is a normal condition when FRS is actively replicating because the service always tries to keep eight change orders outstanding on each joined outbound connection.

OLP_UNJOINED: The partner is not joined

OLP_ELIGIBLE: The partner can accept further change orders (Connection joined, and ‘change orders out’ is less than the max limit)

OLP_STANDBY: The partner is ready to join the eligible list.

OLP_AT_QUOTA: The partner is at max quota for outstanding change orders. This is a normal operating condition.

OLP_INACTIVE: The partner is not accepting change orders.

Leadx The index into the outbound log for the next change order to be processed for this connection.

If a downstream partner is up to date the Leadx (leading index) should be the same as the OutLogSeqNumber. If the Leadx is behind we still have to send change orders to this partner, the Send Delta represents this number of changes. Leadx may be delayed if processing a large file.

Send Delta The difference between OutLogSeqNum (index of the most recent change order in the outlog) and the Leadx value for this connection. In other words, the number of change orders that remain to be processed by this connection.

A Send Delta of zero means that this connection is fully synchronized at this time. Unless there is very little activity in the replica set or the connection schedule is "always on" you are unlikely find connections with a Send Delta of zero.

Trailx The index into the outbound log of the oldest unacknowledged outbound change order. Each change order sent to the partner for this connection must eventually be acknowledged.

Therefore we still have to hold the staging files for these changes in the staging area (or regenerate them when needed if running SP3 and staging space ran low).

Cleanup Delta The difference between the Trailx value for this connection and OutLogCleanup index, the latter being the minimum Trailx value for all outbound connections.

This value indicates how much this specific downstream partner is advanced in progress compared to the one with the oldest change order that has to be held. This

Partners having connections with small cleanup deltas should be examined carefully to verify their operational status, since they are often the ones who are furthest behind in replicating content from this member.

© Copyright 2003 Microsoft Corporation. All rights reserved 59

Page 60: Troubleshooting Frs-Sonar

means the higher this value is, the better the status of this connection. If you find the partner with a Cleanup Delta of 0 it is the one that is furthest behind.

LMT LMT is the acronym for ‘Leadx minus Trailx’, thus LMT is the difference between the values in the Leadx and Trailx columns.

This maximum value is limited by the size of the AckVector, typically 128.

A value of 127 in this column indicates that the connection may require investigation. The delta between Leading and trailing index can expand when the change orders at extreme ends of the window represent large files or change orders that can be moved in because of an error state.

If so, wait for a while to see if the condition changes. If it does not, first try to delete and recreate the related connection object in Active Directory ‘sites and services’. If the condition persists, then you may need to run a non-authoritative restore on the outbound partner.

COs out The number of active unacknowledged change orders pending at the partner. By default, FRS allows up to 8 change orders to be outstanding at a time on each outbound connection.

This limit is controlled by a registry parameter, which you may want to increase in an environment with high latency communication links. Note that FRS on the partner will try to initiate fetch requests for staging files for all eight change orders concurrently, so increasing this registry parameter will increase the load on the upstream member. See the Windows Server 2003 Resource Kit for more details.

Last VVJoin The last date and time that a Version Vector-based join was done on the connection.

VVjoin is interesting when you compare the date of the last join against the schedule between the partners and the amount of data being replicated. Machines that are in VVjoin mode for longer period of time than is required to perform a full sync of the data over the connection (taking into account bandwidth and schedule) should be investigated. An extreme example that would warrant investigation would be a connection that is in VVjoin made 3 days prior (date of topcheck run – last vv join date) to replicate 10-20 MB between intra-site replication partners

Regarding OLP_AT_QUOTA and COs Out (change orders outstanding) - FRS has a default of 8 maximum outstanding (unacknowledged) change orders. If the schedule opens and FRS sends change orders, it will send them immediately after each other until there are no more, or it hits the number of max outstanding change orders. FRS sends the next one when it gets one of the outstanding ones acknowledged. So it is normal for a partner to be in the status OLP_AT_QUOTA.

5.6. Troubleshooting files not replicatingFiles can fail to replicate for a wide range of underlying reasons: DNS problems, firewalls, topology problems, insufficient disk space, the FRS service being stopped, FRS servers in an error state, or sharing violations.

© Copyright 2003 Microsoft Corporation. All rights reserved 60

Page 61: Troubleshooting Frs-Sonar

IMPORTANT:SYSVOL uses FRS as the means to replicate data. When troubleshooting the FRS replication service, you should focus on how to enable FRS to run again. You should not start replicating data using some additional, out-of-band mechanism, such as manual copying of files. If replication stops for some reason, the very worst thing one can do is to copy files manually to replication partners - this will cause additional replication traffic, backlog, and possible replication conflicts (See Understanding Morphed Folders for more details). If you need to manually copy folders, ensure you fully understand the implications of this action, or perform this task under the advice of your Product Support provider.

The correct course of action is to find the root cause. Common causes for files not replicating include lack of disk space for data or staging, poor connectivity, critical FRS objects/attributes missing in Active Directory, or files and directories that are in use and cannot be replaced.

A general procedure for troubleshooting FRS when files appear to not be replicating between any two direct replication partners A and B consists of the following steps:

1. Examine the Event logs on the machines involved – see Understanding and resolving FRS event log messages for details of how to understand and act upon these entries. The most common causes of replication failure are logged in the FRS event log.

2. Verify that both Computer A and Computer B are available on the network. Because FRS uses the fully qualified domain name (FQDN) of the replica members, a good first check is to use a ping command specifying the fully qualified name of the problem replicas. From the console of Computer A, send a ping command with Computer B's FQDN. Then, from the console of Computer B, send a ping command to Computer A's FQDN. Verify that the addresses returned by the ping command are the same as the addresses returned by an ipconfig /all command carried out on the command line of the destination computer.

3. Verify remote procedure call (RPC) connectivity between Computer A and Computer B. A good method to do this to execute ‘NTFRSUTL VERSION <FQDN_of_other_computer>’ from both machines.

4. Use the Active Directory Sites and Services MMC Snap-in to verify the replication schedule on the Connection object, and to confirm that replication is enabled between the computers.

5. Check for files that are larger than the amount of free space on the source or destination server, larger than free space on the staging directory volume, or the size of the staging directory limit in the registry. Resolve the disk space problem or increase the maximum staging file space. See Troubleshooting FRS Event 13522 for more details.

6. Check whether the source file was excluded from replication by FRS file and folder filters (see header of ntfrsutl ds) . Confirm that the file is not EFS encrypted, an NTFS junction point (as created by LINKD.EXE from the Windows 2000® Resource Kit), marked as temporary (See Knowledge base article Q229928 - Design Decisions, Defaults and Behavior for FRS File and Folder) or excluded by a file or folder filter on the originating replica member (See Knowledge base article Q296944: How to use File Replication Service File and Folder Filters). If any of these conditions are true, FRS does not replicate the file or directory.

By default, the following files and folders are excluded from FRS replication:

i. File names starting with a tilde (~) characterii. Files with .bak or .tmp extensionsiii. NTFS mount points

© Copyright 2003 Microsoft Corporation. All rights reserved 61

Page 62: Troubleshooting Frs-Sonar

iv. All reparse points except HSM and SIS

Check whether the file on the originating server is locked on the target computer. See section 5.6.1, “Detecting and resolvingSharing Violations on targets” for details on this process.

7. As a last resort you might try a service restart on either the upstream and / or the downstream partner.

5.6.1 Detecting and resolving Sharing Violations on targetsIf a file is being held open on a replication target by some user or process, this can prevent FRS from replacing the file with updated content. In this case, the file will be staged on the target computer, and FRS will retry installation of the file. In the case of a folder, a sharing violation on a folder can block changes that reside in and under that locked folder.

If sharing violations are preventing a file from being installed, the NTFRSUTL INLOG report will show such files in the INSTALL_RETRY status, and this will continue until FRS is finally able to replace the file. The following methods can be used to confirm that this is the cause of the file not replicating:

Method

Details

NET File

A useful mechanism is to execute the ‘NET FILE’ command on the target computer – this will tell you which users are holding the file open via the network, though this will not report any files being held open by local processes.

If files are being held open by remote users, it is possible to use the ‘NET FILE <id> /CLOSE’ command in order to force the file closed

OH.EXE Note that it is also possible to use the OH.EXE command in the Windows 2000® Resource Kit to find locally opened files. See the Windows 2000® Resource Kit for details

Log files

It is possible to confirm from the FRS debug logs that a file is not being installed due to sharing violations.

NOTE: If the file is locked on the machine from which the replication should originate, then FRS will be unable to read the file to generate the staging file, hence delaying replication. If the file is locked on the replication target, then FRS will be unable to update the file. In this case FRS continues to retry the update until it succeeds. The retry interval is 30 to 60 seconds

5.6.2 Setting File and Folder FiltersFile and folder filters are maintained for each FRS replica set, including SYSVOL and domain DFS with FRS replication enabled.By default, the following files and folders are excluded from FRS replication:

File names starting with a tilde (~) character Files with .bak or .tmp extensions NTFS mount points All reparse points except HSM and SIS

Filters exclude new files and folders added to a replica set from being replicated to direct and transitive replication partners. They have no effect on existing files in the replica set. For example, if you change the existing file filter from "*.tmp, *.bak" to "*.old, *.bak," FRS does not go through the replica set and exclude all files that match *.old, nor does it go through the replica set and begin to replicate all files that match *.tmp. After the filter change, new files added to the replica set matching *.old are not replicated. New files added to the replica set matching *.tmp are replicated.

© Copyright 2003 Microsoft Corporation. All rights reserved 62

Page 63: Troubleshooting Frs-Sonar

The more common point to make is that a file that is not excluded by filter but renamed to excluded name IS replicated to all members of the set.

In addition, any pre-existing file in the replica set that matched the old file filters (such as Test.tmp, created when the old filter was in force) is not automatically replicated when the filter changes. You must explicitly modify such files before they begin replicating. Likewise, you must explicitly delete any pre-existing files in the replica set that match *.old. Until that happens, changes to those files continue to replicate.These rules apply in the same manner to the directory exclusion filter. If a directory is excluded, all subdirectories and files under that directory are also excluded.These rules are designed to protect your system from user error. For example, if you accidentally change a filter to exclude a file like *.doc, FRS does not go through and delete every Microsoft Word file in the replica set. Similarly, if you unintentionally omit *.tmp from the filter, FRS does not go through each replica and begin replicating every temporary file that it finds.You can use the Active Directory Users and Computers console to modify a file or folder filter. To modify the file or folder filter

1. In Control Panel, double-click the Administrative Tools icon.2. Double-click the Active Directory Users and Computers icon.3. On the View menu, click Advanced Features.4. In the console tree, expand System, expand File Replication Service, and

expand DFS Volumes.5. Expand the DFS root.6. Right-click the root\link the filter for which you want to modify, and then

click Properties.7. On the Replica Set tab, enter your changes in the File Filter field, and then

click OK. Tip: If the File Filter field is left blank, the default Active Directory filters (*.tmp,

*.bak, ~*) will still be applied. Therefore, set the file filter to "," to make sure that no filters are applied. This will ensure that all files are replicated.

You can use the NTFRSUTL DS command to view the folder filters on an FRS replica.Also see:Q229928 Design Decisions, Defaults and Behavior for FRS File and FolderQ296944 HOW TO: Use File Replication Service File and Folder Filters

5.7. Understanding Morphed Folders in FRSAll files and folders that FRS manages are uniquely identified by a special file ID called a file GUID. FRS uses file GUIDs as the canonical identifiers of files/folders that are being replicated. FRS attempts to ensure that the GUID for each file or folder is identical on all members of the replica set. To FRS, the file or folder name visible in explorer or the DIR command is just a property of a file/folder, the name/path is NOT the identifier of the file – the GUID is.

If an FRS member receives a change order to create a folder whose name already exists but has a different file GUID than the pre-existing folder, FRS detects a naming conflict - because the file GUIDs for the two directories are different it cannot be a change to the existing directory. In this case the conflicting folder will be given a new name of the form FolderName_NTFRS_<guidname> where

© Copyright 2003 Microsoft Corporation. All rights reserved 63

Page 64: Troubleshooting Frs-Sonar

FolderName was the original name of the folder and GUID is a unique XX character string like “001a84b2”,

There are two common causes of this condition: If a folder is created on multiple machines in the replica set before the

folder has been able to replicate. This could be due to the administrator or application creating duplicate folders on multiple FRS members, perhaps in an effort by the administrator to make data consistent among all members with a manual copy.

You initiate an authoritative restore (D4) on one server and either:o Did not stop the service on all other members of the reinitialized

replica set prior to restarting the NTFRS service following the auth restore

o Did not set the D2 registry key on all other members of the reinitialized replica set prior such a server replicating outbound changes to reinitialized members of the replica set.

To recover from this condition, the best practice is to rename the ORGINAL folders AND the ‘morphed’ folders, and wait for the renames to propagate through the system. This will ensure the folder then has a common name throughout the SYSVOL, and that the names and GUIDS match on all members.

NOTE: Note that you cannot just delete the unwanted folder and rename the other one as this could lead to even more naming conflicts.

After the rename has propagated, choose the one you want to keep, and rename it back to the original name – other morphs can then be safely deleted.

NOTE: Before deleting any of the folders, it is a best practice to ensure you have a backup of the original (and complete) one.

5.8. Troubleshooting the SYSVOL share

5.8.1 SYSVOL folder is not shared after DCPROMO of new dc in existing domain

Confirm Active Directory replication is occurring to and from a DC in the same domain – see Troubleshooting Access Denied Replication Errors. Check the status of in hklm\system\ccs\services\netlogon\parameters\SYSVOLREADY registry entry.

If it is SYSVOLREADY=0, then this DC has not completed FRS replication from a source server. Review the FRS event log on the new DC and the domain controller used to source Active Directory during DCPROMO (found by reading the %systemroot%\debug\DCPROMO log). See Understanding and resolving FRS Event Log messages and Troubleshooting Files not Replicating.

Review the NTFRSAPI.log to see which DC we attempted to source SYSVOL from. Make sure that FRS is in good working order on that server

If it is SYSVOLREADY=1 o Confirm the directory junctions exist - see Verifying the directory

junctions, below.o Also, confirm that the SYSVOL\domain\scripts folder exists, and if it

does not then troubleshoot as a replication problemo If the directory junctions are correct, and the scripts folder exists, then

troubleshoot as a NETLOGON problem.

© Copyright 2003 Microsoft Corporation. All rights reserved 64

Page 65: Troubleshooting Frs-Sonar

NOTE:Do NOT change the SYSVOLREADY parameter manually unless instructed to do so by technical support since doing so can have unpredictable ramifications.

5.8.2 Verifying the directory junctionsThe SYSVOL share (\\domain\SYSVOL) contains two folders that are not ordinary folders, but instead are ‘directory junctions’ which point to other folders, much like a symbolic link.Use the DIR command against the "Drive:\Path\SYSVOL\SYSVOL" and "Drive:\Path\SYSVOL\Staging Areas" paths to verify that junction points are in place (evidenced by the "JUNCTION" string in the DIR output).

D:\WINNT\SYSVOL\sysvol>dir

06/26/2001 01:23p <DIR> .06/26/2001 01:23p <DIR> ..06/26/2001 01:23p <JUNCTION> a.com

D:\WINNT\SYSVOL\staging areas>dir

06/26/2001 01:23p <DIR> .06/26/2001 01:23p <DIR> ..06/26/2001 01:23p <JUNCTION> a.com

If either of two junction points is missing use LINKD.EXE from the Windows 2000® Resource Kit to recreate them as in the example below:c:\>Linkd d:\winnt\sysvol\sysvol\<FQDN> <drive>\<path>\sysvol\domain

where <FQDN> is the fully qualified domain name. You should also verify the same path for staging and staging areas.

NOTE:Great care should be taken when copying folders that include Directory Junctions. When XCOPY copies such a tree in Windows 2000®, it copies the junction, not the contents of the folder the junction points to. If not careful, an administrator can accidentally delete SYSVOL if they do a RD /S on a copy they made of SYSVOL – since RD /S will follow the directory junction, whereas RD with no /S will not.

For additional information on Xcopy please see:Q323007 HOW TO: Copy a Folder to another Folder and retain permissions

5.9. Troubleshooting excessive disk/CPU usage by NTFRS.EXE

Excessive replication generators are applications that change all or most of the files in a replica set on a regular basis without actually altering the file content – they perform writes that do not change the file. Since FRS monitors the USN journal for changes, if it finds a change it has to replicate this file. Common cases of applications that create extensive replication are those that re-write the existing ACL (e.g. file security policy, anti virus) or rewrite the existing file (de-fragmentation).

With FRS, this condition can cause FRS to appear to use excessive amounts of disk I/O or CPU as it tracks this activity.

The effect of this depends on the version of FRS being used:

© Copyright 2003 Microsoft Corporation. All rights reserved 65

Page 66: Troubleshooting Frs-Sonar

From Windows 2000® Service Pack 3 onwards, Event 13567 in the FRS event log records that this kind of non-change was suppressed in order to prevent unnecessary replication, HOWEVER the underlying cause should still be investigated – since it is causing unnecessarily high disk and CPU usage on that FRS server.

For Windows® 2000 versions prior to SP3, extensive replication generators were the most common cause of replication backlogs and for staging areas to fill up. On these versions of FRS, administrators should investigate this problem urgently.

Methods to identify the cause of excessive replication include: Selectively turn off common causes such as AntiVirus products,

defragmentation tools and file system policy and determine if this activity declines.

Use the ‘FileSpy’ tool from the Windows 2000® Resource Kit to identify file information

Inspect the NTFRSUTL OUTLOG report to see which files are being replicated.

Relevant KB articles include: Q284947 “Norton AntiVirus 7.x Makes Changes to Security Descriptors” Q282791 “FRS: Disk Defragmentation Causes FRS Replication Traffic” Q279156 “Effects of Setting File System Policy on a Disk Drive or

Folder” Q307777 “Possible Causes of a Full File Replication Service Staging

Area”.

© Copyright 2003 Microsoft Corporation. All rights reserved 66

Page 67: Troubleshooting Frs-Sonar

6. Recovery using D2 (non-authoritative restore) and D4 (authoritative restore)

It is very important to make sure that you understand why the system ran into problems before using the D2/D4 process – otherwise there is a high chance that the same root cause will recur.

6.1. When to use D2. Non-authoritative restoreThe non-authoritative restore (D2) is used to bring the local FRS replica set back up to date with its partners. This is done by comparing all the local files in a replica set with the ones of the upstream partners. After you set the Burflag to D2 (Q290762 FRS: Using the BURFLAGS registry key to restore and reinitialize) and restarting the service, FRS will move all the data into a “pre-existing” directory and then start to compare the file IDs and the files’ MD5 checksums from the upstream partner with the local ones. If the file ID and the MD5 checksum match, FRS copies this file from the pre-existing directory into its original location. If they don’t match, then FRS copies the new file from the partner. Before Windows® 2000 SP3 this work is performed with each of upstream partner at once; with Windows® 2000 SP3 and later, this is performed sequentially - one partner at a time (see also Q315450 How and when to use FRS connection priority).

With the following .reg file you can set the D2 and define a seeding partner (this process only works for SYSVOL replica sets):

HKEY_LOCAL_MACHINE\system\currentcontrolset\services\NtFrs\Parameters;; Misc params;HKEY_LOCAL_MACHINE\system\currentcontrolset\services\ntfrs

Parameters SysVol Seeding Domain System Volume (SYSVOL share) Replica Set Name = testdom.nttest.myDomain.com Replica Set Parent = \\sudarctest1.testdom.nttest.myDomain.com

Backup/Restore Process at Startup [17 1] BurFlags = REG_DWORD 0xd2

When the D2 is finished you will see positive event log message 13516: “The File Replication Service is no longer preventing the computer … from becoming a domain controller.”

6.2. When to use D4. Authoritative restore

The D4 (authoritative restore) makes the currently in the replica set existing files authoritative for all the members of this replica set. This is done by stamping a new unique file ID on every single file and directory. To get these new files replicated to all the partners also initiate a non- authoritative (‘D2’) restore on every member of the replica set.

© Copyright 2003 Microsoft Corporation. All rights reserved 67

Page 68: Troubleshooting Frs-Sonar

This D4 (authoritative restore) procedure should be performed only in consultation with your support provider – the process is a last resort only used in selected situations.

IMPORTANT: The FRS service must be stopped on all replication partners of a replica set before you perform this D4 process. Failure to do so will cause morphed directories to appear on all replica partners. You can use the SONAR tool with the ‘settings’ column set to confirm that all members are stopped and have the correct ‘D2’ BURFLAGS set.

© Copyright 2003 Microsoft Corporation. All rights reserved 68

Page 69: Troubleshooting Frs-Sonar

7. Reference pages for IOLOGSUM, TOPCHK, CONSTAT

Note that these pre-processing tools are implemented in the PERL language. To run them, you need to obtain a PERL interpreter from a site such as http://www.activestate.com/Products/ActivePerl/?_x=1.

7.1. IOLOGSUM: FRS Inbound and Outbound Logs Summary

This command-line tool processes the output generated when the NTFRSUTL.EXE tool is run with the inlog, outlog, or idtable option.

FRS Inbound logs and Outbound logs

FRS maintains logs of 'change orders' that are either received from replication partners (the inbound log), or are to be sent to outbound replication partners (the outbound log).

The inbound log stores pending inbound change orders to be processed. As entries are processed, FRS sends acknowledgment to the inbound partners. Data stored in the inbound log includes the change order's globally unique identifier (GUID), file name, object ID, parent object ID, version number, and event time. The inbound log is typically small, just a few items, unless something is preventing files from being locally installed on this computer – for example sharing violations.

The outbound log stores pending change orders to be sent to outbound partners. Change orders remain in the outbound log until all outbound partners receive and acknowledge the change. Data stored in the outbound log is the same as that stored in the inbound log. Also in the outbound log is the leading (next change) and trailing (last acknowledged) index for each partner.

The outbound logs can become quite large, particularly when replica members are unavailable, links between replicas are slow, replication hours are restricted, or a large number of changes occur. For example, even when a member for a given replica set is unavailable, outbound log entries are maintained until the member becomes available. When the changes are then sent, the inbound partner sends all changes in log file order

FRS ID table

Running ntfrs idtable creates a report with version information for each file in the replica tree; this version information tracks the last change order accepted for that particular file or directory.

Running IOLOGSUM

Running ntfrs inlog or ntfrs outlog creates a summary inbound or outbound NTFRS log report showing the pending change orders for the replica member. Running ntfrs idtable creates a report with version information for each file in the replica tree; this version information tracks the last change order accepted for that particular file or directory.

Running Iologsum.cmd sorts and formats this NTFRS output

IoLogsum uses the following syntax:

© Copyright 2003 Microsoft Corporation. All rights reserved 69

Page 70: Troubleshooting Frs-Sonar

iologsum [-sort=xxx] datafile Sort Keyword Sort the output by: -sort=seqnum Sequence Number (default) -sort=version File Version Number -sort=filename File Name -sort=size File Size -sort=fileguid File Guid -sort=origguid Originator Guid -sort=cxtion Cxtion Name / Guid -sort=eventtime Event Time

The Perl interpreter should be installed and in the path for this command to function.

Parametersdatafile Specifies a file that was created by running NTFRSUTL.EXE with the inlog, outlog or idtable option. This is a required parameter.

IoLogSum takes the datafile as input, formats the data it contains, and then displays it to the screen. To save the IoLogSum results to a file, type the following at the command prompt:

iologsum datafile > in.txt

where ‘in.txt’ is an example name of a filename where you want the IoLogSum report saved.

Example

NTFRSUTL INLOG > ntfrs_inlog.txtNTFRSUTL OUTLOG > ntfrs_outlog.txtiologsum ntfrs_inlog.txt > in.txtiologsum –sort=fileguid ntfrs_outlog.txt > out.txt

Analyzing IOLOGSUM output

Each Iologsum report starts by identifying the name of the file that it processed, and the sort options used to generate the report. Then, for every replica set the computer participates in, the report lists one row per change order.

The following table defines and describes each column in the report.

Column Header DescriptionSeqNumber Unique sequence number for change order. For idtable reports, this is always

reported as 00000000.

EventTime The USN Journal Entry Timestamp – the time the file was changed on the originating machine

FileVersNum The file version number, incremented each time the file is closed.FileUsn The USN that the file must match on the Fetch request.FileSize File Size (in KB)FileGuid The object ID of the file or directory – for brevity, only the last 8 digits are printed

OriginatorGuidThe GUID of the originating member – for brevity, only the last 8 digits are printed. The originator GUID can be mapped back to the server name by searching the “ntfrsutl sets” output.

Cxtion The GUID of the remote connection – for brevity, only the last 8 digits are printedFileName The Unicode file name. Flags Change Order Flags:

Abort - Set when CO is being aborted. VVAct - Set when VV activate request is made. Content - Valid content command.

© Copyright 2003 Microsoft Corporation. All rights reserved 70

Page 71: Troubleshooting Frs-Sonar

Locn - Valid location command. LclCo - CO is locally generated. Retry - CO needs to retry. InstallInc - Local install not completed. Refresh - CO is an upstream-originated file refresh request. OofOrd - Don't check/update version vector. NewFile - If CO fails, delete IDTable entry. DirectedCo - This CO is directed to a single connection. DemandRef - CO is a downstream demand for refresh. VVjoinToOri - CO is from vvjoin to originator. MorphGen - CO generated as part of name morph resolution. MoveinGen - This CO was generated as part of a sub-dir MOVEIN. OidReset - All CO did was reset OID back to FRS defined value. CmpresStage - The stage file for this CO is compressed. MD5 - The MD5 hash of the file is included DELETED - The file has been deleted. On IDTABLE reports, this indicates that the record is a tombstone, ensuring the file remains deleted throughout the replica set

LocationCmd

Change Order Location command. CREATE - Create a new File or Directory.DELETE - Delete a file or Directory.MOVEIN - Rename a file or dir from a point outside the Replica Tree into a Replica Tree on the same volume.MOVEOUT - Rename a file or dir from a point inside the Replica Tree to a point outside the Replica Tree on the same volume.MOVERS - Rename a file or dir from one Replica Tree to another Replica Tree on the same volume (currently unsupported)MOVEDIR - Rename a file or dir from one directory to another in the same Replica Tree.

ContentCmd

Change Order content changes from USN reason.Flags set in the NTFS change log describing modifications to the file. Close - Change log close record. Create - File or directory was created. Delete - File or directory was deleted. RenNew - File or directory was renamed. DatOvrWrt - Main file data stream was overwritten. DatExt - Main file data stream was extended. DatTrunc - Main file data stream was truncated. Info - Basic info change (attrib, last write time, etc.). Oid - Object ID change. StreamNam - Alternate data stream name change. StrmOvrWrt - Alternate data stream was overwritten. StrmExt - Alternate data stream was extended. StrmTrunc - Alternate data stream was truncated. EAChg - Extended file attribute was changed. Security - File access permissions changed. IndexableChg - File change requires re-indexing. HLink - Hard link change. CompressChg - File compression attribute changed. EncryptChg - File encryption changed. Reparse - Reparse point changed.

By default, the records are sorted by sequence number but it is possible to sort by other columns.

Sorting by File GUID makes it easy to detect repetitive changes to single files and the times at which they occurred (for example, once per hour).

Sorting by Event Time is useful for identifying older change orders that may indicate a particular partner has been unavailable for a while. The IoLogSum report, coupled with the report that Connstat generates, can tell you what change orders are pending for the delinquent partner.

7.2. TOPCHK: DFS and SYSVOL Replication Topology Analysis

This command-line tool processes the output generated when the NTFRSUTL.EXE tool is run with the ds option. Using the NTFRSUTL DS output as its data source, Topchk creates a report that summarizes the replication topology of one or more replica sets. This information includes:

© Copyright 2003 Microsoft Corporation. All rights reserved 71

Page 72: Troubleshooting Frs-Sonar

Topology Server Reference Table Server Host Table Server Inbound/Outbound partners Summary

The Topchk report is used to verify that configuration information for FRS replica sets has been defined and stored correctly within the Active Directory. It is useful both as a regular health-check tool to confirm that replication is correctly configured, and as a troubleshooting tool for investigating possible causes of FRS replication issues.

Running TOPCHK

TopChk uses the following syntax:topchk datafile The Perl interpreter should be installed and in the path for this command to function. Parameters

datafile Specifies a file that was created by running NTFRSUTL.EXE with the ds option.. This is a required parameter.

Topchk takes the datafile as input, formats the data it contains, and then displays it to the screen. To save the Topchk results to a file, type the following at the command prompt:

topchk datafile > topchk_out.txt

where ‘topchk_out.txt’ is an example name of a filename where you want the Topchk report saved.

Example

NTFRSUTL DS > ntfrs_ds.txttopchk ntfrs_ds.txt > top.txt

Analyzing TOPCHK Output

See Section 5.3. , “Verify the FRS topology as stored in the Active Directory.” On page 48 for details on analysing TOPCHK output.

7.3. CONNSTAT: FRS Connection StatusThis command-line tool processes the output generated when the NTFRSUTL.EXE tool is run with the sets option. Using the NTFRSUTL SETS output as its data source, CONNSTAT creates a report that summarizes the status of the FRS replication connections to and from a specified computer.

Running CONNSTAT

Connstat uses the following syntax:

connstat [-sort=send] [ -sort=name] [-sort=clean] [-sort=lmt] [-sort=lastjointime] [-sort=lastvvjoin] datafile

© Copyright 2003 Microsoft Corporation. All rights reserved 72

Page 73: Troubleshooting Frs-Sonar

The Perl interpreter should be installed and in the path for this command to function.

Parametersdatafile Specifies a file that was generated by running NTFRSUTL.EXE with the sets

option. -sort=send Sorts outbound connections in ascending order by the Send Delta column

of the Connstat report.-sort=name Sorts outbound connections in ascending order by the Server Name

column of the Connstat report.-sort=clean Sorts outbound connections in ascending order by the Cleanup Delta

column of the Connstat report.-sort=lmt Sorts outbound connections in ascending order by the LMT (Leading minus

Trailing) column of the Connstat report.-sort=lastjointime Sorts outbound connections in ascending order by the LastJoinTime column

of the Connstat report.-sort=lastvvjoin Sorts outbound connections in ascending order by the Last VVJoin column

of the Connstat report.

Example

NTFRSUTL SETS > ntfrs_sets.txtconnstat ntfrs_sets.txtconnstat sort=lastvvjoin ntfrs_sets.txt

Analyzing CONNSTAT output

See section 5.5. “Inspecting the Connection State for SYSVOL and DFS/FRS” on page 55 for details on analyzing this report.

© Copyright 2003 Microsoft Corporation. All rights reserved 73

Page 74: Troubleshooting Frs-Sonar

8. FRS Event log messages

This section lists the main Event Viewer events that may be caused by FileReplication Service (FRS). These events are displayed in the FRS log.

Event ID

Event type Details

13501 Informational File Replication Service is starting13502 Informational The File Replication Service is stopping13503 Informational The File Replication Service has stopped13504 Error The File Replication Service stopped without cleaning up13505 Error The File Replication Service has stopped after taking an

assertion failure.13506 Error The File Replication Service failed a consistency check (%3) in "%1" at

line%2. The File Replication Service will restart automatically at a later time.If this problem persists a subsequent entry in this event log describes therecovery procedure. For more information about the automatic restart rightclick on My Computer and then click on Manage, System Tools, Services, FileReplication Service, and Recovery.

13507 Error The File Replication Service cannot start replica set %1 on computer %2 fordirectory %3 because the type of volume %4 is not NTFS 5.0 or later. Thevolume's type can be found by typing "chkdsk %4". The volume can be upgradedto NTFS 5.0 or later by typing"chkntfs /E %4".

13508 Warning The File Replication Service is having trouble enabling replication from %1to %2 for %3 using the DNS name %4. FRS will keep retrying.Following are some of the reasons you would see this warning.

[1] FRS can not correctly resolve the DNS name %4 from this computer.[2] FRS is not running on %4.[3] The topology information in the Active Directory for this replica has notyet replicated to all the Domain Controllers.

This event log message will appear once for each connection. After the problemis fixed you will see another event log message that indicates that theconnection has been established

13509 Warning The File Replication Service has enabled replication from %1 to%2 for %3 after repeated retries.

13510 Error The File Replication Service on the computer %1 cannot communicate with theFile Replication Service on the computer %2. Verify that the computer %2 isup and running. Verify that the File Replication Service is running on %2 bytyping "net start ntfrs" on %2. Verify that the network is functioningbetween %1 and %2 by typing "ping %1" on %2 and "ping %2" on %1. If the pingssucceed then retry the failed operation.If the pings fail then there may

© Copyright 2003 Microsoft Corporation. All rights reserved 74

Page 75: Troubleshooting Frs-Sonar

beproblems with the DNS server.The DNS server is responsible for mappingcomputer names to IP addresses. The commands "ipconfig" and "nslookup" helpdiagnose problems with the DNS server. Typing "ipconfig /all" will list thecomputer's IP address and the IP address of the computer's DNS servers. Type"ping DNS server's IP address" to verify that a DNS server is available. TheDNS mapping for %2 or %1 can be verified by typing "nslookup" and then typing"%2" and then "%1" on %1 and %2. Be sure to check out the DNS server on both%1 and %2; a DNS problem on either computer will prevent propercommunication.

Some network problems between %1 and %2 can be cleared up by flushing the DNSResolver Cache. Type "ipconfig /flushdns". Some network problems between %1and %2 can be cleared up by renewing the IP address. Type "ipconfig /release"followed by "ipconfig /renew".

Some network problems between %1 and %2 can be cleared up by resetting thecomputer's DNS entry. Type "net stop NetLogon" followed by "net startNetLogon". Some problems between %1 and %2 can be cleared up by restartingthe File Replication Service. Type "net stop ntfrs" followed by "net startntfrs". Some problems between %1 and %2 can be cleared up by restarting thecomputers %1 and %2 after closing running applications, especially dcpromo.Click on Start, Shutdown, select Restart, and click on OK. Other network andcomputer problems are beyond the scope of this event log message

13511 Error The File Replication Service is stopping on computer %1 because there is nofree space on the volume containing %2. The available space on the volume canbe found by typing "dir %2". Once free space is made available on the volumecontaining %2, the File Replication Service can be restarted immediately bytyping "net start ntfrs". Otherwise, the File Replication Service willrestart automatically at a later time. For more information about theautomatic restart right click on My Computer and then click on Manage, SystemTools, Services, File Replication Service, and Recovery. For more informationabout managing space on a volume type "copy /?", "rename/?", "del /?", "rmdir /?", and "dir /?".

13512 Warning The File Replication Service has detected an enabled disk write cache on thedrive containing the directory %2 on the computer %1. The File ReplicationService might not recover when power to the drive is interrupted and criticalupdates are lost.

13513 Error The File Replication Service on computer %1 is stopping because the database

© Copyright 2003 Microsoft Corporation. All rights reserved 75

Page 76: Troubleshooting Frs-Sonar

%2 is corrupted. The database can be recovered by typing "esentutl /d %2 /l%3/s%4". Once the database has been successfully recovered the File ReplicationService can be restarted by typing "net start ntfrs".

13514 Warning The File Replication Service may be preventing the computer %1 from becominga domain controller while the system volume is being initialized with datafrom another domain controller and then shared as SYSVOL. Type "net share" tocheck for the SYSVOL share. The File Replication Service has stoppedpreventing the computer from becoming a domain controller once the SYSVOLshare appears. The initialization of the system volume can take some time.The time is dependent on the amount of data in the system volume, theavailability of other domain controllers, and the replication intervalbetween domain controllers.

13515 Warning The File Replication Service may be preventing the computer %1 from becominga domain controller while the system volume is being initialized and thenshared as SYSVOL. Type "net share" to check for the SYSVOL share

13516 Informational The File Replication Service is no longer preventing the computer %1 frombecoming a domain controller. The system volume has been successfullyinitialized and the Netlogon service has been notified that the systemvolume is now ready to be shared as SYSVOL. Type "net share" to check for theSYSVOL share.

13517 Warning The File Replication Service will not check access to the API "%1". Accesschecks can be enabled for "%1" by running regedt32. Click on Start, Run, andtype regedt32. Click on the window entitled HKEY_LOCAL_MACHINE. Double clickon SYSTEM, CurrentControlSet, Services, NtFrs, Parameters, Access Checks,"%1", and "%2". Change the string to Enabled. Permissions can be changed byhighlighting "%1" and then clicking on the toolbar option Security and thenPermissions.

13518 Warning The File Replication Service did not grant the user "%3" access to the API"%1". Permissions for "%1" can be changed by running regedt32. Click onStart, Run, and type regedt32. Click on the window entitledHKEY_LOCAL_MACHINE. Double click on SYSTEM, CurrentControlSet, Services,NtFrs, Parameters, Access Checks, and highlight "%1". Click on the toolbaroption Security and then Permissions. Access checks can be disabled for "%1".Double click on "%2" and change the string to Disabled

13519 Error The File Replication Service could not grant an unknown user access to theAPI "%1". Access checks can be disabled for "%1" by running regedt32. Click onStart, Run, and type regedt32. Click on the window entitledHKEY_LOCAL_MACHINE. Double click on SYSTEM, CurrentControlSet, Services,

© Copyright 2003 Microsoft Corporation. All rights reserved 76

Page 77: Troubleshooting Frs-Sonar

NtFrs, Parameters, Access Checks, "%1", and "%2". Change the string toDisabled. Permissions can be changed by highlighting "%1" and then clickingon the toolbar option Security and then Permissions

13520 Warning The File Replication Service moved the preexisting files in %1 to %2. TheFile Replication Service may delete the files in %2 at any time. Files can besaved from deletion by copying them out of %2. Copying the files into %1 maylead to name conflicts if the files already exist on some other replicatingpartner. In some cases, the File Replication Service may copy a file from %2into %1 instead of replicating the file from some other replicating partner.Space can be recovered at any time by deleting the files in %2.

13521 Error The File Replication Service cannot enable replication on the comptuer %1until a backup/restore application completes. A backup/restore applicationhas set a registry key that prevents the File Replication Service fromstarting until the registry key is deleted or the system is rebooted. Thebackup/restore application may still be running. Check with your localadministrator before proceeding further. The computer can be rebooted byclicking on Start, Shutdown, and selecting Restart.

WARNING - DELETING THE REGISTRY KEY IS NOT RECOMMENDED!

Applications may fail in unexpected ways. The registry key can be deleted byrunning regedt32.

Click on Start, Run, and type regedt32. Click on the window entitledHKEY_LOCAL_MACHINE. Double click on SYSTEM, CurrentControlSet, Services,NtFrs, Parameters, Backup/Restore, "Stop NtFrs from Starting". On thetoolbar, click on Edit and select Delete. Be careful! Deleting a key otherthan "Stop NtFrs From Starting" can have unexpected sideeffects

13522 Warning The File Replication Service paused because the staging area is full.Replication will resume if staging space becomes available or if the stagingspace limit is increased. The current value of the staging space limit is %1KB. To change the staging space limit, run regedt32. Click on Start, Run andtype regedt32. Click on the window entitled HKEY_LOCAL_MACHINE. Double clickon SYSTEM, CurrentControlSet, Services, NtFrs, Parameters, and the value"Staging Space Limit in KB

13523 Warning The File Replication Service paused because the size of a file exceeds thestaging space limit. Replication will resume only if the staging space limitis increased. The staging space limit is %1 KB and the file size is %2 KB. Tochange the staging space limit, run regedt32. Click on Start, Run and typeregedt32. Click on the window entitled HKEY_LOCAL_MACHINE. Double click onSYSTEM, CurrentControlSet, Services, NtFrs, Parameters, and the value"Staging Space Limit in KB".

© Copyright 2003 Microsoft Corporation. All rights reserved 77

Page 78: Troubleshooting Frs-Sonar

13524 Error The File Replication Service is stopping on the computer %1 because auniversally unique ID (UUID) cannot be created. The SDK function UuidCreate()returned the error "%2". The problem may be the lack of an Ethernet address,token ring address, or network address. The lack of a network address impliesan unsupported netcard. The File Replication Service will restartautomatically at a later time. For more information about the automaticrestart right click on My Computer and then click on Manage, System Tools,Services, File Replication Service, and Recovery

13525 Warning The File Replication Service cannot find the DNS name for the computer %1because the "%2" attribute could not be read from the distinguished name"%3". The File Replication Service will try using the name "%1" until thecomputer's DNS name appears

13526 Error The File Replication Service cannot replicate %1 with the computer %2 becausethe computer's SID cannot be determined from the distinguished name "%3". TheFile Replication Service will retry later

13539 Error The File Replication Service cannot replicate %1 because the pathname of thereplicated directory is not the fully qualified pathname of an existing,accessible local directory.

13540 Error The File Replication Service cannot replicate %1 because the pathname of thecustomer designated staging directory: %2 is not the fully qualified pathnameof an existing, accessible local directory

13541 Error The File Replication Service cannot replicate %1 because it overlaps the FileReplication Service's logging pathname %2

13542 Error The File Replication Service cannot replicate %1 because it overlaps the FileReplication Service's working directory %2.

13548 Error The File Replication Service is unable to replicate with its partner computerbecause the difference in clock times is outside the range of plus or minus%1 minutes. The connection to the partner computer is: "%2" The detected timedifference is: %3 minutes.

Note: If this time difference is close to a multiple of 60 minutes then it islikely that either this computer or its partner computer was set to theincorrect time zone when the computer time was initially set. Check that thetime zone and the system time are correctly set on both computers.Ifnecessary, the default value used to test for computer time consistency maybe changed in the registry on this computer. (Note: This is notrecommended.)

To change this parameter, run regedt32. Click on Start, Run and typeregedt32.

Click on the window entitled HKEY_LOCAL_MACHINE. Click down the key path:"System\CurrentControlSet\Services\NtFrs\Parameters" Double click on thevalue name "Partner Clock Skew In Minutes" and update the value. If the value

© Copyright 2003 Microsoft Corporation. All rights reserved 78

Page 79: Troubleshooting Frs-Sonar

name is not present you may add it with the Add Value function under the EditMenu item. Type the value name exactly as shown above using the the registrydata type REG_DWORD.

13549 Error The File Replication Service is unable to replicate from a partner computerbecause the event time associated with the file to be replicated is too farinto the future. It is %1 minutes greater than the current time. This canhappen if the system time on the partner computer was set incorrectly whenthe file was created or updated. To preserve the integrity of the replica setthis file update will not be performed or propagated further. The file nameis: "%2" The connection to the partner computer is: "%3" Note: If this timedifference is close to a multiple of 60 minutes then it is likely that thisfile may have been created or updated on the partner computer while thecomputer was set to the incorrect time zone when its computer time wasinitially set. Check that the timezone and time are correctly set on thepartner computer.

13555 Error The File Replication Service is in an error state. Files will not replicateto or from one or all of the replica sets on this computer until thefollowing recovery steps are performed:Recovery Steps: [1] The error state may clear itsel

13559 Error The File Replication Service has detected that the replica root path haschanged from "%2" to "%3". If this is an intentional move then a file withthe name NTFRS_CMD_FILE_MOVE_ROOT needs to be created under the new rootpath. This was detected for the following replica set: "%1" Changing thereplica root path is a two step process which is triggered by the creation ofthe NTFRS_CMD_FILE_MOVE_ROOT file.[1] At the first poll which will occur in %4 minutes this computer will bedeleted from the replica set.[2] At the poll following the deletion this computer will be re-added to thereplica set with the new root path. This re-addition will trigger a full treesync for the replica set. At the end of the sync all the files will be at thenew location. The files may or may not be deleted from the old locationdepending on whether they are needed or not

13561 Error The File Replication Service has detected that the replica set "%1" is inJRNL_WRAP_ERROR. Replica set name is : "%1" Replica root path is : "%2"Replica root volume is : "%3" A Replica set hits JRNL_WRAP_ERROR when therecord that it is trying to read from the NTFS USN journal is not found. Thiscan occur because of one of the following reasons.

[1] Volume "%3" has been formatted.

[2] The NTFS USN journal on volume "%3" has been deleted.

[3] The NTFS USN journal on volume "%3" has been truncated. Chkdsk cantruncate the journal if it finds corrupt entries at the end of the journal.

© Copyright 2003 Microsoft Corporation. All rights reserved 79

Page 80: Troubleshooting Frs-Sonar

[4] File Replication Service was not running on this computer for a longtime.

[5] File Replication Service could not keep up with the rate of Disk IOactivity on "%3".

Following recovery steps will be taken to automatically recover from thiserror state.

[1] At the first poll which will occur in %4 minutes this computer will bedeleted from the replica set.

[2] At the poll following the deletion this computer will be re-added to thereplica set. The re-addition will trigger a full tree sync for the replicaset.

- Event ID: 13562Event Type: WarningRule: Alert suppressed based on ID, source, computer - state to look for eventID 13560 to indicate that the error is auto recoverMessage Text:

Following is the summary of warnings and errors encountered by FileReplication Service while polling the Domain Controller %1 for FRS replicaset configuration information. %2

13563 Warning The File Replication Service has detected that the staging path for thereplica set %1 has changed. Current staging path = %2 New staging path = %3The service will start using the new staging path after it restarts.

13564 Warning The File Replication Service has detected that the volume holding the FRSdebug logs is running out of disk space. This will not affect replicationunless this volume hosts database, staging, or replica root paths as well.Path to the logs directory = %1 You can change the number and size of logs byadjusting the following registry values. Sample values are shown below. Thesevalues are under the registry key"HKEY_LOCAL_MACHINE/SYSTEM/CurrentControlSet/ Services/NtFrs/Parameters DebugLog Files REG_DWORD 0x5 Debug Log Severity REG_DWORD 0x2 Debug Maximum LogMessages REG_DWORD 0x2710 You can also change the path to the logs directoryby changing the following value at the same location. Debug Log File REG_SZ%windir%\debug Changes to the registry values will take affect at the nextpolling cycle.

13565 Warning File Replication Service is initializing the system volume with data fromanother domain controller. Computer %1 cannot become a domain controlleruntil this process is complete. The system volume will then be shared asSYSVOL. To check for the SYSVOL share, at the command prompt, type netshare.

When File Replication Service completes the initialization process, the

© Copyright 2003 Microsoft Corporation. All rights reserved 80

Page 81: Troubleshooting Frs-Sonar

SYSVOLshare will appear. The initialization of the system volume can take sometime. The time is dependent on the amount of data in the system volume, theavailability of other domain controllers, and the replication intervalbetween domain controllers.

13566 Warning File Replication Service is scanning the data in the system volume. Computer%1 cannot become a domain controller until this process is complete. Thesystem volume will then be shared as SYSVOL. To check for the SYSVOL share,at the command prompt, type net share.

When File Replication Service completes the scanning process, the SYSVOL sharewill appear.The initialization of the system volume can take some time. Thetime is dependent on the amount of data in the system volume

13567 Warning File Replication Service has detected and suppressed an average of %1 or morefile updates every hour for the last %2 hours because the updates did notchange the contents of the file. The tracking records in FRS debug logs willhave the filename and event time for the suppressed updates. The trackingrecords have the date and time followed by :T: as their prefix. Updates thatdo not change the content of the file are suppressed to prevent unnecessaryreplication traffic. Following are common examples of updates that do notchange the contents of the file.

[1] Overwriting a file with a copy of the same file.[2] Setting the same ACLs on a file multiple times.[3] Restoring an identical copy of the file over an existing one.

Suppression of updates can be disabled by running regedit. Click on Start, Runand type regedit. ExpandHKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NtFrs\Parameters andcreate or update the value "Suppress Identical Updates To Files" to 0(Default is 1) to force identical updates to replicate

13568 Error The File Replication Service has detected that the replica set "%1" is inJRNL_WRAP_ERROR.

Replica set name is : "%1"Replica root path is : "%2"Replica root volume is : "%3"

A Replica set hits JRNL_WRAP_ERROR when the record that it is trying to readfrom the NTFS USN journal is not found. This can occur because of one of thefollowing reasons.

[1] Volume "%3" has been formatted.[2] The NTFS USN journal on volume "%3" has been deleted.[3] The NTFS USN journal on volume "%3" has been truncated. Chkdsk cantruncate the journal if it finds corrupt entries at the end of the journal.

© Copyright 2003 Microsoft Corporation. All rights reserved 81

Page 82: Troubleshooting Frs-Sonar

[4] File Replication Service was not running on this computer for a longtime.[5] File Replication Service could not keep up with the rate of Disk IOactivity on "%3".

Setting the "Enable Journal Wrap Automatic Restore" registry parameter to 1will cause the following recovery steps to be taken to automatically recoverfrom this error state.

[1] At the first poll, which will occur in %4 minutes, this computer will bedeleted from the replica set. If you do not want to wait %4 minutes, then run"net stop ntfrs" followed by "net start ntfrs" to restart the FileReplication Service.[2] At the poll following the deletion this computer will be re-added to thereplica set. The re-addition will trigger a full tree sync for the replicaset.

WARNING: During the recovery process data in the replica tree may beunavailable. You should reset the registry parameter described above to 0 toprevent automatic recovery from making the data unexpectedly unavailable ifthis error condition occurs again.

To change this registry parameter, run regedit. Click on Start, Run and typeregedit. Expand HKEY_LOCAL_MACHINE.Click down the key path:"System\CurrentControlSet\Services\NtFrs\Parameters"Double click on the value name"Enable Journal Wrap Automatic Restore" and update the value.

If the value name is not present you may add it with the Add Value functionunder the Edit Menu item. Type the value name exactly as shown above usingthe registry data type REG_DWORD.

END

© Copyright 2003 Microsoft Corporation. All rights reserved 82