virtual memory constraints in 32bit windows - an...

19
Virtual memory constraints in 32-bit Windows: an Update Mark B. Friedman Demand Technology 1020 Eighth Avenue South, Suite 6 Naples, FL USA 34102 [email protected] Abstract. This paper discusses the signs that indicate a machine is suffering from a virtual memory constraint in 32-bit Windows. Machines configured with 2 GB or more of RAM installed are particularly vulnerable to this condition. It also discusses options to keep this from happening, including (1) changing the way 32-bit virtual address spaces are partitioned into private and shared ranges, (2) settings that govern the size of system memory pools, (3) hardware that supports 37-bit addressing, and (4) hardware that supports 64-bit addressing but can still run 32-bit applications in compatibility mode. Ultimately, the option of running Windows on 64-bit processors is the safest and surest way to relieve the virtual memory constraints associated with 32-bit Windows. Introduction The Microsoft Windows Server 2003 operating system creates a separate and independent virtual address space for each individual process that is launched. On 32-bit processors, each process virtual address space can be as large as 4 GB. (On 64-bit processors process virtual address spaces can be as large as 8 TB in current versions of Windows Server 2003.) There are many server workloads that can easily exhaust the 32-bit virtual address space available under Windows Server 2003. Machines configured with 2 GB or more of RAM installed appear to be particularly vulnerable to these virtual memory constraints for reasons that are discussed below. When Windows Server 2003 workloads exhaust their 32-bit virtual address space, the consequences are usually catastrophic. This paper discusses the signs and symptoms that indicate there is a serious virtual memory constraint. This paper also discusses the features and options that system administrators can employ to forestall running short of virtual memory. The Windows Server 2003 operating system offers several forms of relief from the virtual memory constraints that arise on 32-bit machines. These include (1) options to change the manner in which 32-bit process virtual address spaces are partitioned into private addresses and shared system addresses, (2) settings that govern the size of key system memory pools, (3) hardware options that permit 37-bit addressing, and (4) hardware options that support 64-bit addressing. By selecting among these options, system adminis- trators can avoid many situations where virtual memory constraints impact system availability and performance. Since these virtual memory addressing constraints arise inevitably as the size of RAM grows, the most effective way to deal with these constraints in the long run is to move to processors that can access 64-bit virtual addresses, running the 64-bit version Windows Server 2003. Virtual addressing Virtual memory is a feature supported by most advanced processors. Hardware support for virtual memory includes an address translation mechanism to map logical (i.e., virtual) memory addresses that application programs reference to physical (or real) memory hardware addresses. Virtual address trans- lation is then performed transparently by the processor hardware during program execution. Only authorized operating system functions are capable of addressing physical memory locations directly. Figure 1 illustrates the hardware virtual address translation mechanism for Intel IA-32 processors. The hardware splits each 32-bot address reference into three segments. The high order ten bits of a virtual address are used to point to a specific Page Directory entry. The Page Directory entry points to a page of Page Table entries (PTEs), which is indexed using the middle ten bits of the virtual address to locate a specific four-byte PTE. The PTE entry that is cross-indexed in this fashion contains the high order 20 bits of the physical memory page that are then merged during translation with the low order 12-bits (enough to address 0-4095 bytes) of the virtual address to create the physical memory address. The operating system is responsible for building and maintaining the Page Directory and associated PTEs in the proper hardware-specified format on behalf of each process address space that is created. The OS also ensures that whenever a new program execution thread is dispatched that the processor’s Control Register 3 is loaded with a pointer to the origin of the Page Directory for the process address space context that the running thread is associated with. To implement virtual address translation, when an executable program’s image file is first loaded into memory, the logical memory address range of the application is divided into fixed size chunks called pages. The operating system builds a PTE for each valid page of virtual memory that is loaded and

Upload: buidien

Post on 12-Apr-2018

222 views

Category:

Documents


1 download

TRANSCRIPT

Virtual memory constraints in 32-bit Windows: an Update

Mark B. FriedmanDemand Technology

1020 Eighth Avenue South, Suite 6Naples, FL USA 34102

[email protected]

Abstract.This paper discusses the signs that indicate a machine is suffering from a virtual memory constraint in 32-bitWindows. Machines configured with 2 GB or more of RAM installed are particularly vulnerable to this condition.It also discusses options to keep this from happening, including (1) changing the way 32-bit virtual addressspaces are partitioned into private and shared ranges, (2) settings that govern the size of system memory pools,(3) hardware that supports 37-bit addressing, and (4) hardware that supports 64-bit addressing but can stillrun 32-bit applications in compatibility mode. Ultimately, the option of running Windows on 64-bit processors isthe safest and surest way to relieve the virtual memory constraints associated with 32-bit Windows.

IntroductionThe Microsoft Windows Server 2003 operating

system creates a separate and independent virtualaddress space for each individual process that islaunched. On 32-bit processors, each process virtualaddress space can be as large as 4 GB. (On 64-bitprocessors process virtual address spaces can be aslarge as 8 TB in current versions of Windows Server2003.) There are many server workloads that caneasily exhaust the 32-bit virtual address spaceavailable under Windows Server 2003. Machinesconfigured with 2 GB or more of RAM installedappear to be particularly vulnerable to these virtualmemory constraints for reasons that are discussedbelow. When Windows Server 2003 workloadsexhaust their 32-bit virtual address space, theconsequences are usually catastrophic. This paperdiscusses the signs and symptoms that indicatethere is a serious virtual memory constraint.

This paper also discusses the features and optionsthat system administrators can employ to forestallrunning short of virtual memory. The WindowsServer 2003 operating system offers several forms ofrelief from the virtual memory constraints that ariseon 32-bit machines. These include (1) options tochange the manner in which 32-bit process virtualaddress spaces are partitioned into private addressesand shared system addresses, (2) settings thatgovern the size of key system memory pools, (3)hardware options that permit 37-bit addressing, and(4) hardware options that support 64-bit addressing.By selecting among these options, system adminis-trators can avoid many situations where virtualmemory constraints impact system availability andperformance. Since these virtual memory addressingconstraints arise inevitably as the size of RAM grows,the most effective way to deal with these constraintsin the long run is to move to processors that canaccess 64-bit virtual addresses, running the 64-bitversion Windows Server 2003.

Virtual addressingVirtual memory is a feature supported by most

advanced processors. Hardware support for virtualmemory includes an address translation mechanismto map logical (i.e., virtual) memory addresses thatapplication programs reference to physical (or real)memory hardware addresses. Virtual address trans-lation is then performed transparently by theprocessor hardware during program execution. Onlyauthorized operating system functions are capable ofaddressing physical memory locations directly.Figure 1 illustrates the hardware virtual addresstranslation mechanism for Intel IA-32 processors.The hardware splits each 32-bot address referenceinto three segments. The high order ten bits of avirtual address are used to point to a specific PageDirectory entry. The Page Directory entry points to apage of Page Table entries (PTEs), which is indexedusing the middle ten bits of the virtual address tolocate a specific four-byte PTE. The PTE entry that iscross-indexed in this fashion contains the high order20 bits of the physical memory page that are thenmerged during translation with the low order 12-bits(enough to address 0-4095 bytes) of the virtualaddress to create the physical memory address. Theoperating system is responsible for building andmaintaining the Page Directory and associated PTEsin the proper hardware-specified format on behalf ofeach process address space that is created. The OSalso ensures that whenever a new program executionthread is dispatched that the processor’s ControlRegister 3 is loaded with a pointer to the origin of thePage Directory for the process address space contextthat the running thread is associated with.

To implement virtual address translation, when anexecutable program’s image file is first loaded intomemory, the logical memory address range of theapplication is divided into fixed size chunks calledpages. The operating system builds a PTE for eachvalid page of virtual memory that is loaded and

FIGURE 1. VIRTUAL MEMORY ADDRESS TRANSLATION ON 32-BIT INTEL-COMPATIBLE PROCESSORS.

continues to build PTEs for additional pages that aprocess allocates (for data structures and otherworking storage, etc.) as it executes. The PTE sup-plies the mapping information for those virtualmemory pages that reside in physical memory. Thismapping is dynamic so that logical addresses thatare frequently referenced tend to reside in physicalmemory, while infrequently referenced pages arerelegated to paging files on secondary disk storage.The active subset of virtual memory pages associatedwith a single process address space that are cur-rently resident in RAM is known as the process’sworking set.

The low order bit of a PTE, known as the Presentbit, is set to reflect the status of a virtual memorypage in physical memory. If the Present bit is not set,the operating system stores information in the PTEthat points to its location on the paging file. Areference to an invalid page by an executing threadcauses the hardware to report an addressing excep-tion or page fault. The operating system thenintercedes to allocate a free page in RAM, copy thecontents of the page from disk to that physicalmemory location, update the PTE appropriately, andre-execute the instruction that originally failed. Thispage fault resolution process can have profoundperformance implications if performed frequently,due to the relatively slow speed of mechanical diskscompared to the electronic speeds that prevail foraccessing physical memory locations. Readersinterested in a fuller understanding of this commonhardware mechanism should consult [1].

Most performance-oriented treatises on virtualmemory systems dissect the problems that can arisewhen physical memory is over-committed andexcessive paging to disk occurs. Virtual memory

systems usually work well because executing pro-grams seldom require all their pages to be resident inphysical memory concurrently in order to run. Withvirtual memory, only the active pages associated witha program’s current working set remain resident inphysical memory. On the other hand, virtual memorysystems can run very poorly when the working setsof active processes greatly exceed the amount of RAMthat the computer contains. A physical memoryconstraint is then transformed into an I/O bottleneckdue to excessive amounts of activity to the pagingdisk (or disks).

Performance and capacity problems associatedwith virtual memory architectural constraints tend toreceive far less scrutiny. These problems arise out ofa hardware limitation, namely, the number of bitsavailable to construct a memory address. The num-ber of bits associated with a hardware memoryaddress determines the memory addressing range. Inthe case of the 32-bit Intel-compatible processorsthat run Microsoft Windows, address registers are 32bits wide, allowing for addressability of 0-4,294,967,295 bytes, which is conventionallydenoted as a 4 GB range. This 4 GB range can be anarchitectural constraint, especially with workloadsthat need 2 GB or more of RAM to perform well. Thesuggestion that machines with 2 GB or more of RAMinstalled are most susceptible to addressabilityconstraints is based on the observation that smallermachines are more prone to encounter pagingperformance problems before the architecturalconstraints discussed here are manifest.

Virtual memory constraints are manifest duringperiods of transition from one processor architectureto another. In the evolution of the Intel x86 proces-sor, there was a period of transition from 16-bit

addressing that was a feature of theoriginal 8086 and 8088 processorsthat launched the PC revolution tothe 24-bit segmented addressingmode of the 80286 to the currentflat 32-bit addressing model that isimplemented across all Intel IA-32processors. Currently, the IA-32architecture is in a state of transi-tion with two flavors of machinescapable of 64-bit addressing startingto become widely available. As 64-bitmachines start to become morecommonplace, they will definitivelyrelieve the virtual memory con-straints that are evident today onthe 32-bit platform.

Physical Address Extension.Functioning as a stop-gap solutionon the road to full 64-bit memoryaddressing, Intel currently providesIA-32 machines with a PhysicalAddress Extension (PAE) feature thatsupports a wider 37-bit address. PAE

0100010011 0000011010 000001101100

IA-32 Virtual address translation

Page tableSelector

Page tableentry

byte offset

Control Register 3

Page Directory

Private

System

10000

7FFEFFFF

A4000000

C1000000

E0000000

FFFFFFFF

PTEs/cache extension

System data cache

Paged system pool

(offset of per process Page Table)

Page Tables Physical pages

Paging file

(invalid)

80000000

C0000000

provides the capability to address up to 128 GB ofphysical RAM. Virtual addresses remain 32-bitswhen PAE is active, but it requires 8-byte PTEs and anovel, 3-level address Page Directory structure,which is illustrated in Figure 2 for translating 32-bitvirtual addresses into 37-bit physical addresses.

When a machine is booted in PAE mode, theWindows Server 2003 operating system builds PTEsand Page Directories in the proper format to supportextended addressing. Individual process addressspaces are still limited to 4 GB in size when WindowsServer 2003 runs in PAE mode, so utilizing physicalmemory above 4 GB can be challenging, as discussedin more detail below. Because 64-bit PAE PTEs con-sume twice the amount of space as conventional 32-bitPTEs, enabling PAE mode should be restricted to thoseconfigurations where more than 4 GB of RAM isinstalled and PAE-mode is required to address thephysical RAM installed above the 4 GB line.

PAE is an interim fix, a pit stop on the roadtowards full 64-bit addressing in Intel hardware. As astop-gap solution, it has an element of deja vu aboutit. It bears an uncanny resemblance to interimactions taken by other processor manufacturers forfamilies of machines that similarly evolved from 16 or24-bit machines to 32-bit addresses, and ultimatelyto today’s 64-bit machines. In the IBM mainframeworld, there was a prolonged focus on Virtual Stor-age Constraint Relief (VSCR) for its popular 24-bitOS/360 hardware and software in almost everysubsequent release of new hardware and softwarethat IBM produced from 1980 to the present day. The

popular book The Soul of a New Machine [2],chronicled the development of Data General’s 32-bitaddress machines to keep pace with the DigitalEquipment Corporation’s 32-bit Virtual AddressExtension (VAX) of its original 16-bit minicomputerarchitecture. Even though today’s hardware andsoftware engineers are hardly ignorant of the relevantpast history, they are still condemned to repeat thecycle of delivering stopgap solutions that provide amodicum of virtual memory constraint relief until thenext major architectural step forward is taken.

Process virtual address spaces.The Windows operating system constructs a

separate virtual memory address space on behalf ofeach running process, potentially addressing up to 4GB of virtual memory on 32-bit machines. Each 32-bit process virtual address space is divided into twoequal parts, as depicted in Figure 3. The lower 2 GBof each process address space consists of privateaddresses associated with that specific process only.This 2 GB range of addresses refers to pages that canonly be accessed by threads running in that processaddress space context. Each per process virtualaddress space can range from 0x0001 0000 toaddress 0x7fff ffff, spanning 2 GBs, potentially. (Thefirst 64K addresses are protected from being ac-cessed - it is possible to trap many commonprogramming errors that way.) Each process gets itsown unique set of user addresses in this range.Furthermore, no thread running in one process canaccess virtual memory addresses in the private rangethat is associated with another process. As it ex-

FIGURE 2. VIRTUAL ADDRESS TRANSLATION IN PAE-MODE.

ecutes, a process allocates virtual memory from itsuser address space for code and data structures ofall types.

The 2 GB upper limit on the size of a user-modeprocess virtual address space can be a constraint fora variety of application programs. How to recognizethose processes that are affected and what actionscan be taken to relieve their capacity constraints isthe focus of this discussion. One initial complicationis that it is seldom possible for a process to allocateits complete virtual memory range. This is due tofragmentation that occurs because virtual memory isallocated and de-allocated in non-uniform chunks.Because of this fragmentation, a virtual memoryallocation request by a process can fail before the 2GB limit on addressability is completely exhausted.

A process that allocates virtual memory, butneglects to free it, suffers from a memory leak, aprogram bug that will eventually exhaust the supplyof virtual memory available to a process. Applicationprograms with virtual memory leaks will not bediscussed any further here, but please refer to anearlier version of this paper for a full account of theirdetection and diagnosis [4].

Since the operating system builds a uniqueaddress space for every process, Figure 4 is perhapsa better picture of what User virtual address spaceslook like. Notice that the System portion of eachprocess address space is identical. One set of SystemPage Table Entries (PTEs) maps the System portion ofthe virtual address space for every process. BecauseSystem addresses are common to all processes, theyoffer a convenient way for processes to communicatewith each other, when necessary.

Shared system addressesThe upper half of each per process address space

in the range of '0x8000 0000' to '0xffff ffff' consists ofsystem addresses common to all virtual addressspaces. All running processes have access to thesame set of addresses in the system range. This featis accomplished by combining the system's pagetables with each unique per process set of pagetables. Commonly addressable system virtualmemory locations play an important role in variousforms of interprocess communication, or IPC. Win32API functions can be used to allocate portions ofcommonly addressable system areas to share databetween two or more distinct processes. For example,the mechanism Windows Server 2003 uses thatallows multiple process address spaces to accesscommon runtime modules known as DynamicallyLinked Libraries (DLLs) utilizes this form of sharedmemory addressing. (DLLs are library modules thatcontain subroutines and functions which can becalled dynamically at run-time, instead of beinglinked statically to the programs that utilize them.)

User mode threads running inside a processcannot directly address memory locations in thesystem range because system virtual addresses areallocated using Privileged mode. This restrictsmemory access to kernel threads that run in Privi-leged mode. This is a form of security that restrictsaccess to system memory to authorized kernelthreads. When an application execution thread callsa system function, it transfers control to an associ-ated kernel mode thread, and, in the process,routinely changes the execution state from Usermode to Privileged. It is in this fashion that anapplication thread gains access to system virtualmemory addresses.

System virtual memory constraintsIt is entirely possible for virtual memory in the

system range to be exhausted. It is also possible forone of the four major system memory pools to reach

FIGURE 4. USER PROCESSES SHARE THE SYSTEM PORTION OF THE4 GB VIRTUAL ADDRESS SPACE.

FIGURE 3. THE 4 GB ADDRESS SPACE LAYOUT USED IN 32-BITINTEL-COMPATIBLE MACHINE.

its maximum allocation level before the full range ofsystem virtual memory addresses is exhausted.There are two sets of circumstances in 32-bit Win-dows where concerns about running out of systemvirtual memory are heightened. These are (1) Termi-nal Server machines intended to support largenumbers of User mode processes and threads and (2)machines booted with the /3GB option that extendsthe User private area to 3 GB, but shrinks thesystem area down to 1 GB.

The shared system address range is used mainlyto store data areas for the following four componentsof the operating system:

• The system file cache that holds segments ofrecently accessed file system objects. The initialsize of the file cache in 32-bit Windows is 512MB in Windows Server 2003, but can extend toabout 960 MB, all of which is allocated ondemand.

• System PTEs, which are necessary to mapsystem pages into physical memory. SystemPTEs are used to map any virtual memorypages allocated in the system area. The mainconsumers of PTEs are I/O buffers, includingthose used by high resolution video cards. Thestacks for kernel-mode threads are also allo-cated from the System PTE pool. The OSearmarks about 660 MB for the pool of SystemPTEs, but if virtual memory is available the poolmaximum can be extended to 900 MB, or more.

• The Nonpaged Pool where data structuresaccessed by kernel and driver functions thatmust always be resident in physical memoryare stored. Nonpaged pool structures includeTCP/IP session-oriented connection data, I/Obuffers of all kinds, and kernel stacks (i.e.,working storage) for kernel-mode threads. Bydefault, the initial maximum size of theNonpaged pool that the OS calculates is 256MB.

• The Paged Pool where data structures that canbe paged out are stored. The initial maximumsize of the Paged pool that the OS calculates isnormally 512 MB, but if virtual memory isavailable the Paged Pool maximum can beextended to 650 MB, or more.

Operating system and driver resident code and thesystem's Hyperspace, a work area used by the operat-ing system's Memory Manager, also occupy someamount of system area virtual memory, but these areusually more limited in size.

It is a mistake to think of these four major systempools as being essentially static in size. Adding upthe potential size of these four major system memorypools, it is readily apparent that the system's virtualmemory requirements would exceed 2 GB, assuming allthe virtual memory earmarked for these pools couldactually be allocated. At initialization, the operatingsystem calculates preset allocation targets for the sizeof these shared data areas in the system range, with

room left over for several designated overflow areas.These initial allocations are preliminary. Over time, assystem virtual memory allocations occur, the OSmanages its remaining free memory locations dy-namically, servicing memory allocation requests thatpoint to the Nonpaged or Paged pool on a first-come,first-served basis until all free memory in the systemrange is ultimately allocated. Thus, by design,running out of virtual memory in the system range isa distinct possibility.

When operating system functions exhaust theamount of virtual memory available for System PTEsor data areas in either the Nonpaged or Paged pools,the results are usually catastrophic. When there areno free system PTEs available, the OS will refuse toallocate additional virtual memory within the systemrange. If either the Nonpaged or Paged pools isexhausted, system functions that need workingstorage cannot allocate virtual memory. Systemfunctions that fail due to a shortage of virtualmemory usually result in a system crash - thedreaded blue screen of death in Windows. In con-trast, note that a shortage of virtual memory for thefile cache is seldom catastrophic, although such ashortage could manifest itself as a severe perfor-mance problem.

In the case of extreme virtual memory shortages inthe system range, the fatal memory allocation failureis apt to have the appearance of being randomlydistributed across any of the stressed data areas,something that also complicates identification anddiagnosis of the problem. Troubleshooting theproblem will normally require viewing a crash dump,where the evidence of a virtual memory constraint isusually clear and unambiguous.

There are several tuning parameters that are alsoavailable to offer "hints" to the operating system inorder to handle situations where one specific pooltends to run out of space before the others do. Theoperating system also provides performance countersthat can assist you in setting these configurationparameters. The associated performance countersthat track virtual memory usage in the system rangeare also discussed below.

Terminal Server. Because they require systemarea storage to support large numbers of User modeprocesses and threads, large Terminal Server ma-chines are among the most likely workloads toencounter a shortage of system area virtual memory.Each desktop process that Terminal Server executesrequires a corresponding kernel mode thread, so thevirtual memory area devoted to the pool of SystemPTEs can become depleted. Another shared systemmemory construct is the session space where datastructures that are shared by desktop processes arestored. Session space is usually of modest size on atypical server. However, Terminal Server machines,where multiple sessions are supported, are a notableexception. In Terminal Server, it is necessary to storeglobal data structures shared by all the processes in

the session in the system area. Session space globaldata structures include an individual copy of theWin32k.sys kernel mode Windows driver, the sessiondesktop, windows, security tokens, and other objectsunique to each User session. Terminal Server machinesmay also rely on the effectiveness of the system filecache to meet performance requirements.

/3GB boot option. When you use the /3GB bootoption to extend the size of the User private addressrange to 3 GB, the system area is shrunk in half to 1GB. This is the other scenario that most frequentlyleads to virtual memory shortages in the systemarea.

Virtual memory Commit LimitThe Commit Limit is an upper limit on the total

number of virtual memory pages the operatingsystem will allocate. When the system is up againstits Commit Limit, no more requests for virtualmemory can be honored. This is usually catastrophicfor the process that encounters this limitation.Fortunately, this is a straightforward problem that iseasy to monitor and easy to anticipate - up to apoint. Servers with only 1-2 GB of RAM installed areliable to run up against the virtual memory CommitLimit before the virtual memory constraints imposedby the 32-bit architecture are manifest. Programswith memory leaks, a bug that refers to programsthat allocate virtual memory continuously, but fail torelease it subsequently, are also apt to be caught bythe Commit Limit.

The operating system builds page tables on behalf ofeach process that is created. A process's page tables getbuilt on demand as virtual memory locations areallocated, potentially mapping the entire virtual processaddress space range. The Win32 VirtualAlloc API callprovides both for reserving contiguous virtual addressranges and committing specific virtual memory ad-dresses. Reserving virtual memory does not triggerbuilding page table entries because you are not yetusing the virtual memory address range to store data.Reserving a range of virtual memory addresses issomething your application might want to do inadvance for a data file intended to be mapped intovirtual storage. Only later when the file is being ac-cessed are those virtual memory pages actuallyallocated (or committed).

In contrast, committing virtual memory addressescauses the Virtual Memory Manager to fill out a pagetable entry (PTE) to map the address into RAM. Alter-natively, a PTE contains the address of the page on oneor more paging files that are defined that allow virtualmemory pages to spill over onto disk. Any unreservedand unallocated process virtual memory addresses areconsidered free.

Commit LimitThe Commit Limit is the upper limit on the total

number of page table entries (PTEs) the operatingsystem will build on behalf of all running processes.The virtual memory Commit Limit prevents the

system from building a page table entry (PTE) for avirtual memory page that cannot fit somewhere ineither RAM or the paging files.

The Commit Limit is the sum of the amount ofphysical memory plus the allotted space on thepaging files. When the Commit Limit is reached, it isno longer possible for a process to allocate virtualmemory. Programs making routine calls toVirtualAlloc to allocate memory will fail.

Paging file extensionBefore the Commit Limit is reached, Windows

Server 2003 will alert you to the possibility thatvirtual memory may soon be exhausted. Whenever apaging file becomes 90% full, a distinctive warningmessage is issued to the Console. A System Event logmessage with an ID of 26, illustrated in Figure 5, isalso generated that documents the condition.

Following the instructions in the message directsyou to the Virtual Memory control (see Figure 6) fromthe Advanced tab of the System applet in the Control

Panel where additional paging files can be defined orthe existing paging files can be extended (assumingdisk space is available and the page file does notalready exceed 4 GB).

Windows Server 2003 creates an initial paging fileautomatically when the operating system is firstinstalled. The default paging file is built on the samelogical drive where the OS is installed. The initialpaging file is built with a minimum allocation equalto 1.5 times the amount of physical memory. It isdefined by default so that it can extend to approxi-mately two times the initial allocation.

The Virtual Memory applet illustrated in Figure 6allows you to set initial and maximum values thatdefine a range of allocated paging file space on diskfor each paging file created. When the system

FIGURE 5. OUT OF VIRTUAL MEMORY EVENT LOG ERROR MESSAGE.

appears to be running out of virtual memory, theMemory Manager will automatically extend a pagingfile that is running out of space, has a range defined,and is currently not at its maximum allocation value.This extension, of course, is also subject to spacebeing available on the specified logical disk. Theautomatic extension of the paging file increases theamount of virtual memory available for allocationrequests. This extension of the Commit Limit may benecessary to keep the system from crashing.

It is possible, but by no means certain, thatextending the paging file automatically may havesome performance impact. When the paging fileallocation extends, it no longer occupies a contigu-ous set of disk sectors. Because the extensionfragments the paging file, I/O operationsto disk may suffer from longer seek times.On balance, this potential performancedegradation is far outweighed by availabil-ity considerations. Without the paging fileextension, the system is vulnerable torunning out of virtual memory entirely andcrashing.

Note that a fragmented paging file is notnecessarily always a serious performanceliability. Because your paging files coexiston physical disks with other applicationdata files, some disk seek arm movementback and forth between the paging file andapplication data files is unavoidable.Having a big chunk of the paging filesurrounded by application data files mayactually reduce overall average seekdistances on your paging file disk. [3]

Monitoring virtual memory usage

Performance counters that track overall virtualmemory usage are available, as well as ones thatallow you to drill down to see the amount of virtualmemory allocated by process. These counters make itpossible to identify a process that is leaking virtualmemory or identify those machines that need moreRAM to perform well. Unfortunately, a number ofusage issues arise when using these counters toassist in identifying 32-bit server machines encoun-tering virtual memory constraints. Critical gaps inthe range of performance monitoring measurementdata available make it difficult to anticipate all thevarieties of virtual memory shortages that can occur.

One serious problem is that while current alloca-tion levels to the major system pools can bemonitored using performance monitoring tools, themaximum sizes of these are only available usingdebugging tools. Another potential problem is thatsystem level measurements of virtual memory alloca-tions cannot always be reconciled with the process levelmeasurements. A final concern is that the performancecounter that tracks the amount of space available forsystem PTEs provides a measurement that is inconsis-tent with the other system virtual memory usagecounters that are all reported in bytes.

Monitoring Committed BytesThe Memory\Committed Bytes and

Memory\Commit Limit provide a convenient way todetermine if your system is running up against theCommit Limit, as illustrated in Figure 7. It shows a 32-bit Windows Server 2003 machine with 4 GB of RAMpredominantly configured to run Terminal Server Usersessions. As Users activate their sessions in themorning, the number of Committed Bytes allocatedincreases steadily. Committed Bytes starts to level offafter 10:00 AM, at which time there is still plenty ofroom to accommodate additional virtual memoryallocations. The number of Committed Bytes nevercomes close to exhausting the Commit Limit, which for

FIGURE 6. THE APPLET FOR CONFIGURING THE LOCATION AND SIZEOF THE PAGING FILES.

FIGURE 7. MONITORING COMMITTED BYTES.

this system is about 12 GB. The large gapbetween the upward trending CommittedBytes shown in the foreground and thestatic Commit Limit behind it is conve-niently viewed as capacity headroom forvirtual memory growth that should permitthis workload to expand.

The situation illustrated in Figure 7 istypical of a server configured with a combi-nation of ample RAM and paging file spacewhere virtual memory growth is not likely tobe constrained by the virtual memoryCommit Limit. Ironically, because theiroverall virtual memory growth is uncon-strained by the Commit Limit, these areprecisely the same machines that can besubject to fatal virtual memory shortageswhen one of the system memory areaswithin the 2 GB range allotted to systemvirtual addresses is exhausted due to the32-bit addressing constraint.

Monitoring system memory pool usageFigure 8 reports the values for five system memory

pool allocation counters for the machine illustratedin Figure 7. The following five counters report onsystem virtual memory allocations:

• Pool Nonpaged Bytes

• Pool Paged Bytes

• System Code Total Bytes

• System Driver Total Bytes

• System Cache Resident Bytes

These five performance counters report the portionof the overall number of Committed Bytes that areassociated with virtual memory allocations in thesystem range.

As illustrated in Figure 8, the number of systemcode and driver virtual bytes is usually minimal, andcan usually be safely dispensed with. The currentsize of both the Nonpaged and Paged Pools shouldboth be monitored carefully. The Cache ResidentBytes counters tracks the current usage of physicalRAM by the file cache function, so it is a physicalmemory allocation counter that is here grouped withcounters that measure virtual memory usage. Thereis a corresponding physical memory allocationcounter called Pool Paged Resident Bytes that reportsthe number of allocated paged pool bytes that arecurrently resident in physical memory. Since thepages in Nonpaged Pool are always resident inphysical memory, one counter that tracks the size ofthe Nonpaged pool is sufficient.

In this example machine illustrated in Figure 7and 8, neither the Nonpaged or Paged pool is ap-proaching its maximum size. Because the Paged Pooland Nonpaged Pools are dynamically re-sized asvirtual memory in the system range is allocated, it isnot always easy to pinpoint exactly when you haveexhausted one of these pools. Fortunately, there is atleast one server application, the file Server service,

that reports on Paged Pool memory allocation failureswhen they occur. Non-zero values of the Server\PoolPaged Failures counter indicate virtual memoryproblems even for machines not primarily intendedto serve as network file servers.

Nonpaged and paged pool maximums. Unfortu-nately, there are no counters available that reportthe maximum size of the Nonpaged and Paged pools,so even the most careful monitoring procedures leaveyou exposed to fatal virtual memory shortagesstriking by surprise. To determine the Nonpaged andpaged pool maximum sizes, you need to run thesystem debugger.

The kernel debugger !vm command extensionprovides more detail than the performance countersdo in tracking how virtual memory is allocated. Youcan access the maximum allocation limits of thesetwo pools and monitor current allocation levels, asillustrated in Listing 1.

The NonPagedPool Max and PagedPool Maximumrows show the values for these two virtual memoryallocation limits. This information on virtual memoryusage is also useful in a post-mortem review of acrash dump to determine if a critical shortage ofvirtual memory was the source of the problem.

System PTEs Another counter worth tracking inthis context is Free System Page Table Entries, whichreports the number of available system PTEs toservice new allocations. It is necessary to chart theFree System Page Table Entries counter separatelyfrom the five system virtual memory usage countersshown in Figure 8 because it measures free spaceavailable, rather than usage, as in Figure 10, whichis adapted from measurement data supplied byMicrosoft from a series of Terminal Server bench-mark stress tests.

System PTEs are built and used by system func-tions to address system virtual memory areas. They

FIGURE 8. MONITORING SYSTEM VIRTUAL MEMORY ALLOCATIONS BY POOL.

are allocated from a pool that is also used to serviceallocations for the stacks of any kernel mode threadsthat exist. When the system virtual memory range isexhausted, the number of Free System PTEs drops tozero and no more system virtual memory of any typecan be allocated. On 32-bit systems with large amountsof RAM (1-2 GB, or more), it is also important to trackthe number of Free System PTEs.

In Figure 9, Free System PTEs is tracked againstthe current number of active Terminal Server User

above from Committed Bytes, the remainder consistsroughly of virtual memory allocations made byprivate area process address spaces. In the machineillustrated in Figures 7 and 8, the five system virtualmemory allocation counters account for about 400MB of the 2.84 GB Committed Bytes total reported atthe end of the interval. This indicates that variousrunning process address spaces are responsible forallocating the remaining 2.44 GB of virtual memory.Unfortunately, a measurement anomaly associatedwith accounting for virtual memory allocations thatare performed on behalf of shared program compo-nents makes it difficult to account for virtual memoryallocations on a process by process basis in astraightforward manner. This measurement anomalyat the process address space level is discussed in thenext section.

Accounting for process memory usageWhen a memory leak occurs, or the system is

otherwise running out of virtual memory, it is oftenuseful to drill down to the individual process. Thereare three counters at the process level that describehow each process is allocating virtual memory. Theseare Process(n)\Virtual Bytes, Process(n)\PrivateBytes, and Process(n)\Pool Paged Bytes.

Process(n)\Virtual Bytes shows the full extent ofeach process's virtual address space, includingshared memory segments that are used to map datafiles and shareable image file DLLs in memory. TheProcess(n)\Working Set bytes counter, in contrast,tracks how many pages in RAM that virtual memoryassociated with the process currently has allocated.An interesting measurement anomaly is thatProcess(n)\Working Set bytes is often greater thanProcess(n)\Virtual Bytes. This is due to the way

lkd> !vm*** Virtual Memory Usage ***Physical Memory: 130927 ( 523708 Kb)Page File: \??\C:\pagefile.sysCurrent: 786432Kb Free Space: 773844KbMinimum: 786432Kb Maximum: 1572864Kb

Available Pages: 73305 ( 293220 Kb)ResAvail Pages: 93804 ( 375216 Kb)Locked IO Pages: 248 ( 992 Kb)Free System PTEs: 205776 ( 823104 Kb)Free NP PTEs: 28645 ( 114580 Kb)Free Special NP: 0 ( 0 Kb)Modified Pages: 462 ( 1848 Kb)Modified PF Pages: 460 ( 1840 Kb)NonPagedPool Usage: 2600 ( 10400 Kb)NonPagedPool Max: 33768 ( 135072 Kb)PagedPool 0 Usage: 2716 ( 10864 Kb)PagedPool 1 Usage: 940 ( 3760 Kb)PagedPool 2 Usage: 882 ( 3528 Kb)PagedPool Usage: 4538 ( 18152 Kb)PagedPool Maximum: 138240 ( 552960 Kb)Shared Commit: 4392 ( 17568 Kb)Special Pool: 0 ( 0 Kb)Shared Process: 2834 ( 11336 Kb)PagedPool Commit: 4540 ( 18160 Kb)Driver Commit: 1647 ( 6588 Kb)Committed pages: 48784 ( 195136 Kb)Commit limit: 320257 ( 1281028 Kb)

LISTING 1. OUTPUT FROM THE !VM DEBUGGER COMMAND

Virtual Memory constraints in Terminal Server

1

10

100

1000

10000

100000

1000000

0 50 100 150 200 250

Terminal Server Sessions (x86)

Free

Sys

tem

PTE

s

0

10

20

30

40

50

60

% P

roce

ssor

Bus

yFree System PTEs% Processor Time

FIGURE 9. FREE SYSTEM PTES IN A VIRTUAL MEMORY-CONSTRAINED TERMINAL SERVER BENCHMARKWORKLOAD.

sessions. Free System PTEsare plotted on a logarithmicscale as the number of activeTerminal Server User sessionsis increased during the courseof this benchmark. When thenumber of free System PTEsdrops below 100, it is nolonger possible to add moreUsers to the system. In fact, itis apparent that some pro-cesses and threads associatedwith currently active Sessionsstart to fail when the poolwhere System PTEs are allo-cated becomes depleted.Processor utilization vs. thenumber of Users is plotted onthe right y-axis to illustrate theprogress of the benchmark. [5]

After subtracting the fivesystem virtual memoryallocation counters discussed

memory accounting is performed for shared DLLs, asdiscussed further below.

If a process is leaking memory, you should beable to tell by monitoring Process(n)\Private Bytes orProcess(n)\ Pool Paged Bytes, depending on the typeof memory leak. If the memory leak is allocating, butnot freeing, virtual memory in the process's privateregion, this will be reflected in monotonically increas-ing values of the Process(n)\Private Bytes counter. Ifthe memory leak is allocating, but not freeing, virtualmemory in the system range, this will be reflected inmonotonically increasing values of theProcess(n)\Pool Paged Bytes counter.

Shared DLLsModular programming techniques encourage

building libraries containing common routines thatcan be shared easily among running programs. Inthe Microsoft Windows programming environment,these shared libraries are known as Dynamic LinkLibraries, or DLLs, and they are used extensively byMicrosoft and other developers. The widespread useof shared DLLs complicates the bookkeeping that isdone to figure out both the number of virtual pagesand physical memory-resident pages associated witheach process working set.

The OS counts all the resident pages associatedwith shared DLLs as part of every process workingset that has the DLL loaded. All the resident pages ofthe DLL, whether the process has recently accessedthem or not, are counted in the process working set.Processes can be charged for resident DLL pagesthey may never have touched, but at least thisdouble counting is performed consistently across allprocesses that have the DLL loaded.

Unfortunately, this working set accounting proce-dure does make it difficult to account precisely for

how physical memory is being used. It also leads to ameasurement anomaly that is illustrated in Figure10. For example, because the resident pages associ-ated with shared DLLs are included in the processworking set, it is not unusual for a process to acquirea working set larger than the number of Committedvirtual memory bytes that it has requested. Noticethe number of processes in Figure 10 with moreworking set bytes (Mem Usage) than CommittedVirtual bytes (VM Size). None of the virtual memoryCommitted bytes associated with shared DLLs areincluded in the Process Virtual Bytes counter eventhough all the resident bytes associated with themare included in the Process Working set counter.

Trying to account for virtual memory usage at theprocess level leads to another measurementanomaly. For the machine shown in Figures 7 and 8,the number of Committed Bytes not associated withsystem virtual memory was calculated as approxi-mately 2.4 GB. When you add up the number ofvirtual bytes consumed by all processes - this isreadily accomplished by accessing theProcess(_Total)\Virtual bytes counter - it will likelyexceed the value of system-wide Committed Bytesthat is reported. Just as with the Process(*)\Workingset counter, it is not possible to break out whichprocesses account for virtual memory allocationconsistently due to double counting of sharedmemory segments.

Figure 11 charts the Process(*)\Virtual Bytescounter for only the 150 largest process virtualaddresses for the Terminal Server machine shownpreviously in Figure 7 and 8 over a one hour periodbeginning at 10:00 AM. Each bar in the chart repre-sents the virtual byte count of an individual processaddress space. The number of virtual bytes that areshown allocated at the process level exceeds 10 GBfor these 150 active processes, which far exceeds theCommitted Bytes measurements for the same interval,which were shown back in Figure 7 to be only about 4GB. The whopping difference between the expectedmeasurement and the actual ones is disconcerting, tosay the least.

Extended virtual addressingThe Windows 32-bit server operating systems

support several extended virtual addressing optionssuitable for large Intel machines configured with 4GB or more of RAM. These include:

• /3GB Boot switch, which allows the definitionof process address spaces larger than 2 GB, upto a maximum of 3 GB.

• Physical Address Extension (PAE), whichprovides support for 37-bit physical addresseson Intel Xeon 32-bit processors. With PAEsupport enabled, 32-bit Intel processors can beconfigured to address as much as 128 GB ofRAM.

• Address Windowing Extensions (AWE), whichpermits 32-bit process address spaces access to

FIGURE 10. WORKING SET BYTES COMPARED TO VIRTUAL BYTES.

physical addresses outside their 4 GB virtualaddress limitations. AWE is used most fre-quently in conjunction with PAE

Any combination of these extended addressingoptions can be deployed, depending on the circum-stances. Under the right set of circumstances, one ormore of these extended addressing functions cansignificantly enhance performance of 32-bit applica-tions, which, however, are still limited to using 32-bitvirtual addresses. These extended addressing optionsrelieve the pressure of the 32-bit limit onaddressability in different ways that are each dis-cussed in more detail below. But, they are alsosubject to the addressing limitations posed by 32-bitvirtual addresses. Ultimately, the 32-bit virtualaddressing limit remains a barrier to performanceand scalability.

Extended process private virtual addressesThe /3GB boot switch extends the per process

private address range from 2 to 3 GB. Windows Server2003 permits a different partitioning of user andsystem addressable storage locations using the /3GBboot switch. This extends the private User addressrange to 3 GB and shrinks the system area to 1 GB, asillustrated in Figure 12. The /3GB switch supports anadditional subparameter /userva=<SizeInMB>, whereSizeinMB can be any value for the size of the largestUser virtual address between 2048 and 3072.

Only applications compiled and linked with theIMAGE_FILE_LARGE_ADDRESS_AWARE switch en-abled can allocate a private address space largerthan 2 GB. Applications that are Large AddressAware include MS SQL Server, MS Exchange, MSInternet Information System (IIS), Oracle, and SAS.

For example, there is a database-oriented processcalled store.exe that is a central component of theMS Exchange Server application. The store.exeprocess in Exchange corresponds to the MS Ex-change Information Store, the component thatmanages a repository of mail folders, and othermessaging content. Like other data-intensive serverapplications, the store process relies on memory-resident caching of frequently accessed objects toavoid disk I/O and improve performance. The storeprocess will frequently benefit from running with the/3GB switch on because it can then acquire approxi-mately 1 GB more private virtual memory for itsinternal data cache. In fact, Microsoft recommendsusing the /3GB switch with userva=3030 for anyExchange Server machine with at least 1 GB of RAM.See, for example, KB article 810371 entitled "Usingthe /Userva switch on Windows Server 2003-basedcomputers that are running Exchange Server" postedon the Microsoft TechNet web site.

FIGURE 11. THE SUM OF PROCESS(*)\VIRTUAL BYTES COUNTERS CAN FAR EXCEED THE VALUE OF THE MEMORY\COMMITTED BYTES COUNTER SHOWNIN FIGURE 7 FOR THE SAME MACHINE.

Extending the user private area using the /3GBswitch also shrinks the size of the system virtualaddress range. This can have serious drawbacks.While the /3GB option allows an application to growits private virtual addressing range, it forces theoperating system to squeeze into a narrower range ofaddresses. When the /3GB boot option is employed,the initial default allocations of the four main systemmemory pools are halved. In certain circumstances,this narrower range of system virtual addresses maynot be adequate, and critical system functions caneasily be constrained by virtual addressing limits.These concerns are discussed in more detail in thesection entitled “Fine-tuning system virtual memory.”

PAEThe Physical Address Extension (PAE) enables

applications to address more than 4 GB of physicalmemory. It is supported by most current Intel proces-sors running Windows Server 2003 Enterprise Editionor Windows 2000 Advanced Server. Instead of 32-bitaddressing, PAE extends physical addresses to use 37bits, allowing machines to be configured with as muchas 128 GB of RAM.

When PAE is enabled, the operating system buildsan additional virtual memory translation layer that is

used to map 32-bit virtual addresses into this 128GB physical address range, as illustrated in Figure 2.PAE utilizes 8-byte PTEs that are wide enough tocontain the necessary extra address bits. A /PAEboot.ini switch informs the operating system to buildPTEs in the format required for extended addressing.On newer server machines that support hot-addmemory, Windows Server 2003 will boot in PAE modeautomatically to ensure continuous operation ifphysical memory ever increases above the 4 GBupper limit for standard addressing.

Server application processes running on machineswith PAE enabled are still limited to using 32-bitvirtual addresses. However, 32-bit server applica-tions facing virtual memory addressing constraintscan exploit PAE in two basic ways:

1. They can expand sideways by deploying mul-tiple application server processes. Both MS SQLServer 2000 and IIS 6.0 support sidewaysscalability with the ability to define and runmultiple application processes. Similarly, aTerminal Server machine with PAE enabled canpotentially support the creation of more processaddress spaces than a machine limited to 4 GBphysical addresses.

2. They can indirectly access physical memoryaddresses outside their 4 GB limit using theAddress Windowing Extensions (AWE) API calls.Using AWE calls, server applications like SQLServer and Oracle can allocate database buffersin physical memory locations outside their 4 GBprocess virtual memory limit and manipulatethem.

The PAE support the OS provides maps 32-bitprocess virtual addresses into the 37-bit physicaladdresses that the processor hardware supports. Anapplication process, still limited to 32-bit virtualaddresses, need not be aware that PAE is active.When PAE is enabled, operating system functionscan only use addresses up to 16 GB. Only applica-tions using AWE can access RAM addresses above 16GB to the 128 GB maximum. Large Memory Enabled(LME) device drivers can also directly address buffersabove 4 GB using 64-bit pointers. Direct I/O for the fullphysical address space is supported if both the devicesand drivers support 64-bit addressing. For devices anddrivers limited to 32-bit addresses, the operatingsystem is responsible for copying buffers located inphysical addresses greater than 4 GB to buffers in RAMbelow the 4 GB line than can be directly addressedusing 32-bits.

Figure 13 illustrates an expands sideways sce-nario for SQL Server 2000 where three namedinstances of the sqlservr.exe process are runningconcurrently on a machine configured with 12 GB ofRAM. When the machine is booted with the /3GBoption, each individual SQL Server instance canexpand to use 3 GB of RAM. PAE-mode addressingallows multiple 32-bit 4 GB virtual memory processaddress spaces to be loaded anywhere in physical

FIGURE 12. THE /USERVA BOOT SWITCH TO INCREASE THE SIZE OFTHE USER VIRTUAL ADDRESS RANGE.

memory and execute. Each 4 GB private addressspace allows for a 1 GB range of virtual memoryaddresses where operating system code and datastructures reside. Each SQL Server process addressspace is shown pointing to the same common OSpages resident in physical memory.

While expanding sideways by defining more processaddress spaces is a straightforward solution to virtualmemory addressing constraints in selected circum-stances, it is not a general purpose solution to theproblem. When a single SQL Server database process isvirtual memory-constrained running with no more thana 3 GB User-mode private area, extensive applicationchanges may be required to partition the databaseacross multiple instances of SQL Server. Moreover, notevery processing task can be readily partitioned intosubtasks that can be parceled out to multiple pro-cesses. While PAE brings extended physical addressing,there are no additional hardware-supported functionsthat extend interprocess communication (IPC). IPCfunctions in Windows rely on operating system sharedmemory, which is still constrained by 32-bit address-ing. Machines with PAE enabled often run betterwithout the /3GB option because the operatingsystem range is subject to becoming rapidly depletedotherwise.

PAE is required if want to enable the OS supportfor Cache Coherent Non-Uniform Memory Architec-ture (known as ccNUMA or sometimes NUMA, forshort) machines, but it is not enabled automatically.On both AMD64 and x64-based systems running inlong mode, PAE is required, is enabled automatically,and cannot be disabled.

AWEThe Address Windowing Extension (AWE) is an API

that allows programs to address physical memorylocations outside of their 4 GB virtual addressingrange. AWE is used by applications in conjunctionwith PAE to extend their addressing range beyond32-bits. Since process virtual addresses remainlimited to 32-bits, AWE is a marginal solution thatshould be deployed cautiously. It has many pitfalls,limitations, and potential drawbacks.

The AWE API calls maneuver around the 32-bitaddress restriction by placing responsibility forvirtual address translation into the hands of anapplication process. AWE works by defining an in-process buffering area called the AWE region that isused to map allocated physical pages dynamically to32-bit virtual addresses. See Figure 14 for an illustra-tion of the three step procedure necessary to use AWE,which is similar to managing overlay structures.

1. The first step is to allocate an AWE region usingnonpaged physical memory within the processaddress space by making a call toAllocateUserPhysicalPages in the AWE API.AllocateUserPhysicalPages locks down the pagesin the AWE region and returns a Page Framearray structure that is the normal mechanismthe operating system uses to keep track ofwhich physical memory pages are mapped towhich process virtual address pages. (Anapplication must have the Lock Pages inMemory User Right to use this function.)Initially, of course, there are no virtual ad-dresses mapped to the AWE region.

2. Then, the AWE application reserves physicalmemory (which may or not be in the rangeabove 4 GB) using a call to VirtualAlloc, specify-ing the MEM_PHYSICAL and MEM_RESERVEflags. Because physical memory is beingreserved, the operating system does not buildpage tables entries (PTEs) to address these dataareas. (Because physical memory addresses arerequested, User mode threads in the processcannot directly access these physical memoryaddresses. (Only authorized kernel threadscan.)

3. Next, the process requests that the physicalmemory that was acquired be mapped to theAWE region using the MapUserPhysicalPagesfunction. Once physical pages are mapped tothe AWE region, the region is now addressableby User mode threads running inside theprocess.

Using the AWE API calls, multiple sets of physicalmemory blocks, extending to 128 GB, can be mappeddynamically, one at a time, into the AWE region. Theapplication, of course, must keep track of which setof physical memory buffers is currently mapped tothe AWE region, what set is currently required tohandle the current request, and perform virtual

FIGURE 13. EXPANDING SIDEWAYS BY RUNNING MULTIPLE INSTANCESOF 32-BIT SQL SERVER 2000.

address unmapping and mapping as necessary toensure addressability to the right physical memorylocations.

In this example of an AWE implementation, theUser process allocates four large blocks of physicalmemory that are literally outside the address spaceand one AWE region of the same size within the processvirtual address space. Because the physical blocks ofRAM are the same size as the AWE region of virtualmemory, the AWE call to MapUserPhysicalPages isused one physical address memory block at a time.In the example, the AWE region and the reservedphysical memory blocks that are mapped to the AWEregion are all the same size, but this is not required.Any sort of overlay structure can be implemented.For instance, there could be more than one AWEregion, and each AWE region does not have to be thesame size. Nor do physical memory blocks have tomatch the size of the AWE region. Applications canmap multiple reserved physical memory blocks to thesame AWE region, provided the AWE region addressranges they are mapped to are distinct and do notoverlap.

In this example the User process private addressspace extends to 3 GB. It is desirable for processesusing AWE to acquire an extended private area sothat they can create a large enough AWE mappingregion to manage physical memory overlays effectively.Obviously, frequent unmapping and remapping of

physical blocks would slow down memory accessconsiderably. The AWE mapping and unmappingfunctions, which involve binding physical addresses toa process address space's PTEs, must be synchronizedacross multiple threads executing on multiprocessors.Compared to performing physical I/O operations toAWE-enabled access to memory-resident buffers, ofcourse, the speed-up in access times using AWE is stillconsiderable.

AWE limitations. Besides forcing User processesto develop their own dynamic memory managementroutines, AWE has other limitations. For example,AWE regions and their associated reserved physicalmemory blocks must be allocated in pages. An AWEapplication can determine the page size using a callto GetSystemInfo. Physical memory can only bemapped into one process at a time. (Processes canstill share data in non-AWE region virtual ad-dresses.) Nor can a physical page be mapped intomore than one AWE region at a time inside the sameprocess address space. These limitations are appar-ently due to system virtual addressing constraints,which are significantly more serious when the /3GBswitch is in effect. Executable code (.exe, .dll, files,etc.) can be stored in an AWE region, but not executedfrom there. Similarly, AWE regions cannot be used asdata buffers for graphics device output. Each AWEmemory allocation must also be freed as an atomic

FIGURE 14. AN AWE IMPLEMENTATION ALLOWING THE PROCESS TO ADDRESS LARGE AMOUNTS OF PHYSICAL MEMORY.

unit. It is not possible to free only part of an AWEregion.

The physical pages allocated for an AWE regionand associated reserved physical memory blocks arenever paged out - they are locked in RAM until theapplication explicitly frees the entire AWE region (orexits, in which case the OS will clean up automati-cally). Applications that use AWE must be careful notto acquire so much physical memory that otherapplications can run out of memory to allocate. If toomany pages in memory are locked down by AWEregions and the blocks of physical memory reservedfor the AWE region overlays, contention for the RAMthat remains can lead to excessive paging or preventcreation of new processes or threads due to lack ofresources in the system area's Nonpaged Pool.

Application supportDatabase applications like MS SQL Server, Oracle

and MS Exchange that rely on memory-residentcaches to reduce the amount of I/O operations theyperform are susceptible to running out of address-able private area in 32-bit Windows. These serverapplications all support the /3GB boot switch forextending the process private area. Support for PAEis transparent to these server processes, allowingboth SQL Server 2000 and IIS 6.0 to scale sideways.Both SQL Server and Oracle can also use AWE togain access to additional RAM beyond their 4 GBlimit on virtual addresses.

Scaling sidewaysSQL Server 2000 can scale sideways, allowing you

to run multiple named instances of the sqlserverprocess, as illustrated in Figure 13. A white paperentitled "Microsoft SQL Server 2000 ScalabilityProject-Server Consolidation" available at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql2k/html/sql_asphosting.aspdocuments the use of SQL Server and PAE to sup-port multiple instances of the database process

the best use possible of potentially scarce virtualmemory resources.

IIS 6.0 features a kernel mode driver, http.sys,which is invoked directly whenever conventionalHTTP Method calls are received and processed by theTCP/IP networking protocol stack. The HTTP kernel-mode driver eschews the standard system file cache,which is virtual-memory constrained in 32-bitWindows, in favor of a dedicated cache that can beaddressed only using the physical memory addressesof the HHTP Response objects stored there. Usingphysical addressing mode exclusively allows theHTTP Response Cache to grow in size unconstrainedby the system’s virtual memory addressing limita-tions. Not only can the HTTP Response Cache growlarger than the 960 MB limit on the system filecache's virtual addressing range, it can do so withoututilizing any of the limited range of virtual addressesreserved for other system memory pools. In caseswhere a fully rendered HTTP Response object isalready available in the HTTP Response Cache, IIS6.0 can respond to a Get Request without everhaving to leave kernel mode or suffer a contextswitch to a User mode processing thread.

Web applications written using either Microsoft'sActive Server Pages scripting extensions or the newerASP.NET runtime facilities cannot be trusted toexecute safely in kernel mode. IIS 6.0 supports a newfeature called Application Pools (also known as WebGardens) that queues ASP and ASP.NET requests toa User mode server process called w3wp.exe forexecution. Figure 15 illustrates the ASP.NET ISAPIfilter that initiates ASP.NET processing inside aw3wp runtime container process. Theaspnet_isapi.dll, which is loaded by the ISAPI inter-face that IIS has supported for several generations, isalso shown inside w3wp, along with the .NET Frame-work Common Language Runtime, mscoree.dll, thatlinks to the extensive set of runtime services the.NET Framework provides. The aspx page containing

User

Kernel

HTTP

TCP

IP

Network Interface

HTTP Kernel Mode Driver

(http.sys)

HTTP Response Cache

(Physical Memory)

LSSAS

WindowsAuthentication

SSL

Inetinfo

IIS Administration

FTP

Metabase

SMTP

NNTP

SVCHOST

W3SVC

W3wpW3wp

Aspnet_filter.dll

Mscoree.dll

Aspnet_isapi.dll

<code-behind>.dll

Default.aspx

Application Pool

W3wpW3wp

Aspnet_filter.dll

Mscoree.dll

Aspnet_isapi.dll

<code-behind>.dll

Default.aspx

Application Pool

FIGURE 15. THE APPLICATION SERVER ARCHITECTURE OF IIS 6.0.

address space, sqlservr.exe, on amachine configured with 32 GB ofRAM. Booting with PAE, this serverconsolidation project defined and ran16 separate instances of MS SQLServer. With multiple databaseinstances defined, it was no longernecessary to use the /3GB switch toextend the addressing capability of asingle SQL Server address space.

IIS 6.0. The IIS 6.0 applicationserver architecture supplies perhapsthe most flexible set of configurationoptions available in any Windowsserver application in the face of 32-bit virtual memory constraints.Figure 15 illustrates the most salientfeatures of the Microsoft Web serverarchitecture that is designed to make

the HTML that is necessary to render the pageproperly and its corresponding code-behind execut-able are also illustrated. (The .NET Framework 2.0uses a slightly different mechanism than the .NETFramework version 1 that merges the aspx page andthe code file at runtime.)

Web sites defined to IIS 6.0 can be assigned to runin separate Application Pools, which correspond toseparate copies of the w3wp.exe container process.Additional options are also available at the Applica-tion Pool level to devote multiple worker processes toexecute the web site programs, as illustrated inFigure 16. This set of multi-process options permitsgreat flexibility in allowing the web application toexpand sideways.

Exchange. Microsoft recommends that Exchange2000 and 2003 use both the /3GB and /uservaswitches in boot.ini to allocate up to 3 GB of RAM forthe Exchange Information Store application (store.exe).The store.exe process in Exchange is a databaseapplication that maintains the Exchange messagingdata store. However, in Exchange 2000 store.exe willnot allocate much more than about 1GB of RAM unlessyou make registry setting changes because the data-base cache will allocate only 900 MB by default. Thisvalue can be adjusted upward using the ADSI Edit toolto set higher values for themsExchESEParamCacheSizeMin andmsExchESEParamCacheSizeMax runtime parameters.

Both Exchange 2000 and 2003 will run on PAE-enabled machines. However, Exchange 2000 doesn'tmake any calls to the AWE APIs to utilize virtualmemory beyond its 4GB address space. Exchange2003 checks the memory management configurationwhen the store process starts. If the virtual memorysettings are not optimal for Exchange, an event 9665

Warning is written to the Application Event log. Thelogic behind this virtual memory check is discussedin KB article 815372 entitled, "How to optimizememory usage in Exchange Server 2003," along withsuggestions for tuning Exchange to run well onmachines with 2 GB or more of RAM installed.

AWE SupportMS SQL Server. Using SQL Server 2000 with AWE

brings a load of special considerations. You mustspecifically enable the use of AWE memory by aninstance of SQL Server 2000 Enterprise Edition bysetting the awe enabled option using the sp_configurestored procedure. When an instance of SQL Server2000 Enterprise Edition is run with awe enabled set to1, The instance does not dynamically manage theworking set of the address space. Instead, the databaseinstance acquires nonpageable memory & holds allvirtual memory acquired at startup until it is shutdown.

Because the virtual memory the SQL Serverprocess acquires when it is configured to use AWE isheld in RAM for the entire time the process is active,the max server memory configuration setting shouldalso be used to control how much memory is used byeach instance of SQL Server that uses AWE memory.

Oracle. AWE support in Oracle is enabled bysetting the AWE_WINDOW_MEMORY Registryparameter. Oracle recommends that AWE be usedalong with the /3GB extended User area addressingboot switch. The AWE_WINDOW_MEMORY param-eter controls how much of the 3 GB address space toreserve for the AWE region, which is used to mapdatabase buffers. This parameter is specified in bytesand has a default value of 1 GB for the size of theAWE region inside the Oracle process address space.

If AWE_WINDOW_MEMORY is set too high, theremight not be enough virtual memory left for otherOracle database processing functions - includingstorage for buffer headers for the database buffers, therest of the SGA, PGAs, stacks for all the executingprogram threads, etc. As Process(Oracle)\Virtual Bytesapproaches 3GB, then out of memory errors can occur,and the Oracle process can fail. If this happens, youthen need to reduce db_block_buffers and the size ofthe AWE_WINDOW_MEMORY.

Oracle recommends using the /3GB option onmachines with only 4GB of RAM installed. WithOracle allocating 3 GB of private area virtualmemory, the Windows operating system and anyother applications on the system are squeezed intothe remaining 1GB. However, according to Oracle,the OS normally does not need 1GB of physical RAMon a dedicated Oracle machine, so there are typicallyseveral hundred MB of RAM available on the server.Enabling AWE support allows Oracle to access thatunused RAM, perhaps grabbing as much as an extra500MB of buffers can be allocated. On a machinewith 4 GB of RAM, bumping up db_block_buffers and

FIGURE 16. MULTIPLE PROCESSES CAN SERVICE ASP AND ASP.NETWEB SITES IN IIS 6.0.

FIGURE 17. MEMORY MANAGEMENT POOL SIZING SETTINGS.

turning on the AWE_WINDOW_MEMORY setting,Oracle virtual memory allocation may reach 3.5 GB.

SAS. SAS support for PAE includes an option toplace the Work library in extended memory to reducethe number of physical disk I/O requests. SAS isalso enabled to run with the /3GB and can use theextra private area addresses for SORT work

Fine-tuning System Virtual memoryThe upper half of the 32-bit 4 GB virtual address

range is earmarked for System virtual memoryaddresses. The system virtual address range islimited to 2 GB, and using the /userva boot option itcan be limited even further to as little as 1 GB. On alarge 32-bit machine, it is not uncommon to run out ofvirtual memory in the system address range. Theculprit could be a program that is leaking virtualmemory from the Paged Pool. Alternatively, it could becaused by active usage of the system address range bya multitude of important system functions, includingkernel threads, TCP session data, the file cache, andmany other normal functions. When the number of freeSystem PTEs reaches zero, no function can allocatevirtual memory in the system range. Unfortunately, youcan sometimes run out of virtual addressing space inthe Paged or Nonpaged Pools before all the SystemPTEs are used up. When you run out of system virtualmemory addresses, whether due to a memory leak orvirtual memory creep, the results are usually cata-strophic.

The system virtual memory range, 2 GB wide, isdivided into four major pools: the System PTEs, theNonpaged Pool, the Paged Pool, and the File Cache.The size of the four main system area virtual memorypools is determined initially based on the amount ofRAM. There are pre-determined maximum sizes ofthe Nonpaged and Paged Pool, but there is no

guarantee that they will reach their predeterminedlimits before the system runs out of virtual ad-dresses. This is because a substantial chunk ofsystem virtual memory remains in reserve to beallocated on demand - it depends on which memoryallocation functions requisition it first. Nonpaged andPaged Pool maximum extents are defined at systemstart-up, based on the amount of RAM that isinstalled, up 256 MB for the NonPaged Pool and 512MB for the Paged Pool. In addition, a 900 MB regionis earmarked for the pool where System PTEs areallocation. Meanwhile, the file cache is granted aninitial virtual address range of 512 MB.

Using the /userva or /3GB boot options thatshrink the system virtual address range in favor of alarger process private address range substantiallyincreases the risk of running out of system virtualmemory. For example, the /3GB boot option thatreduces the system virtual memory range to 1 GBand cuts the default size of the Nonpaged and PagedPools in ½ for a given size RAM.

The operating system's initial pool sizing decisionscan also be influenced by a series of settings in theHKLM\SYSTEM\CurrentControlSet\Control\SessionManager\Memory Management key, as illustrated inFigure 17.

Three settings are available, NonpagedPoolSize,PagedPoolSize, and SystemPages, to specify explicitlythe size of the Nonpaged Pool, the Paged Pool, and theSystem PTE pool, respectively. Using the Memoryperformance monitoring counters, if you are able todetermine that the system is encountering a criticalshortage of virtual memory in any of these three pools,the appropriate Registry setting can be used to boostthe initial size of the pool. Alternatively, you might baseyour analysis on running the !vm debugger command

sgnittestnemeganaMyromeM\reganaMnoisseS\lortnoC\teSlortnoCtnerruC\METSYS\MLKH

eziSlooPdegapnoN foezisehtnodesabstluafeD.BM652otpu,MAR

.ylticilpxetesebnaCmumixamstioteziSlooPdegapnoNsdnetxe1-

eziSlooPdegaP foezisehtnodesabstluafeD.BM056otpu,MAR

.ylticilpxetesebnaCmumixamstioteziSlooPdegaPsdnetxe1-

segaPmetsyS foezisehtnodesabstluafeDMAR

.ylticilpxetesebnaCmumixamstiotsegaPmetsySsdnetxe1-

ehcaCmetsySegraL BM069 ehcacelifdezismuminimsetacollaeulavoreZ

TABLE 1. MEMORY MANAGEMENT REGISTRY SETTINGS

against the crash dump of a system stricken with acritical shortage of virtual memory.

Rather than force you to partition the system areaprecisely, you can also set the SystemPages,NonpagedPoolSize, or PagedPoolSize settings to a -1value (or 0xfffffff, since the Registry Editor does notallow for signed integers), which instructs the operat-ing system to allow the designated pool to grow aslarge as possible. Note that setting more than one of

the three system memory pool sizing parameters to -1 will backfire because the OS will no longerunderstand which one of the pools needs to be thelargest.

A fourth setting, LargeSystemCache, is enabled bydefault on Windows Server 2003 machines. WhenLargeSystemCache is enabled, a 512 MB region ofvirtual memory is initially allocated for the systemfile cache, which is then allowed to extend into anoverflow area of virtual memory to a maximum size of960 MB, assuming virtual memory is available.Server applications like SQL Server, Oracle, Ex-change and even Active Directory rely on their owninternal caches, utilizing the system file cache verylittle. IIS 6.0 relies mainly on its dedicated kernel-mode physical memory cache, but may supply someload on the system file cache. If the main serverapplications are not using the system file cache, thenLargeSystemCache can be set to 0, which minimizesthe initial allocation of the system file cache. Thisallows the OS greater discretion in maximizing one ofthe other system memory pools.

64-bit addressingThe problems discussed above where 32-bit

Windows applications are virtual memory-con-strained vanish when 64-bit processors running64-bit applications are deployed. The 64-bit architec-tures available today allow massive amounts ofvirtual memory to be allocated. Windows Server 2003supports a 16 TB virtual address space for Usermode processes running in 64-bit mode. As in 32-bitmode, this 16 TB address space is divided in half,with User mode private memory locations occupyingthe lower 8 TB and the operating system occupyingthe upper 8 TB. Table 2 compares the virtualmemory addressing provided in 64-bit Windows to32-bit machines in their default configuration.

On 64-bit systems, all operating system functionsare provided in native 64-bit code. The massiveamount of virtual memory available to operatingsystem functions eliminates the architectural con-

tnenopmoC tib-46 tib-23

ecapSsserddAlautriVssecorP BT61 BG4

ecapSsserddAresU BT8 BG2

looPdegapnoN BG821 *BM652

looPdegaP BG821 *BM215

ehcaCeliFmetsyS BT1 BG1

sETPmetsyS BG821 *BM066

ecapsrepyH BG8 BM4

eziSeliFgnigaP BT215 BG4

* These are initial sizing requests, not hard limitson the size of the Paged, Nonpaged Pools, andSystem PTEs in Windows Server 2003.

TABLE 2. A COMPARISON OF VIRTUAL MEMORY ALLOCATIONS IN 64-BIT TO 32-BIT SYSTEMS

straints that lead to critical virtual memory shortagesin the situations discussed above. The concernsenumerated above simply evaporate. For example, inthe Terminal Server example illustrated in Figure 9,space for Free System PTEs dropped to less than 100when the benchmark workload reached 240 Usersessions. When the same Terminal Server workloadis moved to an x64 machine, the virtual memoryconstraint is removed. As a result, the number ofTerminal Server User sessions could be increaseddramatically, in effect, until some other bottleneckedresource in the configuration was manifest. In thecase of this specific workload, Microsoft reported thatprocessor limitations start to become evident whendouble the number of Terminal Server User sessionsare executed.

The 64-bit machines arriving on the scene offera modicum of relief to virtual memory-constrained32-bit applications that are LARGE_ADDRESS_AWARE.Processes running 32-bit code use the thin WOW64(Windows on Windows) translation layer on 64-bitsystems to communicate to operating systemservices. 32-bit User mode applications that arelinked with the IMAGE_FILE_LARGE_ADDRESS_AWAREswitch enabled so they could take advantage of the/3GB switch in 32-bit machines can allocate aprivate address space as large as 4 GB whenrunning on 64-bit Windows. Note that the /3GBboot switch is not supported in 64-bit Windows.The larger potential 32-bit User process addressspace is possible because operating system func-tion no longer need to occupy the upper half of thetotal 4 GB process address space. This allows Usermode applications to allocate almost the full 4 GB32-bit addressing range.

Running virtual memory-constrained 32-bitapplications on 64-bit machines to supply additionalvirtual memory headroom is particularly attractivewhen the 64-bit machines can run native 32-bitinstructions without exacting a severe performancepenalty. That was not the case with the 64-bitItanium IA-64 machines that were introduced severalyears ago that are supported by the Microsoft Server2003 operating system. The IA-64, a radically newprocessor architecture featuring a new instructionset, executes the IA-32 instruction set using emula-tion, which results in a severe performance penalty.IA-64 machines are designed to run native 64-bitapplications without compromising their perfor-mance, with backwards compatibility provided to run32-bit applications as a secondary consideration. Todate, however, there are a very limited number ofnative 64-bit server applications that could run onthese machines. The short list of native 64-bit serverapplications currently includes IIS 6.0, MicrosoftSQL Server, Oracle, and SAP, among others.(Microsoft maintains a far from complete list athttp://www.microsoft.com/windowsserver2003/64bit/x64/app64catalog.aspx.)

The need to run 32-bit server-based applicationsduring a long period of transition to 64-bit Windows

makes the AMD-64 and Intel x64 architecture anattractive interim choice. (The Intel x64 is a clone ofthe original AMD-64 design.) Both AMD-64 and x64machines are capable of executing native 32-bitinstructions with little or no performance penalty.Processor-constrained 32-bit applications will suffera slight performance hit, but virtual memory-con-strained server applications like MS Exchangereceive significant relief.

References.[1] Hennessey and Patterson, Computer Architecture:

A Quantitative Approach. Morgan Kaufmann,2002, 3rd edition.

[2] Tracy Kidder, The Soul of a New Machine. Boston:Back Bay Books, 2000.

[3] Mark Friedman and Odysseas Pentakalos, Win-dows 2000 Performance Guide. O'Reilly Associates,2002.

[4] Mark Friedman, "Virtual memory constraints in32-bit Windows," Proceedings, CMG 2003.

[5] Sreenivas Shetty, "Terminal Server Performance:Tuning, Methodologies and Windows Server 2003x64 Impact." Powerpoint presentation. MicrosoftTechEd 2005 conference.