c 02070810

Upload: dochihao7406

Post on 04-Jun-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 c 02070810

    1/20

    Table of Contents

    Executive summary ............................................................................................................................... 3Background and motivation .................................................................................................................. 3

    Structure of HP servers ...................................................................................................................... 3Interleaved memory .......................................................................................................................... 3Local memory .................................................................................................................................. 4

    LORA scope ........................................................................................................................................ 4Hardware platforms ......................................................................................................................... 4Operating system ............................................................................................................................. 4

    Virtual partitioning ........................................................................................................................... 4Application workload ....................................................................................................................... 5Variability in hardware resources ....................................................................................................... 5When to use LORA........................................................................................................................... 5

    LORA configuration rules ...................................................................................................................... 5nPartitions ....................................................................................................................................... 6vPars .............................................................................................................................................. 6Integrity Virtual Machines .................................................................................................................. 6

    LORA system administration .................................................................................................................. 6

    Server Tunables product in the Tune-N-Tools bundle ............................................................................. 6loratune command ........................................................................................................................... 6

    Benefits .............................................................................................................................................. 7Performance .................................................................................................................................... 7Cost ............................................................................................................................................... 7Power management .......................................................................................................................... 7

    Summary ............................................................................................................................................ 7Glossary ............................................................................................................................................. 8Technical details .................................................................................................................................. 8Configuring nPartitions for LORA ........................................................................................................... 8

    Converting an existing nPartition for LORA .......................................................................................... 8Creating a new nPartition for LORA ................................................................................................... 9

    Determine the size of the nPartition ................................................................................................. 9Create the nPartition ..................................................................................................................... 9

    Configuring vPars for LORA ................................................................................................................ 10Considerations for memory granule size ........................................................................................... 10Creating new vPars instances .......................................................................................................... 10

    Locality-Optimized Resource AlignmentVersion 3.5 for HP-UX 11i v3 Update 5

  • 8/13/2019 c 02070810

    2/20

    Dividing the nPartition into vPars instances of equal size ..................................................................... 11Establishing vPars instances by processor and memory requirements .................................................... 14Modifying existing vPars instances ................................................................................................... 16Summary of vPars configuration rules ............................................................................................... 17

    Advanced tuning ............................................................................................................................... 18numa_mode kernel tunable parameter ............................................................................................. 18numa_sched_launch kernel tunable parameter ............................................................................. 18numa_policy kernel tunable parameter ......................................................................................... 18mpsched command ........................................................................................................................ 19

    Using mpsched to enforce good alignment for Java ........................................................................ 19For more information .......................................................................................................................... 20

  • 8/13/2019 c 02070810

    3/20

    3

    Executive summary

    Locality-Optimized Resource Alignment (hereinafter LORA) is a framework for increasing systemperformance by exploiting the locality domains in HP servers with Non-Uniform Memory Architecture.LORA consists of configuration rules and tuning recommendations, plus HP-UX 11i v3 enhancementsand tools to support them.

    LORA introduces a new mode to supplement the Symmetric Multiprocessing (SMP) mode originally

    implemented in HP-UX. LORA exploits locality in NUMA platforms to advantage, while the SMPapproach treats the memory resources in a symmetric manner. For application workloads that exhibitlocality of memory reference, systems configured in accordance with LORA will typically see a 20%performance improvement compared to the SMP mode used with interleaved memory.

    The advanced power controls in HP servers offer the opportunity for great power savings whenplatform hardware is not fully utilized. Because the power domains generally correspond to thelocality domains, LORA configurations naturally mesh with a power conservation strategy.

    The body of this white paper contains sections describing background and motivation, scope,configuration rules, and system administration recommendations. The technical details behind thesetopics appear in the appendices.

    LORA was first introduced in September 2008 with Update 3 to HP-UX 11i v3. Here are the majorimprovements delivered in September 2009 with Update 5:

    The new par conf i gcommand makes configuring nPartitions much simpler. The procedure for creating well-aligned vPars instances is simpler, and those instances are

    fully compatible with gWLM dynamic processor migration operations. HP now recommends deploying Integrity Virtual Machines in LORA mode. LORA mode is now recommended for more application classes. There is less need for system administrators to perform explicit tuning, because HP-UX

    implements heuristics to perform resource alignment automatically. There is a new command, l or at une, to tune up resource alignment.

    Background and motivationStructure of HP servers

    HP midrange and high-end servers are constructed as a complex of multiple modular units containingthe hardware processing resources. This structure yields great advantages: a single family of serverscan span the range from an economical 4 processor cores up to world-class performance 128processor cores, with similar scaling in the amount of memory and number of I/O slots. Moreover,the complex can be partitioned to support multiple independent and isolated application workloads,with each partition sized to have the right amount of hardware resources for its workload.

    A consequence of this structure is that the processing resources within the complex are grouped into a

    set of localities. For any given processor core, memory access latency time depends on where thatmemory is located. This is called Non-Uniform Memory Architecture (NUMA).

    Interleaved memory

    Interleaved memory (ILM) is a technique for masking the NUMA properties of a system. Successivecache lines in the memory address space are drawn from different localities, making the averagememory access latency time more-or-less uniform.

  • 8/13/2019 c 02070810

    4/20

    4

    Sometimes interleaved memory is the best technique. ILM yields good performance when memoryreferences are spread across the entire address space with equal probability. This is the case forapplications using large global data sets with no spatial locality.

    Local memory

    If memory is not interleaved, then the natural localities inherent in the structure of the server complexare evident. The processor cores in each locality enjoy fast access to their local memory. Thecounterpoint is that access to memory in a different locality is slower. When the memory referencepattern places the majority of accesses in local memory, LORA gives a significant performanceadvantage relative to interleaved memory.

    LORA scope

    Hardware platforms

    LORA applies only to those HP servers that have a Non-Uniform Memory Architecture. Those are theservers built around the sx1000 and sx2000 chip sets.

    For these servers, the localities are oriented around cells. Local memory is referred to as Cell Local

    Memory (CLM).Memory performance is better when all the cells in an nPartition have the same amount of memoryinstalled. This is good advice for the 100% interleaved case, because the deepest interleave ispossible only when each cell contributes the same amount of memory. Having the same amount ofmemory on each cell is even more important for LORA: the memory symmetry promotes balancedutilization of the processing resources. Asymmetry of local memory can cause a slight degradation inoverall system performance.

    The Integrity platform is the design center for LORA, and the architecture exploits features specific tothat platform. LORA is not supported on HP 9000 (PA-RISC) platforms.

    Operating system

    Update 3 to HP-UX 11i v3, released in September 2008, was the first version to provide a rich set ofmechanisms to support local memory. We recommend that LORA be used with this update or itssuccessors, Update 4 and Update 5.

    The earliest versions of HP-UX assumed a uniform memory architecture and implemented only the SMPmode. HP-UX 11i v2 was the first version to provide support for local memory. The entire burden formanaging local memory is placed on the system administrator and on applications, so we do notrecommend LORA with HP-UX 11i v2.

    Virtual partitioning

    The two virtual partitioning solutions provided by HP are Virtual Partitions (vPars) and Integrity Virtual

    Machines. Because these solutions subdivide the physical resources of an nPartition, they presentopportunities to exploit locality.

    The virtualization model of vPars is particularly well-suited to LORA. The version of the T1335DCproduct first delivered in Update 4 to HP-UX 11i v3, version A.05.05, contains many optimizations togain additional benefit from local memory.

    The binding of virtual resources to physical resources in Integrity Virtual Machines is flexible and fluid,yet there are still opportunities to gain performance advantage through resource alignment. HPrecommends deploying Integrity Virtual Machines with LORA starting with Update 4.

  • 8/13/2019 c 02070810

    5/20

  • 8/13/2019 c 02070810

    6/20

    6

    HP-UX 11i supports a variety of partitioning models. The sections that follow explain which of thesemodels are sensitive to the LORA configuration rules.

    nPartitions

    At the nPartition level, each base cell should be configured with thslocal memory to comply with theLORA configuration rules. The floating cells, if any, always have 100% local memory. The appendixConfiguring nPartitions for LORAcontains more detail.

    For nPartitions containing exactly one locality, there is no difference between interleaved memory andlocal memory. For such a partition, there is no difference between SMP mode and LORA mode. If asecond cell were added via Dynamic nPartitions, then there would be a difference between the twomodes.

    vPars

    Each vPars instance should be composed with thslocal memory and thILM. Since the underlyingnPartition has this memory ratio, it is straightforward to reflect the same ratio into the vPars instances.

    It is important that the processor and memory resources assigned to each vPars instance span theminimal set of localities. If a vPars instance must span multiple localities, then the processor and

    memory resources should be distributed symmetrically across those localities. Aligning I/O resourceswith the processors and memory is helpful, but it is a second-order effect. The appendixConfiguringvPars for LORAexplains these points in more detail.

    HP recommends that vPars instances configured with thslocal memory be operated in LORA mode.This can be achieved most easily by leaving the numa_ mode parameter at its default value. WhenvPars is operating in LORA mode, the system will manage any CPU migration operations so as toadhere as closely as possible to the configuration rules given above.

    Integrity Virtual Machines

    The nPartition containing the Integrity Virtual Machines host should be composed with thslocal

    memory and

    th

    ILM. The guest instances operate in a Uniform Memory Architecture environment, soit is neither necessary nor possible to configure local memory in the guest instances. The IntegrityVirtual Machines host will allocate resources to the guest instances to gain the greatest possiblebenefit from memory local to the processors.

    LORA system administration

    For the most part, managing a system in LORA mode is identical to managing it in SMP mode.Suggestions for adapting to unusual workload profiles are in theAdvanced Tuningappendix.

    Server Tunables product in the Tune-N-Tools bundle

    We recommend using the Server Tunables product in the Tune-N-Tools bundle with LORA to improveapplication performance. This product was introduced with Update 3 and is also available on theweb in the HP software depot:http://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=Tune-N-Tools

    loratune command

    The loratune command is valuable in LORA mode. The command can be used to restore goodresource alignment if it has been disturbed by an event such as

    http://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=Tune-N-Toolshttp://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=Tune-N-Toolshttp://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=Tune-N-Tools
  • 8/13/2019 c 02070810

    7/20

    7

    terminating a major application or completing a backup dynamic platform operation such as online cell activation

    The simplest way to use the command is to invoke it with no arguments:

    l or at une

    More details are available in the man page.

    Benefits

    Performance

    LORA reduces average memory access latency times in comparison to the interleaved memory mode.The magnitude of the reduction depends on the memory reference pattern and the number of localitiesin the partition. When processors spend less time waiting for memory references to be satisfied, allaspects of application performance improve. Typically, response times decrease and throughputincreases at the same time that processor utilization drops. A rough estimate of the performancebenefit is 20%. As Table 1indicates, the benefit is greater for larger partitions.

    Cost

    LORA makes processors operate more efficiently, which can be realized as an increase inperformance. Alternatively, the increased efficiency can be used to reduce the number of processorcores allocated to an application workload. This reduces the hardware provisioning cost, and mayalso save on the cost of software licenses as well. The LORA configuration guidelines sometimesrecommend an increase in the amount of memory, which may offset some of the cost savings due toincreased processor efficiency.

    Power management

    Power management has strong synergy with LORA. By its nature, LORA groups hardwarecomponents by their physical locality. These localities often match power domains, which givesopportunities for power savings at times of low hardware utilization.

    The newest Itanium processors have multiple cores per socket, and have low-power modes. Thegreatest power savings are realized when all cores in a socket enter the low-power mode, ascompared to having single cores in multiple sockets in the low-power mode. LORA tends to groupcores by their proximity, increasing the chances that an entire socket can enter low-power mode whenan application is experiencing a light load.

    Summary

    Locality-Optimized Resource Alignment is a framework for improving performance on HP servers witha Non-Uniform Memory Architecture, introduced with HP-UX 11i v3 Update 3 and enhanced insubsequent updates. This paper explains the circumstances in which LORA is beneficial and givesguidelines for deployment in those cases. When LORA is used with commercial applications,performance is about 20% better than the SMP interleaved memory configuration. LORA simplifiesserver configuration by presenting uniform configuration guidelines. LORA dovetails nicely withpower management strategies.

  • 8/13/2019 c 02070810

    8/20

    8

    Glossary

    Term Definition

    Socket Receptacle on a motherboard for the physical package of processing resources.

    Processorcore

    If the physical package of processing resources includes multiple independentfunctional entities, each of them is called a processor core.

    CoreSame as processor core. The keyword cpu in the vPars commands refers to acore.

    CPU Acronym for Central Processing Unit. The term processor core is preferred.

    CellThe basic physical building block of a system complex. A cell contains processorsockets, memory, and I/O components.

    CrossbarA component of the interconnect fabric that allows the cells in a system complex tocommunicate with each other.

    Localitydomain

    A set of processors, memory, and I/O system bus adapters identified by theoperating system for resource alignment purposes.

    Locality Same as locality domain.

    SMPAcronym for Symmetric Multiprocessor. A model in which all of the processors in asystem are equivalent to and interchangeable with each other.

    NUMAAcronym for Non--Uniform Memory Architecture. A hardware platform in whichsystem memory is separated into localities based on memory access latency timesfor processors.

    ILMAcronym for Interleaved Memory. A technique in which successive cache lines ofmemory are drawn from different localities.

    Technical details

    The appendices that follow contain the technical details to supplement the general informationpresented in the opening sections of this paper. The appendixConfiguring nPartitions for LORAexplains the steps needed for every deployment of LORA. The appendixConfiguring vPars for LORAgives the additional steps needed when vPars is used in a LORA nPartition. Recommendations forfine-tuning workloads are given in the appendixAdvanced tuning.

    Configuring nPartitions for LORA

    This appendix discusses converting an existing 100% interleaved partition for use with LORA, andcreating a new LORA partition to meet the needs of a specified workload. The details for configuringa server complex and for dividing it into nPartitions are available in thereferences.

    Converting an existing nPartition for LORA

    If an existing interleaved nPartition is to be converted for LORA, it is only necessary to configure eachbase cell in the nPartition with 87.5% local memory. (Any floating cells always have 100% localmemory.)

  • 8/13/2019 c 02070810

    9/20

    9

    The par modi f y command is used to change the local memory ratio. For example, for an nPartitionbuilt from cells 1, 2, and 3:

    par modi f y - p 2 - m 1: base: : : 87. 5% - m 2: base: : : 87. 5% - m 3: base: : : 87. 5%

    It is necessary to reboot the partition for the change to take effect. The man page for thepar modi f y( 1m) command has the full details.

    Creating a new nPartition for LORA

    Determine the size of the nPartition

    Determine the size of the nPartition by estimating the amount of processing resources needed toprovision the application workload, in terms of number of processors, amount of memory, and thenumber of I/O adapters. HP provides a variety of estimating tools, such as Capacity Advisor, to helpprovision workloads. Alternatively, the estimate could be based on comparison to a similarworkload. Many workloads are CPU bound, so it is common to express the partition size as thenumber of processor cores or processor sockets.

    If iCAP or gWLM is to be used to manage the number of cores in the partition, then the maximumnumber of cores or sockets should be used in the size estimate.

    If the sizing guides or past experience assume interleaved memory, the estimates should be adjusted

    for the special characteristics of LORA. Under LORA, processors are more efficient, because theyspend less time waiting for memory accesses to be satisfied, so fewer cores are needed. On theother hand, to ensure that local memory is available on the cell where it is needed, the amount ofmemory should be increased modestly. The following table gives the guidelines:

    Table 1. Partition size adjustments for LORA

    Number ofsockets

    Coreadjustment

    Memoryadjustment

    1 to 12 None None

    13 to 24 Reduce by 10% Increase by 5%

    25 to 48 Reduce by 20% Increase by 10%

    49 to 64 Reduce by 25% Increase by 15%

    NoteIf memory utilization is below 75%, it is not necessary to increase theamount of memory at all.

    Create the nPartition

    The par conf i g command is used to build the nPartition. For example, the following commandcreates a partition that requires 24 processor cores:

    par conf i g x - O HPUX_11i V3 X LORA - n mi ncpus: 24 - Z hi gh_per f or mance

    The man page for the par conf i g( 1m) command has the full details.

  • 8/13/2019 c 02070810

    10/20

    10

    Configuring vPars for LORA

    Considerations for memory granule size

    Memory for vPars instances is allocated only in multiples of the memory granule size. The granulesize can be different for interleaved memory and local memory. Default granule sizes are establishedwhen a server complex is shipped from the factory. The defaults can be changed by use of thevparenv command, or by adding options to the command that creates the very first vPars instance.

    A small granule size permits the greatest flexibility in dividing the memory resources in an nPartitionfor assignment to vPars instances. However, a small granule size limits the size of contiguousphysical memory objects, which are useful to increase the performance of I/O drivers, and can resultin longer boot times. Also, there is a limit on the total number of granules that can be created. Thelarge granule size is more efficient, but it means that the quantum of memory allocation is large.

    Under the LORA configuration guidelines, an nPartition has seven times as much local memory asinterleaved memory. Therefore, it is reasonable that the granule size for local memory is larger thanthe granule size for interleaved memory. For example, the local memory granule size could be fourtimes larger or eight times larger than the interleaved memory granule size.

    An alternative strategy would be to skip interleaved memory altogether for extremely small vPars

    instances. For example, a vPars instance with only 2 GB of memory could be configured with 100%local memory, instead of giving it 256 megabytes of interleaved memory, which would require agranule size at least that small.

    In the examples that follow, it is assumed that the memory granule sizes have been established so thatthe requested memory allocations are an integral multiple of the relevant granule size.

    Creating new vPars instances

    In this section, we'll assume that an nPartition has already been configured in accordance with theLORA guidelines and will now be configured for vPars. There are two important sub-cases: dividingthe nPartition into a given number of vPars instances of equal size, and establishing a set of vParsinstances each with its own processor and memory requirements.

    In either case, the goal of the configuration is to create vPars instances where the cores and localmemory are drawn from the minimal set of localities. If the I/O can also come from those samelocalities, that is a bonus, but might not be possible because of the locations of the needed I/Oadapters. The enhancements introduced with vPars version A.05.05 can take account of the locationof I/O adapters when assigning processor cores to vPars instances.

    The LORA configuration procedure is a refinement of the general vPars configuration proceduredescribed in theHP-UX Virtual Partitions Administrator's Guide. That reference should be consulted fordetails not discussed in this paper.

    We illustrate the procedure with an nPartition consisting of cells 1, 2, and 3 in an HP Integrity rx8640server with 64 GB of memory on each cell. If the nPartition is configured according to the LORA

    guidelines, each cell would have 56 GB of local memory and would contribute 8 GB to the 24 GB ofILM in the partition. The available processor and memory resources are shown in the followingdiagram:

    http://docs.hp.com/en/T1335-90092/index.htmlhttp://docs.hp.com/en/T1335-90092/index.htmlhttp://docs.hp.com/en/T1335-90092/index.htmlhttp://docs.hp.com/en/T1335-90092/index.html
  • 8/13/2019 c 02070810

    11/20

    11

    Figure 1. Resources available in the example nPartition

    Dividing the nPartition into vPars instances of equal size

    Suppose that it is desired to create six vPars instances of equal size in this partition. Since the

    partition contains 24 cores, each vPars instance should have 4 cores. Since the partition contains 24GB of interleaved memory and 168 GB of local memory, each vPars instance should have 4 GB ofinterleaved memory and 28 GB of local memory. In practice, the actual memory allocations will besomewhat smaller, due to the memory consumed by firmware and by the vPars monitor.

    The commands that could be used to establish the processor and memory allocations are as follows:

    vparcr eat e - p vp1 - a cpu: : 4 - a mem: : 4096 - a cel l : 1: mem: : 28672

    vparcr eat e - p vp2 - a cpu: : 4 - a mem: : 4096 - a cel l : 1: mem: : 28672

    vparcr eat e - p vp3 - a cpu: : 4 - a mem: : 4096 - a cel l : 2: mem: : 28672

    vparcr eat e - p vp4 - a cpu: : 4 - a mem: : 4096 - a cel l : 2: mem: : 28672

    vparcr eat e - p vp5 - a cpu: : 4 - a mem: : 4096 - a cel l : 3: mem: : 28672

    vparcr eat e - p vp6 - a cpu: : 4 - a mem: : 4096 - a cel l : 3: mem: : 28672

    Please note that the creation commands specify the cell identifiers for the local memory but only acount for the number of processors. The enhancements added to vPars version A.05.05 (the versiondelivered starting with Update 4) automatically allocate processor cores in close proximity to thememory.

    For this example, the processor and memory allocations would be as shown in the following diagram:

  • 8/13/2019 c 02070810

    12/20

    12

    Figure 2. Six equal vPars instances configured according to LORA rules

    To achieve alignment between processors and memory with the vPars version delivered in Update 3,it was necessary to specify the cell identifiers for the processors to match the local memory. This hadthe drawback of disallowing gWLM to perform processor deletion operations upon the cell localprocessors. With the enhancements added in Update 4, gWLM is free to delete any core, and online

    processor addition operations will select the cores that give the best possible resource alignment.The creation commands did not specify any I/O allocations. The needed I/O could be added withsubsequent commands. For example:

    vpar modi f y - p vp5 - a i o: 1/ 0/ 0/ 2/ 0. 6. 0: BOOT - a i o: 1/ 0/ 0/ 1/ 0

    vpar modi f y - p vp6 - a i o: 1/ 0/ 4/ 0/ 0. 6. 0: BOOT - a i o: 1/ 0/ 6/ 1/ 0

    As a general rule, overall system performance is determined by processor to memory alignment, andis less sensitive to I/O location. Most I/O operations are asynchronous and the operating system andapplications employ a wide variety of latency-hiding techniques to make progress while I/Ooperations are pending. If a particular workload were sensitive to processor to I/O alignment, thiscan be achieved by specifying I/O location with the create command. For example:

    vpar creat e - p vp5 - a cpu: : 4 - a i o: 3/ 0/ 0/ 2/ 0. 6. 0: BOOT - a i o: 3/ 0/ 0/ 1/ 0

    vpar creat e - p vp6 - a cpu: : 4 - a i o: 3/ 0/ 4/ 0/ 0. 6. 0: BOOT - a i o: 3/ 0/ 6/ 1/ 0

    It is also possible to specify both local memory and I/O in the vparcreate command, in which casethe processor cores are allocated across the set of localities is spanned by the memory and I/O. Thefollowing command would allocate processor cores preferentially from cells 2 and 3:

    vparcr eat e - p vp5 - a cpu: : 4 - a mem: : 4096 - a cel l : 2: mem: : 28672

    - a i o: 3/ 0/ 0/ 2/ 0. 6. 0: BOOT - a i o: 3/ 0/ 0/ 1/ 0

  • 8/13/2019 c 02070810

    13/20

    13

    This first example was extremely simple: the resources on each cell were divided in half. But it showsquite clearly the power behind the LORA concept: the alignment between processor cores and localmemory guarantees that the preponderance of memory references will be satisfied through the fastesthardware path.

    A slightly more complicated example involves dividing the nPartition into four equal vPars instances.Each instance must have 6 cores and 48 GB of memory, of which 42 GB is local memory and 6 GB

    is interleaved memory. The first three instances can be built easily, with each taking three quarters ofthe resources on a cell. The fourth instance must be built from the remaining resources and istherefore split across all three cells. The splintering of a vPars instance reduces performance and soshould be avoided when possible, but sometimes it is inevitable.

    For this example, the processor and memory allocations would be as shown in the following diagram:

    Figure 3. Four equal vPars instances configured according to LORA rules

    It would not have been so smooth if it had been requested to create 7 equal vPars instances from thenPartition of 3 cells. The arithmetic implies that each instance would have 3.4 cores, but only an

    integral number of cores can be allocated. To solve this problem, some instances could have 3 coresand others 4, or the entire allocation could be handled by the technique for creating custom instancesdescribed in the next section.

  • 8/13/2019 c 02070810

    14/20

    14

    Establishing vPars instances by processor and memory requirements

    Suppose it is stipulated to divide the example partition into vPars of the following sizes:

    Table 2. Specification for size of vPars instances

    Name Number of cores Amount of memory

    cow 12 48 GB

    dog 4 88 GB

    elk 2 16 GB

    fox 2 16 GB

    The first vPars instance in the table, cow, requires 12 cores, so it will not fit within a single locality. Itwill, however, fit within two localities, so it should be confined to just two localities: spreading itsresources across a third locality would incur additional memory latency overhead. It is best to splitthe resources evenly across the two localities: this symmetry promotes the most balanced utilization of

    resources. The 48 GB of memory would be allocated in the ratio thslocal memory and thsinterleaved memory. The 42 GB of local memory should be distributed evenly across the twolocalities, with 21 GB from each of cells 1 and 2.

    The second vPars instance in the table, dog, requires 88 GB of memory, so it also will not fit within asingle locality. Once again, the strategy is to confine it within the minimum number of localities,which is two, and divide the resources as evenly as possible between the two localities. In this case,cells 2 and 3 each contribute 2 cores, and 35 GB and 42 GB of local memory respectively, with 11GB of memory coming from ILM.

    Here are commands that could be used to establish the configuration:

    vparcr eat e - p cow - a cpu: : 12 - a mem: : 6144 - a cel l : 1: mem: : 21504

    - a cel l : 2: mem: : 21504

    vparcr eat e - p dog - a cpu: : 4 - a mem: : 11264 - a cel l : 2: mem: : 35840

    - a cel l : 3: mem: : 43008

    Alternatively, a configuration of the same size including 1 GB of floating memory on each cell couldbe established with slightly different commands:

    vparcr eat e - p cow - a cpu: : 12 - a mem: : 6144

    - a cel l : 1: mem: : 20480 - a cel l : 1: mem: : 1024: f l oat i ng

    - a cel l : 2: mem: : 20480 - a cel l : 2: mem: : 1024: f l oat i ng

    vparcr eat e - p dog - a cpu: : 4 - a mem: : 11264

    - a cel l : 2: mem: : 34816 - a cel l : 2: mem: : 1024: f l oat i ng

    - a cel l : 3: mem: : 41984 - a cel l : 3: mem: : 1024: f l oat i ng

    The commands for adding I/O to the vPars instances are not shown, because they would depend onthe I/O devices needed in each instance and the location of the available I/O in the partition. Onceagain, the placement of the processors and memory usually has a greater impact on systemperformance than the location of the I/O.

  • 8/13/2019 c 02070810

    15/20

    15

    The diagram in Figure 4 shows the example rx8640 with processor and memory resources allocatedto the first two vPars instances from Table 5:

    Figure 4. Allocations for the first two custom vPars instances

    CPU

    5

    CPU

    6

    CPU

    7

    CPU

    1

    CPU

    2

    CPU

    3

    CPU

    4

    CPU

    5

    CPU

    6

    CPU

    7

    CPU

    0

    CPU

    1

    CPU

    2

    CPU

    3

    CPU

    4

    CPU

    0

    CPU

    4

    CPU

    0

    CPU

    5

    CPU

    6

    CPU

    7

    CPU

    1

    CPU

    2

    CPU

    3

    dogcow

    The next two vPars instances, elk and fox, are easy to lay out: the processor and memory resources

    needed by each one are available within one single locality. The resource allocations would bedistributed as shown in the following table:

    Table 3. Resource distribution for the four custom vPars instances

    NameCores

    cell 1

    Cores

    cell 2

    Cores

    cell 3

    GB CLM

    cell 1

    GB CLM

    cell 2

    GB CLM

    cell 3

    GB ILM

    cow 6 6 21 21 6

    dog 2 2 35 42 11

    elk 2 14 2

    fox 2 14 2

    The diagram in Figure 5 shows this graphically:

  • 8/13/2019 c 02070810

    16/20

    16

    Figure 5. Allocations for the custom vPars instances

    CPU

    5

    CPU

    6

    CPU

    7

    CPU

    1

    CPU

    2

    CPU

    3

    CPU

    4

    CPU

    5

    CPU

    7

    CPU

    0

    CPU

    1

    CPU

    2

    CPU

    4

    dog

    CPU

    6

    CPU

    1

    CPU

    2

    CPU

    4

    CPU

    7

    elk fox

    CPU

    6

    CPU

    3

    CPU

    0

    CPU

    0

    CPU

    5

    CPU

    3

    cow

    Modifying existing vPars instances

    The vparmodify command can be used to change the composition of existing vPars instances. Whenthe number of processor cores allocated is changed, this is referred to as "CPU migration". In LORA

    mode, the system automatically performs all CPU migration operations so as to conform as closely aspossible to the LORA configuration rules.

    For example, if the average processor utilization in the vPars instance cow is low while the utilizationin instance dog is high, it might be desirable to migrate CPUs from instance cow to instance dog.Here is an example command sequence to migrate four cores:

    vparmodi f y - p cow - d cpu: : 4

    vparmodi f y - p dog - a cpu: : 4

    After the first command is executed, the vPars monitor chooses 2 processor cores from cell 1 and 2cores from cell 2 to be deleted from vPars instance cow. After the second command is executed, two

    processor cores from cell 2 and two cores from cell 3 will be added to instance dog. The newresource assignments would be as shown in the following diagram:

  • 8/13/2019 c 02070810

    17/20

    17

    Figure 6. Allocations after the CPU migration operations

    CPU

    5

    CPU

    6

    CPU

    7

    CPU

    2

    CPU

    3

    CPU

    4

    CPU

    5

    CPU

    7

    CPU

    0

    CPU

    1

    CPU

    4

    dog

    CPU

    6

    CPU

    2

    CPU

    4

    CPU

    7

    elk fox

    CPU

    6

    CPU

    5

    CPU

    3

    cow

    CPU

    1

    CPU

    0

    CPU

    3

    CPU

    0CPU

    2

    CPU

    1

    It is also possible to migrate memory along with the processor cores, but only for the memory that wasspecifically designated as floating memory. If 1 GB of memory from cell 2 in the vPars instance cowhad been designated as floating memory, it could be migrated to instance dog with the following pairof commands:

    vpar modi f y - p cow - d cel l : 2: mem: : 1024: f l oat i ng

    vpar modi f y - p dog - a cel l : 2: mem: : 1024: f l oat i ng

    Summary of vPars configuration rules

    The rules for configuring vPars for LORA are simple:

    Draw resources for each vPars instance from the minimal number of distinct localities If the number of localities is greater than one, balance the resources across those localities Do the best you can with I/O devices and with instances created from leftover resources

    A minor optimization, and one which also applies for 100% interleaved nPartitions, is to keep thecores on a socket in the same instance, because they share a common cache. Keeping such coresworking together on the same application can give a small performance benefit, which is worthrealizing if the choice of cores is otherwise arbitrary. Use mpsched - K to identify the cores thatshare a common socket. Similarly, mpsched S shows which cores belong to the same proximityset, and those should be kept together if the choice of cores is otherwise unconstrained.

  • 8/13/2019 c 02070810

    18/20

    18

    Advanced tuning

    An important part of the LORA value proposition is to deliver ease-of-use along with performance.Our goal is that LORA should work out-of-the-box, without the need for system administrators toperform explicit tuning. Several factors make the goal impossible to reach in every single case. Therange of applications deployed across the HP-UX customer base is extremely diverse. So is thecapacity of the servers: the applications could be deployed in a virtual partition with two processorcores and 3 GB of memory, or in a hard partition with 128 cores and 2 TB of memory. In addition,

    workloads can exhibit transient spikes in demand many times greater than the steady-state average.

    Here is the LORA philosophy for coping with this dilemma: provide out-of-the-box behavior that issolid in most circumstances, but implement mechanisms to allow system administrators to adjust thebehavior to suit the idiosyncrasies of their particular workload if they desire to do so. This sectiondiscusses some possibilities for explicit tuning to override the automatic LORA heuristics.

    numa_mode kernel tunable parameter

    The numa_ mode kernel tunable parameter controls the mode of the kernel with respect to NUMAplatform characteristics. Because of the close coupling between memory configuration and kernelmode, it is recommended to accept the default value of numa_policy, which is 0, meaning to auto

    sense the mode at boot time. Systems configured in accordance with the LORA guidelines will beauto sensed into LORA mode; otherwise they will operate in SMP mode. As described in thenuma_mode man page, the tunable can be adjusted to override the autosensing logic.

    In LORA mode, HP-UX implements a number of heuristics for automatic workload placement toestablish good alignment between the processes executing an application in the memory that theyreference. Every process and every thread is assigned a home locality. Processes and threads maytemporarily be moved away from their home localities to balance the system load, but they arereturned back home as soon as is practical. For process memory allocations, when the allocationpolicy stipulates the closest locality, the home locality of the process is used. For shared memoryobjects too large to fit within a single locality, the allocation is distributed evenly across the smallestnumber of localities that can accommodate the object. Any processes attaching to that shared

    memory object are then re-launched so as to be distributed evenly across the localities containing thememory.

    numa_sched_launch kernel tunable parameter

    Thenuma_sched_l aunchparameter controls the default process launch policy. The launch policyrefers to the preferred locality for processes forked as children of a parent process. In LORA mode,the default launch policy is PACKED, which places child processes in the same locality as their parent.Setting the parameter to the value 0 forces the default launch policy to be the same as it is in SMPmode. Individual processes can be launched with a custom policy by using the mpschedcommand.

    numa_policy kernel tunable parameter

    The numa_pol i cy kernel tunable parameter governs the way HP-UX 11i v3 performs memoryallocation on NUMA platforms. When the parameter is at its default value of 0, the kernel choosesthe allocation policy at boot time based on the platform memory configuration. The systemadministrator can override the default choice at any time by changing the value of the parameter.The numa_pol i cy man page contains the full details; a brief summary appears below.

  • 8/13/2019 c 02070810

    19/20

    19

    Table 4. Values for the numa_pol i cy tunable parameter

    Value Default Memory llocation Policy Use Cases0 automatically selected by the kernel at boot time recommended for all common workloads

    and configurations

    1 from the locality closest to the allocating processor in LORA mode or with lots of localmemory configured

    2 from interleaved memory in SMP mode or with lots of interleavedmemory configured

    3 text and library data segments from interleaved memory; others fromthe locality closest to the allocating processor

    highly threaded applications exhibitingspatial locality

    4 private objects from the closest locality; shared objects frominterleaved memory

    applications that access lots of globaldata

    mpsched command

    In LORA mode, the kernel attempts to provide good alignment between the processors executing anapplication and the memory that they reference. To offer direct control over process placement, HP-UX 11i provides the mpschedcommand. The command reveals information about localities in thesystem and controls the processor or locality on which a specific process executes. The man pagehas the full details on how to use the command.

    The mpschedcommand can be used to bind processes to a particular locality. In the absence of thisbinding, the operating system might schedule processes in different localities as the workload ebbsand flows. The binding ensures that the processes are always executing in the same locality as thememory they allocate and hence will experience low memory latency. The command can also beused to specify a launch policy to control the scheduling of children forked by a process.

    Using mpsched to enforce good alignment for JavaHP-UX is usually able to achieve good alignment when running Java. Here are some guidelines forexplicit tunes to enforce good alignment when running multiple instances of the Java virtual machine.

    We'll assume that each Java instance is small enough in terms of processor and memory resources tofit into a single locality. It is best to subdivide the Java instance if it exceeds the capacity of a singlelocality.

    Use the command mpsched sto determine the number of localities in the partition and the localityidentifier corresponding to each locality. It is usually not necessary to use any launch policy if thereare fewer than three localities in the partition. If there are many localities, use commands similar tothe following to launch Java, distributing the Java virtual machines instances evenly across twolocalities:

    mpsched l 2 j avampsched l 2 j ava

    mpsched l 3 j ava

    mpsched l 3 j ava

    Since Java is highly threaded, it may be beneficial to set the numa_pol i cyparameter to 3.

  • 8/13/2019 c 02070810

    20/20

    For more information

    For a discussion of SMP versus NUMA, seehttp://en.wikipedia.org/wiki/Symmetric_multiprocessing

    Detailed information about the capabilities and configuration of the Superdome platform is availableathttp://h18000.www1.hp.com/products/quickspecs/archives_Division/11717_div_v1/11717_div.HTML

    The definitive manual for nPartition administration is the nPartition Administrator's Guide athttp://docs.hp.com/en/5991-1247B_ed2/index.html

    The definitive manual for vPars administration is the HP-UX Virtual Partitions Administrator's Guide athttp://docs.hp.com/en/T1335-90104/index.html

    The best reference for configuring and tuning the Oracle database for HP-UX is the white paperTheOracle database on HP Integrity servers

    The new commands for LORA are described in the manual pages: parconfig(1m) and loratune(5).

    To help us improve our documents, please provide feedback atwww.hp.com/solutions/feedback

    Technology for better business outcomes Copyright 2009 Hewlett-Packard Development Company, L.P. The informationcontained herein is subject to change without notice. The only warranties for HPproducts and services are set forth in the express warranty statementsaccompanying such products and services. Nothing herein should be construed asconstituting an additional warranty. HP shall not be liable for technical or editorialerrors or omissions contained herein.

    UNIX is a registered trademark of The Open Group.

    14655-ENW-LORA-TW, September 2009

    http://en.wikipedia.org/wiki/Symmetric_multiprocessinghttp://en.wikipedia.org/wiki/Symmetric_multiprocessinghttp://en.wikipedia.org/wiki/Symmetric_multiprocessinghttp://h18000.www1.hp.com/products/quickspecs/archives_Division/11717_div_v1/11717_div.HTMLhttp://h18000.www1.hp.com/products/quickspecs/archives_Division/11717_div_v1/11717_div.HTMLhttp://h18000.www1.hp.com/products/quickspecs/archives_Division/11717_div_v1/11717_div.HTMLhttp://docs.hp.com/en/5991-1247B_ed2/index.htmlhttp://docs.hp.com/en/5991-1247B_ed2/index.htmlhttp://docs.hp.com/en/T1335-90104/index.htmlhttp://docs.hp.com/en/T1335-90104/index.htmlhttp://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA2-0547ENW&cc=us&lc=enhttp://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA2-0547ENW&cc=us&lc=enhttp://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA2-0547ENW&cc=us&lc=enhttp://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA2-0547ENW&cc=us&lc=enhttp://www.hp.com/solutions/feedbackhttp://www.hp.com/solutions/feedbackhttp://www.hp.com/solutions/feedbackhttp://www.hp.com/solutions/feedbackhttp://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA2-0547ENW&cc=us&lc=enhttp://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA2-0547ENW&cc=us&lc=enhttp://docs.hp.com/en/T1335-90104/index.htmlhttp://docs.hp.com/en/5991-1247B_ed2/index.htmlhttp://h18000.www1.hp.com/products/quickspecs/archives_Division/11717_div_v1/11717_div.HTMLhttp://h18000.www1.hp.com/products/quickspecs/archives_Division/11717_div_v1/11717_div.HTMLhttp://en.wikipedia.org/wiki/Symmetric_multiprocessing