guidewire performance guidelines

Guidewire Performance Guidelines

Guidewire

Please note: The content provided in this document is intended for the exclusive use of Guidewire partners and customers and is for educational purposes only. While great care has been taken to validate the content, Guidewire makes no warranties, either expressed or implied, concerning the accuracy, completeness, reliability, or suitability of the information.


The content provided in this document is intended for the exclusive use of Guidewire-authorized external parties and is for educational purposes only. While great care has been taken to validate the content, Guidewire makes no warranties, either expressed or implied, concerning the accuracy, completeness,

reliability, or suitability of the information.

Can be shared internally and with clients - 2 - June 8, 2007

Summary This document provides recommendation on to pro-actively design and tune highly performing infrastructures. It provides both generic and Guidewire specific considerations and recommendations.

Table of Contents Guidewire Performance Guidelines ................................................................................................................ 1

Guidewire ................................................................................................................................................... 1 1 Introduction ............................................................................................................................................ 5 2 Disclaimer .............................................................................................................................................. 5 3 Scope and variables ................................................................................................................................ 5 4 Generic considerations ........................................................................................................................... 5

4.1 32 bits platforms ............................................................................................................................. 5 4.2 Large pages ..................................................................................................................................... 6 4.3 Generic performance considerations .............................................................................................. 6

4.3.1 Processing power .................................................................................................................... 6 4.3.2 Memory .................................................................................................................................. 7 4.3.3 Network .................................................................................................................................. 7 4.3.4 Storage .................................................................................................................................... 7

5 Application tier ....................................................................................................................................... 8 5.1 Generic guidelines .......................................................................................................................... 8

5.1.1 Java and performance ............................................................................................................. 8 5.1.1.1 Java memory management.................................................................................................. 8 5.1.1.2 Memory leaks in Java ......................................................................................................... 9 5.1.1.3 Java memory settings .......................................................................................................... 9

5.1.1.3.1 Heap size and 32 bits platform ..................................................................................... 9 5.1.1.3.2 Java and contiguous memory allocation .....................................................................10 5.1.1.3.3 Large pages .................................................................................................................10 5.1.1.3.4 A look into the future ..................................................................................................10

5.1.1.4 Conclusion .........................................................................................................................10 5.1.2 Web/Application server separation ........................................................................................10 5.1.3 Java client and server mode ...................................................................................................11

5.2 Generic performance analysis methodology ..................................................................................12 5.2.1 Client interaction loads ..........................................................................................................12

5.2.1.1 Processing power ...............................................................................................................12 5.2.1.2 Memory .............................................................................................................................12 5.2.1.3 Network .............................................................................................................................12 5.2.1.4 Storage performance ..........................................................................................................13

5.2.2 Batch interaction loads ..........................................................................................................13 5.2.3 Database connection pooling .................................................................................................13

5.3 Guidewire specifics .......................................................................................................................13 5.3.1 Application server caching ....................................................................................................13

5.3.1.1 Design ................................................................................................................................13 5.3.1.1.1 Cache architecture .......................................................................................................13 5.3.1.1.2 Cache and stickiness ...................................................................................................13 5.3.1.1.3 Concurrent data change prevention .............................................................................14 5.3.1.1.4 Cache and clustering ...................................................................................................14

5.3.1.2 Performance impact ...........................................................................................................14





5.3.1.2.1 Orders of magnitude ...................................................................................................14 5.3.1.2.2 Insights into cache behavior ........................................................................................15 5.3.1.2.3 Cache trashing .............................................................................................................15 5.3.1.2.4 Cache impact on memory utilization ..........................................................................15

5.3.1.3 Analyzing & Tuning ..........................................................................................................16 5.3.1.3.1 Cache settings .............................................................................................................16 5.3.1.3.2 Analyzing cache settings .............................................................................................16 5.3.1.3.3 Data distribution analysis ............................................................................................16 5.3.1.3.4 Evidence of cache trashing .........................................................................................17

5.3.2 Application server clustering .................................................................................................17 5.3.3 Typelist refresh and interaction with application server ........................................................17 5.3.4 Guidewire application database connection pooling .............................................................17

6 Database tier ..........................................................................................................................................18 6.1 Generic guidelines .........................................................................................................................18

6.1.1 Storage, the “hidden” bottleneck ...........................................................................................18 6.1.1.1 Some examples ..................................................................................................................18 6.1.1.2 Database storage workload ................................................................................................19 6.1.1.3 Storage performance considerations ..................................................................................19 6.1.1.4 Recommended solution .....................................................................................................19 6.1.1.5 Designing storage for performance ....................................................................................21 6.1.1.6 Usual customer complexities .............................................................................................21 6.1.1.7 Some other non-trivial factors ...........................................................................................22 6.1.1.8 Schedulers ..........................................................................................................................22

6.1.2 Operating system configuration .............................................................................................22 6.1.2.1 Asynchronous IO ...............................................................................................................22 6.1.2.2 Raw/Direct IO ....................................................................................................................23 6.1.2.3 Logical Volume Management ...........................................................................................24 6.1.2.4 IO size settings ...................................................................................................................24 6.1.2.5 Final points on S.A.M.E architecture.................................................................................25

6.1.3 Testing storage .......................................................................................................................25 6.1.4 Database ................................................................................................................................25

6.1.4.1 Enabling & configuring Async IOs ...................................................................................25 6.1.4.2 Database and memory ........................................................................................................25

6.1.4.2.1 32 bits systems and large database caches ..................................................................25 6.1.4.2.2 Large memory optimization ........................................................................................26 6.1.4.2.3 Guidewire specifics .....................................................................................................26 6.1.4.2.4 Oracle ..........................................................................................................................26

6.1.4.2.4.1 Memory utilization optimization..........................................................................26 6.1.4.2.4.2 The optimizer .......................................................................................................26

6.1.4.2.4.2.1 Generic information ......................................................................................26 6.1.4.2.4.2.2 Guidewire specifics .......................................................................................27 6.1.4.2.4.2.3 Optimizer complexities .................................................................................27 6.1.4.2.4.2.4 Outlines .........................................................................................................27

6.1.4.2.4.3 Oracle settings ......................................................................................................28 6.1.4.2.5 MS-SQL ......................................................................................................................28

6.2 Generic performance analysis methodology ..................................................................................28 6.2.1 Platform analysis ...................................................................................................................28

6.2.1.1 Client interaction loads ......................................................................................................29 6.2.1.1.1 Processing power ........................................................................................................29 6.2.1.1.2 Memory .......................................................................................................................29 6.2.1.1.3 Network ......................................................................................................................29





6.2.1.1.4 Storage performance ...................................................................................................29 6.2.1.1.5 Database monitoring ...................................................................................................29

6.2.1.1.5.1 Oracle ...................................................................................................................29 6.2.1.2 Batch interaction loads ......................................................................................................29

6.2.1.2.1 Processing power ........................................................................................................29 6.2.1.2.2 Memory .......................................................................................................................30 6.2.1.2.3 Network ......................................................................................................................30 6.2.1.2.4 Storage performance ...................................................................................................30 6.2.1.2.5 Database monitoring ...................................................................................................30

6.2.1.2.5.1 Oracle ...................................................................................................................30 6.3 Guidewire specifics .......................................................................................................................30

6.3.1 Data distribution ....................................................................................................................30 6.3.2 Batch loads ............................................................................................................................30

6.3.2.1 Validation & Population complexities ...............................................................................30 6.3.2.2 Storage settings ..................................................................................................................32

6.3.3 All loads .................................................................................................................................32 6.3.3.1 Application logs .................................................................................................................32 6.3.3.2 Storage settings ..................................................................................................................32 6.3.3.3 Batch server and performance ...........................................................................................32 6.3.3.4 Database catalog statistics update ......................................................................................32

6.3.3.4.1 When to update database catalog statistics update ......................................................32 6.3.3.4.2 How to update database catalog statistics ...................................................................33 6.3.3.4.3 Additional elements ....................................................................................................33

6.3.3.5 Database catalog statistics page .........................................................................................33 6.3.3.6 Database default settings (currently Oracle only) ..............................................................34

6.3.4 Adding indexes to the Guidewire schema .............................................................................34 6.3.5 Finders ...................................................................................................................................34

6.4 Experience from the field ..............................................................................................................35 6.4.1 Conversion performance issues .............................................................................................35

6.4.1.1 Forcing statistics updates during conversion .....................................................................35 6.4.2 All load issues ........................................................................................................................35

6.4.2.1 Optimization rules settings (Oracle only) ..........................................................................35 7 References .............................................................................................................................................35 8 Planned additions to the document ........................................................................................................37 9 Lexicon ..................................................................................................................................................37 10 Appendix ...........................................................................................................................................37

10.1 RAID-10 versus RAID-5 ...............................................................................................................37 10.1.1 RAID-10 calculus ..................................................................................................................38 10.1.2 RAID-5 calculus ....................................................................................................................38 10.1.3 Conclusion .............................................................................................................................38





1 Introduction

Guidewire products (ClaimCenter, PolicyCenter, BillingCenter and ContactCenter) are typical N-Tier architecture applications. This document provides insight into Guidewire system performance. Its purpose is to provide insights into designing a high performance system. For every tier of the application, it provides:

• Generic, non Guidewire specific, guidelines on that tier • Fine-tuning recommendations • Shared experience about past performance issues and associated remediation strategies.

2 Disclaimer

This document summarizes performance recommendations for Guidewire applications. Significant time and effort is dedicated to develop these recommendations and keep them updated. Guidewire supports its core products (ClaimCenter, PolicyCenter, BillingCenter and ContactCenter) and does not formally support the recommendations provided in this document. Each customer implementation may vary significantly (code customization, data distribution…). When applying those recommendations, Guidewire recommends that customers test them through a full development lifecycle (integration testing, user acceptance and performance) to ensure proper behavior in production.

3 Scope and variables

This document does not constitute an extensive performance tuning guide. This document intends to address performance issues that would impact Guidewire products Much documentation is available to address performance tuning on all tiers of the application. This document points to the most important generic performance considerations on the dedicated tier and some performance tunings more specific to Guidewire products. It should be pointed out that specific tunings, done by customer staff, should be coordinated with Guidewire, to ensure that no modifications with potential performance and/or integrity consequences are introduced. Guidewire products have been tuned to perform well while preserving integrity. Any specific tunings, though beneficial at first glance, may prove detrimental to a customer’s systems and data. The Guidewire Platform Guidelines should be consulted for sizing recommendations on:

• Application server (number of processors, memory) • Database server (number of processors, memory) • Storage (number of disks, design)

4 Generic considerations

4.1 32 bits platforms Today, all common hardware platforms function in 64 bits. Many hardware platforms will provide both 32 and 64 bits modes. Depending on the operating system platform installed, the system will function in 32 or 64 bits mode. Document reference 15 provides further details on 64 bits platforms. Some software platforms only exist in 32 bits. The Sun JVM 1.4.2 for Linux and Windows and the IBM JVM supporting WebSphere 5.1.1.x for AIX, Windows and Linux are available only in 32 bits mode. 32 bits systems have an inherent limitation of 4GB of addressability per application. This memory space is sliced into several memory spaces (execution pages, data pages, shared memory). Additional limitations stem from arcane memory architectures like low and memory pages for the x86 architecture.





Practically, the memory space available for certain memory segment like database cache (shared memory) or java heap is more limited than 4GB. Following chapter will detail the implications per tier.

4.2 Large pages Operating systems manage memory in chunks of specific size. By example, the default Linux size is 4KB for x86 architectures. The chunk size is generally adapted to a normal utilization. Nevertheless, if the system uses large amounts of memory, the default size may become an issue: if the memory is managed in too many small chunks, the operating system will be overly burdened to manage these pages. Additionally, reducing the number of pages increases the likelihood for a frequently used page to have its page virtual address translation still in the processor Translation Lookaside Buffer (TLB). This has a positive impact on performance. TLB is documented at document reference 37. By example, a Linux x86/x86_64 managing 12GB of memory will manage 3 million pages. Most operating systems (Linux, AIX, Windows) implement a mechanism to set aside amounts of memory that will be managed in larger chunks, thereby reducing management load. On Linux x86/x86_86, the chunk size is 2MB. For a 12GB system, this would correspond to 6000 pages only. Additionally, these memory spaces can be allocated at boot. This will ensure that a proper size contiguous chunk of memory is available for applications. This generally allows allocating larger contiguous memory chunks at boot, in comparison to memory chunks allocated after the memory space has been fragmented. This feature is called Big, Large, Very Large and Huge pages depending on the operating system. Memory allocation API’s will contain specific flags for allocating memory in large pages pools. Therefore, applications must be specifically coded to use large pages. Frequently, a specific flag is needed to force the application to use the large pages. Database platforms were the first ones to leverage that capacity. Some Java platforms support this feature. Both the application and database application run on servers. Server performance can be assessed by analyzing different metrics. This chapter provides insights into these different metrics. Specific platform specific commands to gather those metrics are provided in the “Guidewire Troubleshooting Guidelines”.

4.3 Generic performance considerations

4.3.1 Processing power

Processing power usage is sliced into different categories: • User time tracks the time spent by applications (application server or database). • System time tracks the time spent by the operating system to process application and hardware

requests. This corresponding percentage should generally be very low (not higher than 5-10%). High values points to potentially excessive kernel activity which may be caused by many factors and should be investigated.

• Iowait time tracks the time that applications have been waiting for IO completions with nothing else to do. This metric is fundamental for investigating if storage is performing correctly. This metric is very misleading for several reasons:

o When applications wait for IO completion, they are not active. The system (application and Operating System) are actually still during Iowaits

o Because the system (application and Operating System) is quiet during Iowaits, if a CPU intensive process is started, it will consume that time, lowering and possibly nulling the Iowait time.





• Idle tracks the time the system (application and operating system) is idle and not waiting for IO completions.

4.3.2 Memory

The memory is used by both applications and operating system. The Virtual Memory Manager creates an additional virtualization layer for applications between the virtual memory and the associated Physical memory and Swap Space. The Virtual Memory Manager managed memory in page slices. Different operating systems can use drastically different page sizes. Furthermore, this parameter is tunable. Systems with high amount of memory may want to use large page size to reduce the number of pages to manage. Performance wise, Virtual Memory introduces two mechanisms:

• Swapping: as applications and associated data overflow the available Physical Memory, the operating system will decide to passivate some application and associated data to disk (swap space)

• Paging: the system keeps track of each page’s usage. Pages that are rarely used may be passivated to disk

Swapping is an absolute performance killer while paging is normal with little performance impact. When checking on memory utilization, the following points should be kept in mind:

• On most operating systems, the operating systems will use much of the available free memory to use it as filesystem cache. If an application needs memory, the operating system will free memory accordingly. This memory utilization can be tuned. This phenomenon generally obscures memory usage analysis as one may think that little memory is available.

• Swapping is the real indication of a physical memory shortage. Any swapping should be remediated shortly.

4.3.3 Network

Network can affect performance in two ways: • Network usage can max the different link’s capacity. Document reference 25 provides metrics on

that matter. In such case, it may need to be replaced by a faster one (with possible changes to the larger network) or coupled with another one (through NIC bonding).

• Network can experience errors resulting in re-transmissions with significant adverse affects. The following metrics should be captured:

• Number of bits in/out • Packets discarded • Errors • TCP errors and failures

The first metric will help identify if the network is nearing its maximum capacity. The other ones will help identify network issues.

4.3.4 Storage

Applications such as databases require a highly performing storage system to function correctly. When investigating storage performance, several arcane metrics are critical to understand:

• Serve time tracks the time between the kernel request for an IO (read or write) to the storage and the storage response. During its 8500 concurrent users benchmark with AIX/pSeries and high end storage, Guidewire measured average wait times around 3ms. Some benchmarks indicate that an





optimal metric is around 2-3 ms. Similarly of document reference 42 documents the average serve time for an Oracle E-Business benchmark on Oracle RDBMS. In such case, the average serve time is at 2.54 ms.

• Wait time tracks the time between the application request for an IO to the kernel (read or write) and the response to the application. This metric compounds the kernel process time and the storage process time (average serve time, see below). During benchmarks, Guidewire noted that the kernel process time may be significantly different across platforms. When similar loads where applied to AIX 5.3 on pSeries and Redhat Linux 4 on commodity hardware, the kernel overhead was found to be significantly higher on the latter platform, resulting in much higher average time while the serve time was similar on both platforms. Most platforms will provide that metric directly though for some, like AIX, this metric must be inferred by gathering metrics from the application (Oracle statspack). During its 8500 concurrent users benchmark with an AIX/pSeries database, Guidewire noted that the average wait time during the performance test was around 3 ms. In this occurrence, the kernel process time was small enough that both average wait and serve time were at 3 ms. During a 5000 concurrent users load with Redhat Linux 4, the average serve time was measured to be below 1 ms while the average wait time was at 9ms. The difference of 7-8 ms corresponds to the kernel processing time, which was therefore much higher with Linux/commodity hardware than with AIX/pSeries. Linux’s less granular storage drivers constitute a partial root cause of this platform difference.

• Number of IOs and associated size. Document reference 25 provides some sizing metrics on the expected database storage load. The following metrics will help corroborate the system behavior’s with these metrics:

o Average IO size: during regular user load, the database storage load will be composed of many IOs with a size equal to the database block size (8KB in general). During large batch jobs, the database may be able to process data in larger IO chunks. The average IO size allows witnessing this behavior. Some older OS and hardware platform will unnecessarily chunk larger IOs. This should be avoided.

o Number of IO in read and write and associated throughput.

5 Application tier

5.1 Generic guidelines

5.1.1 Java and performance

5.1.1.1 Java memory management

Guidewire applications run within a Servlet engine container (Tomcat, WebSphere or Weblogic) in a Java Virtual Machine environment. Java provides a platform side memory management that significantly simplifies coding. The Java runtime will periodically identify unused objects and reclaim the associated memory. This process is called Garbage Collection. During garbage collection, the java runtime will freeze some or all execution threads, which impacts negatively performance. Recent java releases have significantly improved performance by using different types of garbage collections:

• Minor garbage collections will mainly reclaim memory allocated to new objects. They are relatively frequent and complete quickly (around 0.1 seconds)

• Full garbage collections will reclaim memory from all objects and compress memory to avoid fragmentation. They are rare and can be long (2 to 5 seconds).





A healthy garbage collection profile is composed of a vast majority of minor garbage collections with very rare full garbage collections. A more complex application will cause the java platform to manage more objects. This leads to more frequent collections and lower performance. The heap size has an impact on the garbage collection profile: larger heaps will cause less frequent full garbage collections but potentially longer ones. On the other end, a full garbage collection on a larger heap will have more space available to compress the memory, thereby ensuring a better compression. Analysis of industry benchmarks (c.f. documented in reference 38) indicate that the best performance points are generally at the higher end of the possible sizes. Benchmarks on 32 bits JVM will use the largest heap size available (c.f. Heap size and 32 bits platform). Benchmarks on 64 bits JVM will frequently use only slightly higher heap size than with 32 bits JVMs. This indicates that the heap size equilibrium is close in both cases and more tied to JVM and hardware scalability than maximum heap size. Therefore, specific heap sizing should be done on case by case basis. Documents reference 16 & 17 provides some insights on the java 1.4.2 memory management. Document reference 19 details how to track garbage collection activity.

5.1.1.2 Memory leaks in Java

The Java runtime will declare an object as unused if it is not referenced by any other object. Coding errors may cause an object to not be used while still being referenced. In such case, the java runtime will not reclaim the memory associated to that object. The corresponding memory is lost and will not be reclaimed until the application is stopped. Alternatively, the JVM algorithms may contain errors that will not reclaim certain objects that are not referenced anymore. As with all platforms, memory leaks will on the long run cause out of memory errors. For Java, the memory leaks will cause extra hardship on the garbage collector. Frequency of minor and major garbage collections will increase which will slow down the application. Documentation reference 18 provides additional insights about memory leaks. The garbage collection monitoring detailed at document reference will allow determining if the application suffers from memory leaks.

5.1.1.3 Java memory settings

5.1.1.3.1 Heap size and 32 bits platform

Guidewire has tested ClaimCenter 3.0 and 3.1 with a default heap size of 1GB. Tests run on newer hardware have shown that additional memory provided performance improvements. This chapter explores means to raise the heap size. Guidewire tested this tuning with Tomcat 5.5.17/Sun JVM 1.4.2 and Redhat 4 only. Recommendations on other platforms are theoretical. Some have been validated with IBM support while others have been gathered from vendor documentation. Up to ClaimCenter 4.0 included, all application server platforms supported by Guidewire are 32 bits. Analysis shows the following:

• Sun JVM 1.4.2, which supports the Tomcat 5.5.17/Redhat 4, the Tomcat 5.5.17/Windows 2003 and the WebLogic 8.1/Windows 2003 combinations, exists only in 32 bits mode for Windows and Linux.

• WebSphere 5.1.x is only supported with the 32 bits version of the IBM JVM 1.4.2. The WebSphere/JVM combination is supported on AIX 32 & 64 bits, Windows 2003 32 bits and Suse 9.3 32 bits.

Guidewire investigated means to raise the heap size on 32 bits platforms. Analysis with IBM support and from other sources indicates that the heap size limitation is a function of the operating system platform, not the application server and/or JVM. The following maximum heap sizes are available per platform:

• Linux (Redhat or Suse): Guidewire tested larger heap sizes with Tomcat 4.1.29/Sun JVM 1.4.2/Redhat 3 32 bits and Tomcat 5.5.17/Sun JVM 1.4.2/Redhat 4 32 bits. The default Linux





elsmp kernel and the more memory capable hugemen kernel were used (.c.f. document reference 10 for more information about hugemem kernel). A maximum of 2600MB heap was achieved with the elsmp kernel and 2900MB with the hugemen kernel. Performance tests were better though with the elsmp/2600MB combination. Both were significantly better than with a default 1024MB heap. Suse 9.3 is very similar to Redhat 4. Therefore, the same should apply to the Suse 9.3/WebSphere 5.1.1.x combination.

• AIX: IBM indicates that a maximum heap size of 3.25GB can be allocated to AIX 32 bits. Document reference 26 provides further details on this matter.

• Windows: Windows’s memory model is more limited than AIX or Linux. Therefore, the maximum heap size on Windows 2003 32 bits is around 1.5GB. Documents reference 27 & 28 provides some additional information on these limitations.

5.1.1.3.2 Java and contiguous memory allocation

Java allocates its heap size in a contiguous memory chunk. When starting a Virtual Machine, different memory chunks are assigned to libraries, compiled code and other objects. This memory segments may reduce the size of contiguous memory chunk. As the application changes, the different memory segments may change and change the maximum chunk of contiguous memory. This fact reinforces the need to thoroughly test increases in heap size.

5.1.1.3.3 Large pages

JVM 1.4.2 is the supported version for ClaimCenter 4.0 and some earlier releases. Among the supported 1.4.2 JVMs, the IBM JVM 1.4.2 for AIX is the only one which can leverage large pages. Customers running WebSphere 5.1.1.x on AIX should test this feature and evaluate its performance advantages.

5.1.1.3.4 A look into the future

Newer versions of Java (1.5 & 1.6) provide some very interesting features that will alleviate the issues described above:

• These platforms exist in 64 bits editions for most operating system platforms. These 64 bits platforms will allow allocating larger heap spaces.

• These platforms can use large pages for all platforms (c.f. Large pages). This may improve performance by reducing page management burden and allow allocating larger heap sizes by guaranteeing large contiguous memory chunks. Details are available at document reference 29

5.1.1.4 Conclusion

Guidewire has performance tested ClaimCenter 3.0 & 3.1 with min and max heap size of 1GB. Later tests on Tomcat/Sun JVM/Redhat have shown that larger heap sizes improve performance. Therefore, larger heap sizes may be explored during performance tests. The maximum heap size is different per platform. Larger heap sizes must be very thoroughly performance tested to measure real performance gains and validate stability. Customers interested to explore specific tunings can analyze the results provided in industry benchmarks. The Standard Performance Evaluation Corporation provides specific java benchmarks which can be analyzed to understand today’s tunings. More specifically, the JBB2000 and JBB2005 benchmarks provide interesting tunings. JBB2000 contains JVM 1.4.2 benchmarks while JBB2005 contains JVM 1.5 benchmarks. Document reference 38 provides the links to these sources of information.

5.1.2 Web/Application server separation

An N-tier architecture traditionally comprises the following server side elements: • A web server, which handles http requests, interacts with the application server and handles http

replies • An application server, which will host the application.





• A database server More recent application servers, such as the ones supported by Guidewire, can handle web serving directly. For Tomcat, the Coyote connector (c.f. document reference 30) supports http requests directly. Similarly, starting with WebSphere 5.0.2, a similar mechanism exists (c.f. document reference 31 at page 20). Web servers provide a performance improvement in the following case:

• The application contains a lot of static content that can be cached by the web server • Access to the application is encrypted with encryption offloaded to the web server.

The Guidewire application contains generally very little content. Therefore, the first case is generally non applicable. The second case is applicable if customers want to leverage encryption. Document reference 32 details this. Otherwise, Guidewire recommends that customer configure their application servers to handle web serving directly.

5.1.3 Java client and server mode

The Java platform initially started as a client side technology. Later, the Java platform was used as a server side technology, supporting server applications accessed by remote clients. The Guidewire application is in the latter category, with the application layer supported by Java. Client and server usage is different in nature and requires different performance profiles:

• Client side applications are accessed by one user for a limited period of time. Load is intermittent and non intensive.

• Server side applications are accessed by many users for extended periods of time. Load is continuous and possibly intensive.

These very different usage patterns drive very different optimizations for client or server usage. Therefore, Java Virtual Machines generally have a client and a server mode. Each mode will enable the appropriate optimizations for the corresponding usage. IBM has promoted Java as a server side technology very early on. IBM JVM’s are enabled by default in server mode with no client mode available. Sun has promoted Java as both a client and server technology. Both modes are available. For JVM 1.4.2, the Java Virtual Machine binary is different depending on the mode chosen. Additionally, depending on the platform (Solaris, Linux, Windows), the default is either server (Solaris) or client (Linux, Windows). Guidewire supports the following combinations on the Sun JVM:

• Tomcat on Windows 2003 & Redhat Linux • WebLogic on Windows 2003.

The support matrix, managed by Product Management, details this further. On these platforms, by default, the Sun JVM is enabled in client mode, which is not optimal for Guidewire applications. Guidewire ran internal performance tests to compare both modes with the Tomcat/Redhat Linux combination. Tests prove that the server mode is 2 to 3 times faster than the client mode on this platform. Therefore, Guidewire recommends enabling the server mode on these platforms. This is done by adding a “-server” string to the Java startup command. This recommendation should be enabled early in the development process and later tested in performance testing. This recommendation is applicable only to platforms using the Sun JVM. The IBM JVM is enabled in server mode by default. Further details are provided in document reference 35.





Default modes may change on newer versions of the Sun JVM. So far, both modes are available on these newer platforms and the –server flag enables the server mode.

5.2 Generic performance analysis methodology Infrastructure performance analysis methodology is provided at Generic performance considerations. Application server performance issues can occur due to issues with

• Processing power • Memory • Network bandwidth

Storage is not a limiting factor for application servers.

5.2.1 Client interaction loads

The application server layer is stressed mainly by client request. Therefore, analysis of the application server layer performance is mainly relevant for such loads.

5.2.1.1 Processing power

During loads with significant application server participation (users action), the typical application server CPU usage pattern would be periods of stable usage with regular decreases due to garbage collection. Minor garbage collections will be hardly noticeable. Full garbage collections should be noticeable tough hopefully rare (c.f. Java memory management)

5.2.1.2 Memory

The different memory spaces used by Java applications are:

• Heap size: it is the main memory segment of a java virtual machine. Objects are allocated in the heap. Newer garbage collection mechanism implements generational garbage collection (c.f. document reference 16). Heap sizing is discussed at: Java memory settings. Guidewire recommends that the same value be used for min and max heap size.

• Permanent generation: this is a memory segment which holds all the reflective data of the JVM such as classes and objects methods. It is considered as a subpart of the Heap. Its sizing is different per JVM:

o Sun VM: Permanent generation is allocated separately from the heap. By default it is limited to 64MB. This can cause some conflicts and object unloading. The size can be raised by changing MaxPerm (c.f. reference 33)

o IBM VM: it is not clear if the permanent generation is allocated separately from the heap or not. The Permanent generation has no maximum and will therefore be increase upon need.

• Other segments With equal settings for min and max heap size, the java virtual machine will gradually increase its memory utilization up to a maximum where it will remain stable. Consecutive garbage collections will free up heap space while the corresponding memory will remain allocated to the java virtual machine. If any swapping is experienced under load, memory settings should be immediately resized. Some paging may be acceptable though it should remain minimal.

5.2.1.3 Network

During interactive loads, the application server will exchange a lot of data with the web clients and the database server. Infrastructure tier provides the generic guidelines on network performance.





5.2.1.4 Storage performance

The application server layer makes very little use of the storage. Therefore, unless the system experiences significant disk issues, storage should not be a bottleneck.

5.2.2 Batch interaction loads

Batch loads mainly stress the database layer of the infrastructure. Therefore, during batch loads, the application server layer should experience very little usage. If performance issues are experienced during batch loads, the same components should be inspected than for client interaction loads though the application server is an unlikely suspect.

5.2.3 Database connection pooling

Application servers communicate to the database using JDBC connections. Database connections have the general characteristics:

• Opening a connection may take a long time (up to several seconds) • A connection consumes memory both on the application server and database side. • A connection is not needed all through the processing of a user request but only when interacting

with the database. For all these reasons, database connections are organized in pools. Connections are kept open and reassigned across work threads upon access to databases. This modus operandi has the following benefits:

• A pool of open connection is available. Processing thread can immediately communicate with the delay without the delay associated with opening a connection.

• Fewer connections are needed, minimizing the amount of memory needed both on the application server and database

Most application server containers provide a database connection pool. Alternatively, the application can implement its own database connection pool mechanism.

5.3 Guidewire specifics

5.3.1 Application server caching

Guidewire implements an object caching mechanism at the application server layer. This mechanism limits access to the database, therefore seriously improving User Interface performance.

5.3.1.1 Design

5.3.1.1.1 Cache architecture

A cache exists for every object type. In addition to the primary object cache, each object type maintains a cache of array key-sets for each of its virtual arrays. Each array key-set stores the set of keys for an array of a cached primary object. For example, if a Claim X has 3 Exposures, with keys E1, E2, and E3, then there will be an entry in the Claim.Exposures cache for X->{E1, E2, E3}. Each array key-set cache has the same maximum size as the primary object cache. The size of an individual key-set in the array key-set cache is not limited. For example, Claim has several virtual arrays, including Exposures. If the Claim cache size is 500 then there will be a cache of key-sets for Exposures with a maximum size of 500. The list of Exposure keys an entry in Claim.Exposures is not limited, e.g. {E1, E2... E(n)}. This mechanism allows identifying if the objects associated to a primary object are also in cache. The cache size can be tuned for all objects (“DefaultCacheSize”) or on a per-object base.

5.3.1.1.2 Cache and stickiness

Ensuring that a user returns to the same application is absolutely critical to leverage the cache mechanism correctly. Data locality is one of the main reasons a user needs to have a "sticky session" if web requests are going through a load balancer. An application server’s global cache contains data specific to that user, if





the user hits another application server its global cache will need to be filled with the user’s active data. Since each application server manages its own cache it's possible for the same object, e.g. Claim 012-033-5043, to live in the cache of two or more application servers at the same time. In fact some object sets, such as users, likely live in the global cache of all application servers in a cluster.

5.3.1.1.3 Concurrent data change prevention

Different users, either on the same application server instance or on different ones, may change objects concurrently. Guidewire implements a data versioning mechanism to prevent corruptions in such cases. When an object value is updated to the database, a counter associated to the object is compared to the counter in the database. A counter value mismatch indicates that the object was concurrently changed. In such case, the change is rejected and the cache mechanism throws a concurrent change exception. The user who initiated the concurrent change, is presented with the error, reloads the latest data and reapplies his/her changes. Furthermore, Changes in a Guidewire application are committed in bundle, ensuring transactional integrity. Therefore, protection against concurrent data changes is enforced across the whole transaction. This mechanism is a standard design pattern called optimistic locking. Concurrent changes exceptions occur only if two users modify the same object(s). A proper organization of the user community should avoid this. Nevertheless, if two users modify the same object(s), the users should make the proper adjustment and resubmit the accurate change. Any automatic resolution carries a significant risk to cause unwanted modifications. Guidewire’s experience is that its optimistic locking mechanism causes very little concurrent data changes exceptions and that exceptions are resolved easily by users. Other design patterns exist for concurrent data changes. The pessimistic locking will ensure that, when one user or batch wants to modify an object, all other users (or batch processes) will be prevented from modifying it until the change is completed. There are many cases where pessimistic locking becomes completely dysfunctional. By example, a user or batch process could not complete his/her modifications and would block any other user or batch process. Pessimistic locking systems generally become impractical. Cache entries will likely become obsolete over time, which increases the likelihood of concurrent change exceptions. A timeout mechanism limits this by obsolescing cache entries, based upon the time of last synchronization with the database. To avoid increased complexity, Guidewire did not choose to evict obsolete objects from the cache. Nevertheless, when the cache is full, newer entries will evict obsolete entries. When non evicted, obsolete object will reside in the cache and use the associated memory.

5.3.1.1.4 Cache and clustering

The cluster inter-communication ensures that, if an object is changed in one application server’s cache, a message is sent to other application servers to tag the object’s cache entries as obsolete. As a cache entry is marked as obsolete, the object value will be reloaded from the database the next time it is needed. This mechanism is different from full cache synchronization, where the new value of the object would be broadcasted to other application server nodes. Network failures or other issues can disrupt the communication between nodes. In such case, object change messages may be lost. The data versioning mechanism will prevent corruptions by raising concurrent data change exceptions. When the disruption is corrected, the cache mechanism functions fully again. Concurrent data exceptions should then not occur anymore. Details on the clustering infrastructure are provided in document reference 15.

5.3.1.2 Performance impact

5.3.1.2.1 Orders of magnitude

Guidewire gathered some additional information on the performance order of magnitudes regarding cache. This section distinguishes several caches:





• Application server cache: this is the one described in the chapter • Database server cache: a database uses this cache to store data retrieved from storage.

For an action involving approximately 2000 objects, the following metrics were gathered:

• Pure processing time: 2 seconds • Time to load data from the database cache to the application cache: 4-5 seconds • Time to load data from storage to database cache: 15-18 seconds

Therefore, the same action would take 2 seconds if the application cache contains the data and up to 25 seconds if the data is neither in the application or database cache.

5.3.1.2.2 Insights into cache behavior

Proper cache behavior is critical to performance. Analyzing & Tuning describes how to size a cache correctly. Several points must be made about application server cache:

• It is purely local to an application server. Therefore, if an application server node’s cache is loaded with information on a specific object, another node may not contain that information. By example, if an adjuster works on a claim, corresponding objects will be loaded on the application server node he/she is running on. If an adjuster action must be approved by an approver, the approver may be interacting with another application server node. In such case, that other node will likely not have the corresponding information in cache. Therefore, the approver may experience slower performance as its cache must be loaded.

• The cache content is lost when the application server node is stopped. Therefore, when an application server is started, some lower performance may be expected during a “ramp-up” phase.

• Batch processes leverage the cache mechanism. Batch jobs may work on many objects. Therefore, they may use the cache extensively. This may have the adverse effect of prematurely evicting objects from the cache, thereby forcing additional cache loads.

Customers with heavy batch processes may want to consider dedicated a specific application server instance to batches.

5.3.1.2.3 Cache trashing

Cache thrashing is a phenomenon whereby objects are removed from the cache prematurely and force additional unnecessary reloads detrimental to performance. There are several cases where cache trashing can happen:

• A single data-set is too large to reside in the global cache, forcing the same set to be loaded from the database and subsequently evicted potentially thousands of times while loading a single web page, resulting in horrible performance.

• Some concurrent actions may result in trashing. By example, if a user is logged on the batch server, a batch job, which can load many objects in cache, can remove objects from the cache and force them to be reloaded when those objects are needed again by the user.

The chapters below provide further information on how to isolate and possibly remediate single data set cache trashing. If trashing is seen when using the batch server through the User Interface, Guidewire recommends that the batch server be dedicated to batch jobs only and not be accessed by users.

5.3.1.2.4 Cache impact on memory utilization

The maximum size of the cache is dependant on the cache parameters. Java does not provide a good means to estimate the memory usage of its objects. Therefore, the maximum size of a cache cannot be reliability estimated. If the collective size of all the filled caches exceeds the maximum heap size then the application may eventually run out of memory. Cache segments dedicated to very frequently used objects will rapidly





grow to maximum usage while segments dedicated to less frequently used objects will reach maximum usage over a much longer time. Therefore, the maximum size of a cache may be reached very long after the application server was started. Risks of our of memory errors are increased by larger caches. Additionally, larger caches expand the memory footprint of the application. Performance will decrease as garbage collections become more frequent and analyze more objects. The cache should be made as big as needed, but no bigger. In practice, few individual caches should be made larger than 2000 (Transactions, Notes, etc.) and these typically shouldn't be larger 10000. Garbage Collection should be monitored to extrapolate memory usage patterns and garbage collection statistics.

5.3.1.3 Analyzing & Tuning

5.3.1.3.1 Cache settings

Config.xml contains a section, called “Cache Parameters” which contains the cache settings. The cache settings can be defined for all objects or on a per objects basis. The Performance team uses a default cache size of 2000 and Transaction cache size of 5000. These settings are optimized for an application server handling 300 concurrent users with a typical claim having:

• 2 Exposures • 10 Activities • 10 History • 10 Documents • 10 ClaimContacts • 35 Notes

As a starting point, customers should use these same settings. In theory the global cache should contain all the "active" data used by all of the users that the application server is currently handling. For example, if an application server is handling 300 concurrent users, and at any given time each user is actively working on 3 claims, then the Claim cache needs to hold at least 900 objects. If every Claim has 2 exposures then the Exposure cache needs to hold at least 1800 objects. If across their Claims every user is simultaneously working on 5 Activities then the Activity cache needs to hold at least 1500 objects. Although a Claim has several Notes, Documents, Contacts, Activities and other related objects, in most cases when these objects are displayed in an LVV (list view) they aren't being loaded into the cache. The cache is typically used if data is displayed in an NVV (name value view).

5.3.1.3.2 Analyzing cache settings

The InternalTools (http://server:port/$WebApp/InternalTools.do) provides a “Cache Info” link. To access that page, one must have logged in as an application administrator before. The % of evictions is currently always set to 0. In general, cache hit rates lower than 90% are a sign of a poorly sized cache. If an application server started recently or hasn't experienced much user load then the hit rates can be skewed low. For example, if the server was just started and just a few pages were visited the hit rate will be very low because only a few cache misses were encountered. As more pages on the Claim are visited, the hit rate will increase.

5.3.1.3.3 Data distribution analysis

Data distributions (number of exposures, activities... per claim) should be pro-actively analyzed to prevent cache thrashing. For any instance of any object, the size of any array on that object should not exceed the size of the arrayed object's cache. For example, based on data distribution analysis, if a single Claim with 3000 Transactions is found (Claim.Transactions) then the Transactions cache should be sized to at least 3000. Sometimes a rogue claim, like one with 20000 Transactions, is present. In such cases, it might not be reasonable to set the Transaction cache size to 20000. The recommended approach would be to investigate why such a claim exists, as it is the likely result of a bad data import and is significantly skewing the data





distribution. If the claim appears to be legitimate and all of these components are required, then two options are available:

• the cache size can be adapted to deal with this claim, forcing the loss of valuable memory • Sluggishness when working on that particular claim should be communicated.

5.3.1.3.4 Evidence of cache trashing

Evidence of a thrashing cache can be found by: • Analyzing the number of evictions on the CacheInfo page • Resetting the CacheInfo • Reproducing the operation • Reanalyzing the CacheInfo page

If an individual cache reports hundreds or thousands of evictions and a low hit rate (less than 75%) then that cache is thrashing. If the cache trashing is experienced on an application server node not processing batch jobs, the cache should be resized. Otherwise, the application server node should be dedicated to batch jobs. Once the proper action has been taken, the analysis should be repeated to ensure that the change yielded the expected results.

5.3.2 Application server clustering

Application server clustering is done by using a multicast IP address. Generic and setup considerations are provided in Document reference 15. Enabling clustering and associated multicasting is described in the installation documentation. The default multicast address provided in the config.xml (228.9.9.9) can be used. Nevertheless, if several clusters must run concurrently on the same subnets, a different I.P. address would need to be allocated for every cluster.

5.3.3 Typelist refresh and interaction with application server

Guidewire User Interface uses typelist capabilities. The available entries in one list are possibly modified by the values entered on other fields. The refresh of available items for the list is either done locally or requires an interaction with the application server. Some standard cases will involve filtering a larger typelist based on the values in another typelist as configured in typelist metadata definition. By example, vehicle make and model are being updated as the vehicle drop down is changed. Some other cases require a server refresh. A non exhaustive list of these is:

• In ClaimCenter 3.1, hidden behind custom fields or added “causesRefresh” override to the NVV. • In ClaimCenter 4.0 and PolicyCenter 2.0 (Pebble Framework), fields marked “PostOnChange”. • In general, typekeys that are filtered by more than one other field on the screen will prompt an

interaction when one of the fields that filter that typekey changes. Server refreshes have a system wide (network, server) performance impact. Customers should be judicious about adding new fields to screens that override causesRefresh (ClaimCenter 3.1) or set postOnChange to true (PolicyCenter 2.0 or ClaimCenter 4.0). Nevertheless, most of the dynamic UI behavior is client-side with minimized performance impact circumscribed to the User Desktop.

5.3.4 Guidewire application database connection pooling

Guidewire applications use their own built-in database connection pool mechanism. DBCP, an open source database connection pool library provided by Apache, is the component used by Guidewire applications for connection pooling. DBCP is described at document reference 23. DBCP is the only database connection pooling mechanism available up to ClaimCenter 3.1.





WebSphere customers running ClaimCenter 4.0 have the option to use the built-in connection pooling mechanism or the WebSphere database connection pooling mechanism. Further details are described at document reference 24.

6 Database tier

The database is a key component of the overall system performance. Proper configuration is key for general performance.

6.1 Generic guidelines These guidelines apply to all database systems. Though generic, these guidelines are critical to maximizing the system’s performance and ensure a healthy performing system. Database systems represent a significant investment for clients, mostly due to the high costs of database licenses (Oracle licenses are 40k$ per CPU, MS-SQL licenses are 20k$). Therefore, it is critical to correctly architect and tune it to maximize this investment.

6.1.1 Storage, the “hidden” bottleneck

Modern computing has been dominated by the premises of Moore’s law. In most people’s minds, always higher performance is achieved by simply upgrading to the latest processors. This “Intel” effect overshadows the role played by other parts of the infrastructure. For a database system, storage is generally the most over-shadowed component of the infrastructure. The basic trends of storage are as follows:

1. Disk are increasing rapidly in capacity 2. CPU are increasing rapidly in speed 3. RAM capacity is increasing more slowly than Disk capacity. 4. Disk speeds is increasing very slowly

Because of factor 1, system administrators will tend to place larger databases on fewer disks. Because of factor 2, processors will be able to process more IO requests per seconds. Because of factor 3, the different database caches become smaller in comparison of the total database size. These three factors congregate to increase the IO pressure and increase storage bottlenecks. As indicated by factor 4, these increases in storage pressure, are not alleviated by any significant storage disk speed increase. The trends exposed above are long-term and should only worsen. Therefore, avoiding any storage bottleneck will become an ever more urgent goal.

6.1.1.1 Some examples

When discussing trends, some of the best examples can be found through benchmarks. One of the most reliable sources of benchmarks is the Transaction Protocol Council (www.tpc.org). Many vendors participate in the benchmark tests for different types of workloads. More so, software and hardware development is significantly impacted by these benchmarks. In the Online Transaction Protocol section (TPC-C), the 08/08/06 AIX p5 595/DB2 setup includes the following hardware:

• A database server with 32 dual cores processors and 2TB of memory • 6400 36.4 GB 15K RPM disks configured in Raid O for data storage (other disks used for logs)

Out of a 213 TB capacity, 112 TB only are used in the benchmark. The space utilization ratio is 52%. In the decision support benchmark (TPC-H), the 09/19/05 3000GB AIX p5 595/Oracle 10G setup includes the following hardware:

• A database server with 64 CPU cores and 256 GB of memory • 1152 36.4 15K RPM disks configured in Raid 10 for data storage (other disks for logs)





The total data capacity is 38TB. Nevertheless, the disks are mirrored and the corresponding usable capacity is 19TB. 3000GB is actually used. The space utilization is 15%. It must be noted that the data is stored on 36.4 GB disks. Much larger disks already existed but were not selected. The following conclusions are inferred from those two benchmarks:

• Optimal IO load must be spanned across many small disks rather than fewer larger disks • Optimal IO load is achieved by using only a small portion of the total disk capacity • Striping (Raid 0 or Raid 10) is used to span the load across as many disks as possible.

These conclusions are totally in line with the Oracle best practice described below. Furthermore, these benchmarks strive to be the most cost efficient solutions, in terms of cost per transaction. Therefore, these infrastructures are perfectly sized: no part of the infrastructure is undersized or oversized compared to others. Net-net, these metrics are today’s standard for optimal and cost-effective performance. The ClaimCenter load is a mix of OLTP (web user’s access) and Datawarehouse (batch process...). The guiding principles used in those benchmarks should be used for ClaimCenter implementation. The following chapter provides the summarized guidance.

6.1.1.2 Database storage workload

Databases will issue storage IO in a mix of single or multi blocks read/writes: • In OLTP (web user’s access) loads, the vast majority of read/writes will occur in single blocks.

These blocks are by default 8K in size. • In Datawarehouse loads, a significant larger number of read/writes will occur in multi blocks.

Database settings (see DB_FILE_MULTI_BLOCK_READ at Oracle settings (Oracle only)) can be adjusted to fine-tune the number of blocks to read in one IO.

6.1.1.3 Storage performance considerations

Document reference 25 provides the general storage metrics applicable for a database. These metrics clarify why a database load will inherently get limited throughput per disk. Consequently, this explains why database loads must be spanned across many disks. Vendors will generally publicize throughput results from loads with maximum size IOs, resulting in much higher throughput not applicable to database loads.

6.1.1.4 Recommended solution

The two previous chapters clearly expose why storage significantly matters in a database system performance. The current chapter exposes how this can achieved relatively simply. All through database configuration history, DBAs and system administrators have struggled with the following of setting up database storage. The following aspects have entered in play into these rather esoteric discussions:

• RAID level: storage provides some different setups for both performance and availability purposes:

o Striping aka RAID-0: for performance, data is scattered across many disks to load-balance IO load across these disks.

o Mirroring aka RAID-1: for availability, data is available on two disks. o Raid 5: for availability, several disks are paired and some redundancy (XOR) data is

scattered across these disks. These different RAID types can be combined into setups such as RAID-10. These different RAID level have the following read/write profiles:





o RAID 5: • One read to the array will result in a single IO to the disk where the data is

available • One write will require two reads (one to main location and one to XOR) and two

writes (one to main location and one to XOR) o RAID 1 and 10

• One read to the array will result in a single IO to one of the mirrors where the data resides.

• One write will result in two IOs, one to each mirror. For pure read loads, all disks will be leveraged with both RAID-10 and RAID-5, making these architectures equally performing. For write loads, RAID-5 will perform much more poorly than RAID-10. Document reference 41 provides at page 177 many insights into RAID levels. Page 180 of that document recommends that RAID-10 be chosen over RAID-5 if the load profile has more than 10% of writes. Guidewire’s database has a load profile of 25% writes. Finally, RAID-5 architecture is more prone to data loss in case of multiple disk failures, as documented on page 180. This last point will be documented further in document reference 15. Therefore, for both performance and availability, RAID-10 is the best choice for Guidewire database data.

• File placement: databases manage different types of files (data, temp, log, system, control, backup and others) with very different activity patterns (active/inactive, read/write...).

• Data on disk placement: on the outer part of the disk, throughput is twice as high as on the inner part.

• IO size: a significant part of the time processing an IO is spent seeking on the disk. When larger IOs are processed, this seek time is comparatively lower than for smaller IOs. Therefore, if possible, a database should request large amount of contiguous data in one IO chunk. This is generally applicable only to batch jobs.

All these constraints (and some others) have led to the creation of very esoteric storage architectures specifically adapted to the types of loads (OLTP, DSS) and some other arcane considerations. These setups are extremely complex to master and, as data loads evolve over time, they may become counter-efficient. Oracle has recognized the complexity of designing correct storage architecture and has invested much time and effort in defining a simple though performing solution. Document Reference 1, “Optimal Storage configuration made easy”, addresses this issue. This recommendation is a watershed moment in database storage design. It recommends the S.A.M.E architecture (Stripe and Mirror Everything). Its precepts are:

1. Mirror data for high availability (RAID 1) 2. Stripe all file across all disks using 1MB stripe width 3. Place frequently accessed data on the outer half of the disk 4. Subset data by partition, not disk.

Precept 1 indicates that availability should be done though the more predictable RAID 10, rather than RAID-5. Precept 2 indicates that the highest throughput is achieved by striping using 1MB stripe width. The 1MB stripe is done to allow data to be read and written in large chunks. This optimizes the percentage of time the disk really transfers data. Precept 3 recognizes that active database files should be placed on the outer part of the disk. Inner parts of the disk should be used for inactive data (back-ups by example). This can be done easily when configuring the array. Precept 4 indicates that the different types of files (data, logs, temp...) can remain on separate subsets (filesystems typically) but all must be striped across all available disks.





The Oracle SAME architecture is deemed to be a high performance, simple and manageable storage architecture. Historically, Microsoft has recommended that MS-SQL data and logs be separated on different spindles. Nevertheless, the same considerations apply for MS-SQL as well as Oracle. Therefore, SAME should be used also for MS-SQL. For both Oracle and MS-SQL, Guidewire uses a SAME like architecture for its performance tests. In Guidewire’s case, data availability is less a concern. Therefore, data & logs are striped across all available disks. Some very exotic and complex architecture will always be slightly better performing than SAME. These gains are sufficient to warrant using these architectures in benchmarks such as the ones described in the TPC benchmarks. These architectures are nevertheless more difficult to build and administer. Therefore SAME is the optimal architecture for production databases.

6.1.1.5 Designing storage for performance

When designing a storage architecture, many are tempted to optimize storage to maximize capacity. RAID-5 is therefore a logical choice. Nevertheless, as has been indicated previously, storage for production systems should not be designed for capacity but for performance. In such case, RAID-10 becomes a more predictable choice, performing predictably for both heavily read and write loads. The Guidewire sizing recommendation for storage is that 1 disk mirror (aka 2 disks) supports between 300 and 400 users. The Appendix provides a comparison of RAID-10 and RAID-5 for both capacity and performance.

6.1.1.6 Usual customer complexities

Real-world situation may differ in many aspects from the previous examples provided earlier: 1. In the benchmark setups, one big server is connected to many small storage arrays. Frequently,

many small servers are connected to one or a few storage systems 2. Storage may be optimized for capacity by utilizing RAID-5 and would therefore perform poorly. 3. The same disk may be shared across several databases, resulting in one database behaving

differently depending on other databases’ activity. The combination of these three facts has sometimes serious consequences. Factor 1 & 2 contribute to general storage sluggishness by concentrating the load of many databases on few disks configured with a low performing architecture. Factor 3 will create some hot spots: some disks become highly utilized and over loaded while other disks are on stand still. The load generated by one database will impact performance of all other database sharing disks with it. The systems connected to this storage architecture will be performing slowly and with no predictability. There are many horror stories that can be shared on this subject. In these cases, the SAME architecture can still be applied. Generally, an array will host different workloads (Oracle, MS-Exchange,). Many vendors have identified that such workloads do not co-exist correctly on the same disks. Therefore, an array should be carved in separate per application disk clusters. Then, the database specific partition should be carved in one big RAID 10 group with 1MB stripe. Each database instance data would be located on that big RAID 10 group. The files associated to the faster databases could be located on the outer part of the disks while the slower ones could be located on the inner disks. The inner half of the disk could be associated to back-up files, which are very rarely accessed.





6.1.1.7 Some other non-trivial factors

SAME aims to optimize the disks utilization. As indicated in document reference 25, link technology is generally not a bottleneck for Guidewire platform user loads. Batch loads, which issue larger IO chunks, may put a greater strain on the link technology. Link technology bottlenecks can be remediated by adding several links and enabling multi-pathing. In these cases, the following considerations should be remembered:

• The array(s) must be capable to perform in multiple path mode. Some low and mid-range arrays do not perform correctly when IOs are processed through several paths at the same time. By example, lower end arrays have distinct per controller caches which trash when IOs to the same disks are processed on different controllers. Higher end arrays will deliver higher performance when accessed in multi-path mode.

• At the server level, multiple pathing is generally enabled at the Logical Volume Management level (see below). When multiple paths are available, LVM can use them in either fail-over or load-balancing mode. In the former case, another path will be used if the first path fails. In the second case, all paths are used in parallel. Veritas VxVm, Redhat LVM2, HP-UX LVM are among the Volume Managers that provide load-balancing capabilities.

Many arrays implement a “Write on Cache” policy. The array can be set to report that an IO has been written once it has been cached by the array. The actual write to disk occurs later in an optimized fashion. This functionality can be enabled when, upon a disrupting event (power outage...), the array is capable to write the cache to disk while its internal batteries are still on. Enabling “Write on Cache” improves performance and should therefore be used if available safely.

6.1.1.8 Schedulers

Some operating systems, such as the ones using Linux kernel version 2.6, support several types of IO schedulers. Different schedulers implement different algorithms adapted to various workloads. The default scheduler is not always the one most adapted to high storage loads. By example, Redhat 4 implements by default the Completely Fair Queuing scheduler. Guidewire experimented with different schedulers and found that the Deadline and Noop schedulers were most adapted for a high storage load. This experimentation was conducted with a SAME like setup with all data objects (data, logs,..) on the same device. Finally, the Deadline scheduler was chosen as it also allows fast copy of large files such as in database cold recovery. The Noop scheduler was found to be much more limited in merging IOs, thereby leading to much longer database recoveries. These results are relevant to the configuration it was experimented with. Therefore, other customers should conduct their own experiment to choose the best scheduler. Further information is provided at document reference 39.

6.1.2 Operating system configuration

Database applications are very refined and complex applications that leverage some operating systems features hardly used by any other applications. Several of these features stand out:

• Asynchronous IO • Raw/Direct IO • Logical Volume Management • IO size settings

6.1.2.1 Asynchronous IO

Under regular settings, when a process thread requires an IO, the application side thread calls the operating systems and remains locked until the operating system has completed the IO. This is a synchronous call. Highly demanding applications such as databases will request many concurrent read/writes. To process many IOs, the database must be configured with many read/write processes. Another method is to use





asynchronous IOs: the thread post the IO to the operating system without waiting for completion. Upon IO completion, the operating system calls the thread back. Using asynchronous IOs allows an application to optimize its storage performance without using many reader/writer processes. Different vendors have implemented very various asynchronous mechanisms with various degrees of setup complexity. Correctly configured asynchronous IO generally provides some performance advantages.

6.1.2.2 Raw/Direct IO

Applications generally access data on storage through a filesystem. Filesystems provide the following services:

• OS level data caching • File Access Control Level • Read-ahead: as one set of data is requested in read, more data is fetched in anticipation of future

reads. Databases implement similar mechanisms internally. Therefore, the filesystem services are useless for databases and possibly adverse to performance. More so, some mechanisms, such as inode access serialization (in read or write), can severely limit storage performance through filesystems. Document reference 5 describes this in more length. For maximum performance, high performance databases generally use Raw IOs (i.e. without filesystem) for some of their data sets. When using Raw IOs, system administrators loose all direct visibility to the data set files, making a raw IOs setup difficult to manage. In recent times, many OS vendors have developed features that enable regular filesystem files to behave much like raw IOs while still providing a regular filesystem “vision”. These setups are generally called Direct I/O. In these modes, the filesystems disable the following services:

• OS caching • File access control serialization • Read-ahead

Because file access control is much less stringent, these Direct I/Os features may lead to corruptions by applications which do not implement their own access locking system. Databases definitely implement their own locking mechanism Some common filesystems that allow a “raw IO” like mode are:

• Concurrent IO for AIX (with Direct IO, a similar but less performing version) • Ext3 for Linux • Veritas Quick I/O

Document Reference 3 describes how to use the ext3 filesystem with Raw I/O like performance. Interestingly, it states the performance benefits of Direct and Async IO:

• Synchronous/Direct IO provides a 74% improvement over Synchronous/Cached IO • Asynchronous/Direct IO provide about a 100% improvement over Synchronous/Cached IO

Direct IO is not a performance “one size fits all”. For some workloads, Direct IO will be significantly slower than cached IO. Typically, when recovering a back-up, Cached IO should perform significantly better than Direct IO. Additionally, Direct IO usage can be very different per OS platform. With Linux, DirectIO usage is enabled when opening a file. With AIX, two modes exist: Concurrent IO / Direct IO are enabled at filesystem mount.





6.1.2.3 Logical Volume Management

Logical Volume Management is an additional abstraction layer between the device and the filesystem (c.f. Document reference 6). LVM virtualizes the disk device layer:

• Continuous storage spaces can be create across one or more disks • Several raid levels (0, 1, and 5) can be done at the LVM level.

Many storage arrays implement similar technologies. The balance between enabling such features at the OS (LVM) or array level is a difficult exercise. Nevertheless, it is the author’s experience that the following guidelines should be followed:

• Mirroring should be enabled at the array level. Mirroring at the LVM level forces 2 IOs for each write, therefore issuing more IOs on the link and to the array.

• Striping can be done at the array or LVM level depending on the following factors: o Array striping is highly performing. Nevertheless, if the database spans several arrays,

LVM will be required to stripe across multiple arrays o RAID HBA striping: generally, a RAID card will stripe data across the underlying disks.

RAID cards will tend to fracture IOs into small fragments and hide the per disk IO performance.

As storage pressure grows (or less likely database files grow), the data must be reallocated across more disks. Therefore, additional disks must be added, either on the same array or an additional one. When allocating space on the new disks, the same striping conditions should be applied to avoid the creation of any hotspots. This is clearly delineated in Document reference 1. Some Volume Managers will allow the dynamic extension over new disk while maintaining the pre-defined stripe width without any downtime. VxVm (Veritas) is one of such Volume Managers. If such option does not exist, the database would need to be taken down and the data would be backed-up and then recovered over the new storage architecture.

6.1.2.4 IO size settings

Document reference 1, “Optimal Storage configuration made easy” clearly delineates the need to read/writes large IOs, when possible. This recommendation is much more relevant for Datawarehouse loads than for OLTP ones. Typically, a 16K IO will complete in 11ms and a 1MB one in 60ms. Therefore, if a 1MB IO is chunked in 64 * 16K IO, it would complete in around 700ms, rather than 60ms if not chunked. Some Operating system settings (Operating systems, filesystems, LVM, driver ) may cause an IO to be chunked in smaller ones. These settings must be adapted to avoid this and, in any case, not chunk in IOs smaller than 1MB. Some older hardware and operating system may limit the maximum size of IOs to chunks much smaller than 1MB, therefore limiting the storage performance for large IOs. On a final note, some understand striping to have an impact on IO size this way: to limit load on disks, a large IO would be chunked in one IO on every disk in the RAID-0 group. The performance metrics provided above show that, by using this “micro-striping” model, disks would become over-loaded by many small IOs, thereby significantly reducing performance. Striping should be understood as macroscopic phenomenon: non chunked IOs are distributed across many disks, therefore reducing the number of IOs on each disk.





6.1.2.5 Final points on S.A.M.E architecture

Oracle’s recommendation to use S.A.M.E constituted a significant moment in the history of storage architecture. After its initial recommendation, submitted in 2000, Oracle has continued the research on this architecture. Document reference 40 is a follow-up to the initial whitepaper (document reference 1) and compares S.A.M.E and other architectures. The conclusion is that S.A.M.E is the best compromise in terms of performance & manageability.

6.1.3 Testing storage

The many recommendations provided in the earlier chapters range from the complex to the arcane. Implementing these recommendations can lead to errors and run into operating system/hardware defects. Therefore, it is highly recommended to test storage architecture before running the database on it. Typically, IOs in chunks of 8K can be generated, therefore allowing testing a storage subsystem’s maximum capacity for OLTP loads. Document reference 19 describes the tool and provides some usage examples.

6.1.4 Database

6.1.4.1 Enabling & configuring Async IOs

Database systems can be easily configured to use Async IO with performance improvements. Async IO may require some configuration fine-tunings to adapt it to the infrastructure specifics. On AIX, the minservers, maxservers, maxreqs OS settings should be adapted to meet both the server (number of CPUs) and the storage (disk queue depth). These settings will need to be changed upon number of CPU or storage changes. Information about these settings is available on page 27 of document reference 2 and on page 6 & 9 of document reference 4. On Linux systems, enabling Async may require Oracle binary relinking. This is described in Document reference 14.

6.1.4.2 Database and memory

The availability of ever increasing amounts of memory continues to have a significant impact on servers. Database applications leverage memory as a cache mechanism to minimize and optimize storage access. By example, the database server in the TPC-C 08/08/06 AIX p5 595/DB2 setup has 2 Tera Byte of data, most of which is used by database caches. When sizing/tuning database cache, some careful considerations must be given to the following facts.

6.1.4.2.1 32 bits systems and large database caches

Chapter 32 bits platforms describes the inherent limitations in terms of memory addressability associated with this platform. This typically limits the database cache sizes (shared memory) to 1.7GB for x86 systems (generally, Windows and Linux). Methods exist to raise this 1.7GB to a higher level. They are described in detail for Redhat 2.1, 3 & 4 at Document reference 14. The PAE (Physical Address Extension) allows raising the amount of memory to an even higher limit (around 62GB). Specific modifications are needed. They are detailed in Document reference 14. Additional information can be found in document referenced 9, 21 and 10. For systems with applications using a lot of memory, the 32 bits architecture can bring significant complications. By example, a system may have a significant amount of free memory for certain types of memory pages (execution, data or shared memory) with no memory available for other types of memory pages. In such case, a summary look at memory statistics will point to a system with plenty of memory available while the system uses its swap space due to memory starvations for other memory types. Document reference 14 provides some insights into these complications and the 32 bits complex memory mapping logic.





Additionally, some Operating Systems have experienced data corruption when using these optimizations, as described in document reference 22. All common hardware sold now support 64 bits. It is the best choice for database platforms Document reference 8 delineates the performance improvements associated with using 64 bits processors with large database caches.

6.1.4.2.2 Large memory optimization

Chapter Large pages describes the possible optimizations associated to large pages. Databases have been the first applications to leverage this feature. Some configuration and potential Oracle patching are necessary to leverage that functionality correctly. Document reference 14 describes this for Oracle on Linux.

6.1.4.2.3 Guidewire specifics

Document reference 25 provides some Guidewire specific database cache sizing. Information provided at Orders of magnitude shows the benefits of database caching on performance. The very long action described takes around 7 seconds with application server empty and database cache full and up to 25 seconds with both application server and database caches empty (thereby requiring a lot from storage). Proper sizing of the database cache server increases the chance of keeping in database cache data that had been removed from application server cache(s). Furthermore, the database cache is common to all application servers. In such case, data cached by the action of one user (an adjuster by example) could be reused by another user (approver by example) running on another application server node. Nevertheless, cache mechanism will ensure that objects are removed from cache as newer ones are cached. Furthermore, upon database start, the cache is empty.

6.1.4.2.4 Oracle

6.1.4.2.4.1 Memory utilization optimization Cache size increases, even well within the limit of the physical RAM available on the system, may hit a point of diminishing return. Oracle, through its performance tool statspack, provides some analysis tool that will allow a database administrator to optimize the database cache size. Document reference 19 provides additional information on that tool. Proper usage of that tool will prevent unneeded memory allocation to database caches.

6.1.4.2.4.2 The optimizer

6.1.4.2.4.2.1 Generic information Before executing a query, the Oracle optimizer will determine the best plan to execute the query. The optimizer will rely heavily on the following elements to choose the correct plan:

• Statistics on rows, blocks. • Presence of indexes • Other magical non documented metrics

The optimizer uses the database statistics to select a good query plan. Up to date database statistics are therefore critical to ensure proper performance. The Database catalog statistics page provides the latest statistics for Guidewire applications. If the statistics are out of date, they should be updated using the procedure described in the ClaimCenter documentation. Document Reference 11 provides some more insights into the Oracle optimizer.





Starting with Oracle 9i, Oracle can also use infrastructure statistics (CPU, storage) to help choose the optimal query plan. This feature is called CPU costing. It is described more at length in Document Reference 12. In Oracle 9i, CPU costing requires that infrastructure statistics be gathered before the statistics are used in the Cost Based Optimization. Infrastructure statistics must be gathered intentionally and is not done by default.

6.1.4.2.4.2.2 Guidewire specifics Cost Based Optimization is the recommended mode for Oracle 9i and up.

6.1.4.2.4.2.3 Optimizer complexities The optimizer algorithm is proprietary and non trivial. While Oracle chooses in most cases the best plan, it has sometimes shown to execute extremely poor plans. The difference between a suboptimal and optimal is very significant and can be hours versus seconds. Frequently, a bad query plan will result in very high processor utilization during the query execution. Guidewire has witnessed that the exact same actions executed with the same application and same data but on different hardware used drastically different execution plans. Document reference 34 provides some possible explanations of this phenomenon. Document reference 19 provides some insights into gathering information about Oracle performance and query plans.

6.1.4.2.4.2.4 Outlines Guidewire extensively tests its application’s performance and goes to great efforts to verify that the application queries perform correctly. Nevertheless, some elements outside of the control of Guidewire, like extreme data distributions or extensive application configurations, may cause some queries to get bad plans while they performed correctly in test conditions. In such cases, another query may have provided better performance. Unfortunately, this requires an application new release or patch. In case of urgencies, this scenario is not workable. In such case, customers can use outlines, an Oracle functionality that allows DBA’s to force a specific query plan. Outlines usage will involve the following steps:

• The queries that perform very badly should be identified. Their plans should be identified using the regular database investigation tools.

• The plans should be analyzed by the DBA. Some tests will allow identifying a better plans. By example, some additional hints in the query code may yield significantly better plans.

• Outlines will allow associating those better plans to the queries. Upon execution of the query, Oracle will use the plans stored in the outlines rather than call the Optimizer to find a plan.

Outlines with the following caveats:

• It should be assumed that the Oracle optimizer chooses correct query plans unless proved wrong. Therefore, outlines should be used only for very seriously bad plans with significant impact. A typical example of this is a Validate & Populate with queries running for hours. In no case should this functionality be used to bypass the optimizer for a large number of acceptably performing queries.

• Significant investigation and tests should be done to validate that the better plan for the new query is indeed better and robust in all cases.

• Upon change to Oracle (patch, upgrade...), the optimizer may change its behavior and select a correct plan for the queries that previously had a bad plan. Therefore, the functionality should be initially disabled upon Oracle patch or upgrades. The previously queries that generated bad plans should be tested to check if the optimizer change its plan decisions. Only then, an outline can be enabled for the queries that still get bad plans.





Document reference 36 provides some detailed instructions on how to use outlines. The system’s DBA is ultimately responsible for managing outlines.

6.1.4.2.4.3 Oracle settings For Oracle, the following parameters are important to consider:

• Database block size: the block size that Oracle uses can be configured. Oracle can even use different block sizes for different tablespaces. Guidewire recommends keeping the default database block size (8K), which is the commonly accepted size for OLTP loads. Document reference 20 provides some insights on database block sizes. For OLTP loads, it is in line with Guidewire’s recommendation.

• DB_FILE_MULTI_BLOCK_READ is used by Oracle to perform read-ahead like behavior. Oracle recommends that this setup be adapted so that DB_FILE_MULTI_BLOCK_READ * block size be superior or equal to the stripe width (1MB in our recommendation). Theoretical sizing of this parameter is heavily dependant on the storage’s capacity to process IOs in size equal to the stripe width. Therefore, caution is recommended if playing with this parameter. This parameter can be especially interesting when the database perform large reads, such as during large batch jobs (Validation/Population, by example). In these cases, it could be temporarily oversized. Guidewire has experimented with this parameter for UI loads and found no benefit associated with larger values than the default (16).

• To use both Async and Direct IO, the Oracle filesystem settings should be set to use all available optimizations: filesystemio_options="SetAll". The infrastructure may need to be configured to support those options (see Operating system configuration)

• Database cache size: Document reference 25 provides the recommendation for the database cache. Tunings should be finalized during performance tests to adapt them to customers’ specifics.

6.1.4.2.5 MS-SQL

For better and for worse, MS-SQL is a more self-tuning platform than Oracle. In the author’s opinion, this is essentially due to two main factors:

• Microsoft intends to deliver products requiring less tuning expertise • MS-SQL is targeted to run on only one platform, Windows. Therefore, MS-SQL is tuned out of

the box for a specific OS platform. This significantly reduces the amount of OS platform specific tunings.

Guidewire Engineering runs its performance tests on MS-SQL with no specific tuning. It is Guidewire’s experience that if MS-SQL is given appropriate memory, processing & storage capacities will perform correctly.

6.2 Generic performance analysis methodology

6.2.1 Platform analysis

Infrastructure performance analysis methodology is provided at Generic performance considerations. More specific platform recommendations are provided in document reference 19. Database server performance issues can occur due to issues with

• Processing power • Memory • Network bandwidth • Storage





6.2.1.1 Client interaction loads

The database server is used for both client interaction and batch processes. Different performance issues may arise during these different loads

6.2.1.1.1 Processing power

Server monitoring will allow gathering information about the system processing usage. Generic performance considerations provides many details about the meaning of each metric. The database/client interaction load specifics are:

• User load should be below 70 or 80%. Higher loads indicate that the system is pegging and potentially more processing power is needed. High end Unix are generally capable of sustaining higher continuous loads than Linux or Windows

• System load: the system load should be minimal (5-6%). Higher loads indicate that the system performs a lot of activity. The possible reasons are:

o Filesystem caching, which should have been eliminated by using Direct IO o Managing many IOs. Many smaller IOs will cause the operating system to work much

harder. The goal should be to force the database and the system to create fewer larger IOs when possible

• Iowait: if this metric is high (20% or higher), storage becomes a performance suspect. Nevertheless, this metric is highly unreliable for reasons explained in Generic performance considerations. The device average wait time is much better indicator of storage performance.

6.2.1.1.2 Memory

The system should not swap and should page normally (system dependant).

6.2.1.1.3 Network

During interactive loads, the network load is significantly used. The advice provided at Generic performance considerations would help identify issues or bottlenecks.

6.2.1.1.4 Storage performance

As indicated above, storage performance is absolutely critical to overall database performance. The methodology provided at Generic performance considerations will help identify issues or bottlenecks.

6.2.1.1.5 Database monitoring

6.2.1.1.5.1 Oracle During normal operations, Statspack snapshots should be taken regularly. Guidewire recommends that regular Statspack snapshots be taken and associated reports be generated. An aggressive option would be to take snapshots every hour and generate reports for every hour slice. A less aggressive approach would be to weekly take two snapshots 15 minutes apart. In any case, these archive reports would be re-used later in case the system’s performance degrades and current statspack must be compared to older ones. Information on creating statspack snapshots and associated reports is provided in document reference 19.

6.2.1.2 Batch interaction loads

6.2.1.2.1 Processing power

The follow values of metrics should be expected. • Batch loads are generally single-threaded. Therefore, a batch interaction will generally use one

CPU. The percentage of time taken by the application (%user) can therefore not be higher than one CPU (50% on a dual CPU system without hyper-threading, 25% on a dual CPU system with hyper-threading). Depending on the batch process phase, the CPU usage may reach its maximum limit at some time while being very low at other times.

• System: this metric should be low as for the interactive load. High number (more than 5-6%) indicate a filesystem buffering phenomenon or too many small IOs





• Iowait may wary significantly during the batch process as some phases of the process may be IO or CPU bound. During the IO bound phases, the IO wait % should not go higher than 30-50%, otherwise, storage is a likely bottleneck

6.2.1.2.2 Memory

The system should not swap and should page normally (system dependant).

6.2.1.2.3 Network

During batch loads, the network load is very limited.

6.2.1.2.4 Storage performance

As indicated above, storage performance is absolutely critical to overall database performance. The insights provided at Generic performance considerations will help identify issues or bottlenecks. The storage pressure may incur significant swings during different phases of a batch load. Nevertheless, the storage should never experience high IO wait times (more than 20 ms).

6.2.1.2.5 Database monitoring

6.2.1.2.5.1 Oracle During most specific batch activities, such as validation & population, the Statspack snapshot reports should be triggered at every significant event. By example, if the validation/population is deemed to perform badly, the Statspack reports should be triggered before starting the validation/population (Time point 1), when the validation is finished (Time point 2) and when the population is finished (Time point 3). The application server cclog.log should be monitored during the process to view these events. These steps are detailed in document reference 19.

6.3 Guidewire specifics

6.3.1 Data distribution

Similarly to the application server, data distribution is critical for the database server component. When running performance tests (both interactive loads and batch jobs), the perf team uses a data distribution where claims have on average:

• 2 Exposures • 10 Activities • 10 History • 10 Documents • 10 ClaimContacts • 35 Notes

Some clients may have significantly different data distributions. The Workers’ Compensation insurance companies have much heavier data distributions with many more activities, contacts and notes. This may have significant impact for databases interactive and batch loads. Therefore, it is strongly recommended that, at project initiation, the data distribution be evaluated and shared with Engineering. This will help identify possible future issues both with batch loads (Validation & Population) and interactive loads.

6.3.2 Batch loads

6.3.2.1 Validation & Population complexities

Validation & Population is a very demanding process that has shown some complexities. During the Validation/Population, the following happens:





• Validation: a lot of checks are done on the staging tables. This process is mainly CPU intensive. As parallel processing is not enabled, it will use only one processor (or one Thread in a Hyper-Threaded architecture).

• Population: If validation is successful, the population will transfer the data in the staging table to the production tables in a single transaction. This last fact is critical to understand and has many consequences:

o To allow rollbacks, the population will heavily write in the Undos. This will severely hit the storage system which, if not appropriately configured will become a bottleneck in this phase.

o During the population process, the database will rely on database statistics that were accurate at the start of the process. These statistics become more and more inaccurate as the process progresses. This is critical if the validation & population starts with empty production tables. Skewed statistics can have an impact on the database choosing an appropriate query plan. Furthermore the entire population phase happens in a single transaction to ensure that either the full population is successful or no change occurs. During the population, there are a number of de-normalizations (columns and rows) that need to be updated or inserted after the insert selects and potentially significant changes to the size of the source tables during the insert/selects. The database statistics have not been updated, so the optimizer still sees the old statistics. The database statistics cannot be updated through the usual code path, because that will cause an implicit commit. Guidewire provides an option to compute statistics derived from the statistics in both the production tables (prior to the population) and the staging tables. These new statistics would be computed as follows:

� Source table row count = source table row count + staging table row count; � Source table block count = source table block count + function(staging table

block count); Using this alternate statistics computation is required if the production tables statistics are significantly changed in size or block count by the validation & population. Therefore this method is provided as an option. For SOAP Api call, if a specific Boolean parameter is set to True, the behavior will be forced. For SQL commands, if validateandpopulate is called from table_import an additional parameter “-updatedatabasestatistics”, will force this behavior. More information is provided in the table_import command.

Guidewire recommends the following approach to achieve success on Validation & Population:

• Database statistics should be updated prior to running the process, especially upon running it with empty production tables

• The validation& population should be run with the updatedatabasestatistics option Several issues have occurred with Validation & Population:

• The process can complete very slowly, which would tend to indicate a performance issue • It can hang at some point of the validation & population. The cclog.log will typically show a

“Begin execution of LoaderCallback…” with no corresponding “End execution of LoaderCallback..”. The occurrences have been traced to Oracle choosing a bad execution plan. The Oracle optimizer chapter details these possible issues and provides some strategies to work around them.

In either case, Guidewire recommends gathering both infrastructure and database monitoring information. In the latter case, several options are believed to limit the risk of hang:

• Run the Validation & Population in small slices • Run a first Validation & Population on a successful smaller slice and then on the remainder





Guidewire plans to pro-actively approach these risks in the future by:

• Identify the client data distribution • Generate a data set that resembles the future database content • Test the data set in its Labs and identify any issues (sluggishness or hang) • Provide that data set to the client soon in the project to allow identifying similar issues.

6.3.2.2 Storage settings

Batch loads will retrieve a significant amount of data from storage. The DB_FILE_MULTI_BLOCK_READ may be changed to allow multi block reads in 1MB chunks, thereby reducing the number of reads to the disk.

6.3.3 All loads

6.3.3.1 Application logs

The current cclog.log file and its predecessors should be collected and archived for analysis in the field or in San Mateo. If the configuration contains several application servers, logs from all application servers should be collected. The logs will provide timelines for the conversion processes (like validate & populate). Proper information is logged both for conversion calls from the SOAP API or the table_imports shell. For a vanilla Windows installation, those files are present in the C:\claimcenter_logs directory.

6.3.3.2 Storage settings

During regular load, a too large DB_FILE_MULTI_BLOCK_READ may create unnecessary large reads and replace necessary in cache data by data fetched during the multi block read. A more conservative value of 16 (default) is likely to be more appropriate.

6.3.3.3 Batch server and performance

Guidewire applications process their long running jobs (batches, escalations...) through a dedicated server, the batch server. The batch server may at time experience significant load. This load is a combination of:

• Application load: in such case, users connected to the batch server would experience a significant performance slow-down. Therefore, it is recommended that, on large implementations, no user connect to the batch server

• Database/Storage load: in such case, all users will experience a performance slow-down. This issue should be remediated by increasing the database and/or storage performance.

Information provided in the Guidewire Infrastructure Troubleshooting Guideline may be leveraged to identify the corresponding bottleneck and adopt the correct remediation strategy.

6.3.3.4 Database catalog statistics update

As indicated earlier the Oracle optimizer depends heavily on the database statistics’ accuracy to choose the good query plan when executing a query. Statistics collect information on table such as number of row/blocks and possibly histograms. Inaccurate or out of date statistics will prove to have a very adverse effect on query performance. This can’t be stressed enough.

6.3.3.4.1 When to update database catalog statistics update

Statistics should be updated whenever the following happens: • Database performance is degraded. Whenever a performance issues arises and the database is a

suspect, the database catalog statistics should be updated before reproducing the issue. • Before processing large batch jobs (such as validation/population) • Data has been significantly changed such after a conversion • 10-20% of the data has been changed by normal usage of the system





• On a regular schedule, dependant on the statistics drift DBAs should regularly review the database catalog statistics on the database and identify the speed of the statistics drift. Current statistics can be retrieved using usual DBA commands. Additionally, for ClaimCenter 3.0.3, 3.1.1 and higher, a internal tool provides the current database statistics (Database catalog statistics page) If a significant drift (more than 10-20%) is identified on certain tables, the statistics should be updated.

6.3.3.4.2 How to update database catalog statistics

Updating statistics is a very process intensive process. Updating all statistics on a large database may take many hours. As an example, an update to Guidewire’s 1 Million claims database takes 14 hours. Several means exist to update statistics:

• The following command updates all statistics on the database: maintenance_tools –server <server> –password <password> -startprocess dbstatistics

• Database catalog statistics analysis (either manual or through the Guidewire tool) may show that only few tables’ statistics have drifted significantly. In such case, the DBA may choose to update the statistics on these specific tables only. The following command lists all the SQL statements executed during the full database statistics update: maintenance_tools –server <server> –password <password> -getdbstatisticsstatements. Its output can be analyzed to identify the commands to run to update the statistics on the tables that have drifted most.

In both cases, <server> is the application server name and <password> is the associated admin password

6.3.3.4.3 Additional elements

Updating statistics can be done while users are using the system but it will have an adverse effect on performance. Therefore, it is recommended to update the full statistics during off or low usage hours and put the ClaimCenter server into maintenance mode. A mix of full and partial database catalog statistics can be adopted by

• Updating the statistics on targeted tables showing a statistical drift on a regular basis • Updating the full database statistics during maintenance periods.

Monitoring the performance of the database should be an ongoing task and inspecting the database catalog statistics should be done on a weekly basis at least. Database catalog statistics are saved in the database’s system tables. They are not lost upon database shutdown/restart. Full database export/import (with system tables) will carry over the statistics. Database catalog statistics are updated using a sampling rate. 20% sampling is a standard in the Oracle World. Nevertheless, when using that sample rate, Guidewire experienced performance problems. Such issues did not arise when using 100% sampling. Therefore, Guidewire recommends using 100% sampling when running manual statistics updates on tables with outdated statistics. Furthermore, Guidewire forces a default value of 100%. Using a 100% sampling rate causes the database statistics updates to be longer than they would be with 20%. There is an issue with Oracle 9.2.0.4 and 100% sampling. It is detailed in document reference 34.

6.3.3.5 Database catalog statistics page

The database catalog statistics are available, as part of the Internal Tools, from version 3.0.3 and 3.1.1 on. One must log in to the product as admin and then modify the URL to the DatabaseCatalogStatistics page (http://localhost:8080/cc/DatabaseCatalogStatistics.do in a vanilla installation). A link to that page is available from the InternalTools URL (http://localhost:8080/cc/InternalTools.do in a vanilla installation).





In case of performance troubleshooting, this page should be analyzed. Analysis will help determine if the “drift” is likely to cause optimization issues. If Guidewire has been contacted about performance issues, the output of the database catalog statistics should be provided to Guidewire support for analysis.

6.3.3.6 Database default settings (currently Oracle only)

Guidewire generally uses the default settings provided by the certified database. Some specific default settings, known to be inappropriate, are forced to a correct value through the software. Guidewire does not test with non-default values. Some changes may have significant impact on performance and/or integrity. Past experience shows that some setting modifications (either through code or database settings) have proven detrimental. In consequence, some code adjustments have been made to force the right values. By example, Cursor sharing was modified to FORCE at a customer site. From 3.0.3 and 3.1.1 on, code has been modified to force that value to EXACT, like the default. ClaimCenter 3.0.4 and 3.1.1 and higher provide the DatabaseParameters in the Internal Tools. That tool displays all database parameters of interest in a single web page. This page can be saved and returned to Guidewire support for further analysis. This feature was initially introduced in 3.0.4 & 3.1.1 for Oracle. It will support other databases later. With Oracle, the solution requires that SELECT privilege on v$parameter2 and v$option be granted by the DBA to the Guidewire database user. On prior releases, the same information can be gathered from Oracle by running the following commands:

• select * from v$parameter2 order by name; • select * from v$option order by parameter;

6.3.4 Adding indexes to the Guidewire schema

Since ClaimCenter 2.1, Guidewire supports the addition of indexes to both core and extension tables. This is done by modifying the config/extensions/extensions.xml file (following the xsd semantics defined in extensions.xsd). If the index type is not supported in the extensions.xsd, Guidewire should be consulted to ensure correctness. Adding indexes may be necessary to support integration code or customization. By example, some customers add an index to improve the performance of a batch query necessary for custom reports. Such cases qualify as an appropriate reason to add an index. Adding indexes to remediate perceived product performance issues should be considered with great care. It is Guidewire’s experience that, in most cases, performance issues initially deemed to be caused by the absence of an index have other root causes. In such case, adding indexes would have constituted a distraction from the resolution of the real issue. The performance analysis provided in this guide should provide the tools to investigate performance issues and identify other performance issue root causes before considering indexes addition. In all cases, the addition of indexes can have a significant performance impact for all queries. Therefore, Guidewire recommends that performance testing quantifies the impact of the change.

6.3.5 Finders

Through finders, the Guidewire platform provides the ability to write SQL queries while following an object oriented model and ensuring type safety. As the previous chapters indicate, SQL queries are prone to ineffective query plans. They should therefore be tested very thoroughly during performance testing. During that analysis, one may consider adding indeces, as described in the previous chapter. Index changes should be considered with great care, as it may change the whole performance behavior of the application.





6.4 Experience from the field

6.4.1 Conversion performance issues

6.4.1.1 Forcing statistics updates during conversion

During a validation & population process, the updatedatabasestatistics option will not change statistics for tables with no associated staging table. In some Validation & Population cases, this has shown to cause performance issues. Financial denormalization tables do not have staging tables associated. An option is to update additional statistics manually, prior to running the validation & population. In one case, a large amount of data was present in cc_checkrpt, which has no staging table associated to it. The populated ccst_check was the closest staging table. Its database statistics were updated prior to running the validation/population:

• The row count was retrieved from ccst_check. (SELECT COUNT(*) FROM ccst_check;) • The block count was retrieved from ccst_check. (SELECT blocks FROM user_tables WHERE

table_name = ‘CCST_CHECK’;) • The row and block count for cc_checkrpt was updated:

(dbms_stats.set_table_stats(<schemaName>, ‘cc_checkrpt’, numrows=><rowcount>, numblks=><blockcount>);)

• For each index on cc_checkrpt, the row and leaf block counts for that index was updated. (dbms_stats.set_index_stats(<schemaName>, <indexName>, numrows=><rowcount>, numlblks=><blockcount>/10);)

The validation/population was then retried and shown to improve in performance.

6.4.2 All load issues

6.4.2.1 Optimization rules settings (Oracle only)

Guidewire products use Oracle cost based optimization. Other optimization, such as rules based, is neither supported by Oracle nor Guidewire. It should not be used as it may impede performance and/or database integrity.

7 References

The following references have been used throughout this document: • Reference 1: “Optimal Storage configuration made easy”, Oracle @

http://www.oracle.com/technology/deploy/availability/pdf/oow2000_same.pdf • Reference 2: “Oracle architecture and performance tuning on AIX”, IBM @ http://www-

1.ibm.com/support/docview.wss?uid=tss1wp100657&aid=1 • Reference 3: “Tuning oracle 10g database for ext3 filesystem”, Redhat @

http://www.redhat.com/magazine/013nov05/features/oracle/ • Reference 4: “Oracle AIX tips”, IBM @ http://www-

03.ibm.com/support/techdocs/atsmastr.nsf/5cb5ed706d254a8186256c71006d2e0a/c82a72e602d0fc4b86256fc100683d73/$FILE/ora9i10g_5L_v2.1_090705.pdf

• Reference 5: “Improve DB2 performance on AIX using Concurrent IO“, IBM @ http://www3.software.ibm.com/ibmdl/pub/software/dw/dm/db2/dm-0408lee/CIO-article.pdf

• Reference 6: “Logical Volume Management”, Wiki http://en.wikipedia.org/wiki/Logical_volume_management

• Reference 7: “Oracle9iR2 on Linux”, Redhat http://www.redhat.com/whitepapers/rhel/OracleonLinux.pdf

• Reference 8: “Oracle 9iR2 on Intel EM64T architecture“, Dell: http://www.dell.com/downloads/global/power/ps4q04-20040167-Radhakrishnan.pdf





• Reference 9, “Operating Systems and PAE Support”, Microsoft, http://www.microsoft.com/whdc/system/platform/server/PAE/pae_os.mspx

• Reference 10: “Upgrading from Red Hat Enterprise Linux 2.1 AS To Red Hat Enterprise Linux 3”, Oracle, http://www.oracle.com/technology/pub/notes/technote_rhel3.html

• Reference 11: “Cost Control: Inside the Oracle Optimizer”, Oracle, part 1: http://www.oracle.com/technology/oramag/webcolumns/2003/techarticles/burleson_cbo_pt1.html & part 2: http://www.oracle.com/technology/oramag/webcolumns/2003/techarticles/burleson_cbo_pt2_pt1.html

• Reference 12 “Understanding System Statistics” by Jonathan Lewis, http://www.oracle.com/technology/pub/articles/lewis_cbo.html

• Reference 13 “An Introduction to IP multicast”, http://ntrg.cs.tcd.ie/undergrad/4ba2/multicast/ • Reference 14, “Tuning and optimizing RHEL for Oracle 9i and 10G”, Werner Puschitz,

http://www.puschitz.com/TuningLinuxForOracle.shtml • Reference 15, “Guidewire Scalability, High Availability and Disaster Recovery Guidelines” • Reference 16, “Tuning Garbage Collection with the 1.4.2 Java[tm] Virtual Machine”, Sun

Microsystems, http://java.sun.com/docs/hotspot/gc1.4.2/index.html • Reference 17, “Heap Dark Matter’, http://www-

1.ibm.com/support/docview.wss?uid=swg21214654, IBM • Reference 18, “Handling memory leaks in Java programs”, http://www-

128.ibm.com/developerworks/java/library/j-leaks/, IBM • Reference 19, “Guidewire Troubleshooting Guidelines”, Guidewire. • Reference 20, “Oracle Metalink Note 46757.1”, Oracle • Reference 21, “Large Memory Support in Windows 2003”, Microsoft,

http://support.microsoft.com/kb/283037/ • Reference 22, “Data corruption when PAE is enabled on Windows 2003”, Microsoft,

http://support.microsoft.com/kb/834628/ • Reference 23, “Commons DBCP”, Apache Software Foundation,

http://jakarta.apache.org/commons/dbcp/ • Reference 24, “Guidewire WebSphere Guidelines”, Guidewire Software • Reference 25, “Guidewire Platform Guidelines”, Guidewire Software • Reference 26, “IBM 32 bit SDK for AIX”, IBM, http://www-

128.ibm.com/developerworks/java/jdk/aix/142/sdkguide.aix32.html#vlpm • Reference 27, “Windows Java Address Space”, IBM,

http://download.boulder.ibm.com/ibmdl/pub/software/dw/jdk/diagnosis/dw3gbswitch3.pdf • Reference 28, “What is the largest maximum Java heap size (Xmx) allowed on Windows

platform?”, IBM, http://www-1.ibm.com/support/docview.wss?rs=203&context=SW000&dc=DB510&dc=DB520&dc=D800&dc=D900&dc=DA900&dc=DA800&dc=DB530&dc=DA600&dc=D600&dc=D700&dc=DA500&dc=D200&dc=DA410&dc=DA450&dc=DA430&dc=DA440&dc=DB540&dc=DB400&dc=DA420&dc=DA460&dc=DB300&dc=DA470&dc=DA480&dc=DB100&dc=DA4A10&dc=DA4A20&dc=DA700&dc=DA4A30&dc=DB550&dc=D100&q1=maximum+heap+size+websphere&uid=swg21249894&loc=en_US&cs=UTF-8&lang=all

• Reference 29, “Supersizing Java: Large Pages on the Opteron Processor”, Devx, part 1: http://www.devx.com/amd/Article/30529 and part 2: http://www.devx.com/amd/Article/30785

• Reference 30, “The Coyote HTTP/1.1 Connector”, Apache Foundation, http://tomcat.apache.org/tomcat-4.1-doc/config/coyote.html

• Reference 31, “IBM WebSphere Application Server, Version 5.1, Server and Environment”, IBM, ftp://ftp.software.ibm.com/software/webserver/appserv/library/wasv51base_servenv.pdf





• Reference 32, “Guidewire Security Guidelines”, Guidewire • Reference 33, “Java HotSpot Options”, Sun Microsystems,

http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp • Reference 34, “Guidewire Infrastructure Technical Bulletin”, Guidewire • Reference 35, “Frequently Asked Questions About the Java HotSpot VM”, Sun Microsystems,

http://java.sun.com/docs/hotspot/PerformanceFAQ.html • Reference 36, “Stored Outlines and Plan Stability”, Oracle Base, http://www.oracle-

base.com/articles/misc/Outlines.php • Reference 37, “Translation Lookaside Buffer”, Wikipedia,

http://en.wikipedia.org/wiki/Translation_Lookaside_Buffer • Reference 38, ‘Standard Performance Evaluation Corporation”, SPEC, http://www.spec.org/ with

JBB2000, http://www.spec.org/jbb2000/ & JBB2005, http://www.spec.org/jbb2005/ • Reference 39, “Choosing an I/O Scheduler for Red Hat® Enterprise Linux® 4 and the 2.6

Kernel”, Redhat, http://www.redhat.com/magazine/008jun05/features/schedulers/ • Reference 40, “SAME and the HP XP512”, Oracle,

http://www.oracle.com/technology/deploy/availability/pdf/SAME_HP_WP_112002.pdf • Reference 41, “IBM System Storage DS4000 Series and Storage Manager”, IBM whitepaper,

http://www.redbooks.ibm.com/redbooks/pdfs/sg247010.pdf • Document reference 42, “E-Business Applications 11i (11.5.10) benchmark - using Oracle 10g on

IBM System p570 Power6 Processor technology and System 5 p5 servers”, Oracle, http://www.oracle.com/apps_benchmark/doc/E-Bus-11i-OASB_ORA_Med_IBM-p570-3000.pdf

8 Planned additions to the document

There are no planned additions to the document as of right now.

9 Lexicon

• HBA:Host Bus Adapter aka the card that is used to connect a server or an array to a transport link (Fiber Channel or Scsi

• LVM: Logical Volume Manager: technology that virtualizes the disk device layer to allow for easy partition resizing, striping. Document 6 describes this further.

10 Appendix

10.1 RAID-10 versus RAID-5 Guidewire indicates that each user issues between 0.5 and 0.75 storage IO per second, depending on factors such as database size. For the purpose of this exercise, we will consider the lower figure (0.5). Furthermore, the Guidewire load is composed approximately to 75% reads and 25% writes. The corresponding IOs are small (around the database block size aka 8K). In such case, the storage limiting factor is the maximum number of IOs a disk can process per second. Peak performance 15k RPM disks can process 150 IOs per second. A more conservative metric is 125 IOs per second. The purpose of the exercise is to determine the number of users that 40 disks can support in either RAID-10 or RAID-5. The following paradigms apply for these architectures:

• RAID-50: Disks are aggregated in groups of N+1. Parity is scattered across all disks, thereby the total capacity corresponds to N. N cannot be too high, otherwise, the risk of concurrent failure of two disks in the same group becomes unacceptable. For the purpose of this exercise N is equal to 5. XOR information is spanned across the N+1 disks in the same grouping. When 1 IO read is





processed by the array, 1 IO is issued to the disk where the data resides. When 1 IO write is processed by the array, 4 IOs are issued to disks: 1 read to the data location, 1 read to the XOR location, 1 write to the data location and 1 write to the XOR location.

• RAID-10: disks are coupled in mirrors. When 1 IO read is processed by the array, 1 IO is issued to 1 of the disks where the data resides. When 1 IO write is processed by the array, 1 IO is issued to both disks.

U is the number of users. For the purpose of the exercise, 1 user issues 0.5 IO per second.

10.1.1 RAID-10 calculus

Per second: • Number of user reads to array: 3/4*1/2*U = 3/8*U • Number of user writes to array: 1/4*1/2*U=1/8*U • Number of reads issues to the disks: 3/8*U • Number of writes issued to the disks: 1/8*U*2 = 1/4*U • IO capacity (40 disks): 5000

3/8*U + 1/4*U = 5000 => U=8000. 40 disks in RAID-10 support 8000 users.

10.1.2 RAID-5 calculus

Per second:

• Number of user reads to array: 3/4*1/2*U = 3/8*U • Number of user writes to array: 1/4*1/2*U= 1/8*U • Number of reads issues to the disks: 3/8*U • Number of writes issued to the disks: 1/8*4*U = 1/2*U • IO capacity (40 disks): 5000

3/8*U + 1/2*U = 5000 => U=5714. 40 disks in RAID-5 support 5714 users.

10.1.3 Conclusion

For the same number of disks, RAID-10 will support a significantly higher number of concurrent of users. Additionally, RAID-10 has the following additional advantages:

• Upon heavier write loads, the advantage will sway even further to RAID-10. Document reference 41 page 180 estimates that RAID-5 should not be used for write loads higher than 10%

• RAID-5 redundancy is inherently poorer than RAID-10, as indicated at page 180 of document reference 41. This point is discussed further in document reference 15.

Oracle has expressed its preference for RAID-10 for simplicity and predictability (c.f. document reference 1). RAID-5 can support a much larger amount of data than RAID-10. If the 40 disks are 70GB in capacity, 40 disks in RAID-50 corresponds to 40*5/6 disks aka support 2333GB while the same 40 disks support 1400GB in RAID-10. This is the conundrum of sizing for performance versus sizing for capacity.