aai-2013 preparing to fail: practical websphere application server high availability

74
© 2015 IBM Corporation Preparing to Fail: Practical WebSphere Application Server High Availability Tom Alcott STSM

Upload: wasdev-community

Post on 07-Aug-2015

264 views

Category:

Software


7 download

TRANSCRIPT

Page 1: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

© 2015 IBM Corporation

Preparing to Fail: PracticalWebSphere ApplicationServer High AvailabilityTom Alcott STSM

Page 2: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Agenda

• Level Set and Definitions

• WebSphere Application Server Request Processing

• WebSphere Application Server High Availability

• Application Availability

• Preparing to Fail

• Final Thoughts

Page 3: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Why Do We Care About High Availability ?

• Aside from the numerous application functional requirements, onecritical and often assumed non-functional requirement is applicationavailability

• Increasing impacts from downtime driving shorter service interruptiontoleration

• One minute of system downtime can cost an organizationanywhere from $2,500 to $10,000 per minute. Using that metric,even 99.9 data availability can cost a company $5 million ayear” - The Standish Group

2

Page 4: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Introduction

• Specific to a Highly Available Infrastructure

• Have you prepared for failure ?

o Hardware Components

o Software Components

o Applications

o Procedures

• Have you tested your infrastructure under failureconditions ?

• If not, this could apply to you……

"Failing to prepare is preparing to fail“

– John Wooden

Page 5: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Definitions

• High Availability (HA)• Ensuring that the system can continue to process work within

one location after routine single component failures

• Usually we assume a single failure

• Usually the goal is very brief disruptions for only some users forunplanned events

• Continuous Operations• Ensuring that the system is never unavailable during planned

activities

• E.g., if the application is upgraded to a new version, we do it ina way that avoids all downtime

Page 6: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Definitions

• Continuous Availability (CA)• High Availability coupled with Continuous Operations

• No tolerance for planned downtime

• Little unplanned downtime as possible

• Very expensive

• Note that while achieving CA almost always requires anaggressive DR plan, they are not the same thing

• Disaster Recovery (DR)• Ensuring that the system can be reconstituted and/or activated

at another location and can process work after an unexpectedcatastrophic failure at one location

• Often multiple single failures (which are normally handled byhigh availability techniques) is considered catastrophic

• There may or may not be significant downtime as part of adisaster recovery

Page 7: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Definitions - Disaster Recovery vs. High Availability

• Both High Availability and Disaster Recovery have acommon goal

• Business continuity

• But under different conditions• HA: localized failures, e.g., server crash

• DR: loss of entire production system

o Natural disasters – flood, fire, earthquake

o Man-made disasters

Page 8: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

WAS-ND HA Architecture

• WAS-ND Full Profile and Liberty Profile ClusterComposed of Multiple Identical Peers

• Each Capable of Performing the Same Work

• Application Servers Independent of Management Runtime and Each Other

• Application Servers Load Configuration From Local File System

• JNDI Lookupso Each Application Server Has It’s Own JNDI Service

• Securityo Each Application Server Has Its Own Security Server

• Transactionso Each Application Server Logs and Manages Distributed Transactions

• Systems Managemento Each Application Server Has Its Own JMX MBean Server

Above Applies to WAS V5.x, WAS V6.x, WAS V7. and WAS V8.x

Page 9: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Realities

• 100% of requests aren't going to work perfectly 100% of the time

• But WAS makes provisions to insure important requests;transactions and persistent messages always work (eventually).

• Optimized HA/CA Requires

• Considerable expense and planning to execute properly

• An architecture built to purpose around this level of requirement.

• Alignment of Processes and Procedures with the Architecture andOperational Requirements

• No Single Checklist Covers It All.– Consider Engaging Services for Assistance

8

Page 10: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Request Processing Definitions

• Request Distribution• Incoming Requests are Distributed Across Multiple Identical Server

Instances• Distribution “Agent” Typically Employs an Algorithm

• Request Redirection• There is an attempt to contact a server to make a request, when the

request fails, the request is redirected to another server• Also Known as Failovero Assumes That There are Multiple Identical Servers (a cluster)

• Work Load Management• Balances Client Requests Across Serverso Active Monitoring Response Time Capacity

o Request Distribution Based on Workload

Page 11: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Agenda

• Level Set and Definitions

• WebSphere Application Server Request Processing

• WebSphere Application Server High Availability

• Application Availability

• Preparing to Fail

• Final Thoughts

Page 12: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

WAS V8.x (and before) HTTP Request Distribution

11

• HTTP Server Plug-in

– Distributes Requests to Cluster Members

• Round Robin or Random

– Maintains Client Affinity Using HTTP Session

– Detects Application Server Failure

• Connection or I/O Timeout

– Marks Container as Unavailable

– Periodic Retry

– Tries Next Cluster Member

HTTPServer

HTTPServerPlugin

ApplicationServer

WebContainer

ApplicationServer

WebContainer

(Static) Cluster

http://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.nd.doc/ae/crun_srvgrp.html

Page 13: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

WAS V8.x (and before) Web Service Requests

WS Client Direct Connection - No WLM/Failover

Stateful Client to Web Service in WAS-ND via WAS-ND Proxy orDatapower

Stateful Client, using HTTP Session to Web Service inWAS-ND via HTTP Server, WAS-ND Proxy or Datapower Employ WS-Addressing for

Clustering/Failover

http://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.nd.multiplatform.doc/ae/cwbs_wsa_eprs.html

Page 14: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

WAS-ND V8.x (and Before) IIOP Request Distribution

13

• Java ORB Plug-in

– Weighted Round Robin Request Distribution

– Maintains Client Affinity As Appropriate

– Stateful Requests

– Transactions

– Detects Failure

• Connection or I/O Timeout

– Marks Container as Unavailable

– Periodic Retry

– Tries Next Cluster Member

ApplicationServer

ApplicationServer

EJBContainer

ApplicationServer

EJBContainer

JavaClient

ORBPlugin

ORBPlugin

http://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.nd.doc/ae/crun_srvgrp.html

Page 15: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

WAS-ND V8.5 Intelligent Management HTTP ** WLM

14

HTTPServer

HTTPServerPlugin

ApplicationServer

WebContainer

ApplicationServer

WebContainer

On DemandRouter(s)

• On Demand Router

– Distributes Requests to Cluster Members

• Weighted Outstanding Requests

– Maintains Client Affinity Using HTTP Session

– Detects Application Server Failure

• Connection or I/O Timeout (as with plugin)

• Change in Server Status (On Demand Configuration)

– Works In Conjunction with

– Autonomic Request Flow Management

– Controls Request Flow

– May suspend and re-order requests in order toprevent overload and achieve service policy

– Health Management Controller

– Routes Requests to Replacement Server

– Application Placement Controller

– Adjusts Cluster Size

(Dynamic) Cluster

** IIOP and JMS Requests are not managed by ODR, but are WLM’d at Application Server

Page 16: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

V8.5.5 Intelligent Management for Webservers

ODR (On Demand Router) provides featuressuch as automatic discovery, edition-aware routing

and caching, health policies, dynamic clusters,maintenance mode, conditional trace, etc

ODRTier AppServer

Tier

WebServerTier

IHS/Apachew/ ODRLIB AppServer

Tier

Page 17: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

V8.5.5 Intelligent Management for Webservers - Exclusions

• ODR routing rules• E.g. No load balancing or failover for the same application in multiple cells

• CPU/memory overload protection• Throttles traffic when the CPU utilization or heap utilization goes above a

configured threshold on an application server host

• Application Lazy Start

• Request prioritization• No queuing and re-ordering of requests based on service policies

• Highly available deployment manager

• Request Classification based on the user identity in a LTPA token

• Workload and storm-drain health policies

Page 18: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Service Policies

Service policies are used todefine application servicelevel goals

Allow workloads to beclassified, prioritized andintelligently routed

Enables applicationperformance monitoring

Resource adjustments aremade if needed toconsistently achieve servicepolicies Service Policies define the relative importance

and response time goals of application services;defined in terms the end user result the

customer wishes to achieve

Page 19: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Agenda

• Level Set and Definitions

• WebSphere Application Server Request Processing

• WebSphere Application Server High Availability

• Application Availability

• Preparing to Fail

• Final Thoughts

Page 20: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

WAS-ND : HA Architecture – A Brief Review

• Peer Recovery Model with Active Hot Standbysfor persistent services

• Transactions

• Messaging

• If a JVM fails then any Singletons running inthat JVM are restarted on a Peer once theFailure is detected

• Starting on an already running Peer eliminatesthe start up time of a new process which couldtake minutes

• Planned failover takes a few seconds

• This low failover time means WAS can toleratemany failures without exceeding the 5.5 minuteyearly maximum outage dictated by 99.999%SLA

19

High Availability Manager

Distribution and Consistency Services(DCS)

Reliable Multicast Messaging (RMM)

TransactionService

WorkloadManagement(WLM)

Data ReplicationServices (DRS)

MessagingEngine

On-DemandConfiguration(ODC)

WAS-ND JVM

Page 21: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

HA: Example Transaction Log Peer Failover

• Provides Failover of In-flight 2 PCtransactions

• WAS-ND Can Be Configured toStore Transaction Logs For EachServer on a Shared File System ***

• Allows All Peers to See All TransactionLogs

• Automatic HAManager Triggered Failover

• When a WAS-ND cluster MemberFails, a Peer is Elected to Processthe Transaction Log From the FailedServer

• In Doubt Transactions From a FailedServer Are Processed Very Quickly,Typically In Seconds (or less!)

• Significantly Faster Than HardwareClustering Which Can Take Minutes

• Resource Managers Locks ReleasedQuickly

*** Database option in V8.0.07 and later

ApplicationServer

TransactionManager

WAS-ND Cluster

ApplicationServer

TransactionManager

Shared File System

CRASH

TranLog

TranLog

DatabaseMessageQueue

Resource Managers

Page 22: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Service Integration Bus - High Availability

WAS-ND Cluster

Cluster member A Cluster member B Cluster member C

SIBusMEME

Failover

•The SIBus Messaging Engine is Managed by HA Manager•HA is provided by failing over the ME service to a different cluster member.•Default is “One of N” Core Group Policy

• Options Exist for Multiple/Partitioned Queue• Options Exist for Multiple MDB Consumers as well as Single Consumer

Page 23: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

WAS V8.5 ME Enhancements

• Restrict long running Database Locks• Active ME now holds only short locks on the SIBOWNER table while revalidating its ownership at regular intervals

• Ability for SIBus to detect a hang in the “active” ME and switch over to the “standby” ME• Adds ME Last Update Time to SIBOWNER Table

• Backup ME Can Safely Take Ownership and avoid Split Brain

• ME able to gracefully stop from database failures instead of killing the entire JVM• Other Applications In JVM Hosting ME Continue to Run

• Automatically “re-enable” a ME if it enters a “disabled” state• In a Large Cluster It Can Be Difficult to Administratively Determine “disabled” ME

• Configure a new ME to recover data from a orphaned persistence store• Reads and Updated ME UUID from Persistent Records

• Persist JMS re-delivery count value• Avoids Reprocessing of Message That May Cause Outage

• Utilization of multi-cores for quicker ME start-up when large number of messages anddestinations are present

Page 24: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

WAS-ND V8.5 Health Management

Automate “Sick” Application Server Restart

Predefined health policies and custom healthpolicies can be defined for common serverhealth conditions

When a health policy's condition is true,corrective action execute automatically orrequire approval

• Notify administrator (send email or SNMPtrap)

• Capture diagnostics (generate heap dump,java core)

• Restart server

Excessive response time means you aremonitoring what matters most: your customer'sexperience!

Application server restarts are done in a waythat prevent outages and service policyviolations

Each health policy can be in supervise orautomatic mode. Supervise mode is liketraining wheels to allow you to verify that ahealth policy does what you want beforemaking it automatic.

Health Conditions

• Excessive request timeouts: % of timed out requests

• Excessive response time: average response time

• Excessive garbage collection: % of time spent in GCs

• Excessive memory: % of maximum JVM heap size

• Age-based: amount of time server has been running

• Memory leak: JVM heap size after garbage collection

• Storm drain: significant drop in response time

• Workload: total number of requests

Page 25: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

• Each deployment manager on a separate machine• Only one is active

• Others are standby

• Shared file system required for dmgrs to share configuration repository• File system with recoverable locks required - e.g. SAN FS,NFS v4, GPFS

• JMX traffic proxied through WVE On-demand Router (ODR)• SOAP connector only

• Clustered ODRs recommended• (they’re recommended any HA production component anyway)

• hadmgrAdd command line utility provided to perform configurationhttp://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.wve.doc/ae/rwve_xdhadmgrAdd.html

• Pre WAS V8.5 Options Outlined in” The WebSphere Contrarian: Runtime management high availability options, redux “ still applicable

http://www.ibm.com/developerworks/websphere/techjournal/1001_webcon/1001_webcon.html

WAS-ND v8.5 Deployment Manager HA

Page 26: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

• Liberty Collective: NEW multi-server administrative domainarchitected exclusively for Liberty!

• Liberty Controller: NEW scalable admin server built on Liberty.

• Liberty Clusters: NEW support for Liberty cluster management

© 2014 IBM Corporation 25

Liberty Collective: Liberty Specific Management

g

Liberty Controller

WLP

Liberty Members

WLP WLP

WLP WLP

v8.5.5: Liberty Collective

WLP=WebSphere Liberty Profile

Page 27: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Liberty Profile in a WAS-ND Cell

• Manage Liberty profiles as integral part of ND Cell !

• Built on Intelligent Management Middleware Server support

– Available v8.5.5.1

• Assisted Lifecycle Management– Basic console/scripting access to Liberty

– config access (i.e. server.xml)

– lifecycle (start/stop/status)

– log access (messages.log, etc)

– Dynamic Clusters and Health Management

26dmgr

nodes

ND Cell DBHTTP/ODR

nodeagent

appserver

appserver

liberty

26nodeagent

appserver

appserver

liberty

Page 28: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Capabilities Comparison

27

Capability Liberty ND Cell Liberty Collective

Lightweight No Yes

Setup speed Low High

Memory use High Low

Reconfigurability Low High

Domain Scalability Low High

Admin Scalability Low High

Liberty deploy No Yes

Admin HA Yes Yes

Static clusters Yes Yes

Health Manager Yes No (for now)

Dynamic clusters Yes Yes

Extends existing ND env Yes No

= cell advantage = collective advantage

Page 29: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Agenda

• Level Set and Definitions

• WebSphere Application Server Request Processing

• WebSphere Application Server High Availability

• Application Availability

• Preparing to Fail

• Final Thoughts

Page 30: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Application Resiliency

• Efficient Request Processing• Avoid Long Running SQL Queries

o Employ setMaxRows(int)

o Employ setFetchSize(int)• Explicitly catch WAS StaleConnectionException

• Asynchronous Application Architecture• Limit Work Manager Thread Timeout

o WaitForAll(workItems,timeout_ms)

• join(workItems, JOIN_AND,timeout_ms)

o WaitForAny(workItems, timeout_ms)

• join(workItems, JOIN_AND,timeout_ms)

o startWork(Work, timeout_ms, WorkListener)

• Java 7 EE Concurrency Thread Timeout• service.submit(new Timeout()).get(2000, TimeUnit.MILLISECONDS);

• Stateless Application Architecture

• Or Minimize Application State Overhead

• Externalize State for Recovery/Failover

Page 31: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

WAS-ND Application Update (Pre V8.5)

• Adminconsole “Rollout Update” & wsadmin “updateAppOnCluster”

• Stops Cluster Member(s) on a Node

• Distributes Update to Node

• Re-starts Cluster Member(s) on Node

• Employs “Application Update” Function for Correct Event Registration andSynchronization

• While attractive in theory, this doesn’t provide for seamless updates from the enduser’s perspective

o Plug-in detects Server Outage and Can Then Select Another Cluster Member

o Additional Effort May Be Required for Uninterrupted Service (see below)

• Primary Benefit is that’s it’s Superior over Manual & Scripted Approaches UsingStop Server, Sleep, etc

• Better Approaches for minimizing dowtime

• Dual Cells

• Single cell wsadmin script that sets ServerWeight to 0, employs isAppReady andgetDeployStatus and manually synch’s each node, then resets ServerWeight

Page 32: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

WAS-ND V8.5 Application Edition Management

• Interruption-free update of application on existing deployment targets (e.g.dynamic cluster)

• Workload quiesce on and diverted from each server or cluster as edition swapis performed

• Group Rollout

• Old edition is replaced with new edition one server at a time or ‘batches’

• Atomic Rollout

• Old edition completely offline before new edition is available

• Application requests arriving in the window are queued by on-demand router

• Edition back-out

• Ability to undo an edition rollout

• Simply use edition rollout capability to rollout a previous edition

• Validation

• Hosting of new edition in production environment on ‘clone’ of originaldeployment targets

• Use routing policy to control edition visibility – e.g. only ‘test’ personnel

Page 33: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Application State

• Typically with a Planned Outage Application State (requests) Can Be “Drained”

• For Unplanned Outages is it worth investing in Application State failover?

• Application State/ Session Failover

• Application Code Transparent

• State is Automatically Retrieved From Remote Copy if Not Present inLocal Copy

• Session Distribution Options for Update of Remote/Backup Session Object

• Time Based Write (default of 10 seconds)– Employ "NoAffinitySwitchBack” custom property when using TBW

• At End of the Servlet Service Method

• Manually (Requires Use of IBM Extension to HttpSession )

• Session Manager is Distribution Client

• No Application Visibility to DB or Replicator/Session Outage

• WAS V6.02 and above - Updates Occur in Local Copy During Replication/DB Outage

o Messages In Logs and Administration Client During Outage

• Performance Will Degrade As Remote Updates are Attempted

Page 34: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Session Manager to DB

• We Need Your Help!

• RFE 34283 - HTTP Session manager DB resiliency

• Vote at http://www.ibm.com/developerworks/rfe/

• In the Interim

• Employ StaleConnectionRetry = 0

– APAR PI04871

– Will not suppress the error messages

– It will reduce the messages.

– Default is 3 times

Page 35: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

WAS-ND HTTP Session Failover – DRS Peer to Peer

Peer to Peer Configuration

Page 36: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

WAS-ND HTTP Session Failover – DRS Client Server

Client Server Configuration

Page 37: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

WAS Full Profile and Liberty Profile Session Failover – Database

HTTPServer

HTTPServerPlugin

ApplicationServer

LocalStateCopy

ApplicationServer

LocalStateCopy

Cluster

RemoteStateCopy

Database

Page 38: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

WebSphere eXtreme Scale

• Alternative to DB Replication forApplication State

• Independent of WAS-ND Cell

Infrastructure

• Servlet Filter Replacement for

Session Manager

• Installs in Any J2EE Application

• Replication Zones Allows

Alignment Along Data Center

Boundaries

WXS Session Failover (WAS Full Profile and Liberty Profile)

Page 39: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Agenda

• Level Set and Definitions

• WebSphere Application Server Request Processing

• WebSphere Application Server High Availability

• Application Availability

• Preparing to Fail

• Final Thoughts

Page 40: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Web Server(100 Threads)

Queue

WebContainer(50 threads)

Queue

ORB Pool

(25 threads)

Requests

Database

JDBC Pool(10 objects)

WebClients

Queue Queue

Typical WebSphere Queuing Network

• Guiding Principle – Keep the website moving!• Don’t allow a large request queue to build in the App Server.• It’s better to prematurely fail a small number of latent/long running requests than to stall the entire website

Page 41: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Sizing Pools, Queues and Timeouts

• Monitor Performance* at Each Layer and Set a Budget

• Pool Size = Req/Sec X Latency + (~20 %) Cushiono E.g. 36 =30 Rec/Sec X 1 Sec + 6

• Queue Sizeo As Close to Zero as Practical

• Request Timeout = Latency Timeout + Successful Retryo E.g. 3.0 seconds = 2.0 Seconds + 1 Second

• Connect Timeout = 99 % Average Network Latencyo E.g. 5ms

• 1 Second is WAS Minimum in Many Cases, So You’ll Need to Round Up forSome Settings

* See Chart in Backup Slides For PMI Suggestions

Page 42: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

A Slight Digression On Tuning for Performance

• The Guidance on the Preceding Slide is Focused on Resiliency andFailover

• As it Turns Out, This Same General Guidance without the AdditionalAttention to Queue Depth and Timeouts Also Optimizes Performance

Page 43: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Web Server and Plugin

Web Server(100 Threads)

Queue

WebClients

• Web Server• Threads• Processes (StartServers in Apache/HIS)

• Web Server Plugin• MaxConnections• ConnectTimeout• ServerIOTimeOut• PostSizeLimit• PostBufferSize• ServerIOTimeoutRetryOriginally Introduced in WAS V8.5,

included in V8.0 and v7.0 service streams

APAR PM94198 adds URI specific ServerIOTimeout, ServerIOTimeoutRetry, and Extended Handshake

rules , available in 7.0.0.31, 8.0.0.8, 8.5.5.2.

Page 44: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

HTTP Server & Plugin Tuning

• MaxConnections and MaxClients

• MaxConnections = ceiling(#ConcurrentUsers / #IHS / #JVMs)

• Then…….. MaxClients = MaxConnections * #JVMsPerNode

Page 45: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Web Container

WebContainer(50 threads)

Queue Queue

• Web Container• Thread Pool (Web Container)• Read timeout• Write timeout• Persistent Connections/Request• Maximum Open Connections•listenBackLog

•TCP Channel Custom Property

Page 46: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Web Service Requests

Transport Policy Sethttp://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.nd.multiplatform.doc/ae/rxml_wsfphttptransport.html

Timeout and Message Propertieshttp://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.nd.multiplatform.doc/ae/rwbs_jaxwstimeouts.html

com.ibm.ws.websvcs.transport.common.TransportConstants.READ_TIMEOUT Default 300 secondscom.ibm.ws.websvcs.transport.common.TransportConstants.WRITE_TIMEOUT Default 300 secondscom.ibm.ws.websvcs.transport.common.TransportConstants.CONN_TIMEOUT Default 180 seconds

• Use WS Policy Set to configure timeouts, messageproperties etc.

WebContainer

Page 47: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

ORB Service (EJB Container)

Queue

ORB Pool

(25 threads)

Queue

• EJB Container• Thread Pool (ORB)• Thread pool timeout• Request timeout• Request retries count

•EJB Client•

Dcom.ibm.websphere.wlm.unusable.interval• Dcom.ibm.CORBA.RequestTimeout• com.ibm.CORBA.ConnectTimeout• Locate request timeout

Page 48: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Connection Pool and DatabaseDatabase

JDBC Pool(10 objects)

Queue• Connection Pool•Maximum Connections (Pool Size)• Connection Timeout• PurgePolicy

•EntirePool (Default)•JDBC Provider

• Read Timeout *• Login Timeout *

•Database• Maximum Connections *

* Name varies by vendor

Page 49: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

WAS V8 Resource Adapter High Availability

• Resource failover and retry logic for relational data sources and JCA connection

factories

• Simplifies application development

o Minimizes the application code required to handle failure of connections to relationaldatabases and other JCA resources

o Provides a common mechanism for applications to uniformly respond to planned orunplanned outages

• Typically Employed with Database Replication (e.g DB2 HADR, Oracle RAC)

• Administrator can tailor data sources and connection factory configuration basedon application needs:

o Alternate/failover resource reference on primary data source

o Optionally

• Number of connection retries

• Pre-population of alternate/failover resource connection pool

• Auto failback

• Full control of functionality available to scripts and programs via managementMbean

http://pic.dhe.ibm.com/infocenter/wasinfo/v8r0/topic/com.ibm.websphere.nd.doc/info/ae/ae/cdat_dsfailover.html

Page 50: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Agenda

• Level Set and Definitions

• WebSphere Application Server Request Processing

• WebSphere Application Server High Availability

• Application Availability

• Final Thoughts

Page 51: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

High Availability for Non-WAS Components

• Firewall• IP Sprayer• WebSphere MQ• Security Registry• Database Server• SOA Appliance (DataPower)• File System• Make all HA !

• Via hardware clustering or software clustering

• 99.999% Can Only Be Achieved When All Components Are Engineeredfor This Availability Level

• WAS-ND Without an Overall 99.999 % Infrastructure Will Not Assure99.999% Availability

Page 52: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

“Gold Standard” – Dual WAS Cells or Liberty Collectives in One DataCenter

Sharing HTTP Session Between Cells is NOT Recommended

Page 53: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

“Gold Standard”

• Two (or More) Cells/Collectives

• Provide Hardware Isolation

• Provide Software Isolation

• Infrastructure for Planned Maintenance without Outage

• Insurance Against Catastrophic Administrative Outage

• Requires More Administrative Effort and Rigor (Scripting)

• Don’t Forget “Rule of 3”

• Discussion Typically Is in Context of HA Clusters of Size 2

• With “Only” 2 of “Everything”

– An Outage (Planned or Unplanned) Reduces Capacity by 50 %

– Is No Longer Fault Tolerant

Page 54: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Causes of Downtime

Primary causes of downtime

– Hardware and environmental problems: 20%

– People and process problems: 80%

WAS addresses the 20%

Solve the other 80% first

– Dedicated, well-trained system administrators

– Strict change control

– Load testing of new applications

– Carefully-planned automated production deployment

Page 55: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Simulating Failure

• Essential for “Preparing to Fail”

• “Pull the Plug and Duck”• Disconnect Network Cable

• Hang an Application Server (or Other Process)• Many Monitoring and Diagnostic Tools Assist Here

• Inject Local or Global failure for WAS Messging Engine• Local = “auto restart”

• Global = “use intervention”• http://www-01.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/tjt0037_.html

• Hang OS• Write script to consume CPU

• Insidious, but Very Effective

• See Backup for Sample

54

Page 56: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Learn from Your Mistakes

• Mistakes and failures will occur, learn from them

• What separates mediocre organizations from the good and great isn't so muchperfection as it is the constant striving to get better – to not repeat mistake

• After every outage perform

• Root cause analysiso Capture diagnostic information

o Meet as a team including all key players to discuss

o Determine precisely what went wrong• Wrong doesn't mean “Bob made an error.”

• Find the process flaw that led to the problem

• Determine a corrective action that will prevent this from happening againo If you can't, determine what diagnostic information is needed next time this happens and

ensure it is collected

• Implement that corrective actiono All too often this last step isn't done

o Verify that action corrected problem

• A senior manager must own this process

Page 57: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Indispensable When Preparing to Fail

Page 58: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Questions?

Page 59: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Notices and DisclaimersCopyright © 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced ortransmitted in any form without written permission from IBM.

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract withIBM.

Information in these presentations (including information relating to products that have not yet been announced by IBM) has beenreviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBMshall have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY,EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OFTHIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFITOR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of theagreements under which they are provided.

Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal withoutnotice.

Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples arepresented as illustrations of how those customers have used IBM products and the results they may have achieved. Actualperformance, cost, savings or other results in other operating environments may vary.

References in this document to IBM products, programs, or services does not imply that IBM intends to make such products,programs or services available in all countries in which IBM operates or does business.

Workshops, sessions and associated materials may have been prepared by independent session speakers, and do notnecessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neitherintended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation.

It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legalcounsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’sbusiness and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice orrepresent or warrant that its services or products will ensure that the customer is in compliance with any law.

Page 60: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Notices and Disclaimers (con’t)

Information concerning non-IBM products was obtained from the suppliers of those products, their publishedannouncements or other publicly available sources. IBM has not tested those products in connection with thispublication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBMproducts. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.IBM does not warrant the quality of any third-party products, or the ability of any such third-party products tointeroperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED,INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR APARTICULAR PURPOSE.

The provision of the information contained herein is not intended to, and does not, grant any right or license under anyIBM patents, copyrights, trademarks or other intellectual property right.

• IBM, the IBM logo, ibm.com, Bluemix, Blueworks Live, CICS, Clearcase, DOORS®, Enterprise DocumentManagement System™, Global Business Services ®, Global Technology Services ®, Information on Demand,ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™,PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®,pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, SoDA, SPSS, StoredIQ, Tivoli®, Trusteer®,urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks ofInternational Business Machines Corporation, registered in many jurisdictions worldwide. Other product andservice names might be trademarks of IBM or other companies. A current list of IBM trademarks is available onthe Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.

Page 61: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Thank YouYour Feedback is

Important!

Access the InterConnect 2015Conference CONNECT AttendeePortal to complete your sessionsurveys from your smartphone,

laptop or conference kiosk.

Page 62: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Backup Slides

Page 63: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Shameless Self Promotion

IBM WebSphere Deployment andAdvanced ConfigurationBy Roland Barcia, Bill Hines,Tom Alcott and Keys BotzumISBN: 0131468626

Page 64: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Another Recommended Book

IBM WebSphere v5.0 System AdministrationBy Leigh Williamson, Lavena Chan,Roger Cundiff,Shawn Lauzon and Christopher C. MitchellISBN: 0131446045

Page 65: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

64

WebSphere Application Server PMI Monitoring Detail (1/2)

Connection Pools JVM Runtime HTTP (Servlet) SessionManager

ORB /EJB

JDBC

AllocateCount

ReturnCount

CreateCount

CloseCount

FreePoolSize

PoolSize

JDBCTime

UseTime

WaitTime

WaitingThreadCount

PrepStmtCacheDiscardCount

JMS (JCA)

JMS Queue Connection FactoryConnection Pools.

Pool Size

Percent Maxed

Percent Used

Wait Time

HeapSize

UsedMemory

ProcessCpuUsage

Optional

% Free after GC,

% Time spent in GC.

ActiveCount

CreateCount

InvalidateCount

LiveCount

LifeTime

TimeSinceLastActivated

TimeoutInvalidationCount

Optional

SessionObjectSize **

Thread Pool

ActiveCount

ActiveTime

CreateCount

DestroyCount

PoolSize

DeclaredThreaHungCount

Requests

WaitTime**

MethodResponseTime **

** Usually in test only

Set PMI to custom and enable just the following metrics:

Page 66: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

65

WebSphere Application Server PMI Monitoring Detail (2/2)

Web ODR Messaging Engine (SIB)

Requests

ResponseTime

ConcurrentRequests

ErrorCount

Thread Pool

ActiveCount

ActiveTime

CreateCount

DestroyCount

PoolSize

DeclaredThreaHungCount

Proxy Module

ActiveOutboundConnectionCount

RequestCount

ResponseTime(TTLB)

odrStatModule

TotalNumberOfRequests

CurrentOutstandingRequests

PercentOfErrors

BufferedReadBytesCount,BufferedWriteBytesCount

CacheStoredDiscardCount ***

CacheNotStoredDiscardCount ****

Optional

AvailableMessageCount,

LocalMessageWaitTime

Set PMI to custom and enable just the following metrics:

**** Not PMI, in System.Out log

Page 67: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Example Server/OS Hang Script• #!/usr/bin/ksh

• clear• echo 'This script will burn CPU cycles and chew up all available memory'• echo• echo 'Start "nmon" in another window to watch when the memory is gone and it locks up. (the display will stop updating every 2 seconds when it is hung)'• echo• echo -e "Press 'Enter' to continue and crash the system... \c"• read ANS

• echo -e ".\c"• ( x=0; while true; do ((x=x+1)); done ) &• sleep 5

• echo -e ".\c"• ( while true; do true; done ) &• sleep 5

• echo -e ".\c"• cat /dev/urandom >/dev/null &• sleep 5

• echo -e ".\c"• tail /dev/zero &

• echo -e "\n\nThe system should crash soon."

• #-- Loop to display time• while true• do• sleep 1• TM=$(date '+%T')• echo -e "$TM \b\b\b\b\b\b\b\b\b\c"• done

66

Page 68: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Licensing Servers as Back Up ServersFrom IBM Contracts and Practices Database• The policy is to Charge for HOT, and not for WARM or COLD back ups. The following are definitions of what constitutes

HOT-WARM-COLD backups:

• All programs running in backup mode must be under the customer's control, even if running at another enterprise'slocation.

• COLD - a copy of the program may be stored for backup purpose machine as long as the program has not beenstarted.

• There is no charge for this copy.

• WARM - a copy of the program may reside for backup purposes on a machine and is started, but is "idling", and is notdoing any work of any kind.

• There is no charge for this copy.

• HOT - a copy of the program may reside for backup purposes on a machine, is started and is doing work. However,this program must be ordered.

• There is a charge for this copy.

• "Doing Work", includes, for example, production, development, program maintenance, and testing. It also could includeother activities such as mirroring of transactions, updating of files, synchronization of programs, data or other resources(e.g. active linking with another machine, program, data base or other resource, etc.) or any activity or configurabilitythat would allow an active hot-switch or other synchronized switch-over between programs, data bases, or otherresources to occur

Refer to http://www-03.ibm.com/software/sla/sladb.nsf/pdf/policies/$file/Feb-2003-IPLA-backup.pdf for more information

Page 69: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

V8.5.5 Intelligent Management for Webservers

Automatic routing‒ Automatically discovers and recognizes all changes which affect routing: server/cluster

create/start/stop/delete, application install/start/stop/uninstall, virtual host updates, session affinityconfiguration changes, dynamic server weight changes, etc.

‒ Lower administrative overhead. Simply connect a cell and go. When new clusters are created in target cells, no change is made or needed to the plugin-cfg.xml.

Application edition routing‒ Upgrade applications without interruption to end users

‒ Easy-to-use validation mode allowing new versions of application to be validated before sending production traffic

‒ Concurrently run multiple editions of a single application, using routing policy to route users to the appropriate edition

Application edition caching‒ The plugin's ESI (Edge Side Include) cache is edition-aware, which means that

edition 1 and edition 2 content is stored separately in the cache

Health policy support for ODR-related health policies‒ Recognize a sick server and automatically take corrective action

‒ “Excessive Response Time” and “Excessive Request Timeout” health policy support

Page 70: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

V8.5.5 Intelligent Management for Webservers

• Dynamic clusters

• JVM and VM elasticity

• APC (Application Placement Controller) dynamically starts/stopsservers and calls to IBM Workload Deployer (IWD) to provision/de-provision servers in order to meet current demand. IHS/Apacheautomatically routes appropriately.

• Node and server maintenance mode

• When a node or server is placed into maintenance mode, applicationoptimization automatically routes appropriately

• Multi Cell Routing

• WLOR (Weighted Least Outstanding) load balancing

• Evens out response times due to dynamically changing weights

• Quick to send less traffic to slow or hung servers

Page 71: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Deployment Manager HA

“Warm-standby Model”

Page 72: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Deployment Manager HA

Take-over after primary failure…

Page 73: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Application Edition Management – Group Rollout

quiesce &stop

Edition 1.0

Edition 1.0

Edition 1.0

On-demandrouters

Dynamic cluster

Edition 2.0

restart

application

requests

Page 74: AAI-2013 Preparing to Fail: Practical WebSphere Application Server High Availability

Application Edition Management– Atomic Rollout

quiesce

& stop

Edition 1.0

Edition 1.0

Edition 1.0

On-demandrouters

Dynamic cluster

Edition 1.0

application

requests

Edition 2.0

Edition 2.0

quiesce& stop

Edition 2.0

Edition 2.0

request

request

request

restart

restart