exchange)e*mail)response) service)disruption:)...

36
Cornell Information Technology Root Cause Analysis Exchange EMail Response Root Cause Analysis Service Disruption: From: October 17, 2011 To: 12:30 AM November 8, 2011 Executive Summary Cornell’s Exchange email system suffered three weeks of increasingly poor response. Microsoft flew a field engineer to Ithaca to assist in diagnosing the problem. With Cornell staff, it was determined that the root cause of the problem was a feature of the network interfaces on the Exchange servers that disrupted communication within the cluster. This triggered a second bug in the Microsoft cluster software that caused lengthy delays in resuming cluster operation following the disruption. Once both of these were addressed, response time returned to normal levels. During the investigation, a number of other potential causes were identified and eliminated. In the end, these other factors were only minor contributors to pushing an unstable system over the edge. Timeline Beginning October 17, users of CITs Exchange email system saw increasingly poor response. CIT staff identified and eliminated several apparent contributions to the problem, but ultimately came to an impasse. Paradoxically, while it initially appeared to be a resource load issue, adding additional resources to the cluster made the problem worse. In reviewing the timeline, it is now apparent that the increasing size of the cluster as servers were moved from the Exchange 2007 cluster to the Exchange 2010 cluster caused the network interface errors to reach a critical level. In the first two weeks, a number of factors were identified that appeared to cause the problem. These factors included a set of bad antivirus signatures coinciding with a malware storm, power management settings that reduced CPU clock speeds on the servers, and an Exchange 2010 feature that caused many more mailboxes to be opened than previously. Each seemed at the time to be an isolated problem, and rectifying them provided temporary relief. The problem was escalated to Microsoft, who flew in a Field Engineer on Wednesday, November 2, evening to help us diagnose the problem. The following sections detail the troubleshooting stages. Network Load Suspected Each Exchange database server (MBX) has two network interfaces that it uses to connect via the Tier 1 and 2 networks to the Client Access Servers (CAS), and to the other MBX servers in the cluster. It has a third that connects to the Tier 3 network for cluster heartbeats and a fourth that connects to the Backup network.

Upload: others

Post on 19-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Cornell Information Technology Root Cause Analysis

Exchange  E-­‐‑Mail  Response  Root  Cause  Analysis  

Service  Disruption:  From:      October  17,  2011  To:            12:30  AM  November  8,  2011    

   Executive  Summary    Cornell’s  Exchange  email  system  suffered  three  weeks  of  increasingly  poor  response.  Microsoft  flew  a  field  engineer  to  Ithaca  to  assist  in  diagnosing  the  problem.  With  Cornell  staff,  it  was  determined  that  the  root  cause  of  the  problem  was  a  feature  of  the  network  interfaces  on  the  Exchange  servers  that  disrupted  communication  within  the  cluster.  This  triggered  a  second  bug  in  the  Microsoft  cluster  software  that  caused  lengthy  delays  in  resuming  cluster  operation  following  the  disruption.  Once  both  of  these  were  addressed,  response  time  returned  to  normal  levels.  During  the  investigation,  a  number  of  other  potential  causes  were  identified  and  eliminated.  In  the  end,  these  other  factors  were  only  minor  contributors  to  pushing  an  unstable  system  over  the  edge.        Timeline  Beginning  October  17,  users  of  CIT'ʹs  Exchange  email  system  saw  increasingly  poor  response.    CIT  staff  identified  and  eliminated  several  apparent  contributions  to  the  problem,  but  ultimately  came  to  an  impasse.    Paradoxically,  while  it  initially  appeared  to  be  a  resource  load  issue,  adding  additional  resources  to  the  cluster  made  the  problem  worse.    In  reviewing  the  timeline,  it  is  now  apparent  that  the  increasing  size  of  the  cluster  as  servers  were  moved  from  the  Exchange  2007  cluster  to  the  Exchange  2010  cluster  caused  the  network  interface  errors  to  reach  a  critical  level.    In  the  first  two  weeks,  a  number  of  factors  were  identified  that  appeared  to  cause  the  problem.  These  factors  included  a  set  of  bad  antivirus  signatures  coinciding  with  a  malware  storm,  power  management  settings  that  reduced  CPU  clock  speeds  on  the  servers,  and  an  Exchange  2010  feature  that  caused  many  more  mailboxes  to  be  opened  than  previously.  Each  seemed  at  the  time  to  be  an  isolated  problem,  and  rectifying  them  provided  temporary  relief.    The  problem  was  escalated  to  Microsoft,  who  flew  in  a  Field  Engineer  on  Wednesday,  November  2,  evening  to  help  us  diagnose  the  problem.  The  following  sections  detail  the  troubleshooting  stages.  

Network  Load  Suspected  Each  Exchange  database  server  (MBX)  has  two  network  interfaces  that  it  uses  to  connect  via  the  Tier  1  and  2  networks  to  the  Client  Access  Servers  (CAS),  and  to  the  other  MBX  servers  in  the  cluster.    It  has  a  third  that  connects  to  the  Tier  3  network  for  cluster  heartbeats  and  a  fourth  that  connects  to  the  Backup  network.        

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 2

It  was  hypothesized  that  Exchange  database  replication  traffic  was  overwhelming  the  client  traffic  on  the  Tier  1  and  2  interfaces,  causing  poor  response  time.    The  Microsoft  Field  Engineer  said  that  this  is  seen  in  some  large  installations  of  Exchange.    He  defined  large  as  more  than  5,000  users,  while  we  have  20,000.    Exchange  settings  were  changed  to  route  replication  traffic  over  the  Tier  3  network,  which  resulted  in  some  improvement  in  performance  issues.    This  was  later  determined  to  be  only  a  tertiary  effect  on  the  overall  problem.  

Clustering  Issue  The  servers  continued  to  report  communication  errors  to  their  peer  servers  in  the  cluster.    The  physical  network  was  examined  for  errors  or  capacity  issues,  but  none  were  found.    The  Field  Engineer  escalated  the  case  to  Microsoft  internal  resources,  and  eventually  engaged  a  Network  Engineer  in  Austin.    This  Engineer  identified  three  unreleased  hot  fixes  for  the  cluster  service  that  appeared  relevant  to  the  issue.    These  fixes  addressed  issues  where  cluster  timeouts  were  too  short  to  allow  a  cluster  of  our  size  to  restabilize  following  a  transient  error,  and  problems  in  electing  a  new  cluster  manager.  Those  fixes  were  applied  to  the  cluster  on  Thursday  night,  and  made  a  large  improvement  in  the  stability  of  the  system.    The  issue  was  related  to  the  number  of  machines  in  the  cluster.    No  problems  had  been  observed  during  the  first  phase  of  the  2007  to  2010  migration,  when  there  were  only  two  pairs  of  mailbox  servers  in  production.    Some  problems  were  observed  when  the  third  pair  was  added  during  the  third  week  of  September,  and  more  frequent  problems  were  observed  when  the  fourth  pair  was  added  mid-­‐‑October.    Our  analysis  now  indicates  that  this  was  the  secondary  root  cause  of  the  problems.  However,  the  improvement  was  sufficient  that  it  was  believed  that  the  problem  was  addressed,  and  the  Microsoft  field  engineer  departed  at  noon  on  Friday.  

Problem  with  New  CAS  servers  Reports  of  connection  problems  on  Friday  morning  appeared  to  be  geographically  clustered.  Around  noon,  a  connectivity  problem  between  the  load  balancer  and  the  new  CAS  servers  was  suspected.    The  new  servers  were  removed  from  rotation.  All  units  that  reported  problems  reported  that  this  resolved  their  connectivity  issues.    Analysis  now  indicates  that  it  was  again  the  combination  of  the  network  interface  errors  and  the  size  of  the  cluster  causing  the  problem.  

Network  Interface  Issues  Communication  errors  remained  present  in  the  cluster,  even  though  the  problem  of  improper  responses  to  those  errors  by  the  cluster  software  had  been  addressed.  While  response  time  seemed  improved  on  Friday,  by  Monday  it  was  apparent  that  it  was  still  seriously  degraded.    CIT  staff  re-­‐‑engaged  with  the  network  engineer  in  Austin,  who  worked  through  logs.    He  first  identified  a  problem  with  the  standby  network  adaptor  invoking  power  saving  mode.    This  appeared  to  take  the  primary  adaptor  offline  momentarily  because  the  system  software  considered  the  pair  of  adaptors  a  team.    Turning  off  power  management  again  reduced  the  magnitude  of  the  problem.  

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 3

 Finally,  using  network  traces,  the  engineer  identified  that  heartbeat  packets  were  being  corrupted  in  transit  between  the  systems,  and  identified  a  TCP  offloading  feature  of  the  Broadcom  network  adaptors  as  a  probable  cause.    This  feature,  called  NetDMA,  was  turned  off  between  midnight  and  1am  on  Tuesday  morning.    Ever  since  that  change,  the  cluster  has  been  highly  responsive,  and  the  CPU  utilization  on  all  servers  has  dropped  to  normal  levels.  

Postscript:  Network  Outage  and  Load  Balancer  Problems  After  the  cluster  was  healthy,  a  major  network  outage  struck  CCC.    This  affected  both  load  balancers,  including  the  one  in  Rhodes  Hall.  People  were  unable  to  reach  the  Exchange  system,  and  after  the  network  was  restored,  the  load  balancer  did  not  re-­‐‑establish  contact  with  Exchange.    Manual  intervention  was  required.  This  is  mentioned  in  the  timeline  because  of  the  proximity  to  the  other  problems.    It  is  entirely  unrelated.      Specific  Services  and  Configuration  Items  Impacted    Exchange          Actions  Taken  to  Resolve  Incident  Event   Date/Time   Action  Migrations,  2007  to  2010  

9/5/11    12:00  AM  

CCAB  CHANGE  REQUEST:  artf35047:  Exchange  Mailbox  Moves    CHANGE  DESCRIPTION:  Migrate  Exchange  Mailboxes  from  Exchange  2007  to  2010.    TEST  PROCEDURE:  Process  is  already  underway  with  select  groups  of  users.    NOTE:  Migrations  completed  October  2.  

MBXA  Added  to  Cluster  

9/20/11    Work  was  fast-­‐‑tracked  because  of  issues  with  the  dual  2007/2010  Exchange  environment,  as  decided  in  a  meeting  with  Ted  and  Dave  earlier  in  the  month.    The  CCAB  covering  the  event  was  inadvertently  not  filed.  

MBXB  Added  to  Cluster  

10/14/11    Work  was  fast-­‐‑tracked  because  of  issues  with  the  dual  2007/2010  Exchange  environment,  as  decided  in  a  meeting  with  Ted  and  Dave  earlier  in  the  month.    The  CCAB  covering  the  event  was  inadvertently  not  filed.  

Communication   10/17/11  10:01  AM  

CIT  posts  an  alert  for  Exchange  Performance  issues  (http://www.cit.cornell.edu/services/alert.cfm?id=1500;  see  Appendix  III).  

Communication   10/17/11  11:46  AM  

CIT  posts  an  alert  for  the  usps.com  malware  attack  (http://www.cit.cornell.edu/services/alert.cfm?id=1503,  see  

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 4

Actions  Taken  to  Resolve  Incident  Event   Date/Time   Action  

Appendix  III).  Communication   10/17/11  

9:23  PM  CIT  posts  an  update  to  the  Exchange  Performance  issue  (http://www.cit.cornell.edu/services/alert.cfm?id=1500;  see  Appendix  III).  

Communication/Description  of  Service  Disruption  

10/18/11  1:56  PM  

E-­‐‑Mail  sent  to  net-­‐‑admin-­‐‑l:    

Please  pass  this  message  along  to  individuals  you  support  who  may  have  been  affected  by  the  issues  described  below.  (This  message  has  been  sent  to  Net-­‐‑Admin-­‐‑L  and  ITMC-­‐‑L.)      On  Monday,  October  17,  from  approximately  8:15  am  to  4:15  pm,  the  Exchange  email  and  calendar  system  experienced  performance  problems  related  to  load,  and  some  individuals  reported  unstable  connections,  slow  response,  reduced  functionality,  and  error  messages.        It  appears  that  Macintosh  clients  and  BlackBerry  devices  were  most  seriously  impacted.  A  few  Outlook  Web  App  connections  may  also  have  been  affected,  and  response  times  for  Windows  Outlook  were  slow  at  times.      The  apparent  cause  was  a  significantly  higher  than  normal  load  triggered  by  the  receipt  of  tens  of  thousands  of  virus-­‐‑laden  messages.  Cornell'ʹs  perimeter  anti-­‐‑virus/anti-­‐‑spam  defenses  kept  most  of  the  virus-­‐‑laden  messages  from  reaching  Exchange,  but  the  ones  that  got  through  triggered  Exchange  2010'ʹs  own  anti-­‐‑virus  defense,  which  affected  overall  performance.  We  are  also  investigating  whether  a  virus  engine  update  played  a  role.      The  virus-­‐‑laden  messages  were  from  a  forged  usps.com  address,  so  one  defensive  step  was  to  temporarily  block  all  usps.com  mail  until  legitimate  mail  could  be  distinguished  from  forged  mail.  This  block  resulted  in  approximately  6  legitimate  messages  being  returned  to  the  sender  with  a  "ʺblacklisted"ʺ  alert.      At  this  time,  Exchange  load  is  back  to  typical  levels,  so  we  believe  individuals  should  no  longer  be  seeing  performance  issues.  We  are  investigating  why  this  event  had  the  effects  that  it  did,  and  what,  if  anything,  could  be  adjusted  in  Exchange  2010.  

 Communication   10/20/11   CIT  posts  an  alert  for  a  5  minute  spike  on  Exchange  servers.    The  

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 5

Actions  Taken  to  Resolve  Incident  Event   Date/Time   Action  

10:30  AM   issue  is  closed  at  4:13  PM.    (http://www.cit.cornell.edu/services/alert.cfm?id=1511,  see  Appendix  III).  

Communication   10/20/11  3:54  PM  

CIT  posts  an  alert  for  Exchange  authentication  errors  which  had  been  reported  during  the  previous  half  hour.    (http://www.cit.cornell.edu/services/alert.cfm?id=1512,  see  Appendix  III)  

Power  Management  Settings  Turned  Off  

10/24/11    11:00  AM  

CCAB  CHANGE  REQUEST:  artf35973  :  Change  Power  Management  Settings  on  Exchange    CHANGE  DESCRIPTION:    The  exchange  servers  are  currently  subject  to  power  management  which  is  causing  the  CPU  clock  frequencies  to  be  lowered.  This  is  not  a  recommended  setting  for  exchange  and  is  a  contributing  factor  to  the  exchange  performance  issues  we  have  been  seeing.    This  change  will  change  the  processors  to  the  maximum  power  settings.      TEST  PROCEDURE:  Standard  Operating  system  setting.        BACKOUT  STRATEGY:  Revert  to  current  settings  

Communication   10/28/11  1:34  PM  

CIT  posts  an  alert  for  Exchange  performance  which  was  affected  from  10:00  AM  to  12:30  PM  due  to  an  issue  with  the  server  farm  switch  that  morning.      (http://www.cit.cornell.edu/services/alert.cfm?id=1526;  see  Appendix  III).  

Change  Request    

10/29/11    5:00  AM  

CCAB  CHANGE  REQUEST:  artf36042  :  Modify  Network  settings  on  Exchange  2010.    CHANGE  DESCRIPTION:    Turn  off  chimney  and  rss  offloading  on  the  network  adapters  on  the  Exchange  2010  mailbox  servers.  This  change  is  recommended  by  Microsoft  to  resolve  some  database  replication  problems  we  have  been  experiencing  in  the  Exchange  2010  environment.    This  change  will  be  occurring  between  5:00am  and  7:00am  on  Sat  for  the  CCC  data  center  servers  followed  by  5:00am  and  7:00am  on  Sunday  for  the  Rhodes  data  center.      The  actual  time  to  complete  this  task  should  only  be  around  15  minutes.    User  will  not  see  any  downtime  as  we  will  always  have  an  available  copy  of  the  exchange  databases.      

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 6

Actions  Taken  to  Resolve  Incident  Event   Date/Time   Action  

 TEST  PROCEDURE:  These  settings  have  been  recommended  by  Microsoft.        BACKOUT  STRATEGY:  revert  to  the  current  settings.  

Communication   10/31/11  8:46  AM  

 

CIT  posts  an  alert  regarding  reports  that  users  cannot  access  Outlook  Web  Access  (OWA).        (http://www.cit.cornell.edu/services/alert.cfm?id=1529;  see  Appendix  III).  

Communication   10/31/11  8:54  AM  

CIT  posts  an  alert  regarding  Exchange  email  and  calendar.    (http://www.cit.cornell.edu/services/alert.cfm?id=1530;  see  Appendix  III).    Updates  are  provided  throughout  the  day  and  the  cause  was  believed  to  be  primarily  a  feature  of  Exchange  Server  2010  SP2  that  became  apparent  when  a  script  for  Exchange  Groups  Accounts  was  run  on  the  weekend  of  October  29.    

Communication/Description  of  Service  Disruption  

10/31/11  7:25  PM  

E-­‐‑Mail  sent  to  net-­‐‑admin-­‐‑l:    The  Exchange  performance  problems  the  morning  of  October  31  have  been  traced  to  Outlook  2007/2010  on  Windows  attempting  to  connect  to  all  mailboxes  to  which  each  user  had  access.  Fixing  this  problem  may  have  disconnected  previously  connected  shared  mailboxes.      Affected  individuals  may  need  to  re-­‐‑add  the  Exchange  Group  Accounts  and  other  mailboxes  they  want  to  see  in  Outlook.    To  make  Exchange  Group  Accounts  visible  again:  http://www.cit.cornell.edu/services/ega/howto/config.cfm    To  make  other  shared  mailboxes  visible  again:  http://www.cit.cornell.edu/services/outlook/howto/email/email-­‐‑view-­‐‑shared.cfm    CIT  is  adjusting  the  scripts  for  Exchange  Group  Accounts  and  filing  an  issue  report  with  Microsoft.    -­‐‑-­‐‑-­‐‑-­‐‑-­‐‑-­‐‑-­‐‑-­‐‑  DETAILS    The  performance  problems  are  believed  to  have  been  primarily  caused  by  a  feature  introduced  by  Microsoft  in  Exchange  Server  

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 7

Actions  Taken  to  Resolve  Incident  Event   Date/Time   Action  

2010  SP  2.  The  feature'ʹs  effects  were  not  apparent  until  scripts  for  Exchange  Group  Accounts  (which  had  been  in  place  for  two  years)  were  run  the  weekend  of  October  29.    The  feature  causes  Outlook  to  automount  all  mailboxes  to  which  an  individual  has  full  access.  This  behavior  creates  a  huge  load  on  both  CornellAD  and  Exchange.  It  was  primarily  responsible  for  the  Exchange  performance  problems  the  morning  on  October  31,  as  thousands  of  additional  mailbox  connections,  in  aggregate,  were  made.  For  individuals  with  full  access  to  many  Exchange  Group  Accounts  or  other  mailboxes,  the  start-­‐‑up  time  for  Outlook  may  have  taken  several  minutes.      These  automounted  mailboxes  supplanted  the  accounts  that  individuals  had  previously  added  to  Outlook,  so  when  the  automounting  was  stopped,  those  mailboxes  disappeared  from  view  in  Outlook.  Re-­‐‑adding  the  affected  mailboxes  resolves  the  issue  for  individuals.    We  apologize  for  the  inconvenience  this  issue  has  caused,  and  appreciate  your  patience  and  assistance  in  helping  individuals  restore  their  Outlook  views.  

Communication   11/1/11  12:02  PM  

CIT  posts  an  alert  regarding  performance  issues  to  Exchange  email  and  calendar.    (See  http://www.cit.cornell.edu/services/alert.cfm?id=1533;  Appendix  III).    This  alert  remains  open  until  9:53  AM  on  November  10.    During  this  period,  approximately  28  updates  are  provided  to  this  alert.  

Microsoft  Engineer  on  Site  

11/2/11    9:00  PM  

See  the  narrative  description  above  for  the  detailed  timeline  of  events.    See  also  Appendix  V.  

Change  Request    

11/2/11    2:00  PM  

CCAB  CHANGE  REQUEST:  artf36126  (Emergency:  Urgent  Service  Request)    Add  Exchange  Client  Access  Server  to  Load  Balancer    CHANGE  DESCRIPTION:    Add  4  additional  client  access  servers  to  the  load  balancer  configuration  so  they  may  later  be  enabled.  This  will  provide  additional  client  access  capacity  to  our  exchange  environment  and  should  reduce  the  slow  downs  and  dropped  connections  users  are  currently  experiencing.      TEST  PROCEDURE:  This  is  the  same  procedure  used  for  the  

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 8

Actions  Taken  to  Resolve  Incident  Event   Date/Time   Action  

current  client  access  servers  that  have  been  in  production  for  several  months.        BACKOUT  STRATEGY:  Revert  to  previous  configuration.  

PATCH  APPLIED  TO  SYSTEM  

11/3/11   Thursday  night,  November  3,  a  patch  was  applied  to  the  systems.    Communication:  Wednesday  evening,  November  2,  Microsoft  flew  in  a  field  engineer.  With  his  help,  we  first  identified  a  network  bottleneck,  which  reduced  but  did  not  eliminate  the  problem.  Digging  deeper,  a  bug  was  identified  in  Microsoft'ʹs  clustering  software  that  caused  the  cluster  to  believe  that  it  was  in  failure  mode,  and  caused  the  active  mailboxes  to  flip  repeatedly  between  the  redundant  Exchange  systems  in  Rhodes  and  CCC.  Since  this  behavior  was  related  to  the  number  of  machines  in  the  cluster,  we  inadvertently  worsened  the  problem  by  adding  capacity.  

CLIENT  ACCESS  SERVERS  REMOVED  FROM  SYSTEM  

11/4/11   Friday  morning,  November4,  pockets  of  connectivity  problems  led  to  discovering  that  a  few  of  the  ten  Client  Access  Servers  were  not  responding  to  connections;  they  were  removed  from  the  pool.  At  this  time  we  believe  that  we  have  resolved  the  problems.    

Change  Request   11/8/11    12:00  AM  

CCAB  CHANGE  REQUEST:  artf36210  (Emergency:  Urgent  Service  Request)  :  Disable  NetDMA  on  Exchange  Mailbox  Servers      CHANGE  DESCRIPTION:    Microsoft  recommends  that  we  disable  NetDMA  (a  feature  of  the  network  adapters)  on  the  Exchange  Mailbox  Servers.    NetDMA  can  cause  timing  problems  with  the  cluster  communications  and  is  a  contributing  factor  to  the  issues  we  have  been  encountering  with  exchange.    The  change  requires  a  reboot  of  the  mailbox  servers.  This  process  will  be  done  one  server  at  a  time  so  users  should  not  see  any  additional  downtime  as  a  result  of  this  update.    These  recommendations  come  out  of  a  Severity  1  case  we  have  open  with  Microsoft  regarding  the  Exchange  performance  issues.    

 Metrics  (See  Appendix  I)  Item   Time  (Hours)   Comment  

Detection  Time  (Detection  –  Incident  Occurrence)  

Indefinite  Difficult  to  determine  when  to  designate  as  beginning  time  (i.e.,  Incident  Occurrence)  

Response  Time  (Diagnosis  –  Detection)   21  days    Repair  Time  (Recovery  –  Diagnosis)   30  minutes    

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 9

Metrics  (See  Appendix  I)  Item   Time  (Hours)   Comment  Recovery  Time  (Restore  –  Recovery)   0    Time  to  Repair  (Recovery  –  Incident  Occurrence)  

~21  days    

 Root  Cause  of  Incident  –  The  Reason(s)  for  the  Service  Disruption    The  primary  cause  of  this  incident  was:    

• A  network  interface  feature  to  offload  network  processing  from  the  CPU  caused  corruption  in  heartbeat  packets.    This  caused  the  cluster  to  believe  communication  had  been  lost  and  commence  failover  negotiations.  This  setting  is  the  default  setting  for  Windows  Server  2008R2.  

The  problem  was  made  worse  by:  

• An  unpublished  bug  in  Microsoft  clustering  software  reacted  inappropriately  to  the  missed  heartbeat  packets,  flipping  the  active  cluster  node  back  and  forth  between  Rhodes  and  CCC.  

Two  additional  factors  contributed  to  triggering  the  problem,  but  were  not  by  themselves  a  problem:  

• Power  management  at  the  network  interface  level  turned  off  power  to  the  backup  network  path,  causing  interrupted  communications.  This  is  the  default  setting  for  Windows  Server  2008R2.  

• Replication  network  traffic  combined  with  client  traffic  to  increase  load  on  network  interfaces  on  the  Exchange  database  servers.  

   Issues  There  was  no  guidance  from  Microsoft  on  avoiding  the  NetDMA  features,  despite  internal  knowledge  that  there  had  been  problems  in  cluster  environments  with  these  features.  There  was  also  no  information  available  to  customers  or  first  level  Microsoft  engineers  on  the  cluster  problems  resolved  by  their  patches.  Both  of  these  have  been  addressed  by  Microsoft  since  our  incident.  Their  knowledgebase  article  is  attached  as  Appendix  VI.    

Recommendations  

Action  Item(s)  Created  to  Address  

The  following  work  is  identified  as  important  for  hardening  the  Exchange  system  against  failures  and  improving  the  performance  under  load.    

• CIT  is  re-­‐‑certifying  the  four  client  access  servers  that  were  briefly  placed  in  service,  and  will  add  their  capacity  to  the  service  within  the  next  few  weeks.  

• The  work  to  have  the  cornell.edu  DNS  name  resolve  to  the  domain  controllers  is  

1,2,3,4,5  

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 10

Recommendations  

Action  Item(s)  Created  to  Address  

expected  to  remove  some  timeout  issues  with  Exchange  management  commands.  

• CIT  is  planning  on  re-­‐‑enabling  the  Forefront  antivirus  protection  on  the  Exchange  servers.    This  was  disabled  as  a  trouble-­‐‑shooting  step,  but  was  found  not  to  be  a  contributing  factor.    

• CIT  is  planning  re-­‐‑enabling  the  host-­‐‑based  firewalls  that  were  turned  off  as  a  trouble-­‐‑shooting  step.  

• CIT  is  adding  additional  network  interfaces  to  the  Exchange  servers,  to  provide  separate  paths  for  replication  and  user  network  traffic.  

Provide  for  faster  resolution  of  critical  problems  by  upgrading  our  Microsoft  support  contract  to  provide  immediate  access  to  Level  3  engineers.  

6  

Communicate  Root  Cause  findings  to  campus  and  give  an  opportunity  for  input;  investigate  general  options  for  “open”  forums  or  reviews  of  major  service  disruptions  with  strategic  customers  and  users.  

9,  10  

Perform  a  yearly  risk  assessment  and  health  check  of  our  Active  Directory  and  Exchange  systems  by  an  outside  vendor.  

7  

Provide  checklist  for  Service  Owners  of  tasks  to  complete  with  respect  to  communication  and  notification  during  Service  Disruptions.  

8  

Consider  additional  options  for  when  the  “big  red  button”  should  be  pushed  for  similar  incidents/problems  in  the  future.  

8  

 Action  Plan  #   Item   Responsible  Party   Completion  Date  1   Re-­‐‑certify  the  four  client  access  servers  

that  were  briefly  placed  in  service,  and  will  add  their  capacity  to  the  service  within  the  next  few  weeks.  

Infrastructure  Division  -­‐‑  Messaging  

11/27/2011  

2   Modify  DNS  so  that  the  cornell.edu  DNS  name  resolve  to  the  domain  controllers  is  expected  to  remove  some  timeout  issues  with  Exchange  management  commands.  

Infrastructure  Division  –  Identity  Management  

11/20/2011  

3   Re-­‐‑enable  the  Forefront  antivirus  protection  on  the  Exchange  servers.    This  was  disabled  as  a  trouble-­‐‑shooting  step,  but  was  found  not  to  be  a  contributing  factor.    

Infrastructure  Division  -­‐‑  Messaging  

12/4/2011  

4   Re-­‐‑enable  the  host-­‐‑based  firewalls  that  were  turned  off  as  a  trouble-­‐‑shooting  step.  

Infrastructure  Division  -­‐‑  Messaging/Systems  

11/27/2011  

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 11

Action  Plan  #   Item   Responsible  Party   Completion  Date  

Admins  5   Add  additional  network  interfaces  to  the  

Exchange  servers,  to  provide  separate  paths  for  replication  and  user  network  traffic.  

Infrastructure  Division  -­‐‑  Messaging/Systems  

1/22/2012  

6   Upgrade  our  Microsoft  support  contract  to  provide  immediate  access  to  Level  3  engineers.  

Infrastructure  Division  -­‐‑  Messaging  

1/31/2012  

7   Perform  a  yearly  risk  assessment  and  health  check  of  our  Active  Directory  and  Exchange  systems  by  an  outside  vendor.  

Infrastructure  Division  -­‐‑  Messaging  

3/1/2012  

8   Update  CIT  Process  and  Procedure  2007-­‐‑002  regarding  “Sev  1”  incidents.    Provide  awareness  to  Service  Owners  and  others.  

CIT  Process  Improvement  -­‐‑  Jim  Haustein  

1/31/2012  

9   Schedule  an  Exchange  SIG  to  have  a  review  of  the  incident  as  one  of  the  topics  

Infrastructure  Division  -­‐‑  Messaging  

12/31/2011  

10   Investigate  general  options  for  “open”  forums  or  reviews  of  major  service  disruptions  with  strategic  customers  and  users.  

CIT  Process  Improvement  –  Jim  Haustein  

2/28/2012  

     

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 12

Approvals:        ___/s/  James  R.  Haustein________  __    (Submitter)                                                    12/6/2011    Jim  Haustein    

___e-­‐‑mail  to  Jim  Haustein___  __  (Director,  Infrastructure)                          12/7/2011  Dave  Vernon                      

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 13

APPENDIX  I:  DESCRIPTION  OF  METRICS    

• Incident  Occurrence  –  when  an  incident  occurs.  

• Detection  –  when  IT  is  made  aware  of  the  issue  

• Diagnosis  –  when  the  diagnosis  to  determine  the  underlying  cause  of  the  incident  has  been  completed  

• Repair  –  when  the  incident  has  been  repaired.  

• Recovery  –  when  component  recovery  has  been  completed.  

• Restoration  –  when  normal  business  operations  resume.  

   

         

Detection Diagnosis Repair Recover Restore

DetectionTime

Repair Time

ResponseTime

Time to Repair (“downtime”)

Time Between Failures

(“uptime”)

Time Between Incidents

IncidentOccurrence

IncidentOccurrence

RecoveryTime

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 14

APPENDIX  II:  TIMELINE  OF  EVENTS    

   

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 15

APPENDIX  III:  COMMUNICATIONS    

Performance:  Exchange  Performance  Issues    

Date:  Oct  17,  2011,  10:01  AM    Duration:  Unknown  

Status:  Closed  

Description:  

We  are  currently  investigating  reported  connection  issues  with  Exchange.  The  symptoms  include  refused  connections  when  sending  and  or  receiving  mail.  The  connections  are  recovering  after  a  minute  but  appear  to  be  re-­‐‑occurring  occasionally.  This  is  affecting  client  access  only  we  are  not  seeing  delivery  issues  at  this  time.  After  the  connection  recovers  mail  is  being  sent  from  and  to  the  client.    

Timeline:  

10/19/2011  11:32  AM:  The  Exchange  servers  have  been  running  normally  since  4:15pm  on  Monday.    

10/17/2011  09:23  PM:  The  CIT  Exchange  Admins  report  that  system  performance  is  much  improved  this  evening.  They  are  still  investigating  this  problem  and  continue  to  monitor  the  issue.  

Affected  Services:  Exchange      

Performance:  Messaging  Malware  Attack  Block:  usps.com    

Date:  Oct  17,  2011,  11:46  AM    Duration:  Unknown  

Status:  Closed  

Description:  

This  morning  CIT  Messaging  staff  blocked  a  large-­‐‑scale  mail  attack  purporting  to  be  from  addresses  at  usps.com,  carrying  malware  that  could  infect  client  machines.  Since  it  is  impossible  to  distinguish  these  forged  addresses  from  legitimate  usps.com  addresses,  no  mail  from  usps.com  is  currently  getting  through.  This  action  was  necessary  to  protect  the  Cornell  mail  system  and  other  IT  systems  from  the  attack.    

Timeline:  

10/17/2011  07:40  PM:  The  complete  block  of  any  email  with  a  @usps.com    address  has  been  lifted.  We  have  isolated  the    appropriate  information  and  we  are  blocking  solely  on    

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 16

that.  Initially  due  to  the  volume  and  variants  of  the    infected  email  it  seemed  prudent  to  block  all  @usps.com    traffic,  even  though  almost  all  of  it  was  already  being    blocked  by  our  normal  systems.  We  apologize  for  any    inconvenience  this  may  of  caused.  

10/17/2011  04:58  PM:  The  CIT  Exchange  Admins  are  still  investigating  this  problem  and  continue  to  monitor  the  issue.  

10/17/2011  11:49  AM:  We  will  restore  incoming  mail  from  legitimate  usps.com  addresses  as  soon  as  we  have  a  way  to  do  so.    

 

Performance:  Problems  With  Exchange  This  Morning    

Date:  Oct  20,  2011,  10:30  AM    Duration:  Unknown  

Status:  Closed  

Description:  

There  was  a  five  minute  load  spike  on  some  of  the  Exchange  servers  this  morning,  causing  momentary  slowness  and  denied  connections.  It  appears  from  reports  that  some  email  programs  did  not  recover  gracefully  from  that  incident.  We  recommend  quitting  and  restarted  your  email  program  if  you  are  experiencing  problems.    

Timeline:  

10/20/2011  04:13  PM:  This  problem  has  been  resolved.  

10/20/2011  12:48  PM:  Exchange  experienced  a  momentary  period  slowness  and  denied  connections  this  morning.  

Affected  Services:  Exchange    

   

Performance:  Exchange  Authentication  Errors    

Date:  Oct  20,  2011,  03:25  PM    Duration:  Unknown  

Status:  Closed  

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 17

Description:  

A  issue  occurred  where  some  users  were  unable  to  authenticate  to  Exchange.  This  problem  occurred  between  3:25pm  and  3:30pm.  One  of  our  client  access  servers  was  unable  to  authenticate  users  against  Active  Directory.  We  removed  the  server  from  service  while  we  investigate  and  correct  the  problem.  All  connections  should  now  have  re-­‐‑established  on  the  remaining  client  access  servers.  In  some  cases  users  may  have  to  re-­‐‑start  their  clients.    

Timeline:  

10/21/2011  03:54  PM:  This  issue  has  now  been  resolved.  

10/20/2011  04:16  PM:  A  issue  occurred  where  some  users  were  unable  to  authenticate  to  Exchange.  

Affected  Services:  Exchange    

   

Performance:  Exchange  Service    

Date:  Oct  28,  2011,  01:30  PM    Duration:  Until  10/28/2011  at  2:00  PM    

Status:  Closed  

Description:  

Due  to  the  network  issue  this  morning  the  Exchange  system'ʹs  performance  was  affected  from  10am  to  approximately  12:30pm  today  (10/28).  To  improve  performance  we  had  had  split  the  databases  up  such  that  half  were  primary  in  Rhodes  and  half  in  CCC.  The  network  issue  caused  databases  to  fail  over  and  all  the  databases  were  on  one  side  instead  of  being  split.  Once  usage  rose  up  high  enough  performance  suffered.    

Timeline:  

10/28/2011  01:34  PM:  The  databases  have  been  split  out  again  and  all  appears  to  be  well.  The  Exchange  2007  servers  are  being  rebuilt  as  Exchange  2010  servers  which  will  increase  our  overall  capacity  to  better  handle  these  sorts  of  situations.  

Affected  Services:  Exchange    

   

Unplanned  Outage:  Outlook  Web  Access  Service  (OWA)    

Date:  Oct  31,  2011,  08:44  AM    Duration:  Unknown  

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 18

Status:  Closed  

Description:  

CIT  has  received  reports  that  users  are  unable  to  access  the  Outlook  Web  Access  (OWA)  service.    

Timeline:  

10/31/2011  08:46  AM:  We  are  currently  investigating  this  problem  and  will  notify  you  with  updates  on  this  situation.  

Affected  Services:  Outlook  Web  Access    

   

Unplanned  Outage:  Exchange  Email  and  Calendar    

Date:  Oct  31,  2011,  08:52  AM    Duration:  Unknown  

Status:  Closed  

Description:  

The  Exchange  performance  problems  the  morning  of  October  31  have  been  traced  to  Outlook  2007/2010  on  Windows  attempting  to  connect  to  all  mailboxes  to  which  each  user  had  access.  Individuals  may  need  to  re-­‐‑add  the  Exchange  Group  Accounts  they  want  to  see  in  Outlook  (see  http://www.cit.cornell.edu/services/ega/howto/config.cfm).    

Timeline:  

10/31/2011  05:50  PM:  The  performance  problems  are  believed  to  be  primarily  caused  by  a  feature  introduced  by  Microsoft  in  Exchange  Server  2010  SP  2.  The  feature'ʹs  effects  were  not  apparent  until  a  script  for  Exchange  Group  Accounts  was  run  the  weekend  of  October  29.      The  feature  causes  Outlook  to  automount  all  mailboxes  to  which  an  individual  has  full  access.  The  result  was  that  start-­‐‑up  time  for  Outlook  may  have  taken  several  minutes  for  some  individuals.  When  the  automounting  was  stopped,  the  accounts  appeared  to  disappear  from  Outlook.  Re-­‐‑adding  the  affected  accounts  resolves  the  issue  for  individuals.    CIT  is  adjusting  the  scripts  for  Exchange  Group  Accounts  and  filing  an  issue  report  with  Microsoft.    

10/31/2011  10:10  AM:  The  load  spike  has  abated  this  morning.  Exchange  staff  are  continuing  to  work  on  monitoring  the  system  and  addressing  the  root  cause  

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 19

10/31/2011  09:10  AM:  Some  Exchange  users  are  experiencing  slow  Exchange  response  or  difficulty  connecting  to  Exchange.  One  of  the  mailbox  servers  is  experiencing  a  heavy  load  spike  at  this  time.  Exchange  admins  are  working  on  determining  the  source  of  the  load  and  taking  measures  to  address  it.  

10/31/2011  08:54  AM:  We  are  currently  investigating  this  problem  and  will  notify  you  with  updates  on  this  situation.  

Affected  Services:  Exchange    

   

Performance:  Exchange  Email  and  Calendar    

Date:  Nov  1,  2011,  12:00  PM    Duration:  Unknown  

Status:  Closed  

Description:  

For  the  past  several  days,  Cornell'ʹs  Exchange  email  and  calendar  services  have  had  performance  issues.  Re-­‐‑establishing  stable  service  levels  is  CIT'ʹs  highest  priority.  Please  bear  with  us  as  we  continue  working  on  the  problem.    

Timeline:  

11/10/2011  09:53  AM:  The  immediate  issues  with  Exchange  have  been  resolved.  Over  the  next  several  weeks,  additional  changes  will  be  made  to  increase  the  Exchange'ʹs  ability  to  handle  normal  growth  in  load  over  time  and  load  associated  with  traffic  spikes.    A  notice  to  all  Exchange  users  will  be  sent  later  today.    Please  report  any  issues  with  Exchange  email  or  calendar  to  the  CIT  HelpDesk  (255-­‐‑8990),  noting  your  email  client  and  OS,  and  the  location  from  which  you  observe  the  problem.  

11/08/2011  04:57  PM:  Our  assessment  of  today'ʹs  experience  with  the  campus  Exchange  service  is  that  the  fixes  applied  yesterday  and  early  this  morning  have  addressed  performance  issues  seen  over  the  past  several  days.  We  have  been  working  on  what  appear  to  be  pockets  of  client  issues  remaining  for  a  limited  number  of  users.  We  will  keep  this  alert  open,  however,  until  more  time  has  elapsed  and  we  can  be  certain  there  are  no  more  infrastructure  issues  remaining.  If  you  have  an  open  ticket  with  the  CIT  Help  Desk,  please  update  us  with  your  current  status.  If  you  see  any  renewed  or  continuing  problems,  please  report  those  to  the  Help  Desk  with  details  including  your  client  and  OS,  and  the  location  from  which  you  observe  the  problem.    Unfortunately,  there  was  a  network  outage  in  the  CIT  data  center  this  afternoon  that  impacted  Exchange  access  from  about  1:00  to  2:00  PM.  During  the  outage  connections  were  refused.  Some  clients  required  a  restart  before  they  were  able  to  connect  once  the  network  was  restored,  so  some  users  may  have  seen  problems  after  2:00  PM.    

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 20

11/08/2011  10:15  AM:  After  making  the  recommended  changes  to  the  Exchange  network  configuration,  which  was  complete  by  1am  today,  the  Exchange  team  has  seen  no  recurrence  of  the  server  errors  that  indicate  this  problem.  Spot  checks  with  the  community  have  indicated,  in  general,  much  improved  performance  this  morning.  If  you  have  an  open  ticket  with  the  CIT  Help  Desk,  please  update  us  with  your  current  status.  If  you  see  any  renewed  or  continuing  problems,  please  report  those  to  the  Help  Desk  with  details  including  your  client  and  OS,  and  the  location  from  which  you  observe  the  problem.  

11/07/2011  10:25  PM:  We  have  received  some  isolated  reports  of  continued  problems  following  the  configuration  change  this  afternoon  around  4:00  PM,  although  we'ʹve  seen  a  reduction  in  server-­‐‑side  errors.  Microsoft  has  recommended  an  additional  change  to  the  server  configuration  which  we  are  implementing  from  12:00  midnight  and  12:15  AM  on  Tuesday.  The  change  requires  rebooting  the  servers  but  we  do  not  anticipate  a  service  disruption.  If  you  experienced  problems  today  described  in  an  earlier  update  (see  list  below)  and  continue  to  see  them  Tuesday  morning,  please  report  them  to  us.    Known  symptoms  are:  sporadic  slow  or  failed  logins,  failure  to  send  messages,  and  slow  operations  (spinning  hourglass  or  beach  ball,  depending  on  the  client  system).  

11/07/2011  06:33  PM:  Microsoft  has  recommended  that  NetDMA  be  disabled  in  the  Exchange  cluster  because  it  is  a  contributing  factor  to  Cornell'ʹs  Exchange  issues.  From  12  midnight  to  12:15  am  on  Tuesday,  November  8,  CIT  will  restart  the  Exchange  mailbox  servers  to  disable  NetDMA.  This  work  will  be  done  one  server  at  a  time.  No  outage  is  expected.  

11/07/2011  04:34  PM:  CIT  has  made  some  changes  to  network  settings  on  the  Exchange  cluster  at  Microsoft'ʹs  recommendation.    We  are  monitoring  the  performance  to  determine  the  effects  of  this  change.  

11/07/2011  03:49  PM:  CIT  and  Microsoft  experts  are  still  diagnosing  the  cause  of  cluster  communication  failures.  They  are  currently  analyzing  network  traces  for  further  information  on  anomalies  identified  in  the  review  of  Exchange  data.  

11/07/2011  01:39  PM:  CIT  staff  continue  to  gather  log  data  for  Microsoft  engineers  to  identify  the  source  of  the  problem,  which  appears  to  continue  to  be  in  the  cluster  communications  layer.      Resolving  the  issues  with  Exchange  remains  the  highest  priority  for  both  CIT  and  Microsoft  to  resolve.    The  main  symptoms  are  sporadic  slow  or  failed  logins,  failure  to  send  messages,  and  slow  operations  (spinning  hourglass  or  beach  ball,  depending  on  the  client  system).  These  have  appeared  a  number  of  times  throughout  the  morning,  with  a  larger  interruption  from  noon  to  1pm  for  users  hosted  on  one  of  the  four  mailbox  servers.  The  server  became  non-­‐‑responsive  and  required  a  reboot.      At  this  point,  we  have  collected  the  data  we  need  on  client  problems.  If  we  need  additional  data  to  be  reported,  a  request  will  be  posted  here.  

11/07/2011  01:12  PM:  That  database  server  is  now  online  again.  The  start  time  was  about  12:30,  so  it  was  a  half  hour  from  that  time.  

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 21

11/07/2011  12:51  PM:  One  of  the  Exchange  databases  servers  (out  of  four)  went  offline  and  unmounted  the  mailbox  databases.  Exchange  staff  are  working  to  get  the  databases  back  online.  This  problem  does  appear  to  be  related  to  the  ongoing  issue.  Expected  time  to  restore  the  service  is  30  minutes.  

11/07/2011  09:30  AM:  While  the  patches  that  were  applied  to  the  Exchange  cluster  on  Friday  greatly  reduced  the  rate  of  errors,  it'ʹs  now  apparent  that  some  level  of  errors  still  persists.  The  Exchange  team  remains  engaged  with  Microsoft  to  locate  the  source  of  these  problems.  Symptoms  include  timeouts  in  connection,  refused  connections,  and  errors  in  using  OWA.  If  you  receive  these  errors,  please  wait  for  a  short  time  and  retry  the  operation.  The  patch  applied  on  Friday  makes  recovery  from  such  problems  much  more  rapid  that  before.  

11/04/2011  04:45  PM:  If  people  are  still  seeing  problems  with  their  email  or  calendar,  as  a  first  step,  they  should  quit  and  restart  their  email  client,  and  give  it  some  time  to  catch  up.  In  a  few  cases,  it  may  be  necessary  to  reboot  their  system.  If  problems  persist,  they  should  contact  the  CIT  HelpDesk  with  these  details:  problem  description,  date  and  times  the  problem  has  occurred,  and  the  operating  system  and  email  client  being  used.  Having  issues  reported  is  critical.    TIME  LINE  OF  ACTIONS  TAKEN    Early  on,  CIT  staff  identified  and  eliminated  several  apparent  contributions  to  the  problem,  but  ultimately  came  to  an  impasse.  Paradoxically,  adding  additional  resources  to  the  cluster  made  the  problem  worse.    Wednesday  evening,  November  2,  Microsoft  flew  in  a  field  engineer.  With  his  help,  we  first  identified  a  network  bottleneck,  which  reduced  but  did  not  eliminate  the  problem.  Digging  deeper,  a  bug  was  identified  in  Microsoft'ʹs  clustering  software  that  caused  the  cluster  to  believe  that  it  was  in  failure  mode,  and  caused  the  active  mailboxes  to  flip  repeatedly  between  the  redundant  Exchange  systems  in  Rhodes  and  CCC.  Since  this  behavior  was  related  to  the  number  of  machines  in  the  cluster,  we  inadvertently  worsened  the  problem  by  adding  capacity.      Thursday  night,  November  3,  a  patch  was  applied  to  the  systems,  and  all  the  server  side  problems  were  eliminated.      Friday  morning,  November  4,  pockets  of  connectivity  problems  led  to  discovering  that  a  few  of  the  ten  Client  Access  Servers  were  not  responding  to  connections;  they  were  removed  from  the  pool.  At  this  time  we  believe  that  we  have  resolved  the  problems.    

11/04/2011  01:41  PM:  The  root  cause  of  recent  Exchange  problems  has  been  addressed  with  hot  fixes  and  reconfiguration  of  network  traffic  accomplished  last  night.  Nonetheless,  a  subset  of  campus  users  experienced  problems  with  the  service  today  related  to:    A  brief  load  spike  at  9:00  AM  this  morning.  This  resulted  in  the  temporary  inability  to  connect  to  Exchange  for  some  users.  We  are  still  investigating  this  event.    A  new  problem  was  introduced  with  the  addition  of  client  access  server  capacity.  These  servers  were  not  handling  connections  properly  so  we  have  eliminated  them  from  the  rotation.  We  have  been  working  directly  with  the  IT  staff  in  the  units  impacted  and  believe  that  removing  these  servers  has  resolved  those  cases.  We  will  continue  to  monitor  reports  until  we  are  certain  that  no  access  issues  remain.  

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 22

11/04/2011  12:24  PM:  Overall,  Exchange  performance  is  much  improved.  However,  we  are  still  receiving  reports  from  a  subset  of  users  who  are  having  trouble  connecting  to  their  accounts.  We  are  working  with  the  Microsoft  engineer  to  diagnose  these  cases  and  solve  them.  

11/04/2011  08:49  AM:  CIT  staff  with  the  Microsoft  engineer  who  has  been  assisting  us  this  week  have  applied  patches  to  the  cluster  service  supporting  the  Exchange  system.  These  patches  have  eliminated  the  network  errors  and  subsequent  database  restarts  that  have  caused  the  extremely  poor  performance  this  week.  At  this  time  the  Exchange  service  appears  much  healthier.  Some  email  programs  may  have  become  confused  when  the  Exchange  system  became  unresponsive.  If  problems  persist,  we  recommend  that  you  quit  and  restart  your  email  programs,  and  contact  the  CIT  Help  Desk  if  problems  continue  after  that.    

11/03/2011  09:23  PM:  Technical  staff  working  on  Exchange  performance  issues  have  applied  a  patch  to  the  server  cluster  to  address  a  bug  that  was  causing  communication  failures.  This  should  improve  stability  and  allow  the  reconfiguration  work  to  proceed.  

11/03/2011  07:14  PM:  Exchange  mailboxes  may  be  temporarily  unavailable  due  to  a  cluster  communications  problem  we  expect  this  condition  to  last  for  less  than  30  minutes.    

11/03/2011  04:09  PM:  We  are  still  working  on  reconfiguring  the  network  path  for  Exchange  communications  to  better  distribute  the  traffic.  We  have  engaged  additional  Microsoft  resources  over  the  phone  to  expedite  resolution  of  issues  we'ʹve  encountered  with  this  change.  

11/03/2011  02:20  PM:  We  are  still  working  with  the  Microsoft  engineer  to  accomplish  the  reconfiguration  referenced  in  the  last  communication.  Although  we  initially  anticipated  that  work  would  be  completed  around  1  PM,  we  now  expect  it  will  take  several  more  hours.  We  expect  these  changes  will  result  in  a  stable  service  very  soon  after  they  are  completed  but  we  will  continue  to  take  incremental  steps  to  increase  capacity  to  better  accommodate  future  unplanned  events.    

11/03/2011  10:56  AM:  Between  now  and  approximately  1  PM  we  will  be  making  configuration  changes  to  the  Exchange  environment  to  improve  performance.  The  changes  themselves  are  not  expected  to  impact  the  user  community.  However,  until  these  changes  are  complete  we  may  see  events  similar  to  those  we'ʹve  experienced  over  the  past  several  days  that  result  in  access  issues  for  users.  Such  an  event  did  occur  this  morning  at  10  AM.  It  affected  a  significant  number  of  users  whose  mailboxes  live  on  the  affected  server.  Those  users  would  have  experienced  performance  issues  or  the  momentary  inability  to  connect  to  their  Exchange  accounts.    We  anticipate  that  very  soon  after  we  complete  the  configuration  changes  users  will  see  the  improvement  in  service  performance.  

11/03/2011  07:30  AM:  Working  in  concert  with  the  Microsoft  engineer  last  evening  we  have  made  configuration  changes  to  alleviate  Exchange  performance  issues.  Measures  included  client  access  network  reconfiguration,  changes  to  the  replication  configuration,  and  deploying  four  additional  client  access  servers.  While  we  believe  we  have  determined  the  root  cause  of  these  issues  we  will  continue  to  analyze  performance  data  to  confirm.  

11/02/2011  03:29  PM:  CIT  continues  to  work  on  resolving  the  Exchange  performance  issues.  Additional  servers  will  be  added  to  Exchange  tonight  (November  2)  to  spread  the  load.      Problems  with  the  replication  service  are  being  investigated,  including  determining  whether  a  Microsoft  patch  

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 23

would  resolve  them.    A  Microsoft  engineer  will  be  on  site  tonight  (November  2),  and  CIT  will  be  taking  additional  measures  based  on  those  recommendations.  

11/02/2011  08:55  AM:  CIT  is  continuing  to  work  on  solutions  to  the  Exchange  performance  issues.  Our  next  step  is  to  address  a  communications  problem  between  the  two  halves  of  the  Exchange  cluster.  We  are  also  working  to  add  another  Exchange  2010  server  as  soon  as  tonight.  In  our  test  environment,  we  will  be  assessing  a  newly  released  Microsoft  patch  that  contains  fixes  for  some  of  the  problems  we  have  been  seeing.    

11/01/2011  07:22  PM:  Exchange  performance  has  been  stabilized  for  the  moment.  Some  Microsoft-­‐‑recommended  changes  to  the  Active  Directory  Domain  Controllers  were  implemented,  as  well  as  monitors  that  will  capture  diagnostic  information  if  the  problems  return  tomorrow  during  periods  of  high  load.      We  also  have  a  fourth  Exchange  database  server  ready  to  go  into  production,  which  will  give  us  33%  more  capacity  to  deal  with  load  issues.  A  fifth  server  will  be  added  in  another  week.  These  will  have  a  gradual  affect  as  user  mailboxes  migrate  transparently  onto  them.  

11/01/2011  05:10  PM:  CIT  understands  the  importance  of  email  and  calendar  for  your  work,  and  we  realize  we  have  fallen  short  of  your  expectations.  We  are  working  hard  to  regain  those  service  levels.  We  have  been  working  with  Microsoft  and  others  to  understand  what  is  causing  these  problems.      So  far  the  causes  have  been  elusive,  appearing  at  times  to  be  a  high  CPU  load  causing  poor  response  time,  and  at  other  times  seeming  to  be  an  intermittent  network  problem.  Several  apparent  causes  have  been  addressed,  including  anti-­‐‑virus  updates,  network  adapter  offload  settings,  power  management  settings,  and  the  mailbox  automounting  setting.  Please  bear  with  us  as  we  continue  working  on  the  problem.    

11/01/2011  04:06  PM:  Exchange  Admins  are  actively  working  with  Microsoft  to  resolve  the  problem  swiftly.  Additional  information  will  be  posted  as  it  becomes  available.  

11/01/2011  02:30  PM:  CIT  is  still  receiving  reports  that  some  users  are  still  unable  to  access  their  Exchange  email.  CIT  is  still  investigating  and  will  provide  further  updates.  

11/01/2011  12:02  PM:  We  are  currently  investigating  this  problem  and  will  notify  you  with  updates  on  this  situation.  

Affected  Services:  Exchange    

         

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 24

APPENDIX  IV:  CCAB  SERVICE  DISRUPTION  REPORTS    The  below  CCAB  Service  Disruption  reports  were  completed  in  conjunction  with  the  Exchange  service  disruption  described  in  this  document.  

 artf35310    9/15/2011  9:30  PM  

(Thursday)    9/15/2011  11:59  PM  (Thursday)  

Exchange  [4236]   2  mailbox  DBs  on  mbcx  outage  :  mailboxdatabases19,22,  and  the  public  folder  database  did  not  mount  after  patching  last  night.  It  appears  possible  that  this  was  an  early  symptom  of  the  communications  problem  

 artf35362    9/19/2011  8:00  AM  

(Monday)    9/19/2011  1:30  PM  

(Monday)  Exchange  [4236]   Exchange  slow  response  

times  :  Longer  than  anticipated  run  times  for  a  large  set  of  Exchange  2010  migrations  coincided  with  a  failed  backup  run  that  restarted  at  the  same  time.    The  two  activities,  neither  of  which  could  be  halted,  combined  to  slow  response  time  down  for  client  access  to  Exchange.  

   artf35567    9/26/2011  7:00  AM  

(Monday)    9/26/2011  7:00  PM  

(Monday)  E-­‐‑Mail  Routing  [3979]  

Exchange  connections  hanging  :  Connections  began  to  hang  on  two  new  Client  Access  Servers  placed  into  production  on  Sunday.    The  problem  was  resolved  when  the  new  servers  were  removed  from  service.    Only  a  fraction  of  Exchange  users  were  affected,  and  only  certain  clients  had  problems.    No  cause  of  the  problem  has  yet  been  determined.  

 

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 25

artf35912    10/17/2011  8:15  AM  (Monday)  

 10/17/2011  4:15  PM  (Monday)  

Exchange  [4236]   Exchange  Performance  -­‐‑-­‐‑  malware  attack:  Exchange  experienced  slow  response  and  dropped  client  connections  after  receiving  a  large  attack  of  malware  messages.    This  did  not  affect  mail  delivery,  only  client  access.  There  may  have  been  some  interaction  with  a  set  of  virus  definitions  in  effect  that  day  on  the  Exchange  anti-­‐‑virus  engine.    Anti-­‐‑virus  signatures  are  automatically  delivered  several  times  per  day  by  Microsoft.  

   artf36074    10/28/2011  10:00  

AM  (Friday)    10/28/2011  12:30  

PM  (Friday)  Exchange  [4236]   Exchange  performance  

slowdown  :  Due  to  the  network  issue  this  morning  the  Exchange  system'ʹs  performance  was  affected.  To  improve  performance  we  had  had  split  the  databases  up  such  that  half  were  primary  in  Rhodes  and  half  in  CCC.    The  network  issue  caused  databases  to  fail  over  and  all  the  databases  were  on  one  side  instead  of  being  split.    Once  usage  rose  up  high  enough  performance  suffered.  The  databases  have  been  split  out  again  and  all  appears  to  be  well.    The  Exchange  2007  servers  are  being  rebuilt  as  Exchange  2010  servers  which  will  increase  our  overall  capacity  to  better  handle  these  sorts  of  situations.  

 artf36151    10/31/2011  7:00  

AM  (Monday)    10/31/2011  3:00  PM  (Monday)  

Exchange  [4236]   Outlook  automapping/Exchange  performance  :  A  new  'ʹfeature'ʹ  with  Exchange  2010  is  that  Outlook  2007/2010  will  automatically  open  *all*  mailboxes  to  which  the  user  has  full  access  permission.    All  EGAs  and  resources  grant  those  permissions  to  their  owners.    This  only  took  effect  when  the  

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 26

permissions  for  a  specific  EGA  or  resource  was  updated,  however  a  maintenance  script  over  the  weekend  updated  permissions  on  all  EGAs.    This  resulted  in  many  more  connections  to  mailboxes  on  Monday  morning,  contributing  to  ongoing  performance  problems.      The  automatic  mounts  were  removed  late  in  the  morning.    An  unexpected  side  effect  of  this  was  that  a  previously  manually  mounted  mailbox  that  was  overridden  by  the  automatic  mount  of  the  same  mailbox  was  subsequently  forgotten.    People  reported  they  had  'ʹlost  access'ʹ  to  shared  mailboxes,  when  they  had  in  fact  simply  been  disconnected.    The  remedy  was  for  them  to  reopen  the  shared  mailbox.  

 artf36226    11/01/2011  12:00  

AM  (Tuesday)    11/07/2011  11:59  

PM  (Monday)  Exchange  [4236]   Exchange  Performance  

Problems  :  Severe  performance  problems  affected  Exchange  during  the  time.    The  underlying  symptom  was  that  the  cluster  repeatedly  lost  and  re-­‐‑established  quorum.    The  cause  appeared  to  be  communications  problems  between  the  cluster  nodes.    A  Microsoft  engineer  came  onsite  to  assist  in  diagnosis.    A  number  of  steps  were  taken  to  eliminate  the  problems,  listed  from  the  apparently  most  important  contributing  cause  through  lesser  contributors:    -­‐‑  Turned  off  NetDMA  on  all  network  adapters.  This  was  causing  corrupted  heartbeat  packets.  -­‐‑  Applied  three  hotfixes  from  Microsoft  that  improved  the  cluster  resiliency  to  network  errors  -­‐‑  Turned  off  power  management  on  the  

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 27

network  adapters.  (The  failover  NICs  were  trying  to  go  to  sleep.)  -­‐‑  Ensured  that  replication  traffic  does  not  use  the  same  NIC  as  MAPI  traffic  to  the  CAS  servers.    -­‐‑  Turned  off  power  management  on  the  CPUs.    

 artf36231    11/08/2011  12:37  

PM  (Tuesday)    11/08/2011  12:52  

PM  (Tuesday)  Campus  Area  Network  [2208]  

Server  Farm  network  disruption  :  The  network  switch  sfcdist1-­‐‑1-­‐‑6600  failed  at  12:37  and  was  restored  to  service  at  12:52.    A  network  issue  on  tier3  prevented  the  firewalls  from  failing  over  properly  and  the  extra  tier  had  no  connectivity  during  this  same  interval.    A  second  switch  sfc1-­‐‑1-­‐‑5400  also  had  no  connectivity  and  some  single  attached  servers  affected.  

 artf36227    11/08/2011  12:52  

PM  (Tuesday)    11/08/2011  2:00  PM  (Tuesday)  

Exchange  [4236]   Exchange  affected  by  network  outage  :  Exchange  access  was  affected  by  the  network  switch  outage.    After  the  end  of  the  outage,  the  load  balancer  did  not  reestablish  connections  to  the  CAS  servers.    Services  needed  to  be  stopped  and  started  on  the  CAS  servers  before  the  load  balancer  would  restart  the  connections.    We  had  many  reports  that  client  programs  also  required  a  stop/start  or  reboot  before  they  would  let  go  of  their  previous  connection  to  Exchange  via  the  load  balancer.  

 

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 28

Artf36231   11/08/2011  12:37  PM  

11/08/2011  12:52  PM  

Campus  Area  Network  [2208]  

The  network  switch  sfcdist1-­‐‑1-­‐‑6600  failed  at  12:37  and  was  restored  to  service  at  12:52.    A  network  issue  on  tier3    prevented  the  firewalls  from  failing  over  properly  and  the  extra  tier  had  no  connectivity  during  this  same  interval.    A    second  switch  sfc1-­‐‑1-­‐‑5400  also  had  no  connectivity  and  some  single  attached  servers  affected.  

     

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 29

APPENDIX  V:  MICROSOFT  FINAL  REPORT    Mail  from  Microsoft  Engineer  to  CIT  team:    From:  John  Chappelle    Sent:  Tuesday,  November  15,  2011  4:40  PM  To:  [email protected]  Cc:  Gregg  Koop;  MSSolve  Case  Email;  Gregg  Koop  Subject:  [REG:111100371705359]  Exchange  2010  SP1|Experiencing  two  databases  where  the  issue  is  happening  frequently.      Bill,      I  am  writing  to  check  on  your  DAG  today,  and  I  am  also  including  a  summary  of  our  troubleshooting  efforts  on  this  case.      When  we  first  started,  we  observed  an  issue  with  the  cluster  losing  quorum  and  the  copy  queue  length  changing  to  a  very  large  number.    This  was  the  result  of  a  cluster  disconnect.    We  installed  three  patches  (KB2549472,  KB2549448,  and  2552040)  to  allow  nodes  to  join  properly  when  they  go  offline,  as  well  as  to  correct  an  issue  with  the  cluster  not  regrouping  properly  following  a  communication  failure.    This  alleviated  the  issue  for  a  period  of  time,  although  it  seems  likely  at  this  point  that  it  was  really  the  reboots  that  brought  the  cluster  back  together.    Those  patches  are  still  important  to  the  proper  operation  of  the  cluster,  and  we  recommend  them  for  any  2008R2  cluster  that  experiences  any  quorum  issues  at  all.      We  saw  the  issue  crop  up  again  the  next  week,  and  this  time  we  brought  in  both  a  Cluster  engineer  and  one  of  our  Networking  engineers.    From  their  analysis,  we  found  in  the  cluster  logs:      00001124.00001e84::2011/11/07-­‐‑19:36:12.823  INFO    [CONNECT]  169.254.7.84:~3343~  from  local  169.254.2.231:~0~:  Established  connection  to  remote  endpoint  169.254.7.84:~3343~.  00001124.00001e84::2011/11/07-­‐‑19:36:12.823  INFO    [Reconnector-­‐‑MBXB-­‐‑01]  Successfully  established  a  new  connection.  00001124.00001e84::2011/11/07-­‐‑19:36:12.823  INFO    [SV]  Route  local  (169.254.2.231:~43912~)  to  remote  MBXB-­‐‑01  (169.254.7.84:~3343~)  exists.  Forwarding  to  alternate  path.  00001124.00001e84::2011/11/07-­‐‑19:36:12.823  INFO    [SV]  Securing  route  from  (169.254.2.231:~43912~)  to  remote  MBXB-­‐‑01  (169.254.7.84:~3343~).  00001124.00001e84::2011/11/07-­‐‑19:36:12.823  INFO    [SV]  Got  a  new  outgoing  stream  to  MBXB-­‐‑01  at  169.254.7.84:~3343~  00001124.00001e84::2011/11/07-­‐‑19:36:12.823  INFO    [SV]  Authentication  and  authorization  were  successful  00001124.00001e84::2011/11/07-­‐‑19:36:12.838  INFO    [SV]  Security  Handshake  successful  while  obtaining  SecurityContext  for  NetFT  driver  00001124.00001e84::2011/11/07-­‐‑19:36:12.838  ERR      [CORE]  mscs::Reconnector::ConnectionEstablished:  HrError(0x8009030f)'ʹ  because  of  'ʹSignature  Verification  Failed'ʹ  00001124.00001e84::2011/11/07-­‐‑19:36:12.838  WARN    [Reconnector-­‐‑MBXB-­‐‑01]  Failed  to  handle  new  connection  with  error  ERROR_SYSTEM_POWERSTATE_COMPLEX_TRANSITION(783),  ignoring  connection.      In  addition,  we  saw  simultaneous  TCP  Resets  that  were  unexpected.    We  know  this  because  the  remote  node  in  the  conversation  continued  to  attempt  communication  after  the  resets:      2060  54  0    14:36:12.8425000  13:36:12  07-­‐‑Nov-­‐‑11  14.4811462  0.0000191  {TCP:41,  IPv4:33}  169.254.2.231  169.254.7.84  TCP  TCP:Flags=...A.R..,  SrcPort=43912,  DstPort=3343,  PayloadLen=0,  Seq=3063920255,  Ack=2252985581,  Win=0  

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 30

(scale  factor  0x8)  =  0  2061  86  32    14:36:12.8425199  13:36:12  07-­‐‑Nov-­‐‑11  14.4811661  0.0000199  {TCP:42,  IPv4:33}  169.254.7.84  169.254.2.231  TCP  TCP:Flags=...AP...,  SrcPort=3343,  DstPort=43912,  PayloadLen=32,  Seq=2252985581  -­‐‑  2252985613,  Ack=3063920254,  Win=514  2062  54  0    14:36:12.8425356  13:36:12  07-­‐‑Nov-­‐‑11  14.4811818  0.0000157  {TCP:42,  IPv4:33}  169.254.2.231  169.254.7.84  TCP  TCP:Flags=.....R..,  SrcPort=43912,  DstPort=3343,  PayloadLen=0,  Seq=3063920254,  Ack=3063920254,  Win=0  2063  54  0    14:36:12.8429705  13:36:12  07-­‐‑Nov-­‐‑11  14.4816167  0.0004349  {TCP:43,  IPv4:33}  169.254.7.84  169.254.2.231  TCP  TCP:Flags=...A....,  SrcPort=3343,  DstPort=43912,  PayloadLen=0,  Seq=2252985613,  Ack=3063920255,  Win=514      This  “POWERSTATE”  event  and  the  resets  led  us  to  examine  the  NICs  on  the  server,  where  we  found  the  power  save  functions  were  enabled.    We  disabled  those,  and  both  the  “POWERSTATE”  and  TCP  Reset  issues  abated  immediately.      Our  Cluster  engineer  also  researched  the  NetDMA  settings  and  determined  that  they  should  be  disabled,  so  we  turned  off  NetDMA  along  with  the  power  save  settings.      As  a  side  note,  I  received  the  information  on  the  Broadcom  driver  versions,  and  I  am  looking  around  to  see  if  there  is  a  known  issue  with  them.      Thank  you,  John  Chappelle  Senior  Support  Escalation  Engineer  [email protected]  469-­‐‑775-­‐‑5153  M-­‐‑F  0900-­‐‑1800  Central      My  manager:  Melissa  Stroud  [email protected]  469-­‐‑775-­‐‑7246            Followup  email  identifying  NetDMA  as  a  primary  cause:    From: William Effinger [mailto:[email protected]] Sent: Friday, November 18, 2011 10:43 AM To: William T Holmes Cc: Gregg Koop; John Chappelle Subject: [REG:111100371705359] Exchange 2010 SP1|Experiencing two databases where the issue is happening frequently      Bill,  John  asked  me  to  give  you  a  shout  with  a  writeup  of  my  notes      Looking  in  your  cluster  log  Node  MBXD-­‐02  14744  000015d0.000025c0::2011/11/07-­‐17:34:50.725  INFO    [GUM]  Node  2:  Processing  RequestLock  7:1242  14745  000015d0.00002ad8::2011/11/07-­‐17:34:50.725  INFO    [GUM]  Node  2:  Processing  GrantLock  to  7  (sent  by  1  gumid:  80208)  14746  000015d0.00001718::2011/11/07-­‐17:35:01.349  WARN    [PULLER  MBXA-­‐02]  ReadObject  failed  with  HrError(0x8009030f)'  because  of  'Signature  Verification  Failed'  14747  000015d0.00001718::2011/11/07-­‐17:35:01.349  ERR      [NODE]  Node  2:  Connection  to  Node  6  is  broken.  Reason  

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 31

HrError(0x8009030f)'  because  of  'Signature  Verification  Failed'  14748  000015d0.00001718::2011/11/07-­‐17:35:01.349  WARN    [NODE]  Node  2:  Initiating  reconnect  with  n6.  14749  000015d0.00001718::2011/11/07-­‐17:35:01.349  INFO    [MQ-­‐MBXA-­‐02]  Pausing  14750  000015d0.000018b0::2011/11/07-­‐17:35:01.349  INFO    [Reconnector-­‐MBXA-­‐02]  Reconnector  from  epoch  7  to  epoch  8  waited  00.000  so  far.  14751  000015d0.000018b0::2011/11/07-­‐17:35:01.349  INFO    [CONNECT]  169.254.6.224:~3343~  from  local  169.254.2.172:~0~:  Established  connection  to  remote  endpoint  169.254.6.224:~3343~.  14752  000015d0.000018b0::2011/11/07-­‐17:35:01.349  INFO    [Reconnector-­‐MBXA-­‐02]  Successfully  established  a  new  connection.  14753  000015d0.000018b0::2011/11/07-­‐17:35:01.349  INFO    [SV]  Route  local  (169.254.2.172:~14524~)  to  remote  MBXA-­‐02  (169.254.6.224:~3343~)  exists.  Forwarding  to  alternate  path.  14754  000015d0.000018b0::2011/11/07-­‐17:35:01.349  INFO    [SV]  Securing  route  from  (169.254.2.172:~14524~)  to  remote  MBXA-­‐02  (169.254.6.224:~3343~).  14755  000015d0.000018b0::2011/11/07-­‐17:35:01.349  INFO    [SV]  Got  a  new  outgoing  stream  to  MBXA-­‐02  at  169.254.6.224:~3343~  14756  000015d0.000025c0::2011/11/07-­‐17:35:01.349  WARN    [PULLER  MBXB-­‐01]  ReadObject  failed  with  HrError(0x8009030f)'  because  of  'Signature  Verification  Failed'  14757  000015d0.000025c0::2011/11/07-­‐17:35:01.349  ERR      [NODE]  Node  2:  Connection  to  Node  7  is  broken.  Reason  HrError(0x8009030f)'  because  of  'Signature  Verification  Failed'  14758  000015d0.000025c0::2011/11/07-­‐17:35:01.349  WARN    [NODE]  Node  2:  Initiating  reconnect  with  n7.  14759  000015d0.000025c0::2011/11/07-­‐17:35:01.349  INFO    [MQ-­‐MBXB-­‐01]  Pausing          15063  000015d0.00001614::2011/11/07-­‐17:35:47.681  INFO    [GUM]  Node  2:  Processing  GrantLock  to  1  (sent  by  4  gumid:  80222)  15064  000015d0.00004628::2011/11/07-­‐17:35:51.035  INFO    [GUM]  Node  2:  Processing  RequestLock  7:1246  15065  000015d0.00003964::2011/11/07-­‐17:35:51.035  INFO    [GUM]  Node  2:  Processing  GrantLock  to  7  (sent  by  1  gumid:  80223)  15066  000015d0.00003f7c::2011/11/07-­‐17:36:02.704  WARN    [PULLER  MBXA-­‐02]  ReadObject  failed  with  HrError(0x8009030f)'  because  of  'Signature  Verification  Failed'  15067  000015d0.00003f7c::2011/11/07-­‐17:36:02.704  ERR      [NODE]  Node  2:  Connection  to  Node  6  is  broken.  Reason  HrError(0x8009030f)'  because  of  'Signature  Verification  Failed'  15068  000015d0.00003f7c::2011/11/07-­‐17:36:02.704  WARN    [NODE]  Node  2:  Initiating  reconnect  with  n6.  15069  000015d0.00003f7c::2011/11/07-­‐17:36:02.704  INFO    [MQ-­‐MBXA-­‐02]  Pausing  15070  000015d0.00003a78::2011/11/07-­‐17:36:02.704  INFO    [Reconnector-­‐MBXA-­‐02]  Reconnector  from  epoch  10  to  epoch  11  waited  00.000  so  far.  15071  000015d0.00004628::2011/11/07-­‐17:36:02.704  WARN    [PULLER  MBXB-­‐01]  ReadObject  failed  with  HrError(0x8009030f)'  because  of  'Signature  Verification  Failed'  15072  000015d0.00004628::2011/11/07-­‐17:36:02.704  ERR      [NODE]  Node  2:  Connection  to  Node  7  is  broken.  Reason  HrError(0x8009030f)'  because  of  'Signature  Verification  Failed'  15073  000015d0.00004628::2011/11/07-­‐17:36:02.704  WARN    [NODE]  Node  2:  Initiating  reconnect  with  n7.  15074  000015d0.00004628::2011/11/07-­‐17:36:02.704  INFO    [MQ-­‐MBXB-­‐01]  Pausing      SEC_E_MESSAGE_ALTERED  The  message  or  signature  supplied  for  verification  has  been  altered  0x8009030f      Doing  research  with  our  internal  knowledge  base  I  can  see  that  'Signature  Verification  Failed'  case  be  caused  by  one  of  two  reasons  Receive Side Scaling, and Network Direct Memory Access features in Windows Server 2008 as you have already turned off RSS we disabled NetDMA      Info  on  this  tech  http://technet.microsoft.com/sk-­‐sk/magazine/2007.01.cableguy(en-­‐us).aspx      

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 32

How  to  turn  off  RSS  &  NetDMA  http://support.microsoft.com/?id=951037        Best Regards,  William Effinger  | MCP | MCSA | MCSE | MCTS | MCITP EA |  Office Hours: Monday - Friday | 7a - 4p | EST  (  Phone:  980.776.8887 *  Email: [email protected] : Blog:  http://blogs.technet.com/askcore/ Alternative Contact Information Local country phone

number found here: http://support.microsoft.com/globalenglish Extension 1168887      

 

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 33

APPENDIX VI: MICROSOFT KNOWLEGEBASE ARTICLE In  a  post  mortem  discussion  with  Microsoft,  CIT  staff  pointed  out  the  lack  of  information  available  that  would  have  allowed  us  to  prevent  this  problem  or  diagnose  it  once  it  occurred.    In  response,  Microsoft  published  the  following  article:  (http://blogs.technet.com/b/exchange/archive/2011/11/20/recommended-­‐‑windows-­‐‑hotfix-­‐‑for-­‐‑database-­‐‑availability-­‐‑groups-­‐‑running-­‐‑windows-­‐‑server-­‐‑2008-­‐‑r2.aspx)  

Recommended Windows Hotfix for Database Availability Groups running Windows Server 2008 R2

Scott Schnoll [MSFT] 20 Nov 2011 7:41 AM 11 In early August of this year, the Windows SE team released the following Knowledge Base (KB) article and accompanying software hotfix regarding an issue in Windows Server 2008 R2 failover clusters:

KB2550886 - A transient communication failure causes a Windows Server 2008 R2 failover cluster to stop working

This hotfix is strongly recommended for all databases availability groups that are stretched across multiple datacenters. For DAGs that are not stretched across multiple datacenters, this hotfix is good to have, as well. The article describes a race condition and cluster database deadlock issue that can occur when a Windows Failover cluster encounters a transient communication failure. There is a race condition within the reconnection logic of cluster nodes that manifests itself when the cluster has communication failures. When this occurs, it will cause the cluster database to hang, resulting in quorum loss in the failover cluster.

As described on TechNet, a database availability group (DAG) relies on specific cluster functionality, including the cluster database. In order for a DAG to be able to operate and provide high availability, the cluster and the cluster database must also be operating properly.

Microsoft has encountered scenarios in which a transient network failure occurs (a failure of network communications for about 60 seconds) and as a result, the entire cluster is deadlocked and all databases are within the DAG are dismounted. Since it is not very easy to determine which cluster node is actually deadlocked, if a failover cluster deadlocks as a result of the reconnect logic race, the only available course of action is to restart all members within the entire cluster to resolve the deadlock condition.

The problem typically manifests itself in the form of cluster quorum loss due to an asymmetric communication failure (when two nodes cannot communicate with each other but can still communicate with other nodes). If there are delays among other nodes in the receiving of cluster regroup messages from the cluster’s Global Update Manager (GUM), regroup messages can end up being received in unexpected order. When that happens, the cluster loses quorum instead of invoking the expected behavior, which is to remove one of the nodes that experienced the initial communication failure from the cluster.

Generally, this bug manifests when there is asymmetric latency (for example, where half of the DAG members have latency of 1 ms, while the other half of the DAG members have 30 ms latency) for two cluster nodes that discover a broken connection between the pair. If the first node detects a connection loss well before the second node, a race condition can occur:

• The first node will initiate a reconnect of the stream between the two nodes. This will cause the second node to add the new stream to its data.

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 34

• Adding the new stream tears down the old stream and sets its failure handler to ignore. In the failure case, the old stream is the failed stream that has not been detected yet.

• When the connection break is detected on the second node, the second node will initiate a reconnect sequence of its own. If the connection break is detected in the proper race window, the failed stream's failure handler will be set to ignore, and the reconnect process will not initiate a reconnect. It will, however, issue a pause for the send queue, which stops messages from being sent between the nodes. When the messages are stopped, this prevents GUM from operating correctly and forces a cluster restart.

If this issue does occur, the consequences are very bad for DAGs. As a result, we recommend that you deploy this hotfix to all of your Mailbox servers that are members of a DAG, especially if the DAG is stretched across datacenters. This hotfix can also benefit environments running Exchange 2007 Single Copy Clusters and Cluster Continuous Replication environments.

In addition to fixing the issue described above, KB2550886 also includes other important Windows Server 2008 R2 hotfixes that are also recommended for DAGs:

• http://support.microsoft.com/kb/2549472 - Cluster node cannot rejoin the cluster after the node is restarted or removed from the cluster in Windows Server 2008 R2

• http://support.microsoft.com/kb/2549448 - Cluster service still uses the default time-out value after you configure the regroup time-out setting in Windows Server 2008 R2

• http://support.microsoft.com/kb/2552040 - A Windows Server 2008 R2 failover cluster loses quorum when an asymmetric communication fail

Comments

William Holmes 21 Nov 2011 9:59 AM # This helpful article comes about 3 weeks too late. We experienced this issue and have in fact installed the hotfixes. In addition to these fixes you may want to examine other aspects of your networking recomendations. For instance: support.microsoft.com/.../951037 the features mentioned in this KB all contributed to triggering the problems that the hotfixes address. Disabling the features mentioned improved the stability and responsiveness of our entire Exchange Organization.

daliu 21 Nov 2011 5:53 PM # I take it from the kb's these are "Windows" clustering hotfixes & therefore won't be rolled up into Exchange 2010 SP2 later this year, correct?

Marcus L 22 Nov 2011 2:14 AM # This is a question for William Holmes, when you say "Disabling the features mentioned improved stability", which features exactly, all of them?

Martijn 22 Nov 2011 4:33 AM # Will this info be part of the Installation Guide Template - DAG Member? Then it would be clear which hotfixes to install along with the latest Windows 2008 R2 & Exchange 2010 Service Packs and Update Rollups.

Rob A 22 Nov 2011 7:17 AM # MSFT needs to update ExBPA so that we don't have to comb through articles like this for obscure fixes and optimizations. ExBPA makes life easier for us and for PSS. I don't think I have seen an update for ExBPA in a very long time.

Brian Day [MSFT] 22 Nov 2011 8:12 AM # @Rob A, ExBPA updates are released in Service Packs and Update Rollups. If you want to make sure you have the latest ExBPA ruleset in place then install the latest SP and rollup on the machine you are running the ExBPA from.

Eugene 22 Nov 2011 9:33 AM # In our environment, using latest drivers available for IBM x3550 M2 servers and firmware, we can only stabilize a high-throughput server by disabling NetDMA in each and every case.

Eugene 22 Nov 2011 9:34 AM # In fact, IBM has documented recommendations for many of their products to disable NetDMA. But since our drivers are the latest available you'd think we'd expect a feature so heavily recommended by Microsoft perf. tuning guides to fundamentally work, which it fundamentally doesn't. www-304.ibm.com/.../docview.wss

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 35

Serhad MAKBULO�LU 23 Nov 2011 1:46 AM # Thanks. andy 25 Nov 2011 1:03 PM # tried to request the hotfix but got below: The system is currently

unavailable. Please try back later, or contact support if you want immediate assistance When will the hotfix be available from WSUS? We need some quality assurance from Microsoft in order to get it approved on production environment.

William Holmes 25 Nov 2011 7:49 PM # For Marcus: Yes all of them. NetDMA in particular seems to have caused cluster communications to be disrupted. This in turn caused a number of exchange problems as might be expected.

       

Cornell Information Technology Root Cause Analysis

CIT Root Cause Analysis 36

APPENDIX VII: MICROSOFT CLOSEOUT  From:       Gregg  Koop  <[email protected]>  

Subject:       Recent  Exchange/Broadcom  case  

Date:       November  22,  2011  3:13:27  PM  EST  

To:       Chuck  Boeheim  <[email protected]>,  Andrea  Beesing  <[email protected]>,  William  T  Holmes  

<[email protected]>      Hi  everyone,      I  am  in  the  process  of  closing  out  your  case  and  classifying  this  as  a  bug  (Broadcom  or  otherwise)  so  that  you  don’t  get  charged  the  hours  against  your  contract.      Is  there  anything  else  you  need  from  the  engineers  assigned  to  this  case?      Otherwise,  is  it  OK  to  close  this  out?      Thank  you.      Kind  regards,      Gregg  Koop  Sr.  Technical  Account  Manager,  MCTS,  MBA,  PMP,  6σ  Black  Belt  Microsoft  US  Public  Sector  Services  -­‐‑  State  and  Local  Government  &  Education  [email protected]          office:  (732)  476-­‐‑5581        cell:  (908)  391-­‐‑5656