the john hancock monitoring story, futurestack17
TRANSCRIPT
We operate as John Hancock in the United States, and Manulife in other parts of the world.
The John Hancock Monitoring Story:
Implementation OR Adaptation?
What does it take to succeed with New Relic?
September 2017
We operate as John Hancock in the United States, and Manulife in other parts of the world.
Navpreet SinghHead of Technical Resolution at John Hancock
2
3
Manulife & John Hancock
Source: http://www.manulife.com/Our-Story
A Global company
22 million customers,
35,000 employees, 70,000 agents,
thousands of distribution partners
Global Assets Under Management
and Administration exceeded
$1 trillion in the first quarter of 2017
4
Technology Landscape @ John Hancock
150 Year-old Business
Early IT Adapter
Using mainframe
MainframeCOBOLMicrofocus…
ServerlessMicroservicesIn Cloud
VB, PB, Progress, VFP…
Java, .Net, Ruby, Node, Angular, React, PHP…
Windows, Linux, Solaris, AIX…
SQL Server, Oracle, DB2, MySQL…
…And every version of these!
, cloud, and everything in between
5
Technology Landscape @John Hancock
600+ applications developed both in-house
and with vendors
Hosted on multiple models
Thousands of IT/IS professionals
We operate as John Hancock in the United States, and Manulife in other parts of the world.
The Manulife/John Hancock Reality
Before New Relic
Disparate Monitoring Solutions
Many different approaches to
monitor applications
No monitoring software for many applications
Basic hardware monitoring for
ops and vendors
But…Applications talk to each other all the time!
Result: Large holes in end-to-end monitoring
8
Example Scenarios
Web page loading slow
Batch process running slow
Don’t know CPU? or RAM? or Disk? or SQL? or App? issue
Dev team can only access app logs;Can’t capture CPU/RAM usage
Need server admin & DBAMeet Service admin to capture CPU/RAM usageWait for assigned admins to respond
Takes hours to days just to obtain databefore troubleshooting
Performance Issues
9
Example Scenarios
Web page errors
App layer / Business layer errors
SQL errors
Dev team uses app logs; limited insight
Need to bring to lower regions, do code debugging
Time consuming exercise, lack of real time trace.Web page -> App component -> SQL invokedfrom App
Lack of detail @thread level tracing forperformance issues
Need architect / admins
Application Errors
10
Increased Priority Incidents = Need for Better Monitoring
Move from reactive to proactive
We needed a
central monitoring standard
Resolve issues quickly
Improve understanding of application
behavior
Improve visibility into applications
in production
Enter
We operate as John Hancock in the United States, and Manulife in other parts of the world.
We’re All a Product of Our Environment!What Else Was Happening When New Relic Was Being Introduced?
What Else Was Happening?
Move to CloudPredominantly Azure IaaS with some PaaS, App Service
Some AWS
Move to AgileLargely Scrum, SAFe with some advanced concepts like TDD+Pairing
Push to DevOps
New Relic push aligns with DevOps and Agile
CIO/COO sets a Clear Goal!
All applications in Production must be monitored by New Relic within one year
An aggressive, clear, & unambiguous goal:
What’s Next?
What’s the right Team Structure?
Who should Ownmonitoring setup and responsibilities?
15
Monitoring Ownership
Goal: End-to-end monitoring solution
which spans tiers, hardware, and software
Monitoring Servers
Ops team has clear ownership
Monitoring ApplicationsNot so clear
?
16
Monitoring Ownership options
A specialized central monitoring team focused on application monitoring
Ops team owns all monitoring, drives it with
the application teams
1 2
Each app team owns setting up
monitoring
3
17
Our Ownership Solution at JH: It’s a Hybrid!
Each app team owns setting up monitoring for
their applications
Center of Excellenceset up to drive the effort
Culture change – very important.This distinguishes adaptation from a simple software implementation
For one BU with 100+ apps, a central monitoring teamestablished within the BU
18
Engagement Methodology with App Teams
1st set of Meetings:
New Relic Buy-in
2nd set of Meetings:
App’s Tech
Proposal:
App + New Relic = Great Things!
Periodic Check-ins
Adaptation: Best Practices & Suggestions
Culture ChangeGet Buy-In
Highlight the Wins & Success Stories
to Top Leadership
Nurture an Internal Community
Monitoring Maturity CurveDifferent types of monitoring
Alerts – Getting them right
Insights – IT Analytics
Insights – Business Analytics
21
Agile mindset to the project
Bias towards action
Don’t sit in a room discussing / researching until you know all the answers
Figure out enough to get started, start executing, find answers in the process – Inspect and Adapt
22
Progress Shared monthly with all Senior IT Leaders
Metrics showed:
# of users
Growth over a period:
% Apps by Status
Monthly growth by BU
Metrics Highlighted to Track Progress
Agent TypeMin.
Contracted Apr May Jun
APM (Application
Performance Monitors) 264 61 98 126
Servers Unlimited 575 675 725
Mobile Apps 250000 0 0 298
Browser (Million Checks) 75 1.5 8.3 11
Synthetic*(Million checks) 1.5 1.4 1.4 0.7
Jan-17
Feb-17
Mar-17
Apr-17
May-17
Jun-17
‘In Progress’ and ‘Completed’JH DA
JH DA
We operate as John Hancock in the United States, and Manulife in other parts of the world.
Speed Bumps?
Before You Can Live Happily Ever After…
24
Some speed bumps we faced?
Firewall – took a long time to resolve internally
SSL issue with older java apps
Sweet spot – Great with tech within the last 20-30 years and upcoming technologies
IBM technologies
PMI Metrics with Websphere
Private Locations Azure deployable image
Server Agents (& breadth)
We operate as John Hancock in the United States, and Manulife in other parts of the world.
Some Happy Endings…
26
Results - Success Stories
APM: A group improved page performance by 3 secs per page load by identifying tuning opportunities with a SQL executed multiple times for every page load
Synthetics: A group identified a 100+ MB static file was being served by webservers in MA instead of Akamai CDN
SQL Server Plugin: A team identified their Page Life Expectancy had deteriorated drastically since DB moved to new server, indicating inadequate RAM allocated
Insights: A team identified uneven load distribution across servers was causing severely degraded performance
Server API+Synthetics: A team uses alerts on memory exhaustion to avoid what used to be definite downtime
28
Going Forward… The Journey Continues
Recently Acquired
Infrastructure Product
NR Software Analysis Review
NR Expert
Services
Increased Insights
Retention Period
Miles to go…
29
Questions?