practical solutions to detecting bugs

40
Practical solutions to detecting bugs Karl Norling @karlnorling

Upload: karl-j-norling

Post on 11-Apr-2017

39 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Practical solutions to detecting bugs

Practical solutions to detecting bugsKarl Norling @karlnorling

Page 2: Practical solutions to detecting bugs

I’m Karl Norling

2

• Swedish, moved to U.S ~13 years ago• Spends most of my time in Brooklyn with my family• Lead software engineer at Quartet

Page 3: Practical solutions to detecting bugs

Quartet

3

• Healthcare technology company focused on improving Behavioral Healthcare

• 1 in 4 Americans experienced a mental disorder in the last year; most were moderate to severe

• Behavioral Health side is often ignored, leading to poor outcomes (medication non-adherence & ER visits)

Page 4: Practical solutions to detecting bugs

Quartet

4

• Quartet delivers scalable behavioral health integration for our partners, leading to better patient care and cost savings

• Quartet’s product is a marriage of our three pillars, Data & Analytics, Collaborative Platform, Engagement and Support.

Page 5: Practical solutions to detecting bugs

Engineering at Quartet

5

• Work with highly sensitive and regulated information that demands high reliability (PHI/HIPAA)

• Develop for very distinct users with very different challenges (BHP, PCP, Patients, Quartet Users)

• Need to deliver a robust solution

Page 6: Practical solutions to detecting bugs

How do you know something is wrong?

Page 7: Practical solutions to detecting bugs

Catching errors

Page 8: Practical solutions to detecting bugs

How do I know that’s something is wrong?Identify critical ‘this should never happen, but if’ — and log them

if (user.isAuthenticated) {…

} else { this.logger.log(‘warn’, ‘Not authorized user trying to access ..’, { user, });}

Page 9: Practical solutions to detecting bugs

How do I know that’s something is wrong?Wrap code in try catch statements try {

const patient = new Patient(metrics, logger); patient.hydrate(rawPatient);} catch (error) { this.logger.log(‘error’,’Failed to hydrated patient’, { patientId: rawPatient.id});

}

Page 10: Practical solutions to detecting bugs

How do I know that’s something is wrong?Measure events with metrics const res =

request.authenticate(email, pw);if (res.status === 200) { this.metrics(‘user_login_success’);} else if (res.status === 403) { this.metrics(‘user_login_unathorized’);} else if (res.status >= 500) { this.metrics(‘user_login_error’);}

Page 11: Practical solutions to detecting bugs

How do I know that’s something is wrong?

11

Measure everything• Environment alerts: CPU usage, disk space etc.• External reporting: Customer, employee reports issue.

Page 12: Practical solutions to detecting bugs

Create alerts

Page 13: Practical solutions to detecting bugs

Create alerts

13

Create search queries in logging software (e.g. kibana, sumologic, splunk)that will alert on specific log message, level, or threshold.At Quartet we’re using elastalert.

Page 14: Practical solutions to detecting bugs

Create alerts

14

Metrics are good to use to detect trends. Example:If we haven’t had any logins in the last 24 hours, it’s time to investigate.

Page 15: Practical solutions to detecting bugs

Create alerts

15

There should be a way for employees and customers to report issues — either from the website or via email address.Example:Employee using internal tool cannot change shipping address for an order.

Page 16: Practical solutions to detecting bugs

Organize alerts

Page 17: Practical solutions to detecting bugs

Organize alerts

17

Add tags to log messages.Then, search queries are easier to group, delegate, and report upon.

Page 18: Practical solutions to detecting bugs

Organize alerts

18

Define a naming convention system for your tags.Either prefix them with functional areas or team names.

Page 19: Practical solutions to detecting bugs

Organize alerts

19

Alerts should create tickets.When an alert gets triggered, a ticket should be generated in whichever tool being used to track work in (i.e., JIRA). Tickets should be created within the project associated with the team that owns the service.

Page 20: Practical solutions to detecting bugs

Communicate

Page 21: Practical solutions to detecting bugs

Communicate

21

Choose the right tool for communicating the alert to the person on call (e.g. Slack, Hipchat, email, JIRA).At Quartet we’re using Slack.

Page 22: Practical solutions to detecting bugs

Communicate

22

Make sure the tool can be configured to send alerts via different channels depending on the alert, so the correct team, on-call person sees it.At Quartet we’re using PagerDuty.

Page 23: Practical solutions to detecting bugs

What happens next?

Page 24: Practical solutions to detecting bugs

On-call acknowledges the issue

Page 25: Practical solutions to detecting bugs

Who is on-call

25

On-call is the employee that’s responding to alerts. Other terms might be red-hot, on-duty, etc.

Page 26: Practical solutions to detecting bugs

On-call acknowledges the issue

26

On-call schedule should be created, rotating weekly (depending on # of employees).You may also have a secondary on-call, in case primary is unavailable (i.e., on subway ride home).At Quartet we have one for app devs, core, and infra.

Page 27: Practical solutions to detecting bugs

On-call acknowledges the issue

27

Page 28: Practical solutions to detecting bugs

On-call acknowledges the issue

28

Primary on-call receives alert, acknowledges through same channel within defined range of time.If time expires, issue is bumped to alert secondary.Make sure to set time range that makes sense for your organization.At Quartet, we use 15 minutes.

Page 29: Practical solutions to detecting bugs

How to respond to the issue

Page 30: Practical solutions to detecting bugs

How to respond to the issue

30

Alerts need to be investigated to determine how urgently they need to be addressed.For critical issues, on-call should be empowered to reach out and involve owner of code causing issue even if it’s after-hours.At Quartet, alerts create tickets automatically. For all issues, on-call will make sure tickets are assigned to the right team.

Page 31: Practical solutions to detecting bugs

How to respond to the issue

31

You need to define a process for marking issues resolved that makes sense for your organizational model.It’s helpful if there’s a link to a handbook in the reported issue. The handbook should contain steps for how to investigate and possibly resolve the issue.

Page 32: Practical solutions to detecting bugs

How to respond to the issue

32

If it’s an employee or customer filing the issue, there needs to be an established process for communicating externally, i.e., internal email or involving customer service.Depending on SLA, it needs to happen within a timeframe.

Page 33: Practical solutions to detecting bugs

On-call best practices

Page 34: Practical solutions to detecting bugs

Establish process and guidelines

34

At Quartet, we have a doc that details our process and best practices. New employees shadow the primary on-calls for a month before they get added to the rotation.

Page 35: Practical solutions to detecting bugs

Example guidelines

35

• Each engineering team member will have the PagerDuty app installed on their phone.

• If your PagerDuty schedule overlaps with planned vacation, arrange a schedule override in PagerDuty.

• Each team is responsible for creating PagerDuty alerts for the services that they are responsible for.

Page 36: Practical solutions to detecting bugs

Ensure information is maintained

36

To facilitate a good continuum of the on-call schedule to the next person, there should be a hand-off meeting.

The on-call is responsible for walking the next on-call through the weekly report for the previous week.

Page 37: Practical solutions to detecting bugs

Evolve

37

How did we get here?

Stop the noise - if an error happens over and over again, dedupe it. Investigate why.

Downgrade - is the error actually an error or should we measure via a metric.

Page 38: Practical solutions to detecting bugs

On-call

38

On-call is dedicated 100% of their time to investigate bugs, this makes sense where we’re at, shipping a lot of code. More code generates more bugs naturally.

Page 39: Practical solutions to detecting bugs

Tools

39

Tools ❤ Process

The tooling will not solve your issues, you have to have a process how to use the tools.

“If you only have a hammer, everything looks like a nail”

Page 40: Practical solutions to detecting bugs

Thanks