system insight without interference

Post on 11-May-2015

3.287 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Talk at Wordnik HQ about how to monitor application performance and business goals without intrusive engineering work on your core product.

TRANSCRIPT

Insight without InterferenceMonitoring with Scala, Swagger, MongoDB and Wordnik

OSSTony Tam@fehguy

Nagios Dashboard

Monitoring?

IT Ops 101

Host Checks

System

Load

Disk Space

Network

Host Checks

System

Load

Disk Space

Network

Monitoring?

Necessary(but

insufficient)

Why Insufficient?

•What about Services?

• Database running?

• HTTP traffic?

•Install Munin Node!

• Some (good) service-level insight

Your boss LOVES charts

“OH pretty

colors!”

“up and to the right!”“it MUST

be important

!”

Good vs. Bad?

•Database calls avg 1ms?

• Great! DB working well

• But called 1M times per page load/user?

•Most tools are for system, not your app

•By the time you know, it’s too late

Need business metrics

monitoring!

Enter APM

•Application Performance Monitoring

•Many flavors, degrees of integration

• Heavy: transaction monitoring, code performance, heap, memory analysis

• Medium: home-grown profiling

• Light: digest your logs (failure forensics)

•What you need depends on architecture, business + technology stage

APM @ Wordnik

•Micro Services make the System

Monolithic application

APM @ Wordnik

•Micro Services make the System

Monolithic application

API Calls are the unit of work!

Monitoring API Calls

•Every API must be profiled

•Other logic as needed

• Database calls

• Connection manager

• etc...

•Anything that might matter!

How?

•Wordnik-OSS Profiler for Scala

• Apache 2.0 License, available in Maven Central

•Profiling Arbitrary code block:import com.wordnik.util.perf.Profile

Profile("create a cat", {/* do something */})

•Profiling an API call:Profile("/store/purchase", {/* do something */})

Profiler gives you…

•Nearly free*** tracking

•Simple aggregation

•Trigger mechanism

• Actions on time spent “doing things”:

Profile.triggers += new Function1[ProfileCounter, Unit] { def apply(counter: ProfileCounter): Unit = { if (counter.name == "getDb" && counter.duration > 5000) wakeUpSysAdminAndEveryoneWhoCanFixShit(Urgency.NOW) return counter }}

Profiler gives you…

•Nearly free*** tracking

•Simple aggregation

•Trigger mechanism

• Actions on time spent “doing things”:

Profile.triggers += new Function1[ProfileCounter, Unit] { def apply(counter: ProfileCounter): Unit = { if (counter.name == "getDb" && counter.duration > 5000) wakeUpSysAdminAndEveryoneWhoCanFixShit(Urgency.NOW) return counter }}

This is intrusive on

your codebase

Accessing Profile Data

•Easy to get in codeProfileScreenPrinter.dump

•Output where you wantlogger.info(ProfileScreenPrinter.toString)

•Send to logs, email, etc.

Accessing Profile Data

•Easier to get via API with Swagger-JAXRS

import com.wordnik.resource.util

@Path("/activity.json")@Api("/activity")@Produces(Array("application/json"))class ProfileResource extends ProfileTrait

Accessing Profile Data

Accessing Profile Data

Inspect without bugging

devs!

Is Aggregate Data Enough?

•Probably not

•Not Actionable

• Have calls increased? Decreased?

• Faster response? Slower?

Make it Actionable

•“In a 3 hour window, I expect 300,000 views per server”

• Poll & persist the counters

• Example: Log page views, every min{

"_id" : "web1-word-page-view-20120625151812","host" : "web1","count" : 627172,"timestamp" : NumberLong("1340637492247")

},{"_id" : "web1-word-page-view-20120625151912","host" : "web1","count" : 627372,"timestamp" : NumberLong("1340637552778")

}

Make it Actionable

Make it Actionable

Your boss LOVES charts

That’s not Actionable!

•But it’s pretty

What’s missing?

APIs to track?

Low + High

Watermarks

Custom Time

window

Too much custom

Engineering

That’s not Actionable!

APIs to track?

Low + High

Watermarks

Custom Time

window

Too much custom

Engineering

Call to Action!

Make it Actionable

•Swagger + a tiny bit of engineering

• Let your *product* people create monitors, set goals

•A Check: specific API call mapped to a service function{ "name": "word-page-view", "path": "/word/*/wordView (post)", "checkInterval": 60, "healthSpan": 300, "minCount": 300, "maxCount": 100000}

Make it Actionable

•A Service Type: a collection of checks which make a functional unit { "name": "www-api", "checks": [ "word-of-the-day", "word-page-view", "word-definitions", "user-login", "api-account-signup", "api-account-activated" ] }

Make it Actionable

•A Host: “directions” to get to the checks { "host": "ip-10-132-43-114", "path": "/v4/health.json/profile?api_key=XYZ", "serviceType": "www-api”},{ "host": "ip-10-130-134-82", "path": "/v4/health.json/profile?api_key=XYZ", "serviceType": "www-api”}

Make it Actionable

•And finally, a simple GUI

Make it Actionable

•And finally, a simple GUI

Make it Actionable

•Point Nagios at this!serviceHealth.json/status/www-api?explodeOnFailure=true

•Get a 500, get an alert

Metrics from

Product

Based on YOUR app

Treat like system failure

Make it Actionable

Is this Enough?

System monitoring

Aggregate monitoring

Windowed monitoring

Object monitoring?

• Action on a specific event/object

Why!?

Object-level Actions

•Any back-end engineer can build this

• But shouldn’t

•ETL to a cube?

•Run BI queries against production?

•Best way to “siphon” data from production w/o intrusive engineering?

Avoiding Code Invasion

•We use MongoDB everywhere

•We use > 1 server wherever we use MongoDB

•We have an opLog record against everything we do

What is the OpLog

•All participating members have one

•Capped collection of all write ops

primary replica replicat0

time

t1

t3

t2

time

So What?

•It’s a “pseudo-durable global topic message bus” (PDGTMB)

• WTF?

•All DB transactions in there

•It’s persistent (cyclic collection)

•It’s fast (as fast as your writes)

•It’s non-blocking

•It’s easily accessible

More about this{

"ts" : {"t" : 1340948921000, "i" : 1

},"h" : NumberLong("5674919573577531409"),"op" : "i","ns" : "test.animals","o" : {"_id" : "fred", "type" : "cat"}

}, {"ts" : {

"t" : 1340948935000, "i" : 1},"h" : NumberLong("7701120461899338740"),"op" : "i","ns" : "test.animals","o" : {

"_id" : "bill", "type" : "rat"}

}

Tapping into the Oplog

•Made easy for you!https://github.com/wordnik/wordnik-oss

Tapping into the Oplog

•Made easy for you!https://github.com/wordnik/wordnik-oss

SnapshotsReplication

Incremental Backup

Same Techniqu

e!

Tapping into the Oplog

•Create an OpLogProcessor

class OpLogReader extends OplogRecordProcessor { val recordTriggers = new HashSet[Function1[BasicDBObject, Unit]] @throws(classOf[Exception]) def processRecord(dbo: BasicDBObject) = { recordTriggers.foreach(t => t(dbo)) } @throws(classOf[IOException]) def close(string: String) = {}}

Tapping into the Oplog

•Attach it to an OpLogTailThreadval util = new OpLogReader

val coll: DBCollection =

(MongoDBConnectionManager.getOplog("oplog",

"localhost", None, None)).get

val tailThread = new OplogTailThread(util, coll)

tailThread.start

Tapping into the Oplog

•Add some observer functions

util.recordTriggers += new Function1[BasicDBObject, Unit] { def apply(e: BasicDBObject): Unit = Profile("inspectObject", { totalExamined += 1 /* do something here */ } }) } }

/* do something here */

•Like?

•Convert to business objects and act!

• OpLog to domain object is EASY

• Just process the ns that you care about

"ns" : "test.animals”

•How?

Converting OpLog to Object

•Jackson makes this trivial

case class User(username: String, email: String, createdAt: Date)

val user = jacksonMapper.convertValue( dbo.get("o").asInstanceOf[DBObject], classOf[User])

•Reuse your DAOs? Bonus points!

•Got your objects!

Converting OpLog to Object

•Jackson makes this trivial

case class User(username: String, email: String, createdAt: Date)

val user = jacksonMapper.convertValue( dbo.get("o").asInstanceOf[DBObject], classOf[User])

•Reuse your DAOs? Bonus points!

•Got your objects!Now

What?

“o” is for “Object”

Use Case 1: Alert on Action

•New account!obj match { case newAccount: UserAccount => { /* ring the bell! */ } case _ => { /* ignore it */ }}

Use case 2: What’s Trending?

•Real-time activitycase o: VisitLog =>

Profile("ActivityMonitor:processVisit", {

wordTracker.add(o.word)

})

Use case 3: External Analytics

case o: UserProfile => {

getSqlDatabase().executeSql(

"insert into user_profile values(?,?,?)",

o.username, o.email, o.createdAt)

}

Use case 3: External Analytics

case o: UserProfile => {

getSqlDatabase().executeSql(

"insert into user_profile values(?,?,?)",

o.username, o.email, o.createdAt)

}

Don’t mix runtime &

OLAP!

Your Data pushes to Relational!

Use case 4: Cloud analysis

case o: NewUserAccount => {

getSalesforceConnector().create(

Lead(Account.ID, o.firstName, o.lastName,

o.company, o.email, o.phone))

}

Use case 4: Cloud analysis

case o: NewUserAccount => {

getSalesforceConnector().create(

Lead(Account.ID, o.firstName, o.lastName,

o.company, o.email, o.phone))

} We didn’t interrupt

core engineering

!

Pushed directly to Salesforce!

Examples

Polling profile APIs

cross cluster

Examples

Siphoning hashtags

from opLog

Examples

Page view activity from

opLog

Examples

Health check w/o

engineering

Summary

•Don’t mix up monitoring servers & your application

•Leave core engineering alone

•Make a tiny engineering investment now

•Let your product folks set metrics

•FOSS tools are available (and well tested!)

•The opLog is incredibly powerful

• Hack it!

Find out more

•Wordnik: developer.wordnik.com

•Swagger: swagger.wordnik.com

•Wordnik OSS: github.com/wordnik/wordnik-oss

•Atmosphere: github.com/Atmosphere/atmosphere

•MongoDB: www.mongodb.org

top related