five lines i could not draw

Post on 25-Jan-2017

208 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

5 lines I couldn’t draw

Hi everybody

I’m davedave@librato.com@davejosephsen

github: djosephsen

I’m Dave and I work on the Ops team at Librato. In fact you’ve caught me in a bit of a transitional period because I’ve recently decided to move back to ops after spending two years as Librato’s developer evangelist, which has been a fascinating role that’s given me the opportunity to really branch out learn many new things.

Thought Leader Power Moves

For example Here’s a shot of me doing an all male panel, which is just one of the many thought-leader power moves I’ve perfected over the last couple years in my former role as developer evangelist.

Thought Leader Power Moves

I can also do that faux-earnest, touching my fingers together while wearing a blazer and pontificating at you thing. That’s totally in my repertoire so if you’re ever in doubt of the validity of my argument I can touch my fingers together and become super reasonable looking

• “Resource”: •Human-being who works here

!

• “Lead”: •Human-being who doesn’t work here

!

• “Content Marketing: •Annoying people on twitter

!

• “Engagement” •Tricking people into talking to you

I’ve penetrated the marketing team’s vernacular, so yeah.. that took two years, but I can definitely sit in those meetings now and.. I mean I pretty much know mostly what’s going on I think.

Thought Leader Power Moves

git push -f origin master!

I can force-push master from a vendor booth. That was a huge achievement get in this role. By the way I’ve heard if you force push master from an all male panel they have to make you CTO. Pro tip. But yeah the vender booth is like a second home to me now. I’ve done quite a bit of venderboothing over the last few years

Graphing Nagios

Venderboothing

Here we are at twilio signal a few months ago. And something happened at signal I want to tell you about, because it happens a lot in the course of my venderboothing endeavors where

Venderboothing

I’ll be there in the booth, you know, my fingernails lightly resting against one another. And I’ll be engaging with leads, and filling our funnel top with branding, and this neckbeardy dude

will kind of slide up, and lurk. He’ll just stand there glaring at me as I do my booth dance, impossible to ignore, like He’s like, heavy. I mean not literally heavy I’m not body-shaming him, I just mean he’s like… laden with discontent you know?

Graphing Nagios

BLARG ANOMOLY DETECTION!!!

and after listening for a while he’ll just spontaneously interrupt by blurting out something like “WHAT ABOUT ANOMOLY DETECTION!”. And by the way in every case this has been what my wife calls a “catestrophic digression”,

Graphing Nagios

like this interruption will be so awkward and so rude that it will send people scurrying away from the booth out of either embarrassment or genuine fear for their own safety. And I’ll want to run away too but not running away from incredulous neck beards is another skill I’ve developed so instead

Graphing Nagios

I’ll don my thought leader cap and be like “what an informed and thought provoking question, Let me ask you, what kind of monitoring tools are you using today?”

and inevitably, he’ll launch into this apocalyptic tale of woe, wherein he will describe the faustian hell in which he is currently trapped. Like his companies product

Perl!

is, I’m not making this up, a bunch of perl scripts that

OTHER PEOPLES Perl!

some consultant wrote back in 2006, and for monitoring they have these

Graphing NagiosMORE Perl!

other perl scripts that a different consultant wrote in 2008 and those perl scripts are watching those first perl scripts

Graphing NagiosWINDOWS!!

And by the way they all run on Windows so it’s not just perl

Graphing NagiosACTIVESTATE PERL!

but active state perl, and these active state perl scripts

EMAILZ!

send emails when unexpected things happen or error states are detected

Graphing NagiosMAPI EMAILZ!!!

and they’re using… MAPI on windows XP to do this,

Graphing NagiosTHOUSANDS OF MAPI EMAILZ!!!

And of course they’re sending 5000 emails per day because unexpected state is basically always happening and of course nobody is paying any attention to the emails.

Graphing NagiosTHOUSANDS OF MAPI EMAILZ!!!

and usually by this point he feels awful and I feel awful so I’ll say something “like wow, that sounds awful it must be a really frustrating and stressful environment for you, like I’m sorry to hear it. And he’ll be all”

Graphing Nagios

it’s fine…. and I’ll be like. Man. You sure? because that sounds really bad, like I sincerely want to just hug you right now. Would you like a hug?

Graphing Nagios

And he’ll be like no really it’s fine, I don’t need a AAAHHH oh my god hug me. And we’ll hug it out and he’ll cry a little bit, and then I’ll be like. You know what you should do? You should vi that .qmail-root file mister neckbeard. You know?

Graphing Nagios| cat > /dev/null && echo ‘emailz,1,g’ | /usr/bin/statsd

Throw something like this in there. Just dev null those messages and count them instead. I’m not saying that’s your permanent solution or anything

Graphing Nagios

but i mean you don’t even know what that signal LOOKS like. You might be able to alert on simple message volume, or the derivative, or maybe just seeing that line will TEACH you something about the behavior or rhythm of your system.

Graphing NagiosYa think?

And he’ll be like? “fa foa foa You think?”

Graphing NagiosDO EEET!

And I’ll be like heck yeah mister neckbeard. Go giterdone, and he’ll leave happy and with a lighter heart. It’s a funny story. But it’s also a true story. I’ve had this conversation at least 20 times. And today surrounded by all of you and my librato teamies it’s easy to forget, but if I’m being honest?

This was me one job ago

I was this neckbeard. This was me. Not a long time ago either but like one job ago, I was him. And I more or less knew everything about monitoring that I know now, and yet I couldn’t .

Graphing Nagios

#1

draw this line, any more than he could. And I can hear you like Really dave emails per second? I mean yes, of course I could draw a line of emails per second, but then so can he. What neither of us could do is make that cognitive leap of applying monitoring tools as a means of understanding system behavior independent of alerting. So this is line number one I couldn’t draw.

Graphing NagiosSpoiler alert:

There will be 4 more

And I come across these pretty commonly these days where I’ll see some data that we’re using internally and be like man.. I never could have drawn one job ago and I just kind of thought it’d be interesting to explore the reasons why. Live on stage in front of a room full of visionaries.

I was carrying a misapprehension about what monitoring was and

whom it was for

Because in every case the problem wasn’t a lack of technical aptitude, it was wrapped up in my beliefs and expectations. In this particular case, the problem . The thing that’s causing our cognitive dissonance is that these perl scripts are sending alerts already. So to us— to mr neck beard and I, it seemed like monitoring was already in place. Box checked. Monitoring done.

Because to us monitoring the thing that made alerts happen. We had no means of describing an undertaking called monitoring that was meaningfully discreet from alerting. That’s why monitoring is an ops thing. It’s about uptime. So obviously ops owns it. So as an ops person the pattern was that we would install or inherit this thing

this, guard-dog-like entity. I would tell it where to sit, and it would bark whenever it felt like something wasn’t right.

And it was great at barking. It’d bark all day and night bark bark bark bark bark. And that sucked. so I thought, maybe I should train it. So I’d put in a ticket to train it

Squirrel! ZOMG SQUIRREL!!! It’s RIGHT THERE!

YOUGUISE? SQUIRRELL!

but I’d never work the ticket because it was literally nobody’s priority but mine. I’d feel guilty if I worked on it because it seemd self indulgent or I’d feel guilty if I didn’t because dogs sitting over there barking at squirrels all night and it’s just embarrassing

But what about squirrels?

and maybe eventually I’d get some time to take it aside and be like listen. Thread contention? Bark. Ice cream trucks? NO BARK. But no matter what I told it, was never something that could help me out with stuff like these perl emails because

Oh sure. Blame the puppies Dave

ultimately our relationship was Prescriptive in the wrong direction. The dog was always telling me what I was interested in, so the best we could ever hope to achieve was this on going negotiation about what to bark at how much barking was enough barking, it always became about the barking

Monitoring is not FOR alerting

Here are two important things I no longer believe. These are the things that I think makes me different from mister neckbeard. First, I don’t believe monitoring is for alerting. It’s not about uptime.

Nobody OWNS monitoring

Next, monitoring is neither my responsibility nor is it my burden. It’s not mine. not my pet. It’s a tape measure. It’s a tape measure that I get to share with every engineer I work with.

Ops owns Monitoring

Everyone owns

Monitoring

I think if you’ve been listening to the real speakers, basically all the ones who aren’t me, you’ll find that this is the probably most important underlying belief that differentiates them from mister neckbeard. People who run effective monitoring infrastructure

believe we all get to ask questions. We all get to measure things. Not just ops, not just dev, not just DBA, everybody who cares, gets to measure, and we all get to use the same tape measure, and it’s perfectly reasonable to expect to get accurate, timely answers to our questions, and that’s what MONITORING is for.

Monitoring is FOR asking questions

It’s the infrastructure that makes it possible for everyone to understand system behavior. That’s what monitoring is for me now, today. That’s how it works. Not because I installed or bought some particular collection of tools or learned about percentiles, but exactly because my expectations have changed. Does that make sense?

You might be fascinated with anomaly detection

because your input signal sucks?

What If I Told You:

And mark was right when he said tools matter, but the tools are there today and yet many still suffer. Like you can build the metrics infrastructure you need right now, that’s hard, and expensive but possible. Or you can buy it, and that’s easier and expensive but possible. But to make either of those actually work, you still need to change the people. That’s a lot harder. No combination of bleeding edge tools, no amount of fancy anomaly detection is going to save mister neck beard until he understands that monitoring is not for alerting, and that measuring things is everyone’s job.

Complexity Isolates

And Mr Neckbeard has another problem too. He’s embracing complexity. See? To him, those emails are his burden to bear, his lot in life. They are the hand grenade upon which he will jump to save us all. And that belief isolates him. It mires him in complexity, and he believes that’s fine. Let me show you what I mean..

#2

Here’s line number two that I couldn’t draw. And I can hear you saying Dave, that isn’t even a line. Like, you literally had one job dude and the first line was disappointing and this isn’t even a line. But the line you aren’t seeing here is actually a REALLY important … lack of a line.

because what you aren’t seeing here is the number of people currently using the Librato API who are being throttled. So slight digression for context this is a pretty common problem when new users are wiring us up for the first time and what things do they send?

All the things! Cheslock knows all about that you can ask him.. So rather than surprise people with a million dollar bill, We catch unlikely new ingest throttling and we’ll shoot an email or whatever that’s like hey um, maybe dial that back unless you actually want to pay us the GDP of uraguy every month.

and a whole bunch of metrics like this are physically mounted to the wall next to our support team, because they’re the ones who are going to be there to help the user understand why they’re suddenly getting http500’s but the interesting part is that these weren’t created for support. These metrics in fact, were originally put in place by the engineer who implemented throttling to understand what that signal looked like

And this made me wonder like, how did this happen. How did first level support begin repurposing API metrics?

With whom shall I share my bounty of hard-won

metric data?

Was there some cuddly API engineer who, in a spontaneous bout of altruism went to go

devops unicorn cuddle with the support team and make rainbow metrics babies of team spirit? That’s amazing, I want to meet these api engineers, so I went to go talk to them and

they seemed like typical software engineers who exhibit the typical demeanor and mannerisms that one expects software engineers to manifest. So yeah long story short I think what’s happening here

Cynefin

is something like emergent cynefin.

and if you’re not familiar, cynefin is a framework that helps us make sense of complexity

The idea is that you categorize the complexity you’re dealing with, and then you attempt to move from whatever category you’re in, to the next less complex category until you hit

obvious right here, which you can see is sort of an uphill climb

Things you need to move: • Control •Understanding •Standardization

but my time at Librato has convinced me that cynefin can be an emergent property of effective monitoring systems. By which I mean effective monitoring just sort of organically provides you a lot of the stuff you need to move toward obvious

And I can hear you like really dave? emergent cynefin properties? You should have stuck with unicorn babies of team spirit. Like do you even know what you sound like when you say shit like that? which, yes, I hear myself

Cynefin(no, for reals tho)

I mean the processes line up pretty well. What you need to climb the cynefin ladder is pretty much what you give a decent telemetry system to people who understand that their job is to measure things.

And this is a perfect example. Our support team was able to move from a very opaque and chaotic form of complexity straight to obvious by repurposing monitoring data from another team, and today my friend Nik on the support team can look up at that very real, very physical wall and say

“Behold. Throttled users!”, and go talk to them about it. That’s a first-level support team that organically understands the concept of http return codes, and services oriented architecture, and API backpressure. Nobody wrote them a manual for that. That’s Cynefin at work.

I could never draw lines like this… not-line. And the reason I couldn’t draw this not-line was because like mister neck beard I thought that embracing complex things 5000 janky perl emails was my job. I thought complexity was my lot in life,

Me dissecting somebody’s javascript circa 2003

I was totally the guy who would sit down with that incomprehensible bowl of spagetti code that some mean-spirited consultant wrote in 1997 and I’d be like I hate not understanding this. “I’m not leaving until this is understood. I’m not going to a meeting, I’m not going to lunch, I’m not going home”. And managers would come looking for me like did dave show up today. and my teamies would have to be like

don’t bug him he’s dissecting some janky code. This is a real picture, this is my boss steve at IBM global services bringing me a sandwich. So For the millenials in the audience, this is actually what team spirit looked like in corporate america in the late 90’s. But then once I had it figured it out I’d become the owner of that janky perl forever. The only person who ever understood it and then people would be like

yo dave, that janky perl thing is broke again. Right? They’d dump it on me when it broke and that’s perfectly rational, because why should they crawl down there with me? Why should I want them to?

Graphing NagiosEXPERIENCE

5000 JANKY EMAILS

pain hurts y'all. It’s painful, so embracing it just isolates you, even from other engineers, because there’s only so much pain each of us can endure, we just can’t really go around willy-nilly embracing each others pain. It’s just not a tenable scaling model.

But simplicity feels fantastic. Simplicity wants to be shared and celebrated. I should have always been working to reduce complexity instead of just accepting it, but I never realized that my monitoring tools could help simplify things.

I used to think this was about as simple as simple got. I used to make things like this when I understood something. Well I still do I’ll draw a really complicated picture of the really complicated thing. I was so close, but I just never took that next step. the one that was like lets

#3

Simplify that into something that isn’t painful to understand. This is line 3, and now you’re like dave that’s also not a line, so not only are you 1 for three on following through with the click bait listicle title you sold us, line one was super boring, and you sir, are a lying deceitful faud.

#3

so OK captain pedantic, here’s the line. I couldn’t have drawn it because I never had an amazing dashboard like this beneath the lines I would draw

#3

this dashboard is actually a simplification of

This diagram. the curator of those metrics, took this diagram and made it obvious making one row

Row Per SLB

for every SLB in that architecture diagram, which, if you think about it is an interesting way to simplify your understanding of service ingress because what does every service have in common? a load balancer.

Latency

Availability Traffic

so for each load balancer lets break down a few golden signals and these are like, if any blocking outage happens inside any of these services, you’re going to see it in one of these signals, it’s guaranteed. If you don’t see it in one of these signals, then it’s by definition not a blocking problem.

#3

And I want to stress that again, the person who curated this view was not the person who wrote the instrumentation to get this data. Different people, different concerns, and the monitoring system is enabling them to work together to reduce complexity, and aid comprehension.

I never could have drawn that line over this amazing dashboard, I never realized I could use monitoring tools to build bridges to help other people understand the pain I was experiencing.

Everybody gets to measure things

Nobody OWNS monitoring

And then one day I hire into this shop where everybody can measure things, and nobody owns monitoring

and all these people are building all this stuff, and they’re taking measurements as they go

And then other people get a hold of those signals and refine them, and cynefin happens

And bam, suddenly first-level support understands API backpressure. And I’m trippin out like three weeks ago I was trapped in a perpetual

Srsly tho; squirrels. Bark or don’t bark?

tire-fire with this clueless watch-dog and NOBODY cared. Like nobody even KNEW. I was alone with my bij despite being surrounded by other engineers

#4

I’m sorry it’s just a stark contrast. Like, check out line number four here. This measures the storage latency introduced by our API matching metric names to UIDs. Point being, this is subtle latency metric. Like I can’t describe it to you in less than 15 words. But check this out,

<redacted>

An Ops guy named Benjo, is working with it. That’s pretty crazy right? I mean where I come from, in the tire-fire it’s atypical for ops people to work with latency data that describes job execution inside the database. I mean in the tire fire this is what we referred to as somebody else’s problem. But OK Maybe Ben’s just a really savvy guy.

<redacted>

But, wait benjo the ops person is not only wise to this intricate db latency issue but he’s correlating it back to system metrics. Ok huh, that seems extraordinarily astute to me, I mean even if you have the domain knowledge

<redacted>

Wait hold on a sec, he is actually talking to Data Engineering about this? Ok, savvy, astute and brave. Or maybe he just doesn’t know that data engineers are mean

<redacted>

<redacted>

woah, what? they’re actually responding and working together with him? And evidently so is the Front-End Team? Like What in the actual HELL is happening here? How does Ben the ops guy have all this Data Engineering domain knowledge? And why isn’t anyone being mean to him? Where are the passive-aggressive insults? The hostility and mistrust I’ve come to expect from engineers working in other teams? I mean this is dev and ops, this is

This is DOGS AND CATS. LIVING TOGETHER. I’m SHOCKED. It’s Shocking!

#4

I’ve certainly never been able to draw lines like that. I never knew enough about what was going on around me to even work with data like this. I mean this is a Line that literally bridge disciplines.

#4

Look, if you squint, you can almost see the bridge that this line creates, between

#4

Data Engineering and Operations. And again I can’t help but wonder how this happened. Is it possible that effective monitoring can bring about cultural change?

#4

Because it looks to me like that’s what’s happened here. It looks like the combination of an effective telemetry infrastructure, combined with people who understand that measuring things is their job, has ultimately changed how people interact with each other in this shop. Good monitoring changes people. That’s kind of mindblowing.

#5

So speaking of culture how much time do I have, ok good, because this is the good part, Line number 5. So this is a funny story about Bryan one of our integrations engineers, and this happened a few months ago now, and for the record I publicly apologize in advance to Brian I’m sorry dude, if you’re watching for shaming you like this on the internet, but in my defense.. it was pretty funny though

so what happened was, Bryan was working on making our UI faster. And up top he rolls out a change, and that change? Makes the UI faster. So mission accomplished, good jorb Bryan you done it. And to be clear he’s graphed the performance data. I mean job done, homework done, he HAS a graph showing the stuff becoming faster.. but what he’s pasted in here

is not that graph. It’s the mouse over of the tooltip. like he drew the graph, moused over it, took a screenshot of the tooltip of the individual datapoints, for a single polling interval, and then pasted THAT in channel. And for context, not only does Brian work for a startup whose singular purpose is the drawing linegraphs depicting time series data, but his Boss at the time is literally

Line graphs FTW

the inventor of graphite AND this conference. Basically he works at line-graph-co for the godfather of linegraphs. And he’s basically just walked up to the godfather of line graphs like “Behold my assortment of individual datapoints!” but wait.. it gets better..

And then he says… ignore the zeros! omg so amazing.

I mean look.. if we ignore the zero’s literally we have two values. Like Bryan, sit down, I think it’s time we had the data-to-ink ratio talk bro. Once upon a time there was a man named Tufte…

And I also want to point out the time stamps here, because it’s only a matter of seconds before his team begins expressing their confusion, like wait.. what?

Is me messing with me RN?

Like I can almost see Dixon at home, head tilted to the side, unsure if this is some kind of elaborate troll, he’s like maybe everybody got together and agreed to not paste any line graphs in channel for the whole week. Which actually is kind of a brilliant troll and also something we’d totally do, but no, this was all bryan

Anyway, then Bryan facepalms… and throws line number 5 in the channel

And I never could have drawn this line, because I’ve never had a team around me that actually cared this much about what I was working on. If I came to someone with some amazing data that I was super proud of they’d look at it like

thatscoolIguessorwhatever

yeah um wow. Thats cool i guess or whatever. doctor mc-showoff. Anywayz pretty busy so please stop being in my cubicle now.

but look at the love here, these people want to geek out on the data with you. They want to celebrate your win. Not just in the fortune 500 goals and gift cards way, but by actually quantifying the y-axis of your success. They want to comprehend your win so bad they are confused when they lack sufficient data to comprehend your win, which I find astounding.

and at the risk of sounding campy I guess I just wanted to say I love mah teamies at Librato and I love all of you as well, and I wanted to thank you for working to make effective monitoring happen in your shops, and building tools to make it happen for other people. So, Sincerely, thank you, you make me want to come to work every day.

Questions?

@davejosephsen

And at this point the conference organizers have insisted that I allow you to ask questions but I have read the code of conduct which has literally nothing to say on the subject of speakers shouting smoke bomb and running off stage if they are confronted with a hostile question. SO that is a right I reserve.

top related