killer bugs from outer space

74
Killer Bugs From Outer Space Jérôme Petazzoni — @jpetazzo LinuxCon — Chicago — 2014

Upload: jerome-petazzoni

Post on 15-Jan-2015

1.078 views

Category:

Technology


0 download

DESCRIPTION

Working with software means working with bugs. Bugs in software, bugs in hardware; bugs in Open Source code, bugs in proprietary code. If software is eating the world, bugs might end up taking the first bite. We will present a few typical bugs, some of them famous, some of them infamous (including bugs that actually killed people). Since one can never be too well-prepared to fend off the next infestation, we will give tools, tips, and best practices to fix bugs in Open Source software. We will give real world examples of Really Mysterious Bugs (sometimes nicknamed "Heisenbugs" because they tend to disappear when you try to observe them), and how they were fixed, in Node.js, Docker, and the Linux Kernel.

TRANSCRIPT

Page 1: Killer Bugs From Outer Space

Killer BugsFrom Outer SpaceJérôme Petazzoni — @jpetazzoLinuxCon — Chicago — 2014

Page 2: Killer Bugs From Outer Space
Page 3: Killer Bugs From Outer Space

Why this talk?

Codito, ergo erro

I code, therefore I make mistakes

Page 4: Killer Bugs From Outer Space

Introduction(s)● Hi, I’m Jérôme.

Page 5: Killer Bugs From Outer Space

Introduction(s)● Hi, I’m Jérôme.● Sometimes, I write code.

Page 6: Killer Bugs From Outer Space

Introduction(s)● Hi, I’m Jérôme.● Sometimes, I write code.● Sometimes, the code has bugs.

Page 7: Killer Bugs From Outer Space

Introduction(s)● Hi, I’m Jérôme.● Sometimes, I write code.● Sometimes, the code has bugs.● Sometimes, I fix the bugs in my code.

Page 8: Killer Bugs From Outer Space

Introduction(s)● Hi, I’m Jérôme.● Sometimes, I write code.● Sometimes, the code has bugs.● Sometimes, I fix the bugs in my code.● Sometimes, I fix the bugs in other people’s code.

Page 9: Killer Bugs From Outer Space

Introduction(s)● Hi, I’m Jérôme.● Sometimes, I write code.● Sometimes, the code has bugs.● Sometimes, I fix the bugs in my code.● Sometimes, I fix the bugs in other people’s code.

I like bullet points!

Page 10: Killer Bugs From Outer Space

Introduction(s)● Hi, I’m Jérôme.● Sometimes, I write code.● Sometimes, the code has bugs.● Sometimes, I fix the bugs in my code.● Sometimes, I fix the bugs in other people’s code.

I like bullet points!

● And I carry a pager.

Page 11: Killer Bugs From Outer Space

Introduction(s)A pager is a device that wakes you up, or tells you to stop whatever you’re doing, so you can fix other people’s bugs.

Page 12: Killer Bugs From Outer Space

Introduction(s)A pager is a device that wakes you up, or tells you to stop whatever you’re doing, so you can fix other people’s bugs.

WE HATESSS THEMSS.

Page 13: Killer Bugs From Outer Space

What about you?● Do you write code?● Does it sometimes have bugs?● Do you fix them?● Do you fix other people’s code too?● Do you carry a pager?● Do you love it?

Page 14: Killer Bugs From Outer Space

Outline● Let’s talk about some really nasty bugs● How they were found, how they were fixed● How to be prepared next time● This is not about testing, TDD, etc.

(when the bugs are there, it’s too late anyway)

Page 15: Killer Bugs From Outer Space

Outline● Node.js● Harmless hardware bugs● Docker● Harmful hardware bugs● Linux

Page 16: Killer Bugs From Outer Space

Node.js

Page 17: Killer Bugs From Outer Space

Context● Hipache* is a reverse-proxy in Node.js● Handles a bit of traffic

○ >100 req/s○ >10K virtual hosts○ >10K different containers

● Vhosts and containers change all the time(more than 1 time per minute)

*Hipache is Hipster’s Apache. Sorry.

Page 18: Killer Bugs From Outer Space

The bugIt all starts with an angry customer.

“Sometimes, our application will crash, because this 700 KB JSON file is truncated by Hipache!”

What about Content-Length?The client code should scream, but it doesn’t.

Page 19: Killer Bugs From Outer Space

Let’s sniff some packetsLog into the load balancer (running Hipache)...# ngrep -tipd any -Wbyline '/api/v1/download-all-the-things' tcp port 80

interface: any

filter: (ip or ip6) and ( tcp port 80 )

match: /api/v1/download-all-the-things

####

T 2013/08/22 04:11:27.848663 23.20.88.251:55983 -> 10.116.195.150:80 [AP]

GET /api/v1/download-all-the-things.json HTTP/1.0.

Host: angrycustomer.com

X-Forwarded-Port: 443.

X-Forwarded-For: ::ffff:24.13.146.16.

X-Forwarded-Proto: https.

...

Page 20: Killer Bugs From Outer Space

Too much traffic, not enough visibility!# tcpdump -peni any -s0 -wdump tcp port 80(Wait a bit)^C

Transfer dump file

DEMO TIME!

Page 21: Killer Bugs From Outer Space
Page 22: Killer Bugs From Outer Space

What did we find out?● Truncated files happen because a chunk

(probably exactly one) gets dropped.

But:● Impossible to reproduce locally.● Only the customer sees the problem.

TONIGHT, WE DINE IN CODE!

Page 23: Killer Bugs From Outer Space

This is Node.js.I have no idea what I’m doing.

● Warm up the debuggers!

Page 24: Killer Bugs From Outer Space
Page 25: Killer Bugs From Outer Space

This is Node.js.I have no idea what I’m doing.

● Warm up the debuggers!● … but Node.js is asynchronous,

callback-driven, spaghetti code● Hmmmm, spaghetti

Page 26: Killer Bugs From Outer Space

This is Node.js.I have no idea what I’m doing.

● Plan B: PRINT ALL THE THINGS

Page 27: Killer Bugs From Outer Space

You need a phrasebook!

● How do you say “printf” in your language?

● How do you find where a function comes from?

● How do you trace the standard library?

Page 28: Killer Bugs From Outer Space

Shotgun debugging● Add console.log() statements everywhere:

○ in Hipache○ in node-http-proxy○ in node/lib/http.js

● For the last one (part of std lib), we need to:○ replace require(‘http’) with require(‘_http’)○ add our own _http.js to our node_modules○ do the same to net.js (in “our” _http.js).

● Now analyze big stream of obscure events!● Let There Be Light

Page 29: Killer Bugs From Outer Space

Interlude about pauses● With Node.js, you can pause a TCP stream.

(Node.js will stop reading from the socket.)● Then whenever you are ready to continue,

you are supposed to send a resume event.● Hipache does that: when a client is too slow,

it will pause the socket to the backend.

SO FAR, SO GOOD

Page 30: Killer Bugs From Outer Space

What really happens● There are two layers in Node: tcp and http.● When the tcp layer reads the last chunk,

the backend closes the socket (it’s done).● The tcp layer notices that the socket is now

closed, and emits an end event.● The end event bubbles up to the http layer.● The http layer finishes what it was doing,

without sending a resume.● Node never reads the chunks in the kernel

buffers. They are lost, forever alone.

Page 31: Killer Bugs From Outer Space

How do we fix this?Pester Node.js folks

Catch that end event, and when it happens, send a resume to the stream to drain it.

(Implementation detail: you only have the http socket, and you need to listen for an event on the tcp socket, so you need to do slightly dirty things with the http socket. But eh, it works!)

Page 32: Killer Bugs From Outer Space
Page 33: Killer Bugs From Outer Space

What did we learn?When you can’t reproduce a bug at will, record it in action (tcpdump) and dissect it (wireshark).

Spraying code with print statements helps.(But it’s better to use the logging framework!)

You don’t have to know Node.js to fix Node.js!

Page 34: Killer Bugs From Outer Space

Harmless hardware bugs

Page 35: Killer Bugs From Outer Space

Intel Pentium(insert appropriate ©™ where required)

● Pentium FDIV bug (1994)○ errors at 4th decimal place○ fixed by replacing CPUs○ cost (for Intel): $475,000,000○ cost (for users): approx. $0

● Pentium F00F bug (1997)○ using the wrong instruction

hangs the machine○ fixed in software○ cost: ???

Page 36: Killer Bugs From Outer Space

ATA ribbon cables● Touch or move those cables:

the transfer speed changes● SATA was introduced in 2003,

and (mostly) addresses the issue● Vibration is still an issue, though

Page 37: Killer Bugs From Outer Space

Docker(because even when it’s not about Docker, it’s still about Docker)

Page 38: Killer Bugs From Outer Space

Bug:It never works the first time# docker run -t -i ubuntu echo hello world

2013/08/06 23:20:53 Error: Error starting container 06d642aae1a: fork/exec /usr/bin/lxc-start: operation not permitted

# docker run -t -i ubuntu echo hello world

hello world

# docker run -t -i ubuntu echo hello world

hello world

# docker run -t -i ubuntu echo hello world

hello world

# docker run -t -i ubuntu echo hello world

hello world

Page 39: Killer Bugs From Outer Space
Page 40: Killer Bugs From Outer Space

Strace to the rescue!Steps:1. Boot the machine.2. Find pid of process to analyze.

(ps | grep, pidof docker...)3. strace -o log -f -p $PID

4. docker run -t -i run ubuntu echo hello world

5. Ctrl-C the strace process.6. Repeat steps 3-4-5, using a different log file.

Note: can also strace directly, e.g. “strace ls”.

Page 41: Killer Bugs From Outer Space

Let’s compare the log files● Thousands and thousands of lines.● Look for the error message in file A.

(e.g. “operation not permitted”)● If lucky: it will reveal the issue.● Otherwise, look what happens in file B.

● Other approach: start from the beginning or the end, and try to find the point when things started to diverge.

Page 42: Killer Bugs From Outer Space

Specialized hardware helps

Page 43: Killer Bugs From Outer Space

Specialized hardware helps● Now you have a good reason to ask your

CFO about that dual 30” monitor setup!

Page 44: Killer Bugs From Outer Space

Investigation resultsFirst time[pid 1331] setsid() = 1331[pid 1331] dup2(10, 0) = 0[pid 1331] dup2(10, 1) = 1[pid 1331] dup2(10, 2) = 2[pid 1331] ioctl(0, TIOCSCTTY) = -1 EPERM (Operation not permitted)[pid 1331] write(12, "\1\0\0\0\0\0\0\0", 8) = 8[pid 1331] _exit(253) = ?

Second time (and every following attempt)[pid 1414] setsid() = 1414[pid 1414] dup2(14, 0) = 0[pid 1414] dup2(14, 1) = 1[pid 1414] dup2(14, 2) = 2[pid 1414] ioctl(0, TIOCSCTTY) = 0[pid 1414] execve("/usr/bin/lxc-start", ["lxc-start", "-n", ...]) <...>

Page 45: Killer Bugs From Outer Space

What does that mean?● For some reason, the code wants file

descriptor 0 (stdin) to be a terminal.

● The first time we run, it fails, but in the process, we acquire a terminal.(UNIX 101: when you don’t have a controlling terminal and open a file which is a terminal, it becomes your controlling terminal, unless you open the file with flag O_NOCTTY)

● Next attempts are therefore successful.

Page 46: Killer Bugs From Outer Space

… Really?To confirm that this is indeed the bug:● reproduce the issue

(start the process with “setsid”, to detach from controlling terminal)

● check the output of “ps”(it shows controlling terminals)

#before23083 ? Sl+ 0:12 ./docker -d -b br0#after23083 pts/6 Sl+ 0:12 ./docker -d -b br0

Page 47: Killer Bugs From Outer Space

V I C T O R Y

Page 48: Killer Bugs From Outer Space

What did we learn?You can attach to running processes.

● strace is awesome.It traces syscalls.

● ltrace is awesome too.It traces library calls.

● gdb is your friend.(A very peculiar friend, but a friend nonetheless.)

Page 49: Killer Bugs From Outer Space

Harmfulhardware bugs

Page 50: Killer Bugs From Outer Space

“Errare humanum est,perseverare autem diabolicum”

“To err is human,but to really foul things up, you need a computer”

Page 51: Killer Bugs From Outer Space

Really nasty (and sad) bug:The Therac-25● Radiotherapy machine

(shoots beams to cure cancer)

● Two modes:○ low energy

(direct exposure)○ high energy

(beam hits a specialtarget/filter first)

Page 52: Killer Bugs From Outer Space

The problem● In older versions of the machine,

a hardware interlock prevented the high energy beam from shooting if the filter was not in place.

● On the Therac-25, it’s in software.

● What could possibly go wrong?

Page 53: Killer Bugs From Outer Space

What went wrong● 6 people got radiation burns● 3 people died● … over the course of 3 years (1985 to 1987)

Page 54: Killer Bugs From Outer Space

Konami Code of DeathOn the keyboard, press:(in less than 8 seconds)

X ↑ E [ENTER] B

...And the high energy beam shoots, unfiltered!

Page 55: Killer Bugs From Outer Space

How could it happen?● Race condition in the software.

● Never happened during tests:○ the tests did not include “unusual sequences”

(which were not that unusual after all)○ test operators were slower than real operators

Page 56: Killer Bugs From Outer Space

Aggravating details● Many engineering and institutional issues

○ No code review○ No evaluation of possible failures○ Undocumented error codes○ No sensor feedback

● The machine had tons of “normal errors”○ And operators learned to ignore them

● So the “real errors” were ignored○ Just hit retry, same player shoot again!

Page 57: Killer Bugs From Outer Space

Let’s get back to weird Linux Kernel bugs

Page 58: Killer Bugs From Outer Space

Linux Kerneland spinlocks and Xen and ...

Page 59: Killer Bugs From Outer Space

Let’s get back to weird Linux Kernel bugs

Page 60: Killer Bugs From Outer Space

Random crashes on EC2● Pool of ~50 identical instances● Same role (run 100s of containers)● Sometimes, one of them would crash

○ Total crash○ no SSH○ no ping○ no log○ no nothing

● EC2 console won’t show anything● Impossible to reproduce

Page 61: Killer Bugs From Outer Space

Try a million things...● Different kernel versions● Different filesystems tunings● Different security settings (GRSEC)● Different memory settings (overcommit, OOM)● Different instance sizes● Different EBS volumes● Different differences● Nothing changed

Page 62: Killer Bugs From Outer Space

And one fine day...● One machine crashes very often

(every few days, sometimes few hours)

CLONE IT!ONE MILLION TIMES!

Page 63: Killer Bugs From Outer Space

A New Hope!● Change everything (again!)● Find nothing (again!)● Do something crazy:

contact AWS support● Repeat tests on “official” image (AMI)

(this required porting our stufffrom Ubuntu 10.04 to 12.04)

Page 64: Killer Bugs From Outer Space

Happy ending● Re-ran tests with official image● Eventually got it to crash● Left it in crashed state● Support analyzed the image...

Page 65: Killer Bugs From Outer Space

Happy ending● Re-ran tests with official image● Eventually got it to crash● Left it in crashed state● Support analyzed the image

“oh yeah it’s a known issue, see that link.”

Page 66: Killer Bugs From Outer Space

Happy ending● Re-ran tests with official image● Eventually got it to crash● Left it in crashed state● Support analyzed the image

“oh yeah it’s a known issue, see that link.”

U SERIOUS?

Page 67: Killer Bugs From Outer Space

● The bug only happens:○ on workloads using spinlocks intensively○ only on Xen VMs with many CPUs

● Spinlocks = actively spinning the CPU● On VMs, you don’t want to hold the CPU● Xen has special implementation of spinlocks

When waking up CPUs waiting on a spinlock, the code would only wake up the first one,even if there were multiple CPUs waiting.

I can explain!

Page 68: Killer Bugs From Outer Space

The patch (priceless)diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c

index d69cc6c..67bc7ba 100644

--- a/arch/x86/xen/spinlock.c

+++ b/arch/x86/xen/spinlock.c

@@ -328,7 +328,6 @@ static noinline void xen_spin_unlock_slow(struct xen_spinlock *xl)

if (per_cpu(lock_spinners, cpu) == xl) {

ADD_STATS(released_slow_kicked, 1);

xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);

- break;

}

}

}

--

Page 69: Killer Bugs From Outer Space
Page 70: Killer Bugs From Outer Space

What did we learn?We didn’t try all the combinations.(Trying on HVM machines would have helped!)

AWS support can be helpful sometimes.(This one was a surprise.)

Trying to debug a kernel issue without console output is like trying to learn to read in the dark.(Compare to local VM with serial output…)

Page 71: Killer Bugs From Outer Space
Page 72: Killer Bugs From Outer Space

Overall ConclusionsWhen facing a mystic bug from outer space:● reproduce it at all costs!● collect data with tcpdump, ngrep, wireshark,

strace, ltrace, gdb; and log files, obviously!● don’t be afraid of uncharted places!● document it, at least with a 2 AM ragetweet!

Page 73: Killer Bugs From Outer Space

One last thing...● Get all the help you can get!● Your developers will rarely reproduce bugs

(Ain’t nobody got time for that)● Your support team will

(They talk to your customers all the time)● Help your support team to help your devs● Bonus points if your support team fixes bugs

Page 74: Killer Bugs From Outer Space

Thank you! Questions?