large scale identification of race conditions · - bad nodepool images - service outages - mirrors...

23
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Large Scale Identification of Race Conditions How we find race conditions in Joe Gordon Sean Dague May 21 th , 2014

Upload: others

Post on 30-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Large Scale Identification of Race ConditionsHow we find race conditions in Joe GordonSean DagueMay 21th, 2014

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.2

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.3

Development Scale

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.4

Development Principles

● Never break trunk– Master branch is always green

– Developers are never blocked on broken trunk

– Support continuous deployment

● Transparency● Automate everything● Egalitarian● Be Strict. Reduce burden on reviewers

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.5

Unit Tests

What Happens When You Submit Code

ProposedChange

Pep8

Unit TestsUnit Tests

Devstack /Tempest

~180 Guests

Devstack /Grenade

Devstack /Tempest

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.6

WAT?

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.7

1 Proposed Change generates …

● 5 – 10 Devstacks● ~10K integration tests● ~1000 2nd Level Guests● ~1 GB of Log Data (uncompressed)

● 1 week = 250-500 changes merged● 1 week = 1500-3000 change revisions (including updates to

existing changes)● 10,000 new first time changes proposed every 42 days

– 42 days between gerrit 70k – 80k and 80k – 90k

And these add up...

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.8

Statistics of Large Numbers

● Factors– Chance of Events - P(E)

– Number of Events / run – N(E/R)

– Number of runs - N(R)

● Ex: Github is down 0.05% of the time– 0.0005 * 20 clones/run * 1500 runs/week = 15

– 15 test failures every week (on average) because of github

– We no longer clone from github

P(E) x N(E/R) x N(R) = Failure Rate

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.9

Where do this failures come from?

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.10

How we did this in Grizzly

● Someone's change fails– They run recheck, it passes

– No one ever knew about the issue

● Someone has a large patch series (15 patches)– 1/3 of patches fail

– Different 1/3 of patches fail next time around

– “Hey, have you seen this failure → URL”

● My brain is a poor big data solution● … and then we turned on parallel testing – KABOOM!

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.11

“Have you seen this recently?”

Elastic Recheck

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.12

“Have you seen this recently?”

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.13

“Have you seen this recently?”

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.14

“Have you seen this recently?”

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.15

Unit Tests

What Happens When You Submit Code

ProposedChange

Pep8

Unit TestsUnit Tests

Devstack /Tempest

~180 Guests

Devstack /Grenade

Devstack /Tempest

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.16

Elastic Recheck Flow

logs.openstack.org logstash.openstack.orgAll artifacts

Select LOGsat INFO+

recheckbot

Gerrit

TestCompletes

Results

1

irc.freenode.net

KnownPatterns

2

3

4Report < 15 minutes after fail

er datascripts

KnownPatterns

status.openstack.org/elastic-recheck

Every 30 mins

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.17

We expected...

● 6 – 10 major bugs● Frequency rates > 1%

– Human detection rates for patterns

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.18

We found...

Upstream Service Breaks

Examples:- pypi bad cert- github outages- iaas dns blacklisting- iaas provider network

Assume touching network is poison, cache or bring resources local

Infra Breaks

Examples:- bad nodepool images- service outages- mirrors broken

Fixes:

Make infra more resilient and self healing

Bugs in OpenStack

Examples:- state corruption- races w/ async messaging- races w/ multiple workers- db deadlocks

Fixes:

Ferret out races in the code

● Currently tracking ~100 unique bugs in the system - seen in last 2 weeks● Most at < 0.1% occurrence rate

Bugs in Tests

Examples:- Unsafe global state expectations- Comparing timestamps

Fixes:

Fix the tests

Bugs in Dependencies

Examples:- kernel nbd vs. ovswitch- libvirt wedging

Fixes:

Get bug reported upstream, try to provide work around for buggy versions in OpenStack

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.19

Contributing Patterns

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.20

Keeping up with categorization

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.21

Next Steps

● Deprecating old /rechecks/ page● Finding patterns in the patterns

– Is this only some providers?

– Is this only some configurations?

● Converting from frequency to percentages

– frequency graphs are cool, but misleading at times

– add error bars!

● Packaging up for easier consumption

● Optimizations on data collection– We hit Elastic Search really hard

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.22

Thank You!

Elastic Recheck's Valiant Contributors

Joe GordonSean DagueMatt RiedemannMatthew TreinishClark BoylanSalvatore OrlandoJames E. BlairPeter PortanteDavanum SrinivasSergey LukjanovAttila Fazekas

Masayuki IgawaJeremy StanleyDolph MathewsBrant KnudsonAnita KunoMichael StillAllison RandalRussell BryantJerry ZhaoChristopher YeohThierry Carrez

Akihiro MotokiAdam GandelmanMark McLoughlinSean M. CollinsMichael KrotscheckAlexis LeeDean TroyerKen'ichi OhmichiAndrew LaskiMohammed NaserSahid Orentino Ferdjaoui

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.23

Thank You!

logs.openstack.org logstash.openstack.orgAll artifacts

Select LOGsat INFO+

recheckbot

Gerrit

TestCompletes

Results

1

irc.freenode.net

KnownPatterns

2

3

4Report < 15 minutes after fail

er datascripts

KnownPatterns

status.openstack.org/elastic-recheck

Every 30 mins