hkg15-204: openstack: 3rd party testing and performance benchmarking

Presented by

LEG HKG15-204

OpenStack Testing and Performance Benchmarking

3rd-Party CI, Rally and TempestAndrew McDermott

Clark Laughlin

Tuesday, 10 Feb 2015

Agenda

● Update on 3rd party CI testing○ why, where and how

● Update on Tempest○ analysis of results○ current issues○ plan going forward

● Explanation of Rally○ results○ how we can make use of it

OpenStack 3rd-Party CI

● Goal: Get ARM recognized as an equal, supported platform for OpenStack○ Path to recognition requires setting up 3rd-party CI

system○ Must be able to demonstrate stability before being

allowed to vote on patches

OpenStack 3rd-Party CI● What it is:

○ Run Tempest against OpenStack triggered by gerrit events○ Report results back to OpenStack gerrit○ Functional test of OpenStack components

● What it is not:○ A general purpose arm64 test environment○ Testing hypervisor functionality○ Testing performance, functionality of VMs

OpenStack 3rd-Party CI● How?

○ Setting up using OpenStack CI components○ OpenStack deployment with KVM as hypervisor○ Run devstack/tempest configured to use QEMU

instances

Image credit: http://thoughtsoncloud.com/2014/09/creating-continuous-integration-environment-openstack/

arm64nova-compute

● Setting up a dedicated testing environment in Linaro co-lo facility○ HP Moonshot

■ Single chassis■ ~5 HP m300 cartridges (8-core amd64) running CI

infrastructure services■ ~20 HP m400 cartridges (8-core arm64) running test

instances (KVM)

● Plans:○ Initially handle gerrit events for nova○ Over time scale to handle additional projects:

■ cinder■ glance■ swift■ neutron

Questions

● What other projects does Linaro need to work towards adding test support for?○ Network/storage plugins?

Questions

● Would anyone like to help?○ Help debugging / fixing Tempest failures?○ Experience setting up an OpenStack CI?

OpenStack Rally

● Rally is an OpenStack project that provides a framework for measuring performance, for benchmarking and validation○ run benchmarks that explain how the

deployment scales○ provides an historical view of the benchmarks

that were run○ details how fast they run○ validates that the workload run successfully

Rally versus Tempest

● Rally is a higher-level tool than Tempest○ Tempest is typically about running something once

○ Rally is more about testing across your data centres with 1000’s of machines, each with 1000’s of users/tenants

● Note: validation could use tempest as a workload

Rally high-level use cases

● Rally for devops○ uses existing cloud, simulate real-world load, aggregate results, verify

SLA have been met

● Rally for developers and QA○ deploy, simulate real-world load, iterate on performance issue, aggregate

results, make openstack better by upstreaming patches

● Rally for Continuous Integration / Delivery○ deploy on specific h/w configuration with latest versions from tip, run a

specific set of benchmarks, store performance data for historical trend analysis, report results - this use case is our initial focus

Rally Benchmark Scenarios● A scenario is a benchmark specification

○ Typically grouped into OpenStack functional areas● A scenario performs a small set of atomic operations

○ nova: boot then delete an instance○ keystone: create user, then list user

● Benchmark scenarios are also customisable○ which image to use, how much RAM, disk, CPU

Rally Benchmark Runners● Control the execution of a benchmark● Provide different strategies for applying load to the deployment:

○ constant - generate a constant load N times○ constant-for-duration - constant, but time limited○ periodic - intervals between consecutive runs

● Key aspect is concurrency○ Run same test but with concurrent invocations○ This is quite different to tempest testing

Rally Benchmark Context

● A Context typically specifies:○ the number of users/tenants○ the roles granted to those users/tenants○ whether they have extended or narrowed quotas

● Running a test on your laptop is different to running the test at scale

Rally Example Scenario{ "NovaServers.boot_server": [ { "args": { "flavor_id": 42, "image_id": "73257560-c59b-4275-a1ec-ab140e5b9979" }, "runner": { "type": "constant", "times": 15, "concurrency": 2 }, "context": { "users": { "tenants": 1, "users_per_tenant": 3 }, "quotas": { "nova": { "instances": 20 }

Rally Benchmark Database

● Rally stores results in a database• data mining & trend analysis• looking at historical results• results can be arbitrarily tagged, then used in SQL queries

$ rally task list+--------------------------+---------------------+-----------+--------+| uuid | created_at | status | failed |+--------------------------+---------------------+-----------+--------+| fbdf6a3e-...fe47d6345d13 | 2014-10-22 15:26:37 | finished | False || ab231519-...3a72b7460fad | 2014-10-22 15:29:32 | finished | False || 67ff34c4-...a6a651f1c458 | 2014-10-24 13:33:15 | finished | False || 495598c5-...98b0e9b005e6 | 2014-11-12 11:02:46 | finished | False |+--------------------------+---------------------+-----------+--------+

Nova “boot-and-delete” scenario

● Manual runs of the “boot-and-delete” scenario○ Results for 1 controller, 2 compute ○ Results for 1 controller, 3 compute

● Disclaimer: results and timings are exemplary - machines and network shared

Rally Reports

How we use and run Rally?

● Deployment, testing and running of Rally through LAVA and manually

● Start with nova scenarios○ grow and expand for other OpenStack components○ Future: benchmark ODP and NFV

● Run scenarios against icehouse, juno and tip

Openstack Tempest Update

● Summary from LEG-SC meeting● Analysis of results● What are the current issues● What we plan to do next cycle

Tempest Result Summary● Bundle Stream: https://validation.linaro.

org/dashboard/streams/private/team/mustang/mwhudson-devstack/bundles/7c4d42405460a199ae694d0affe8d9e3ae96c64e/

ARMv8 x86 (OpenStack CI)

Pass 1379 2051

Fail 36 0

Skip 322 200

Understanding “skips”

● Components not installed■ cinder, neutron, trove, sahara, ceilometer, zaqar, etc.

● Config setting not enabled■ Nova v3 API, suspend, live migration

● Currently disabled (existing bugs)● Configuration errors

■ ping/ssh access not enabled■ not enough images in glance

Examining Tempest failures● Some reasons:

○ HTTP timeouts in test setup○ Invalid configuration creating instances (attempting to use

IDE bus)● Common ARM and x86 failures

○ Unable to locate instance/image by ID○ Unable to establish SSH connection to running instance○ Tempest test suite can hang when running concurrently (e.

g., --concurrency=8)

Getting more tests passing

• We need to enable subsystems like cinder (needs PCIe)

• Get live migration working• Live migration is planned for 2015.03• PCIe (hot plug) is planned for 2015 Q2

• Neutron, getting it configured and working on ARMv8

Ongoing LAVA testing plan (1)

● Dedicate 3 (new) machines in LAVA for OpenStack testing

● Will improve test execution time○ no reboot○ no reinstall of base OS for each run○ not shared

● Machines will also be used for Rally benchmarking

Ongoing LAVA testing plan (2)

● Establish baselines results for:○ icehouse vs juno vs tip

● CI jobs for both ARM and x86○ Want a baseline to make comparisons○ x86 is minimal, best effort only

● Investigate LAVA results○ some LAB issues○ some test jobs fail very early

Linaro OpenStack bugzilla

● Bug database setup:○ https://bugs.linaro.org/enter_bug.cgi?

product=OpenStack● Capturing ARMv8 only bugs

○ Common bugs will be reported upstream

hkg15-204: openstack: 3rd party testing and performance benchmarking

Software

hkg15-205: opentac - open hardware test automation...

hkg15-202: umeq (user mode emulation quest)

kvm and docker lxc benchmarking with openstack

hkg15-505: power management interactions with op-tee and...

hkg15-200: openjdk under the hood

hkg15-203: tcwg 2015 roadmap review

openstack on openstack

hkg15-101: programming for performance

machine learning benchmarking with openstack and...

hkg15-307: op-tee paging

rally: openstack benchmarking

hkg15-201: upstream kernel validation

hkg15-407: linaro clear key

hkg15-210: port forwarding daemon

hkg15-409: arm hibernation enablement on socs - a case study

hkg15-104: what is linaro working on - core development...

hkg15-306: build system modifications to ease working with...

hkg15-308: kick-start your 64-bit aosp build engines

hkg15-407: eme implementation in chromium: linaro clear key

hkg15-902: upstreaming 201