openstack summit tokyo2015 presentation nttresonant
TRANSCRIPT
OpenStack at NTT Resonant: Lessons Learned in Web Infrastructure
Tomoya Hashimoto, Business Platform Division, NTT Resonant Inc.Kazuhiro Tooriyama, Business Platform Division, NTT Resonant Inc.Toshikazu Ichikawa, NTT Software Innovation Center, NTT Corporation
Presentation Video
This slide deck was presented in OpenStack Summit Tokyo 2015.You will find our video-recorded presentation at the URL
https://www.openstack.org/summit/tokyo-2015/videos/presentation/openstack-at-ntt-resonant-lessons-learned-in-web-infrastructure
2
Speakers
Tomoya Hashimoto
Kazuhiro Tooriyama2010 – 2014 NTT Communications Development of ISP NW (OCN)2014 - current NTT Resonant Engineer of server platform
2001 - 2012 NTT Resonant goo blog, oshiete goo (Q&A service) Development and operation of core services2012 - current NTT Resonant Architect of server platform
3
Toshikazu Ichikawa2011 – 2014 Verio (NTT America) Development of cloud service “Cloudn” and managed hosting service2014 - current NTT Development of cloud service platform
1. About NTT Resonant2. OpenStack Infrastructure Design3. VM setup by Puppet with OpenStack4. Monitoring OpenStack and VMs5. Current Issues and Future Plan
4
Agenda
1. About NTT Resonant
5
1. About NTT Resonant
6
Regional Communications Business Long Distance andInternational Communications Business
Mobile CommunicationsBusiness
Date CommunicationsBusiness
$112billionin total
revenue
240,000employeesworldwide
#1in Data Center floor space
#2in Global IP Backbone
Source: TeleGeography
All facts and figures accurate as of March 2014
R&D
1. About NTT Resonant
B2C services
Platform and B2B2C services
Portal Site
Smartphone application
goo milk feeder goo disaster prevention application
7
Servicesof
Customers
HealthcareDisaster prevention/response solutionsPhone Cloud / Developer support
e-commerce site for communications devices
NTT Resonant’s Business Area
1. About NTT Resonant
8
Dictionaries ZIP codes Laboratory Bodycloud
Housing and real estate Search Baby-care Movies
Maps Navigation Horoscopes Rankings Car and bike
News Weather Healthcare Smartphone applications Blogs
Job search Love and marriage Online store
Travel
Providing 60+ services including• Web search• Blogging• News• Oshiete! goo Q&A site
Launched in 199718th years old
Web portal site “goo”
http://www.goo.ne.jp/
1. About NTT Resonant
9
How large is ?
The 3rd largest web portal in Japan
Yahoo! Google
Rakuten
MSN
Scale of web portal “goo”
170 million unique browsers per month1 billion page views per month
Source: 2015.02 NetRatings
2. OpenStack Infrastructure Design
10
2. OpenStack Infrastructure Design
• Migrate to another data center, under limited timeframe–The termination of existing data center (DC) contract is fixed. We
need to migrate our system from existing DC to another DC by the time.
• Shorten a lead time for service release–Speed up by changing manual operation to create and manage VMs–Comparable to public cloud service such as AWS
• Support all workflows to provide service–not just introducing OpenStack, as an infrastructure for web
services–Not only VM creation, but also an installation and configuration of
software inside VMs
What was required to us at OpenStack deployment
11
Service Teams
Platform Team
2. OpenStack Infrastructure Design
Organization and FormationNTT Resonant
Service Developer
60+ servicesPlatform service
OperationPartners
OperatorOutsourcing
12
~10 engineers
300+ engineers
…
Service DeveloperService DeveloperService DeveloperService Developer
OperatorOperator
Design Team
NTT R&D
OpenStack Community
… Joint experiment
Contribution Distribution
2. OpenStack Infrastructure Design
• It’s decided to migrate our services to another data center.–2014/ 3 Project Started
OpenStack installation’s design and deployment begins–2014/10 OpenStack is ready, in production–2014/10-2015/01 (4 months) 70 services, 1300 VMs started
OpenStack deployment timeline with our services
13
March July Oct. JanJuneApril May Aug. Sep. Nov. Dec.2014 2015
Migration of services from old existing environment
★OpenStack started, In Production
★Migration CompletedOpenStack Installation: Design / Deployment
Requ
irem
ent
Defin
ition
About 6 months★Old existing environment Closed
2. OpenStack Infrastructure Design
• Using OpenStack as Private Cloud• In production since 2014 October• As of now, it supports
–80+ services–1 billion page views per month
• With–400 hypervisors
•2 Nova cells–4,800 physical cores–1,800+ virtual servers
OpenStack Scale at main data center of NTT Resonant
14
Launch
Dashboard
2. OpenStack Infrastructure Design
OpenStack Components (Icehouse Release)
15
Horizon
Neutron
Nova
Glance
Cinder
Swift
Keystone
Network
Hypervisor
Image
Block Storage
Object Storage
Identity
Virtual Router and LANVirtual Load Balancer
Virtual server
VM TemplateImage snapshot
Virtual volume
RESTful file storeReplication
What we use
VM VM VM
APPOS
APPOS
APPOS
designed by Freepik
Trove
Heat Orchestration
Database services
Ceilometer Telemetry
• Distribution–RDO with CentOS 6–Icehouse version
• Automation–Puppet for Configuration Management
•Thanks RDO Community for Puppet manifest
16
2. OpenStack Infrastructure Design
Deployment
• Provider network with VLAN–No control L3+ including
Router, NAT, Load Balancer, Firewall• Using ML2, Linuxbridge agent
–We are familiar with it• Service Model
–An administrator prepares networks and subnets per tenant–A tenant is not allowed to create/delete a network
• Close to “Scenario: Provider networks with Linux bridge” in the “OpenStack Networking Guide” [1]
17
[1] http://docs.openstack.org/networking-guide/scenario_provider_lb.html
Neutron
L4-7: Load Balancer, VPN
L3: Router, NAT
L2: Network, Port
What we use
2. OpenStack Infrastructure Design
Networking with Neutron
18
Node Type OpenStackComponents
RabbitMQ (MQ) and MariaDB (Database)
HAProxy (LB) and Pacemaker (HA cluster)
Top cellController
Nova, Glance, Keystone, Neutron, Horizon
RabbitMQ Mirrored queue
Nova, Keystone, Neutron, Horizon, DB, MQ
Child cellController
Nova RabbitMQ Mirrored queue
MQ
Database N/A MariaDB Galera Cluster
N/A
Swift Proxy Swift, Glance N/A Swift, GlanceSwift Storage Swift N/A N/ACompute Nova, Neutron N/A N/A
Node types and HA(High Availability) strategy
2. OpenStack Infrastructure Design
2. OpenStack Infrastructure Design
Contribution to Community related to this project
19
• This bug was a show-stopper for the project until we fixed–Bug Fix [1]:
• the Shelve function didn't work at Icehouse release with nova-cell deployment
• We use shelve/unshelve for hypervisor maintenance• Some bugs we found and fixed
–Security Bug Fix [2]:• This was announced as OSSA 2015-017 recently
–8 bug fixes other than above[1] "shelve api does not work in the nova-cell environment“ https://bugs.launchpad.net/nova/+bug/1338451
[2] "Deleting instance while resize instance is running leads to unuseable compute nodes” https://bugs.launchpad.net/nova/+bug/1392527
2. OpenStack Infrastructure Design
• We modified codes to enforce our operation rules–We modified only Horizon
• Users come through Horizon, not API• What we implemented
–Server naming restriction–Access limit to security group function–And, about 40 items
–No modification to other componentsexcept bug fix backports• Minimize the cost to maintain
20
Server creation dialog of Horizon
Customization on Dashboard
3. VM setup by Puppet with OpenStack
21
3. VM setup by Puppet with OpenStack
Issue of VM setup (installation and configuration)
22
• Only 4 months from VM creation to service migration– Time is limited for VM setup– 1,300 VMs need to be migrated onto OpenStack– Automate procedures, as much as possible
• The key is puppet manifest using at existing data center (DC)– We used puppet manifest to setup VM at existing DC
• Making a bridge between OpenStack and puppet– The goal is to setup our services on top of OpenStack quickly and
easily
We resolved this issue by using puppet integrated with OpenStack
OpenStack
3. VM setup by Puppet with OpenStack
How we use puppet with OpenStack
23
• Our puppet design– Individual puppet master per tenant– Linux account, middleware, config file etc.– Single manifest repository
• What is required to use Puppet– Host name can be resolved with DNS– Host group is defined in LDAP– Puppet manifest has the entry for a host group
Tenant: A
Tenant : A User
VM-A
VM-B
SVN
PuppetMaster
DNS LDAP
Necessary
OpenStack
3. VM setup by Puppet with OpenStack
How we use puppet with OpenStack
24
• Synchronization tool– Polling on Nova API to detect a new
VM– VM registration to DNS, LDAP and
puppet manifest– Complete above steps every 5
minutes
• OpenStack user is able to apply puppet manifest easily and quickly right after a VM creation
Tenant: A
Tenant : A User
VM-A
VM-B
SVN
PuppetMaster
Synchronization tool
PollingNovaAPI
DNS LDAP
Add entry
Add entry
Outcome from VM setup framework with OpenStack
25
• Drastically shortened timeline and efficient workflow–1000 VMs service deployment within 1 months–Only 30 minutes from VM creation to service start
• It needed 5 business days without OpenStack–Eliminate tasks of two operators by reducing manual
operation
• Common process to build service environment–Service engineer don’t worry about environment, focusing on
business
3. VM setup by Puppet with OpenStack
4. Monitoring OpenStack and VMs
26
4. Monitoring OpenStack and VMs
• Two monitoring environments1. For cloud infrastructure
• NW, Physical servers, Openstack itself2. For web services
• Providing standard service monitoring methods on the private cloud• Tools and Situations
– Zabbix• Semi-auto VM monitoring
– Redmine and Wiki• As an issue(ticket) managiment system• Auto issuing 1 ticket / 1 trouble
– Operation Center• 24/7 monitoring and calling via TEL• 1st response to simple occasion
27
Abstract of our monitoring env.
Operation CenterWeb Service Teams
Automatic Issuing
Infra Team(us)
Watcing24/7 In case of trouble or provisioning
In case of serious situation designed by Freepik
4. Monitoring OpenStack and VMs
• Severity order– API monitoring
• keystone-api, nova-api, neutron-api, horizon GUI, glance-api, swift-proxy• Quite serious trouble
– Process failure detection• nova-*, swift-*, keystone-*, rabbitmq-server, mysqld(MariaDB) etc.
– Process performance monitoring• Depending on middleware• i.e.) MySQL connection number etc.
– Log monitoring• Treat any log message above ERROR as “trouble” from the beginning
–Lack of knowledge leads doubt• Filtering problem-free logs day by day
28
1. For cloud infrastructure (OpenStack monitoring)
4. Monitoring OpenStack and VMs
• What’s this?– Log messages from the OpenStack launching one virtual
machine
29
223 lines, 119698 characters(only 24 lines without DEBUG level)
Problems: complicated logs
(icehouse release)
4. Monitoring OpenStack and VMs
• Analyzing without DEBUG logs– In a case of failure to create new instance
30
2015-07-XX 17:00:YY TopCellController INFO nova.osapi_compute.wsgi.server 172.X.X.X "GET <API_URL>/servers/<VM-UUID> HTTP/1.1" status: 200 ->Accepting the request of creating new one
2015-07-XX 17:00:YY TopCellController INFO nova.scheduler.filter_scheduler Attempting to build 1 instance(s) ->Just reporting
2015-07-XX 17:00:YY ChildCellController WARNING nova.scheduler.driver [instance:<VM-UUID>] Setting instance to ERROR state. ->The beginning of sleepless night
2015-07-XX 17:00:YY ChildCellController INFO nova.filters Filter DiskFilter returned 0 hosts -> Lack of free disks? Where is the processing sequence?
(icehouse release)
Problems: complicated logs
For newbies, it’s not friendly.
4. Monitoring OpenStack and VMs
31
2015-07-XX 17:00:YY ChildCellController DEBUG nova.filters Filter RamFilter returned 88 host(s) get_filtered_objects /usr/lib/python2.6/site-packages/nova/filters.py:88 ->report: enough memory
2015-07-XX 17:00:YY ChildCellController DEBUG nova.scheduler.filters.disk_filter (<hypervisor-name>) ram:46581 disk:731136 io_ops:0 instances:3 does not have 1433600 MB usable disk, it only has 731136.0 MB usable disk. ->report: not enough disk * 88 times
2015-07-XX 17:00:YY ChildCellController INFO nova.filters Filter DiskFilter returned 0 hosts ->Lacks of free disk space. We need to add more disk rapidly.
• Analyzing DEBUG logs – In a case of failure to create new instance
(icehouse release)
Problems: complicated logs
DEBUG log shows internal processing, but it’s quite scruffy.
4. Monitoring OpenStack and VMs
• New function to trace logs easily even across components– Target) nova, cinder, glance, neutron, keystone, etc.
• Current– Each component, each request ID– Need to map request IDs for tracing logs– Difficulty of finding IDs – i.e.) Create new volume from image (cinder calls glance
api)
• NTT’s suggestion– Log request ID mapping within 1 line in each caller– Approved as a cross-project spec, To be implemented
• https://review.openstack.org/#/c/156508
32
Log Request ID mapping
glance-apicinder-volume
2015-10-08 16:14:33.498 DEBUG cinder.volume.manager [req-A admin] image down load from glance req-B
015-10-08 16:14:33.521 DEBUG glanceclient.common.http [req-A admin]HTTP/1.1 200 OKcontent-length: 0x-image-meta-status: activex-image-meta-owner: 46e99ee00fd14957b9d75d997cbbbcd8…x-openstack-request-id: req-B…x-image-meta-disk_format: ami log_http_response /usr/local/lib/python2.7/dist-packages/glanceclient/common/http.py:136
…2015-10-08 16:14:33.517 11610 DEBUG glance.registry.client.v1.client [req-B 924515e485e846799215a0c9be9789cf 46e99ee00fd14957b9d75d997cbbbcd8 - - -] Registry request GET /images/c95a9731-77c8-4da7-9139-fedd21e9756d HTTP 200 request id req-req-5cb606e5-ea1c-4afc-a626-a4deb83c56a1 do_request /opt/stack/glance/glance/registry/client/v1/client.py:1242015-10-08 16:14:33.520 11610 INFO eventlet.wsgi.server [req-B 924515e485e846799215a0c9be9789cf 46e99ee00fd14957b9d75d997cbbbcd8 …
Buried deep!
4. Monitoring OpenStack and VMs
• We’ve been providing standard monitoring system inside our company– Standardized monitoring work-flow for internal service developers
• Standard monitoring item sets and rules• Parameter threshold of alerts
– Monitoring configuration into Zabbix (or Nagios) by our hands
• Think about monitoring scheme with OpenStack– Over 1,000 virtual machines are born, also suddenly die– By our hands?– Our zabbix given new function
• Detecting new VMs and starting monitoring semi-automatically
• Before getting along with OpenStack...– Consider your today’s work-flow deeper for an efficient operation
33
2. For web services - Changing operation work-flow
5. Current Activity and Future Plan
34
5. Current Activity and Future Plan
• Changing Sizing and improving VM density• Initial flavors are designed by focusing on migration project
• Compatibility with old DC rather than resource efficiency• VM spec same as old DC was the best for migration plan
• Current usage–Disk capacity is too much
• Design: 37 Gbytes disk size per 1 Gbytes memory size• Actual Usage: 7 Gbytes disk size per 1 Gbytes memory size
• Providing new flavors based on actual usage, asking to return unused disk capacity
• Increase server physical memory double• Aiming to increase VM density 1.3 – 2 times
35
Current Activity
5. Current Activity and Future Plan
• Upgrade Openstack–Load Balancer as a Service (LBaaS) is desired
• Current: Manual operation to Load Balancer• LBaaS API v1 is not enough• Waiting our vender driver for LBaaS API v2
–Establish upgrade operation• Need to apply our patches• Need to develop and test these patches• These prevents us from frequent upgrade
–Mitaka release• NTT R&D locates at “Mitaka”
36
Future Plan
37
Summary
1. About NTT Resonant, operates web portal site “goo”.• 170 million unique browsers and 1 billion page views per month
2. OpenStack Infrastructure Design• It increased our business speed and agility• We successfully deployed 400 hypervisors in 6 months• Stable in production for more than 1 year
3. VM setup by Puppet with OpenStack• We could start 70+ services on 1,300 VMs in 4 months• It shorten the time to deploy a service from 5 days to 30 minutes
4. Monitoring both of OpenStack and VMs, with Zabbix5. Current Activity and Future Plans
• Current: Sizing to improve VM density• Future: Upgrade, LBaaS and more toward Mitaka release
Openstack new VM
Appendix: Our monitoring environment
TIPS) Semi-auto monitoring settingZabbix polles VMs and reads monitoring.conf, and then apply specified template.
サーバZabbix
サーバサーバ
サーバサーバ
Redmine
(2) Getting monitoring definition via the agent.
(1)Zabbix serverpolles IP segments of Openstack VMs,finding out zabbix agents=> Registering it as monitoring target
(3)Applying corresponding monitoring template against monitoring.conf
(4)In case of catching trouble,kick the script for auto-issuing=> Sending request to Redmine
API
X.Y.Z.0/24
Monitoring.conf
apache_prod
mysql_prod
linux_prod
alert_on
Zabbix Agent
Examplesapache_prod= Apache in production monitorapache_dev = Apache in development monitorlinux_prod = Linux OS in production monitoralert_on = sending alert to the VM usersalert_off = maintenance(silent) mode…
Script
Polling VMs(Auto Discovery)
Trouble Ticket
Issuing
New!
38