On Building Public Cloud Service for PostgreSQL and MySQL
Guangzhou Zhang, Alibaba Cloud [email protected]
2
Agenda • Resource isolation
• CPU/IO/MEM • Disk
• Privilege management • Critical issues
• Handling IO hang • Handling OOM
• Enhance the service with Proxy • Transparent switch-over with Proxy • Transparent connection pool with Proxy • Read write auto routing
3
Self-Intro • Guangzhou Zhang, Developer from Alibaba Cloud RDS Team • Alibaba Cloud RDS Team
• Covers MySQL/PostgreSQL/SQL Server • Handles infrastructure, database kernel (MySQL/PG), operations, customer support, etc. • Run one of world largest database instance cluster, and still growing more than 100% YTY
4
Architecture
Master
Slave
HA
Instances
User requests thru VIP
Proxy layer Server Load Balancer
5
Agenda • Resource isolation
• CPU/IO/MEM • Disk
• Privilege management • Critical issues
• Handling IO stalls • Handling OOM
• Enhance the service with Proxy • Transparent switch-over with Proxy • Transparent connection pool with Proxy • Read write auto routing
6
Resource isolation – CPU/IO/MEM • Use bare-metal machines (no VM no network storage )
• Locate up to hundred of instances on the same host • Take full control over the whole stack, minimize dependency with predictable SLA and
performance • Control resource allocation layout
7
Resource isolation – CPU/IO/MEM • Use cgroup for CPU/IO/MEM isolation
cgroup for Instance A
master
backend backend
backend
cgroup for Instance B
slave
backend backend
backend
Host Machine
8
Resource isolation – CPU/IO/MEM • Use cgroup for CPU/IO/MEM isolation
• Trick: Separate data and write-ahead-logging (WAL) disks, loose the limit of WAL disk IOPS, why?
• Maintenance work inside database requires heavy IOs and should be finished as soon as possible
• Strict limit on WAL IOPS on slave could result in severe replication lag
• Strict limit on WAL IOPS on master could easily cause a failover switch
• For better write performance
Why we have to loose the limit? • No one will constantly be writing to database at a
high pace, since database storage has a upper limit (2T) and is expensive
• The storage usage of WAL files is charged • Relocate ‘hot’ instances to different hosts as needed
Why we are able to loose the limit?
cgroup for Instance A master
backend backend
backend
WAL Disk Data Disk
10000 IOPS 500 IOPS
9
Resource isolation - Disk
• Regularly collect disk usage of each database instance, mark those with excessive disk usage as locked-down
• Change the database kernel to add a parameter that controls user access: • Allow read access • Disallow any write to instance • Allow DROP/TRUNCATE commands
10
Agenda • Resource isolation
• CPU/IO/MEM • Disk
• Privilege management • Critical issues
• Handling IO stalls • Handling OOM
• Enhance the service with Proxy • Transparent switch-over with Proxy • Transparent connection pool with Proxy • Read write auto routing
11
Privilege Management
• Superuser role is kept from the end users for security reason
• Prevent users from messing up critical configurations for HA or maintenance • Convenient for us to handle end users’ roles and maintenance roles differently in code
• Users keep on asking for more from “superuser” privilege set for managing
data, monitoring, extensions, roles, etc.
• We have to loose privilege check for some cases to make users happy
12
Privilege Management
• Creating a new role – rds_superuser • Allow this role to manage all other non-superuser’s data, kill sessions, etc. • Allow it to set specific configuration parameters
13
Agenda • Resource isolation
• CPU/IO/MEM • Disk
• Privilege management • Critical issues
• Handling IO stalls • Handling OOM
• Enhance the service with Proxy • Transparent switch-over with Proxy • Transparent connection pool with Proxy • Read write auto routing
14
Handling OOM
• OOM (out of memory) cases were frequently observed • Cloud users tend to buy small class instance firstly, then regularly upgrade to higher class • Large objects / big temp memory are widely used in some instances • Concurrent connections are not suitably configured by application
• When an instance goes OOM, one of its processes is picked up and killed by the Linux kernel. The whole instance will be restarted with all connections lost.
• How can we minimize OOM impact for users?
15
Handling OOM
Linux Box CGroup
backend
backend
backend
CGroup backen
d
backend
backend
16
Handling OOM
Linux Box CGroup
backend
backend
backend
CGroup backen
d
backend
backend
CGroup
Public CGroup
Kill –USR2
17
Handling IO stalls
• Symbols of IO stalls (ext4 filesystem with mount –data=order on Linux) • Long checkpoint fsync time • Nearly all operations including setup of a new connection hang for > 10s. All instances on the
same machine are affected.
• But IO usage is low (< 10%) for the problematic disk.
18
Handling IO stalls fsync can be slow
Linux Box CGroup
backend
backend
checkpointer
fsync
Ext4 journaling buffer for metadata
Dirty data in IO queue File 2 metadata
File 1 metadata
19
Handling IO stalls
Linux Box CGroup
backend
backend
checkpointer
fsync
Ext4 journaling buffer for metadata
Dirty data in IO queue File 2 metadata
File 1 metadata
write
write() can be blocked by fsync()
20
Handling IO stalls
• Mount ext4 with option data=writeback • Change the database kernel to call sync_file_range() before calling fsync in
checkpointer process
• Linux kernel enhancement
21
Agenda • Resource isolation
• CPU/IO/MEM • Disk
• Privilege management • Critical issues
• Handling IO stall • Handling OOM
• Enhance the service with Proxy • Transparent switch-over with Proxy • Transparent connection pool with Proxy • Read write auto routing
22
Architecture
Master
Slave
HA
Instances
User requests thru VIP
Proxy layer Server Load Balacer
23
Transparent connection pool with Proxy
Backend thread/process
Proxy Instance
• Connection establishment is costly • Performance is bad for short-connection applications
24
Transparent connection pool with Proxy
Backend thread/process
Proxy DB Instance
Conn pool
25
Transparent connection pool with Proxy
• Proxy needs to restore all the resources held by this connection before putting it into the connection pool
• Authentication needs to be performed against the incoming user when connection is reused
• Change the database kernel to support “SET AUTH_REQ = ‘user/database/md5password/salt’” for authentication
26
Transparent switch-over with Proxy
• There are quite a few cases requiring instance restart. And we implement a restart with switch-over
• Upgrade or degrade an instance’s class (e.g. from 2G mem to 4G) • Move an instance from one machine to another • Database kernel version upgrades
• Transparent switch-over is the process that does a switch-over without breaking existing user connections. This is key to SLA and customer satisfactory
27
Transparent switch-over with Proxy
Master
Slave
HA Proxy layer
DB Instances
28
Transparent switch-over with Proxy
• Proxy does the connection re-establishing out of transaction boundaries
• Only for connections that have not done anything not re-creatable • Temporary table usage • Statement preparation
• Change the kernel to support a command like “set connection_user to xxx”for proxy to re-establish connection without an explicit authentication process
29
Transparent switch-over with Proxy
Master
Slave
HA Proxy layer
DB Instances
Change user to app user
Application user
30
Read-write auto routing
Master
Slave
HA Proxy layer
DB Instances
Application user
ReadonlyReplica
31
Agenda • Resource isolation
• CPU/IO/MEM • Disk
• Privilege management • Critical issues
• Handling IO stalls • Handling OOM
• Enhance the service with Proxy • Transparent switch-over with Proxy • Transparent connection pool with Proxy • Read write auto routing