supporting multi-processors bernard wong february 17, 2003

Supporting Multi-Processors

Bernard WongFebruary 17, 2003

Uni-processor systems Began with Uni-processor systems Simple to implement uni-processor

OS, allows for many assumptions UMA, efficient locks(small impact on

throughput), straight forward cache coherency

Hard to make faster

Small SMP systems Multiple symmetric processors Requires some modifications to the OS Still allows for UMA System/Memory bus becomes a contended

resource Locks have larger impact on throughput

e.g. A lock on one process can block another process (running on another processor) from making progress

Must introduce finer grain locks to improve scalability System bus limits system size

Large Shared Memory Multi-processor

Consist of many nodes, each of which may be a uni-processor or an SMP

Access to memory often NUMA, sometimes does not even provide cache coherency

Performance very poor if used with an off the shelf SMP OS

Requirement for good performance: Locality of service to request Independence between services

DISCO Uses Virtual Machine Monitors to run

multiple commodity OSes on a scalable multi-processor

Virtual Machine Monitor Additional layer between OS and hardware Virtualizes processor, memory, I/O OS unaware of virtualization (ideally) Exports a simple general interface to the

commodity OS

DISCO Architecture

DISCO

PE PE PE PE PE PE PE

Interconnect

ccNUMA Multiprocessor

OS SMP-OS OS OS Thin OS

Implementation Details Virtual CPUs

Uses direct execution on real CPU• Fast, most instructions run at native speeds

Must detect and emulate operations that can not be safely exported to VM

• Primary privilege instructions: TLB modification, direct physical memory or I/O operations

Must also keep data-structure to save registers and other state

• For when virtual CPU not scheduled to real CPU Virtual CPUs uses affinity scheduling to

maintain cache locality

Implementation Details Virtual Physical Memory

Adds a level of address translation Maintains physical-to-machine address mappings

• Because VMs use physical addresses that start from 0 and continuing for size of VM’s memory address

Performed via emulating TLB instructions• When OS tries to insert entry into TLB, DISCO

intercepts it and insert translated version TLB flushed on virtual CPU switches

• TLB lookup also more expensive due to required trap• Second level software TLB added to improve

performance

Implementation Details Virtual I/O

Intercepts all device accesses from VM through special OS device drivers

Virtualizes both disk and network I/O DISCO allows persistent disks and non-

persistent disks• Persistent disks cannot be shared• Non-persistent disk implemented via copy-

on-write

Why use a VMM? DISCO aware of NUMA-ness

Hides NUMA-ness from commodity OS Requires less work than engineering a NUMA-

aware OS Performs better than NUMA-unaware OS Good middle ground

How? Dynamic page migration and page replication

• Maintain locality between virtual CPU’s cache miss and memory pages to which cache miss occur

Memory Management Pages heavily accessed by only one node are

migrated to that node Change physical to machine address mapping Invalidates TLB entries that point to old location Copy page to local machine

Pages that are heavily read-share and replicated to nodes move heavily accessing them Downgrade TLB entries pointing to page to read-only Copy pages Update relevant TLB entries to local machine version

and remove read-only

Page Replication

Aren’t VMs memory inefficient? Traditionally, VMs tend to replicate

memory used for each system image Additionally, structures such as disk cache

not shared DISCO uses notion of global buffer cache

to reduce memory footprint

Page sharing DISCO keeps a data structure that maps disk

sectors to memory addresses If two VMs request for same disk sector, both

assigned to same read-only buffer page Modifications to pages performed via copy-on-

write Only works for non-persistent copy-on-write disks

Page sharing

Page sharing Sharing effective even via packets

when sharing data over NFS

Virtualization overhead

Data sharing

Workload scalability

Performance Benefits of Page Migration/Replication

Tornado OS designed to take advantage of shared

memory multi-processors Object Oriented structure

Every virtual and physical resource represented by an independent object

Ensure natural locality and independence• Resource lock and data structure stored on some

node as resource• Resources manage independently and at a fine grain

• No global source of contention

OO structure Example: Page fault

Separate File Cache Manager(FCM) object for different regions of memory

COR -> Cached Object Representative

All objects are specific to either the faulting process or the file(s) backing the process

Problem: Hard to make global policies

Clustered objects Even with OO, widely shared objects can be

expensive due to contention Need replication, distribution, partition to reduce

contention Clustered Objects systematic way to do this Gives illusion of a single object, but is actual

composed of multiple component (rep) objects Each component handle a subset of processors Must handle consistency across reps

Clustered objects

Clustered object implementation Per-processor translation table

Contains pointer for to local rep of each clustered object Created on demand via a combination of global miss handling

object and clustered object specific miss handling object

Memory Allocation Need an efficient, highly concurrent

allocator that maximizes locality Use local pools of memory

However, for small block allocation, still have problem of false sharing

Additional small pool of strictly local memory used

Synchronization Use of objects, and additional clustered object

reduces scope of lock and limits lock contention to that of a rep

Existence guarantees hard A thread must determine whether an object is currently

being de-allocated by another thread Often require lock hierarchy where root is a global lock

DISCO uses semi-automatic garbage collector Thread never worries needs to test for existence, no

locking required

Protected Procedure Calls Since Tornado is a microkernel, IPC traffic

is significant Need a fast IPC mechanism that

maintains locality Protected Procedure Calls (PPC) maintains

locality by: Spawning a new server thread in the same

processor as client to service client request Keeping all client specific data in data-

structures stored on the client

Protected Procedure Calls

Performance Comparison to other large shared-

memory multi-processors

Performance (n threads in 1 process)

Performance (n threads in n process)

Conclusion Illustrated two different approach to make

efficient use of shared memory multi-processors

DISCO adds extra layer between hardware and OS Less engineering effort, more overhead

Tornado redesigns an OS to take advantage of locality and independence More engineering effort, less overhead but local

and independent algorithms may work poorly with real world loads

supporting multi-processors bernard wong february 17, 2003

Documents

uniprocessor os

direct physical memory

virtual cpus cache

uniprocessor systemsbegan

uniprocessor systemssimple

memory pages

tlb modification

hardwarevirtualizes