computer architecture ii 1 computer architecture ii lecture 10

Computer Architecture II

Computer architecture II

Lecture 10

• Synchronization for SMM– Test and set, ll and sc, array– barrier

• Scalable Multiprocessors– What is a scalable machine?

Synchronization

• Types of Synchronization– Mutual Exclusion– Event synchronization

• point-to-point• group• global (barriers)

• All solutions rely on hardware support for an atomic read-modify-write operation

• We look today at synchronization for cache-coherent, bus-based multiprocessors

Components of a Synchronization Event

• Acquire method–Acquire right to the synch (e.g. enter critical

section)

• Waiting algorithm–Wait for synch to become available when it

isn’t–busy-waiting, blocking, or hybrid

• Release method–Enable other processors to acquire

Performance Criteria for Synch. Ops

• Latency (time per op)–especially when light contention

• Bandwidth (ops per sec)–especially under high contention

• Traffic– load on critical resources–especially on failures under contention

• Storage• Fairness

Strawman Lock

lock: ld register, location /* copy location to register */

cmp location, #0 /* compare with 0 */

bnz lock /* if not 0, try again */

st location, #1 /* store 1 to mark it locked */

ret /* return control to caller */

unlock: st location, #0 /* write 0 to location */

ret /* return control to caller */

Busy-Waiting

Location is initially 0

Why doesn’t the acquire method work?

Atomic Instructions

• Specifies a location, register, & atomic operation

– Value in location read into a register

– Another value (function of value read or not) stored into location

• Many variants

– Varying degrees of flexibility in second part

• Simple example: test&set

– Value in location read into a specified register

– Constant 1 stored into location

– Successful if value loaded into register is 0

– Other constants could be used instead of 1 and 0

Simple Test&Set Locklock: t&s register, location

bnz lock /* if not 0, try again */ret /* return control to caller */

unlock: st location, #0 /* write 0 to location */ret /* return control to caller */

The same code for lock in pseudocode:

while (not acquired) /* lock is aquired be another one*/ test&set(location); /* try to acquire the

lock*/

•Condition: architecture supports atomic test and set– Copy location to register and set location to 1

•Problem: – t&s modifies the variable location in its cache each time it tries to

acquire the lock=> cache block invalidations => bus traffic (especially for high contention)

Number of processors

11 13 150

20 Test&set, c = 0

Test&set, exponential backoff c = 3.64

Test&set, exponential backoff c = 0

T&S Lock Microbenchmark: SGI Challenge

lock; delay(c); unlock;

• Why does performance degrade?– Bus Transactions on T&S

Other read-modify-write primitives

•Fetch&op–Atomically read and modify (by using op

operation) and write a memory location

–E.g. fetch&add, fetch&incr

•Compare&swap–Three operands: location, register to

compare with, register to swap with

Enhancements to Simple Lock

• Problem of t&s: lots of invalidations if the lock can not be taken• Reduce frequency of issuing test&sets while waiting

– Test&set lock with exponential backoff i=0;while (! acquired) { /* lock is acquired be

another one*/ test&set(location);if (!acquired) {/* test&set didn’t succeed*/ wait (ti); /* sleep some time

}• Less invalidations• May wait more

Number of processors

11 13 150

20 Test&set, c = 0

Test&set, exponential backoff c = 3.64

Test&set, exponential backoff c = 0

T&S Lock Microbenchmark: SGI Challenge

lock; delay(c); unlock;

• Why does performance degrade?– Bus Transactions on T&S

Enhancements to Simple Lock• Reduce frequency of issuing test&sets while waiting

– Test-and-test&set lockwhile (! acquired) { /* lock is acquired be another one*/ if (location=1) /* test with ordinary load */

continue; else {

test&set(location);if (acquired) {/*succeeded*/

break}

• Keep testing with ordinary load– Just a hint: cached lock variable will be invalidated when release occurs– If location becomes 0, use t&s to modify the variable atomically– If failure start over

• Further reduces the bus transactions – load produces bus traffic only when the lock is released– t&s produces bus traffic each time is executed

Lock performanceLatency Bus Traffic Scalability Storage Fairness

t&s Low contention: low latency

High contention: high latency

A lot poor Low (does not increase with processor number)

t&s with backoff

Low contention: low latency (as t&s for no contention)

Less than t&s Better than t&s Low (does not increase with processor number)

t&t&s Low contention: low latency, a little higher than t&s

Less than t&s and t&s with backoff

Better than t&s and t&s with backoff

Low (does not increase with processor number)

Improved Hardware Primitives: LL-SC

• Goals: – Problem of test&set: generate lot of bus traffic– Failed read-modify-write attempts don’t generate invalidations– Nice if single primitive can implement range of r-m-w operations

• Load-Locked (or -linked), Store-Conditional– LL reads variable into register– Work on the value from the register– SC tries to store back to location – succeed if and only if no other write to the variable since this

processor’s LL• indicated by a condition flag

• If SC succeeds, all three steps happened atomically• If fails, doesn’t write or generate invalidations

– must retry acquire

Simple Lock with LL-SClock: ll reg1, location /* LL location to reg1 */

sc location, reg2 /* SC reg2 into location*/

beqz reg2, lock /* if failed, start again */

unlock: st location, #0 /* write 0 to location */

• Can simulate the atomic ops t&s, fetch&op, compare&swap by changing what’s between LL & SC (exercise)

– Only a couple of instructions so SC likely to succeed

– Don’t include instructions that would need to be undone (e.g. stores)

• SC can fail (without putting transaction on bus) if:

– Detects intervening write even before trying to get bus

– Tries to get bus but another processor’s SC gets bus first

• LL, SC are not lock, unlock respectively

– Only guarantee no conflicting write to lock variable between them

– But can use directly to implement simple operations on shared variables

Advanced lock algorithms

• Problems with presented approaches– Unfair: the order of arrival does not count– All processors try to acquire the lock when

released– More processes may incur a read miss when

the lock released• Desirable: only one miss

Ticket Lock• Draw a ticket with a number, wait until the number is shown • Two counters per lock (next_ticket, now_serving)

– Acquire: fetch&inc next_ticket; wait for now_serving == next_ticket

• atomic op when arrive at lock, not when it’s free (so less contention)

– Release: increment now-serving• Performance

– low latency for low-contention – O(p) read misses at release, since all spin on same variable– FIFO order

• like simple LL-SC lock, but no invalidation when SC succeeds, and fair

Array-based Queuing Locks• Waiting processes poll on different locations in an array of size p

– Acquire

• fetch&inc to obtain address on which to spin (next array element)

• ensure that these addresses are in different cache lines or memories

– Release

• set next location in array, thus waking up process spinning on it

– O(1) traffic per acquire with coherent caches

– FIFO ordering, as in ticket lock, but, O(p) space per lock

– Not so great for non-cache-coherent machines with distributed memory

• array location I spin on not necessarily in my local memory

Lock performanceLatency Bus Traffic Scalability Storage Fairness

t&s Low contention: low latency

A lot poor O(1) no

t&s with backoff

Low contention: low latency (as t&s)

Less than t&s Better than t&s O(1) no

t&t&s Low contention: low latency, a little higher than t&s

Less: no traffic while waiting

Better than t&s with backoff O(1) no

ll/sc Low contention: low latency

High contention: better than t&t&s

Like t&t&s + no traffic on missed attempt

Better than t&t&s O(1) no

ticket Low contention: low latency

High contention: better than ll/sc

Little less than ll/sc

Like ll/sc O(1) Yes (FIFO)

array Low contention: low latency, like t&t&s

High contention: better than ticket

Less than ticket More scalable than ticket (one processor incurs the miss)

O(p) Yes (FIFO)

Transactional memory

Transactional memory benefits

Transactional memory drawbacks

Point to Point Event Synchronization• Software methods:

– Busy-waiting: use ordinary variables as flags

– Blocking: semaphores

– Interrupts• Full hardware support: full-empty bit with each word in memory

– Set when word is “full” with newly produced data (i.e. when written)

– Unset when word is “empty” due to being consumed (i.e. when read)

– Natural for word-level producer-consumer synchronization• producer: write if empty, set to full; • consumer: read if full; set to empty

– Hardware preserves read or write atomicity

– Problem: flexibility• multiple consumers• multiple update of a producer

Barriers• Hardware barriers

– Wired-AND line separate from address/data bus• Set input 1 when arrive, wait for output to be 1 to

leave– Useful when barriers are global and very frequent– Difficult to support arbitrary subset of processors

• even harder with multiple processes per processor– Difficult to dynamically change number and identity of

participants• e.g. latter due to process migration

– Not common today on bus-based machines• Software algorithms implemented using locks, flags, counters

struct bar_type {int counter; struct lock_type lock; int flag = 0;} bar_name;

BARRIER (bar_name, p) {LOCK(bar_name.lock);if (bar_name.counter == 0)

bar_name.flag = 0; /* reset flag if first to reach*/

mycount = bar_name.counter++; /* mycount is private */UNLOCK(bar_name.lock);if (mycount == p) { /* last to arrive */

bar_name.counter = 0; /* reset for next barrier */bar_name.flag = 1; /* release waiters */

}else while (bar_name.flag == 0) {}; /* busy wait for release

A Simple Centralized Barrier• Shared counter maintains number of processes that have arrived

– increment when arrive (lock), check until reaches numprocs– Problem?

A Working Centralized Barrier• Consecutively entering the same barrier doesn’t work

– Must prevent process from entering until all have left previous instance

– Could use another counter, but increases latency and contention• Sense reversal: wait for flag to take different value consecutive times

– Toggle this value only when all processes reach

1. BARRIER (bar_name, p) {2. local_sense = !(local_sense); /* toggle private sense variable */3. LOCK(bar_name.lock);4. mycount = bar_name.counter++; /* mycount is private */5. if (bar_name.counter == p) 6. UNLOCK(bar_name.lock); 7. bar_name.counter = 0;8. bar_name.flag = local_sense; /* release waiters*/9. else {10. UNLOCK(bar_name.lock);11. while (bar_name.flag != local_sense) {}; }12. }

Centralized Barrier Performance• Latency

– critical path length at least proportional to p (the accesses to the critical region are serialized by the lock)

• Traffic– p bus transaction to obtain the lock– p bus transactions to modify the counter – 2 bus transaction for the last processor to reset the counter and release

the waiting process – p-1 bus transactions for the first p-1 processors to read the flag

• Storage Cost– Very low: centralized counter and flag

• Fairness– Same processor should not always be last to exit barrier

• Key problems for centralized barrier are latency and traffic– Especially with distributed memory, traffic goes to same node

Improved Barrier Algorithms for a Bus

– Separate arrival and exit trees, and use sense reversal

– Valuable in distributed network: communicate along different paths

– On bus, all traffic goes on same bus, and no less total traffic

– Higher latency (log p steps of work, and O(p) serialized bus transactions)

– Advantage on bus is use of ordinary reads/writes instead of locks

Software combining tree•Only k processors access the same location, where k is degree of tree (k=2 in the example below)

Flat Tree structured

Contention Little contention

Scalable Multiprocessors

Scalable Machines

• Scalability: capability of a system to increase by adding processors, memory, I/O devices

• 4 important aspects of scalability– bandwidth increases with number of processors– latency does not increase or increases slowly– Cost increases slowly with number of processors– Physical placement of resources

Limited Scaling of a Bus

• Small configurations are cost-effective

Characteristic Bus

Physical Length ~ 1 ft

Number of Connections fixed

Maximum Bandwidth fixed

Interface to Comm. medium extended memory interface

Global Order arbitration

Protection virtual -> physical

Trust total

OS single

comm. abstraction HW

Workstations in a LAN?

• No clear limit to physical scaling, little trust, no global order

• Independent failure and restart

Characteristic Bus LAN

Physical Length ~ 1 ft KM

Number of Connections fixed many

Maximum Bandwidth fixed ???

Interface to Comm. medium memory interface peripheral

Global Order arbitration ???

Protection Virtual -> physical OS

Trust total none

OS single independent

comm. abstraction HW SW

Bandwidth Scalability

• Bandwidth limitation: single set of wires• Must have many independent wires

(remember bisection width?) => switches

P M M P M M P M M P M M

S S S S

Typical switches

Multiplexers

Crossbar

Dancehall MP Organization

• Network bandwidth demand: scales linearly with number of processors

• Latency: Increases with number of stages of switches (remember butterfly?)– Adding local memory would offer fixed latency

Scalable network

Switch

Switch Switch

Generic Distributed Memory Multiprocessor

• Most common structure

Scalable network

Switch

Switch Switch

Bandwidth scaling requirements

• Large number of independent communication paths between nodes: large number of concurrent transactions using different wires

• Independent transactions• No global arbitration• Effect of a transaction only visible to the nodes

involved– Broadcast difficult (was easy for bus):

additional transactions needed

Latency Scaling

T(n) = Overhead + Channel Time (Channel Occupancy) + Routing Delay + Contention Time

• Overhead: processing time in initiating and completing a transfer

• Channel Time(n) = n/B

• RoutingDelay (h,n)

Cost Scaling

• Cost(p,m) = fixed cost + incremental cost (p,m)• Bus Based SMP

– Add more processors and memory

• Scalable machines – processors, memory, network

• Parallel efficiency(p) = Speedup(p) / p• Costup(p) = Cost(p) / Cost(1)• Cost-effective: Speedup(p) > Costup(p)

Cost Effective?

•2048 processors: 475 fold speedup at 206x cost

0 500 1000 1500 2000

Processors

Speedup = P/(1+ logP)

Costup = 1 + 0.1 P

Physical Scaling

• Chip-level integration– Multicore– Cell

• Board-level– Several multicores on a board

• System level– Clusters, supercomputers

Chip-level integration: nCUBE/2

• Network integrated onto the chip 14 bidirectional links => 8096 nodes

• Entire machine synchronous at 40 MHz

Single-chip node

Basic module

Hypercube networkconfiguration

DRAM interface

I-Fetch&

decode

64-bit integerIEEE floating point

Operand$

Execution unit

1024 Nodes

Chip-level integration: Cell

• PPE• 3.2 GHz • Synergetic Processing Elements

Board level integration: CM-5

• Use standard microprocessor components• Scalable network interconnect

Diagnostics network

Control network

Data network

Processingpartition

Controlprocessors

I/O partition

DRAMctrl

DRAM DRAM DRAM DRAM

DRAMctrl

Vectorunit DRAM

ctrlDRAM

Vectorunit

FPU Datanetworks

Controlnetwork

System Level Integration

• Loose packaging

• IBM SP2

• Cluster blades

Memory bus

MicroChannel bus

i860 NI

IBM SP-2 node

Power 2CPU

Memorycontroller

4-wayinterleaved

General interconnectionnetwork formed from8-port switches

Roadrunner• next-generation supercomputer to be built at the Los Alamos

National Laboratory in New Mexico. • 1 petaflops US Department of Energy. • hybrid design

– more than 16,000 AMD Opteron cores (~2200 IBM x3755 4U servers, each holding four dual core Opterons, connected by Infiniband)

– a comparable number of Cell microprocessors– Red Hat Linux operating system – When completed (2008), it will be the world's most powerful computer,

and cover approximately 12,000 square feet (1,100 square meters). It is expected to be operational in 2008.

• simulating how nuclear materials age and whether the aging nuclear weapon arsenal of the United States is safe and reliable.

computer architecture ii 1 computer architecture ii lecture 10

location bnzlock

variable location

setcopy location

atomic read

return control

atomic instructionsspecifies

atomic test

atomic operationvalue

Documents

caching ii andreas klappenecker cpsc321 computer...

cse.m-ii-advances in computer architecture [12scs23]-notes

unit-ii computer architecture and organization … ·...

cpsc 614:graduate computer architecture memory system ii

computer architecture ii 1 computer architecture ii network...

csci 136 computer architecture ii – cpu performance

pipelining ii andreas klappenecker cpsc321 computer...

computer architecture part ii-d: survey of processor...

ece 259 / cps 221 advanced computer architecture ii

computer architecture and organization unit-ii structured...

ti ii: computer architecture isa and assembly

computer architecture ii - josehu.com

m.tech: advanced computer architecture assignment ii

computer architecture ii 1 computer architecture ii...

cs 232: computer architecture ii

cse.m-ii-advances in computer architecture...

csci 136 computer architecture ii –cache memory

ece/cs 757: advanced computer architecture ii

computer architecture ii 1 computer architecture ii lecture...

computer architecture computer architecture processing of...