advanced computer architecture week 1: introductionstrukov/ece154bwinter2015/week1.pdf · •...
TRANSCRIPT
Advanced Computer ArchitectureWeek 1: Introduction
ECE 154B
Dmitri Strukov
1
Outline
• Course information
• Trends (in technology, cost, performance) and issues
2
Course organization
• Class website: http://www.ece.ucsb.edu/~strukov/ece154bWinter2015/home.htm
• Instructor office hours: Wed, 2:00 pm – 4:00 pm
• Teacher Assistant: David McCarthy
office hours: Friday 2:00 pm – 4:00 pm
email: [email protected]
3
Textbook
• Computer Architecture: A Quantitative Approach, John L. Hennessy and David A. Patterson, Fifth Edition, Morgan Kaufmann, 2012, ISBN: 978-0-12-383872-8
• Modern Processor Design: Fundamentals of Superscalar Processors, John Paul Shen and Mikko H. Lipasti, Waveland Press, 2013, ISBN: 978-1-47-860783-0
4
Class topics and tentative schedule
• Computer fundamentals (historical trends, performance metrics) – 1 week
• Memory hierarchy design - 2 weeks• Instruction level parallelism (static and dynamic
scheduling, speculation) – 2 weeks• Data level parallelism (vector, SIMD and GPUs) – 2
weeks• Thread level parallelism (shared-memory architectures,
synchronization and cache coherence) – 2 weeks• Warehouse-scale computers or Detailed analysis of
some specific uP (1 week)
5
Ultimate goal of the class
• To get intuition behind main techniques for improving performance
• To understand advanced microprocessors such as
- ARM Cortex A8
- Intel Core i7
- Tesla GPU
6
5-STAGE MIPS PIPELINE
This is what you supposed to know!
7
This is what we learn in this class!
Grading
• Projects: 70 %• Final: 30 %
• Project course work will involve program performance analysis and architectural optimizations for superscalar processors using SimpleScalar simulation tools
• Number of problems will be assigned before final (but not graded)
8
Course prerequisites
• ECE 154A or equivalent
9
A bit of history: ENIAC - Electronic Numerical Integrator And Computer, 1946
10
VLSI Developments
1946: ENIAC electronic numerical integrator and computer
• Floor area– 140 m2
• Performance– multiplication of two 10-digit
numbers in 2 ms
2011: High Performance microprocessor
• Chip area– 100-400 mm2 (for multi-core)
• Board area– 200 cm2; improvement of 104
• Performance: – 64 bit multiply in few ns;
improvement of 106
11
Computer trends: Performance of a (single) processor
12The next series of question is centered around understanding that important graph
Question
• Q1: what is performance shown on the figure and how do we define it?- A1a: Performance is typically related to how fast a certain task can be executed, i.e. reciprocal of execution time
Performance = 1/ ExecTime ExecTime = IC * CCT * <CPI>
• Wall clock time: includes all system overheads• CPU time: only computation time
- A1b: Many different metrics of performance today because of different application of uPs
- What kind of metrics?
13
Measuring Performance• Typical performance metrics:
– Execution Time (or latency) – Throughput
• Q2: How is throughput related to latency?– A2: In general these are two different concepts. Throughput can be improved by providing more
parallelism, but also be improved by reducing latency. For example, with no parallelism throughput is reversely proportional to latency
– Energy • Q3: Is energy metric the same as power consumption one?
– A3: Power = energy / time, so in general, it is the same metric only when execution time is the same.
– Response time
• Typical way to measure performance is to run benchmark (i.e. collection of representative for the tested hardware application)– Kernels (e.g. matrix multiply)– Toy programs (e.g. sorting)– Synthetic benchmarks (e.g. Dhrystone)– Benchmark suites (e.g. SPEC06fp, TPC-C)
• Speedup of X relative to Y– Execution timeY / Execution timeX
14
Bandwidth vs. Latency
• Bandwidth or throughput– Total work done in a given time
– 10,000-25,000X improvement for processors
– 300-1200X improvement for memory and disks
• Latency or response time– Time between start and completion of an event
– 30-80X improvement for processors
– 6-8X improvement for memory and disks
15
Computer trends: Performance of a (single) processor
16
Questions:
• Reasons behind performance improvement?• Q4: Why it was improving originally (from ~1978-~1984
on the figure) ?– A4: Moore’s law and the resulting increase in clock frequency
17
18
CMOS improvements:
• Transistor density: 4x / 3 yrs
• Die size: 10-25% / yr
Scaling with Feature Size(for short channel devices before running into leakage problems)
19
Let’s
1) scale all the dimensions of the transistors and wires down by factor of s
and
2) supply voltage V down by factor of s (together with threshold voltage Vth)
Then
• Density: ~ s2
• Logic gate capacitance Cgate (traditionally dominating parasitics): ~ 1/s
• Saturation current ION : ~ 1/s
• Gate delay Tgate: ~ CgateV/ION = 1/s
• Clock frequency: s , i.e. it is reversely proportional to gate delay. Clock cycle time is typically around ten or more of logic gate delays
See, e.g. page 124 of Digital Integrated Circuits by Jan Rabaey et al, 2nd edition
Frequency Scaling with Feature Size
20
• If s is scaling factor, then density scale as s2
• Voltage V: 1/s
• Logic gate capacitance C (traditionally dominating): ~ 1/s
• Saturation current ION : ~ 1/s
• Gate delay: ~ CV/ION = 1/s
Computer trends: Performance of a (single) processor
21
Question:
• Q5: Reasons behind further performance improvement?• What happened in 1986?– A5: CISC to RISC which enabled additional architectural improvements (see next slide)
Review: Dimensions of ISA
(1) Class of ISA: register-memory vs load-store
(2) Memory addressing: byte addressabile
(3) Addressing modes (what are operands and addressing modes of memory): registers, immediate, displacement, indirect, indexed, absolute)
(4) Types and sizes of operands: byte, half-word, word
(5) Operations: data transfer, arithmetic logical, control and fp
(6) Control flow instructions: conditional branches, unconditional jumps, returns
(7) Encoding an ISA: variable versus fixed length
22
Question:
• Reasons behind performance improvement?• What happened in 1986?– CISC to RISC
– Q6: How are these terms affected by this move and in particularhow are the terms in performance equation are affected by pipelining?
-A6:
23
Design Instcount
CPI CCT
Single Cycle (SC) 1 1 1
Multi cycle (MC) 1 N ≥ CPI > 1(closer to N than 1)
> 1/N
Multi cycle pipelined (MCP)
1 > 1 >1/N
ExecTime = IC * CCT * <CPI>
Question:
• Pipelining improve performance (instruction per cycle with respect to multi-cycle processor with non pipelining, by overlapping instructions)
• One kind of instruction level parallelism (ILP)
• Q7: Problems with improving ILP?• What are the problems in pipelines?
– A7: Clock cycle is determined by slowest component
» What is typically the slowest component? memory
– A7: Data and control hazards (pipeline stalls and flushes)
• Further improvement in ILP?– A7: Limited parallelism in ILP
24
“Memory Wall” problem
25
• DRAM access (main memory) could take hundreds of cycles • Memory hierarchy to rescue to alleviate the problem
– Will spend much time later in class reviewing advanced techniques for reducing effective access time to main memory
Bandwidth and Latency
Log-log plot of bandwidth and latency milestones
Performance Milestones
• Processor: ‘286, ‘386, ‘486, Pentium, Pentium Pro, Pentium 4 (21x,2250x)
• Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x)
• Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x)
• Disk : 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)
CPU high,
Memory low
(“Memory
Wall”)
Bandwidth is much easier to improve – why?
Question:
• Pipelining improve performance (instruction per cycle with respect to multi-cycle processor with non pipelining, by overlapping instructions)
• One kind of instruction level parallelism (ILP)
• Q7: Problems with improving ILP?• What are the problems in pipelines?
– A7: Clock cycle is determined by slowest component
» What is typically the slowest component? memory
– A7: Data and control hazards (pipeline stalls and flushes)
• Further improvement in ILP?– A7: Limited parallelism in ILP
27
ILP techniques
Summary of Trends in Technology
• Integrated circuit technology– Transistor density: 35%/year– Die size: 10-20%/year– Integration overall: 40-55%/year
• DRAM capacity: 25-40%/year (slowing)
• Flash capacity: 50-60%/year– 15-20X cheaper/bit than DRAM
• Magnetic disk technology: 40%/year– 15-25X cheaper/bit then Flash– 300-500X cheaper/bit than DRAM
29
Computer trends: Performance of a (single) processor
30
The area of high performance chip has been close to ~ cm^2, why?
Question:
• Q8: Why did the die size only grew by 10% / year?– Performance of single processor could be improved by using more
hardware (larger cache, more sophisticated branch prediction etc.)
31
Drawing single-crystalSi ingot from furnace…. Then, slice into wafers and pattern it…
8” MIPS64 R20K wafer (564 dies)
Trends in Cost
• Cost driven down by learning curve– Yield
• DRAM: price closely tracks cost
• Microprocessors: price depends on volume– 10% less for each doubling of volume
32
IC cost = Die cost + Testing cost + Packaging costFinal test yield
Die cost = Wafer costDies per Wafer * Die yield
Final test yield: fraction of packaged dies which pass the final testing state
Die yield: fraction of good dies on a wafer
Integrated Circuits Costs
33
Defects per unit area = 0.016-0.057 defects per square cm (2010)N = process-complexity factor = 11.5-15.5 (40 nm, 2010)
34
10 20.1 1 10 100
10 5
10 4
10 3
10 2
0.1
1
Die area (cm^2)
Die yield / wafer yieldDefects per unit area = 0.016
Defects per unit area = 0.057
N = 11.5
Die area (cm^2)
Die cost (arbitrary units)
10 20.1 1 10 100
10 2
10
104
107
1010
1013
Answer to Q8
ASIC vs. uP
35
100 105 108
1
10
100
1000
104
105
106 $1 M NRE (non recurrent engineering cost)
ASIC
uP
Volume
Total cost $
1
Total cost = NRE/volume + IC cost
IC cost = $100
IC cost = $1
Q9: - What is typically denser ASIC or uP for the same task? ASIC- What is typically more energy efficient and faster? ASIC- What cost less to produce ASIC or uP? depends on volume (see graph above)
this is just an example of NRE cost. It may vary by much in in general total cost for uP > that of ASIC
Density, speed Flexibility
Application Specific Integrated
Circuit
Field Programmable
Gate Array
Microprocessor
Major computing platforms
In this class, the focus is on the microprocessors only
Computer trends: Performance of a (single) processor
37
Questions:• Reasons behind performance improvement?
• Q10: What happened after > 2002 on the performance figure?
• A10: Power wall
• A:10 End of ILP– Limits to pipelining
– Limits to superscalar
38
Power consumption
39
• Intel 80386 consumed ~ 2 W
• 3.3 GHz Intel Core i7 consumes 130 W
Problem: Get power in, get power out
Thermal Design Power (TDP) - Characterizes sustained power consumption, used as target for power supply and cooling system, Lower than peak power, higher than average power consumption
Maximum power density forfan-based cooling:
200W/cm^2water based cooling: 1000W/cm^2
Typical max temperatures: ~70 C
40
Ambient temperature (Tlow)
Chip temperature (Thigh)
Heat flux (Q)
Fourier Law in 1D : Similar to Ohms law when replacing - thermal conductance with electrical conductance - heat source (total generated power) with current source- temperature with voltage
Tlow
Thigh
1/RK
Q I
Vlow
Vhigh
Thermal conductance K
Thigh = Tlow + Q/K Vhigh = Vlow + IR
Temperature is roughly (in 1D lumped model) linearly proportional to the Q or total dissipated power
Scaling with Feature Size(for short channel devices before running into leakage problems)
41
Let’s
1) scale all the dimensions of the transistors and wires down by factor of s
and
2) supply voltage V down by factor of s (together with threshold voltage Vth)
Then
• Density: ~ s2
• Logic gate capacitance Cgate (traditionally dominating parasitics): ~ 1/s
• Saturation current ION : ~ 1/s
• Gate delay Tgate: ~ CgateV/ION = 1/s
• Clock frequency f : s , i.e. it is reversely proportional to gate delay. Clock cycle time is typically around ten or more of logic gate delays
• Power (dynamic component only): ~1/2 Ctotal*V2*f ~ 1
(if chip area remain the same power scaling is the same as power density) No issue with power (or temperature) scaling but the problem is that supply voltage is no longer scaled down by factor of s, why? – see next slides
Static vs. dynamic power
Leakage (static power) increases exponentially when lowering V! Cannot be neglected anymore
Static power
Dynamic power
Static power is permanentDynamic power only when switching
Leakage power ~ V^2/Roff Roff/Ron ~ Exp(V)
Roff
Ron
Technique for Reducing Power Consumption– Do nothing well
• Low power state for DRAM, disks• Energy proportionality concept (don’t consume energy when
no work is done) very important for data center for which power is huge portion of running cost
• Power gating to reduce static component– Dynamic Voltage-Frequency Scaling
• Q11: Any benefits for multiprocessors?– A11: If task is easily parallelizable, then running this task on m
processors in parallel at lower V (say V/m) and slower f (say f/m) can lead to the same execution time but much lower dynamic power CtotalV^2f ~ 1/m^3 (not accounting for static power)
– Overclocking, turning off cores• Race-to-halt• Thermal capacitance/ turbo mode
43
Since saturation current ION ~ V2 f ~ 1/Tgate ≈ ION/ (Cgate V )~ V
Lowering voltage reduces the dynamic power consumption and energy per operation but decrease performance because of increased CCT
Reducing energy consumption: Choice of optimal voltage supply
Other problems with scaling: Transistors and Wires
• Feature size– Minimum size of transistor or
wire in x or y dimension– 10 microns in 1971 to .032
microns in 2011– Transistor performance scales
linearly– Wire delay does not improve
with feature size!– There is always need in long
wires• Problem related to Rent Rule
(number of pins versus number of gates)
45
Questions:• Reasons behind performance improvement?
• Q10: What happened after > 2002 on the performance figure?
• A10: Power wall
• A10: End of ILP
– Limits to pipelining
– Limits to superscalar
» Will discuss it in detail after covering advanced ILP topics
46
What is next: Current Trends in Architecture
• Cannot continue to leverage Instruction-Level parallelism (ILP)– Single processor performance improvement ended in 2003
• New ways of improving performance:– Data-level parallelism (DLP)
– Thread-level parallelism (TLP)
– Request-level parallelism (RLP)
• These require explicit restructuring of the application
47
New applications appears: Classes of computers now
• Personal Mobile Device (PMD)– e.g. start phones, tablet computers– Emphasis on energy efficiency and real-time
• Desktop Computing– Emphasis on price-performance
• Servers– Emphasis on availability, scalability, throughput
• Clusters / Warehouse Scale Computers– Used for “Software as a Service (SaaS)”– Emphasis on availability and price-performance– Sub-class: Supercomputers, emphasis: floating-point performance and
fast internal networks
• Embedded Computers– Emphasis: price
48
Dark silicon
49
Only some parts of a chip are active at a time
Q12: Specialized cores make sense now in general purpose microprocessor
Qualcomm Zeroth chip
50
Summary of trends in uP
Not covered in class
51
Quantitative Principles of Design
• Take Advantage of Parallelism
• Principle of Locality
• Focus on the Common Case
– Amdahl’s Law
– E.g. common case supported by special hardware; uncommon cases in software
• The Performance Equation
52
1. Parallelism
How to improve performance?
• (Super)-pipelining
• Powerful instructions– MD-technique
• multiple data operands per operation
– MO-technique• multiple operations per instruction
• Multiple instruction issue– single instruction-program stream
– multiple streams (or programs, or tasks)
53
Flynn’s Taxonomy
• Single instruction stream, single data stream (SISD)
• Single instruction stream, multiple data streams (SIMD)– Vector architectures– Multimedia extensions– Graphics processor units
• Multiple instruction streams, single data stream (MISD)– No commercial implementation
• Multiple instruction streams, multiple data streams (MIMD)– Tightly-coupled MIMD– Loosely-coupled MIMD
54
MIPS Pipeline
Five stages, one step per stage
1. IF: Instruction fetch from memory
2. ID: Instruction decode & register read
3. EX: Execute operation or calculate address
4. MEM: Access memory operand
5. WB: Write result back to register
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
IFetch Dec Exec Mem WBlw
Review from Last Lecture
1
2
3
4
5
1
2
3
4
5
6
7
(a) Task-time diagram (b) Space-time diagram
Cycle
Instruction
Cycle
Pipeline stage
1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11
Start-up region
Drainage region
a
a
a
a
a
a
a
w
w
w
w
w
w
w
f
f
f
f
f
f
f
r
r
r
r
r
r
r
d
d
d
d
d
d
d
a a a a a a a
w w w w w w w
d d d d d d d
r r r r r r r
f f f f f f f
f = Fetch r = Reg read a = ALU op d = Data access w = Writeback
Clock
Clock
Instr 2 Instr 1 Instr 3 Instr 4
3 cycles 3 cycles 4 cycles 5 cycles
Time saved
Instr 1 Instr 4 Instr 3 Instr 2
Time needed
Time needed
Time allotted
Time allotted
Design Instcount
CPI CCT
Single Cycle (SC) 1 1 1
Multi cycle (MC) 1 N ≥ CPI > 1(closer to N than 1)
> 1/N
Multi cycle pipelined (MCP)
1 > 1 >1/N
multi cycle pipelined
Single cycle
multi cycle
Execution time = 1/ Performance = Inst count x CPI x CCT
CPIideal MCP=N /InstCount + 1 – 1/InstCount large N and/or small InstCount result in worse CPI Performance to run one instruction is the same as of CP (i.e. latency for single instruction is not reduced)
N = # of stages for pipeline design or ~ maximum number of steps for MC
What are the other issues affecting CCT and CPI for MC and MCP?
Pipelined Instruction Execution
Instr.
Order
Time (clock cycles)
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
57
Limits to pipelining• Hazards prevent next instruction from executing during its designated
clock cycle
– Structural hazards: attempt to use the same hardware to do two different things at once
– Data hazards: Instruction depends on result of prior instruction still in the pipeline
– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).
Instr.
Order
Time (clock cycles)
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
58
2. The Principle of Locality
• Programs access a relatively small portion of the address space at any instant of time.
• Two Different Types of Locality:– Temporal Locality (Locality in Time): If an item is referenced, it will
tend to be referenced again soon (e.g., loops, reuse)– Spatial Locality (Locality in Space): If an item is referenced, items
whose addresses are close by tend to be referenced soon (e.g., straight-line code, array access)
• Last 30 years, HW relied on locality for memory perf.
P MEM$
59
Memory Hierarchy Levels
CPU Registers100s Bytes300 – 500 ps (0.3-0.5 ns)
L1 and L2 Cache10s-100s K Bytes~1 ns - ~10 ns~ $100s/ GByte
Main MemoryG Bytes80ns- 200ns~ $10/ GByte
Disk10s T Bytes, 10 ms (10,000,000 ns)~ $0.1 / GByte
CapacityAccess TimeCost
Tape infinitesec-min~$0.1 / GByte
Registers
L1 Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler1-8 bytes
cache cntl32-64 bytes
OS4K-8K bytes
user/operatorGbytes
Upper Level
Lower Level
faster
Larger
L2 Cachecache cntl64-128 bytesBlocks
still needed? 60
3. Focus on the Common Case• Favor the frequent case over the infrequent case
– E.g., Instruction fetch and decode unit used more frequently than multiplier, so optimize it first
– E.g., If database server has 50 disks / processor, storage dependability dominates system dependability, so optimize it first
• Frequent case is often simpler and can be done faster than the infrequent case– E.g., overflow is rare when adding 2 numbers, so improve
performance by optimizing more common case of no overflow
– May slow down overflow, but overall performance improved by optimizing for the normal case
• What is frequent case? How much performance improved by making case faster? => Amdahl’s Law
61
Amdahl’s Law
Speedupoverall =Texec,old
Texec,new
=
1
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhancedse
rial
par
tp
aral
lel p
art
seri
al p
art
62
Amdahl’s Law
• Floating point instructions improved to run 2 times faster, but only 10% of actual instructions are FP
Speedupoverall =
Texec,new =
63
Amdahl’s Law
• Floating point instructions improved to run 2X; but only 10% of actual instructions are FP
Speedupoverall =1
0.95
= 1.053
Texec,new = Texec,old x (0.9 + 0.1/2) = 0.95 x Texec,old
64
Amdahl's law
65
Principles of Computer Design
• The Processor Performance Equation
66
Principles of Computer Design
• Different instruction types having different CPIs
67
Acknowledgements
Some of the slides contain material developed and copyrighted by Henk Corporaal (TU/e) and instructor material for the textbook
68