1 fundamentals of computer design introduction classes of computers defining computer architecture...
TRANSCRIPT
1
Fundamentals of Computer DesignFundamentals of Computer Design Introduction
Classes of Computers
Defining Computer Architecture
Trends in Technology
Trends in Power and Energy in Integrated Circuits
Tends in Cost
Dependability
Measuring and Reporting Performance
Quantitative Principles of Computer DesignCDA 4102/5155 – Fall 2015 Copyright © 2015 Prabhat Mishra
2
Microprocessor Performance Trends
Relative to VAX-11/780 using SpecInt Benchmarks
Due to technological advances
Due to advances in architecture
Slowdown due to limits ofpower and available ILP
3
Design Complexity
Exponential Growth – doubling of transistors every couple of years
4
Technology and Demand
Technology Demand
#of transistors are doubling every 2 years
Communication, multimedia, entertainment, networking
Exponential growth of design complexity verification complexity
5
Fundamentals of Computer DesignFundamentals of Computer Design Introduction
Classes of Computers
Defining Computer Architecture
Trends in Technology
Trends in Power and Energy in Integrated Circuits
Tends in Cost
Dependability
Measuring and Reporting Performance
Quantitative Principles of Computer Design
Conclusion
6
Computer MarketDesktop
Driven by price-performance
$1000 - $10,000 [$100 - $1000 per processor]
ServerThroughput, availability, scalability
$10K - $10M [$200 - $2000 per processor]
Embedded SystemsApplication specific
Low cost, low power, real-time performance
$10 - $100,000 [$0.20 - $200 per processor]
7
An Example Embedded System
Digital Camera Block Diagram
8
Components of Embedded Systems
Analog Digital Analog
Memory
Coprocessors
Controllers
Converters
Processor
Interface
Software(Application Programs)
ASIC
9
Fundamentals of Computer DesignFundamentals of Computer Design Introduction
Classes of Computers
Defining Computer Architecture
Trends in Technology
Trends in Power and Energy in Integrated Circuits
Tends in Cost
Dependability
Measuring and Reporting Performance
Quantitative Principles of Computer Design
Conclusion
10
Computer ArchitectureDefinition
Instruction set architecture (ISA) Programmer (user) View
Implementation Organization: CPU, memory, buses, I/O
Hardware: logic design, packaging technology
Computer design must meet Functional requirements
Area, performance, cost, power goals Optimize, evaluate, and explore to find best possible architecture
Consider other factors Time-to-market, technology trend, safety, reliability, …
11
Instruction-Set Architecture (ISA)
An instruction set architecture is a specification of a standardized programmer-visible interface to hardware, comprised of:
A set of instructions (instruction types and operations)With associated argument fields, assembly syntax, binary encoding.
A set of named storage locations and addressingRegisters, memory, … programmer-accessible caches?
A set of addressing modes (ways to name locations)
Types and sizes of operands
Control flow instructions
Often an I/O interface (usually memory-mapped)
12
Example: MIPS
0r0r1°°°r31PClohi
Programmable storage
232 x bytes
31 x 32-bit GPRs (R0=0)
32 x 32-bit FP regs (paired DP)
HI, LO, PC
Data types ?
Format ?
Addressing Modes?
Arithmetic logical ADD, ADDU, SUB, SUBU, AND, OR, XOR, NOR, SLT, SLTU,
ADDI, ADDIU, SLTI, SLTIU, ANDI, ORL, XORL, LUI
SLL, SRL, SRA, SLLV, SRLV, SRAV
Memory AccessLB, LBU, LH, LHU, LW, LWL, LWR
SB, SH, SW, SWL, SWR
ControlJ, JAL, JR, JALR
BEQ, BNE, BLEZ, BGTZ, BLTZ, BGEZ, BLTZAL, BGEZAL
13
MIPS64 Instruction Format
MIPS Implementation
Pipelined Implementation
16
Fundamentals of Computer DesignFundamentals of Computer Design Introduction
Classes of Computers
Defining Computer Architecture
Trends in Technology
Trends in Power and Energy in Integrated Circuits
Tends in Cost
Dependability
Measuring and Reporting Performance
Quantitative Principles of Computer Design
Conclusion
17
Technology Trend
Component IC technology: transistor/chip increases 55% per year
DRAM: density increases 40-60% per year
Magnetic disk: density increases 100% per year
Network: Ethernet from 10 100Mb took 10 years;
100Mb 1Gb in 5 years
Scaling of performance, wires and powerFeature size: 10 micron in 1971; 0.18 in 2001, …
Microprocessor organization improvementWiring delayPower issue: ~100 watts for 2GHz Pentium 4
18
Disk Comparison
CDC Wren I, 1983
3600 RPM
0.03 GBytes capacity
Tracks/Inch: 800
Bits/Inch: 9550
Three 5.25” platters
Bandwidth: 0.6 MBytes/sec
Latency: 48.3 ms
Cache: none
Seagate 373453, 2003
15000 RPM (4X)
73.4 GBytes (2500X)
Tracks/Inch: 64000 (80X)
Bits/Inch: 533,000 (60X)
Four 2.5” platters (in 3.5” form factor)
Bandwidth: 86 MBytes/sec (140X)
Latency: 5.7 ms (8X)
Cache: 8 MBytes
19
Memory Comparison
1980 DRAM (asynchronous)
0.06 Mbits/chip
64 K transistors, 35 mm2
16-bit data bus per module
16 pins/chip
13 Mbytes/sec
Latency: 225 ns
(no block transfer)
2000 Double Data Rate Synchronous (clocked) DRAM
256.00 Mbits/chip (4000X)
256 M transistors, 204 mm2
64-bit data bus per DIMM (4X)
66 pins/chip
1600 Mbytes/sec (120X)
Latency: 52 ns (4X)
Block transfers (page mode)
20
LAN Comparison
Ethernet 802.3
Year of Standard: 1978 10 Mbits/s link speed Latency: 3000 sec Shared media Coaxial cable
Ethernet 802.3ae
Year of Standard: 200310,000 Mbits/s
(1000X)link speed
Latency: 190 sec
(15X)Switched mediaCategory 5 copper wireCopper core
InsulatorBraided outer conductor
Plastic Covering
Copper, 1mm thick, twisted to avoid antenna effect
Twisted Pair:"Cat 5" is 4 twisted pairs in bundle
21
CPU Comparison
1982 Intel 80286
12.5 MHz 2 MIPS (peak) Latency 320 ns 134,000 xtors, 47 mm2
16-bit data bus, 68 pins Microcode interpreter, separate FPU chip (no caches)
2001 Intel Pentium 4
1500 MHz (120X)4500 MIPS (peak) (2250X)Latency 15 ns (20X)42,000,000 xtors, 217 mm2
64-bit data bus, 423 pins3-way superscalar,
Dynamic translate to RISC, Superpipelined (22 stage),Out-of-Order execution
On-chip 8KB Data caches, 96KB Instr. Trace cache, 256KB L2 cache
22
Bandwidth vs. Latency
Processor: ‘286, ‘386, ‘486, Pentium, Pentium Pro, Pentium 4 (21x,2250x)
Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x)
Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x)
Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)
Latency improvement is 10X while bandwidth improvement is 100X to 1000X.
23
Fundamentals of Computer DesignFundamentals of Computer Design Introduction
Classes of Computers
Defining Computer Architecture
Trends in Technology
Trends in Power and Energy
Tends in Cost
Dependability
Measuring and Reporting Performance
Quantitative Principles of Computer Design
Conclusion
24
Power and Energy
dtPE
t
P
E
In many cases, faster execution means less energy, but the opposite may be true if power has to be increased to allow faster execution.
Power and EnergyPower is drawn from a voltage source
Power:
Energy:
Average Power:
( ) ( )DD DDP t i t V
0 0
( ) ( )T T
DD DDE P t dt i t V dt
avg
0
1( )
T
DD DD
EP i t V dt
T T
Dynamic Power
Cfsw
iDD(t)
VDD
dynamic
0
0
sw
2sw
1( )
( )
T
DD DD
TDD
DD
DDDD
DD
P i t V dtT
Vi t dt
T
VTf CV
T
CV f
Power needed to charge and discharge load capacitances when transistors switch. The capacitor needs to charge for output to be ‘1’ For output to be ‘0’, capacitor needs to discharge
This repeats T.fsw times over an interval of T
2dynamic DDP CV f Here, is activity factor
and f is clock frequency.
28
Static Power
Because leakage current flows even when a
transistor is off, now static power important too
Leakage current increases in processors with smaller transistor sizes Increasing the number of transistors increases power even if they are turned off In 2006, goal for leakage is 25% of total power consumption; high performance designs at 40% Very low power systems even gate voltage to inactive modules to control loss due to leakage
VoltageCurrentPower staticstatic
Reducing Energy Consumption
[www.transmeta.com]
Pentium Crusoe
Running the same multimedia application.
Infrared Cameras (FLIR) can be used to detect thermal distribution.
Dynamic Power Management (DPM)
RUN: operationalIDLE: a SW routine may stop the CPU when not in use, while monitoring interruptsSLEEP: Shutdown of on-chip activity
RUN
SLEEPIDLE
400mW
160µW50mW
90µs
90µs10µs
10µs160ms STRONGARM
SA1100
Dynamic Voltage Scaling (DVS)
E = P x TP V2
E (energy), P (power), T (time), V (voltage)
Example A task is given with workload (W) and deadline (D).
Assume that idle energy is negligible.
31T 2TD T D
V
V/2
E1 V12.T1 = V2.T E2 V2
2.T2 = V2/4.2T = E1/2
Multicores – Low Power?
MulticoreOne core with frequency 2 GHz
Two cores with 1 GHz frequency (each) Same performance Two 1 GHz cores require half power/energy
– Power freq2
– 1GHz core needs one-fourth power compared to 2GHz core.
New challenges Performance concerns – how to keep them busy?
Reliability concerns – MTTF goes worse!
and more …
33
Fundamentals of Computer DesignFundamentals of Computer Design Introduction
Classes of Computers
Defining Computer Architecture
Trends in Technology
Trends in Power and Energy in Integrated Circuits
Tends in Cost
Dependability
Measuring and Reporting Performance
Quantitative Principles of Computer Design
Conclusion
34
DRAM Pricing
© 2003 Elsevier Science (USA). All rights reserved.
35
Processor Pricing (Intel Pentium III)
© 2003 Elsevier Science (USA). All rights reserved.
36
Silicon wafer and microprocessor die
This 8-inch wafer contains 564 MIPS64 R20K processors (0.18) Intel Pentium 4 Microprocessor
37
Cost of an Integrated Circuit (IC)
Cost of IC: (die + packaging + test) / yield
See examples in Page 22-24
Cost of a systemProcessor board: ~ 37% I/O device: ~ 37%Cabinet: ~ 6%Software: ~ 20%
38
Cost
Unit cost Monetary cost of manufacturing one unit, excluding NRE cost
NRE cost (Non-Recurring Engineering cost) The one-time monetary cost of designing the system
Total cost NRE cost + unit cost * # of unit
Per-product cost total cost / # of units = (NRE cost / # of units) + unit cost
• Example– NRE=$2000, unit=$100– For 10 units
– total cost = $2000 + 10*$100 = $3000– per-product cost = $2000/10 + $100 = $300
Amortizing NRE cost over the units results in an additional $200 per unit
39
NRE versus Unit Cost
High NRE, low production cost
Low NRE, high production cost
Volume
Un
it C
ost
40
Cost versus Price
41
Fundamentals of Computer DesignFundamentals of Computer Design Introduction
Classes of Computers
Defining Computer Architecture
Trends in Technology
Trends in Power and Energy in Integrated Circuits
Tends in Cost
Dependability
Measuring and Reporting Performance
Quantitative Principles of Computer Design
Conclusion
42
Define and Quantify Dependability
How to decide when a system is operating properly?
Infrastructure providers now offer Service Level Agreements (SLA) to guarantee that their networking or power service would be dependable
Systems alternate between 2 states of service with respect to an SLA: State 1: Service accomplishment, where the service is
delivered as specified in SLA
State 2: Service interruption, where the delivered service is different from the SLA
Failure = transition from state 1 to state 2
Restoration = transition from state 2 to state 1
43
Dependability
Module reliability = measure of continuous service accomplishment (or time to failure)
Two metrics:
1. Mean Time To Failure (MTTF) – measures Reliability
2. Failures In Time (FIT) = 1/MTTF, the rate of failures
Traditionally reported as failures per billion hours of operation
Mean Time To Repair (MTTR) measures Service Interruption
Mean Time Between Failures (MTBF) = MTTF+MTTR
Module availability measures service as alternate between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9)
Module availability = MTTF / ( MTTF + MTTR)
44
Example
If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules
Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk), 1 disk controller (0.5M hour MTTF), and 1 power supply (0.2M hour MTTF):
hours
MTTF
FIT
eFailureRat
000,59
000,17/000,000,000,1
000,17
000,000,1/17
000,000,1/5210
000,200/1000,500/1)000,000,1/1(10
( )
45
Fundamentals of Computer DesignFundamentals of Computer Design Introduction
Classes of Computers
Defining Computer Architecture
Trends in Technology
Trends in Power and Energy in Integrated Circuits
Tends in Cost
Dependability
Measuring and Reporting Performance
Quantitative Principles of Computer Design
Conclusion
46
Performance MeasurementPerformance metrics execution time
Increasing performance decreases execution time
Other metricsWall-clock time, response time, elapsed time
CPU time: user or system
We will focus on CPU performance, i.e., user CPU time on unloaded system
nxtimeExecution
ytimeExecution
yePerformanc
xePerformanc
47
Choosing Programs to Evaluate Performance
Real applicationsFor example: gcc compiler, Microsoft Word
Modified (or scripted) applicationsFor example: remove I/O, script to simulate interactive
behavior.
KernelsFor example: Livermore loops, Linpack
Toy benchmarksFor example: sieve of eratosthenes, quicksort
Synthetic benchmarksFor example: wheatstone, dhrystone
Low
er Accuracy
48
Benchmark SuitesDesktop
New SPEC CPU2006SPEC CPU2000: 11 integer, 14 floating-pointSPECviewperf, SPECapc: graphics benchmarks
ServerSPEC CPU2000: running multiple copiesSPECSFS: for NFS performanceSPECWeb: Web server benchmarkTPC-x: measure transaction-processing, queries, and
decision making database applications
Embedded ProcessorEEMBC: EDN Embedded Microprocessor Benchmark
Consortium
49
SPEC CPU Benchmarks
50
Reporting Performance
Performance should be reproducible
Description of the machine and compiler flags
Report for both baseline and optimized version
Source code modifications
Not allowed in SPEC benchmarks
Allowed but difficult or impossible
– TPC-C using Oracle or SQL database
Allowed in supercomputer benchmarks
– Modify or re-write algorithms
Hand-coding in assembly for EEMBC benchmark
51
Comparing Performance
Arithmetic Mean:
What is the mixture of programs in the workload?
n
i
in Time1
1
Arithmetic Mean: 500.5 55 20
52
Comparing Performance
Weighted Arithmetic Mean:
What if programs are fixed and inputs are not?
n
i
ii TimeWeight1
53
Comparing Performance
Geometric Mean:
Execution time ratio is normalized to a base machine.Reference machine is not important.
The arithmetic means are different depending on which machine is used as basis, but geometric means are same.
Geometric mean does not predict execution time
n
n
i
iRatioTimeExecution1
54
Normalized Execution Times (SPECRatio)
Geometric mean does not predict execution time Performance of machines A and B are same only if program P1
is executed 100 times for every occurrence of program P2
Rewards easy enhancements Improving program P3 (2 to 1) is same as improving program
P4 (1000 to 500).
55
Fundamentals of Computer DesignFundamentals of Computer Design Introduction
Classes of Computers
Defining Computer Architecture
Trends in Technology
Trends in Power and Energy in Integrated Circuits
Tends in Cost
Dependability
Measuring and Reporting Performance
Quantitative Principles of Computer Design
Conclusion
56
Amdahl’s Law
)/()1(
1
nffSpeedup
• Where:
f is a fraction of the execution time that can be enhanced
n is the enhancement factor
• Example: f = 0.1, n = 10 Speedup = 1.1
Make the common case fast Performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.
57
Application of Amdahl’s Law
Amdahl’s law is useful for comparing overall performance of two design alternatives.
Example: Floating-point (FP) operations consume 50% of the
execution time of a graphics application. FP square root (FPSQRT) is used 20% of the time.
1. Improve FPSQRT operation execution by 10 times– Speedup = 1 / ((1-0.2) + 0.2/10) = 1.22
2. Improve all FP operations by 1.6 times– Speedup = 1 / ((1-0.5) + 0.5/1.6) = 1.23
Due to higher frequency of FP operations, the performance gain is more (case 2) compared to drastic improvement of FPSQRT (case 1).
58
Measuring the Performance
Performance Equation CPU time = Instruction Count x Clock cycle time x CPI
How to compute these parametersKnown for existing processors
Clock cycle time
Use of counters in new processors CPI, Instruction count
Simulation for performance analysisProfile based
Trace-driven
Execution-driven
59
CPU Performance Equation
The parameters are dependent Instruction Count: ISA and compiler technology CPI: Organization and ISA Cycle Time: Hardware technology and organization
Many performance enhancing techniques improves one with small/predictable impacts on the other two.
ClockRatestCyclePerInnCountInstructio
CycleTimestCyclePerInnCountInstructioTimeCPU
/1
60
ExampleParameters:
Frequency of FP operations (incl. FPSQR) = 25%CPI for FP operations = 4; CPI for others = 1.33Frequency of FPSQR = 2%; CPI of FPSQR = 20
Compare 2 designs:Decrease CPI of FPSQR to 2CPI of all FP to 2.5
0.2%)7533.1(%)254(1
n
i ICTotal
ICCPICPI
iiorig
64.1)220(%20.2
)(%2
newFPSQRoldFPSQRnewFPSQR CPICPICPICPI orig
625.1)5.2%25()33.1%75( newFPCPI
61
Fundamentals of Computer DesignFundamentals of Computer Design Introduction
Classes of Computers
Defining Computer Architecture
Trends in Technology
Trends in Power and Energy in Integrated Circuits
Tends in Cost
Dependability
Measuring and Reporting Performance
Quantitative Principles of Computer Design
Conclusion
62
Fallacies and Pitfalls
The relative performance of two processors with the same ISA can be judged by clock rate or by the performance of a single benchmark suite.
1.7 GHz Pentium 4 relative to 1.0 GHz Pentium III
© 2003 Elsevier Science (USA). All rights reserved.
63
Fallacies and PitfallsBenchmarks remain valid indefinitely.
One line in matrix300(SPEC89) executes 99% of the time
Peak performance tracks observed performance.
The best design is the one that optimizes the primary objective without considering design costs.
Synthetic benchmarks predict performance for real programs. Compiler/hardware optimizations can inflate performance
MIPS is an accurate measure for comparing performance among computers Consider using FP hardware instead of FP routines.
66 1010
CPI
ClockRate
ExecTime
InstCountMIPS