unit i-introduction

52
Fundamentals of Computer Design Unit 1- Chapter 1 1 Reference : Computer Architecture, A Quantitative Approach John L. Hennessey and David A. Patterson. 4 th edition

Upload: akruthi-k

Post on 15-Apr-2017

34 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: Unit i-introduction

Fundamentals of Computer DesignUnit 1- Chapter 1

1

Reference :Computer Architecture, A Quantitative Approach – John L. Hennessey and David A. Patterson. 4th edition

Page 2: Unit i-introduction

2

Outline

Introduction

Computing Classes

Task of Computer Designer

Measuring, Reporting, Summarizing Performance

Quantitative Principles of Design

Page 3: Unit i-introduction

3

Introduction

Eniac, 1946 (first stored-program computer)Occupied 50x30 feet room, weighted 30 tonnes, contained 18000 electronic valves, consumed 25KW of electrical power;capable to perform 100K calc. per second

CHANGE!

PlayStation Portable (PSP) Approx. 170 mm (L) x 74 mm (W) x 23 mm (D) Weight: Approx. 260 g (including battery) CPU: PSP CPU (clock frequency 1~333MHz) Main Memory: 32MB Embedded DRAM: 4MB Profile: PSP Game, UMD Audio, UMD Video

Page 4: Unit i-introduction

4

A short history of computing Continuous growth in performance due to advances in technology and

innovations in computer design First 25 years (1945 – 1970)

25% yearly growth in performance Both forces contributed to performance improvement Mainframes and minicomputers dominated the industry

Late 70s, emergence of the microprocessor 35% yearly growth in performance thanks to integrated circuit technology Changes in computer marketplace:

elimination of assembly language programming, emergence of Unix easier to develop new architectures

Mid 80s, emergence of RISCs (Reduced Instruction Set Computers) 52% yearly growth in performance Performance improvements through instruction level parallelism

(pipelining, multiple instruction issue), caches Since ‘02, end of 16 years of renaissance

20% yearly growth in performance Limited by 3 hurdles: maximum power dissipation,

instruction-level parallelism, and so called “memory wall” Switch from ILP to TLP and DLP (Thread-, Data-level Parallelism)

Page 5: Unit i-introduction

5

Effect of this Dramatic Growth

Significant enhancement of the capability available to computer user Example: a today’s $500 PC has more performance,

more main memory, and more disk storage than a $1 million computer in 1985

Microprocessor-based computers dominate Workstations and PCs

have emerged as major products Minicomputers - replaced by servers Mainframes - replaced by multiprocessors Supercomputers - replaced by

large arrays of microprocessors

Page 6: Unit i-introduction

6

Changing Face of Computing

In the 1960s mainframes roamed the planet Very expensive, operators oversaw operations Applications: business data processing,

large scale scientific computing In the 1970s, minicomputers emerged

Less expensive, time sharing In the 1990s, Internet and WWW, handheld devices

(PDA), high-performance consumer electronics for video games and set-top boxes have emerged

Dramatic changes have led to 3 different computing markets Desktop computing, Servers, Embedded Computers

Page 7: Unit i-introduction

7

Computing Classes: A SummaryFeature Desktop Server Embedded

Price of the system

$500-$5K $5K-$5M $10-$100K (including network routers at high end)

Price of the processor

$50-$500 $200-$10K $0.01 - $100

Sold per year(estimates for 2000)

150M 4M 300M(only 32-bit and 64-bit)

Critical system design issues

Price & performance. graphics, compute performance

Throughput, availability, scalability

Price, power consumption, application-specific performance

Page 8: Unit i-introduction

8

Desktop Computers

Largest market in dollar terms Spans low-end (<$500) to high-end ($5K) systems Optimize price-performance

Performance measured in the number of calculations and graphic operations

Price is what matters to customers Arena where the newest, highest-performance and

cost-reduced microprocessors appear Reasonably well characterized in terms of

applications and benchmarking

Page 9: Unit i-introduction

9

Servers

Provide more reliable file and computing services (Web servers)

Key requirements Availability – effectively provide service

(Yahoo!, Google, eBay) Reliability – never fails Scalability – server systems grow over time,

so the ability to scale up the computing capacity is crucial

Performance – transactions per minute Related category: clusters / supercomputers

Page 10: Unit i-introduction

10

Embedded Computers Fastest growing portion of the market Computers as parts of other devices where their presence is not

obviously visible E.g., home appliances, printers, smart cards,

cell phones, palmtops, gaming consoles, network routers Wide range of processing power and cost

$0.1 (8-bit, 16-bit processors), $10 (32-bit capable to execute 50M instructions per second), $100-$200 (high-end video gaming consoles and network switches)

Requirements Real-time performance requirement

(e.g., time to process a video frame is limited) Minimize memory requirements, power

SOCs (System-on-a-chip) combine processor cores and application-specific circuitry, DSP processors…

Page 11: Unit i-introduction

11

Task of Computer Designer

“Determine what attributes are important for a new machine; then design a machine to maximize performance while staying within cost, power, and availability constraints.”

Aspects of this task Instruction set design Functional organization Logic design and implementation

(IC design, packaging, power, cooling...)

Page 12: Unit i-introduction

12

What is Computer Architecture?

Instruction Set Architecture the computer visible to the assembler language

programmer or compiler writer (registers, data types, instruction set, instruction formats, addressing modes)

Organization high level aspects of computer’s design such as

the memory system, the bus structure, and the internal CPU (datapath + control) design

Hardware detailed logic design, interconnection and packing

technology, external connections

Computer Architecture covers all three aspects of computer design

Page 13: Unit i-introduction

Instruction Set Architecture, Organization, and Hardware

13

Page 14: Unit i-introduction

14

Instruction Set Architecture: Critical Interface

instruction set

software

hardware

Properties of a good abstraction Lasts through many generations (portability) Used in many different ways (generality) Provides convenient functionality to higher levels Permits an efficient implementation at lower levels

Page 15: Unit i-introduction

Measuring, Reporting, Summarizing Performance

15

Page 16: Unit i-introduction

16

Cost-Performance

Purchasing perspective: from a collection of machines, choose one which has best performance? least cost? best performance/cost?

Computer designer perspective: faced with design options, select one which has best performance improvement? least cost? best performance/cost?

Both require: basis for comparison and metric for evaluation

Page 17: Unit i-introduction

17

Two “notions” of performance

Which computer has better performance? User: one which runs a program in less time Computer centre manager:

one which completes more jobs in a given time Users are interested in reducing

Response time or Execution time the time between the start and

the completion of an event Managers are interested in increasing

Throughput or Bandwidth total amount of work done in a given time

Page 18: Unit i-introduction

18

An Example

Which has higher performance? Time to deliver 1 passenger?

B is 6.5/3 = 2.2 times faster No. of passengers delivered per hour?

A is 72/44 = 1.6 times faster

Plane DC to Paris[hour]

Top Speed[mph]

Passe-ngers

Throughput[p/h]

A 6.5 610 470 72 (=470/6.5)

B 3 1350 132 44 (=132/3)

Page 19: Unit i-introduction

19

Definition of Performance

We are primarily concerned with Response Time Performance :

“X is n times faster than Y”

As faster means both increased performance and decreased execution time, to reduce confusion will use “improve performance” or “improve execution time”

)(_)(

xtimeExecutionxePerformanc 1

)()(

)(_)(_

yePerformancxePerformanc

xtimeExecutionytimeExecutionn

Page 20: Unit i-introduction

20

Execution Time and Its Components

Execution Time - response time, elapsed time the latency to complete a task, including disk

accesses, memory accesses, input/output activities, operating system overhead,...

CPU time the time the CPU is computing, excluding I/O or

running other programs with multiprogramming often further divided into user and system CPU times

User CPU time the CPU time spent in the program

System CPU time the CPU time spent in the operating system

Page 21: Unit i-introduction

21

CPU Time for a program

Instruction count (IC) = Number of instructions executed

Clock cycles per instruction (CPI)

timecycleClockprogramaforcyclesclockCPUtimeCPU

rateClockprogramaforcyclesclockCPUCPUtime

ICprogramaforcyclesclockCPUCPI

CPI - one way to compare two machines with same instruction set, since Instruction Count would be the same

Page 22: Unit i-introduction

22

CPU Execution Time (cont’d)

timecycleClockCPIICtimeCPU

ProgramSeconds

cycleClockSeconds

nInstructiocyclesClock

ProgramnsInstructiotimeCPU

Substituting the units of measurement

Page 23: Unit i-introduction

Basic technologies involved in changing each characteristic are interdependent:

Hence it is difficult to change one parameter without affecting others

23

IC CPI Clock rateProgram XCompiler XISA X XOrganisation X XTechnology X

CPU Execution Time (cont’d)

Page 24: Unit i-introduction

24

How to Calculate 3 Components?

Clock Cycle Time in specification of computer

(Clock Rate in advertisements) Instruction count

Count instructions in loop of small program Use simulator to count instructions Hardware counter in special register (Pentium II)

CPI Calculate:

Execution Time / Clock cycle time / Instruction Count Hardware counter in special register (Pentium II)

Page 25: Unit i-introduction

25

Metrics For desktop and workstations in most cases it is execution

time Also known as response time Reciprocal of performance

For servers it is normally throughput How many jobs can get done in unit time Almost always a response time limit (per job) is also

imposed For embedded processors it is also execution time

In many situations a hard or soft real time deadline is imposed

Hard deadlines cannot be missed, soft deadlines can be missed in a limited cases

Goal: maximize performance within power budget

Page 26: Unit i-introduction

26

SPEC- Benchmarks Want to compare two processors by measuring

their performance Need some standardized set of programs These are called benchmark programs Each market sector has a different focus: needs

different set of benchmark

Page 27: Unit i-introduction

27

Choosing Programs to Evaluate Per.

Ideally run typical programs with typical input before purchase, or before even build machine Engineer uses compiler, spreadsheet Author uses word processor, drawing program,

compression software Workload – mixture of programs and OS commands

that users run on a machine

Page 28: Unit i-introduction

28

Benchmarks

Different types of benchmarks Real programs (Ex. MSWord, Excel, Photoshop,...) Kernels - small pieces from real programs (Linpack,...) Toy Benchmarks - short, easy to type and run

( Quicksort, Puzzle,...) Synthetic benchmarks - code that matches frequency

of key instructions and operations to real programs (Whetstone, Dhrystone)

Need industry standards so that different processors can be fairly compared

Companies exist that create these benchmarks: “typical” code used to evaluate systems

Page 29: Unit i-introduction

29

Desktop benchmarks SPEC CPU is the standard

Provided by Standard Performance Evaluation Corporation (SPEC) http://www.spec.org/

Currently in the 5th generation: SPEC89, SPEC92, SPEC95, SPEC2000, SPEC2006

Has two types of programs: integer and floating-point to stress different CPU units

SPEC2000 has 12 integer and 14 floating-point programs If you are introducing a new processor for desktop or uniprocessor

workstation/server, you are supposed to report performance of all 26 programs

For graphics performance two available benchmarks: SPECviewperf and SPECapc

Page 30: Unit i-introduction

30

Server benchmarks SPECrate

Not the most reliable measure: run a copy of the same SPEC application on each processor and report average number of jobs finished per unit time

SPECSFS Tests the performance of NFS using a series of file

server requests; measures CPU, disk and network throughput; also puts response time limit on each request

SPECWeb Web server benchmark; simulates multiple clients

requesting both static and dynamic pages from a server; also clients can upload data on the server

Page 31: Unit i-introduction

31

Server benchmarks Transaction processing (TP)

Probably the most widely used benchmark in server community Measures database access and update throughput of a server Simple examples: airline reservation, bank ATM Complex systems: complicated query processing e.g. online book

shops (amazon) Provided by Transaction Processing Council (TPC) TPC-A was the first one; now we have TPC-C (complex queries),

TPC-H (unrelated queries), TPC-R (DBMS is optimized based on past query pattern), TPC-W (business-oriented transactional web server)

The measured metric is transactions per second along with response time limit for individual transactions

Page 32: Unit i-introduction

32

Embedded sector Quite difficult to come up with a good benchmark

Wide range of applications Requirements vary widely from one system to another Reality: in many cases an embedded system designer

designs his/her own benchmarks which are either the target applications or kernels extracted from them

Today the best known benchmark set is offered by EEMBC (Embedded Microprocessor Benchmark Consortium)

EEMBC has five types of applications: automotive/industrial (16 kernels), consumer (5 multimedia kernels), networking (3 kernels), office automation (4 graphics and text processing kernels), telecommunications (6 filtering and DSP kernels)

Page 33: Unit i-introduction

33

Comparing and Summarising Per.

An Example

What we can learn from these statements? We know nothing about

relative performance of computers A, B, C! One approach to summarise relative performance:

use total execution times of programs

Program Com. A Com. B Com. CP1 (sec) 1 10 20P2 (sec) 1000 100 20Total (sec) 1001 110 40

– A is 20 times faster than C for program P1

– C is 50 times faster than A for program P2

– B is 2 times faster than C for program P1

– C is 5 times faster than B for program P2

Page 34: Unit i-introduction

A Benchmark Program is compiled and run on computer under test- note the running time

The same pgm is compiled and run on a reference computer

Compute the SPEC ratio as

34

testundercomputertheontimerunningcomputerrefencetheontimerunningRatioSPEC

SPEC ratio of 50 means comp. under test is 50 times faster than ref. comp.

Comparing and Sum. Per. (cont’d)

Page 35: Unit i-introduction

The above process is repeated for all pgms in the SPEC suite

The overall SPEC ratio for the computer

35

n

iiratioSPEC SPEC

n

1

1

Where,n – number of pgms in the suiteSPECi – ratio for the pgm i in the suite

Comparing and Sum. Per. (cont’d)

Page 36: Unit i-introduction

SPECRatio of computer A on a benchmark was 1.25 times higher than computer B

This means performance of computer A is 1.25 times higher than computer B

36

Note - choice of the reference computer is irrelevant

Comparing and Sum. Per. (cont’d)

Page 37: Unit i-introduction

37

Quantitative Principles of Design

Where to spend time making improvements? Make the Common Case Fast Most important principle of computer design:

Spend your time on improvements where those improvements will do the most good

Example Instruction A represents 5% of execution Instruction B represents 20% of execution Even if you can drive the time for A to 0, the CPU will only be

5% faster Key questions

What the frequent case is? How much performance can be improved by making

that case faster?

Page 38: Unit i-introduction

38

1. Focus on the Common Case - Amdahl’s Law

Suppose that we make an enhancement to a machine that will improve its performance; Speedup is ratio:

Amdahl’s Law states that the performance improvement that can be gained by a particular enhancement is limited by the amount of time that enhancement can be used

tenhancemenusingtaskentireforExTimetenhancemenwithouttaskentireforExTimeSpeedup

tenhancemenwithouttaskentireforePerformanctenhancemenusingtaskentireforePerformancSpeedup

Page 39: Unit i-introduction

39

Computing Speedup

Fraction_enhanced = fraction of execution time in the original machine that can be converted to take advantage of enhancement (E.g., 10/30)

Speedup_enhanced = how much faster the enhanced code will run (E.g., 10/2=5)

Execution_timenew = ex time of unenhanced part + new ex time of the enhanced part

enhanced

enhancedunenhancednew Speedup

ExTimeExTimeExTime

20 10 20 2

Page 40: Unit i-introduction

40

Computing Speedup (cont’d)

so times are:

Factor out Timeold and divide by Speedupenhanced:

Overall speedup is ratio of Timeold to Timenew:

enhancedoldunenhanced FractionExTimeExTime 1

enhancedoldenhanced FractionExTimeExTime

enhanced

enhancedenhancedoldnew Speedup

FractionFractionExTimeExTime 1

enhanced

enhancedenhanced Speedup

FractionFractionSpeedup

1

1

Page 41: Unit i-introduction

41

Example 1

Enhancement runs 10 times faster and it affects 40% of the execution time; compute the overall speedup :

Fractionenhanced = 0.40 Speedupenhanced = 10 Speedupoverall = ?

5616401

1040401

1 ....

Speedup

Page 42: Unit i-introduction

Example 2 A common transformation required in graphics processors is square

root. Implementations of floating-point (FP) square root vary significantly in performance, especially among processors designed for graphics. Suppose FP square root (FPSQR) is responsible for 20% of the execution time of a critical graphics benchmark. One proposal is to enhance the FPSQR hardware and speed up this operation by a factor of 10.

The other alternative is just to try to make all FP instructions in the graphics processor run faster by a factor of 1.6; FP instructions are responsible for half of the execution time for the application. The design team believes that they can make all FP instructions run 1.6 times faster with the same effort as required for the fast square root. Compare these two design alternatives.

42

Page 43: Unit i-introduction

Hence improving the performance of the FP operations overall is slightly better because of the higher frequency.

43

enhanced

enhancedenhanced Speedup

FractionFractionSpeedup

1

1

Page 44: Unit i-introduction

Another method of calculating CPU time

44

where –

ICi represents number of times instruction i is executed in a program

CPIi represents the average number of clocks per instruction for instruction i.

The CPU clock cycles taken by a program is given by :

Page 45: Unit i-introduction

45

The CPU time taken by a program is given by :

The overall CPI is expressed by :

Page 46: Unit i-introduction

Example 3 Suppose we have made the following measurements:   Frequency of FP operations =25% Average CPI of FP operations =4.0 Average CPI of other instructions = 1.33 Frequency of FPSQR=2% CPI of FPSQR=20 Assume that the two design alternatives are to decrease the CPI of

FPSQR to 2 or to decrease the average CPI of all FP operations to 2.5. Compare these two design alternatives using the processor performance equation.

46

Page 47: Unit i-introduction

Find the original CPI with neither enhancement

Compute the CPI for the enhanced FPSQR by subtracting the cycles saved from the original CPI:

47

Page 48: Unit i-introduction

compute the CPI for the enhancement of all FP instructions the same way:

CPInewFP = CPIoriginal - 25% x (CPIoldFP - CPInewFPonly ) =2 – 0.25 x (4 - 2.5)

=2 – (0.25 X 1.5) =1.625

Performance is marginally better with FP enhancement

Speedup with new FP = 2 / 1.625 = 1.23

48

Page 49: Unit i-introduction

2. Take Advantage of Parallelism

One of the most important methods for improving performance

In a server: multiple processors and multiple disks can be

used - improves throughput. Scalability- Being able to expand memory,

number of processors, disks - is a valuable asset for servers.

49

Page 50: Unit i-introduction

Individual processor Parallelism among instructions - pipelining set-associative caches use multiple banks of

memory

50

2. Take Advantage of Parallelism

Page 51: Unit i-introduction

3. Principle of Locality

Programs tend to reuse data and instructions they have used recently.

Program spends 90% of its execution time in only 10% of the code

Two different types of locality Temporal locality- recently accessed items are

likely to be accessed in the near future. Spatial locality- items whose addresses are

near one another tend to be referenced close together in time.

51

Page 52: Unit i-introduction

Principle of Locality

We can predict which instructions and data a program will use in the near future based on its accesses in the recent past

52