lecture 01 -1introductioncs510 computer architectures cs510: computer architectures fall, 2001 jung...

CS510:Computer ArchitecturesFall, 2001Jung Wan [email protected], 8701

AdministrationText Book: J. L. Hennessy and D. A. PattersonComputer Architecture: A Quantitative Approach,2nd Edition, MKP, Inc.

Grading:Homework: 10%Mid-Term and Final Exams: 40% eachOthers: 10%Office:Room 3437(3rd fl., East wing of CS Bldg)Office Hours: M/W 10:00~11:30AMTA:

Table of ContentsIntroductionCostPerformanceInstruction Set ArchitectureImplementationPipeliningInstruction-level ParallelismMemory HierarchyStorage SystemsInput/Output Systemsetc

Lecture 1Introduction

Hardware Resources in a Computer System

Hardware ResourcesCentral Processing Unit(CPU)Registers and FlagsArithmetic and Logic Unit(ALU, Execution Unit)BusesMemory SystemMain MemoryMemory BusI/O SystemI/O DevicesI/O BusesI/O Controllers

Central Processing Unit

Central Processing UnitHardware Resources in a CPU can be classified into 3 classes Storage ResourcesRegistersFlagsFunctional ResourcesALUTransfer ResourcesInternal Buses

Storage Resources

Storage ResourcesRegistersDedicated RegistersData RegistersAddress RegistersIndex Register, Base Address Register, Stack Pointer Register, Page/Segment Pointer RegisterControl RegistersMajor State Register, Timing State Register, etcSpecial RegistersRegisters that can not be accessed by the programmerPC, MAR, MBR, IR, etcGeneral Purpose RegistersRegister that can be used for all purposeFlagsRepresenting Resulting Status of arithmetic operationsRepresenting the Status of CPU

Dedicated Registers vsGeneral Purpose RegistersEfficiency of register utilization issueBetter utilization of GPR than the Dedicated RegistersMay experience the shortage of Data Registers when there are some unused Address RegistersInstruction length issueDedicated register makes the instruction length shorter than GPRLess number of bits are needed to address the Dedicated RegistersExecution speed issueAccess time of the dedicated register is faster than GPRShorter register address results in faster access time(register address decoding time)

FlagsA Flag is an 1 bit register(flip flop)Represents the result of an arithmetic operationO_flag, Z_flag, N_flag(S_flag), C_flag, etcRepresents the status of CPUIE_flag(interrupt enable), IR_flag(interrupt requested), IM_flag(interrupt mask)U-flag(User/Supervisor mode)etc Flags can be packed in a registerPacked flags along with other important special registersProgram Status Word(s)(PSW)

Register LifeRegister LifeA Register Life begins with storing information into the register and ends just before storing information into that registerLength of the Register Life: The number of machine instructions during the Register Life

Statistics: How Many Registers?Needs to evaluate following in terms of Time Used or Time Saved by having or not having enough registersHow many registers are used simultaneously? Average number of simultaneous live registers is 2 ~ 6 How many would be sufficient most or all of the time? No program uses more than 15 registers simultaneously17 out of 41 programs would get by with less than 10 registers(10 registers would suffice 90% of time)What would be overhead if the number of registers were reduced?For a live register 2 pairs of LD/ST instructionsFor a dead register life( the register with a long dormant period), 1 pair of LD/ST instructions In what purpose the registers were used during their lives? Average 39%(18 ~ 68%) of the lives are used for indexing

Status of CPU for the Running ProgramContext: Processor Environment for the running program Contents of registers and flagsContext of the program continuously changes as the execution progressesContext Switch When there is a change in the executing procedure as in the cases such as procedure call or return, context needs to be changedContext Switch is a very costly operationMultiple register Load and Store operationsLD multiple registers, ST multiple registersMultiple Set of registers, or Overlapping Multiple Set of registers

Context SwitchInstruction set supports the Load Multiple Registers and Store Multiple Registers InstructionsMultiple Register SetsInstead of saving and restoring Context, simply change the Register Set Pointer value

Functional Resources

Arithmetic and Logic UnitArithmetic UnitBasic arithmetic unitAdderHigh performance arithmetic unit(Superscalar, Super pipeline)Adder, Multiplier, Divider, etcFloating point unit, Decimal arithmetic unit, etcLogic UnitBasic logic unitLogic function associated with one of the Complete setsHigh performance logic unitAND, OR, Invert, EXOR, etcShifterSerial shifterParallel shifter

Speed Enhanced FU Multiple Functional UnitsSuperscalarMore than one functional units perform functions concurrently with different set of dataPipelined Functional UnitSuperpipelineOne functional unit perform the same function, but different phase of operations, on many set of data

Transfer Resource

Internal BusThe more Internal Buses, the more opportunities of Concurrent Operations

Internal Bus

Memory System

Main Memory

Performance Gap

Memory BusSPEED of memory is always considered to be too SLOW, andCPACITY of memory is always considered to be too SMALL

Memory BandwidthMemory BandwidthInformation transfer rate in terms of bits/secIt is determined by the memory cycle(or access) time and the delay of the memory busMemory Bandwidth can be increased by increasing the information unit(size of a word), but this requires to increase the size of registers, width of buses, etcBottleneckMemory Bandwidth is shared by the CPU and I/ONormally CPU accesses the memory in every memory cycle for instruction fetch and data fetch/storeI/O steals some of the memory cycles from CPUMemory Cycle is too slow with respect to CPU speedCPU often waits for the information from memory to before continuing its operation(s)

Speed GapCacheSmall capacity Fast storage placed between CPU and Main MemoryInformation anticipated to be needed by CPU is brought to the cache ahead of timeCPU can find the information most of the time in the cache so that the access time of the memory becomes close to the cache storage access timeMultiple Module Memory and Address InterleavingMain Memory is organized with multiple memory modulesMemory system maps the consecutive addresses into the different memory modulesAccesses to the consecutive addresses can be pipelined such that while one module send out the accessed information another module accesses the information, and while doing that another module decodes the address, etc

Memory CapacityVirtual Storage SystemMain Memory is supplemented by a large capacity secondary storageMake the information needed by the CPU can be accessed from the Main Memory even though it is stored in the secondary storageBy this way, the capacity of the Main Memory virtually looks like the capacity of the secondary storageSome kind of mechanism is needed to bring the information stored in the secondary storage into the Main Memory before the CPU tries to access that information

Trends in Computer ArchitecturesWWII-50s: Technology improvement - relays --> vacuum tubes - high-level languages

60s: Miniaturization/Packaging - Transistors - Integrated Circuits

70s: Semantic Gap - Complex instruction set - Large support in hardware - microcoding

80s: Keep it simple - RISC - Shift complexity to software

90s: What to do with all these trs? - Large on-chip cache - Prefetching hardware - Speculation execution - Special-purpose instructions and hardware - Multple processors on-a-chipPre-WWII: Mechanical calculating machines

Review on Technology Trends

Development of Computer Architectures:

Dimension of Evolution Storage Capacity Main Memory Storage Hierarchy Speed Component-level(technology driven) speedup Cycle time -- Processor, Memory, I/O Instruction-level speedup Task/Program-level speedup Functionality Instruction Set High Function Execution Unit Friendliness Programmer-friendly Compiler-friendly

Focusing Points[Bandwidth][Clock Period,CPI,Instruction count][Capacity,Cycle Time][Capacity,Data Rate]Computer System

A Balanced Computer System[C/E Ratio][E/B Ratio]Data RateB(Mb/Sec)

A Balanced Computer System:Subsystem CharacteristicsCPUInstruction execution rate: MIPS/MFLOPSMIPS is misleading, since the amount of information processed by an instruction varies 8-bit 1 MIPS 32-bit 1 MIPS MemoryCycle time is misleading - How fast to access a unit of informationBandwidth = cycle time/sec x bytes/cycleFor a given cycle time, double the bandwidth with doubling the width of a wordSecondary StorageData Rate(B) does not completely represent the device performanceIRG, Seek time, Latency

A Balanced Computer system:Gaining Higher MIPS FACT: Memory Cycle Time >> CPU Cycle TimeDesign Features - Avoid Memory access Large register file CacheFACT: I/O Transfer Rate(B)

A Balanced Computer system:Gaining Higher MIPSTo gain higher MIPS(E), Balanced C/E Ratio and Balanced E/B Ratio are needed

CPU Performance

Clock Period: Component Technology and Hardware Organization CPI(Clocks per Instruction): Hardware Organization and Instruction Set ArchitectureIC(Instruction Count): Instruction Set Architecture and Compiler

CPU performance depends on these three factors, and can be improved by reducing one or more of these factorsCPU time = IC x CPI x Clock Period

Clock PeriodClock Period = (Setup Time + Hold time) of registers(F/F) + Delay of (Registers + ALU) Clock Period is determined by Timing Characteristics of the Component Organization(Algorithm) of the Execution Unit

Clock Period:Evolution of Component Technology Compact Packages Low power consumption Shorter delay, Faster clock period Smaller size systems High Density package Higher functionality More reliable systems

Clock Period:Evolution of Execution Unit(ALU)Add Time (Sam Winograd) :The Fastest Add algorithm CR Representation + Conditional Sum Adder

Speed of Arithmetic Algorithms reached the Theoritical Upper Bound, Addition in the Second Generation, and Multiplication in the early part of the Third Generation Computer eras.

Chinese Remainder RepresentationChinese Remainder Theorem

A set of n relatively prime numbers: miA set of remainders uniquely determines an integer Ain the range of 0 < A < M, where

M = mi i=1

Conversion to decimal

Let Nj = mi

A = | N

N

i=1, i=j n Ni | ai / Ni |mi | MI=1 Representation of A with mis A = (a1, a2, , an) = (|A|m1, |A|m2, , |A|mn)

Characteristics of CRRCompact Code: Much shorter than the original numberIndependent operations on individual code symbol: No needs to consider carry propagation to the adjacent code symbolsOperation on Complement of numbers for subtraction or negative number representationEasy to generate CRR and easy to convert to decimal numbers

Conditional Sum Adderc sc sc sc sc sc sc sc s0 10 00 11 00 10 10 10 1 01 00 11 01 11 01 01 01 0 1 0 1 01 0 00 1 10 1 10 1 11 0 11 0 0 1

0 1 1 0 00 1 1 1 10 1 1 0 1 20 1 1 0 0 1 1 1 1 3

3000011110 c L17710110001

Upper Bound of Add Time

T > ta + log r/2 (n / r/2 ) ts n: number of bits for the largest remainder ta: 1-bit add time ts: correct sum and carry selection time r: fan-in of the functional module

Speed of Arithmetic Operations can further be improved only by Component Technology Faster switching, More functions in a package(larger fan-in) Architecture Multiple function units and concurrent operations

Clocks per InstructionComplex instructions take more clocks to complete

Clocks per Instruction(CPI)Clocks per Instruction - Number of clock pulses to complete an instruction

Instruction Set Processor Architecture dependent Concurrent execution of multiple instructionsvector, pipelined, superscalar, VLIWSimple, Shorter, Direct(not encoded), and non-Memory Referencing Instructions(RISC) are favorableRR-type instructions

Instruction CountDepends on Instruction SetIn general codes become longer with simple instruction sets and shorter with complex instruction setsNumber of machine instructions to implement an application or an algorithm Depends on Compiler Very difficult task for compilers to optimize the generation of of codes, especially in the general purpose instruction set environment

Clocks per Instruction and Instruction Count:Instruction SetRange of Instruction SetIssues on instruction set design: Trade-off between 3EsElegance - Completeness - Symmetry - Flexibility

Efficiency - Instruction length - Address map - Frequency of use - Memory BW utilization - Instruction execution O/HEnvironment - Multiprogramming environment - Code generation by compilers

Clocks per Instruction and Instruction Count:Complex Instruction Set(CISC)EleganceGeneral purpose functions for all kinds of applications programmer friendly, compiler-friendlySupport variety of addressing modesDesign PhilosophyEfficiencyReduction of Semantic Gap between HLL and ML by including HLL primitives in ISIncrease the CPISignificantly reduce the Instruction CountCompiler-friendlyReduction of the Ratio of Overheads to Execution High performance instructions to use memory bandwidth efficientlyReduction of Instruction CountEnvironment Instruction level supports of Sharing and Protection of resources and context switch Reduce the Instruction Count

Instruction Set Evolution:Criticism on CISCLarge Powerful Instruction SetEfficiency Issues Large, Powerful, General Purpose instruction sets are Efficient Flexibility and application adaptability Memory bandwidth utilization - Instruction Count However, they are also Inefficient in the view point of Specialized instructions for several HLL Excess baggage for a certain language Compilers may not utilize the right instructions Program execution characteristics are ignored Most frequently used and most time consuming instructions CPI is important

Control Unit Issue Control Unit also becomes larger and complex Microprogramming the CU is inevitable Microprogramming inefficiency - slow clock cycle Consumes a significant portions of processor chip area

Clocks per Instruction and Instruction Count: Reduced Instruction Set (RISC)Make the most Frequently used statements(instructions) simple and fast

PhilosophyMake the most Time-consuming statements(instructions) fast

Clocks per Instruction and Instruction Count:Reduced Instruction Set (RISC)

Most Frequently Executed Instructions

Clocks per Instruction and Instruction Count: Reduced Instruction Set (RISC)

Most time consuming statements are Procedure CALLs and RETURNs , whichinvolve a large number of Loads and Stores.To make CALLs and RETURNs fast, i.e., to reduce CPI;Multiple Sets of Registers Switch register set for context switch Avoids memory access for context switch at the cost of R-R moves for parameters Further optimization by Overlapping Multiple Sets of RegistersAnother time consuming statements are Branch-type Statements in the Pipelined Execution architectures.To make Branches fast;Branch optimization by compilersMost Time Consuming Instructions

Multiple Register Set

Overlapping MRS Physical Register FileProc As Logical Register FileProc Bs Logical Register FileRegister SetAllocated to Proc BRegister SetAllocated to Proc B

Clocks per Instruction and Instruction Count: RegistersDimension - Width, Number, and Types of Registers Width of Registers 4, 8, 16, 32, 64, Wider registers improve both CPI and Instruction Count Number of RegistersLarge number of registers, in general, improves both CPI and Instruction Count Logical Registers Somewhere between 8 and 32 is optimal When insufficient, 2 Loads and 2 Stores penalty per operation Physical Registers Simply the more the better, but on-chip space limitation MRS and Overlapping MRS improve CPU performance significantlyTypes of Registers General Purpose vs Dedicated Special Purpose Registers Flexibility vs Register addressing

Penalty for Not Enough Registers

Clocks per Instruction and Instruction Count:Control Unit ImplementationControl Unit

Clocks per Instruction and Instruction Count:Control Unit ImplementationHardwired Control or Microprogrammed Control Flexibility Issue Application adaptability Debugging Modification, Taylorability, Development Microprogrammed Control has definite advantage over the Hardwired ControlSpeed Hardwired control is definitely faster, i.e., smaller CPI Cost

Implementation of Microprogrammed CPUCUm-prgrammed

Clocks per Instruction and Instruction Count:Microprogrammed Control Unit Microprogramming made computer family concept feasible

Set of hardware-wise different computers which provide an identical IS Developing Computer Family with Hardwired Control is very expensive In each of the computers in the family has a hardware-wise identical Control Units with different microprogramsMicroprogrammed computers have evolved to a new concept of computers; Microprogrammable computersControl Unit provides Writeable Control Storage so that the microprogram can be modified or replaces with a new microprogram for users needs Universal Host conceptMicroprogrammable computers could improve Instruction Count Instructions tailored to the application can be implemented

Computer Family, Universal Host...DifferentHardwareDifferent Target Machines ISA

Further Optimization of CPI and ICFurther improvement on Effective CPI can be made by executing multiple instructions in ParallelPipelining of instruction ExecutionsSuperscalarFurther improvements on IC can be made by executing a single instruction on multiple of operands Vector Processing(SIMD)Further improvements on both CPI and IC can be made by executing a single instruction that specifies multiple of operands VLIW

Reduction of CPI:PipeliningHazards degrade the pipeline performanceReduce the effective CPI by overlapping the executions ofmultiple instructions in every clock cycle

Reduction of CPI:Superscalar and Superpipeline

Superscalar - multple copies of pipeline stages For m copies, effective CPI is 1/m of the pipelineSuperpipeline - pipelined pipeline stages For m stages, effective CPI is 1/m of the pipeline

Reduce the Effective CPI by issuing Multiple Instructions for Concurrent Execution to instruction execution pipelineSuperscalar Instruction Execution Degree=3Superpipeline Instruction ExecutionDegree=3

Reduction of CPI:Superscalar and Superpipeline

Hazards are more serious Thus, needs performance enhancement techniques Register renaming for data hazards Out-of-order Issue and out-of-order Execution Branch predictionCPI can be reduced to less than 1(Sub CPI)

Reduction of CPI:VLIWRequirementsLarge number of data paths to make a large number of concurrent operations possibleIntelligent compiler to pack collection of operations representing a program into VLIW instructionsSingle instruction specifies multiple concurrent operations Simple hardware implementationNo needs for hardware to detect instruction parallelism, which is inherently specified by instruction Binary compatibility is absent among processors

Classification of Computer ArchitecturesInstruction StreamA sequence of instructions executed by a single processorData StreamA sequence of data processed by a single processorSISD ArchitectureSIMD ArchitectureMISD ArchitectureMIMD ArchitectureMike Flynns Classification

SIMD ArchitecturesCUPE0...PE1PEn-1Mm-1Shared MemoryDS0DS1DSn-1ISVector Processor or Array Processor...M0M1M2

MIMD ArchitecturesShared-MemoryMultiprocessorMessage-PassingMulticomputerScalableShared-MemoryMultiprocessorMultithreadedArchitectureDataflow

Types of MIMD Architectures:Shared-Memory MultiprocessorLimitations Memory access latency Hot Spot problem UnscalableEncoreMultimaxSequent SymmetryBBN TC-2000Scalable Shared-Memory multiprocessor Global single address space Stanford DASH, KSR-1, IEEE SCIMMMMPPPPInterconnection Network

Types of MIMD Architectures:Message-Passing MulticomputerPoint-to-pointconnection Scalable Limitations Communication overhead Hard to programTMC CM-5, Intel Paragon, nCUBEInterconnection NetworkPMPMPMPM

Future Trends of ProcessorsComplexity50 million transistors on an 1-inch dieIncorporate multiple processors on a single chipPerformanceOver 2,000 MIPSOperate at over 250MHzArchitectural Features64-bit address and data types256-bit I/O paths to memorySpecial purpose processing unitsVector floating-point operationsGraphics, video, sound generation and processingAn entire personal computer or workstation on a chip

Future Trends of ProcessorsSuperscalarScalar RISCSuperpipelinedVLIWCPI20

10

5.0

2.0

1.0

0.5

0.2

0.1 5 10 20 50 100 200 500 1000MHzMost likely futureprocessor spaceSuperscalarRISCVectorSupercomputerDesign space of modern processor families100 MIPS

Architecture:Where are we heading?Parallel Processing is the magic wordHistory: Speed of traditional single CPU computers has increased 10-fold every 5 years.But, by the time when a new parallel computer is developed and implemented, single CPU computers will just as fast.

Minskys Conjecture: Speedup achievable by a parallel computer increases as the logarithm of number of processors.Large scale parallelism is unproductive.

Amdahls Law: A small number of sequential operations can effectively limit the speedup of a parallel algorithm.If the sequentially operated portion is 10%, the maximum achievable speedup is 10, no matter how many processors a parallel computer has.

Architecture:Where are we heading?Up to now, class of parallel computers well-defined in the market is Vector Machines

Many applications are being developed These machines are not suited for many classes of problems Machine understanding/translation, Expert systems, Knowledge-base applications, Heuristics,...

Building Block:Where are we heading?A Building Block contains highly complex functions for all kinds of applications For a particular application, unused functions are extra baggage Requirement of Reconfiguration within the building block to mask out those unneeded functions Reincarnation of Microprogram

lecture 01 -1introductioncs510 computer architectures cs510: computer architectures fall, 2001 jung...

Documents

dead register

register lifestatistics

base address register

stack pointer register

timing state register

number of registers

registers simultaneously17

registers10 registers