a sahu deptt. of comp. sc. & engg. iit guwahati review of computer architetcure

Computer Peripheral & Interfaces (Introduction )

A SahuDeptt. of Comp. Sc. & Engg.IIT Guwahati Review of Computer ArchitetcureOutlineComputer organization Vs ArchitectureProcessor architecture Pipeline architecture Data, resource and branch hazards Superscalar & VLIW architecture Memory hierarchy Reference Computer organization Vs ArchitectureComp Organization => Digital Logic Module Logic and Low level ============================Comp Architecture = > ISA Design, MicroArch Design

Algorithm for Designing best micro architecture, Pipeline model, Branch prediction strategy, memory management Etc..Hardware abstractionMainmemorybridgeBus interfaceALURegister fileCPUSystem busMemory busDisk controllerGraphicsadapterUSBcontrollerMouseKeyboardDisplayDiskI/O busExpansion slots forother devices suchas network adapters

PCHardware/software interfacesoftwarehardwareC++m/c instrreg, addertransistorsArch. focusInstruction set architectureLowest level visible to a programmerMicro architectureFills the gap between instructions and logic modulesAssembly Language ViewProcessor state (RF, mem)Instruction set and encodingLayer of AbstractionAbove: how to program machine - HLL, OSBelow: what needs to be built - tricks to make it run fastISACompilerOSCPUDesignCircuitDesignChipLayoutApplicationProgramInstruction Set ArchitectureThe Abstract MachineProgrammer-Visible StatePC Program CounterRegister Fileheavily used dataCondition CodesPCRegistersCPUMemoryCode + DataAddressesDataInstructionsStackConditionCodesMemoryByte arrayCode + datastack

ALUInstructionsLanguage of Machine Easily interpreted primitive compared to HLLs

Instruction set design goals maximize performance, minimize cost, reduce design time

InstructionsAll MIPS Instructions: 32 bit long, have 3 operandsOperand order is fixed (destination first)Example:C code: A = B + CMIPS code: add $s0, $s1, $s2 (associated with variables by compiler)

Registers numbers 0 .. 31, e.g., $t0=8,$t1=9,$s0=16,$s1=17 etc.000000 10001 10010 01000 00000 100000 op rs rt rd shamt funct

Instructions LD/ST & ControlLoad and store instructionsExample:C code:A[8] = h + A[8];MIPS code:lw $t0, 32($s3)add $t0, $s2, $t0sw $t0, 32($s3)Example: lw $t0, 32($s2) 35 18 9 32 op rs rt 16 bit numberExample:if (i != j) beq $s4, $s5, Lab1 h = i + j; add $s3, $s4, $s5else j Lab2 h = i - j; Lab1: sub $s3, $s4, $s5 Lab2: ...What constitutes ISA?Set of basic/primitive operationsArithmetic, Logical, Relational, Branch/jump, Data movementStorage structure registers/memoryRegister-less machine, ACC based machine, A few special purpose registers, Several Gen purpose registers, Large number of registersHow addresses are specified Direct, Indirect, Base vs. Index, Auto incr and auto decr, Pre (post) incr/decr, StackHow operand are specified3 address machine r1 = r2 + r3, 2 address machine r1 = r1 + r21 address machineAcc = Acc + x (Acc is implicit)0 address machineadd values on (top of stack)How instructions are encoded

RISC vs. CISCRISC Uniformity of instructions, Simple set of operations and addressing modes, Register based architecture with 3 address instructionsRISC: Virtually all new ISA since 1982 ARM, MIPS, SPARC, HPs PA-RISC, PowerPC, Alpha, CDC 6600CISC : Minimize code size, make assembly language easy VAX: instructions from 1 to 54 bytes long! Motorola 680x0, Intel 80x86

MIPS subset for implementationArithmetic - logic instructionsadd, sub, and, or, sltMemory reference instructionslw, swControl flow instructionsbeq, j

Incremental changes in the design to include other instructions will be discussed laterDesign overviewRegistersRegister #DataRegister #DatamemoryAddressDataRegister #PCInstructionALUInstructionmemoryAddressUse the program counter (PC) to supply instruction address Get the instruction from memoryRead registersUse the instruction to decide exactly what to doDivision into data path and control DATA PATHCONTROLLERcontrolsignalsstatussignalsBuilding block typesTwo types of functional units:elements that operate on data values (combinational)output is function of current input, no memoryExamplesgates: and, or, nand, nor, xor, inverter ,Multiplexer, decoder, adder, subtractor, comparator, ALU, array multiplierselements that contain state (sequential)output is function of current and previous inputsstate = memoryExamples:flip-flops, counters, registers, register files, memories

Components for MIPS subsetRegister, AdderALUMultiplexerRegister fileProgram memoryData memoryBit manipulation componentsComponents - registerPCclock3232Components - adder323232PC+4offset+323232PC4+Components - ALU323232operationresultabALUa=boverflowComponents - multiplexers0

mux

13232PC+4PC+4+offset32selectComponents - register fileRegWriteRegistersWriteregisterReaddata 1Readdata 2Readregister 1Readregister 2WritedataDataDataRegisternumbers555Components - program memoryInstructionmemoryInstructionaddressInstructionMIPS components - data memoryMemReadMemWriteDatamemoryWritedataReaddataAddressComponents - bit manipulation circuitsSignxtend1632shift32320LSBMSBLSBMSBMIPS subset for implementationArithmetic - logic instructionsadd, sub, and, or, sltMemory reference instructionslw, swControl flow instructionsbeq, jDatapath for add,sub,and,or,sltFetch instructionAddress the register filePass operands to ALU actionsPass result to register file requiredIncrement PCFormat: add $t0, $s1, $s2 000000 10001 10010 01000 00000 100000 op rs rt rd shamt functPCIMadinsFetching instructionPCIMadins RFrad1rad2wadwdrd1rd2ins[25-21]ins[20-16]Addressing RFPCIMadins RFrad1rad2wadwdrd1rd2ALUins[25-21]ins[20-16]Passing operands to ALUPCIMadins RFrad1rad2wadwdrd1rd2ALUins[25-21]ins[20-16]ins[15-11]Passing the result to RFPCIMadins RFrad1rad2wadwdrd1rd2ALU+4ins[25-21]ins[20-16]ins[15-11]Incrementing PCLoad and Store instructionsformat : I

Example: lw $t0, 32($s2)

35 18 9 32 op rs rt 16 bit numberPCIMadins RFrad1rad2wadwdrd1rd2DMadrdwdALU+4ins[25-21]ins[20-16]ins[15-11]Adding sw instruction0

1sxins[15-0]16PCIMadins RFrad1rad2wadwdrd1rd2DMadrdwdALU+sx0

10

14ins[25-21]ins[20-16]ins[15-11]ins[15-0]16Adding lw instruction1

0PCIMadins RFrad1rad2wadwdrd1rd2DMadrdwdALU+sx0

11

00

14ins[25-21]ins[20-16]ins[15-11]ins[15-0]16Adding beq instruction+s20

1PCIMadins RFrad1rad2wadwdrd1rd2DMadrdwdALU++s2sx0

10

11

00

14ins[25-21]ins[20-16]ins[15-11]ins[15-0]0

1s2ins[25-0]PC+4[31-28]ja[31-0]2816Adding j instructionPCIMadins RFrad1rad2wadwdrd1rd2DMadrdwdALU++s2sx0

10

11

00

10

1s24RdstjmpZMRRWins[25-0]ins[25-21]ins[20-16]ins[15-11]ins[15-0]PC+4[31-28]ja[31-0]2816Control signalsMWop3AsrcPsrcM2RPCIMadins RFrad1rad2wadwdrd1rd2DMadrdwdALU++s2sx0

10

11

00

10

1s24RdstjmpPsrcZbrnMRMWM2RopRWAsrcins[25-0]controlins[31-26]ins[25-21]ins[20-16]ins[15-11]ins[15-0]Actrlins[5-0]PC+4[31-28]ja[31-0]2816opc23Datapath + ControlAnalyzing performanceComponent delaysRegister0Addert+ALUtAMultiplexer0Register filetRProgram memorytIData memorytMBit manipulation components0Delay for {add, sub, and, or, slt}

PCIMadins RFrad1rad2wadwdrd1rd2ALU+4ins[25-21]ins[20-16]ins[15-11]

Delay for {sw}PCIMadins RFrad1rad2wadwdrd1rd2DMadrdwdALU+4ins[25-21]ins[20-16]sxins[15-0]16

Clock period in single cycle designtRtRtMtMtRtRtRtRtAtAtAtAt+t+tItItItIt+tIt+tIR-classlwswbeqjclockperiodClock period in multi-cycle designtRtRtMtMtRtRtRtRtAtAtAtAt+t+tItItItIt+tIt+tIR-classlwswbeqjclockperiodCycle time and CPIcycle timeshortlonglowhighCPIsingle cycledesignpipelineddesignmulti-cycledesignPIpelined datapath (abstract)PCIMadins RFradwadwdrd1rd2DMadrdwdALU++4IFIDEXMemWBIF/IDID/EXEX/MemMem/WBFetch new instruction every cyclePCIMadins RFradwadwdrd1rd2DMadrdwdALU++4IFIDEXMemWBIF/IDID/EXEX/MemMem/WBPipelined processor designPCIMadins RFrad1wadwdrd1rd2DMadrdwdALU++40

11

00

1s2sx1

0rad2controlActrl0

10bubbleIF/IDwPCwPCw=0IF/IDw=0bubble=1flushGraphical representation IMRFDMALURF5 stage pipelineIFIDEXMemWBstagesactionsUsage of stages by instructions IMRFDMALURFIMRFDMALURFIMRFDMALURFIMRFDMALURFlwswaddbeqPipeliningIF D RF EX/AG M WB Faster throughput with pipeliningSimple multicycle design : Resource sharing across cycles All instructions may not take same cyclesDegree of overlap DepthSerialOverlappedPipelinedShallowDeepHazards in PipeliningProcedural dependencies => Control hazardscond and uncond branches, calls/returnsData dependencies => Data hazardsRAW (read after write)WAR (write after read)WAW (write after write)Resource conflicts => Structural hazardsuse of same resource in different stagesData Hazardsdelay = 3previousinstrcurrentinstrread/writeread/writeStructural HazardsUse of a hardware resource in more than one cycle

Different sequences of resource usage by different instructions

Non-pipelined multi-cycle resources

ABACABACABACABCDACBDFDXXFDXXCaused by Resource ConflictsControl Hazardsdelay = 5branchinstrnext inlineinstrtargetinstrcond evaldelay = 2 the order of cond eval and target addr gen may be different cond eval may be done in previous instructiontarget addr genPipeline PerformanceCPI = 1 + (S - 1) * bTime = CPI * T / STS stagesFrequency of interruptions - bImproving Branch PerformanceBranch EliminationReplace branch with other instructionsBranch Speed UpReduce time for computing CC and TIFBranch PredictionGuess the outcome and proceed, undo if necessaryBranch Target CaptureMake use of historyBranch EliminationCSUse conditional instructions(predicated execution)TFC : SOP1BC CC = Z, + 2ADD R3, R2, R1OP2OP1ADD R3, R2, R1, NZOP2Branch Speed Up : Early target address generationAssume each instruction is BranchGenerate target address while decodingIf target in same page omit translationAfter decoding discard target address if not BranchIF IF IF D TIF TIF TIF AGBCBranch PredictionTreat conditional branches as unconditional branches / NOPUndo if necessaryStrategies:Fixed (always guess inline)Static (guess on the basis of instruction type)Dynamic (guess based on recent history)Static Branch Prediction

Total 68.2%Branch Target Capture Branch Target Buffer (BTB) Target Instruction Buffer (TIB)instr addr pred stats targettarget addrtarget instrprob of target change < 5%BTB PerformanceBTB missgo inlineinlineBTB hitgo to targetdecisionresulttarget inlinetargetdelay0540.4 .6.8 .2.2 .8 .4*.8*0 + .4*.2*5 + .6*.2*4 + .6*.8*0 = 0.88 (Eff.Delay)Compute/fetch schemeI - cacheIFAR+InstructionFetch addressComputeBTABTAIIFANext sequentialaddressA I I + 1 I + 2 I + 3BTI BTI+1 BTI+2 BTI+3(no dynamic branch prediction)BTAC schemeI - cacheIFAR+InstructionFetch addressBTAIIFANext sequentialaddressA I I + 1 I + 2 I + 3BTI BTI+1 BTI+2 BTI+3BTACBA BTACache/memoryFetchUnitSingle multi-operation instructionmulti-operation instructionFUFUFURegister fileILP in VLIW processorsCache/memoryFetchUnitMultiple instructionSequential stream of instructionsFUFUFURegister fileDecodeand issueunitInstruction/controlDataFUFuntional UnitILP in Superscalar processorsslide 69Why Superscalars are popular ?Binary code compatibility among scalar & superscalar processors of same familySame compiler works for all processors (scalars and superscalars) of same familyAssembly programming of VLIWs is tediousCode density in VLIWs is very poor - Instruction encoding schemesHierarchical structureMemoryCPUMemorySizeCost /bitSpeedSmallestBiggestHighestLowestFastestSlowestMemoryData transfer between levels

unit of transfer = blockaccesshitmissPrinciple of locality & Cache PoliciesTemporal Localityreferences repeated in timeSpatial Localityreferences repeated in spaceSpecial case: Sequential Locality============================ReadSequential / ConcurrentSimple / ForwardLoadBlock load / Load forward / Wrap aroundReplacementLRU / LFU / FIFO / RandomLoad policies4 AU BlockCache miss on AU 1Block LoadLoad ForwardFetch Bypass(wrap aroundload)0123Fetch PoliciesDemand fetchingfetch only when required (miss)Hardware prefetchingautomatically prefetch next blockSoftware prefetchingprogrammer decides to prefetchquestions: how much ahead (prefetch distance)how oftenWrite PoliciesWrite HitWrite BackWrite ThroughWrite MissWrite BackWrite ThroughWith Write AllocateWith No Write AllocateCache TypesInstruction | Data | Unified | SplitSplit vs. Unified:Split allows specializing each partUnified allows best use of the capacityOn-chip | Off-chipon-chip : fast but smalloff-chip : large but slowSingle level | Multi levelReferencesPatterson, D A.; Hennessy, J L. Computer Organization and Design: The Hardware/software Interface. Morgan Kaufman, 2000Sima, T, FOUNTAIN, P KACSUK, Advanced Computer Architectures: A Design Space Approach, Pearson Education, 1998Flynn M J, Computer Architecture: Pipelined and Parallel Processor Design, Narosa publishing India, 1999John L. Hennessy, David A. Patterson, Computer architecture: a quantitative approach, 2nd Ed, Morgan Kauffman, 2001ThanksInstr%GuessBranchCorrect

uncond14.5always100%14.5%

cond58never54%27%

loop9.8always91%9%

call/ret17.7always100%17.7%

a sahu deptt. of comp. sc. & engg. iit guwahati review of computer architetcure

Documents