linear pipeline_collosion vector analysis

Upload: mahnaznamazi

Post on 04-Apr-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    1/23

    Pipeline Hazards Architecture of Parallel Computers 1

    Collision Analysis

    Assume we could implement on-chip cache and get the cache access timedown to 1 clock, but implement it as a unified cache.

    Our new pipeline is:

    Our new reservation table is:

    Clock

    Operation

    1 2 3 4 5 6 7

    Memory Op X X X

    Inst Dec. X

    Addr Gen XExecute X

    Update PC X

    And the serial execution time is 7 x 5 ns = 35 ns.

    How often can we initiate an instruction with this configuration?

    Instruction Decode -- 5 ns

    Address Generate -- 5 ns

    Operand Fetch -- 5 ns

    Execute -- 5 ns

    Operand Store -- 5 ns

    Update Program Counter -- 5 ns

    Instruction Fetch -- 5 ns

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    2/23

    1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 2

    The Collision Vector

    As the pipeline becomes more complicated, we can use a collision vectorto analyze the pipeline and control initiation of execution. The collisionvector is a method of analyzing how often we can initiate a new operationinto the pipeline and maintain synchronous flow without collisions.

    We construct the collision vector by overlaying two copies of thereservation table, successively shifting one clock to the right, and recordingwhether or not a collision occurs at that step. If a collision occurs, record a 1bit, if a collision does not occur, record a 0 bit.

    For example, our reservation table would result in the following collisionvector:

    Collision vector = 011010

    Using the collision vector, we construct a reduced state diagram to tell uswhen we can initiate new operations.

    The Reduced State Diagram

    The reduced state diagram is a way to determine when we can initiate anew operation into the pipeline and avoid collisions when some operationsare already in process in the pipeline.

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    3/23

    Pipeline Hazards Architecture of Parallel Computers 3

    Steps to create the reduced state diagram:

    Shift the collision vector left one position, filling in a 0 at the right end.

    If the left-most bit shifted out is a 1, you cannot initiate a new operationinto the pipeline.

    If the left-most bit shifted out is a 0, you can initiate a new operationinto the pipeline. Create a new state with a collision vector that is theshifted collision vectorORed with the original pipeline collisionvector.

    Draw an arc to the new collision vector and label it with the number ofshifts from the previous vector.

    Following is the resulting reduced state diagram:

    Note: Some texts reverse this notation, build the collision vector from rightto left, and shift the vector right in order to determine when to initiate a newoperation.

    0 1 1 0 1 0

    1 1 1 1 1 0 1 1 1 0 1 0

    1 4

    64

    6

    6

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    4/23

    1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 4

    The reduced state diagram tells us that we can initiate a new operation intothe pipeline one cycle after we initiated one in an empty pipe. However, thisbrings us to a state where we cannot safely initiate another operationuntil 6 more clock periods.

    Since we can initiate a second instruction on the next clock period but mustwait six clock periods before we can initiate another instruction, we caninitiate only two instructions every seven clock periods. We get only 2/7 of100% efficiency (speedup of 7), so ourspeedup is only 2 for the sevenstage pipeline.

    An alternative would be to wait for 4 cycles after the initial initiation, and theninitiate a new operation every 4 cycles. But this would give us a speedup ofonly 7(0.25) = 1.75.

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    5/23

    Pipeline Hazards Architecture of Parallel Computers 5

    Improving the speedup

    One way to improve this situation is to insert delays at appropriate points inthe pipeline. Stone goes to great lengths to analyze where to insert thedelays. As an example, if we added a delay in the pipeline after the Executestage, we get:

    Clock Operation

    1 2 3 4 5 6 7 8

    Memory Op X X X

    Inst Dec. X

    Addr Gen X

    Execute X DelayUpdate PC X

    And our new collision vector is:

    Collision vector = 0010010

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    6/23

    1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 6

    The new reduced state diagram follows.

    Note that all states have an arc back to the beginning state with 7clocks in addition to those noted.

    We can now look for movements from state to state that would improve ourpipeline speedup. If we took the greedy cycle, we could initiate 3

    operations out of every 9 cycles for a speedup of (3/9) 7 = 2.33. However, ifwe did not take the first possible initiation and waited for 2 cycles, we wouldget into the 2, 5, 2 cycle and initiate an operation 3 out of every 9 cyclesalso. There appears to be one other3-out-of-9 cycle, but none better.

    0 0 1 0 0 1 0

    0 1 1 0 1 1 0 1 0 1 1 0 1 0

    1 2

    4 2

    7

    1 1 1 1 1 1 0

    1

    1 1 1 0 0 1 0 1 1 1 1 0 1 0

    1 0 1 0 0 1 0

    55

    5

    5

    0 1 1 0 0 1 0

    4

    1 1 1 0 1 1 0

    1

    54

    44 5

    2

    7

    4

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    7/23

    Pipeline Hazards Architecture of Parallel Computers 7

    Other Pipeline Hazards

    Pipeline collisions occur when there is contention for shared hardwarethat is needed by more than one stage of a pipeline. Potential collisions

    prevent us from initiating (and thus completing) a new operation every clockperiod, and so slow down the effective execution rate of a processor.

    Other hazards that can prevent us from completing an instruction everyclock period are:

    Conditional Branches

    Data dependencies

    Conditional Branches (Jumps)

    A conditional branch changes the location where we are fetchinginstructions. A conditional branch instruction must execute before we knowwhich location to fetch subsequent instructions from.

    Example Instruction Stream

    ------- ; Instruction

    Cmp A, B ; Compare A to B

    BE NewLoc ; Branch on condition code = 0 to NewLoc

    ------- ; Instruction

    ------- ; Instruction

    ------- ; Next Sequential Instruction (NSI)

    ------- ; Instruction

    NewLoc ------- ; Instruction

    ------- ; Instruction

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    8/23

    1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 8

    Reservation Table Analysis

    Assume we have the following reservation table:

    Clock Operation

    1 2 3 4 5 6 7 8 9 10 11 12

    Inst Fetch X X

    Inst Dec. X

    Addr Gen X

    Data Fetch X X

    Execute X

    Op Store X

    We can show successive instruction execution through the pipeline byindicating the instruction in each cell. Here, I will use:

    CC to indicate the instruction that sets the condition code.

    BR to indicate the branch condition instruction.

    NSI to indicate the next sequential instruction after the branch.

    2SI to indicate the 2nd sequential instruction after the branch, etc.

    BT to indicate the branch target instruction.

    Following would be the instruction sequence for a branch not taken:

    Clock

    Operation

    1 2 3 4 5 6 7 8 9 10 11 12

    Inst Fetch CC CC BR BR NSI NSI 2SI 2SI 3SI 3SI 4SI 4SI

    Inst Dec. CC BR NSI 2SI 3SIAddr Gen CC BR NSI 2SI 3SI

    Data Fetch CC CC BR BR NSI NSI 2SI 2SI

    Execute CC BR NSI

    Op Store CC BR NSI

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    9/23

    Pipeline Hazards Architecture of Parallel Computers 9

    Following would be the instruction sequence for a branch taken:

    Clock

    Operation

    1 2 3 4 5 6 7 8 9 10 11 12

    Inst Fetch CC CC BR BR NSI NSI 2SI 2SI 3SI 3SI BT BTInst Dec. CC BR NSI 2SI wait

    Addr Gen CC BR NSI wait wait

    Data Fetch CC CC BR BR NSI NSI wait wait

    Execute CC BR wait

    Op Store CC BR wait

    We have taken a penalty of 6 clock cycles because we assumed that we

    were going to be executing sequential instructions. We started theseinstructions into the pipeline, only to find that we had to abort executingthem because the conditional branch was taken.

    The assumption here is that we know the outcome of the branchinstruction at the end of its execute cycle, and so we can stop furtherexecution of sequential instructions following the branch. The newprogram counter gets sent to the Instruction Fetch unit during theOperand Store cycle of the branch instruction, so it can begin to fetch thebranch target instruction and succeeding instructions on the next cycle.

    Reducing Branch Penalties

    We can use several methods to reduce the effects of branching:

    Delayed Branch Instruction

    Multiple Condition Codes (discussed with data dependencies)

    Branch Prediction with and without Branch History

    Speculative Execution

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    10/23

    1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 10

    Delayed Branch Instruction

    We can push some of the problem back on the programmer (or compiler) bydesigning a new branch instruction that telegraphs an intent to branch:

    Branch Condition after executing the Next Sequential Instruction.

    The instruction sequence for a branch not taken, using this new branchinstruction (BA), is identical to a normal branch:

    Clock Operation

    1 2 3 4 5 6 7 8 9 10 11 12

    Inst Fetch CC CC BA BA NSI NSI 2SI 2SI 3SI 3SI 4SI 4SI

    Inst Dec. CC BA NSI 2SI 3SIAddr Gen CC BA NSI 2SI 3SI

    Data Fetch CC CC BA BA NSI NSI 2SI 2SI

    Execute CC BA NSI

    Op Store CC BA NSI

    However, the instruction sequence for a branch taken, using the newbranch condition after next instruction, would save us two clocks:

    Clock

    Operation

    1 2 3 4 5 6 7 8 9 10 11 12

    Inst Fetch CC CC BA BA NSI NSI 2SI 2SI 3SI 3SI BT BT

    Inst Dec. CC BA NSI 2SI wait

    Addr Gen CC BA NSI wait wait

    Data Fetch CC CC BA BA NSI NSI wait wait

    Execute CC BA NSI

    Op Store CC BA NSI

    Our penalty is now only 4 clock cycles instead of 6, because we followedthrough and completed execution of NSI (per definition of the delayedbranch instruction). We had to abort executing only 2SI and beyond as aresult of the conditional branch taken.

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    11/23

    Pipeline Hazards Architecture of Parallel Computers 11

    Branch Prediction

    We can make a better guess about whether or not a branch will be takenrather than just always assuming it will not be taken.

    Assume that a special end-of-loop branch instruction is usually taken.

    Assume that a branch to a location earlier in the code will usually betaken.

    Keep a history table of how this particular branch instruction behavedin the recent past.

    Some processors define special instructions to be used to terminate a loop.

    For example, BXLE branch on index low or equal combinesdecrementing an index register with a branch on condition. The processorcan safely assume that whenever it fetches a BXLE instruction, the branchwill normally be taken. This can be determined back at the InstructionDecode step. Note that the Unconditional Branch is a special case ofthis, in that it will always be taken.

    A conditional branch to an earlier address can be determined at theAddress Generate stage.

    However, note that we are making an educated guess. Even when weguess correctly, we are taking some penalty. The instruction sequence fora branch taken, when we predict that it will be taken:

    Clock

    Operation

    1 2 3 4 5 6 7 8 9 10 11 12

    Inst Fetch CC CC BR BR NSI NSI BT BT NT NT 2T 2T

    Inst Dec. CC BR wait BT NT

    Addr Gen CC BR wait BT NT

    Data Fetch CC CC BR BR wait wait BT BTExecute CC BR wait

    Op Store CC BR wait

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    12/23

    1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 12

    Branch History

    Rather than depending on special instructions and branch target locations,we can keep a history of how this particular branch instruction

    behaved in the recent past, and assume that it will continue to behave thatway in the future. Some implementations are:

    The Branch-history table (Stone page 196):

    The instruction fetch unit searches a branch-history table (BHT),similar to a TLB, on every instruction fetch. If we have a hit, use thecorresponding address in the BHT for NSI instead of the real NSI.

    At the execute stage of the branch, update the BHT with the actual

    target (NSI or BT).

    Of course, we need to keep track of which way we predicted, and abortinstructions on mis-predictions.

    Decode-history table (similar to Stone page 196):

    The instruction decode unit searches a decode-history table (DHT)when it encounters a conditional branch instruction. If we have a hit,

    redirect the instruction fetch unit to abort NSI and give it the BTaddress from the branch instruction.

    At the execute stage of the branch, add (or keep) a DHT entry for thisbranch when the branch is taken. Delete the DHT entry for this branch(if it exists) when the branch is not taken.

    Note that we always abort the prefetch of NSI on predicted branchestaken.

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    13/23

    Pipeline Hazards Architecture of Parallel Computers 13

    Extra bits in the Instruction Cache

    For a processor with a fixed-length instruction set and a Harvard cache,we can organize the instruction cache such that we add an extra bit or two to

    each instruction (in the cache) and use them to keep a history on branchinstructions. This works the same as the decode-history table without thetime and logic for the lookup.

    When a cache line is loaded from main memory, all branch indicatorbits (BIB) for the line are set to 00.

    When a branch is taken, increment the corresponding BIB.

    When a branch is not taken, decrement the corresponding BIB.

    When the instruction is fetched:

    Use NSI if the BIB is 00 or 01.

    Use BT if BIB is 10 or 11.

    Other Instuction00

    Other Instuction00

    Other Instuction00

    Other Instuction00

    Branch01

    Branch11

    Branch10

    Branch00

    Other Instuction00

    Other Instuction00

    00

    Instruction Cache

    Strongly not taken

    Legand

    01

    10

    11

    Weakly not taken

    Weakly taken

    Strongly taken

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    14/23

    1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 14

    Speculative Execution

    The brute-force approach Provide enough logic in the processor to:

    Replicate the first several stages of the pipeline.

    Always follow both paths of execution (branch taken and branch nottaken).

    When the outcome of the branch is known, discard the intermediateresults of the wrong path(s) and continue execution with the correctpath.

    For deep pipelines, the processor must be prepared to follow severalpaths in order to keep things moving along.

    Stone (page 197) says that these mechanisms have not been widely used inpractice (as of 1986). In fact, they have become very popular as a way tospeed up execution of modern processors.

    Note: some literature defines speculative execution to mean performing anyprocessing steps before you know the outcome of a conditional branch.That is, if there is any chance that you may need to discard intermediateresults of an instruction, it is defined as speculative execution. We will notuse this definition.

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    15/23

    Pipeline Hazards Architecture of Parallel Computers 15

    Data Dependencies

    An instruction may be stalled in the pipeline because it needs data that hasnot yet been produced by a prior instruction that is still in the pipeline.

    The data dependencies among instructions can take the following forms:

    READ/READ one instruction reads a data item and a followinginstruction reads the same data item.

    READ/WRITE one instruction reads a data item and a followinginstruction writes that same data item.

    WRITE/READ one instruction writes a data item and a followinginstruction reads that same data item.

    WRITE/WRITE one instruction writes a data item and a followinginstruction writes that same data item.

    The READ/READ combination is not a problem with pipelines because thedata item does not change. However, the other three combinations can allproduce invalid results unless we detect and interlock on them. We dealfirst with the WRITE/READ combination, and defer the others to a laterdiscussion on superpipelined machines.

    WRITE/READ

    Consider the following sequence of instructions:

    ------- ; Instruction

    R2

    ------- ; Instruction

    ------- ; Instruction

    R3 + R4 ; Store Register 2

    R5 R2 + R4 ; Use Register 2

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    16/23

    1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 16

    The reservation table, where S2 is the instruction that stores a new valueinto register 2, and U2 is the instruction that uses the new value inregister 2:

    Clock Operation

    1 2 3 4 5 6 7 8 9 10 11 12

    Inst Fetch S2 S2 U2 U2 NSI NSI 2SI 2SI 3SI 3SI

    Inst Dec. S2 U2 NSI 2SI

    Addr Gen S2 U2 NSI 2SI

    Data Fetch S2 S2 wait wait U2 U2 NSI NSI

    Execute S2 U2

    Op Store S2 U2

    The data fetch unit must detect that the value in register 2 that it needs ispending update from a prior instruction that has not yet completed. Itmust wait until the new value has been stored into register 2 by the OperandStore unit. The penalty is 2 cycles that we had to stall the pipeline.

    Internal Forwarding and Register Renaming

    A way to reduce the penalty due to data dependencies is to forward the

    results of a computation directly to the data fetch unit or to the executeunit, and not wait for the data to be stored into the proper register.

    If we forward the results of the addition in instruction S2 to the datafetch unit, we reduce the data interlock penalty to one cycle.

    If we forward the results directly to the execute unit, we can eliminatethe penalty altogether.

    The data is really available when we need it. It is just not in the rightplace. We rename the input register for the next operation from registerR2 to the register where the computation results will appear. Note that theOperand Store unit still needs to put the results into register 2 as well.

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    17/23

    Pipeline Hazards Architecture of Parallel Computers 17

    The new reservation table if we forward the results to the data fetch unit:

    Clock

    Operation

    1 2 3 4 5 6 7 8 9 10 11 12

    Inst Fetch S2 S2 U2 U2 NSI NSI 2SI 2SI 3SI 3SI 4SIInst Dec. S2 U2 NSI 2SI 3SI

    Addr Gen S2 U2 NSI 2SI

    Data Fetch S2 S2 wait U2 U2 NSI NSI 2SI

    Execute S2 U2 NSI

    Op Store S2 U2

    The reservation table if we forward the results directly to the Execute unit:

    Clock

    Operation

    1 2 3 4 5 6 7 8 9 10 11 12

    Inst Fetch S2 S2 U2 U2 NSI NSI 2SI 2SI 3SI 3SI 4SI 4SI

    Inst Dec. S2 U2 NSI 2SI 3SI

    Addr Gen S2 U2 NSI 2SI 3SI

    Data Fetch S2 S2 U2 U2 NSI NSI 2SI 2SI

    Execute S2 U2 NSI

    Op Store S2 U2 NSI

    The Condition Code Dependency

    Another type of data dependency is that between an instruction thatgenerates a condition code setting and the branch instruction that usesthe condition code.

    Internal forwarding can again be used to reduce or eliminate delays

    Another variant of the Branch after NSI, called multiple condition codes,puts the problem back on the programmer (or compiler). Multiple conditioncodes makes it easier for the programmer to have intervening instructionsbetween the instruction that generates the CC and the branch instructionthat uses it.

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    18/23

    1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 18

    Superscalar Architectures

    Up to now, we have been discussing computer architectures with a singlepipeline for processing instructions. The objective was to complete one

    instruction per clock period by breaking the instructions into (approximately)equal pieces of work, and pipelining them through the process in a serialfashion. However, all of the hazards prevent us from ever achieving aprocessing rate of 1 instruction per clock.

    Given the circuit density we have today, we can replicate many of thepipeline units and process instructions in parallel, so long as we ensurethat we produce results that are indistinguishable from those obtained ifwe executed the code in a strictly sequential fashion.

    This brings us back to data dependencies. We must now consider theREAD/WRITE and WRITE/WRITE sequences, because one instructionmay get ahead of another through the parallel pipelines.

    Instruction Instruction Instruction Instruction

    Instruction Instruction Instruction InstructionI-cache

    Decode Decode Decode Decode

    Op Fetch Op Fetch Op Fetch Op Fetch

    Fixed PointExecute

    Fixed PointExecute

    Fixed PointExecute

    Floating PointExecute

    Store

    Results

    Store

    Results

    Store

    Results

    Store

    Results

    Instruction Instruction Instruction Instruction

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    19/23

    Pipeline Hazards Architecture of Parallel Computers 19

    Consider the following sequence of instructions:

    We must interlock on register 2 to ensure that the new value (R5 + R4) does

    not get stored into it before we obtain the old value to add to R4.

    And the following sequence of instructions:

    We must ensure that the second value of R2 gets stored if the branch is nottaken.

    ------- ; Instruction

    R3

    ------- ; Instruction

    ------- ; Instruction

    R2 + R4 ; Use Register 2

    R2 R5 + R4 ; Store Register 2

    ------- ; Instruction

    ------- ; Instruction

    ------- ; Instruction

    Cmp A, B ; Compare A to B

    BE NewLoc ; Possible branch

    R2 R5 + R4 ; Store Register 2

    R2 R3 + R4 ; Store Register 2

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    20/23

    1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 20

    Extra Internal Registers

    When we have multiple pipelines and speculative execution in theprocessor, it is beneficial to have several extra sets of registers to keep

    intermediate results.

    Several paths are being followed due to speculative execution.

    Parallel execution is proceeding along each serial path.

    Many intermediate results are being forwarded to other instructions.

    Many tentative final results must be held until the final outcome isknown.

    Retiring Instructions

    When the final outcome of a series of branches and data dependencies isknown, the winning instruction is retired.

    Its tentative results are marked final.

    Any data in a renamed register is stored into the real named register.

    All other tentative instructions and results (the losers) are discarded, andany resources held are made available for processing new instructions.

    Only the retired instructions count toward the processing rate (theMIPS) of the processor.

    The objective of the computer architect is to retire more than oneinstruction per clock period.

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    21/23

    Pipeline Hazards Architecture of Parallel Computers 21

    CISC versus RISC Stone page 210

    CISC Complex Instruction Set Computer

    RISC Reduced Instructed Set Computer

    CISC Architectures

    Traditional processor architectures (e.g. IBM S/360, Intel 8086) use variable-length instructions and provide variations on basic instructions with severaladdressing modes.

    8086 example:

    Instructions can vary in length from 1 to 12 bytes long.

    There are 14 variations of the integer ADD instruction.

    There are 14 variations of the integer ADD with Carry instruction.

    There are 14 variations of the integer SUB (subtract) instruction.

    There are about 100 different instructions.

    There are four different prefixes that can modify instructions.

    This gives a lot of flexibility to the programmer and compiler writer, butcauses many problems for the computer architect.

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    22/23

    1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 22

    RISC architectures

    RISC architectures attempted to make life easy on the computer architect bydrastically simplifying the instruction set.

    John Cocke (IBM) reasoned that only compilers generate machine code,and so making life easier for the assembly language programmer should notbe an objective.

    Example:

    Make all instructions four bytes long and aligned on a word boundary.

    Make lots of general-purpose registers so that most intermediate data

    can be held in the fast processor storage.

    Make all arithmetic instructions register-to-register addressing only.

    All instructions execute in a single clock.

    Add instructions to help the CPU architect make a fast processor.

    Over time, the CISC architectures have adopted RISC techniques and the

    RISC architectures have added CISC instructions.

    Today, the only real difference between the two are that CISCprocessors still have variable-length instructions and RISC processorshave fixed-length instructions.

  • 7/29/2019 Linear Pipeline_collosion Vector Analysis

    23/23

    Superpipelined architecture Stone page 218.

    In the discussion on superscalar architectures, Stone describes asuperpipelined architecture as one where the internal clock for issuing

    instructions is Ntimes faster than the main clock.

    Virtually all processors today are superpipelined the internal clock isrun faster than the external bus clock.

    VLIW Very Long Instruction Word architecture Stone page 219

    VLIW is typically called microcode, and the machine architectures are notgeneral-purpose. They may be used in graphics processors, hard diskcontrollers or other dedicated function units.

    The advantage of a VLIW architecture is that the fields of the instructiondirectly control the hardware latches and gates, and thus can directlyperform multiple functions in parallel. Normally, engineers program themicrocontrollers and the programs are relatively short.

    VLIW microcontrollers formerly were used to implement the complexinstructions of CISC architecture machines.