1
Implementing An Associative Processor on FPGAs
2
A Conceptual View of the KSU ASC Model
Cel
l In
terc
on
nec
tio
n N
etw
ork
Memory PE
Memory PE
Memory PE
Memory PE
Instruction Stream Control
Unit
3
An Example for the Data Memory Organization:Auto Information Stored in the PE Cells
PE0
PE1
3 Blue Focus OH 190 ……. PE2
PE3
2 Blue Taurus OH 160 …….
4 Red Focus PA 180 …….
1 Burgundy Focus OH 170 …….
ID Color Model State Rebate
4
The Prototype of the Byte-serial ASC Processor
IS Control Unit
CPU (for Sequential and parallel instructions)
32-bit Instruction
Memory
Data Memory
PE Array
Associative Processing Array
Responder Resolution
Circuitry
MAX/MIN Circuitry 16 8-bit Common
Registers
5
Prototype of the 4-PE Associative Processing Array
Data Memory0 PE0
PE Cell 0
Responder ResolutionCircuitry
MAX/MIN
Circuitry
At_Least_One_Responder
Data Memory1 PE1
PE Cell 1
Data Memory2 PE2
PE Cell 2
Data Memory3 PE3
PE Cell 3
6
A Processing Element Overview
8-bi
t ALU
CarryOut 1-bitALU
MUXMUX
16 8-bitGeneralPurposeRegisters
16 1-bitLogicalRegisters
16-deep1-bitMaskStack
1-bit Responder Register
Common Registers
General-Purpose Registers
Logical Registers
Find/Step/ResolveFirst
Comparator
7
Instruction Set and Assembling Language (1)
• Data Transfer Instructions
- LD address, dstreg - LDI immediate, dstreg
- LDRR srcreg, dstreg - LDRRSPD srcreg
- ST srcreg, address
• Arithmetic and Logical Instructions
(mnemonic srcreg1, srcreg2, dstreg)
– ADD SUB– AND OR XOR NOT– SLL SRL– SLT SLE SGT SGE SEQ SNE
8
Instruction Set and Assembling Language (2)
• Mask Stack and Responder Instructions – SETMSK– TOPMSK TOPMSKRSPD– POPMSK POPMSKRSPD– POPTHEM POPTHEMRSPD– RPCMSK RPCMSKRSPD– PUSHMSK PUSHMSKRSPD– PUSHTHEM– PUSHMSKTHEM– STKTOMEM MEMTOSTK
– FIND– STEP– RESFST
9
Instruction Set and Assembling Language (3)
• Maximum and Minimum Searching Instructions
– SETMXMI
– LDMXMI
– STMXMI
– MAX
– MIN
• Branch/Jump Instructions– BNR
– BRS
– J
10
Associative Operations
Related PE Components:• The Responder Register:
to indicate whether a PE is a responder to a particular associative search or not
• The Step/Find/ResolveFirst Unit:
to support processing multiple responders in various ways
• The Mask Stack:
to represent at most 16 levels of association. The top of the Mask Stack always represents the current status of the PE – whether it is masked (‘1’) or unmasked (‘0’)
11
Example of Associative Search:Find all Focus cars located in Ohio
• Perform the comparison: model == “Focus”, and store the result either ‘1’ or ‘0’ into $LR1
• Perform the comparison: location == “Ohio”, and store the result into $LR2
• AND $LR1 with $LR2, and store the result into the Responder Register
(Note: all the instructions above performed by all PEs in parallel are called unmasked instructions)
12
Unmasked and Masked Instructions
Unmasked Instruction:
Executed by all the PEs regardless of the state of the Mask Stack
Masked Instruction:
Executed only by those PEs with a ‘1’ on the top of their Mask Stack
13
Example of Associative Search Using Masked Instructions:
(Find all Focus cars located in Ohio ) Initialize the top of the Mask Stack to ‘1’• Perform the comparison: model == “Focus”, and store
the result ‘1’ or ‘0’ into $LR1• Perform the comparison: location == “Ohio”, and store
the result into $LR2• AND $LR1 with $LR2, and store the result into the
Responder Register AND the Responder Register with the top of the Mask
Stack, and push the ANDing result into the Mask Stack and also store it into the Responder Register
Increase the rebate of all Focus cars in Ohio by 10 (masked instruction)
14
The MAX/MIN Circuitry, the Responder Resolution Circuitry, and PE3
R0 V0
R1 V1
R2 V2
R3 V3
V4
From PE0
From PE1
From PE2
From PE3
to PE0
to PE1
to PE2
to PE3
To CU
D0 MM0R0 D1 MM1R1 D2 MM2 R2
D3 MM3R3
From PE0 : GPR RPD
From PE1: GPR RPD
From PE2: GPR RPD
From PE3: GPR RPD
to PE0
to PE1
to PE2
to PE3
MaskStack
Responder
Step/Find /RslvFst
General Purpose Registers
Responder Resolution
MAX/MIN
PE 0
clr
15
Using the Falkoff Algorithm for MAX/MIN Search
Maximum-Value Searching (the following steps areperformed in parallel for all the data)• Search bit slices of the data from the most
significant bit to the least significant bit: As each bit slice is processed, each bit is ANDed
with a corresponding MM bit (a 1-bit register used to indicate whether or not a data item is the maximum after processing a bit)
• Check the results of the AND to ensure that at least one new maximum value remains:
16
Using the Falkoff Algorithm for MAX/MIN Search (continued)
If this condition is true, then the MM bits are updated by the results of AND; if all the results are 0, then the MM bits are not updated at this time
• Continue to process the remaining bit slices as above until all bits are processed
• After the least significant bit slice is processed:
If only one MM bit is ‘1’, it marks the largest number; if more than one MM bit is ‘1’, those data are tied for the maximum value
17
Minimum-Value Searching: • Similar to maximum value searching, but
complement the bit slices each time before ANDing it with MM bits
Using the Falkoff Algorithm for MAX/MIN Search (continued)
18
Bit Slices (7..0) of Rebates Values in MM bits During Processing
Process bit from MSB to LSB After processing each bit
(rebate) 76543210 Initialize 7 6 5 4 3 2 1 0
(170) 10101010 (MM0) 1 1 1 1 0 0 0 0 0
(160) 10100000 (MM1) 1 1 1 1 0 0 0 0 0
(190) 10111110 (MM2) 1 1 1 1 1 1 1 1 1 (max)
(180) 10110100 (MM3) 1 1 1 1 1 0 0 0 0
Search For the Maximum Rebate in the Data Memories
19
MAX/MIN Circuit using the Falkoff Algorithm
OP
Data 0 to RPD0
RPD0
Data 1 to RPD1
RPD1
Data 2
RDP 2 to RPD2
Data 3
RPD 3 to RPD3 Mask_W
8-bit shift register0
MM0
“not”
8-bit shift register1
MM1
“not”
8-bit shift register2
MM2
“not”
8-bit shift register3
MM3
“not”
20
The MAX/MIN Circuitry, the Responder Resolution Circuitry, and PE3
R0 V0
R1 V1
R2 V2
R3 V3
V4
From PE0
From PE1
From PE2
From PE3
to PE0
to PE1
to PE2
to PE3
To CU
D0 MM0R0 D1 MM1R1 D2 MM2 R2
D3 MM3R3
From PE0 : GPR RPD
From PE1: GPR RPD
From PE2: GPR RPD
From PE3: GPR RPD
to PE0
to PE1
to PE2
to PE3
MaskStack
Responder
Step/Find /RslvFst
General Purpose Registers
Responder Resolution
MAX/MIN
PE 0
clr
21
Functionality of Responder Resolution Circuit
• Responder resolution:
Send an At-Least-One-Responder signal to the IS control unit
• Support responder selection:
Send a corresponding Responder_Before_Me
signal to each PE’s Find_ Step _ResolveFirst unit
22
The Responder Resolution Circuitry for 4 PEs
R0 to R3 : from responder registersV0 to V3 : called Responder_Before_ME V4 : called At_Least_One_Responder
V0
R0
V1
R1
V2
V4
R2
V3
R3
Responder Resolution Circuitry ‘0’
PE0
PE1
PE2
PE3
23
Responder Processing
• Process responders in parallel:– use masked instructions
• Process responders sequentially:– Need some responder selection instructions – Need a responder selection mechanism
24
Responder Selection Instructions
• Steprepetitively used to pick one responding PE each time for further processing – “ for” loope.g., to step through all the Focus cars in Ohio to list the features available on each car
• Find select a responding PE, while still keeping all
responders identifiable – “ while” loop e.g., retrieve the tax rate from one of the cars located in OH, then increment the tax rate by a certain amount, afterwards apply this new tax rate to all the cars located in OH
25
Responder Selection Instructions (continued)
• ResolveFirst select a responder and only keep this
responder identifiable
e.g., resolve one PE from several PEs which have the values tied for the maximum value
26
The Responder Resolution Circuitry, MAX/MIN Circuitry, and PE3
R0 V0
R1 V1
R2 V2
R3 V3
V4
From PE0
From PE1
From PE2
From PE3
to PE0
to PE1
to PE2
to PE3
To CU
D0 MM0R0 D1 MM1R1 D2 MM2 R2
D3 MM3R3
From PE0 : GPR RPD
From PE1: GPR RPD
From PE2: GPR RPD
From PE3: GPR RPD
to PE0
to PE1
to PE2
to PE3
MaskStack
Responder
Step/Find /RslvFst
General Purpose Registers
Responder Resolution
MAX/MIN
PE 3
clr
27
Design Language: VHDL
• A standard hardware description language used to model and design digital hardware- Support concurrent events
- can be translated into hardware by some design tools
• good for managing large design structures
• Supported by many CAD tool and programmable logic vendors
28
Altera MAX+PLUS II Development System
Design Entry
Device Programming
ProgrammerData I/OOther Programmers
Graphic EditorText EditorWaveform EditorSymbol EditorFloorplan EditorOther Design Entry Tools
MAX+PLUS II Compiler
Design Verification
SimulatorWaveform EditorTiming AnalysisOther Verification Tools
Design Compilation
29
Altera FLEX 10K FPLD
• FLEX10K70 Device:
- 3,744 LEs
- 9 EABs
- 70,000 gates totally
( IOEs – I/O elements)
Partial FLEX10K20 FPLD Architecture
EAB
EAB
LAB
LAB
LAB
LAB
IOEs
FastTrack Interconnect
IOEs
IOEs
IOEs
IOEs
IOEs
IOEsIOEs
IOEs
IOEs
IOEs
IOEsIOEs IOEs
30
Simulation on FLEX 10K 70 Chip
• The ISCU runs at about 10MHz using 50% logical gates
• One EAB is used as a local memory for one PE; 4 PEs and the support circuit runs at about 14MHz using 82% logical cells.
From the simulation result, we can see that the FLEX10K 70 chip isn’t large enough for the 4-PE processor.So our current work is targeting on Altera APEX 20Kdevices with 1million gates in one chip.
31
Future Work
• Explore more arithmetic features and associative operations
• Develop the complete ASC assembly language and the ASC back-end compiler
• Implement the PE cell interconnection network • Implement the whole ASC processor on
bigger and faster FPGA chips• Develop the multiple instruction stream MASC
model