effective compilation support for variable instruction set architecture
Post on 05-Jan-2016
29 Views
Preview:
DESCRIPTION
TRANSCRIPT
1111111111 1
Effective Compilation Support for Variable Instruction Set
Architecture
Jack LiuTimothy Kong
Fred ChowCognigine Corp.
www.cognigine.com
1111111111 2
Outline
1. VISC Architecture
2. Compile-time Configurable Code Generation
3. Managing the Dictionary
4. Concluding Remarks
1111111111 3
Configurable Computing
Motivation• Higher performance
• processor and instruction set customized to
type of application
• Lower hardware cost
• non-essential features excluded
• Shorter time-to-market
1111111111 4
Variable Instruction Set Architecture (VISC ArchitectureTM)
A new approach to configurable computing:
• Fixed processor hardware
• Many types of operations provided
• Numerous instruction variants (CISC-style)
• Per-program instruction set tailoring during compile time
1111111111 5
Background of this work
Cognigine CGN16100 Network Processor• Single-chip, fully programmable network processor
• Processing cores:
16 Re-configurable Communications Units (RCU) processor cores
• VISC architecture• 4 64-bit parallel execution units• Multi-threaded• 512 KB on-chip memory (text and data)
1111111111 6
VISC ArchitectureTM
Dictionary (instruction set for current program)
instruction
dictionary entry:32-bit: 2 operations64-bit: 4 operations128-bit: 8 operations
opcode: 8-bit
256
256
entr
ies
opcode opnd0 opnd1 opnd2 opnd3
1111111111 7
Motivation for VISC Architecture
1. Efficient way to encode/decode the many operation variants with different addressing modes
• Not all used in each program
2. High instruction encoding density
• Small opcode bit count
• Operands shared among multiple operations
3. Simplified control logic for VLIW-style ILP
• Up to 8 operations per cycle
1111111111 8
Operation Specification
In Dictionary Entry (only specified once):1. Operation name2. Operation variants:
• Signed and unsigned• Operand and result sizes — 8-bit, 16-bit, 32-bit, 64-bit
• Support different sizes among operand(s) or result• Vector — 64v8, 64v16, 64v32, 32v8, 32v16
3. Data path to each operand/result
In Instruction:1. Operands’ encoding formats2. Actual operands
1111111111 9
RCU Architecture• 5 Stage Pipeline• 4-way multi-threaded• Hardware RSF synchronization
• 128 bit reconfigurable address path• 256 bit reconfigurable data path
ExecutionUnit
64
PointerFile Dictionary
Registers, Scratch Memory
Packet Buffers DataMemory
InstructionCache
RSF Connector
Dic
tio
nar
yD
eco
de
ExecutionUnit
ExecutionUnit
ExecutionUnit
SourceRoute
SourceRoute
SourceRoute
SourceRoute
Ad
dre
ssC
alcu
lati
on
Pip
eline &
Th
readC
on
trol
64 64 64
Dat
a F
low
Syn
chro
niz
atio
n“Back-side” Ports
RSF
256
64
128 128 64
1111111111 10
Roles of Compiler for VISC Architecture
1. Determine best instruction set stored in dictionary for best execution time performance
2. Generate optimized code sequence based on best instruction set
3. Cater to various hardware limitations:
• Dictionary limit
• Data path constraints
• Dictionary and Instruction encoding constraints
1111111111 11
New Compilation Approach: Configurable Code Generation
• Exact form of generated instructions decided in the last instruction scheduling phase
• Direct result of instruction compaction based on what is allowed by the hardware
1111111111 12
Compiler Implementation Method
• Retarget SGI Pro64 (Open64) compiler to an Abstract Machine
• Code generator operates on an Abstract Operation Representation– Code generation optimizations left intact
• Add new Instruction and Dictionary Finalization (IDF) phase as post-passIDF Phase 1:– Instruction scheduling and folding– Abstract operations converted to target code sequence
IDF Phase 2:– Output VISC instructions and dictionary entries
1111111111 13
Compiler Phase Structure
GNU / Pro64TM Front-end
WHIRL Optimizer
Code Generator
IDF
Pro64TM Back-end
C
Assembly Program: Instructions Dictionary
1111111111 14
Abstract Operation Representation (AOR)
Each operation corresponds to a micro-operation in the core execution units
• RISC-like formats– r1 = op r2, r3– r2 = load <offset>(<base>)– store r2 <offset>(<base>)– r1 = loadimm <imm>
• Optimizations in AOR reflected in final code• No pre-disposition of compiler to any specific
instruction format
1111111111 15
Multiple AOR ops can be combined to single target operation
Operations taking immediate operandr2 = move <imm> => r3 = addi r1 <imm>r3 = add r1, r2
Operations supporting memory operandsr2 = load 4(sp) => r3 = add r1 4(sp)r3 = add r1, r2
Post incre/decre memory operationsr2 = load 0(r1) => r2 = load 0(r1++)r1 = addi r1, 4
Branches on condition codesr1 = add r2, r3. . . r1 = add r2, r3compare (r1 != 0) => br.z label (only if immediately after)br.z label
Others
1111111111 16
IDF Approach
Instruction scheduling + following tasks:– Instruction folding– Opcode selection– Modelling of irregular hardware constraints– Modelling of encoding constraints– Monitoring of states of condition codes and
transient registers– Keeping track of dictionary contents
Use enumeration (branch and bound) approach
1111111111 17
Example of IDF Processing
$w80 = move 0x55$w91 = move 0xf8$w70 = add $w70, $w80$w71 = xor $w92, $w80$w90 = sub $w92, $w91store 8($p1) = $w90
Dictionary
add xor sub nop
instruction
• move and store instructions subsumed• $w71, $w92 mapped to transient registers
Input
3 add xor sub nop
op3 8($p1) $w70 0x55 0xf8
1111111111 18
IDF Scheduling Algorithm
To speed up the search:
Shrink solution space by:– Coming up with high
initial boundsch
– Prune useless search paths continuously
• Tight hardware constraints help
start
Estimate initial boundsch
Search for schedule with length <=
boundsch
succeed?
end
yes
no
Input: Sequence of operations in BB
boundsch= boundsch+1
1111111111 19
Managing the Dictionary
• Dictionary usage increases due to:– Program size: more variety of operations– High ILP: more combination of operations– Library code linked in
• Currently, dictionary contents fixed for each executable• Role of linker:
– Merge dictionary entries with identical contents across files/libraries
– Error message on dictionary overflow• Role of compiler:
– Maximize dictionary entry re-use
1111111111 20
Dictionary Compilation
Strategy:• Keep track of existing dictionary entries during compilation
– Extract dictionary entries from:• Libraries and .s files being linked• .o files compiled before current file
Example: cc a.c b.o c.s– Maintain table of existing dictionary entries– Add to table as new entries are generated
• Re-use existing dictionary entries • Bias scheduling towards dictionary conservation as
dictionary fills up
1111111111 21
User Control of Dictionary CompilationBest program performance demands near-full
dictionary.When dictionary overflow, needs to re-compile.Provide user control mechanisms:
– Trade-off between dictionary consumption and program performance
– Command line option: -CG:dict_usage=n n = 0…10– Embedded in code: #pragma dict_usage n
dict_usage is dictionary budget guideline for IDF– Low dict_usage:
• Less new dictionary entries created• Low ILP
– High dict_usage: • Tighter instruction schedule• More dictionary entries created
1111111111 22
Additional search goal bounddict
– Number of new dictionary entries allowed for current BB– Automatically adjust lower with more pre-existing entries
When bounddict
reached during enumeration, disallow creating new dictionary entry (unless single operation)
IDF Support of dict_usage
0
100
200
300
400
500
600
700
800
10 8 3 2 0
dict_usage
instructions
dict entries
1111111111 23
Experimental Results
Summary (with dict_usage=10):• ILP from IDF scheduling: 1.38 ops per instruction• ILP from relaxed scheduling: 1.51 ops per instruction• 23% of all subsumable operations subsumed• Each dictionary entry referred to by 2.63 instructions
(statically)• Scheduling via enumeration: 100 times slower than
one-pass schedulers• Compilation time: 1 to 2 minutes per program
1111111111 24
Concluding Remarks• VISC approach most suitable as embedded processors
– Limited program size– Dictionary space less of an issue– Slow compilation tolerable– CISC-style instructions enable small code size
• Compilation support key to deploying applications on VISC– Very hard to write in assembly language– Advanced optimizations performed by compiler– Dictionary managed by compiler with user hints
• Compile-time configurable code generation enables RISC compilation techniques to generate CISC output
top related