1 dependence in c and hardware design allen and kennedy, chapter 12 presented by tali shragai
Post on 20-Dec-2015
226 views
TRANSCRIPT
1
Dependence in C and Hardware Design
Allen and Kennedy, Chapter 12
Presented by Tali Shragai
2
Today’s lecture…
3
Introduction
So far, we’ve discussed dependence analysis in Fortran
Dependence analysis applies to any language and translation context where arrays and loops are useful
Application to C and C++ Modern features (pointers, structures…)
Application to hardware design Language based approach
4
Outline
Optimizing C Overview The challenges
HW design Overview HW Description Languages (HDL) Optimizing simulation Synthesis optimization methods
Summary
5
Problems of C C\C++ focuses on simplified software
development, at the expense of optimizability
Optimization may not be desired: Polling a keyboard example -while (!(t=*p));Optimizer would move p outside the loop…
Use of C\C++ has expanded into areas where optimization is required…
6
Problems of C - Examplevoid vadd(double *a, double *b, double *c, int n){
while(n--)*a++ = *b++ + *c++;
}
Would be easily vectorized & optimized inFortran, but not in C: Pointers
Memory locations accessed by pointers is not clear (unlike for arrays…)
Aliasing C does not guarantee that arrays passed into
subroutine do not overlap
7
Problems of C – Example (cont.)void vadd(double *a, double *b, double *c, int n){
while(n--)
*a++ = *b++ + *c++;
}
Side-effect operators Pre\post increment operators conceal the index
calculations for addressing arrays Optimizer focuses extra effort on transformations
(induction-variable substitution…) Loops
Fortran loops provides values and restrictions to simplify optimizations
8
Outline
Optimizing C Overview The challenges
HW design Overview HW Description Languages (HDL) Optimizing simulation Synthesis optimization methods
Summary
9
Pointers
C optimizers’ most difficult challenge is unrestricted pointers: Hard to resolve indirect pointer access:
pointer variable can point to different memory locations during its use
Aliasing memory locations: memory location can be accessed by more than one pointer variable at any given time
Resulting in a much more difficult and expensive dependence testing
10
Pointers dependence testing
Compiler can replace pointers indirections like *p by subscripted array references n[e], for dependence testing.
But another pointer q might access the same place need to be replaced with the pseudo array n too…
In the worst case, must assume that each pair of references is dependent!
11
Dependence testing strategies Safety Assertions:
Use compiler options / pragmas to indicate “disciplined” code Safe parameters
All pointer parameters point to independent storage Safe pointers
All pointer variables (parameter, local, global) point to independent storage
Whole-Program Analysis:Without separate compilation, analyzing dependency in the entire program is solvable, but still unsatisfactory
12
Naming and Structures
In Fortran, unlike C, a block of storage can be uniquely identified by a single name simplify dependence analysis
Dependence analysis requires a single name for all references to the same location
C’s constructs complicate this:p;*p;**p;*(p+4);*(&p+4);
p[1]*p
**p
p
&p
13
Naming and Structures (cont.)
Troublesome structures Naming problem
What is the name of ‘a.b’ ? Unions
Allow different sized objects to overlap same storage
Need to reduce references to the same common unit of smallest storage possible
14
Loops Lack of constraints in C
Jumping into loop body is permitted Induction variable (if there’s any) can be
modified in the body of the loop Loop increment value may also be changed Conditions controlling the initiation,
increment, and termination of the loop have no constraints on their form
Might be hard to identify a loop variable with start and end values
15
Loops (cont.) Rewrite while as a DO loop
The induction variable: Only one! Must be initialized with the same value on all
paths into the loop Must have one and only one increment in the loop
The increment must be executed on every iteration
The termination condition must match No jumps from outside of the loop body
16
Scoping and Statics Scoping rules might create extra
aliasing Handled by creating unique symbols for
variables with same name but different scopes
Static variables File-static variable can only be modified by
procedures that see its declaration. Access to the variable can be determined
from scope information in the symbol table. Storing an address parameter in a static
variable makes it accessible from any other procedures.
17
Problematic C Dialects Some of C code might look “tidy”, as in
Fortran. “Messy” style conventions:
Use of pointers instead of arrays Use of address and dereference operators Use of side effect operators
Previously mapped to machine instructions Complicate the work of optimizers
18
Problematic C Dialects (cont.) Titan C Compiler: remove side effect
operators! But, requires enhancements in some
transformations Constant propagation
Treat address operators as constants and propagate them where possible
Replace generic pointer in a dereference with the actual address
Expression simplification and recognition Need stronger recognition within expression
where variable is actually the ‘base variable’
19
Problematic C Dialects (cont.)
Conversion of pointers into array references
Simplifies dependence testing Induction variable substitution need to
enhanced Deal with indirect access to array references
through pointers Recognize and remove usage of side-effect
operators
20
C Miscellaneous Volatile variables
Functions with these variables are best left without optimization
Volatile code usually isn’t targeted for optimization (example: vector unit initialization)
Setjmp and Longjmp Commonly used for error handling: Calling setjmp
saves current context in a buffer. longjmp can then be called and bypass section of the calling chain
Storing and loading current state of computation is complex when optimization is performed and variables are allocated to registers
No optimization used!
21
C Miscellaneous (cont.)
Varags and stdargs Variable number of arguments
printf(…)
Implemented by a complier directive: Save all register parameters to the stack Access using pointer manipulation over
the stack Pointer variable is an alias for many
parameters in the program No optimization
22
Outline
Optimizing C Overview The challenges
HW design Overview HW Description Languages (HDL) Optimizing simulation Synthesis optimization methods
Summary
23
Hardware Design: Overview
In the past, HW design was done at gate\transistor level
Today, HW design is language-based, similarly to SW development
Level of abstraction may vary Current trend is high level behavioral
specification Key factor: compiler’s efficiency
24
Abstraction levels for HW Design
Circuit / Physical level Diagrams of electronic components
Logic level Boolean equations
Register transfer level (RTL) Control state transitions and data transfers Synthesis: convert RTL to gates and flip-flops
System level Behavior expressed by variables, no timing Behavioral synthesis: select arithmetic units,
impose timing
Most Common!
25
Hardware Design
Behavior Synthesis is really a compilation problem
Two fundamental tasks Verification (Simulation) Implementation (Synthesis)
Optimization is essential for both: HW Simulation is inherently very slow Efficient synthesis raises the device’s value
26
Outline
Optimizing C Overview The challenges
HW design Overview HW Description Languages (HDL) Optimizing simulation Synthesis optimization methods
Summary
27
Hardware Description Languages 2 main HDLs used today:
Verilog Supports all abstraction levels, mostly used for
gates and RTL Extended C
VHDL Extended Ada
Primitives and extensions used for HW description provide similar functionality
28
Verilog extensions Multi-valued logic: 0, 1, x, z
x = unknown state, z = bus conflict E.g. division by zero produces x state Higher date types (integer…) are vectors of
multi-valued bits Operations with x will result in x state ->
simulation can’t execute addition directly… Reactivity
Automatic propagation of changes always @(b or c)
a = b + c;
29
Verilog extensions (cont.) Objects
Specific area of silicon, own state and registers Semantics different from C’s functions calls Data encapsulation using module
Connectivity Continuous passing of information Input port and output port
module add(a,b,c) output a; input b, c; integer a, b, c; always @(b or c) a = b + c;endmodule
30
Verilog extensions (cont.) Instantiation
Verilog only allows static instantiation Each instance is a different area in the
siliconinteger x, y, z;add adder1(x,y,z);
Vector operations Viewing other data structures as vector of
scalars Bit selection: A[1] Vector concatenation: {A[0], A[1:15]}
31
Optimization in Verliog
Advantages Disadvantages
No aliasing Non-procedural continuation semantics
Restricted subscripts Lack of loops (only implicitly using “always” blocks)
No separate compilation HW design is large
32
Outline
Optimizing C Overview The challenges
HW design Overview HW Description Languages (HDL) Optimizing simulation Synthesis optimization methods
Summary
33
Optimizing Simulation
Philosophy: Higher abstraction level more efficient! Details consume simulation time and obscure
behavioral functionality Next…
Optimization techniques
module adder(a, b, c)
input b[0:3], c[0:3];
output a[0:3];
always @(b or c)
a = b + c;
endmodule
34
Inlining Modules
Data encapsulation: Hide details from the programmer Hide optimization info from the compiler
HDLs have two properties that make module inlining simpler Whole design is reachable at one time Recursion is not permitted inline in linear time
using topological order No use to inline above the level of
functional units
35
Inlining example - beforemodule adder(a, b, c)
input b[0:3], c[0:3];output a[0:3];wire carry;add2 add_l(a[0:1], 0, b[0:1], c[0:1], carry);add2 add_r(a[2:3], carry, b[2:3], c[2:3], 0);
endmodule
module add2(sum, c_out, op1, op2, c_in)input op1[0:1], op2[0:1], c_in;output sum[0:1], c_out;wire carry;add1 add_l(sum[0], carry, op1[0], op2[0], c_in);add1 add_r(sum[1], c_out, op1[1], op2[1], carry);
endmodule
module add1(sum, c_out, op1, op2, c_in)input op1, op2, c_in;output sum, c_out;always @(op1 or op2 or c_in) begin
sum = op1 ^ op2 ^ c_in;c_out = (op1&op2) | (op2&c_in) | (op1&c_in);
endmodule
36
Inlining example - aftermodule adder(a, b, c)
input b[0:3], c[0:3];output a[0:3];wire carry, temp, temp1;always @(b[1] or c[1] or carry) begin
a[1] = b[1] ^ c[1] ^ carry;temp = (b[1]&c[1])|(c[1]&carry)|(carry&b[1]);
endalways @(b[0] or c[0] or temp) begin
a[0] = b[0] ^ c[0] ^ temp;0 = (b[0]&c[0])|(c[0]&temp)|(temp&b[0]);
endalways @(b[3] or c[3] or 0) begin
a[3] = b[3] ^ c[3] ^ 0;temp1 = (b[3]&c[3])|(c[3]&0)|(0&b[3]);
endalways @(b[2] or c[2] or temp1) begin
a[2] = b[2] ^ c[2] ^ temp1;carry = (b[2]&c[2])|(c[2]&temp1)|(temp1&b[2]);
endendmodule
37
Execution Ordering
The order of statements execution can dramatically effect efficiency!
HW is fast, thanks to triggering on bit’s change
SW simulation cannot afford tracking bits Memory overhead May consider all bits as a group
Execute blocks in topological order based on the dependence graph of individual array elements No memory overhead
38
Ordering examplemodule adder(a, b, c)
input b[0:3], c[0:3];output a[0:3];wire carry, temp, temp1;always @(b[3] or c[3] or 0) begin
a[3] = b[3] ^ c[3] ^ 0;temp1 = (b[3]&c[3])|(c[3]&0)|(0&b[3]);
endalways @(b[2] or c[2] or temp1) begin
a[2] = b[2] ^ c[2] ^ temp1;carry = (b[2]&c[2])|(c[2]&temp1)|(temp1&b[2]);
endalways @(b[1] or c[1] or carry) begin
a[1] = b[1] ^ c[1] ^ carry;temp = (b[1]&c[1])|(c[1]&carry)|(carry&b[1]);
endalways @(b[0] or c[0] or temp) begin
a[0] = b[0] ^ c[0] ^ temp;0 = (b[0]&c[0])|(c[0]&temp)|(temp&b[0]);
endendmodule
39
Dynamic vs. Static Scheduling
Dynamic scheduling Dynamically track changes in values and
propagate them Naturally mimics hardware Overhead of change checks
Especially if performed per bit Static scheduling
Based on a topological model Blindly sweeps through all values for all
objects regardless of changes No need for change checks
40
Dynamic vs. Static Scheduling (cont.)
Prefer Static Scheduling for a highly active circuit When changes are frequent, no need to
check before update Can we really tell in advance?
Common strategy: use static analysis to locate dynamic scheduling improves simulation performance!
41
Fusing always blocks
High cost of change checks motivates fusing always blocks
Can fuse blocks with the same trigger conditions Save overhead of invoking blocks
Most useful for synchronous designs But, may change the design’s output
Still semantically correct Bad surprise for the designer… Simulators try to avoid output changes
42
Fusion examplemodule adder(a, b, c)
input b[0:3], c[0:3];output a[0:3];wire carry, temp, temp1;always @(posedge(clk)) begin
a[3] = b[3] ^ c[3] ^ 0;temp1 = (b[3]&c[3])|(c[3]&0)|(0&b[3]);
endalways @(posedge(clk)) begin
a[2] = b[2] ^ c[2] ^ temp1;carry = (b[2]&c[2])|(c[2]&temp1)|(temp1&b[2]);
endalways @(posedge(clk)) begin
a[1] = b[1] ^ c[1] ^ carry;temp = (b[1]&c[1])|(c[1]&carry)|(carry&b[1]);
endalways @(posedge(clk)) begin
a[0] = b[0] ^ c[0] ^ temp;0 = (b[0]&c[0])|(c[0]&temp)|(temp&b[0]);
endendmodule
43
Vectorizing always block
Regrouping low level operations back together for higher lever abstractions
Vectorizing the bit operations
44
Vectorizing examplemodule adder(a, b, c)
input b[0:3], c[0:3];output a[0:3];wire carry[0:3];always @(b[3] or c[3] or carry[3]) begin
a[3] = b[3] ^ c[3] ^ carry[3];carry[2]=(b[3]&c[3])|(c[3]&carry[3])|(carry[3]&b[3]);
endalways @(b[2] or c[2] or carry[2]) begin
a[2] = b[2] ^ c[2] ^ carry[2];carry[1]=(b[2]&c[2])|(c[2]&carry[2])|(carry[2]&b[2]);
endalways @(b[1] or c[1] or carry[1]) begin
a[1] = b[1] ^ c[1] ^ carry[1];carry[0]=(b[1]&c[1])|(c[1]&carry[1])|(carry[1]&b[1]);
endalways @(b[0] or c[0] or carry[0]) begin
a[0] = b[0] ^ c[0] ^ carry[0];cout = (b[0]&c[0])|(c[0]&carry[0])|(carry[0]&b[0]);
endendmodule
Scalar expansion
45
Vectorizing example – merge blocks
module adder(a, b, c)input b[0:3], c[0:3];output a[0:3];wire carry[0:3];always @(b[3] or c[3] or carry[3] or
b[2] or c[2] or carry[2] or b[1] or c[1] or carry[1] or
b[0] or c[0] or carry[0]) begina[3] = b[3] ^ c[3] ^ carry[3];carry[2]=(b[3]&c[3])|(c[3]&carry[3])|
(carry[3]&b[3]);a[2] = b[2] ^ c[2] ^ carry[2];carry[1]=(b[2]&c[2])|(c[2]&carry[2])|
(carry[2]&b[2]);a[1] = b[1] ^ c[1] ^ carry[1];carry[0]=(b[1]&c[1])|(c[1]&carry[1])|
(carry[1]&b[1]);a[0] = b[0] ^ c[0] ^ carry[0];cout = (b[0]&c[0])|(c[0]&carry[0])|(carry[0]&b[0]);
endendmodule
46
Vectorizing example – rearranged following dependencies
module adder(a, b, c)input b[0:3], c[0:3];output a[0:3];wire carry[0:3];always @(b[3] or c[3] or carry[3] or
b[2] or c[2] or carry[2] or b[1] or c[1] or carry[1] or
b[0] or c[0] or carry[0]) begincarry[2]=(b[3]&c[3])|(c[3]&carry[3])|
(carry[3]&b[3]);carry[1]=(b[2]&c[2])|(c[2]&carry[2])|
(carry[2]&b[2]);carry[0]=(b[1]&c[1])|(c[1]&carry[1])|
(carry[1]&b[1]);cout = (b[0]&c[0])|(c[0]&carry[0])|(carry[0]&b[0]);a[3] = b[3] ^ c[3] ^ carry[3];a[2] = b[2] ^ c[2] ^ carry[2];a[1] = b[1] ^ c[1] ^ carry[1];a[0] = b[0] ^ c[0] ^ carry[0];
endendmodule
Can Vectorize!!
47
Vectorizing example – vectorize now
module adder(a, b, c)input b[0:3], c[0:3];output a[0:3];wire carry[0:3];always @(b or c or carry) begin
carry[0:2] = (b[1:3]&c[1:3]) | (c[1:3]&carray[1:3]) |
(carry[1:3]&b[1:3]);cout = (b[0]&c[0])|(c[0]&carry[0])|(carry[0]&b[0]);a = b ^ c ^ carry;
endendmodule
Pattern Matching
always @(b or c) begina = b+ c;
end
48
Two state vs. four state logic
Extra overhead in 4-state HW Do we want HW with unknown states?! 2-state logic can be 3-5 times faster!
But… Hard to find regions without unknowns
Use interprocedural analysis Check for unknowns, but default to 2-state
Test for detecting unknown is low cost 2-3 instructions
49
Rewriting block conditions
Semantics of synchronous block: Changes are updated with every clock update
In Verliog: Recompute results every clock tick
always @(posedge(clk)) begin
sum = op1 ^ op2 ^ c_in;
c_out = (op1 & op2) | (op2 &
c_in) | (c_in & op1)
end
50
Rewriting block conditions (cont.) Actually, HW only computes when input changes
Clocking is simply a matter of gating results through a register
Rewrite code to achieve the same in the simulator Avoid excessive computations
always @(op1 or op2 or c_in) begin
t_sum = op1 ^ op2 ^ c_in;
t_c_out = (op1 & op2) | …
end
always @(posedge(clk)) begin
sum = t_sum;
c_out = t_c_out;
end
51
Using Basic Compiler Optimizations
Mostly useful for high level abstraction
Optimize inside an always block Control flow between blocks is too
complex Useful methods:
Loop vecorization Constant propagation Dead code elimination Common subexpression elimination
52
Outline
Optimizing C Overview The challenges
HW design Overview HW Description Languages (HDL) Optimizing simulation Synthesis optimization methods
Summary
53
Synthesis Optimization
Goal: automatically insert the details Analogous to standard compilers Harder than standard compilers
Not targeted towards a fixed target Many goals:
Minimize cycle time Minimize area Minimize power consumption
54
Basic Framework
Fundamental problem: reduce this computation to a series of gates
for(i=0; i<100; i++)t = t + a[i]*b[i];
The simple approach: convert add\multiply into gates, optimize later Not so easy in the real world
Converting high level to gates is inefficient, better select components first Various ways to perform multiply, add, etc. Need a good library to select from Need to make the optimal selection
55
Basic Framework (cont.)
One option:t = ADD(t, MUL(LOAD(A[i]), LOAD(B[i])));
Will take 3 cycles Another option:t = MAC(t, MUL(LOAD(A[i]), LOAD(B[i])));
Only 2 cycles… A lot more options for the unrolled
version:for(i=0; i<100; i+=4)
t = t + a[i]*b[i] + a[i+1]*b[i+1] +a[i+2]*b[i+2] +
a[i+3]*b[i+3];
56
Basic Framework (cont.)
Analogous to instruction selection for CISC architecture Extensively researched Fast & effective tree-matching
algorithms, applicable to synthesis For the undefined target
Algorithms will adapt to current HW Achieve multiple goals
Bound types & number of functional units
Just need to minimize time now…
57
Loop Transformations Execution order affects functional unit
utilization & synthesized HW efficiency
for(i=0; i<100;i++) { t[i] = 0; for(j=0; j<3; j++) t[i] = t[i] + (a[i-j]>>2);}
for(i=0; i<100; i++) { o[i] = 0; for(j=0; j<100; j++) o[i] = o[i] + m[i][j] * t[j]}
58
Loop Transformations (cont.)
Distribute the loops & rearrange topologically:
for(i=0; i<100; i++)
t[i] = 0;
for(i=0; i<100; i++)
o[i] = 0;
for(i=0; i<100; i++)
for(j=0; j<3; j++)
t[i] = t[i] + (a[i-j] >> 2)
for(i=0; i<100; i++)
for(j=0; j<100; j++)
o[i] = o[i] + m[i][j] * t[j];
No improvement
so far…
59
Loop Transformations (cont.) Let’s try some fusion…
Can you see the interchange?
for(i=0; i<100; i++)
o[i] = 0;
for(i=0; i<100; i++)
t[i] = 0;
for(j=0; j<3; j++)
t[i] = t[i] + (a[i-j] >> 2);
for(j=0; j<100; j++)
o[j] = o[j] + m[j][i] * t[i];
60
Loop Transformations (cont.) Scalar replacement on t Exploit input dependence on a
for(i=0; i<100; i++) { o[i] = 0; a0 = a[0]; a1 = a[-1]; a2 = a[-2]; a3 = a[-3]; for(i=0; i<100; i++) { t = 0; t = t + (a0>>2) + (a1>>2) + (a2>>2) + (a3>>2) a3 = a2; a2 = a1; a1 = a0; a0 = a[i+1]; for(j=0; j<100; j++) o[j] = o[j] + m[j][I] * t; }}
61
Loop transformation - summary
Loop fusion: fuse 2 loops which use different functional units
Loop distribution: separate loops using the same functional units
Vectorization: when a functional unit can be pipelined
Loop interchange: mostly to help other transformations
62
Control and Data Flow Von Neumann architecture
Data flow = data movement among memory and registers
Control flow = changes in the PC due to sequential execution and branches
Synthesized hardware Data flow = data movement among
functional units Control flow = which functional unit should
be active on what data at which time steps Requires a state machine
63
Control and Data Flow –special HW constructs
Wires Immediate data transfer
Latches Values hold throughout one clock cycle
Registers Static variables in c Held in one or more clock cycle
Memories Arrays in C Special registers: large, long lifetime
64
Memory Reduction
Memory access is slower than unit access strive to minimize frequency & volume of
access Application of techniques
Loop interchange Loop fusion Scalar replacement Strip mining Unroll and jam Prefetching
65
Outline
Optimizing C Overview The challenges
HW design Overview HW Description Languages (HDL) Optimizing simulation Synthesis optimization methods
Summary
66
Summary
Application of dependence is not limited to Fortran!
Analysis framework can be adapted to C Ardent Titan compiler
Several techniques are useful for HW simulation & synthesis
Early stage of research…
67
Questions???
Thanks for listening…