1 dependence in c and hardware design allen and kennedy, chapter 12 presented by tali shragai

1

Dependence in C and Hardware Design

Allen and Kennedy, Chapter 12

Presented by Tali Shragai

2

Today’s lecture…

3

Introduction

So far, we’ve discussed dependence analysis in Fortran

Dependence analysis applies to any language and translation context where arrays and loops are useful

Application to C and C++ Modern features (pointers, structures…)

Application to hardware design Language based approach

4

Outline

Optimizing C Overview The challenges

HW design Overview HW Description Languages (HDL) Optimizing simulation Synthesis optimization methods

Summary

5

Problems of C C\C++ focuses on simplified software

development, at the expense of optimizability

Optimization may not be desired: Polling a keyboard example -while (!(t=*p));Optimizer would move p outside the loop…

Use of C\C++ has expanded into areas where optimization is required…

6

Problems of C - Examplevoid vadd(double *a, double *b, double *c, int n){

while(n--)*a++ = *b++ + *c++;

}

Would be easily vectorized & optimized inFortran, but not in C: Pointers

Memory locations accessed by pointers is not clear (unlike for arrays…)

Aliasing C does not guarantee that arrays passed into

subroutine do not overlap

7

Problems of C – Example (cont.)void vadd(double *a, double *b, double *c, int n){

while(n--)

*a++ = *b++ + *c++;

}

Side-effect operators Pre\post increment operators conceal the index

calculations for addressing arrays Optimizer focuses extra effort on transformations

(induction-variable substitution…) Loops

Fortran loops provides values and restrictions to simplify optimizations

8

Outline



Summary

9

Pointers

C optimizers’ most difficult challenge is unrestricted pointers: Hard to resolve indirect pointer access:

pointer variable can point to different memory locations during its use

Aliasing memory locations: memory location can be accessed by more than one pointer variable at any given time

Resulting in a much more difficult and expensive dependence testing

10

Pointers dependence testing

Compiler can replace pointers indirections like *p by subscripted array references n[e], for dependence testing.

But another pointer q might access the same place need to be replaced with the pseudo array n too…

In the worst case, must assume that each pair of references is dependent!

11

Dependence testing strategies Safety Assertions:

Use compiler options / pragmas to indicate “disciplined” code Safe parameters

All pointer parameters point to independent storage Safe pointers

All pointer variables (parameter, local, global) point to independent storage

Whole-Program Analysis:Without separate compilation, analyzing dependency in the entire program is solvable, but still unsatisfactory

12

Naming and Structures

In Fortran, unlike C, a block of storage can be uniquely identified by a single name simplify dependence analysis

Dependence analysis requires a single name for all references to the same location

C’s constructs complicate this:p;*p;**p;*(p+4);*(&p+4);

p[1]*p

**p

p

&p

13

Naming and Structures (cont.)

Troublesome structures Naming problem

What is the name of ‘a.b’ ? Unions

Allow different sized objects to overlap same storage

Need to reduce references to the same common unit of smallest storage possible

14

Loops Lack of constraints in C

Jumping into loop body is permitted Induction variable (if there’s any) can be

modified in the body of the loop Loop increment value may also be changed Conditions controlling the initiation,

increment, and termination of the loop have no constraints on their form

Might be hard to identify a loop variable with start and end values

15

Loops (cont.) Rewrite while as a DO loop

The induction variable: Only one! Must be initialized with the same value on all

paths into the loop Must have one and only one increment in the loop

The increment must be executed on every iteration

The termination condition must match No jumps from outside of the loop body

16

Scoping and Statics Scoping rules might create extra

aliasing Handled by creating unique symbols for

variables with same name but different scopes

Static variables File-static variable can only be modified by

procedures that see its declaration. Access to the variable can be determined

from scope information in the symbol table. Storing an address parameter in a static

variable makes it accessible from any other procedures.

17

Problematic C Dialects Some of C code might look “tidy”, as in

Fortran. “Messy” style conventions:

Use of pointers instead of arrays Use of address and dereference operators Use of side effect operators

Previously mapped to machine instructions Complicate the work of optimizers

18

Problematic C Dialects (cont.) Titan C Compiler: remove side effect

operators! But, requires enhancements in some

transformations Constant propagation

Treat address operators as constants and propagate them where possible

Replace generic pointer in a dereference with the actual address

Expression simplification and recognition Need stronger recognition within expression

where variable is actually the ‘base variable’

19

Problematic C Dialects (cont.)

Conversion of pointers into array references

Simplifies dependence testing Induction variable substitution need to

enhanced Deal with indirect access to array references

through pointers Recognize and remove usage of side-effect

operators

20

C Miscellaneous Volatile variables

Functions with these variables are best left without optimization

Volatile code usually isn’t targeted for optimization (example: vector unit initialization)

Setjmp and Longjmp Commonly used for error handling: Calling setjmp

saves current context in a buffer. longjmp can then be called and bypass section of the calling chain

Storing and loading current state of computation is complex when optimization is performed and variables are allocated to registers

No optimization used!

21

C Miscellaneous (cont.)

Varags and stdargs Variable number of arguments

printf(…)

Implemented by a complier directive: Save all register parameters to the stack Access using pointer manipulation over

the stack Pointer variable is an alias for many

parameters in the program No optimization

22

Outline



Summary

23

Hardware Design: Overview

In the past, HW design was done at gate\transistor level

Today, HW design is language-based, similarly to SW development

Level of abstraction may vary Current trend is high level behavioral

specification Key factor: compiler’s efficiency

24

Abstraction levels for HW Design

Circuit / Physical level Diagrams of electronic components

Logic level Boolean equations

Register transfer level (RTL) Control state transitions and data transfers Synthesis: convert RTL to gates and flip-flops

System level Behavior expressed by variables, no timing Behavioral synthesis: select arithmetic units,

impose timing

Most Common!

25

Hardware Design

Behavior Synthesis is really a compilation problem

Two fundamental tasks Verification (Simulation) Implementation (Synthesis)

Optimization is essential for both: HW Simulation is inherently very slow Efficient synthesis raises the device’s value

26

Outline



Summary

27

Hardware Description Languages 2 main HDLs used today:

Verilog Supports all abstraction levels, mostly used for

gates and RTL Extended C

VHDL Extended Ada

Primitives and extensions used for HW description provide similar functionality

28

Verilog extensions Multi-valued logic: 0, 1, x, z

x = unknown state, z = bus conflict E.g. division by zero produces x state Higher date types (integer…) are vectors of

multi-valued bits Operations with x will result in x state ->

simulation can’t execute addition directly… Reactivity

Automatic propagation of changes always @(b or c)

a = b + c;

29

Verilog extensions (cont.) Objects

Specific area of silicon, own state and registers Semantics different from C’s functions calls Data encapsulation using module

Connectivity Continuous passing of information Input port and output port

module add(a,b,c) output a; input b, c; integer a, b, c; always @(b or c) a = b + c;endmodule

30

Verilog extensions (cont.) Instantiation

Verilog only allows static instantiation Each instance is a different area in the

siliconinteger x, y, z;add adder1(x,y,z);

Vector operations Viewing other data structures as vector of

scalars Bit selection: A[1] Vector concatenation: {A[0], A[1:15]}

31

Optimization in Verliog

Advantages Disadvantages

No aliasing Non-procedural continuation semantics

Restricted subscripts Lack of loops (only implicitly using “always” blocks)

No separate compilation HW design is large

32

Outline



Summary

33

Optimizing Simulation

Philosophy: Higher abstraction level more efficient! Details consume simulation time and obscure

behavioral functionality Next…

Optimization techniques

module adder(a, b, c)

input b[0:3], c[0:3];

output a[0:3];

always @(b or c)

a = b + c;

endmodule

34

Inlining Modules

Data encapsulation: Hide details from the programmer Hide optimization info from the compiler

HDLs have two properties that make module inlining simpler Whole design is reachable at one time Recursion is not permitted inline in linear time

using topological order No use to inline above the level of

functional units

35

Inlining example - beforemodule adder(a, b, c)

input b[0:3], c[0:3];output a[0:3];wire carry;add2 add_l(a[0:1], 0, b[0:1], c[0:1], carry);add2 add_r(a[2:3], carry, b[2:3], c[2:3], 0);

endmodule

module add2(sum, c_out, op1, op2, c_in)input op1[0:1], op2[0:1], c_in;output sum[0:1], c_out;wire carry;add1 add_l(sum[0], carry, op1[0], op2[0], c_in);add1 add_r(sum[1], c_out, op1[1], op2[1], carry);

endmodule

module add1(sum, c_out, op1, op2, c_in)input op1, op2, c_in;output sum, c_out;always @(op1 or op2 or c_in) begin

sum = op1 ^ op2 ^ c_in;c_out = (op1&op2) | (op2&c_in) | (op1&c_in);

endmodule

36

Inlining example - aftermodule adder(a, b, c)

input b[0:3], c[0:3];output a[0:3];wire carry, temp, temp1;always @(b[1] or c[1] or carry) begin

a[1] = b[1] ^ c[1] ^ carry;temp = (b[1]&c[1])|(c[1]&carry)|(carry&b[1]);

endalways @(b[0] or c[0] or temp) begin

a[0] = b[0] ^ c[0] ^ temp;0 = (b[0]&c[0])|(c[0]&temp)|(temp&b[0]);

endalways @(b[3] or c[3] or 0) begin

a[3] = b[3] ^ c[3] ^ 0;temp1 = (b[3]&c[3])|(c[3]&0)|(0&b[3]);

endalways @(b[2] or c[2] or temp1) begin

a[2] = b[2] ^ c[2] ^ temp1;carry = (b[2]&c[2])|(c[2]&temp1)|(temp1&b[2]);

endendmodule

37

Execution Ordering

The order of statements execution can dramatically effect efficiency!

HW is fast, thanks to triggering on bit’s change

SW simulation cannot afford tracking bits Memory overhead May consider all bits as a group

Execute blocks in topological order based on the dependence graph of individual array elements No memory overhead

38

Ordering examplemodule adder(a, b, c)

input b[0:3], c[0:3];output a[0:3];wire carry, temp, temp1;always @(b[3] or c[3] or 0) begin

a[3] = b[3] ^ c[3] ^ 0;temp1 = (b[3]&c[3])|(c[3]&0)|(0&b[3]);

endalways @(b[2] or c[2] or temp1) begin


endalways @(b[1] or c[1] or carry) begin


endalways @(b[0] or c[0] or temp) begin


endendmodule

39

Dynamic vs. Static Scheduling

Dynamic scheduling Dynamically track changes in values and

propagate them Naturally mimics hardware Overhead of change checks

Especially if performed per bit Static scheduling

Based on a topological model Blindly sweeps through all values for all

objects regardless of changes No need for change checks

40

Dynamic vs. Static Scheduling (cont.)

Prefer Static Scheduling for a highly active circuit When changes are frequent, no need to

check before update Can we really tell in advance?

Common strategy: use static analysis to locate dynamic scheduling improves simulation performance!

41

Fusing always blocks

High cost of change checks motivates fusing always blocks

Can fuse blocks with the same trigger conditions Save overhead of invoking blocks

Most useful for synchronous designs But, may change the design’s output

Still semantically correct Bad surprise for the designer… Simulators try to avoid output changes

42

Fusion examplemodule adder(a, b, c)

input b[0:3], c[0:3];output a[0:3];wire carry, temp, temp1;always @(posedge(clk)) begin

a[3] = b[3] ^ c[3] ^ 0;temp1 = (b[3]&c[3])|(c[3]&0)|(0&b[3]);

endalways @(posedge(clk)) begin






endendmodule

43

Vectorizing always block

Regrouping low level operations back together for higher lever abstractions

Vectorizing the bit operations

44

Vectorizing examplemodule adder(a, b, c)

input b[0:3], c[0:3];output a[0:3];wire carry[0:3];always @(b[3] or c[3] or carry[3]) begin

a[3] = b[3] ^ c[3] ^ carry[3];carry[2]=(b[3]&c[3])|(c[3]&carry[3])|(carry[3]&b[3]);

endalways @(b[2] or c[2] or carry[2]) begin





a[0] = b[0] ^ c[0] ^ carry[0];cout = (b[0]&c[0])|(c[0]&carry[0])|(carry[0]&b[0]);

endendmodule

Scalar expansion

45

Vectorizing example – merge blocks

module adder(a, b, c)input b[0:3], c[0:3];output a[0:3];wire carry[0:3];always @(b[3] or c[3] or carry[3] or

b[2] or c[2] or carry[2] or b[1] or c[1] or carry[1] or

b[0] or c[0] or carry[0]) begina[3] = b[3] ^ c[3] ^ carry[3];carry[2]=(b[3]&c[3])|(c[3]&carry[3])|

(carry[3]&b[3]);a[2] = b[2] ^ c[2] ^ carry[2];carry[1]=(b[2]&c[2])|(c[2]&carry[2])|

(carry[2]&b[2]);a[1] = b[1] ^ c[1] ^ carry[1];carry[0]=(b[1]&c[1])|(c[1]&carry[1])|

(carry[1]&b[1]);a[0] = b[0] ^ c[0] ^ carry[0];cout = (b[0]&c[0])|(c[0]&carry[0])|(carry[0]&b[0]);

endendmodule

46

Vectorizing example – rearranged following dependencies

module adder(a, b, c)input b[0:3], c[0:3];output a[0:3];wire carry[0:3];always @(b[3] or c[3] or carry[3] or

b[2] or c[2] or carry[2] or b[1] or c[1] or carry[1] or

b[0] or c[0] or carry[0]) begincarry[2]=(b[3]&c[3])|(c[3]&carry[3])|

(carry[3]&b[3]);carry[1]=(b[2]&c[2])|(c[2]&carry[2])|

(carry[2]&b[2]);carry[0]=(b[1]&c[1])|(c[1]&carry[1])|

(carry[1]&b[1]);cout = (b[0]&c[0])|(c[0]&carry[0])|(carry[0]&b[0]);a[3] = b[3] ^ c[3] ^ carry[3];a[2] = b[2] ^ c[2] ^ carry[2];a[1] = b[1] ^ c[1] ^ carry[1];a[0] = b[0] ^ c[0] ^ carry[0];

endendmodule

Can Vectorize!!

47

Vectorizing example – vectorize now

module adder(a, b, c)input b[0:3], c[0:3];output a[0:3];wire carry[0:3];always @(b or c or carry) begin

carry[0:2] = (b[1:3]&c[1:3]) | (c[1:3]&carray[1:3]) |

(carry[1:3]&b[1:3]);cout = (b[0]&c[0])|(c[0]&carry[0])|(carry[0]&b[0]);a = b ^ c ^ carry;

endendmodule

Pattern Matching

always @(b or c) begina = b+ c;

end

48

Two state vs. four state logic

Extra overhead in 4-state HW Do we want HW with unknown states?! 2-state logic can be 3-5 times faster!

But… Hard to find regions without unknowns

Use interprocedural analysis Check for unknowns, but default to 2-state

Test for detecting unknown is low cost 2-3 instructions

49

Rewriting block conditions

Semantics of synchronous block: Changes are updated with every clock update

In Verliog: Recompute results every clock tick

always @(posedge(clk)) begin

sum = op1 ^ op2 ^ c_in;

c_out = (op1 & op2) | (op2 &

c_in) | (c_in & op1)

end

50

Rewriting block conditions (cont.) Actually, HW only computes when input changes

Clocking is simply a matter of gating results through a register

Rewrite code to achieve the same in the simulator Avoid excessive computations

always @(op1 or op2 or c_in) begin

t_sum = op1 ^ op2 ^ c_in;

t_c_out = (op1 & op2) | …

end

always @(posedge(clk)) begin

sum = t_sum;

c_out = t_c_out;

end

51

Using Basic Compiler Optimizations

Mostly useful for high level abstraction

Optimize inside an always block Control flow between blocks is too

complex Useful methods:

Loop vecorization Constant propagation Dead code elimination Common subexpression elimination

52

Outline



Summary

53

Synthesis Optimization

Goal: automatically insert the details Analogous to standard compilers Harder than standard compilers

Not targeted towards a fixed target Many goals:

Minimize cycle time Minimize area Minimize power consumption

54

Basic Framework

Fundamental problem: reduce this computation to a series of gates

for(i=0; i<100; i++)t = t + a[i]*b[i];

The simple approach: convert add\multiply into gates, optimize later Not so easy in the real world

Converting high level to gates is inefficient, better select components first Various ways to perform multiply, add, etc. Need a good library to select from Need to make the optimal selection

55

Basic Framework (cont.)

One option:t = ADD(t, MUL(LOAD(A[i]), LOAD(B[i])));

Will take 3 cycles Another option:t = MAC(t, MUL(LOAD(A[i]), LOAD(B[i])));

Only 2 cycles… A lot more options for the unrolled

version:for(i=0; i<100; i+=4)

t = t + a[i]*b[i] + a[i+1]*b[i+1] +a[i+2]*b[i+2] +

a[i+3]*b[i+3];

56

Basic Framework (cont.)

Analogous to instruction selection for CISC architecture Extensively researched Fast & effective tree-matching

algorithms, applicable to synthesis For the undefined target

Algorithms will adapt to current HW Achieve multiple goals

Bound types & number of functional units

Just need to minimize time now…

57

Loop Transformations Execution order affects functional unit

utilization & synthesized HW efficiency

for(i=0; i<100;i++) { t[i] = 0; for(j=0; j<3; j++) t[i] = t[i] + (a[i-j]>>2);}

for(i=0; i<100; i++) { o[i] = 0; for(j=0; j<100; j++) o[i] = o[i] + m[i][j] * t[j]}

58

Loop Transformations (cont.)

Distribute the loops & rearrange topologically:

for(i=0; i<100; i++)

t[i] = 0;

for(i=0; i<100; i++)

o[i] = 0;

for(i=0; i<100; i++)

for(j=0; j<3; j++)

t[i] = t[i] + (a[i-j] >> 2)

for(i=0; i<100; i++)

for(j=0; j<100; j++)

o[i] = o[i] + m[i][j] * t[j];

No improvement

so far…

59

Loop Transformations (cont.) Let’s try some fusion…

Can you see the interchange?

for(i=0; i<100; i++)

o[i] = 0;

for(i=0; i<100; i++)

t[i] = 0;

for(j=0; j<3; j++)

t[i] = t[i] + (a[i-j] >> 2);

for(j=0; j<100; j++)

o[j] = o[j] + m[j][i] * t[i];

60

Loop Transformations (cont.) Scalar replacement on t Exploit input dependence on a

for(i=0; i<100; i++) { o[i] = 0; a0 = a[0]; a1 = a[-1]; a2 = a[-2]; a3 = a[-3]; for(i=0; i<100; i++) { t = 0; t = t + (a0>>2) + (a1>>2) + (a2>>2) + (a3>>2) a3 = a2; a2 = a1; a1 = a0; a0 = a[i+1]; for(j=0; j<100; j++) o[j] = o[j] + m[j][I] * t; }}

61

Loop transformation - summary

Loop fusion: fuse 2 loops which use different functional units

Loop distribution: separate loops using the same functional units

Vectorization: when a functional unit can be pipelined

Loop interchange: mostly to help other transformations

62

Control and Data Flow Von Neumann architecture

Data flow = data movement among memory and registers

Control flow = changes in the PC due to sequential execution and branches

Synthesized hardware Data flow = data movement among

functional units Control flow = which functional unit should

be active on what data at which time steps Requires a state machine

63

Control and Data Flow –special HW constructs

Wires Immediate data transfer

Latches Values hold throughout one clock cycle

Registers Static variables in c Held in one or more clock cycle

Memories Arrays in C Special registers: large, long lifetime

64

Memory Reduction

Memory access is slower than unit access strive to minimize frequency & volume of

access Application of techniques

Loop interchange Loop fusion Scalar replacement Strip mining Unroll and jam Prefetching

65

Outline



Summary

66

Summary

Application of dependence is not limited to Fortran!

Analysis framework can be adapted to C Ardent Titan compiler

Several techniques are useful for HW simulation & synthesis

Early stage of research…

67

Questions???

Thanks for listening…

1 dependence in c and hardware design allen and kennedy, chapter 12 presented by tali shragai

Documents

p p p slide

c overview

pointers c optimizers

arrays aliasing c

problems of c example

problems of c cc

p optimizer

unsatisfactory slide