philips research iccd 1999 - 1 1 trimedia cpu64 application domain and benchmark suite a.k. riemens...

100
Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Upload: stella-jackson

Post on 31-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 1

1

TriMedia CPU64Application Domain and

Benchmark Suite

A.K. RiemensPhilips Research

Eindhoven, The Netherlands

Page 2: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 2

2

Outline

• Introduction• Approach• Benchmark suite• Example• Results

Page 3: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 3

3

Design Problem

VLIWcpu

I$

video-in

video-out

audio-in

SDRAM

audio-out

PCI bridge

Serial I/O

timers I2C I/O

D$

Page 4: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 4

4

Application Domain

• High volume consumer electronics productsfuture TV, home theatre, set-top box, etc.

• Embedded core• Media processing:

audio, video, graphics, communication

Page 5: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 5

5

Media Processing CPU

Design considerations:• Cost (embedded in high volume products)• Performance for the application domain• Ease of use (programming model)

Benchmark suite needed

Page 6: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 6

6

Application: Natural Motion

n

n-1

n-1/2

picture number

-D/2

D/2

Page 7: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 7

7

Outline

• Introduction• Approach• Benchmark suite• Example• Results

Page 8: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 8

8

Project Approach: Y-chart

MachineDescription

BenchmarkApplicationsCompiler

Simulator

PerformanceNumbers

Page 9: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 9

9

Outline

• Introduction• Approach• Benchmark suite• Example• Results

Page 10: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 10

10

Terminology

Application domain

Application class

Application

Benchmark suite: set of all applications

Page 11: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 11

11

Benchmark Suite Characteristics

• Each application: typical for a class of applications within application domain

• The set covers a significant area of the application domain

• Each benchmark is sufficiently well tuned to the architecture to measure its performance

Page 12: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 12

12

The Benchmark SuiteData comm. Viterbi decoding

Audio coding AC3 decode

Video coding MPEG2 encodeDVC decode

Video proc. Layered natural motionDynamic noisereductionPeaking

Graphics 3D renderer library

Page 13: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 13

13

• Signal ratesaudio, video (at block and pixel rate)

• Basic data typesbyte (8), half-words (16), words (32), float (32)

• Data access patternssample stream, bitstream, random access

• Data independent and dependent load• Control processing and signal processing

Processing Characteristics

Page 14: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 14

14

Application DevelopmentAlgorithm

Reference C

TM1000 optimized CPU64 optimized

simulationexplorationICvideo video

knowledge

Page 15: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 15

15

Initial Design Considerations

• Goal: 6-8 times TM1000 performance• Standard ANSI-C, reuse of code• Utilize instruction and data parallelism• Limited complexity (embedded core)• Compatibility through recompilation• VLIW architecture

Page 16: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 16

16

Initial Design Choices

• Double vector length: 32 64• Enriched media instruction set

Page 17: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 17

17

Instruction Set Considerations

• A machine operation must be sufficiently generic within the application domain

• Sufficiently powerful operations• Limited number of operations• Consistency and orthogonality

Page 18: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 18

18

Outline

• Introduction• Approach• Benchmark suite• Example• Results

Page 19: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 19

19

Vector Instruction Example

|.| |.| |.| |.|

x3 x2 x1 x0

z

y3 y2 y1 y0

- - - -

|.| |.| |.| |.|

x7 x6 x5 x4

y7 y6 y5 y4

- - - -

0

Sum of absolute differences (“ub_me”)

Page 20: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 20

20

Application Code Example

int calc_sad(vec64ub *prv, vec64ub *cur, int s)

{

vec64ub left, right;

int i; int sad = 0;

for (i=0; i<8; i++)

{ left = prv[i*s]; right = cur[i*s];

sad += ub_me(left, right);

}

return(sad);

}

Page 21: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 21

21

Outline

• Introduction• Approach• Benchmark suite• Example• Results

Page 22: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 22

22

Results: Natural Motion

nops

cache stalls

instructions

Averagegain: 2.7x(cycles)

10

15

5

10

15

5

CPU64 cycles x 10 7 TM1000 cycles x 107

SAD

MED

SUB

UPC

SEG

SAD MEDSUB

UPC

SEG

0 20 40 60 80 100% 0 20 40 60 80 100%0 0

Page 23: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 23

23

Natural Motion Dynamic Load

10 20 30 40 50 60 70 80 90 100 110 0

50

100

150

Field number

cpu

lo

ad (

%)

90%

16%

tm1000 @ 125MHz

cpu64 @ 300MHz

Page 24: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 24

24

Summary

• Application domain:– Media processing for CE industry

• Benchmark suite:– Targeted to application domain – Optimized for class of processors

• Future products:– Gradual shift to software implementations of

signal processing functions

Page 25: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 25

25

TriMedia CPU64Architecture

J. T. J. van EijndhovenPhilips Research

Eindhoven, The Netherlands

Page 26: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 26

26

Outline

• Introduction• VLIW architecture• Instruction set• IDCT example• Status & conclusion

Page 27: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 27

27

Application Target

• Processor core, to be embedded in different ICs and products

• Real time processing of media streams• Cost-sensitive consumer electronics market

Page 28: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 28

28

Embedded Application

VLIWcpu

I$

video-in

video-out

audio-in

SDRAM

audio-out

PCI bridge

Serial I/O

timers I2C I/O

D$

Page 29: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 29

29

Performance Target

• 6x to 8x performance increase,to process f.i. higher-resolution video streams or more tasks in parallel

• Not more than double transistor count

Relative to TriMedia TM1000 (5-slot VLIW,100MHz, 32-bit datapath, media operations):

Page 30: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 30

30

Efficiency

Good performance/silicon cost ratio by:• Optimize CPU architecture and media

benchmark source code towards each other• Solve resource conflicts at compile time

Page 31: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 31

31

Outline

• Introduction• VLIW architecture• Instruction set• IDCT example• Results

Page 32: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 32

32

Architectural Speedup

Not simply increasing the VLIW issue width:• Diminishing gain of compiler-generated ILP• Increasing implementation complexity

(area, timing)

Page 33: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 33

33

Architectural Speedup

• Extended SIMD: uniform 64-bit design,vectors of 1-, 2-, and 4-byte elements(data throughput per cycle)

• New, extensive, media instruction set(functionality per cycle)

• Improved cache control(prefetch, alloc)

Page 34: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 34

34

CPU64 Architecture

datacache16KB

datacache16KB

mmummu

mmummu

64-bitmemory

bus multi-port 128 words x 64 bits register filemulti-port 128 words x 64 bits register file

FUFU FUFU FUFU FUFUFUFU

instructioncache32 KB

instructioncache32 KB

mmummu

bypass networkbypass network

PCPCexceptions exceptions

32-bitperipheral

bus

VLIW instruction decode and launch

Page 35: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 35

35

Architecture VLIW+SIMD

• Issues a 5-slot instruction every cycle• Each slot supports a selection of FUs• All FUs support vectorized (SIMD) data• Double-slot FU allows powerful multi-

argument, multi-result operations• All FUs are pipelined, latency 1 to 4

(except floating point divide and sqrt)

Page 36: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 36

36

Outline

• Introduction• VLIW architecture• Instruction set• IDCT example• Results

Page 37: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 37

37

Instruction Format

Per operation, per slot:• Up to 2 arguments, up to 1 result• Extra register for guarding (conditional execution)• Immediate argument size can be 32 bit

Instructions have compressed variable length format, decompressed during decode

Page 38: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 38

38

Operations

Intends to cover combinations of:• 1-, 2-, 4-byte elements in an 8-byte vector• Signed or unsigned type• Clipped versus wrap-around arithmetic• Set of basic functions

(ld, st, add, mul, compare, shift, ...)

Page 39: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 39

39

Instruction SetCategory Cnt CommentLoad/store ops 39 vectorized, endian-correctByte shuffles 67 vector type convertBit shifts 48 round, fieldsMultiplies 54 round, clip, wrap aroundInteger arith 104 clip, wrap aroundFloating point 59 scalar and vectorizedLookup table 6 vectorizedSpecial ops 23 MMU, cache, special regsBranch 10 (un)interruptible, trap

Page 40: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 40

40

Example: msp-multiply

Multiply tointernal doubleprecision integer

Each argument:4-way 16-bit int

Round lower half into upper half

Return upper half

+ + + +1 1 1 1

Page 41: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 41

41

Example: Transpose

mm nn oo pp

ii jj kk ll

ee ff gg hh

aa bb cc dd

bb ff jj nn

aa ee ii mm

2-slot ‘super-op’:takes 4 arguments,shuffles bytes,produces 2 results

(2-dimensional filtering)

Page 42: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 42

42

Example: 4-way average

(MPEG 2-dimensional half-pixel average)

+ + + + + + + +

+ + + + + + + +

1

add with extended precision

round

1

Page 43: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 43

43

Branches

• Branch operations have 3 branch delay slots• Compiler & scheduler try to fill these

(profiling, loop unrolling, inlining, guarding)• Branch units are properly pipelined• Up to 3 branch ops in 1 instruction• No branch prediction hardware• Branches are preferred moments for interrupt

servicing

Page 44: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 44

44

Memory Management Units

• Separate IMMU and dual-port DMMU• Variable page size: 4K, 16K, … ,16M• Indexed with 32-bit VA and 8-bit task ID• Inter-task, X/W, and supervisor-mode

protections• TLBs are software managed through precise

exceptions

Page 45: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 45

45

Outline

• Introduction• VLIW architecture• Instruction set• IDCT example• Results

Page 46: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 46

46

2D-IDCT Example

• IDCT code was tested early in the project, also for studying the programming model

• Mapped through our experimental C compiler and instruction scheduler

• Operates entirely on (vectors of) 16-bit data elements

• Simulation showed IEEE 1180 accuracy compliancy

Page 47: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 47

47

2D-IDCT Instructions

0

50

100

150

200

250

300

350

400CPU64

TM-1000

TI 320C62

Altivec

PA-8000

PentiumI

PentiumII

NEC V830OKOK

OK

Execution time in cycles, excluding data cache misses

OK = accuracy compliant

Page 48: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 48

48

Outline

• Introduction• VLIW architecture• Instruction set• IDCT example• Status & conclusion

Page 49: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 49

49

Status

Now in construction at Philips Semiconductors, Sunnyvale, CA

• 7M transistors (target)• 300Mhz clock (target)• 0.18 technology• 1.8 volt• couple of watts power

Page 50: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 50

50

Conclusion

• The 5-slot 64-bit VLIW provides a powerful architecture for media processing

• Performance gain of 6x to 8x on TM1000• Rich and regular media instruction set• Powerful multi-argument, multi-result ‘super-

op’ naturally fits VLIW architecture• Embedded DSP supports robust multi-tasking

through MMU

Page 51: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 51

51

TriMedia CPU64Application Development

Environment

E.J.D. PolPhilips Research

Eindhoven, The Netherlands

Page 52: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 52

52

Outline

• Introduction• Machine description• Vector models• Compilation trajectories• Optimization• Conclusion

Page 53: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 53

53

Basic architecture

• CPU64 is based on TM1000 VLIW DSPCPU• Application domain: digital media processing• Register-rich• Custom operations• Applications show several levels of parallelism

Page 54: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 54

54

Parallelism

• MIMD– Implicit in program (C)– Compile-time scheduling (ILP)

• SIMD– Explicit in program (vector types)– Mixes well with MIMD

• Task level parallelism

Page 55: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 55

55

Instruction Set

• General-purpose (scalar) operations for control• Special-purpose (custom, vector) operations

for media processing• 5 scalar types (8,16,32,64 bit ints, 32 bit floats)• 7 vector types (8,16,32 bit (un)signed ints, 32

bit floats)

Page 56: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 56

56

Goal

Make application development as simple as possible!

• Higher levels of parallelism • More special-purpose hardware• Range of CPUs will be available

Page 57: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 57

57

Outline

• Introduction• Machine description• Vector models• Compilation trajectories• Optimization• Conclusion

Page 58: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 58

58

Traditional Compilation

App 1App 2

App 3

Compiler AExe 1A

Exe 2A

Exe 3A

Exe 1BExe 2B

Exe 3BCompiler B

Machine A

Machine B

Page 59: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 59

59

Retargetable Compilation

App 1App 2

App 3

Compiler

Exe 1AExe 2A

Exe 3A

Exe 1BExe 2B

Exe 3B

Machine A

Machine B

Machine ADescription

Machine BDescription

Page 60: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 60

60

MD file contents

• MD describes instance from class of machines

• Contains information such as:– number & width of registers– number of issue slots– number & placement of function units– instruction latencies

Page 61: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 61

61

Y-Chart

MachineDescription

BenchmarkApplicationsCompiler

Simulator

PerformanceNumbers

Page 62: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 62

62

Outline

• Introduction• Machine description• Vector models• Compilation trajectories• Optimization• Conclusion

Page 63: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 63

63

Memory Endianness

• Endianness affects storage of scalars • Programmability requires vectors to be

subranges of scalar arrays• C address arithmetic requires increasing

address with increasing array index

Page 64: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 64

64

Register Endianness

• Endianness does not affect storage of scalars• Specific vector elements are addressed by

certain custom ops (if necessary)• Endianness might affect ordering of vector

elements in registers

Page 65: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 65

65

Endianness design choices

• Memory endianness is fixed for intra-element order (C requirement)

• Register endianness is fixed for both inter- and intra-element order, but it is irrelevant which (programming ease)

• Load/Store operations translate inter-element order between registers and memory

Page 66: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 66

66

Vector Load/Store operations

A[1].msbA[1].lsb A[0].msbA[0].lsb

A[0].lsbA[0].msbA[1].lsbA[1].msb register

(bit number)31

memory3 2 1 0

….

….

(byte address)

0

load/store

Page 67: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 67

67

Outline

• Introduction• Machine description• Vector models• Compilation trajectories• Optimization• Conclusion

Page 68: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 68

68

Software and Hardware Operations

• Hardware view of instruction set– Implement minimal, changeable set

• Software view of instruction set– provide regular, orthogonal, stable set

• Two instruction sets were defined for these purposes

Page 69: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 69

69

Instruction set libraries

• Hardware operations library– untyped, minimal, changing

• Software operations library– vector-typed, orthogonal, stable

• Mapping in MD file and library

Page 70: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 70

70

Library structure

simulator hw_ops sw_ops application

simulator application application

sourcecode

workstationexecutable

CPU64executable

Page 71: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 71

71

Functional Development

hw_ops sw_ops application

application

sourcecode

workstationexecutable

gcc

Page 72: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 72

72

Code Tuning

simulator hw_ops application

simulator application

gcc

interpretation

tmcc

sourcecode

workstationexecutable

CPU64executable

Page 73: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 73

73

Outline

• Introduction• Machine description• Vector models• Compilation trajectories• Optimization• Conclusion

Page 74: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 74

74

General guidelines

• Design application architecture (data flow)• Determine data formats (scalar/vector types)• Arrange data locale (memory/cache/registers)• Design control architecture (loop structure)• Optimize computations (custom-ops)• Fine tune cache behavior (prefetch/alloc)

Page 75: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 75

75

Outline

• Introduction• Machine description• Vector models• Compilation trajectories• Optimization• Conclusion

Page 76: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 76

76

Conclusions

• Applications are written in C code only• Machine Description supports changes in ISA• Vector types keep code legible• Endianness is not visible in application code• Fast functional development track• Accurate application tuning track

Page 77: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 77

77

TriMedia CPU64Design Space Exploration

G.J. HekstraPhilips Research

Eindhoven, The Netherlands

Page 78: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 78

78

Outline

• Introduction• Tools• Exploration• Real numbers• Conclusions

Page 79: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 79

79

Methodology: the Y-chart

MachineDescription

BenchmarkApplicationsCompiler

Simulator

PerformanceNumbers

Page 80: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 80

80

64-bit VLIW CPU

• Many parameters, large design space(s)

datacache16KB

datacache16KB

mmummu

mmummu

multi-port 128 words x 64 bits register filemulti-port 128 words x 64 bits register file

FUFU FUFU FUFU FUFUFUFU

64-bitmemory

bus

instructioncache32 KB

instructioncache32 KB

mmummu

bypass networkbypass network

VLIW instruction decode and launch

Page 81: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 81

81

Multimedia benchmark

• Nine multimedia applications from:– Data communication– Audio coding– Video coding– Video processing– Graphics

• Representative– Applications– Code– Datasets

Page 82: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 82

82

Challenge for design space exploration

• Simulation time for the benchmark for a single machine is 18 hours

• The number of design points for functional unit configuration alone:

• Results in 2000,000,000,000 years of computation time

1510

Page 83: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 83

83

Outline

• Introduction• Tools• Exploration• Real numbers• Conclusions

Page 84: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 84

84

Essential tooling

• Retargetable toolchain– Core compiler, scheduler, simulator

• Design and experiment management• DSE support tools

– Machine generation, pseudosim, analysis• Glue software

Page 85: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 85

85

Pseudo-simulationPseudosim: a retargetable pseudo-simulator• Performs a cycle-accurate calculation of the

amount of instruction cycles, without doing the actual machine simulation

• Gathers other machine statistics, such as utilisation of slots, functional units, operations

• Time for the whole benchmark is reduced from 18 hours to 4 minutes

Page 86: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 86

86

Pseudo-simulation

once,

18 hours

many times,

4 minutesSimulator

Compiler

Pseudosim

Machinedescription

Benchmarkapplications

Machinedescription’

code

statisticsresults

Page 87: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 87

87

Outline

• Introduction• Tools• Exploration• Real numbers• Conclusions

Page 88: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 88

88

Functional unit configuration

Problem:• How many of each type

of FU do I need?• Where do I place them?• How do I prevent

simulating too many machine variations?

slot

FU

type

Page 89: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 89

89

Naпve calculation of the design space size

• 24 types of FU’s31 configurations per type

• 6 types of super-op FU’s7 configurations per type

Size of design space: 40624 10731

Page 90: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 90

90

Not so naпve calculation of the design space size

• Assume relationship between FU types• Best-guess lower and upper bounds on

number of FU’s per type• Reduce permutations

Size of design space: 1510

Page 91: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 91

91

Systematic exploration

• It is not feasible to exhaustively compute all design points in the design space

• Therefore we probe, partition, and then explore the design space

• Careful set-up of experiments

Page 92: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 92

92

Reduction of the design space

• Observation: over 93% of operations are done in 30% of FU types

• Action: partition space – Exhaustive exploration

of important FU’s– Greedy exploration of

the remainder

70% 30%

7% 93%

operations

functional unit types

Page 93: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 93

93

Exhaustive experiment

• Close to 3000 machine variations

• Only 4 machines survive• Large performance

variation due to placement

• Time taken = 200 hours

194 195 196 197 198 199 200 201 202 8.6

8.8

9

9.2

9.4

9.6

9.8

10x 107

area

cy

cle

s

Exhaustive experiment

Page 94: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 94

94

Greedy experiments

191 192 193 194 195 196 197 8.6

8.7

8.8

8.9

9

9.1

9.2

9.3 x 107

area

First greedy experiment

cy

cle

s

• Less machine variations per experiment

• More machines survive• Less performance variation

due to placement• Close to another 3000

machine variations over all greedy experiments

Page 95: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 95

95

Outline

• Introduction• Tools• Exploration• Real numbers• Conclusions

Page 96: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 96

96

Simulated machines

• 180 Gbyte data transferred

• 800 Mbyte compressed data stored

• 2T cycles simulated• 10T operations issued

Experiment number

Probing 185

Exhaustive 3023

Greedy 2519

Total 5727

Page 97: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 97

97

Outline

• Introduction• Tools• Exploration• Real numbers• Conclusions

Page 98: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 98

98

Summary and conclusions

• A full-fledged design space exploration was done for the 64-bit TriMedia VLIW CPU core

• The use of both pseudo-simulation and a systematic exploration has made this feasible

• The outcome is:– A range of machines

in A-T space– Rules for functional

unit placement– A flexible, integrated

environment for continuation of DSE

Page 99: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 99

99

Summary and conclusions

• A large amount of time is spent in developing tools to enable the DSE

• The design of experiments and the analysis of results takes up more time than the execution

• Performing a DSE stretches the capabilities of the tools to the maximum

• The pseudo-simulator is a powerful tool that supports the full design space offered by the machine description

Page 100: Philips Research ICCD 1999 - 1 1 TriMedia CPU64 Application Domain and Benchmark Suite A.K. Riemens Philips Research Eindhoven, The Netherlands

Philips Research

ICC

D 1

999

- 10

0

100