design and development of a heterogeneous hardware search … · 2009. 7. 16. · way that the...

Design and Development of a HeterogeneousHardware Search Accelerator

This dissertation is submitted for the degree of Doctor of Philosophy

Tan, Shawn Ser Ngiap

Magdalene College

May 21, 2009

Abstract

Search is a fundamental computing problem and is used in any number of applications

that are invading our everyday lives. However, it has not received as much attention as

other fundamental computing problems. Historically, there have been several attempts

at designing complex machines to accelerate search applications. However, with the cost

of transistors falling dramatically, it may be useful to design a novel on-chip hardware

accelerator for search applications.

A search application is any application that traverses a data set in order to find one

or more records that meet certain fitting criteria. These applications can be broken down

into several low level operations, which can be accelerated by specialised hardware units.

A special search stack can be used to visualise the different levels of a search operation.

Three hardware accelerator units were designed to work alongside a host processor.

A significant speed-up in performance when compared against pure software solutions

was observed under ideal simulation conditions An unconventional method for virtually

saving and loading search data was developed within the simulation construct to reduce

simulation time.

This method of acceleration is not the only possible solution as search can be ac-

celerated at a number of levels. However, the proposed architecture is unique in the

way that the accelerator units can be combined like LEGO bricks, giving this solution

flexibility and scalability.

Search is memory intensive, but the performance of regular cache memory that

exploit temporal and spatial locality was found wanting. A certain cache memory that

exploited structural locality instead of temporal and spatial locality was also developed

to improve the performance.

As search is a fundamental computational operation, it is used in almost every

application, not just obvious search applications. Therefore, the hardware accelerator

units can be applied to almost every software application. Obvious examples include

genetics and law enforcement while less obvious examples include gaming and operating

system software. In fact, it would be useful to integrate accelerator units with slower

microprocessors to improve general search performance.

The accelerator units can be implemented using an off-the-shelf FPGA at speeds of

around 200MHz or in ASIC for 333MHz (0.35µm) and 1.0GHz (0.18µm) operations. A

regular FPGA is able to accelerate up to five parallel simple queries or two heterogeneous

boolean queries or a combination of each when used with regular DDR2 memory. This

solution is particularly low-cost for accelerating search, avoiding the need for expensive

system-level solutions.

Declaration

I hereby declare that my thesis entitled is not substantially the same as any that I have

submitted for a degree or diploma or other qualification at any other University. I further

state that no part of my thesis has already been or is being concurrently submitted for

any such degree, diploma or other qualification.

This dissertation is the result of my own work and includes nothing which is the

outcome of work done in collaboration except where specifically indicated in the text.

This dissertation does not exceed the limit of length prescribed by the Degree Committee

of the Engineering Department. The length of my thesis is approximately 45,000 words

with 41 figures and 25 listings.

Signed,

Shawn Tan

i

Acknowledgements

I would like to take this opportunity to express my gratitude to the following people

who have helped me, in one way or another, throughout the duration of my research at

Cambridge and the write-up at home in Malaysia.

Dr David Holburn, for being the nicest supervisor that one can hope for, without

whom this work would be difficult to accomplish. I want to express my thanks for

everything you’ve done for me in the past four years; welcoming me into your family,

getting things done within the department and patiently reading through my thesis.

All the members of the department and division, for making it a nice place and easy

environment to work in. Mr Stephen Mounsey, Mr John Norcott and Miss Eleanor Blair

for technical assistance in setting up the various software tools that I needed. Mr Mick

Furber for all the assistance in the electrical teaching lab.

Friends from college, for helping me through tough times and keeping me sane. Jack

Nie for helping me print out my thesis and handling all of the administrative issues in

submitting my thesis. Drs Ray Chan and Ming Yeong Lim for being my companions on

my many travels. Zen Cho for being my shoulder to cry on when things were not going

well.

All my friends and family in Malaysia, for their belief in me and support throughout

the duration of this research. I would like to thank my sister and my parents for all the

patience and tolerance that they have shown me during the final stretch of this work. My

niece and nephews, Jarellynn, Jareick and Jarell for lending me their bubbling energy

when I needed a boost. This thesis is dedicated to them.

ii

Contents

1 Introduction 1

1.1 Justifying Search Acceleration . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Historical Justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Search Basics 7

2.1 Search Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Categorising Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Primary Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Secondary Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Data Structures & Algorithms . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Search Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Search Application 16

3.1 Search Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.1 Example Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.2 Pipeline Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.3 Query Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Search Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.1 Key Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.2 List Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.3 Result Collation . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.4 Overall Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

iii

4 General Architecture 25

4.1 Initial Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2 Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.1 Multi-Core Processing . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.2 Word Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2.3 Host Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3.1 Software Toolchain . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3.2 Standard Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.3.3 Custom Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.4 Initial Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4.1 Stack Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Streamer Unit 32

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.1.1 Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2.1 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.2.2 Operating Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2.3 State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.3 Streamer Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.3.1 Kernel Functional Simulation . . . . . . . . . . . . . . . . . . . . 38

5.3.2 Kernel Timing Simulation . . . . . . . . . . . . . . . . . . . . . . 39

5.3.3 Kernel Performance Simulation . . . . . . . . . . . . . . . . . . . 44

5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Sieve Unit 46

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46


6.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.2.1 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.2.2 Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.2.3 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51


6.3.2 Kernel Software Pump Timing . . . . . . . . . . . . . . . . . . . 51

6.3.3 Kernel Software Pump Performance . . . . . . . . . . . . . . . . 56

6.3.4 Kernel Hardware Pipe Timing . . . . . . . . . . . . . . . . . . . 57

6.3.5 Kernel Hardware Pipe Performance . . . . . . . . . . . . . . . . 59

iv

6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7 Chaser Unit 63

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63


7.2 Chaser Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7.2.1 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7.2.2 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7.3 Kernel Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . 68


7.3.2 Kernel Single Key Timing . . . . . . . . . . . . . . . . . . . . . . 69

7.3.3 Kernel Single Key Performance . . . . . . . . . . . . . . . . . . . 72

7.3.4 Kernel Multi Key Timing . . . . . . . . . . . . . . . . . . . . . . 73

7.3.5 Kernel Multi Key Performance . . . . . . . . . . . . . . . . . . . 75

7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

8 Memory Interface 79

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

8.2 Cache Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

8.3 Cache Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

8.3.1 Instruction Cache . . . . . . . . . . . . . . . . . . . . . . . . . . 82

8.3.2 Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

8.4 Cache Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

8.4.1 Instruction Cache . . . . . . . . . . . . . . . . . . . . . . . . . . 86

8.4.2 Data Cache Trends (Repeat Key) . . . . . . . . . . . . . . . . . . 87

8.4.3 Data Cache Trends (Random Key) . . . . . . . . . . . . . . . . . 89

8.5 Data Cache Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

8.5.1 Static Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . 90

8.5.2 Dynamic Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . 92

8.5.3 Prefetched Data Cache . . . . . . . . . . . . . . . . . . . . . . . . 92

8.6 Cache Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

8.6.1 Cache Size Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

8.6.2 Structural Locality . . . . . . . . . . . . . . . . . . . . . . . . . . 95

8.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

9 Search Pipelines 97

9.1 Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

9.1.1 Primary Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

9.1.2 Simple Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

v

9.1.3 Range Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

9.1.4 Boolean Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

9.2 System Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

10 Implementation 103

10.1 Fabric Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

10.1.1 Dynamic Fabric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

10.1.2 Static Fabric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

10.2 Integration Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . 105

10.2.1 Tight Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

10.2.2 Loose Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

10.3 FPGA Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

10.3.1 Chaser Implementation . . . . . . . . . . . . . . . . . . . . . . . 107

10.3.2 Streamer Implementation . . . . . . . . . . . . . . . . . . . . . . 107

10.3.3 Sieve Implementation . . . . . . . . . . . . . . . . . . . . . . . . 107

10.3.4 Resource & Power . . . . . . . . . . . . . . . . . . . . . . . . . . 111

10.3.5 Physical Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

10.4 ASIC Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

10.4.1 Area Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

10.4.2 Power Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

10.4.3 Speed Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

10.5 Cost Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

10.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

11 Analysis & Synthesis 117

11.1 Important Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

11.2 Host Processor Performance . . . . . . . . . . . . . . . . . . . . . . . . . 117

11.2.1 Software Optimisation . . . . . . . . . . . . . . . . . . . . . . . . 118

11.2.2 Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . 118

11.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

11.3.1 Processor Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 123

11.3.2 Accelerator Scalability . . . . . . . . . . . . . . . . . . . . . . . . 124

11.3.3 Memory Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . 125

11.4 Acceleration Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

11.4.1 Configuration A . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

11.4.2 Configuration B . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

11.4.3 Configuration Comparisons . . . . . . . . . . . . . . . . . . . . . 128

11.5 Alternative Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

11.5.1 Improved Software . . . . . . . . . . . . . . . . . . . . . . . . . . 129

vi

11.5.2 Content-Addressable Memories . . . . . . . . . . . . . . . . . . . 130

11.5.3 Multicore Processors . . . . . . . . . . . . . . . . . . . . . . . . . 130

11.5.4 Data Graph Processors . . . . . . . . . . . . . . . . . . . . . . . 131

11.5.5 Other Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

11.6 Suggestions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . 132

11.6.1 Conjoining Arithmetic Units . . . . . . . . . . . . . . . . . . . . 133

11.6.2 Conjoining Stream Buffers . . . . . . . . . . . . . . . . . . . . . . 133

11.6.3 Memory Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

12 Conclusion 135

vii

List of Figures

2.1 Search abstraction stack . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Typical search pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1 Initial hardware search accelerator architecture . . . . . . . . . . . . . . 29

4.2 Initial stack based accelerator architecture . . . . . . . . . . . . . . . . . 31

5.1 Streamer data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2 Streamer block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.3 Streamer configuration stack . . . . . . . . . . . . . . . . . . . . . . . . 35

5.4 Streamer operating modes . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.5 Streamer machine states . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.6 Accelerator unit simulation setup . . . . . . . . . . . . . . . . . . . . . . 37

5.7 Streamer timing diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.8 Streamer performance simulation . . . . . . . . . . . . . . . . . . . . . . 44

6.1 Sieve data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.2 Sieve Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.3 Sieve configuration register . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.4 Sieve operating modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.5 Sieve FSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.6 Sieve software pumped timing diagram . . . . . . . . . . . . . . . . . . . 54

6.7 Sieve software pumped simulation . . . . . . . . . . . . . . . . . . . . . 56

6.8 Sieve with hardware piped timing diagram . . . . . . . . . . . . . . . . . 58

6.9 Sieve with streamer piped simulation . . . . . . . . . . . . . . . . . . . . 60

7.1 Chaser data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

viii

7.2 Chaser unit block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7.3 Chaser configuration stack . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.4 Chaser machine states . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7.5 Single key chaser timing diagram . . . . . . . . . . . . . . . . . . . . . . 71

7.6 Chaser simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7.7 Multiple key chase kernel timing . . . . . . . . . . . . . . . . . . . . . . 74

7.8 Chaser simulation (multi-key) . . . . . . . . . . . . . . . . . . . . . . . . 76

8.1 Cache simulation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

8.2 Basic cache operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

8.3 Instruction cache hit ratio . . . . . . . . . . . . . . . . . . . . . . . . . . 86

8.4 Repetitive heap cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

8.5 Random heap cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

8.6 Random heap cache (with prefetch) . . . . . . . . . . . . . . . . . . . . 93

8.7 Cache structure comparison . . . . . . . . . . . . . . . . . . . . . . . . . 94

8.8 Structural cache architecture . . . . . . . . . . . . . . . . . . . . . . . . 95

9.1 Search pipeline abstraction . . . . . . . . . . . . . . . . . . . . . . . . . 97

10.1 Implementation architectures . . . . . . . . . . . . . . . . . . . . . . . . 103

10.2 System level implementation . . . . . . . . . . . . . . . . . . . . . . . . 105

10.3 ASIC area and power estimates . . . . . . . . . . . . . . . . . . . . . . . 112

ix

List of Tables

3.1 Search Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

10.1 ASIC area and power estimates at speed . . . . . . . . . . . . . . . . . . 115

10.2 Fabrication cost per accelerator unit . . . . . . . . . . . . . . . . . . . . 115

11.1 Code profile for std::set::find() . . . . . . . . . . . . . . . . . . . . . 122

11.2 Specifications for 0.35µm CMOS DPRAM blocks . . . . . . . . . . . . . 134

x

Listings

3.1 Verilog profiling construct . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Key search profile kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 List retrieval profile kernel . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 Result collation profile kernel . . . . . . . . . . . . . . . . . . . . . . . . 24

5.5 Streaming pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.2 Software streamer kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.3 Hardware streamer kernel . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.4 Streamer kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.1 Sieve software kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.2 Sieve hardware kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.3 Sieve kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.6 Hardware streamer-sieve kernel . . . . . . . . . . . . . . . . . . . . . . . 62

7.2 Software chaser kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7.3 Hardware chaser kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7.4 Chaser kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7.6 Software multi-key chaser kernel . . . . . . . . . . . . . . . . . . . . . . 77

7.7 Hardware multi-key chaser kernel . . . . . . . . . . . . . . . . . . . . . . 78

8.1 Verilog simulation LOAD/SAVE . . . . . . . . . . . . . . . . . . . . . . 82

8.2 Cache tree fill kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

8.3 Cache simulation kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

11.1 AEMB disassembly (GCC 4.1.1) . . . . . . . . . . . . . . . . . . . . . . 120

11.2 ARM disassembly (GCC 4.2.3) . . . . . . . . . . . . . . . . . . . . . . . 120

11.3 PPC disassembly (GCC 4.1.1) . . . . . . . . . . . . . . . . . . . . . . . . 121

11.4 68K disassembly (GCC 3.4.6) . . . . . . . . . . . . . . . . . . . . . . . . 121

11.5 X86 disassembly (GCC 4.2.3) . . . . . . . . . . . . . . . . . . . . . . . . 122

xi

List of Reports

10.1 Chaser FPGA implementation results (excerpt) . . . . . . . . . . . . . . 108

10.2 Streamer FPGA implementation results (excerpt) . . . . . . . . . . . . . 109

10.3 Sieve FPGA implementation results (excerpt) . . . . . . . . . . . . . . . 110

xii

CHAPTER 1

Introduction

Search is a fundamental problem in computing and as computers are in-

creasingly invading our everyday lives, search is also becoming an everyday

problem for everyone. Historically, search has received less attention than

other computing problems. The main objective of this research is to design

a hardware device that can offload the mundane tasks from a microprocessor

and speed up bottlenecks in search processing.

1.1 Justifying Search Acceleration

Search is becoming increasingly important in the consumer space. Where once, it was

the province of massive supercomputers owned by large corporations, search is moving

downstream. This is evident with the present emphasis placed on desktop search1 and

other localised search applications. Search has grown from becoming a fundamental

computing problem, into an everyday problem[RSK04] for everyone.

Computing is also becoming ever more personal, with mobile devices today far ex-

ceeding the computing powers of enterprise servers from the past. As a result, our

personal computing devices have to juggle with more information than before. Personal

computing storage capacities have grown from the tens of megabytes in the 1980s to the

hundreds of gigabytes it is today[Por]. This is reflective of the amount of data search

applications have to work on. So, it will be useful to see how modern search might be

accelerated today.

1Desktop search is the name for the field of search tools which search the contents of a user’s owncomputer files, rather than searching the Internet[Wik09a].

1

With transistors being so cheap, there is good reason to explore ways of adding

processor functionality to improve performance and add value. The floating-point

unit has become an integral component in general-purpose computers to accelerate

floating-point calculations. Graphics accelerators are also being integrated into general-

purpose computers for stream processing2, which is useful for media centric and scientific

computations[HB07, LB08]. For everything else, the generic solution to every problem

is to devote more general-purpose computing power to it.

According to [Knu73], search is the most time-consuming part of many programs,

and the substitution of a good search method for a bad one often leads to a substantial

increase in performance and in fact it is sometimes possible to arrange the data or the

data structure so that searching is eliminated entirely. However, there are many cases

when search is necessary, so it is important to have efficient algorithms for searching.

Although the problem of sorting received considerable attention in the earliest days of

computer, less has been done about searching.

Search can be loosely defined as traversing through a data space to find solutions

that fit a set of criteria. As an abstract task, it is not limited to just database search and

similar applications. It is a fact that almost every task performed by a computer involves

some form of search. Some abstract examples include chess playing and cryptography

while less obvious examples include task scheduling and language parsing. Even the

simple task of creating a document with a modern word processor will cause a search to

be performed for both grammar and spell checking, many times.

In less abstract applications, search is involved in almost every aspect of data manip-

ulation, regardless of how the data is organised in a computer or for what application.

Besides performing a search to find a record, searches are also performed during record

insertion, updating and deletion. Therefore, search is a very fundamental computing

task and any hardware acceleration for search would contribute significantly to overall

programme speed.

The search problem can be solved by devoting more hardware resources to it or

by designing better algorithms. In the case of the leading search engine in this world,

Google, both methods are employed. Google uses a patented algorithm called PageR-

ank3, to improve search result quality while also employing parallelisation on a massive

scale, to perform complex searches with lightning speed. There are some lessons that

can be taken from them.

However, there may be a more elegant way of tackling the problem, one that involves

using less, not more, hardware that supports algorithms. This research sets out to answer

the question by looking at search algorithms, how they behave and which parts of the

2Stream processing is a computer programming paradigm, related to SIMD, that allows some appli-cations to more easily exploit a limited form of parallel processing[Wik09d].

3http://web.archive.org/web/20071114010112/http://www.google.com/corporate/tech.html

2

algorithm causes bottlenecks in the microprocessor. Then, it attempts to develop a

versatile architecture that can support search acceleration in hardware and measures

the potential performance of such an accelerator.

Chapters 2 and 3 address software issues where search is defined and classified into

different categories and the problems with each are identified. Chapters 5 through 10

detail the hardware accelerators, their design considerations, functional configurations,

operating modes and implementation technologies. Chapter 11 discusses various over-

arching issues including the validity of the results, scalability of the architecture and

other potential competing solutions.

1.2 Historical Justification

As mentioned earlier, search has received less attention than sorting [Knu73] and this

is also evident in hardware. While there have been past attempts at designing a search

processor, there are not many major ones. Some of these attempts have initially targeted

non-indexed queries and treated search as a computational problem and the solution de-

voted more processing power on the problem. As mentioned in [Sto90], the performance

gains are at the expense of sorting. In most cases, simply indexing the database would

reduce the effectiveness of the solutions. However, there are also some solutions that

used unique storage hardware to do acceleration at a fine grained level. The following

traces some of the major evolutions in the past.

CASSM [GLS73, SL75] was a cellular system for very large databases. It was an early

research that looked into hardware methods for accelerating search applications and is

one that is often cited. It focused on a context addressed cellular system for information

processing using a unique but inexpensive large memory device. This device allowed the

creation of hardware dependent data structures that were closer to the abstraction of

the data as perceived by a human, rather than a machine. Therefore, high level search

queries were implemented directly in this device. This memory device was implemented

using a floppy disk but could be expanded to include other storage mechanisms including

electronic memory. These devices were used in a distributed fashion in order to increase

parallelism, using a number of non-numeric microprocessors to process the data in a

parallel and associative manner.

DIRECT [DeW78] was a multiprocessor architecture for supporting relational database

management systems. It is a form of MIMD computer using a number of micro-

programmable off-the-shelf PDP11 microprocessors. These processors were attached

to pseudo-associative memory through a cross-point switch. The number of processors

allocated to a specific query were dynamically determined, based on the complexity and

3

size of the query. It was software compatible, ran a modified version of Ingres and can

be used as a relational database accelerator. Its operations were database specific in-

cluding such primitives as: CREATEDB, DESTROY, NEXTPAGE, JOIN, INSERT, RESTRICT and

other database specific operations. Therefore, it had a custom programming language

that used its query primitives like assembly opcodes. The resultant programming code

resembled stored procedure4 languages used in present day databases. However, it did

not use indices.

CAFS [Bab79] attempted to design an entire hardware relational database system by

means of specialised hardware. It claimed that regular computers were fundamentally

unsuited to implement relational operations and that database systems were ultimately

I/O limited. Therefore, this system used content-addressable hardware, even at the disk

level, to speed up relational queries. It worked on both indexed and non-indexed queries

using a temporary storage as a core hardware index. This allowed complex relational

operations to be accelerated by manipulating the information stored in this hardware

index. However, it ran on specific types of hardware and was limited to searching and

filtering table rows stored in disk storage.

GAMMA [DGG+86] was a relational database machine that exploited dataflow query

processing techniques. It was built as a cluster of off-the-shelf VAX11 based machines

connected via a token ring network and was a direct descendant of DIRECT. However,

it took into account the fact that the use of indices would improve search performance

tremendously by reducing I/O transactions. In addition to the I/O bandwidth limi-

tation, this machine tried to address the bandwidth limitation in the message passing

interface of a multiprocessor system. This machine demonstrated that parallelism can

be used to work in a database machine context. Moreover, it also showed how paral-

lelism was controlled with minimum overhead through a combination of hashing based

algorithms and pipelining between processes. However, this was an expensive demon-

stration on how standard computing power can be scaled in a cluster to perform search

acceleration.

GRACE [FKT86] was a parallel relational database machine. It was also built as a

cluster of off-the-shelf machines connected in two rings: a processing ring and a staging

ring. Both these rings share the same cluster of shared memory modules. While previous

machines employed processing at the database page level, this works on databases at

a higher level of granularity, at the task level. Each task uses a number of primitive

database oriented operations including joins, selections, sorting and relational algebra.

4A stored procedure is a subroutine available to applications accessing a relational databasesystem[Wik09c].

4

The machine tried to achieve high performance for join-intensive applications by using

a data stream oriented processing technique. A parallel join algorithm based on the

clustering property of hashing and sorting was used to support this processing technique.

In addition, it reduced the I/O bottleneck by using a combination of unique algorithms

and disk systems.

RINDA [ISH+91] was a relational database machine with specialised hardware for

searching and sorting. It was built as a cluster of standard computers with standard

disk controllers and storage. This database processor accelerated non-indexed relational

database queries. As the data is non-indexed and consequently unsorted, it has to

handle both a search problem and sorting problem at the same time. It is composed

of content search processors and relational operational accelerating processors. The

former, searches rows stored in disk storage, and the latter sorts rows stored in the

main memory. The processors connect to a general-purpose host computer with channel

interfaces.

GREO [FK93] was a commercial database machine based on a pipelined hardware

sorter. This machine was designed for commercial usage in existing installations. As

the clients were not interested in rewriting whole applications to cater to a new architec-

ture, support of legacy data structures and algorithms was important. It was made up of

a hardware merge sorter alongside a number of data stream microprocessors. The data

stream microprocessors were made up of a number of MC68020 microprocessors on a

board level multi-processor system. These processors performed the database primitive

operations such as selections, projections, joins and other computations. The host com-

puter compiled a given query into a sorting-oriented dataflow graph, which was executed

by the hardware sorter and the 68K microprocessors.

It would seem that in practically all of these cases, the solution presented is one

that employs various multi-processor arrangements: at a networked cluster level down

to the system board level. The solutions were not targeted at the chip level integration,

including those that employed custom chips at a board level. This may be due to the

nature of the industry and cost constraints on building systems on chip. However, with

the low cost of transistors today, it is feasible to explore building a chip level accelerator.

Some of these solutions also employ unique storage hardware and content addressable

storage to improve I/O performance. While these solutions are fast, customised storage

hardware would introduce incompatibilities with existing computing architectures and

content addressable devices are notoriously expensive to build. Therefore, these solu-

tions are custom solutions that would be difficult to implement in today’s world where

computing is a commodity.

5

1.3 Objectives

This document is broadly organised according to the objectives of this research, which

are summarised as follows:

• Justify the need for hardware search acceleration. Search is shown to be an impor-

tant and common operation that is performed by a computer. Therefore, it will

be beneficial to accelerate the operation, while incurring the minimum penalty of

additional hardware and software.

• Categorise the types of search and the problems faced by each. Search needs to

be understood and broken down into sub-problems. Each sub-problem can then

be studied and accelerated in hardware.

• Design a hardware device that can be used to accelerate common present-day

search operations in a cost effective manner. The term accelerate is defined as

comprising of two functions:

1. It will need to offload many of the mundane search processing tasks from

the host-processor. This will free up the host processor to perform other

computational operations.

2. It will be designed to speed up the bottlenecks in the search operation. This

speed-up will be obtained by performing many of the operations in hardware,

at the fastest rate possible.

6

CHAPTER 2

Search Basics

A search stack helps visualise how the different hardware and software com-

ponents work together in any search application. The survey of the software

layers begin with the primary search and secondary search layers, which each

exhibit different characteristics and encounter different problems. It is also

important to know the basic data structures and algorithms that are used in

search applications as they are used in application software. In the end, the

problems of search become evident and regular methods may not address

them adequately.

2.1 Search Stack

The search stack of figure 2.1 is an abstraction framework for illustrating how different

parts of the hardware and software fit together to perform search operations. Although

the details of each layer in the stack may not be evident at the moment, they will be

further elaborated in subsequent chapters.

The search stack is separated between hardware and software abstraction layers.

The hardware layers are composed of hardware devices from accelerator units to the

host processor while the software layers represent different software functions performed

during a search operation. The different layers are dependent upon each other and as a

result, any improvement in search performance on one layer will ultimately improve the

overall performance. As the stack clearly illustrates, improvements can come from either

the software or hardware domains and each layer can be substituted with alternative

technologies that perform the same functions.

7

Software Application

Secondary Search

Primary Search

Sof

twar

e

Interface Library

Host Processor

Search Pipeline

Accelerator Units

Har

dw

are

Figure 2.1: Search abstraction stack

Software Application: This represents the user application that uses search opera-

tions and ultimately benefits from any form of search acceleration. These appli-

cations include classic examples such as database applications but are no means

limited to it. Depending on the application, it may be possible for the software

application to directly use the primary search but it will normally depend directly

on secondary searches.

Secondary Search: This represents the software algorithms that perform different

types of queries classified as secondary searches. These algorithms form the bulk

of complex search operations used in different kinds of software applications. All

these algorithms are fundamentally dependent on primary search primitives.

Primary Search: This represents the basic software algorithm that perform primitive

searches directly on fundamental data structures. As mentioned above, certain

applications may only depend on primary searches and not use secondary searches.

Interface Library: This represents the interface layer between the hardware and soft-

ware layers. It provides the software hooks that convert the software requirements

into hardware machine operations. This can take the form of a driver, a shared

library, a language dependent source library or some other form dependent on

application requirements.

Host Processor: This represents the main processor that runs the application soft-

ware. The host processor is controlled by the software layers through the interface

library and controls the hardware layers below to actually perform the search oper-

ation. Different host processor architectures may be used, as long as the interface

library is changed accordingly.

Search Pipeline: This represents a specific combination of accelerator units that can

be used to accelerate search operations. The pipeline is controlled by the host pro-

8

cessor and performs the different operation stages by using the different accelerator

units.

Accelerator Units: This represents the hardware primitive units that perform the

basic operations of search. These are made up of the chaser, streamer and sieve

units that provide the acceleration capability in hardware. These can be replaced

by alternative hardware technologies, some of which are discussed later.

It may be useful to keep a mental picture of this search stack while reading the rest

of this document. It will not appear again until the end of the document. The bulk of

this document is loosely organised around this search stack.

2.2 Categorising Search

Different texts will categorise search algorithms differently. As an example, in [Knu73],

search is broadly categorised into different categories:

Internal and External searches are defined based on the data storage method. An

internal search uses data stored only inside primary memory. An external search

involves data stored inside disk storage.

Static and Dynamic searches are categorised based on the data structure used. A

static search uses data that does not change with time. A dynamic search uses

data that is subject to frequent record insertion and deletion.

Comparison and Attribute searches could possibly be divided based on the algo-

rithm used. A comparison search accomplishes the search by selecting data out

based on key comparisons. An attribute search does not involve comparisons but

filters out data by property flags.

But, for the purpose of this research, search is broadly organised into: primary

and secondary searches. This method of categorisation was chosen as the two searches

present different problems.

2.2.1 Primary Search

Primary search involves searching a data space for primary keys, which uniquely identify

a specific record within the data space. These keys are usually, though not necessarily,

sorted into an in-memory index. The decision to sort the key, depends on how often the

search is performed. If search is performed regularly, the cost of sorting the index during

insertion will be minimal. The cost of sorting could potentially be further reduced with

9

the aid of special purpose hardware sorting networks. Examples of sorting networks can

be found in [CLRS01, Sto90, Shi06].

Primary search is analogous to finding local maxima along a function. In terms of

computational resources, this rarely consumes complex operations, unless the values had

to be computed on-the-fly. In most cases, comparison algorithms are usually used to

traverse a tree-like structure. Operations are then limited to: addition, subtraction and

conditional branches. Addition is used to keep track of the tree position, subtraction is

used to perform a comparison and conditional branching is used for decision making.

From [Sto90], multiple processors will not be very efficient if they are used to perform

a search by making multiple probes into a file ordered by a single search key. We cannot

expect a multiprocessor to perform any single-key search much faster than a single

processor can, but we can expect a multiprocessor to do many different searches in

parallel with high efficiency.

This becomes obvious when we realise that the surplus processing power provided

by multi-core processors is wasted when complex computations are not used. Therefore,

multiple processors should be used to conduct multiple independent searches. In turn,

this causes memory to bottleneck as memory bandwidth requirements increase linearly

with the number of parallel search processes.

According to [Sto90], when database keys are unsorted, a serial search might have

to examine the entire database. In this case, a multiprocessor search has a potential for

excellent speedup. It is possible to spawn a search process whenever we hit a branch in

a tree and have the processes move in opposite directions. But from [Sto90], the savings

from parallelism is not truly the speedup observed. It is the saving in the overhead used

to sort the database and maintain that sorted order. If this overhead is small, then the

effectiveness of the parallelism is small.

In many applications the cost of sorting or building an index can be amortised over

hundreds or thousands of searches. Rarely in such instances does it pay to perform

parallel search. On the other hand, some problems in cryptography are essentially

enormous searches that are only performed once per data space. The equivalent of

building an index is far more costly than searching the database in parallel by using a

multiple processor.

Therefore, although multiple processors could potentially speed up a search, this

approach only works for applications that cannot be indexed. However, similar to the

earlier case, memory then becomes the bottleneck. Therefore, the main technical prob-

lem for primary search is the memory bottleneck.

10

2.2.2 Secondary Search

Secondary searches offer a different problem. Secondary search involves the search for

secondary keys, which are generally non-unique values. Once again, the keys may or may

not be sorted into an index. According to [Knu73], secondary search queries are usually

restricted to at most the following three types: simple, range and boolean. The problem

of discovering efficient search techniques for these three types of queries is already quite

difficult, and therefore queries of more complicated types are usually not considered.

Simple Query is the search for a specified key within the search space, such as

YEAR = 2008. In many ways, this may look similar to a primary search key except

that results are non-unique. If sorted, it will return a chain of references to the different

data while an unsorted one, would require a complete traversal. As a result, such a

search will degenerate into a linear traversal. There are software techniques available

to optimise the query. One method is to batch process a few queries at a time. This

suffers from the same problems as primary search. It does not scale well and will not

benefit significantly from higher computational power.

Range Query is a search for values that fit within a specified range of values, such

as YEAR in [2004:2008]. Just like the simple query, it looks similar to a primary key

search and will ultimately degenerate into a linear traversal. Although it is still possible

to optimise simple queries in software, it may involve multiple traversals or more complex

comparisons. Therefore, it uses more computational power than a simple query.

Boolean Query can combined any primary and secondary searches with boolean

operators. Regardless of how it is optimised in software, it would still involve a large

number of traversals. Trying to combine different result sets can be graphically easy,

but computationally more difficult. In fact, it is suggested in [Knu73] to let people do

part of the work, by providing them with suitable printed indexes to the information but

we will not consider this here. These types of queries are also complicated to optimise

in software because the data is not known beforehand.

In the case of a secondary search, a larger proportion of the problem consumes

computational power. The most difficult problem is the combination of multiple result

sets in the boolean query and this happens to be a very common form of query on large

data sets. Therefore, a major technical problem for secondary search is computational

complexity, and is a suitable candidate for hardware acceleration.

11

2.3 Data Structures & Algorithms

It is mentioned in [Knu69] that data is rarely simply stored as an amorphous mass

of numerical values. The way in which the data is stored can also provide important

structural relationships between the data elements. It is important to acquire a good un-

derstanding of the structural relationships present within the data, and of the techniques

for representing and manipulating such structures within a computer.

2.3.1 Data Structures

From [Kor87], data structures are important because the way the programmer chooses

to represent data significantly affects the clarity, conciseness, speed of execution, and

storage requirements of the programme. Data structures are chosen so that informa-

tion can be easily selected, traversed, inserted, deleted, searched and sorted. For this

research, data structures can be classified in the following ways: static and dynamic,

structures and implementations. Any potential hardware acceleration of data structures

would directly improve algorithm performance.

Static Structures change only their values, not their structure, and include arrays

and records. Because their structure stays the same, even large structures are well

defined and can benefit from hardware processing. Moreover, their layout is known

during compile time and can be scheduled efficiently in software. These structures

are often used in signal processing applications and are often hardware accelerated in

stream processors through specialised memory addressing modes[KG05] and specialised

hardware. However, these structures lack the power of dynamic structures and are rarely

used to store complex relationships of data.

Dynamic Structures change their size, shape as well as values, and include stacks,

heaps, lists and trees. Dynamic structures are often used to store high-dimensional, non-

linear, data that may not be efficiently stored in a static structure because resources

are only allocated when needed. These structures do not have a defined layout, either

during compile-time or run-time. So, it is fairly difficult to accelerate them in hardware

or software. From [KY05], the prevailing technique used for accelerating these structures

is data pre-loading or pre-caching. These structures are often used to store large sets of

data and need to be accelerated for search applications.

Static Implementations implement data structures statically in hardware and would

certainly speed up all the operations on it. A content addressable memory (CAM) is

fully associative and will allow information to be searched and retrieved almost instantly.

Certain applications, such as network routers, implement such a hardware structure to

12

facilitate routing table lookups[PS06b]. But such structures are expensive and would

not be feasible for any large data set.

Other common data structures are regularly implemented in hardware, such as stacks

and heaps. There has also been some work done [MHH02] on implementing complex

graph structures directly in hardware. Such implementations would essentially move

the search algorithm from software into hardware. However, whatever it gains in speed

is sacrificed in flexibility.

Dynamic Implementations would typically be built in software as a pointer linked

structure. Memory can be dynamically and quickly allocated and freed as the structure

grows and shrinks. Pointer linked structures can usually only be traversed and searched

from one direction at a time. Hence, these structures are notoriously difficult to accel-

erate in hardware. However, existing data sets are almost entirely implemented using

this method and should be accelerated where possible.

2.3.2 Algorithms

From [CLRS01], informally, an algorithm is any well-defined computational procedure

that takes some value, or set of values, as input and produces some value, or set of values

as output. An algorithm is thus a sequence of computational steps that transform an

input into the output. Most of the algorithms studied by computer scientists that solve

problems are types of search algorithms. There are many types of basic search algorithms

and learning how it progresses from one type to another will help us understand how to

accelerate them.

Linear Search is a basic worst case search algorithm as it merely steps through the

data space, one element at a time, until the key is found. The data structure could be

either static or dynamic. While the worst case would take O(N) steps to finish, there

are different methods to improve this search algorithm such as sorting the data based

on value or frequency of access.

Binary Search can be employed if the data is stored in a sorted array. There are

several variants on this method such as Fibonacci[Fer60] and interpolation search. At

each iteration, the algorithm would quickly eliminate at least half the search space. The

worst case would take O(log N) steps.

Binary Tree Search can be employed if the data is structured in a tree. In a binary

tree, one branch will contain values that are always smaller than the other branch and

each branch is a binary sub-tree. A binary tree search starts at the root of a tree and

13

eliminates half the tree with each step, similar to the binary search. The number of

entries traversed will depend on the maximum height of the tree. In the worst case,

where there is only one branch at each node, it could degenerate into a linear search

through a linked list.

Balanced Tree Search is the solution to a badly grown binary tree search. A

balanced tree like Red-Black trees ensure that the height of the longest and shortest

branches of the tree will differ by at most one. Therefore, a minimum height tree

is guaranteed and the balanced tree search will never degenerate into a linear search.

However, these binary trees all depend on the entire tree being in-memory and only

allow entry at the root.

Multi-way Tree Search is used for large trees, where data may need to be split into

multiple sub-trees and accessed individually. A multi-way tree like B-trees store the

root tree in-memory, but store the large sub-trees on disk. The multi-way tree search

can quickly traverse through the in-memory tree and locate a particular sub-tree that

needs to be loaded. This allows the sub-trees to be entered at different points while only

consuming a modest amount of memory.

A survey of real world databases [Bor99, MA06, PDG05] show that indices are often

built using these trees, keeping parts of the index on disk and swapping pages when

necessary. These trees can be considered a more advanced form of a balanced tree.

Therefore, it could also benefit from balanced tree enhancements.

Radix Tree Search works like any other tree search, with one critical difference -

it does not rely on value comparisons to work. On certain processor architectures, the

implementation of comparisons could be expensive. Instead, it checks to see if a bit-

value at a specific position of the key and branches left or right depending on the value.

Therefore, the amount of time it takes is dependent on the size of the key, rather than

the number of elements. As an added benefit, it can also do lexicographical matching

or wildcard matching, which is extremely powerful.

2.4 Search Problems

It should be evident at this point that search exhibits different problems. Most com-

putational problems can be solved by devoting more computational hardware to the

problem. Although secondary search may benefit from having additional computational

resources, primary search would resist such attempts. Therefore, the current trend of

increasing computational performance by means of additional processor cores, would

14

increase the throughput of multiple searches but would do almost nothing for a single

search.

Additional computational units go hand in hand with increases in memory band-

width requirements. As primary search is a memory problem, one may think that

increasing on-chip cache is a useful solution. However, this is not a panacea as the

increase in memory cannot continue indefinitely. There may come a day when we have

multi-gigabyte cache memories but by then, our databases would probably be terabytes

large. Also, subsequent chapters will show that a larger cache size does not necessarily

result in better performance.

In standard databases, dynamic structures such as trees are often used to store and

sort information. They are often defined dynamically during run-time and implemented

dynamically in the data memory heap. This means that its characteristics are not well

defined prior to it actually being used. As a result, it is not easy to directly accelerate

in software or hardware.

It can be argued that search is primarily an I/O limited problem. However, with the

cost of present technology, it is feasible to circumvent this by storing entire databases

in primary memory. Therefore, this thesis assumes that databases are stored entirely in

memory and any attempt to accelerate search would need to deal with the problem of

slower primary memory only. Improvements in memory technology will help, but not

solve the problem until the day when whole databases fit inside fast cache memory.

15

CHAPTER 3

Search Application

The application layer is the top-most software layer. Almost any application

that searches through a data set for records that match a number of fitting

criteria is considered a search application. This task can be further broken

down into several stages that work collectively as a search pipeline. An

analysis of actual code profiles for each pipeline stage will reveal that they

are different from regular computing code and may require special attention.

3.1 Search Application

Search is a broad problem and search applications encompass a large number of problem

types that go beyond the scope of this research. Every type of computer application

exploits search operations at its core, which differ depending on the application type.

An encryption cracking software performs a search for the encryption key while a chess

playing programme searches through a tree of potential positions to find the best move.

Although many of these problems are unique and very difficult, they are often less

commonly used and would not benefit much from hardware acceleration.

An alternative type of search that is performed regularly, is one that is performed

on a process table whenever an operating system starts or stops a process. In this type

of search, the computer has to go through a finite data set, looking for one or more

records that match a fitting criterion or criteria. This form of search is a generic search

operation and will benefit directly from any hardware acceleration.

16

3.1.1 Example Query

Assume for a moment, that there is a flu outbreak that only kills cats and the local

authorities want to inform all cat owners of this outbreak. Also, assume that there

exists a massive directory of the entire human population of the United Kingdom and

it holds all kinds of information about individuals including pets they own and their

city of residence. So, if the local council wished to find one or more individuals who are

resident in Cambridge and who own pet cats, an example query can be performed. This

query can be characterised using the following SQL-like statement1:

SELECT person FROM population WHERE pet=cat AND city=cambridge

SELECT, FROM, WHERE and AND are all SQL keywords. The person represents the indi-

vidual or individuals being searched for. It may return one or more results, depending

on how many pet cat owners reside in Cambridge, or no results if no one owns cats in

Cambridge. The population represents the massive directory that needs to be searched.

In search algorithms, this massive directory is called the search space or database and

the size of this search space would be represented by N records. The cat and cambridge

criteria represent the fitting criteria used to filter out the results. In this case, only pet

cat owners residing in Cambridge should be identified.

It is easy to look at things from this perspective as it exhibits many characteristics

of a common search. Looking at this search query with an abstract eye, it reduces itself

to searching a data set for one or more records that match one or more fitting criteria.

Any application that performs this type of operation regularly, is classified as a search

application.

3.1.2 Pipeline Breakdown

The example query above can be broken down into a number of simple sub queries,

which return multiple result streams that are then combined into a final result stream.

From the above description, the search operation can be broken down into a series of

operations. Figure 3.1 illustrates these operations.

KEYSEARCH

LISTRETRIEVAL

RESULTCOLLATION

DATA SET

RESULTSSTREAM

KEYLIST

Figure 3.1: Typical search pipeline

1an SQL statement is merely used for illustration purposes as this research is not SQL focused

17

Key Search can be performed on each criterion of the query statement. The input to

this stage is the index structure and the output is a key. Indices are often stored

in a balanced tree structure. Unique key searches involve a balanced tree search

through the tree.

There are many algorithms that can be used to search through a tree, depending

on the size of the tree. For sufficiently large trees, the amount of time taken to

perform the search by the best algorithms is in the order of O(logN) time. For the

example query, depending on the processor and the number of criteria, a number

of key searches may be performed in parallel to speed it up. Once the keys are

located, a list of results can be retrieved.

List Retrieval searches are not actually searches but form part of the search pipeline.

The input to this stage is the starting point of the structure and the output is a list

of potential results. A list can be organised in many ways but is most commonly

organised as a pointer linked structure, such as a linked list. If the list is not in

sorted order, the bottleneck will once again be in the data structure. So for most

intents and purposes, the list can assumed to be sorted.

In this case, the list retrieval is an operation to pull in data from memory into the

processor. For such an operation, the best algorithms will take the order of O(N)

time to completely retrieve each list. Once again, depending on the power of the

processor and the number of criteria, retrievals may be performed in parallel. The

final stage of the operation is to collate the results.

Result Collation operations can be considered a search if they involve any form of

result operation. In the example query, the two results lists need to be intersected.

This is often a bottleneck in the search operation as the number of results returned

from the list retrieval may be significantly larger than the actual number of final

results needed. As described earlier, this is a computationally intensive operation

and can benefit from hardware acceleration.

There are many possible software algorithms that can be applied to this operation.

If the results lists are sufficiently short and randomly accessible, a fast intersection

algorithm could take O(logN) time to complete. However, for a sufficiently large

list that is not randomly accessible, a typical algorithm would take O(N) time to

complete. This section is difficult to parallelise because all the earlier results need

to be fed in. After this, the bulk of the search operation is essentially complete

and the results are available.

Each individual stage of the pipeline can be performed in software by modern pro-

cessors. However, each stage needs to wait for results from the previous stage before

18

it is able to continue. Therefore, the above series of operations can be considered a

search pipeline. This project will treat each stage of the pipeline as its own operation

and accelerate each stage in hardware.

3.1.3 Query Illustration

There are many ways to organise information and just as many ways to search through

them. In our example query, there could be a single file on each resident that is sorted

by alphabetical order in folders, files, cabinets and rooms. Alternatively, there could

be a number of ledgers that hold a list of numbers identifying specific resident files by

room, cabinet, file and folder. There could be a ledger representing pet cat owners and

another ledger representing residents of Cambridge and these ledgers are stored in book

shelves by category.

In the first case, the data is structured badly, without any index or keys built for the

information. A badly organised database would not benefit from any form of hardware

nor software acceleration. As a result, the only way to search through this information

is to inspect each individual resident file to see if they are pet cat owners and residents

of Cambridge. There is no reason in working on this scenario as no amount of hardware

acceleration is going to help, when the bottleneck is the database itself. This search will

take a very long time to perform, even on the fastest supercomputers and the best way

to accelerate the search operation would be to reorganise the database.

In the second case, the data is structured, with a number of indices built for the

information. The cat and cambridge ledgers will first need to be retrieved from the pet

and city book shelves. Once located, each ledger contains a list of records (assumed

to be sorted) that reference specific individual files. So, the cat ledger will identify

specific pet cat owners and the cambridge ledger will identify residents. An intersection

operation can then be performed on both the lists to find the common records of both

ledgers. The resultant references can then be used to locate the specific person from the

massive population directory.

Although, the types of algorithms that are involved in the above operations are

beyond the scope of this research, assume that the data is already organised in an

optimised manner and that the algorithms used to process them are also optimised.

Therefore, any bottleneck that exists, will be caused by the processing of the search

algorithm. This is the opportunity where hardware acceleration may be able to help.

Theoretically, the actual number of results retrieved could be from nothing (no one

in Cambridge owns cats) to the entire population of the UK (assuming that everyone is

resident in Cambridge and they all owns cats). As all search operations are dependent on

the size of the search space (N limited), hardware acceleration would be more beneficial

for large data sets than small ones. Therefore, this research will focus on methods to

19

accelerate a significantly large sized data set.

3.2 Search Profile

Before we proceed, it is important to understand how the individual search pipeline

stages are performed in a regular microprocessor. In order to do this, a basic search

software kernel was written using the C++ STL library. The software was then run

through a simulator and the resultant operations profiled.

Listing 3.1 illustrates how this is done in the Verilog simulation construct. Profiling

was turned on in the Verilog simulation, just before the search functions were called.

Profiling sends the status of certain important pipeline registers to the output (lines

103–127) when the dump variable is set (line 100) including ASM, which represents the

instruction register. The dump variable is toggled (line 186) by a virtual device mapped

to memory location 0x40009000, which was activated by a memory write in software.

The output was passed through a parser and statistics were taken on each type of

instruction.

According to [FKS97], branches account for about 20% of the code in general-purpose

code and about 5-10% for scientific code with conditional branches accounting for about

80% of it while loads and stores are frequent operations on RISC code and make up

about 25-35% load and about 10% store instructions.

Therefore, general-purpose code has about 35-45% memory operations, 20% branches

and 35-45% arithmetic operations. Table 3.12 shows the profile of a typical search kernel

and a breakdown of the operations for the kernel codes shown in listings 3.2, 3.3 and

3.4.

3.2.1 Key Search

Glancing at this profile, it is evident that search code has a similar profile to general-

purpose computing code. However, significant differences appear on closer inspection.

The number of memory operations is similar to the 35% as suggested in general-

purpose code. But, the number of store operations are almost halved from that of a

general-purpose code, while the number of load operations increased significantly. The

large number of read operations means that key search code essentially behaves in a

mainly ‘read’ manner. This suggests that speeding up memory reads might be beneficial.

However, reading is usually a cheaper operation than writing and any benefits gained

may be insignificant.

The numbers of conditional branches are only slightly higher than that of general-

purpose code. However, the total number of branches taken is almost 50% more than

2the total percentage is 100 ± 1% due to rounding errors

20

Type Key Search List Retrieval Result Collation

Arithmetic 37% 21% 47%

Compare 44% 0% 13%Logic 2% 1% 0%Addition 39% 3% 27%Subtraction 15% 96% 50%

Branch 29% 20% 28%

Conditional 84% 99% 71%Unconditional 9% 0% 27%Return 6% 1% 1%

Memory 32% 59% 26%

Load 81% 67% 99%Store 19% 33% 1%

Miscellaneous 3% 1% 3%

Table 3.1: Search Profiles

that of general-purpose code. This indicates that decision making code is more common

for search operations. This suggests that speeding up branch penalties or eliminating

branches might be beneficial.

The bulk of the key search code comprises arithmetic operations. Almost half of

the arithmetic operations performed were comparison operations. This is indicative of

search operations, as searches mainly involve comparing values with a key. This suggests

again, that accelerating comparisons in hardware may be beneficial.

3.2.2 List Retrieval

Looking at the middle column of Table 3.1 immediately tells us that list retrieval is very

different from general-purpose code. Although the 20% branches are expected as for

general-purpose code, the amount of memory operations at 59% is significantly higher

at the expense of arithmetic operations of only 21%.

As its name suggests, the list retrieval operation is memory intensive, as it tries to

retrieve the entire list from memory. In this case, the ratio of writes to reads is about

1:2 because each node that is read, is also written to memory by the kernel to simulate

result buffering. However, if the data read in is used internally, the number of write

operations drop dramatically and the operation becomes ’read-only’.

No comparison intructions are used whilst subtraction is used the most. While code

listing 3.3 does not employ any subtraction, it is used to decrement the list counter.

These results show that list retrieval is a memory intensive operation and may benefit

from accelerated memory operations but it will not benefit much from accelerating

21

computational operations.

3.2.3 Result Collation

Looking at the right column of Table 3.1, the profile is again significantly different from

the general-purpose profile. A large number of branches are again used, with most of

them being conditional ones. This is indicative of the decision making process that is

involved in results collation. There are relatively few memory operations and they are

almost all read operations.

As the name suggests, results collation is a compute-intensive operation. Almost

half the operations performed are computational. However, the majority is used in

subtraction. While code listing 3.4 does not employ any explicit subtractions, it is often

used in decision making code to affect the sign and overflow conditions, which allow

microprocessors to perform conditional branches.

Surprisingly, only about a quarter of the code operations are memory operations.

This is a very different profile from the previous two profiles that are memory intensive.

Result collation is a computationally intensive operation and is a good candidate for

hardware acceleration.

3.2.4 Overall Profile

From the overall result, it can be shown that search is indeed an expensive operation. It

is well known that branches and memory operations are expensive operations, when com-

pared to simple arithmetic operations. Search algorithms, consume significantly more

branches and memory operations than general-purpose code while only certain types of

computational operations, compares and subtractions, are heavily used in search.

In addition, as each stage is so different, it might be better to design stage specific

hardware accelerators than to design a universal hardware accelerator for search. The

key search stage resembles a general purpose operation but is memory and computa-

tionally intensive. The list retrieval stage is primarily memory intensive. The result

collation stage is mainly computationally intensive. Each accelerator stage can then be

combined in different ways, in order to accelerate different kinds of search.

22

97// DUMP CYCLES

reg dump ;

always @(posedge sys_clk_i )

if (dump & core0 .risc0 .cpu0 .dena ) begin

// begin

102‘ifdef AEMB2_SIM_KERNEL

$displayh ("TME=",($stime /10) ,

",PHA=",core0 .risc0 .cpu0 .gpha ,

",IWB=",{core0 .risc0 .cpu0 .rpc_if ,2’ o0},

",ASM=",core0 .risc0 .cpu0 .ich_dat ,

107",OPA=",core0 .risc0 .cpu0 .opa_of ,

",OPB=",core0 .risc0 .cpu0 .opb_of ,

",OPD=",core0 .risc0 .cpu0 .opd_of ,

",MSR=",core0 .risc0 .cpu0 .msr_ex ,

",MEM=",{core0 .risc0 .cpu0 .mem_ex ,2’ o0},

112",BRA=",core0 .risc0 .cpu0 .bra_ex ,

",BPC=",{core0 .risc0 .cpu0 .bpc_ex ,2’ o0},

",MUX=",core0 .risc0 .cpu0 .mux_ex ,

",ALU=",core0 .risc0 .cpu0 .alu_mx ,

//",WRE =", dwb_wre_o ,

117",SEL=",dwb_sel_o ,

//",DWB =", dwb_dat_o ,

",REG=",core0 .risc0 .cpu0 .regs0 .gprf0 .wRW0 ,

//",DAT =", core0.risc0. cpu0 . regs0.gprf0.regd ,

",MUL=",core0 .risc0 .cpu0 .mul_mx ,

122",BSF=",core0 .risc0 .cpu0 .bsf_mx ,

",DWB=",core0 .risc0 .cpu0 .dwb_mx ,

",LNK=",{core0 .risc0 .cpu0 .rpc_mx ,2’ o0},

",SFR=",core0 .risc0 .cpu0 .sfr_mx ,

",E"

127);

‘endif

end // if (uut . dena )

always @(posedge sys_clk_i )

154begin

// DATA WRITE

if (dwb_stb_o & dwb_wre_o & dwb_ack_i ) begin

case (dwb_adr_o [31:28])

4’h0: // INTERNAL MEMORY

159case (dwb_sel_o )

4’hF: rDLMB [dwb_adr_o [DLMB :2]] <= #1 dwb_dat_o ;

4’hC: rDLMB [dwb_adr_o [DLMB :2]] <= #1 {dwb_dat_o [31:16] , wDLMB [15:0]};

4’h3: rDLMB [dwb_adr_o [DLMB :2]] <= #1 {wDLMB [31:16] , dwb_dat_o [15:0]};

4’h8: rDLMB [dwb_adr_o [DLMB :2]] <= #1 {dwb_dat_o [31:24] , wDLMB [23:0]};

1644’h4: rDLMB [dwb_adr_o [DLMB :2]] <= #1 {wDLMB [31:24] , dwb_dat_o [23:16] , wDLMB [15:0]};

4’h2: rDLMB [dwb_adr_o [DLMB :2]] <= #1 {wDLMB [31:16] , dwb_dat_o [15:8], wDLMB [7:0]};

4’h1: rDLMB [dwb_adr_o [DLMB :2]] <= #1 {wDLMB [31:8], dwb_dat_o [7:0]};

endcase // case ( dwb_sel_o )

4’h8: // EXTERNAL MEMORY

169begin

case (dwb_sel_o )

4’hF: rDOPB [dwb_adr_o [DOPB :2]] <= #1 dwb_dat_o ;

4’hC: rDOPB [dwb_adr_o [DOPB :2]] <= #1 {dwb_dat_o [31:16] , wDOPB [15:0]};

4’h3: rDOPB [dwb_adr_o [DOPB :2]] <= #1 {wDOPB [31:16] , dwb_dat_o [15:0]};

1744’h8: rDOPB [dwb_adr_o [DOPB :2]] <= #1 {dwb_dat_o [31:24] , wDOPB [23:0]};

4’h4: rDOPB [dwb_adr_o [DOPB :2]] <= #1 {wDOPB [31:24] , dwb_dat_o [23:16] , wDOPB [15:0]};

4’h2: rDOPB [dwb_adr_o [DOPB :2]] <= #1 {wDOPB [31:16] , dwb_dat_o [15:8], wDOPB [7:0]};

4’h1: rDOPB [dwb_adr_o [DOPB :2]] <= #1 {wDOPB [31:8], dwb_dat_o [7:0]};

endcase // case ( dwb_sel_o )

179‘ifdef SAVEMEM

#1 $fdisplayh (swapmem , "\@",dwb_adr_o [DOPB :2]," ",rDOPB [dwb_adr_o [DOPB :2]]) ;

‘endif

end

4’h4: // I/O DEVICES

184case (dwb_adr_o [15:12])

4’h0: $write ("%c", dwb_dat_o [31:24]) ;

4’h9: dump <= !dump ;

endcase // case ( dwb_adr_o [15:12])

default : $display ("DWB@ %h<=%h", {dwb_adr_o ,2’o0}, dwb_dat_o );

189endcase // case ( dwb_adr_o [31:28])

// $display (" DWB@ %h <=% h", {dwb_adr_o ,2’ o0}, dwb_dat_o );

end // if ( dwb_stb_o & dwb_wre_o & dwb_ack_i )

Listing 3.1: Verilog profiling construct

23

32int swchase(std ::set <int > &setA , int pkey )

{

#ifdef DEBUG

35iprintf("PKEY \t: 0x%X\n", pkey );

#endif

volatile int j = (int)&*setA .find (pkey )._M_node ;

40#ifdef DEBUG

iprintf("FIND \t: 0x%X\n", j);

#endif

return EXIT_SUCCESS ;

45}

Listing 3.2: Key search profile kernel

47int swstream (std ::list <int > &listA )

{

for (std ::list <int >:: iterator node = listA .begin ();

node != listA .end (); node ++)

51{

volatile int j = *node ;

#ifdef DEBUG

iprintf("HIT\t: 0x%X\n", j);

56#endif

}

}

Listing 3.3: List retrieval profile kernel

60int swsieve(std ::list <int > &listA , std ::list <int > &listB )

61{

std ::list <int >:: iterator idxA , idxB ;

idxA = listA .begin ();

idxB = listB .begin ();

66

while ( (idxA != listA .end ()) && (idxB != listB .end ()) )

{

if (* idxA == *idxB )

{

71volatile int j = *idxA ;

// HIT !!

#ifdef DEBUG

iprintf ("HIT\t: 0x%X\n", j);

#endif

76

idxA ++;

idxB ++;

}

else if (*idxA < *idxB )

81{

idxA ++;

}

else

{

86idxB ++;

}

}


91}

Listing 3.4: Result collation profile kernel

24

CHAPTER 4

General Architecture

Some general architectural decisions need to be made at the very start. The

accelerator units were designed to work alongside a host processor in a het-

erogeneous computing environment. An existing host microprocessor was

used instead of designing a system from the ground up, in order to exploit

existing software tools for development and testing. C++ was chosen as

the primary software language along with the use of STL as the default li-

brary. Some initial ideas of using a stack processor were also explored but

ultimately discarded.

4.1 Initial Considerations

It was clear that the work would involve studying both hardware and software operations.

Working at the hardware-software boundary allows a greater leeway in determining

where to draw the line between functionality.

On the hardware side, microprocessor cores are increasingly becoming commodities

that can be allocated to a problem to solve it. Hence, an obvious method for accelerating

search operations would be to spread the task across multiple processors. This is the

most obvious path being taken presently by various microprocessor vendors from Intel

(desktop), Sun (server) and ARM (embedded).

Alternatively, an architecture can be designed to provide small and fast algorithmic

support for individual search sub-operations. This is a more flexible method of address-

ing the problem of generic search and would benefit the most applications. It must not

be an attempt at designing a hardware search engine as a hardware search engine would

25

certainly be speedy, but it would not be useful for much else.

On the software side, search algorithms can also be improved by changing the algo-

rithm architecture. Any improvement in the class of algorithm can have dramatic effects

on the performance. If a hardware accelerator can be designed to support improvements

in algorithm structure, it would prove to be doubly useful.

4.2 Hardware Architecture

From earlier considerations, it appears that the best way to accelerate search is to

spread out the work across multiple hardware cores. The question is the form of multi-

processing that this should take. Independent searches can definitely be split up across

multiple independent NUMA1 machines. However, it may not be feasible to do this for

secondary searches that involve data from a common data set.

4.2.1 Multi-Core Processing

Homogeneous processors use multiple replicated copies of the same hardware core to

increase processing power. This is useful for software engineers as it would be easy to

model and distribute multiple tasks across a homogeneous hardware platform[Mer08].

Such hardware is suitable for general purpose computing but is less suitable for special

application computing as it consumes a large amount of chip resources, which may not

be used for the specialised computational task.

Heterogeneous processors use multiple dissimilar cores to increase processing power.

This is useful for hardware engineers as it is a more efficient usage of chip resource[Mer08].

However, it is more difficult for software to be written for it as each different core has

different computational and memory requirements. In fact, each core may even be of an

entirely different computing architecture.

However, as the focus of this research is very application specific, search only, it is

possible to use heterogeneous processing as the way to accelerate computational per-

formance. This can be accomplished by designing extensions to an existing processor

instead of designing an entirely new processor. This allows some functions to be im-

plemented in the host processor software instead of having to implement everything in

accelerator hardware. It would also allow us to easily benchmark the performance of

the software running with and without the accelerator. By exploiting an existing host

processor architecture, it would not be necessary to design an accompanying software

toolchain. This will ultimately simplify software development. Therefore, this was the

path chosen for the development of the hardware search accelerator.

1NUMA is a computer memory design used in multiprocessors, where the memory access time dependson the memory location relative to a processor[Wik09b].

26

4.2.2 Word Size

Although 64-bit microprocessors are slowly becoming the norm, most applications are

still overwhelmingly 32-bit. Therefore, there is little attraction in using a 64-bit word

length for either the hardware accelerator or the host processor. For expediency, the

internal architecture was selected to use a basic 32-bit word-length for the proof of

concept. However, there is no reason that the design cannot be completely converted

into a 64-bit design or higher, if necessary, for future work.

4.2.3 Host Processor

The role of the host processor is mainly to configure the accelerator and to supply it

with data. Its secondary role would be to provide comparisons between the software

and hardware search operations. This means that the host processor has a minor role

to play and should be minimalist. The focus of the host processor should be on size and

simplicity rather than raw computational performance.

Using an open-source processor would provide full access to the design, which will

facilitate hardware accelerator integration and simulation. There are several popular

microprocessor designs available under an open-source license. Many of these [Sun06,

Int07, LCM+06] microprocessors are unnecessarily complex. Therefore, a simpler soft

microprocessor architecture was chosen for the host processor.

An open-source Verilog implementation[Tan04] of the Microblaze[Xil04] is used as

the host processor. It is a DLX-like 32-bit RISC microprocessor which is mainly designed

for small embedded applications. In addition to an instruction and data memory bus,

it has the advantage of having a dedicated accelerator bus. This third bus can be used

as a private communication and configuration bus between the accelerator and host

processor. It was also designed by the author of this thesis who has intimate knowledge

of its inner workings and can easily modify it when necessary. It is also sufficiently

mature and independently tested by users of the processor.

4.3 Software Architecture

Hardware development has to go hand in hand with software development. Otherwise,

there will not be any way to exploit the advances made on the hardware platform. In

order to accelerate software development, libraries are used where possible.

4.3.1 Software Toolchain

The chosen processor has a mature C/C++ software toolchain based on the GNU Com-

piler Collection (GCC version 4.1.1). This simplifies writing software for the host pro-

27

cessor and also leverages on existing software libraries. This allows certain functionality

to be emulated in software, where necessary.

Although many arguments may be made between the performance of C versus C++

code, a decision was made to use C++ for development. The main reason is due to the

prevalence of high-level libraries for C++, while still being able to use C code that is

closer to the hardware. Using techniques from [Sak02], the accelerator interface can be

interfaced using low level C in a library for the host processor. Simple code tests also

show that the code generated for both languages are very similar in performance. The

main factor that determines code performance is the optimisation level used and not

the language.

The optimisation level of -O2 was used for almost all the code compiled for testing.

In early tests, it was found that the -O3 optimisation often resulted in a larger code

size and slightly slower running code. The -O1 optimisation often resulted in code that

made more memory accesses than was necessary. So, the final chosen optimisation level

was the best trade-off[Jon05] in terms of performance and size.

4.3.2 Standard Libraries

Initially, several third party libraries were used during testing. However, this caused

problems as varying results with different software libraries were encountered. As a

result, it was decided that a standard library had to be used for development and

testing. Certain libraries have been optimised for their specific applications, but of the

many popular libraries available, a decision was made to support the C++ Standard

Template Library.

Being a standard library, it is both robust and mature[Ste06]. It has met much

success through the years, making it suitable for the widest range of applications[Str94].

As a standard C++ library, it is the first port of call for many programmers as many are

familiar with them. As a result, it would have been heavily used in existing applications.

Hence, there would be more trust in the integrity and performance of its results and

a hardware accelerator that is proven to be capable of accelerating C++ STL library

routines, would have the widest benefit.

The C++ STL library presents many basic data structures (trees, lists, queues,

stacks) and related algorithms that operate on these data structures[SL95]. Through its

template architecture, these data structures can be easily wrapped around various data

types including user defined data types and data structures. This makes the C++ STL,

a very powerful library for writing software algorithms.

28

4.3.3 Custom Library

Both a custom mid-level and a low-level accelerator libraries were written. This deci-

sion was taken to make the accelerator platform agnostic. The low-level library provides

different, primitive read and write routines to access the accelerator registers directly.

These are the only routines that need to be modified for different host processor archi-

tectures and systems level designs.

The mid-level library provides a user-friendly structure to the low-level interface

library. It provides data structures that can be manipulated by external software and

mid-level software routines to access the accelerator. This allows user software to be

abstracted as simple functions instead of calling the hardware routines directly.

4.4 Initial Architecture

Figure 4.1 shows an early conception of a potential accelerator architecture. Although

the final design is significantly different from this, it is beneficial to introduce this early

design. It helps to understand the train of thought and also the reasons for changes that

were made in the end. The accelerator was broken up into three modular sections: bus

interface, accelerator and memory interface.

BUS RAMN−ELEMENT ARRAY

DDR INTERFACE UNITHSE

HSE

EXTERNAL BUS

HSE

HOST CONTROLLER

HSE

HSE

HSE

DATA SPACE MEMORY

(DDR SDRAM)

BUS INTERFACE UNIT

L1/ L2

L1/ L2L1/ L2 L1/ L2

L1/ L2L1/ L2

X−CON

Figure 4.1: Initial hardware search accelerator architecture

Bus interfacing was made a modular section in order to make the accelerator platform

agnostic. This interface is used to interact with any host processor and will depend

on the host architecture. For example, a HyperTransport module could be used for

interfacing with AMD processors on a Torrenza platform or a PCIe module could

be used for generic PC interfacing. However, this is the subject of an end-user

application and is not directly pertinent to this research.

RAM interfacing was made a modular section for a similar reason. This allows the

accelerator to access the popular memory technology of the day. Today, this is

29

DDR2-SDRAM but there is no reason that the accelerator should not be used on

any future technology. Again, this is subject to an end-user application and is not

directly pertinent to this research.

Element Array actually makes up the main part of this research. This holds a number

of small accelerator cores with the different interfaces to each side. The cores can

be linked together to form a search pipeline.

It was important to see if this accelerator architecture was viable. In this case, it

was, because there was a method to configure a number of accelerators with access from

the host processor and to memory. For this exercise, the details of each block were not

important. However, there were a couple of potential bottlenecks with this architecture.

Firstly, memory was going to be a bottleneck as all the accelerator units access mem-

ory through the same memory interface. However, without actually changing existing

computing architecture practices, there is little that can be done. Ultimately, all the

data sits in a common main memory that needs to be accessed by the accelerators.

There are existing techniques to increase memory bandwidth and such details can be

handled by the modular memory interface.

Secondly, the communication with the host processor through the accelerator bus

is another bottleneck. This is used for configuration purposes, which should be fairly

light on traffic. However, it is also used to access the results of the search operations.

Depending on how the search pipelines were configured, this may be a fairly significant

amount of traffic. Therefore, as much of the operation should be offloaded to the accel-

erators as possible. This will reduce the traffic to only the ones relevant to retrieving

results.

4.4.1 Stack Processors

In the past, stack processors have had limited success in mainstream applications. How-

ever, many general purpose stack processors have been studied in [Koo89] and they

are just as powerful as mainstream RISC/CISC architectures. In recent years, stack

processors are starting to see a revival such as in [LaF06, Pay00] for general purpose

use.

All recursive algorithms use a stacking model of operation. Although it is not strictly

necessary to traverse a tree recursively as explained in [Kor87], recursive algorithms

are usually implemented as such. Hence, it made sense to use a stack architecture in

the design as it was intrinsically suited. There are many advantages to using a stack

processor such as, fast procedure calls and returns and reduced instruction complexity,

all of which reduces computational cost for search. So, the search accelerator design

began around a stack architecture.

30

However, designing a custom stack microprocessor also required writing a custom

toolchain for it. Forth is a commonly used high-level language for programming stack

based machines, though it is also possible to use other languages. As the research was

into the design of an accelerator, a choice was made, fairly early in the design process,

to design it independent of processor architecture. Therefore, it did not make sense to

design a special purpose microprocessor architecture to process it.

Figure 4.2 shows an initial idea for a custom stack-based accelerator unit. Although

it was ultimately decided to abandon the idea of a stack-based processor, this figure

presents some interesting ideas. The figure shows the use of a pointer engine, which is

an addressing device to off-load the calculation of pointers. However, as most dynamic

data structures are not calculated, the pointer engine ultimately became a simple look

up device.

HSE ELEMENT

L1/ L2 CACHE

ASTACK

PC

DSTACK

ASLUDP

DCACHEI CACHE

TOS

TOS NOS

X−CON

DECODE

BRANCH

PENGINE

UCACHE

Figure 4.2: Initial stack based accelerator architecture

Using a stack architecture also added an extra level of information available to the

processor. It allowed the hardware to keep track of the level of tree traversal by keeping

count of the push and pop operations. This idea was retained in order to exploit the

stack level information. This will be explored later in Section 8.6.2.

31

CHAPTER 5

Streamer Unit

The streamer is the simplest accelerator unit to understand. Functional and

timing simulation results show that the streamer off-loads the work of the

host processor but does not achieve any significant speed-up. However, as a

simple to understand unit, it is used to illustrate the steps taken in writing

the simulation software kernel and simulating the device.

5.1 Introduction

The design of the accelerator can start from any part of the pipeline. But, of the

different stages of the search pipeline, the simplest operation to perform is the list

retrieval operation. Therefore, this can be used to illustrate the processes involved,

while keeping everything else simple.

In any search query, once a key has been located, the next operation is to extract

one or more records that are related to the key. In STL terms, a map data structure

could be used to map a key to any secondary data structure, such as a list. In the case

of the example search query, once the key cat is found, a list of records that contain cat

can then be retrieved from memory.

Hence, the next operation is to pull results into the accelerator. This secondary

structure would typically be stored in a pointer linked data structure. Although it is

by no limited to being a linked list, a linked list is used as an example because it is the

most primitive dynamically linked data structure. In order to accelerate the processing

of this list, a streamer unit can be used.

32

5.1.1 Design Considerations

From the beginning, it is fairly clear that a streamer unit would provide no significant

improvement in performance. List retrieval operations are inherently memory bound.

The software operation of processing a list is a O(N) bound function and the streamer

unit is also similarly bound.

Therefore, the objective of the streamer unit is not to speed things up, but to offload

the task from the host processor. As this is not computationally complex, there is little

to differentiate the hardware and software operations - it can either be implemented in

hardware or software. The only question is the differing amounts of acceleration and

costs involved. Its main function is to bypass the host processor in supplying data to

the other accelerating units, from main memory. This task of pulling in data, can be

analogised to a form of Direct Memory Access (DMA).

However, a regular DMA is designed to move data between main memory and devices

in large contiguous blocks but it is not data-structure-aware as it does not treat complex

data structures any differently from random memory blocks. This is not suitable as the

individual nodes of a data structure could be located anywhere within the heap and

need not necessarily be contiguous, which results in bandwidth wastage. A streamer

unit is designed to be data structure aware and it moves data that is needed, in a set

order, and from the correct memory locations, into the accelerator.

In addition to being data-structure-aware, it is also results-aware and the streamer

will walk through a data structure and extract potential results from it. Therefore,

although its primary function is to supply the accelerator units with data, it can also

be used standalone, to independently extract data for use by the host processor in any

application.

5.2 Architecture

Figure 5.1 illustrates an abstract level view of the flow of data through a streamer. A

streamer unit walks through a data structure and converts the data structure into a

stream of data values. These data values represent the results from the list retrieval

operation. All that is needed to achieve this simple operation is a simple machine

structure. This simplicity means that the streamer can be easily implemented at low

cost.

STREAMER DATAVALUESDATA

STRUCTURE

Figure 5.1: Streamer data flow

33

Figure 5.2 shows the architectural view for a streamer unit. It is a three port device,

with one memory port, one output port and one configuration port. The memory port is

connected to the data memory and cannot be accessed from the host processor directly.

The configuration port is used to access the configuration stack. The retrieved data

stream is available on the output port. The output port and configuration port can be

accessed via the accelerator bus.

NODE

DATA

NEXT

SIZE

CONF

FIFOSTREAMER

CO

NFIG

STA

CK

DATA MEMORY STREAM0

Figure 5.2: Streamer block

A note needs to be made about the memory. Memory limitations will be considered

in detail at a later stage. For now, and for the next few chapters, memory can be

considered an abstract device with unlimited space and bandwidth. However, for the

simulation results, the host processor and accelerator units are connected to a shared

memory pool via a round-robin memory arbiter. Although connected this way, this will

not present any problems in our simulations as both hardware and software operations

are run separately. So, the issue of memory contention is avoided.

5.2.1 Configuration

The software library hsx/stream.hh provides several software functions to access and

configure the streamer unit. There are four streamer channels defined in hsx/types.hh

as HSX_STREAM0 through to HSX_STREAM3 but these are not hard limits and are easily

changed in software. These identifiers specify the exact streamer channel to access on

the accelerator bus. This allows the streamers to be configured and accessed by the host

processor.

The configuration of a streamer is managed by a series of registers organised as a

stack. Figure 5.3 illustrates the structure of the streamer configuration stack. There is

no reason that the configuration registers cannot be organised in a different way, such

as a memory mapped structure. The reason that this structure was chosen is to simplify

the configuration operation in hardware as only one configuration port is needed.

Only the CONF register is actually accessible on the accelerator bus and functions as

the top-of-stack register. Each write to this register will push the values down the stack.

To completely configure the streamer unit, the values need to be written in the order:

NODE, DATA, NEXT, SIZE, CONF. All these details are managed by the hsxSetStream()

function and the user need not be concerned with the details.

34

31 3 2 1 0

CONF

ROK

MOD

ENA

RST

SIZE 00

NEXT 00

DATA 00

NODE 00

Figure 5.3: Streamer configuration stack

NODE contains the pointer to the base node, which is the first data item in the data struc-

ture. This can be obtained for standard STL data structures using the begin()

method for each structure. Both the following offset registers specify positive

offsets from this value. The two lowest bits are zero as data is assumed to be

word-aligned in memory, which is a fair assumption for most 32-bit processors.

DATA contains the offset to the data value within the node. This offset is added to the

NODE pointer to obtain a memory location, which holds the actual value that gets

pulled in from the data structure into the results stream. Again, this offset is

assumed to be word-aligned.

NEXT contains the offset to the pointer for the next node. This offset is added to the

NODE pointer to obtain a memory location, which holds the link pointer to the

next node in the data structure. This pointer overwrites the existing base pointer

before iterating through the stream cycle. The stream cycle forms the actual

pointer following operation.

SIZE contains the number of items that are to be retrieved. This value is used in a

counter that is decremented after each iteration of the stream cycle. When this

counter reaches zero, the stream cycle is halted. The size of a STL data structure

can be obtained using the size() method.

CONF is the configuration and status register. Figure 5.3 shows the configuration bits for

the CONF register. As pointers are all word-aligned in memory, the enable and reset

bits would not be enabled by any of the other registers. This feature ensures that

the streamer is cleared and enabled only when it is needed. The hsxSetStream()

function also resets the unit before configuring and enabling it.

5.2.2 Operating Modes

There are two modes for the streamer output that can be configured using the CONF:MOD

register bit. These modes determine the operating mode of a streamer unit. Figure 5.4

35

depicts these basic modes.

MODE PUMP stops the data stream from being automatically streamed away. Essentially,

it disables the read enable signal on the output buffer, which prevents any other

attached accelerator unit from streaming the data away. The host processor now

needs to manually read the streamed values, using the accelerator bus. This con-

figures the streamer to be used in standalone mode, where the streamer is used as

an independent device to pull data into the processor from memory.

MODE PIPE tells the streamer to pipe the data stream directly through to another at-

tached device. This is another accelerator unit, typically the sieve unit. This

mode bypasses the host processor and provides the highest streaming speed pos-

sible within hardware as the process flow is directly controlled in hardware.

FIFOSTREAMER FIFO

FIFO

FIFO

FIFO

SIE

VE

FIFOSTREAMER

FIFOSTREAMER

MODE_PIPEMODE_PUMP

Figure 5.4: Streamer operating modes

5.2.3 State Machine

Figure 5.5 shows the finite state machine controlling the streamer. There are four states,

each running at full clock speed. The main stream cycle consists of the NULL, DATA and

NEXT states.

IDLE NULL NEXTDATALOAD

CONFIGSIZE > 0 DATA

LOAD

POINTER LOAD

Figure 5.5: Streamer machine states

IDLE is the default state and is entered as a result of either a soft or hard reset. All

internal control signals are reset to their default values during this stage. The

value of NODE register is copied to the internal node pointer, the DATA register is

copied to an internal offset register and the SIZE register is copied to the internal

counter.

NULL state performs a cycle check. If the internal counter is zero or if the internal node

pointer is a null pointer, the stream cycle is terminated by staying locked in this

36

state and blocks until a reset is received. Otherwise, the data value is addressed

by adding the internal node pointer value with the internal offset register. The

NEXT register is then copied to the internal offset register.

DATA state performs a single data read transfer on the memory port. The appropriate

memory control signals are asserted and de-asserted according to the transfer

protocol and the data item is read directly into the output buffers. At the same

time, the next data pointer is addressed by adding the internal node pointer value

with the internal offset register. The DATA register is then copied to the internal

offset register as before.

NEXT state performs another single data read transfer on the memory port. The pointer

is loaded directly into the internal node pointer, over-writing the existing pointer

and the internal counter is decremented by one.

From this state machine, it takes three clock cycles to stream out a single 32-bit word

of data. Assuming that the streamer channel runs at a nominal 100 MHz, the theoretical

maximum data streaming speed of a single channel is 1.066 Gbps (13×100MHz×32bits).

However, it loads a 32-bit data and a 32-bit pointer during each iteration giving a

theoretical maximum memory bandwidth of 2.13 Gbps.

5.3 Streamer Simulation

To measure the streamer performance, software simulation was used. A streamer kernel

was written in C++ to compare the performance of software and hardware streaming

methods. Extracts of the source code are listed at the end of the chapter. Each method

was timed and measured using a unitless tick count, which was then used to obtain the

speed-up factor.

The kernel first created and filled an input list (lines 72–76) with random values,

the number of which was determined during compile time. The input was then sorted

(line 79) and the same data set was used for both the hardware and software streams for

comparison. Debug output was obtained via iprintf(), an integer-optimised version

of printf() with a smaller memory footprint. Debug and non-debug was selectively

enabled using conditional defines and the results for both streaming operations were

compared using simple text scripts.

Figure 5.6 shows the simulation virtual hardware setup. This setup is used for

simulating the other accelerator units as well. The host processor is connected to the

accelerator units via a dedicated accelerator interface, which is used for both data and

control operations. The processor reads software instructions from instruction memory

that is directly connected. Each accelerator and host processor uses a shared data

37

HOSTCPU

CP

U I/F

ME

MO

RY

I/F

C

M M

V

C

I/ODevices

DataMemory

InstructionMemory

Figure 5.6: Accelerator unit simulation setup

memory pool that is accessed through a round-robin memory arbiter. Contention in the

memory arbiter between the host processor and accelerator unit is avoided by performing

each software and hardware streaming operation individually.

5.3.1 Kernel Functional Simulation

Listing 5.1 shows the debug output for a data set of N = 30. This verifies that the

same results are obtained for software and hardware streamers. It is important to

verify that the hardware operation produces the same result before useful performance

measurements can be made.

Initially, it seems that a streamer unit is quite worthless as an accelerator. Although

the bulk of the tick count was consumed by the iprintf() function, the number of

ticks required to perform either software or hardware streaming were similar. In fact,

the hardware accelerated operation took a slightly longer time. However, this is not

entirely true and the situation will be further explored in subsequent chapters.

Listing 5.3 shows the sample software used to configure the hardware streamer as

an independent DMA device to free up the processor for other functions. Although in

actual applications, it will be prudent to check that the buffers are not empty before

extracting the result, in this example, the hardware results were not checked to exist in

the buffers before being extracted (line 60). This was because the hardware streamer

unit pulls in data at a much higher rate than the host processor consumes it. Checking

can be done by reading the status of the CONF:ROK bit.

The way in which the configuration parameters for the list were obtained may look

somewhat convoluted and may require some explaining. This depends entirely on how

the data structure was defined and this was just one method of extracting the param-

eters. The values that were needed are the base pointer, data and next pointer offsets

and the structure size. The information was extracted from studying the C++ STL

linked list data structure source file bits/stl_list.h. For a user-defined data struc-

ture, extracting the necessary offsets and pointers should be fairly straightforward. Any

slow-down due to this added complexity was taken into account by considering this

38

configuration time as a fixed overhead cost.

Listing 5.2 shows an equivalent function performed by the software to stream data

in. In essence, the software operation needed only very few instructions to perform the

stream cycle. As mentioned earlier, this is not computationally intensive and is reflected

by the simple for-loop used. In pseudo assembly, the code assembles into:

1DO

data = LOAD (pointer + data _offset)

pointer = LOAD (pointer + pointer_offset)

4STORE (data , output_location )

LOOP

Listing 5.5: Streaming pseudocode

5.3.2 Kernel Timing Simulation

To obtain a more accurate timing measurement, a non-debugged streamer kernel was

used. This removes the iprintf() overhead and all tick counts were consumed purely

by the streamer operation. Listing 5.6 shows the output of the kernel timing simulation.

The number of operations were still similar, with the hardware configuration overhead

consuming several memory cycles more than the software operation.

Figure 5.7 shows an extract from the streamer kernel simulation timing diagram for

one iteration. The time values quoted are all unitless and ten units correspond to one

tick. The diagram focuses on the important signals while running the hardware streamer

kernel to detect events as they happen in hardware. There are a number of markers on

the diagram, indicated by the vertical dotted lines, from left to right:

A (1067870) is an estimated mark for the start of the configuration phase of the

streamer. Some parts of the configuration phase happen before this and can be

considered kernel function call overhead. This point corresponds to the first trans-

fer initiated by the hsxSetStream(HSX_STREAM2, cfg) function call. The first

part of this function asserts the CONF:RST bit to reset the streamer state machine

and flush the buffers as indicated by the clr_i signal 1©. After this, the con-

figuration registers are transferred onto the configuration stack, indicated by the

multiple xwb_stb_i and xwb_ack_o handshakes 2©. The same technique of writ-

ing to the CONF:RST bit is used to reset, flush, and configure the other accelerator

units.

B (1068880) marks the end of the configuration phase of the streamer. The final task of

the hsxSetStream(HSX_STREAM2, cfg) function call is to enable the accelerator.

At this point, the CONF:ENA bit is asserted, which starts streaming data from

memory immediately as indicated by the ena_i signal 3©. The memory transfers

are indicated by the multiple dwb_stb_o and dwb_ack_i handshakes 4©. The same

technique is used to start the other accelerator units.

39

C (1071360) shows the time at which the output buffers are full. The output buffers

are 15 levels deep and are pushed and popped on each wre_i and rde_i assertion.

By counting the numbers of buffer pushes and pops, it is evident that the buffers

will stall at C 5©. The streamer slows down and waits for items to be extracted

from its output buffer. This is because the hardware streams data in at a much

higher rate than the software extraction.

D (1071780) marks the time when the streamer stops running. The streamer does not

stop until the size counter decrements to zero. The number of items read into

the output buffer is 30 in this case, as defined during compile time. This can be

determined by counting the number of wre_i assertions after B 6©.

E (1074010) marks the estimated point when the stream kernel function ends. The

number of items extracted into the host processor is also 30 in this case. This

can be determined by counting the number of rde_i assertions after B 7©. After

this, there are a few more operations to perform before the kernel function returns

control to the main process; these can be considered the kernel function return

overhead.

As seen from the result, the time taken to perform the actual hardware streaming

was much shorter than the total hardware operation time. Although the total hardware

streamer operation time, TAE = 614 ticks, the amount of time to extract the stream into

the host processor, TBE = 513 ticks (83.6%) only. This makes sense as the hardware

streamer immediately began to fetch data into the processor at point A and did not stop

until it was completed. The hardware configuration overhead, TAB = 101 ticks (16.4%).

From the output, the total streamer kernel consumed, THW = 852 ticks. This makes

the function call and return overhead, T+ = 238 ticks (+38.8%). Using a similar function

call and return overhead for the software operation, the software operation, TSW = 414

ticks. Therefore, the hardware streamer operation actually ran slower than the software

operation. From the timing diagram estimate, the speed-up factor is:

TSW

THW= 0.67 , N = 30

However, there are several points to note in this estimation. Inspecting the source

code will show that the function call and return overhead is not 238 ticks. A large

proportion of it was actually used up by the operation to extract the values for configu-

ration, which happens prior to point A. Assuming the function call and return overhead

is similar to that of the other accelerator units, of about 100 ticks, the results are then

different.

The total hardware operation would require THW = 752 ticks, of which a total of

40

1047000 sec 1048000 sec 1049000 sec 1050000 sec 1051000 sec 1052000 sec 1053000 sec 1054000 sec 1055000 sec 1056000 sec 1057000 secA B C D ETime

xwb_stb_i=0

xwb_ack_o=0

dwb_stb_o=0

dwb_ack_i=0

clr_i=0

ena_i=1

wok_o=1

wre_i=0

rok_o=0

rde_i=0

A© B© C© D© E©

@@

@@

@I

1©

��

��

��

�

2©

@@

@@ @I

3©

��

��

��

4©

@@

@@I

5©

��

�

6©

��

7©

Figu

re5.7:

Stream

ertim

ing

diagram

41

TAB = 224 ticks (29.8%) are used for hardware configuration overhead. The software

operation time would require TSW = 552 ticks. This gives a slow-down factor of 0.73,

which makes more sense, but is still very slow.

On closer inspection, the amount of time taken to actually complete the hardware

stream, TBD = THW = 290 ticks including stalling. The additional time at the end,

TDE = 223 ticks is used by the host processor to retrieve the result values from the

operation. Therefore, if the results are not retrieved in software but piped directly to

the other accelerator units, there is a potential hardware speed-up of 1.90. This is a good

sign as the hardware streamer unit is mainly going to be used to stream information

into another hardware accelerator unit.

5.3.3 Kernel Performance Simulation

In order to eliminate the large uncertainties from the timing diagram estimates, multiple

sampling was used for simulation. The streamer kernel was compiled for different data

set sizes between N = 10 and N = 150 to extrapolate a trend. Each kernel was run and

sampled 50 times to obtain a range of software and hardware tick counts. In each case

the data set was first prepared with random values.

The number of samples chosen was trade-off between simulation time and accuracy.

Increasing the sample size would improve statistical accuracy in the results. However,

sampling 50 times for each data set took about a day to complete the entire simulation

run. The simulation had to be re-run each time the design was modified.

Figure 5.8 shows the simulation results for different data sets. The points were plot-

ted with y-errorbars to mark the mean and standard deviation of the result set but the

errorbars aren’t visible as the results are fairly consistent. Both the software and hard-

ware curves reflect the O(N) bound nature of a list retrieval operation. Extrapolating

linearly from the graphs, yields the following relationships:

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160 0.6

0.7

0.8

0.9

1

Tic

ks

Spe

edup

Data set size (n)

Streamer simulation

SoftwareHardwareSpeedup

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120 140 160 0.6

0.7

0.8

0.9

1

Tic

ks

Spe

edup

Data set size (n)

Streamer simulation

Figure 5.8: Streamer performance simulation

42

Msw(N) = 22.0N + 94.0 (5.1)

Mhw(N) = 22.6N + 241.7 (5.2)

Equation 5.1 describes the performance of the software streamer. The intercept of

94 is similar to the timing-estimated function call and return overhead of 100. Equation

5.2 describes the performance of the hardware streamer unit. The intercept of 242

is close enough to the estimated overhead from the timing diagram with the complex

parameter extraction. From the graph, the speed-up ratio Msw(N)Mhw(N) of the streamer for

N = 30 agrees well with the timing estimate of 0.73 and the ratio of the speed-up for a

sufficiently large data set, N → ∞ is:

Mup =Msw(N)

Mhw(N)= 0.97

This is, in effect, a slight (3%) slow-down rather than a speed-up. As mentioned

earlier (section 5.3.1) streaming is a very simple operation that the host processor can,

if necessary, handle fairly efficiently. The hardware state machine performs the same

series of loops with the same number of memory accesses. Therefore, it does not run any

faster than the software kernel. It runs slightly slower due to the difference in memory

contention between the software and hardware kernels. The software kernel had the

entire memory bus to itself while the hardware kernel had to share small parts of it with

the parts of the running software application.

However, from section 5.3.2, it is apparent that the streamer unit actually pulls in

data at a higher rate than the software is able to remove it. Therefore, the streamer unit

would spend a large proportion of its time stalling while waiting for the host processor

to pull data off. In addition, the hardware streamer works independently of the host

processor and can consequently be used to offload the streaming operation from the host

processor. In terms of using it to feed data to other acceleration units, the exact effects

are not presently known but will be explored in subsequent chapters.

5.4 Conclusion

The streamer unit can be used to accelerate a list-retrieval operation on a dynamically

linked structure in two ways. When used as a standalone unit, it can be used to offload

the task of pulling data into the host processor. This will reduce the workload of the

host processor but provides little additional benefit. If used in combination with other

acceleration units, in addition to offloading, it can potentially pipe data through at a

much higher rate than the host processor can. However, this assumes that the attached

43

unit would consume data at a sufficiently high rate to prevent the streamer unit from

stalling, which is a limitation.

The streamer unit is ultimately a O(N) bound machine. The hardware design is

fairly optimised, with multiple operations layered across the machine states. It has a

speed-up factor Mup = 0.97, which is similar enough to the software operation. This

makes it useful as the worst case is not significantly detrimental to the list retrieval

operation while it frees up the host processor for other operations.

The maximum external memory bandwidth that is required for each streamer unit

is 2.13Gbps at 100MHz while the maximum internal data transfer rate is 1.066Gbps at

100MHz.

44

Software Stream

HIT : 0x56

HIT : 0x121A

HIT : 0x1244

HIT : 0x281B

HIT : 0x2BBD

HIT : 0x2D13

HIT : 0x2E80

HIT : 0x37AE

HIT : 0x38D4

HIT : 0x3F17

HIT : 0x5147

HIT : 0x536E

HIT : 0x64C8

HIT : 0x65E6

HIT : 0x6F80

HIT : 0x7007

HIT : 0x7CB4

HIT : 0x7D4E

HIT : 0x8957

HIT : 0x9903

HIT : 0xAFF3

HIT : 0xB06A

HIT : 0xC596

HIT : 0xCD1F

HIT : 0xD3EE

HIT : 0xD8CF

HIT : 0xE89A

HIT : 0xE9B2

HIT : 0xF54E

HIT : 0xFBF4

182638 swticks

10079 swmemticks

Hardware Stream

HIT : 0x56

HIT : 0x121A

HIT : 0x1244

HIT : 0x281B

HIT : 0x2BBD

HIT : 0x2D13

HIT : 0x2E80

HIT : 0x37AE

HIT : 0x38D4

HIT : 0x3F17

HIT : 0x5147

HIT : 0x536E

HIT : 0x64C8

HIT : 0x65E6

HIT : 0x6F80

HIT : 0x7007

HIT : 0x7CB4

HIT : 0x7D4E

HIT : 0x8957

HIT : 0x9903

HIT : 0xAFF3

HIT : 0xB06A

HIT : 0xC596

HIT : 0xCD1F

HIT : 0xD3EE

HIT : 0xD8CF

HIT : 0xE89A

HIT : 0xE9B2

HIT : 0xF54E

HIT : 0xFBF4

183040 hwticks

10129 hwmemticks

Listing 5.1: Streamer kernel debug output

31int swstream (std ::list <int > &listA )

32{

for (std ::list <int >:: iterator node = listA .begin ();

node != listA .end (); node ++)

{

volatile int j = *node ;

37

#ifdef DEBUG


#endif

}

42}

Listing 5.2: Software streamer kernel

45

44int hwstream (std ::list <int > &listA )

{

46std ::list <int >:: iterator node ;

hsxStreamConfig cfg;

cfg.conf .bits .mode = HSX_STREAM_PUMP ;

cfg.node = (int) &*listA .begin ()._M_node; // node base

51cfg.next = (int) &node ._M_node ->_M_next ; // next offset

cfg.data = (int) &(( std:: _List_node <int >*) node ._M_node )->_M_data ;

cfg.size = LIST_MAX ; // listA.size ();

hsxSetStream (HSX_STREAM2 , cfg);

56

// pull data

for (int i=0; i<LIST_MAX ; ++i)

{

volatile int j = hsxGetData (HSX_STREAM2 );

61

#ifdef DEBUG


#endif

}

66}

Listing 5.3: Hardware streamer kernel

68int stream()

{

std ::list <int > listA ;

71

// prefill lists


{

listA .push_back (getrand () & 0x0000FFFF );

76}

// sort lists

listA .sort ();

81// do sieve

int ticks ;

int memtick;

// SOFTWARE STREAM

86iprintf("Software Stream\n");

memtick = getmemtick ();

ticks = gettick ();

swstream (listA );

ticks = gettick () - ticks ;

91memtick = getmemtick () - memtick;

iprintf("%d swticks \n",ticks );

iprintf("%d swmemticks \n",memtick);

// HARDWARE STREAM

96iprintf("Hardware Stream\n");


ticks = gettick ();

hwstream (listA );


101memtick = getmemtick () - memtick;

iprintf("%d hwticks \n",ticks );

iprintf("%d hwmemticks \n",memtick);


106}

Listing 5.4: Streamer kernel

Software Stream

652 swticks

98 swmemticks

Hardware Stream

852 hwticks

120 hwmemticks

Listing 5.6: Streamer simulation output (non-debug)

46

CHAPTER 6

Sieve Unit

The sieve unit is a primitive computational unit. It can be configured to

perform multiple operations and also takes inputs from multiple sources.

The functional and timing simulation results show that it can improve per-

formance of a simple boolean query by 5.2 times when used in conjunction

with the streamer unit in a hardware pumped configuration.

6.1 Introduction

After a list of results is extracted during list retrieval, a common operation required at

the end of a search is to collate the results. If the results do not need to be collated,

they can be buffered and passed through to the host processor as the final results. Both

of these functions are performed by a sieve unit.

Of the two operations, collation is computationally intensive and is a suitable can-

didate for hardware acceleration. It involves traversing the results list and comparing

each individual element to see if it matches one or more filter criteria. The example

query would need to compare a list of records to see if they both contain the words cat

and cambridge.

Unlike the streamer, a sieve works on the actual values of the results and is not

data-structure-aware. As the name suggests, it is designed to filter through a results list

to extract only the results that fit a specific criterion. Common operations that would

need to be performed are intersection and union of two lists.

As the sieve unit is used at the end of a search pipeline, it can be used as a results

buffer to extend the output buffer capacity of any other acceleration unit. In this mode,

47

it can extend the existing output buffer capacity. This can help alleviate bottlenecks

that are caused by buffer stalls as seen in the streamer unit previously.


To perform these functions, some factors need to be considered. The process of inter-

secting two lists can be software-intensive. Each item in the first list needs to be checked

against the items in the second list. Assuming that the two lists are of equal length, a

naıve algorithm would do this operation in O(N2) time. If the lists are unsorted, this

is the best possible performance of any sieve operation.

However, in this case, the problem lies with the data structure rather than the

computation. For optimal operation, data items should be inserted into a list in sorted

order. If the lists are in sorted order, a smarter algorithm would be capable of reducing

this operation to O(N) time.

However, sorting is a computationally intensive operation that is O(N log N) bound.

Assuming that insertion happens far less often than data selection, it should be per-

formed during data insertion. For most common applications, this assumption is valid.

In this case, the cost of sorting the data can be easily amortised over a number of select

operations, which is more efficient than sorting the data after a select operation.

If the minimum and maximum values are known, the sieve operation can be further

reduced to O(log N) time. This can allow a binary search to be used to quickly elim-

inate blocks of items that need to be read. However, this may need better hardware

capabilities to decide where to split the range. Furthermore, it requires the entire list

to be available in advance, to know the list range.

A hybrid technique is used in the hardware sieve unit. It runs in O(N) time while

being able to eliminate blocks of data at a time. The two lists are stored in input buffers

and the minima and maxima of each input buffer are tracked. When possible, the input

buffers are flushed to eliminate whole blocks. The result is a very fast sieve unit that

provides significant acceleration.

6.2 Architecture

Figure 6.1 illustrates an abstract level data view of the sieve process. Multiple result

streams generated by streamer units or software are combined by the sieve unit into

a final result stream. Although a sieve unit is designed to combine two streams into

one, multiple sieve units can be combined in parallel and in cascade, for more complex

collation operations.

Figure 6.2 shows a single sieve unit organised as two individual channels. Each

individual channel has an input and output port, linked directly to internal input and

48

SIEVE

DATASTREAM

A

DATASTREAM

B

RESULTS

Figure 6.1: Sieve data flow

output buffers. Unlike the streamer unit, the channels are controlled and configured in

pairs because the sieve works on paired data streams. Configuring either SIEVE_0 or

SIEVE_1 channel results in configuring the same channel pair.

FIFO

FIFO

FIFO

FIFO

SIE

VE

CONF

SIEVE 0

SIEVE 1

SIEVE 0

SIEVE 1

INPUT STREAM OUTPUT STREAM

Figure 6.2: Sieve Block

6.2.1 Configuration

The software library hsx/sieve.hh provides several software functions to access and

configure the sieve unit. There are four sieve channels defined in hsx/types.hh as

HSX_SIEVE0 through to HSX_SIEVE3 in this library. Just as before, these identifiers

specify the exact sieve channel to access on the accelerator bus.

Unlike the streamer unit, the sieve unit has a single configuration register that con-

trols the operation of the channel pair, instead of a stack. The configuration of a sieve

unit is managed through the accelerator bus using the hsxSetSieve() function. This

function will write the appropriate values to the configuration register.

Figure 6.3 describes the configuration bits of the register. The configuration register

also doubles as a status register for each individual channel. Although writes to each

sieve will configure the same register, reads from each sieve will return only the status

of each individual channel. So, a read will return the data available status for read

and buffer available status for write (CONF:ROK and CONF:WOK) bits individually for

HSX_SIEVE0 and HSX_SIEVE1.012345631

CONF

WOK

ROK

MODE

ENA

RST

Figure 6.3: Sieve configuration register

49

6.2.2 Modes

A2 A1 A0

B2 B1 B0

FIFO

FIFO

FIFO

FIFO

A1 A0

B1 B0

SIE

VE

A2 A1 A0

B2 B1 B0

FIFO

FIFO

FIFO

FIFO

B1 B0

A1 A0

SIE

VE

MODE_SWP

A2 A1 A0

B2 B1 B0

FIFO

FIFO

FIFO

FIFO

A1S

IEV

E

MODE_AND

B2

A2 A1 A0

B2 B1 B0

FIFO

FIFO

FIFO

FIFO

A1 A0B0

SIE

VE

MODE_IOR

B1

B2

A2 A1 A0

B2 B1 B0

FIFO

FIFO

FIFO

FIFO

A1 A0

B1 B0

SIE

VE

MODE_PAS

Figure 6.4: Sieve operating modes

The sieve unit is able to perform several functions. Figure 6.4 depicts some of the

basic functions and other functionality can be added to the sieve unit when desired. The

basic functionality is sufficient to perform the most common collation functions. These

functions are configured by setting the CONF:MOD bits to certain values:

MODE PAS links the output buffers and input buffers directly in pass-through mode. This

mode makes the sieve extend any results buffer of any other accelerator unit. In

this mode, the sieve can stream data at maximum speed and store additional

elements in each buffer.

MODE SWP crosses the output buffers with the input buffers in swap-through mode. This

swaps the two channels but does not filter any of the results. In this mode, each

channel can stream at maximum speed and store additional elements in each buffer.

If a number of sieves are used in cascaded swap mode, it is possible to use them

as routers to move data around.

MODE AND filters the two input buffers into a single output buffer in intersection mode.

Only duplicate results that appear on both input streams are filtered into the

output. This performs a logical AND operation on the input streams. In the figure,

B2 is the same as A1. Only A1 is filtered through while B2 is dropped. In this mode,

whole buffers are flushed if the values in a buffer are outside the intersection range.

MODE IOR filters and sorts the two input buffers into a single output buffer in union

mode. Duplicate results that appear on both input streams are filtered into a

single occurrence. At the same time, the results are sent out in sorted order. This

performs a logical inclusive OR operation on the input streams. In the figure, B1

is smaller than A1, which is the same as B2. Only A1 is filtered through while B2

is dropped.

50

6.2.3 Operation

Although the channel pairs are configured together, each individual input and output

port can be accessed through the accelerator bus using hsxGetData() and hsxPutData().

Data can be fed directly into the sieve unit by the host processor or streamed from a

streamer unit. Using software to pump data allows the sieve unit to be used as a stan-

dalone data collation unit to accelerate the collation of data streams retrieved from

various other sources. However, the fastest performance can be achieved if the data is

streamed directly from a streamer unit instead of via the host processor.

The sieve unit works by having data pushed into it in sorted order. The output of

a sieve is also sorted and can be used as a direct input to another sieve. This allows

sieve operations to be cascaded for complicated collation operations. Figure 6.5 shows

the basic finite state machine controlling the sieve. There are only 2 states, each each

running at full clock speed.

WORKIDLE

TRANSFER

BUFFERSNOT EMPTY

Figure 6.5: Sieve FSM

IDLE state is the default state, where the data items in the input buffers are compared.

The sieve unit maintains a record of the minimum and maximum data items

pushed into the input buffer. A decision is made to either flush, pop, pass, swap

or stall the buffers based on the contents of each input buffer.

WORK state is when the actual sieve operation occurs. A flush clears all data items from

an input buffer, a pop drops a single data item from the input buffer, a pass feeds

a single data item from the input to the output buffer, a swap feeds a single data

item from the input to the opposite output buffer, and a stall will wait for more

data items to be made available on the input buffers.

Therefore, the maximum theoretical transfer speed of a single channel is 1.6 Gbps

at 100 MHz. With two independent channels, the sieve unit can have a maximum data

transfer rate of 3.2 Gbps at 100 MHz. This is ample data bandwidth to handle the

incoming data stream that comes from either a streamer unit or the host processor.

The maximum theoretical transfer rate of a streamer unit is only 1.066Gbps and the

processor transfer speed is even slower. Therefore, there is no possibility of saturating

the sieve unit either by software or hardware piping of the inputs, and this makes the

sieve operation bound by the streaming input operation.

51

6.3 Simulation Results

Once again, simulation was used to measure the performance of the sieve unit. Listing

6.3 show parts of the sieve kernel written to perform an intersect collation in both

software and hardware. Two input streams were created and filled (lines 100-105) with

a number of different random values. Once again, the data set size was defined during

compile time.

Several random values were then inserted (lines 107-113) into both streams to ensure

that intersections exist. The two streams were then sorted using O(N log N) time (lines

115-117) and used for both software and hardware input streams. As the two streams

contain different values, the common values were located at different parts along each

list. The results of both the hardware and software sieves are inspected visually using

the debug output.


Listing 6.4 shows the debug output. It verifies that the same results were obtained for

both software and hardware sieves. This shows that the hardware sieve can at least be

used to offload the operation from the host processor.

Listing 6.2 shows how to use the hardware sieve as an independent sieve configured

for intersection mode under software pump. Both lists were manually pumped into the

sieve through software, using the hsxPutData() function. In this example code, the

software kernel did not check for the status of the read and write buffers because the

software pump will never saturate the buffers. However, the CONF:WOK and CONF:ROK

status bits should be checked in actual applications.

Listing 6.1 shows an equivalent intersection computed in software. There are different

methods to compute intersections. The method used computes the intersection in O(N)

time.

6.3.2 Kernel Software Pump Timing

Listing 6.5 shows the non-debugged software output. Unlike the streamer unit, it is

evident that the hardware operation is faster than the software option, even at this

stage.

Figure 6.6 shows the resultant timing diagram from running the sieve with software

pump input. It is a single sample with a data set size of N = 33 for each list. Again,

the timing diagram is unitless and 10 units are equal to a tick. There are three markers

on this timing diagram:

A (1885940) marks the point when sieve configuration began. This corresponds to the

beginning of the hsxSetStream(HSX_STREAM2, cfg) function. Again, the parts

52

31int swsieve(std ::list <int > &listA , std ::list <int > &listB )

{


34

idxA = listA .begin ();



39{

if (* idxA == *idxB )

{

volatile int j = *idxA ;

// HIT !!

44#ifdef DEBUG

iprintf ("HIT\t: 0x%X\n", j);

#endif

idxA ++;

49idxB ++;

}

else if (*idxA < *idxB )

{

idxA ++;

54}

else

{

idxB ++;

}

59}


}

Listing 6.1: Sieve software kernel

64int hwsieve(std ::list <int > &listA , std ::list <int > &listB )

{

66hsxSieveConfig cfg;

cfg.conf .bits .mode = HSX_SIEVE_AND ;

hsxSetSieve (HSX_SIEVE2 , cfg);


71idxA = listA .begin ();


// data pump


76{

hsxPutData (HSX_SIEVE2 , *idxA ++);

hsxPutData (HSX_SIEVE3 , *idxB ++);

}

81// data pull

for (int i=0; i<( LIST_MAX /10) ; ++i)

{

// HIT !!

volatile int j = hsxGetData (HSX_SIEVE2 );

86

#ifdef DEBUG


#endif

}

91


}

Listing 6.2: Sieve hardware kernel

53

96int sieve ()

97{

std ::list <int > listA , listB ;

// prefill lists


102{

listA .push_back (getrand () & 0x0000FFFF );

listB .push_back (getrand () & 0x0000FFFF );

}

107// plant about 10% hits


{

int j = getrand () & 0x0000FFFF ;

listA .push_back (j);

112listB .push_back (j);

}

// sort lists

listA .sort ();

117listB .sort ();

// do sieve

int ticks ;

int memtick;

122

// SOFTWARE SIEVE

iprintf("Software Sieve \n");


ticks = gettick ();

127swsieve(listA , listB );


memtick = getmemtick () - memtick;



132

// HARDWARE SIEVE

iprintf("Hardware Sieve \n");


ticks = gettick ();

137hwsieve(listA , listB );





142


}

Listing 6.3: Sieve kernel

54

1883000 sec 1884000 sec 1885000 sec 1886000 sec 1887000 sec 1888000 sec 1889000 sec 1890000 sec 1891000 sec 1892000 sec 1893000 sec 1894000 sec 1895000 sec 1896000 sec 1897000 sec 1898000 sec 1899000 sec 1900000 sec 1901000 secA B CTime

SIEVE2

xwb_stb_i=0

xwb_wre_i=0

xwb_ack_o=0

a_rok_o=1

wre_i=0

A© B© C©

@@

@@

@I

1©

@@ @I

2©

@@ @I

3©

��

4©

Figu

re6.6:

Sieve

software

pum

ped

timin

gdiagram

55

Software Sieve

HIT : 0x116E

HIT : 0x22F1

HIT : 0x30B1

HIT : 0x8601

HIT : 0xD2C6

30874 ticks

1859 memticks

Hardware Sieve

HIT : 0x116E

HIT : 0x22F1

HIT : 0x30B1

HIT : 0x8601

HIT : 0xD2C6

28080 ticks

1769 memticks

Listing 6.4: Sieve kernel output (debug)

Software Sieve

2330 swticks

198 swmemticks

Hardware Sieve

1454 hwticks

141 hwmemticks

Listing 6.5: Sieve kernel output (non-debug)

before this were considered the function call overhead and similar to the streamer,

it resets and flushes the sieve before configuring it.

B (1886800) marks the point when sieve configuration was completed. This corresponds

to the end of the hsxSetStream(HSX_STREAM2, cfg) function. The sieve is en-

abled at this point. After this point, data from the two lists were then pumped

into the sieve unit via software. Each assertion of the xwb_stb_i signal 1© corre-

sponds to writing a value to the sieve input. As both input channels are pumped

by the software, there are 66 data items written.

C (1899680) marks the point when the sieve operation was completed. The parts after

this were considered the function return overhead. The wre_i assertions show the

number of hits 2© 3© 4© written into the output buffer. In this case, there were

3 hits, which was correct for a list size of 33 items (30 dummies and 3 hits) as

prepared by the software. The last three xwb_stb_i assertions were used to pull

the items off the sieve output.

Reading from the timing diagram, the complete operation, TAC = 1374 ticks. The

configuration operation, TAB = 86 ticks (6.25%) while the data pump and pull operation,

TBC = 1288 ticks (93.75%). The total ticks reported was 1454 and the overhead, T+ = 80

ticks (+5.8%) was used for function call and return. Both the configuration and software

overheads are minor as the bulk of the time is taken up by the sieve operation.

According to the non-debug output, the software sieve required 2330 ticks. Using

a similar function call and return overhead, the software sieve consumes TSW = 2250

56

ticks. The hardware accelerated operation gives a speed-up factor:

TSW

TAC= 1.64

Although this is an acceleration, it is not very significant. Hence, the sieve unit

would not benefit the host processor very much in standalone operation. In this config-

uration, the host processor is kept busy pumping data to the sieve unit. Therefore, the

performance bottleneck is governed by the host processor software pumping the data.

Once this bottleneck is removed, the results are much better as shown in section 6.3.4.

6.3.3 Kernel Software Pump Performance

Figure 6.7 shows the simulation results of the software and hardware tick count for sieve

operations on different data sets. Again, each data set was randomly prepared and the

simulation was sampled 50 times for each data set size. The calculated averages and

standard deviations are used to plot the curves.

0

2000

4000

6000

8000

10000

12000

0 20 40 60 80 100 120 140 160

1.5

1.6

1.7

Tic

ks

Spe

edup

Data set size (n)

Sieve software pump simulation


0

2000

4000

6000

8000

10000

12000

0 20 40 60 80 100 120 140 160

1.5

1.6

1.7

Tic

ks

Spe

edup

Data set size (n)

Sieve software pump simulation

Figure 6.7: Sieve software pumped simulation

Graphically, the software and hardware timings were both linearly proportional to

the size of the data sets while the speed-up showed diminishing returns. Extrapolating

the software and hardware curves, their linear relationships can be extracted into the

following equations:

Vsw(N) = 78.3N + 90.1 (6.1)

Vhw(N) = 45.9N + 162.7 (6.2)

As before, the intercepts of each graph represent the fixed overhead costs while the

slope shows the cost per data item. A software overhead of 90 is very close to the timing

estimate of 80 ticks. A hardware overhead value of 163 is of a similar to the timing

estimate of TAB + T+ = 166 ticks.

57

Graphically, the speed-up factor for N = 30 of 1.6 is similar to the timing estimate

of 1.64. Graphically, it is clear that for large data sets, the factor reaches a plateau at

about 1.7 speed-up factor. Once again, the speed-up factor is not very impressive as it

is entirely limited by software. Even for a sufficiently large data set, the speed-up factor

merely resolves to:

Vup =Vsw(N)

Vhw(N)= 1.71

6.3.4 Kernel Hardware Pipe Timing

As clearly evidenced by the earlier results, the host processor pump is a bottleneck for

supplying data to the sieve. It is possible to use one streamer unit to pump one data

stream, alongside software pumping the second data stream, to speed things up. But,

by using two streamer units to pump data to both sieve channels, the software pump

bottleneck is removed entirely.

Listing 6.6 shows the modified kernel that used two streamer units to pipe data into

the sieve unit. In this case, unlike in Listing 6.2, the status of the output buffer needed

to be checked for valid outputs. This was done by means of a polling mechanism here,

but this can also be done using interrupts.

Listing 6.7 shows the output from the modified kernel software. As can be clearly

seen, the software operation consumed a similar amount of time but the hardware op-

eration was much faster than the one in section 6.3.2. In this case, the data set was set

to N = 30.

Figure 6.8 shows the results of running the sieve with two streamer inputs. There

are six markers on this timing diagram:

A (1869080) marks the beginning of the configuration overhead. This corresponds to

the hsxSetSieve(HSX_SIEVE2) function. Everything before this is considered the

function call overhead.

B (1871050) marks the point when HSX_STREAM2 configuration starts. This corresponds

to the hsxSetStream(HSX_STREAM2, cfgA) function.

C (1872060) marks the point when HSX_STREAM2 configuration ends. At this point,

the streamer unit began to read data in from memory and piped it directly onto

the sieve unit. The number of data units read can counted by the number of

dwb_stb_o assertions 1©. This point can be considered the starting point for the

operation as data starts to stream in here.

D (1873910) marks the point when HSX_STREAM3 configuration starts. This corresponds

to the hsxSetStream(HSX_STREAM3, cfgB) function.

58

1869000 sec 1870000 sec 1871000 sec 1872000 sec 1873000 sec 1874000 sec 1875000 sec 1876000 sec 1877000 sec 1878000 secA B C D E FTime

SIEVE2

xwb_stb_i=1

xwb_ack_o=0

wre_i=0

STREAM2

xwb_stb_i=0

xwb_ack_o=0

dwb_stb_o=0

dwb_ack_i=0

wre_i=0

STREAM3

xwb_stb_i=0

xwb_ack_o=0

dwb_stb_o=0

dwb_ack_i=0

wre_i=0

A© B© C© D© E© F©

@@

@@

@@

@I

1©

@@

@I

2©

��

��

3©

��

��

��

��

4©

��

��

��

��

��

5©

Figu

re6.8:

Sieve

with

hard

ware

pip

edtim

ing

diagram

59

E (1875100) marks the points when HSX_STREAM3 configuration ends. At this point,

all the configuration overhead ends. The streamer unit began to supply the sec-

ond stream of information to the sieve unit 2©. The sieve unit now had enough

information to begin the intersection operation.

F (1877300) marks the point when the sieve operation is completed. The number of

hits can be counted by the number of wre_i assertions at the top 3© 4© 5©. In

this case, there were 3 hits, which is the correct number of hits for a list size of

30. The distribution of the hits indicated that the intersections occurred towards

the end of the list but it could occur anywhere.

A stall can be observed around E as the buffers become full. By default, the buffer

depths were configured to 15 levels deep. As there was both an output buffer in the

streamer and an input buffer in the sieve, the stall happened after 30 data items were

read. At this point, there were 15 data items sitting in each input and output buffer.

Reading the values directly from the timing diagram, the total time for complet-

ing the whole hardware piped operation TAF = 822 ticks. The configuration time was

TAC = 298 ticks (36.3%) and the actual sieve operation consumed TCF = 524 ticks

(63.7%). The actual number of ticks reported by the terminal output THW = 938 ticks.

The additional T+ = 116 ticks (+14.1%) are considered used by the function call and

return overhead.

Using a similar value for the software sieve operation, the software operation took

TSW = 2398 ticks. Using this result, the hardware accelerated operation resulted in a

2.9 factor acceleration. This is a significant acceleration and a promising result when

compared to the earlier software-pumped estimate.

A point of note is the total cost of the overhead is a much larger portion in the

hardware accelerated operation. This indicates that the actual effective search operation

is much faster than before. As the size of the data elements increases, the configuration

overhead will ultimately become negligible.

For most purposes, these overheads can be considered a constant value while the

operation time is proportional to the number of elements in the list. In other trials, it

was discovered that using the .size() function is dependent on N and can increase the

hardware overhead to 1400 ticks. Therefore, the setup time can also be reduced if the

method for extracting the pointers and offsets is less convoluted.

6.3.5 Kernel Hardware Pipe Performance

Figure 6.9 shows the simulation results of the software and hardware tick count for sieve

operations on different data sets, as before. Immediately noticeable is the slope of the

hardware line and the range of the speed-up, which has changed tremendously.

60

0

2000

4000

6000

8000

10000

12000

0 20 40 60 80 100 120 140 160

1.5

2

2.5

3

3.5

4

4.5

Tic

ks

Spe

edup

Data set size (n)

Sieve-streamer pump simulation


0

2000

4000

6000

8000

10000

12000

0 20 40 60 80 100 120 140 160

1.5

2

2.5

3

3.5

4

4.5

Tic

ks

Spe

edup

Data set size (n)

Sieve-streamer pump simulation

Figure 6.9: Sieve with streamer piped simulation

Extrapolating linearly and extrapolating the results, we end up with the following

equations to describe the software and hardware sieves:

Vsw(N) = 78.4N + 77.9 (6.3)

Vhw(N) = 15.0N + 520.2 (6.4)

Equation 6.3 shows insignificant change from equation 6.1 software simulation re-

sults, as is expected. The slopes are the same and the intercepts are very similar. The

timing estimate of 116 for function call and return is close enough to the intercept of

78.

Equation 6.4 shows a significant difference from equation 6.2. The intercept of 520 is

similar enough to the timing estimate of TAC + T+ = 414 ticks for the hardware opera-

tion overhead, considering the estimation method. The hardware configuration overhead

has increased significantly as the streamer unit configurations require a significant time

to extract the configuration options and push these values into the stack. This is evident

when comparing timing diagrams and is illustrated experimentally here.

The timing estimation of 2.9 for the speed-up factor, corresponds to the ratio of the

two slopes when N = 30 as shown on the graph. Once again, the speed-up factor shows

a trend of diminishing returns but the hardware piped speed-up factor for a sufficiently

large data set is:

Vup = 5.23

This is a significant speed-up value. Therefore, it can be safely said that the hardware

piped sieve provides a very significant acceleration over the pure software operation.

61

6.4 Conclusion

A sieve unit designed as decribed can evidently be used as an accelerator unit to offload

the filtering operation from the host processor and accelerate result collation. The speed-

up of a sieve unit is 1.71 when used standalone but it gives a Vup of 5.2 when combined

with the hardware streamer unit and under ideal conditions. In this situation, the

performance of the sieve unit is bound by the performance of the streamer unit.

However, this analysis considers only the hardware versus software tradeoffs when

intersecting two lists. For a more complicated operation with multiple lists, the hardware

speed-up could be more. Multiple hardware sieves can perform operations in parallel

and in cascade without any significant problems.

The sieve unit does not consume any external memory bandwidth as it does not

deal with the data set directly. This contributes to a larger acceleration when compared

to software methods that need to access memory regularly. The maximum internal

bandwidth of each sieve channel is 1.6 Gbps for a combined total of 3.2 Gbps per unit

at 100 MHz.

62

64int hwsieve(std ::list <int > &listA , std ::list <int > &listB )

{

// configure sieve

67hsxSieveConfig cfg;

cfg.conf .bits .mode = HSX_SIEVE_AND ;

hsxSetSieve (HSX_SIEVE2 , cfg);

// configure streamers

72std ::list <int >:: iterator node ;

hsxStreamConfig cfgA , cfgB ;

cfgA .conf .bits .mode = HSX_STREAM_PIPE ;

cfgA .node = (int) &*listA .begin ()._M_node; // node base

77cfgA .next = (int) &node ._M_node ->_M_next; // next offset

cfgA .data = (int) &(( std :: _List_node <int >*) node ._M_node )->_M_data;

cfgA .size = LIST_MAX + LIST_MAX /10; // listA.size ();

cfgB .conf .bits .mode = HSX_STREAM_PIPE ;

82cfgB .node = (int) &*listB .begin ()._M_node; // node base

cfgB .next = (int) &node ._M_node ->_M_next; // next offset

cfgB .data = (int) &(( std :: _List_node <int >*) node ._M_node )->_M_data;

cfgB .size = LIST_MAX + LIST_MAX /10; // listB.size ();

87hsxSetStream (HSX_STREAM2 , cfgA );

hsxSetStream (HSX_STREAM3 , cfgB );

// data pull


92{

while (!( hsxGetConf (HSX_SIEVE2 ) & (1<<5))); // wait for result

// HIT !!

volatile int j = hsxGetData (HSX_SIEVE2 );

97#ifdef DEBUG


#endif

}

102return EXIT_SUCCESS ;

}

Listing 6.6: Hardware streamer-sieve kernel

Software Combi

2514 swticks

198 swmemticks

Hardware Combi

938 hwticks

189 hwmemticks

Listing 6.7: Streamer-sieve kernel output

63

CHAPTER 7

Chaser Unit

The chaser unit operates on the key search stage of the search pipeline. It

can be configured to work with different data structures and applications.

Functional and timing results of the chaser simulation show that the it has

the ability to accelerate multiple key searches by up to 3.43 times when

compared against a pure software operation.

7.1 Introduction

The final accelerator unit is the chaser unit. The very first step of any search pipeline

usually begins with a key search. This can be considered the most common search

operation, which is performed during primary and secondary search operations. In

primary search operations, this is the only search operation that is performed. Therefore,

it is a common computational task and would benefit a considerable number of further

applications if accelerated in hardware.

The task of chasing down the search key involves a few operations: loading the

data from memory and comparing it against a search key. Based on the result of

the comparison, a decision is made on what action to do next. At a machine level,

this is a fairly mundane task that does not exploit the full powers of a microprocessor

but unnecessarily consumes valuable computational power. This presents an excellent

opportunity for offloading the operation to free up the processor for other compute

intensive tasks, without compromising on the raw performance.

64


As mentioned previously, keys are often stored in a tree structure. Therefore, it is evident

that a tree traversal algorithm would be used to search it. In many implementations,

travelling down a branch and up it again involves remembering and recalling where

the algorithm has been. This is essentially similar to the action of pushing down and

popping back up the procedural call stack.

Hence, a tree traversal algorithm would benefit from having a stack machine ar-

chitecture, which is naturally suited to efficient stack operations. The fact that the

popular SQLite database engine uses a stack-based virtual machine [Hip07] to process

SQL queries, lends some credence to this idea. This formed the initial idea of a hardware

chaser design.

An initial design was envisioned, involving a simple dual-stack processor that could

be programmed with a few primitive operations. The basic operations needed were:

a memory load, a compare operation and a conditional operation. The top of stack

contains a pointer that points to the current data node. As the device steps down the

tree, the stack is pushed and the new node pointer loaded from memory into the top

of stack. When going back up the stack, the pointers merely need to be popped off the

stack.

The design had a distinct advantage of keeping the previously loaded pointers in the

stack. This meant that the pointers would not need to be recalculated nor reloaded from

memory, which would save both computational and bandwidth resources. The stack

machine was also modest in its use of resources as it merely needed a small memory

block with a simple ALU.

However, after spending some time on this design, it was abandoned. It was found

that a key search is more akin to a list traversal than a tree traversal. Any search that

involves traversing an entire tree is best optimised by reorganising the data in a different

method. The whole idea of using a tree is to eliminate entire branches with every step

of the traversal. There is no need to move back up a tree and the path traced through

the tree is only one way. As a result, it was possible to redesign the chaser unit in a

simpler form.

7.2 Chaser Architecture

The function of the chaser is to process a data structure and extract part of it as a result.

Figure 7.1 illustrates an abstract level view of a chaser data flow. Like the streamer, the

chaser processes a data structure, not the data values directly, and extracts the data

node that matches a key. To enable it to do this, the chaser unit has four ports: an

input port, an output port, a configuration port and a memory port.

65

CHASERDATANODEDATA

STRUCTURE

Figure 7.1: Chaser data flow

Figure 7.2 illustrates the structural view of the chaser. The memory port is connected

directly to data memory and used to read in the data structure. The input, output and

configuration ports are only accessible over the accelerator bus. The input port is used

to write the primary key into the appropriate configuration register. The output port is

used to retrieve results from the chaser. The configuration port is used to write values

into the configuration stack.

NODE

DATA

EQCC

LTCC

GTCC

FIFOCHASER

CO

NFIG

STA

CK

DATA MEMORY

CHASE0

CONF

FIFOPKEY

Figure 7.2: Chaser unit block

7.2.1 Configuration

The software library hsx/chase.hh provides several software functions to access and

configure the chaser unit. There are four chaser channels defined in hsx/types.hh as

HSX_CHASE0 through to HSX_CHASE3. These identifiers specify the exact chaser channel

to access on the accelerator bus.

Like the streamer unit, the chaser holds its configuration in a stack. The only

configuration register not accessed on the stack is the key register, PKEY, which is written

to separately from the rest of the configuration. This separates the search key from the

data set configuration, which simplifies the chaser for multi-key searches on a single data

set.

Figure 7.3 lists the registers in the configuration stack. It works the same way as

the streamer configuration stack mentioned in section 5.2.1 All the details are managed

by the hsxSetChase() function. To configure the chaser unit, the values need to be

written in order: NODE, DATA, EQCC, LTCC, GTCC, CONF.

NODE contains the base pointer. This will typically contain the pointer to the root node

of a tree. All the following offsets are calculated as a positive offset from this base

pointer.

66

012331

CONF

WOK

ROK

ENA

RST

GTCC 00

LTCC 00

EQCC 00

DATA 00

NODE 00

Figure 7.3: Chaser configuration stack

DATA contains the key offset. A key will be loaded from this offset from the node pointer

and compared with the primary key to decide the next operation. The operation

performed will depend on the values of the following three CC registers.

EQCC contains an offset to a pointer. This offset is used when the loaded value is equal

to the primary key value. All offsets are considered positive values from the base

pointer. If a negative offset is supplied, this is interpreted as a hit condition and

the present base pointer will be pushed into the result output buffer. Therefore,

this register will normally be set to the HSX_CHASE_HIT value in software. This

register can be set to a branch node offset if the search is looking for a ‘less than’

or ‘greater than’ key instead. If the pointer is a NULL pointer, the search will stop.

LTCC contains an offset to a pointer. This offset is used when the loaded key value

is less than the primary key value. This generally means that the primary key

exists in the right branch of the present node. The new node pointer will then

be loaded from this offset. If a negative offset is supplied, this is interpreted as a

hit condition and the present base pointer will be pushed into the result output

buffer. If this value is a NULL pointer, the search will stop.

GTCC contains an offset to a pointer. This offset is used when the loaded key value is

greater than the primary key value. This generally means that the primary key

exists in the left branch of the present node. The new node pointer will then be

loaded from this offset. If a negative offset is supplied, this is interpreted as a

hit condition and the present base pointer will be pushed into the result output

buffer. If this value is a NULL pointer, the search will stop.

CONF The only register that can be read on the configuration bus is the CONF register,

which also functions as a status register. Figure 7.3 lists the bits of the configu-

ration register. Most of it works the same as the configuration registers for the

other accelerator units.

67

7.2.2 Operation

IDLE NULL DATA COMP NEXT BASELOAD

CONFIGDATALOAD

POINTER LOADNULL POINTER

KEY FOUND

Figure 7.4: Chaser machine states

Figure 7.4 shows the internal machine states of the chaser. It has a number of states,

each capable of being run at one clock cycle:

IDLE is the default state and is entered whenever the accelerater is reset by hardware or

software. In this state, the registers from the configuration stack are copied onto

internal machine registers.

NULL is the state where the node pointer is checked for a NULL pointer. If a NULL pointer

is detected, the machine goes back to an IDLE state. Otherwise, the base pointer

and data offset is added and written into the internal node pointer, which now

points to the data value.

DATA is the state where the actual data value is read from memory via the memory port.

The necessary memory signals are asserted and deasserted to complete a memory

transfer. The data is read into a holding data register.

COMP is the state where the loaded data is compared with the key. The result of the

comparison is stored in a conditional register. This is given it’s own state to share

the same ALU as the other memory calculations.

NEXT is the state where the next pointer is calculated. The appropriate offset is selected

based on the result of the earlier comparison. If the offset is a negative value,

the chase is completed and the base pointer is written into the output buffer.

Otherwise, the offset is then added to the base pointer and written into the internal

node pointer, which now points to the next node pointer.

BASE is the state where the pointer to the next node is loaded from memory. Again,

the necessary memory signals are asserted to complete a memory transfer. The

pointer is loaded as the new base pointer.

The main loop of the machine runs through: NULL, DATA, COMP, NEXT, BASE. This

makes the theoretical maximum internal bandwidth 640 Mbps at 100 MHz core speed.

However, it loads a data and a pointer during each loop iteration. Therefore, its theo-

retical maximum external bandwidth consumption at 100 MHz is 1.28 Gbps.

68

Although primarily designed for tree traversal, the chaser unit can also be used to

traverse other linked data structures. This can be done by configuring the less than and

greater than pointer offsets to point to the appropriate next node in the link. When

configured this way, a chaser can be used to search for a primary key that is stored

within a linked list or other structure. Alternatively, software can be used to translate

other data structures into trees for processing.

7.3 Kernel Simulation Results

As before, simulation was used to measure the performance of a chaser. A chaser kernel

was written to perform a primary key search in both software and hardware. Listing

7.4 shows the main chaser kernel.

The kernel first filled an input tree (lines 82–86) with a number of different random

values. As before, the size of the tree is defined during compile time. A key value is then

inserted into the tree (line 87) to ensure that there is at least a single result hit. The

software and hardware kernels were compiled and run, with the debug output inspected

visually to ensure functionality.


Listing 7.1 shows the debug output. Visual inspection confirms that the software and

hardware chasers both work and produce the same result. A primary key of 0x1690 is

found at the node located at 0x80000680, which is in the heap memory space. This

shows that the hardware chaser is capable of performing the same task as the host

processor software. Hence, it can be used to offload the primary key search work from

the host processor.

Software Chaser

PKEY : 0x1690

FIND : 0x80000680

11844 swticks

740 swmemticks

Hardware Chaser

PKEY : 0x1690

FIND : 0x80000680

11918 hwticks

758 hwmemticks

Listing 7.1: Chaser simulation output (debug)

Listing 7.3 shows a hardware chaser kernel. The method for extracting the config-

uration parameters is rather complicated owing to the structure of the C++ STL set

library. This is because the internal variable used to represent the red-black tree is

a private member of the set class. There is no easy way to access a private member

directly from an external application.

69

Hence, the pointer to the root of the red-black tree was extracted (lines 58–59)

using manual offsets. The manual offsets were obtained by studying the STL header

for trees to determine where the root pointer was stored. The next few lines extracted

the relevant pointers for a specific data structure node, depending on the result of the

comparison operation.

Listing 7.2 shows a software chaser kernel. It simply calls the built-in STL tree

search function that searches the red-black tree in O(log N) time. It then returns the

pointer to the base node, which is the same result that the hardware operation returns.

The pointer value extracted using either method, can be used to cast a node pointer in

the application to access the data.

7.3.2 Kernel Single Key Timing

Listing 7.5 shows the output from the simulation of chasing a primary key in a red-

black tree of size N = 50 elements. With a large N, it takes a long time to simulate

it as building the tree is O(N log N) bound. With a small N, the period of interest

is very short as the search is completed in O(log N) time. This tree size was chosen

as a trade-off between simulation time and observability. This listing shows the total

hardware and software timing results without any debug output. There are a few extra

memory accesses for the hardware operation, which is largely used during the hardware

configuration operation.

Figure 7.5 shows only the timing diagram of the important signals in the hardware

operation portion of the above simulation. As before, the time values are unitless but

10 units are equivalent to a clock tick. There are four visible markers on the diagram:

A (1590631) marks the beginning of the hardware configuration overhead. This is when

the hsxPutData(HSX_CHASE2, pkey) operation was called. The parts before this

are considered the function call and return overhead.

B (1591781) marks the end of the hardware configuration overhead. This is when

the hsxSetChase(HSX_CHASE2, cfg) function was completed. The actual chaser

operation began to run immediately after this as indicated by the activity on the

dwb_stb_o signal 1©.

C (1592401) marks the point when the primary key was found by the chaser as indicated

by the rok_o assertion 2©. Although the key was found, the host processor did not

realise it yet. It took a few more ticks for the host processor to check the status of

the output buffer and retrieve the results from the output buffer as indicated by

the rde_i assertion 3©.

70

31int swchase(std ::set <int > &setA , int pkey )

{

#ifdef DEBUG

iprintf("PKEY \t: 0x%X\n", pkey );

35#endif

volatile int j = (int)&*setA .find (pkey )._M_node ;

#ifdef DEBUG

40iprintf("FIND \t: 0x%X\n", j);

#endif


}

Listing 7.2: Software chaser kernel

46int hwchase(std ::set <int > &setA , int pkey )

{

#ifdef DEBUG

iprintf("PKEY \t: 0x%X\n", pkey );

#endif

51

// Configure the hardware chaser .

hsxChaseConfig cfg;

std ::set <int >:: iterator node ;

56int *tree = (int *)&setA +2; // Manual tree header offset.

cfg.node = (int) *tree ; // Extract pointer to tree ROOT .

cfg.eqcc = HSX_CHASE_HIT ; // Hit when memory = key .

cfg.gtcc = (int) &node ._M_node ->_M_left ; // Offset when memory is > key .

cfg.ltcc = (int) &node ._M_node ->_M_right ; // Offset when memory is < key .

61cfg.data = (int) &(( std:: _Rb_tree_node <int >*) node ._M_node)->_M_value_field ; // Data offset

hsxSetChase (HSX_CHASE2 , cfg);

hsxPutData (HSX_CHASE2 , pkey );

66

while (!( hsxGetConf (HSX_CHASE2 ) & (1<<2))); // wait for result

volatile int j = hsxGetData (HSX_CHASE2 );

#ifdef DEBUG

71iprintf("FIND \t: 0x%X\n", j);

#endif


}

Listing 7.3: Hardware chaser kernel

71

77int chaser()

{

std ::set <int > setA ;

int pkey = getrand () & 0x0000FFFF ;

81

// prefill lists


{

setA .insert(getrand () & 0x0000FFFF );

86}

setA .insert(pkey );

// sort lists

// listA.sort ();

91

// do sieve

int ticks ;

int memtick;

96// SOFTWARE CHASE

iprintf("Software Chaser\n");


ticks = gettick ();

swchase(setA , pkey );

101ticks = gettick () - ticks ;




106// HARDWARE CHASE

iprintf("Hardware Chaser\n");


ticks = gettick ();

hwchase(setA , pkey );

111ticks = gettick () - ticks ;




116return EXIT_SUCCESS ;

}

Listing 7.4: Chaser kernel

72

1590000 sec 1591000 sec 1592000 sec 1593000 sec

00000000 xXXXXXXX 20000164

00000000 xXXXXX+ 20000164 00000004

00000000 xXXXXX+ 200001+ 00000004 20000000

00000000 xXXXXX+ 200001+ 000000+ 20000000 00000003

00000000 xXXXXX+ 200001+ 000000+ 200000+ 00000003 00000002

00000000 xXXXXX+ 200001+ 000000+ 200000+ 000000+ 00000002 xXXXXXXX

00000000 7+ 0000455D

00000000 2+ + 2+ + 2+ + 2+ + 2+ + 2+ + 2+ 00000290

80000590 000000+ 800000+ 000000+ 000000+ XXXXXXXX 0+ + 0+ + 0+ + 0+ + 0+ + 0+ + 0000455D 4000+ 4000D000 00+ 00+ 00026+ 00002CE4

00000000 200001+ 200+ 200+ 200+ 200+ 200+ 20000290

000 + + + + + + + + + + + + + 111

00000000 800005+ 800+ 800+ 800+ 800+ 800+ 80000A40

A B C DTimexwb_stb_i=0

cfg_node[31:2]=20000164

cfg_data[31:2]=00000004

cfg_eqcc[31:2]=20000000

cfg_ltcc[31:2]=00000003

cfg_gtcc[31:2]=00000002

cfg_conf[31:2]=xXXXXXXX

cfg_pkey[31:0]=0000455D

dwb_stb_o=0

dwb_adr_o[31:2]=00000290

dwb_dat_i[31:0]=0000455D

dwb_ack_i=0

rADR[31:2]=20000290

rFSM[2:0]=111

wre_i=0

dat_i[31:0]=80000A40

rde_i=0

rok_o=0

A© B© C© D©

@@

@@

@@

@ @I

1©

��

2©

��

3©

Figu

re7.5:

Sin

glekey

chaser

timin

gdiagram

73

Software Chaser

420 swticks

31 swmemticks

Hardware Chaser

460 hwticks

47 hwmemticks

Listing 7.5: Chaser simulation output

D (1592571) marks the end of the hardware chase kernel. At this point, the results were

retrieve and the operation was completed. The hardware kernel then returned

control back to the main kernel.

The total operation took TAD = 194 ticks. From this, TBD = 79 ticks (40.7%) was

used in the actual chase operation. The balance TAB = 115 ticks (59.3%) was used up by

the hardware configuration overhead. A significantly large proportion of the operation

time was actually spent configuring the hardware parameters as opposed to performing

the chase.

From the simulation output, another T+ = 266 ticks (+137.1%) is used by the func-

tion call and return overhead. These are consumed before A and after D. Using a similar

function call overhead, from the simulation output, the software operation completed in

TSW = 154 ticks. From this single timing simulation, the timing estimate speed-up is:

TSW

TAD

= 0.79

If the hardware configuration overhead is discounted as a fixed cost, the hardware

chase operation speed-up TSW

TBD= 1.95 is much faster. This is a significant acceleration

for a very basic search function.

7.3.3 Kernel Single Key Performance

Figure 7.6 shows the simulation results for the chaser software and hardware operation.

The data set sizes chosen were between 10 to 300 for tree depths between 4 to 9 levels.

The time taken for simulation grows quickly with a significantly large data set size

because the insertion process is O(N lg N)1 bound. Therefore, the data set is kept to a

fair size to keep simulation times reasonable.

As the performance of the search algorithm is O(lg N) limited, the graphs are plotted

against the lg N values. The performance for a single key search over one trial depends

greatly on the positional level of the key in the tree. As the data set is prepared

randomly, the vertical error bars on each data point reflect the standard deviation of

the values obtained across 50 trials. Extrapolating and extracting the linear relationship

1Nomenclature: lg = log2

74

300

350

400

450

500

3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5

0.9

1

1.1

Tic

ks

Spe

edup

Data set size (log2n)

Chaser simulation


300

350

400

450

500

3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5

0.9

1

1.1

Tic

ks

Spe

edup

Data set size (log2n)

Chaser simulation

Figure 7.6: Chaser simulation

of the points gives the following two equations to describe them:

Csw,s(N) = 22.9 lg N + 280.6 (7.1)

Chw,s(N) = 9.04 lg N + 390.2 (7.2)

Equation 7.1 describes the relationship of the software operation. The intercept

of 281 ticks agrees with the timing estimate of 266 ticks for function call and return

overhead. The single key chase operation is a very different operation compared to the

other accelerator devices. This is because a search can very quickly find a key, due to

the lgN nature of the search process. There is therefore, a fairly large possible range

for the intercept point.

Equation 7.2 describes the relationship of the hardware operation. The intercept

of 390 corresponds to the timing estimate of TAB + T+ = 381 for the total hardware

overhead costs. For a sufficiently large N , the speed-up factor is:

Cup =Csw(N)

Chw(N)= 2.54

The cross over point in the graph is at about lg N = 8, which is N = 256. This

means that for any tree that is larger than 256 elements, the hardware chaser affords

a hardware acceleration while the software method is faster for small trees. This will

prove useful in everyday applications. Most indices will benefit because any data set

worth accelerating will definitely be larger than 256 entries.

7.3.4 Kernel Multi Key Timing

The chaser has a more interesting mode, where it can perform better acceleration than

before. The chaser can be configured to perform multiple key searches on the same data

structure. In this mode, a dedicated chaser can be allocated to each data structure that

75

needs to be searched, simplifying the configuration process. This is useful for a structure

that is commonly used, such as an operating system process table.

Listing 7.6 and 7.7 shows a multiple key search kernel. In each case, the number of

keys searched is N/10 (integer division) of each data set of size N . The thing to note in

listing 7.7 is the order in which the hardware is configured. The configuration registers

were written to and the chaser was enabled, before the keys that need to be searched

are loaded. If the order is swapped, the multiple keys in the input buffers will be flushed

during the configuration process.

Listing 7.8 shows the debug output of the simulation. The results were verified

through visual inspection and shown to be the same for both software and hardware

methods.

Figure 7.7 shows the results of one simulation trial for N = 30 and 3 key searches.

There are six markers on the timing diagram:

A (1225782) marks the start of the hsxChaseSet() function call. The configuration

parameters were all extracted before this point. Since the parameters are the

same as that for a single search, it can be assumed to consume the same amount

of time.

B (1226882) marks the end of the hsxChaseSet() function call. At this point, the

chaser is configured but has not begun chasing the key as seen by the absence of

dwb_stb_o signal assertions.

C (1227112) marks the time when the first key is written into the chaser. Almost

immediately, the loading of the key is signalled by the rde_i signal. The chaser

then began operating soon after this, as seen by the multiple dwb_stb_o assertions

1©.

D (1227462) marks the time when all the multiple keys have been written into the

chaser as indicated by the wre_i signal 2©. This operation overlaps the actual

chase operation. So, the keys to be searched for can all be safely written in

advance if there are less than the size of the input buffer. Otherwise, the multiple

key search operation may need to be broken up into several phases.

E (1228202) marks the time when the last key was loaded as indicated by the rde_i

signal 3©. The amount of time between C-E can be used to calculate the average

key search time.

F (1228862) marks the time when all the results have been retrieved and the multiple

key chase operation is completed.

76

32int swchase(std ::set <int > &setA , std ::vector <int > &pkey )

{


35{

volatile int j = (int)&*setA .find (pkey [i])._M_node;

#ifdef DEBUG


#endif

40}


}

Listing 7.6: Software multi-key chaser kernel

45int hwchase(std ::set <int > &setA , std ::vector <int > &pkey )

46{

// Configure the hardware chaser .

hsxChaseConfig cfg;


51int *tree = (int *)&setA +2; // Manual tree header offset.

cfg.node = (int) *tree ; // Extract pointer to tree ROOT .

cfg.eqcc = HSX_CHASE_HIT ; // Hit when memory = key .

cfg.gtcc = (int) &node ._M_node ->_M_left ; // Offset when memory is > key .

cfg.ltcc = (int) &node ._M_node ->_M_right ; // Offset when memory is < key .

56cfg.data = (int) &(( std:: _Rb_tree_node <int >*) node ._M_node)->_M_value_field ; // Data offset

hsxSetChase (HSX_CHASE2 , cfg);

for (int i=0; i<( LIST_MAX /20) ; ++i) // write multiple keys

61hsxPutData (HSX_CHASE2 , pkey [i]);


{

while (!( hsxGetConf (HSX_CHASE2 ) & (1<<2))); // wait for result

66volatile int j = hsxGetData (HSX_CHASE2 );

#ifdef DEBUG


#endif

}

71


}

Listing 7.7: Hardware multi-key chaser kernel

77

1226000 sec 1227000 sec 1228000 sec

00000000 76DE6BED 0000CC09 00000198 000006E8

+ 0+ + + 0+ + + + + 0+ + + 0+ + + 0+ + + + + + + + + + 0+ + + 0+ 80+ + 0+ 8+ + + 000+ + + + 0000+ + 8+ + + 0+ + 8+ 0+ 0+ + 0+ 0+ 81F01E90

+ 00+ 250+ 00+ 250+ 00+ 250+ 00+ 250+ 00+ 250+ 00+ 250+ 0+ 2+ 25+ 00+ 2502+ + 25029+ + 25029B4+ + 25029B+ 0+ 2+ 0+ 2+ + 8+ 80+ + 0+ 9D1+ + + + + + 15+ + + + + 1+ + + 2CEE5F59

80000770 00000010 80000000 0000000C 00000008 00000011 8000+ 0000CC+ 0+ + 8+ 0+ + 0+ 0+ + 0+ 8+ 0+ + 000+ 0+ + 0+ 8+ 000+ + 0+ 8+ 000+ 0+ + 000+ + 0+ 8+ 0+ + 0+ 8+ 000006E8

A B C D E FTimeCHASER2

cfg_pkey[31:0]=000006E8

wre_i=0

rde_i=0

xwb_stb_i=0

xwb_ack_o=0

xwb_dat_i[31:0]=00000000

xwb_dat_o[31:0]=800008B0

dwb_stb_o=0

dwb_dat_i[31:0]=000006E8

dwb_ack_i=0

A© B© C© D© E© F©

@@ @I

1©

��

��

��

2©

��

��

��

3©

Figu

re7.7:

Multip

lekey

chase

kernel

timin

g

78

Software Chaser

FIND : 0x80000860

FIND : 0x80000888

FIND : 0x800008B0

20652 swticks

1198 swmemticks

Hardware Chaser

FIND : 0x80000860

FIND : 0x80000888

FIND : 0x800008B0

19974 hwticks

1198 hwmemticks

Listing 7.8: Chaser multi-key simulation output (debug)

Software Chaser

886 swticks

75 swmemticks

Hardware Chaser

574 hwticks

71 hwmemticks

Listing 7.9: Chaser multi-key simulation output

The complete hardware operation took TAF = 308 ticks in this one trial. Of this,

TAC = 133 ticks (43.2%) was made up of hardware configuration overhead. The balance

TCF = 175 ticks (56.8%) was consumed by the actual chase operation. The average key

search time TCE for two keys is 109 ticks or 36 ticks per key, which is fairly close to the

number in the single key timing estimate. From the timing estimate in listing 7.9, the

function call and return overhead is estimated to be T+ = 266 ticks (+86.4%).

Using a similar function call and return overhead, the software operation takes

TSW = 620 ticks. This gives an estimated speed-up factor of 2.0 times. However, these

estimates are only an indicator as they only reflect the results from one single trial and

a more accurate value is estimated in the next section.

7.3.5 Kernel Multi Key Performance

Figure 7.8 shows the results of a series of multiple key searches, each repeated 50 times.

In each case, the graphs are plotted against the number of searches performed, which is

set to be 10% of the data set size. The performance is linearly related to the number of

keys that have to be searched, with each key taking a slightly different amount of time

to be searched.

Extrapolating and extracting the linear relationship of the points gives the following

two equations to describe them:

Csw,m(N) = 285.2 N + 4.6 (7.3)

Chw,m(N) = 83.2 N + 318.2 (7.4)

Equation 7.1 describes the relationship of the software operation. Equation 7.4

79

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 5 10 15 20 25 30 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9

Tic

ks

Spe

edup

Number of searches (n)

Chaser simulation (multi-key)

SoftwareHardwareSpeedup 1000

2000

3000

4000

5000

6000

7000

8000

9000

0 5 10 15 20 25 30 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9

Tic

ks

Spe

edup

Number of searches (n)

Chaser simulation (multi-key)

Figure 7.8: Chaser simulation (multi-key)

describes the relationship of the hardware operation. The intercept of 318 is similar

to the timing estimate of TAC + T+ = 399 for the total hardware overhead costs. The

speed-up factor of Csw(N)Chw(N) = 1.51 for N = 3 and is close to the timing estimate of 2.0.

However, for a sufficiently large N , the speed-up factor is:

Cup = 3.43

7.4 Conclusion

A primary key search is a common task that needs to be performed for both primary and

secondary searches. The hardware chaser unit can be used to both offload and accelerate

the primary key search process. However, it only provides a significant saving if the data

set that needs to be searched is larger than N = 256. It provides an acceleration that

is O(lg N) bound.

If a chaser is used to search multiple keys in the same data structure repeatedly,

the acceleration becomes O(N) bound with the number of keys to be searched even

though each individual key search is still O(lg N) bound. For a sufficiently large data

set, the maximum acceleration factor is Cup = 3.43 for multiple key searches. The

maximum external memory bandwidth that is required for each chaser unit is 1.28 Gbps

at 100 MHz.

80

CHAPTER 8

Memory Interface

As a big part of search is primarily memory limited, cache and the memory

hierarchy are explored. A special cache that takes into account structural

locality, instead of just temporal and spatial locality is also designed. How-

ever, the improvements gained are only 3% and not sufficiently significant

to warrant its use in the search accelerator unless absolutely necessary.

8.1 Introduction

All the results thus far, have been obtained with one underlying assumption: the simula-

tions were all run without the use of any cache memory. All memory accesses were sent

out through the memory arbiter to a simulated external memory device as described

in section 5.3. Typical computer architecture practice exploits the benefit of a memory

hierarchy, using cache, to speed up operations. However, the effects of a cache memory

on search need to be studied.

8.2 Cache Primer

Search algorithms are typically limited by the number of records that have to be searched

through, which translates into the size of the search space N . As main memory is slow,

for search algorithms that have to traverse through in-memory data sets, this becomes a

major bottleneck, which is usually alleviated by use of cache memory. The performance

of existing cache architectures is fairly well understood [Han98, Gen04, van02]. Existing

cache memories are designed around two core principles: temporal and spatial locality;

81

while cache performance is typically regulated by three basic parameters: cache size,

line length and associativity[HS89].

Performance is improved by retaining more data within the cache. It is common to

find more than half the die area of a modern processor taken up by cache memory as

processor speeds have outpaced memory speeds[FH05]. However, this directly increases

cost by increasing chip area when this valuable space could alternatively be used to

increase functionality of a processor. Alternatively, a reduction in area space could

lower its cost.

Moreover, improving search performance by merely increasing cache size is not sus-

tainable simply because there will always be a significant amount of data stored in main

memory. Even if cache sizes do reach gigabyte values, the primary memory by then

would be larger and the data set sizes potentially even larger. Therefore, the search

space will always be stored in external memory instead of on-chip cache.

A cache line represents the amount of information read whenever data is read from

main memory. Longer cache lines affect performance by bringing in larger blocks of data

at a time, exploiting spatial locality. This will naturally benefit linked data structures

because each node generally holds multiple words of information. However, these com-

plex data structures may not have data nodes located in contiguous locations, which

reduces the effectiveness of spatial locality.

Associativity works by replicating cache blocks, to reduce the problem of a single

cache block mapping to multiple areas in main memory (aliasing). This improves the

probability of retaining data in the cache. If information is widely scattered across

memory, there is less spatial locality to exploit. Hence, higher associativity may be

useful by retaining multiple blocks within cache. However, the cost increases quickly

with associativity due to replication.

For instruction cache, both temporal and spatial locality are equally important. As

instructions are executed sequentially, instructions next to an existing one are likely to

be used (spatial locality). In the case of loops in algorithms, recently used instructions

are likely to be used repeatedly. (temporal locality). However, it is less clear that a data

cache benefits from the same design features for reasons set out below.

For search applications, the data space exhibits ephemeral characteristics. Data

structures are usually traversed in one direction and once a node is used it is unlikely to

be used again, which reduces the effectiveness of temporal locality. For accessing data

structures, structural locality may be more important because in every data structure,

once a node is checked against the search key, it will normally need to access a child

node next, which can be located anywhere in data memory.

82

8.3 Cache Principles

As microprocessor speeds outpace increase in memory speeds, the processor spends

more time waiting for data. The present trend in general-purpose microprocessors is to

increase the amount of cache to reduce this penalty. There are two problems with this

trend.

Firstly, increasing cache improves general purpose performance, but may not help

with search operations. Secondly, this strategy is not cost-effective from area, power

and efficiency points of view. As cache performance can severely affect the performance

of software, how it can help a search accelerator needs to be investigated.

To facilitate analysis, design and testing, a parameterisable cache memory block was

first designed and tested. The cache uses a pseudo-LRU replacement mechanism for

2-way and 4-way associativity configurations. The associativity, size and line width of

the cache is dependent on conditional defines that were defined on the command line by

simulation scripts.

InstructionL1 Cache

DataL1 Cache

32−bitRISCCPU

.text

.data

.heap

.stack

0x00000000

0x000FFFFF

0x00100000

0x001FFFFF

0x80000000

0x81FFFFFF

Figure 8.1: Cache simulation setup

Figure 8.1 illustrates how the caches and memory were set up for simulation. The

memory transfers were monitored by simulation scripts and dumped as text, which was

then post-processed using text processing tools on the host computer. The memory

space was divided into the following main address spaces for easy monitoring:

.text is reserved for read-only instruction memory. This memory can either be im-

plemented as simulated on-chip ROM or off-chip flash. In this case, it is was an

on-chip device. Therefore, all transfers happen at the fastest possible rate to avoid

slow downs in simulation due to excessive wait times for instructions.

.data is reserved for read-write initialised data memory. This memory was implemented

as a small block of on-chip RAM. It mainly holds certain variables, constants,

strings and other pre-initialised values.

.heap is reserved for the read-write heap memory. It represents an uninitialised block of

external memory. This is where the dynamically allocated data (using malloc()

83

and free()) is located. Entire data structures were stored within this area in-

cluding trees, lists and other dynamically linked data structures.

.stack is reserved for the software stack. This was used for function call and return

overheads and for passing parameters between functions. Some parts of the data

structure may also be stored within this area, such as the structural information

of a tree that is located in the heap.

A software kernel was written to test the operation of the cache memory block by

performing a key search in a tree. Two types of code were used: a random value search

and a repeated value search. The simulation trials were conducted with a different

number of loop iterations defined as ITERS in the code. For each value of ITERS, 50

samples were collected.

Listing 8.1 shows part of the cache simulation construct in Verilog. Lines 210–225

saves the external memory contents into a Verilog memory (VMEM) file. Lines 227–230

loads the contents from the Verilog memory file into external memory. These operations

simulate the load and save operations on a computer. Constructing each tree takes

O(N log N) time, which translate into more than a day of real-world simulation time for

such a large tree. Therefore, reusing a saved tree reduces simulation time tremendously.

Listing 8.2 shows the data preparation code used to pre-build the search tree. Lines

39–44 fill a red-black tree with 216 records. The resultant data structure has a node that

is 6 words in size, which results in a 400kbyte data set. Line 48 triggers the simulation

construct that saves the external memory into a Verilog memory (VMEM) file.

Listing 8.3 shows the cache simulation kernel. Line 44 triggers the simulation con-

struct to transfer the pre-built tree structure into the heap. Lines 50 and 62 enable and

disable the data cache for simulation, which is used to limit the cache results to pure

search code. Lines 52–59 iterates through a red-black tree search, a number of times.

210if (dwb_stb_o & dwb_ack_i & dwb_wre_o & (dwb_adr_o [31:16] == 16’ h0200 )) begin

$strobe ("SAVE MEMORY");

213fname = $fopen("dopb .vmem ");

$fdisplayh (fname ,"/* Save OPB RAM */");

// save heap

for (save = 0; save < 32’ h00080000 ; save = save + 1) begin

$fdisplayh (fname ,"@", save , " ", {rDOPB [save ]});

218end

// save stack - important !!! as some information is pushed onto the stack by the compiler

for (save = 32’ h07D0000 ; save < 32’ h07EFFFF ; save = save + 1) begin

$fdisplayh (fname ,"@", save , " ", {rDOPB [save ]});

end

223$fclose (fname );

end

if (dwb_stb_o & dwb_ack_i & dwb_wre_o & (dwb_adr_o [31:16] == 16’ h0400 )) begin

228$strobe ("LOAD MEMORY");

$readmemh ("dopb .vmem ", rDOPB );

end

Listing 8.1: Verilog simulation LOAD/SAVE

84

31#define NODE_MAX 0x010000 // 64k nodes

int main () // works with -O1

{

35// declare and create a tree

std ::set <int > *rbtree;

rbtree = new std ::set <int >();

// pre -fill the tree

40for (int i = 0; i < NODE_MAX ; ++i)

{

*hsx ::STDO = i;

rbtree ->insert(i << 16);

}

45

// save /load the tree

*hsx ::STDO = rbtree ->size ();

*hsx ::SAVE = -1;

50rbtree ->clear ();


*hsx ::LOAD = -1;


55

aemb :: enableDataCache (); // start the cache test

// list the tree

for (std ::set <int >:: iterator node = rbtree ->begin (); node != rbtree ->end (); node ++)

60{

*hsx ::STDO = *node ;

}

aemb :: disableDataCache (); // disable the cache test

65

exit (0);

}

Listing 8.2: Cache tree fill kernel

37int main ()

{

39// declare and create a tree

std ::set <int > *rbtree;

rbtree = new std ::set <int >(); // create a rbtree object in the heap

rbtree ->clear (); // !!! do not skip this step

44*hsx ::LOAD = -1; // simulator heap load

// search for 10 values


49// enable cache

aemb :: enableDataCache ();

int j = *hsx ::PRNG << 16;

for (int i=0; i<ITERS ; ++i)

54{

*hsx ::PRNG = -1; // invalidate the PRNG cache entry. remember to discount this .

j = *hsx ::PRNG << 16;

node = rbtree ->find (j);

*hsx ::STDO = *node ;

59}

// disable cache

aemb :: disableDataCache ();

64exit (0);

}

Listing 8.3: Cache simulation kernel

85

Lines 55–56 are commented out for the repeated search case or left uncommented for

the random search case.

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 10 20 30 40 50 60 70 80 90 100

Dat

a ca

che

(h:m

)

Loop iteration (n)

Repetitive Search (2K1W2L)

DataStackHeap

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 10 20 30 40 50 60 70 80 90 100

Dat

a ca

che

(h:m

)

Loop iteration (n)


0

0.5

1

1.5

2

2.5

3

3.5

0 10 20 30 40 50 60 70 80 90 100

Dat

a ca

che

(h:m

)

Loop iteration (n)

Random Search (2K1W2L)

DataStackHeap

0

0.5

1

1.5

2

2.5

3

3.5

0 10 20 30 40 50 60 70 80 90 100

Dat

a ca

che

(h:m

)

Loop iteration (n)


0

20

40

60

80

100

120

140

160

0 10 20 30 40 50 60 70 80 90 100

Inst

cac

he (

h:m

)

Loop iteration (n)


Inst

0

20

40

60

80

100

120

140

160

0 10 20 30 40 50 60 70 80 90 100

Inst

cac

he (

h:m

)

Loop iteration (n)


0

20

40

60

80

100

120

140

0 10 20 30 40 50 60 70 80 90 100

Inst

cac

he (

h:m

)

Loop iteration (n)


Inst

0

20

40

60

80

100

120

140

0 10 20 30 40 50 60 70 80 90 100

Inst

cac

he (

h:m

)

Loop iteration (n)


Figure 8.2: Basic cache operation

8.3.1 Instruction Cache

At this point, the intention was to verify the functionality of basic data and instruction

cache blocks. The specific results are less important as cache parameter effects on

search operations are simulated later. Figure 8.2, however, yields some useful early

considerations. Extrapolating the points for the instruction cache linearly yields the

following relationships:

Irep(N) = 1.25 N + 17.3 (8.1)

Irnd(N) = 1.21 N + 16.3 (8.2)

Both equations 8.1 and 8.2 show that the instruction hit ratio improves linearly with

the number of iterations through the loop for both repetitive and random cases. The

two graphs are very similar in nature and within 3.2% of each other, and any difference

is insignificant.

The expected linear trend is correct because the same code loop and the same in-

structions are being run each time. This will only happen when the instruction code

size is small enough to be loaded once and fit entirely in the cache. In the case of

86

the simulation kernel, the software runs only through the red-black tree search routine,

which is only a sub-section of the total code. The simulation results suggest that the

cache is correctly capturing and retaining needed data. Spatial and temporal locality

are beneficially exploited.

8.3.2 Data Cache

The situation for the data cache is very different. Both the stack and heap were cached

in the data cache and they exhibit different characteristics. The search tree was stored in

the heap, while the function call and return overheads were stored in the stack. As can

be seen from the individual curves, the stack exhibits a higher hit ratio than the heap.

The performance of the stack cache indicates that a data cache would be a valuable

addition for software function call and return operations.

However, the heap ratio for the repeat case is 0.9 while the random case is 0.09 only.

The 10 times difference is due to temporal locality. It is expected for the random case

as different searches traverse down different branches of the tree for each iteration. But

the repeat case is expected to perform linearly, like the instruction cache, as it steps

through the same data nodes each iteration.

However, the size of the data cache is very much smaller than the size of the data

space. The 2K1W2L parameter gives a 2kbyte cache size (2K), organised in a direct-

mapped (1W) configuration with a 2 word cache line (2L) in 256 blocks. The task of

searching a single key involves calling several C++ STL subroutines and overwrites the

heap cache with stack values. This results in the characteristic curve of diminishing

returns for the data cache in each case.

It is highly unlikely that multiple search operations are going to search the exact

same data every time. Although the same algorithm may be used, the data sets and

keys may change. The results are an early indicator that a search data set may not be

suitable for caching inside a data cache. The results from the stack performance also

show that at least 30 iterations are needed to get a result that is close to the value for a

large number of iterations. This helps to determine the minimum number of iterations

that need to be performed in later parts of the chapter.

8.4 Cache Parameters

The next issue is how different cache configurations affect search operation, with par-

ticular attention to the heap cache. The cache was checked by running a software loop

through 50 iterations that searches for a specific key in a tree. Then, each combination

of cache parameters is sampled over 50 iterations and the hit-miss-ratio is recorded.

The cache parameters are changed one at a time and the entire simulation repeated. A

87

complete simulation run took several days to complete.

Figures 8.3, 8.4, 8.5 and 8.6 are grouped by memory size as this translates directly

into physical cost, which is an important factor in any physical implementation. Each

figure contains 6 sub-figures labeled with a different cache size. Each sub-figure is a

heat-map with associativity (N-Way) and cache line width (2N Words) as the param-

eters, while the colour represents the hit-miss-ratio. The results provide us with some

interesting trends and guidelines for search oriented caches.

8.4.1 Instruction Cache

Figure 8.3 shows the results of the instruction hit ratio, for different cache parameters.

For the instruction hit ratio simulation, the entire search kernel was about 4kbytes in

size. It was compiled using a -O2 optimisation level. The instruction cache performed as

expected, based on previous results. All the plots exhibit similar trends, which indicate

consistency in the results. Performance improves with increased size, line width and

associativity.

Interpolating from the raw data provides the following approximation of the instruc-

tion hit ratio. The cache-line width is defined as 2L words and 2K is the cache size in

bytes.

Ihit = 20.04KL+0.22L−0.06K+5.1 (8.3)

Line Width. The most obvious improvement was seen with each line width increment.

This is because instructions are stored and executed in sequence. Increasing the line

width will improve the spatial locality performance. Longer lines capture more sequential

data at once but also increase the line refill time. However, due to the sequential nature

of instruction operation, the increased hit ratio makes up for the increased cache refill

times.

Associativity. Increased associativity also increased performance, but not by a sig-

nificant amount. We can see graphically that increasing the associativity from direct-

mapped to 4-way did not change the hit ratio. Increased associativity only helps if the

kernel size is significantly larger than the cache size. If it is smaller, memory aliasing

does not occur as the entire kernel is retained within the cache block. Equation 8.3

dropped the associativity value entirely as it was insignificant.

Cache Size. Figure 8.3 did not show any improvement for larger caches. Larger caches

are redundant when the kernel is only 4kbytes in size. For a much larger software kernel,

88

0

50

100

150

200

250

300

Random Search HMR (2K)

1 2 3 4 5 6

Line width (2N words)

0

1

2

Way

(2N

)

0

50

100

150

200

250

300

350

400


1 2 3 4 5 6


0

1

2

Way

(2N

)

0 50

100

150

200

250

300

350

400

450

500


1 2 3 4 5 6


0

1

2

Way

(2N

)

0 50

100

150

200

250

300

350

400

450

500


1 2 3 4 5 6


0

1

2

Way

(2N

)

0 50

100

150

200

250

300

350

400

450

500


1 2 3 4 5 6


0

1

2W

ay (

2N)

0 50

100

150

200

250

300

350

400

450

500


1 2 3 4 5 6


0

1

2

Way

(2N

)

Figure 8.3: Instruction cache hit ratio

an observable improvement is expected over increasing cache sizes. Equation 8.3 shows

that size variability is a less significant factor on cache performance than line width.

8.4.2 Data Cache Trends (Repeat Key)

Now that the basic instruction cache trends for executing search algorithms are known,

the data cache trends for search need to be observed. The investigation focuses on the

heap hit ratio only, as the data structures are almost entirely stored in the heap. Figure

8.4 shows the results of the repeat simulation with different cache parameters. All the

plots have a similar shape, which indicate consistency in the results.

Line Width. Once again, the most visible improvements were due to an increase in

cache line width. There is a visible leap in hit ratio, when moving from a line width

89

0.8 1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

Repeat Search Hit-Miss Ratio (2K)

1 2 3 4 5 6

Cache line (2L words)

0

1

2

Cac

he w

ay (

2W)

0.5

1

1.5

2

2.5

3

3.5


1 2 3 4 5 6


0

1

2

Cac

he w

ay (

2W)

1

1.5

2

2.5

3

3.5

4


1 2 3 4 5 6


0

1

2

Cac

he w

ay (

2W)

1

1.5

2

2.5

3

3.5

4

4.5

5


1 2 3 4 5 6


0

1

2

Cac

he w

ay (

2W)

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6


1 2 3 4 5 6


0

1

2C

ache

way

(2W

)

2

3

4

5

6

7

8


1 2 3 4 5 6


0

1

2

Cac

he w

ay (

2W)

Figure 8.4: Repetitive heap cache

of 8 into 16 and 32 words. However, each data node is only 6 words in size. In this

specific case, the nodes were inserted into the tree in-order. For a tree that is built in-

order, the child nodes are located spatially close to each other, which results in hit ratio

improvements for longer cache-lines. For a tree that is built with randomly inserted data,

adjacent nodes are scattered through the heap, losing the advantage of spatial locality

for longer cache-lines. In a real-world scenario, it is unlikely that the tree nodes will be

spatially close to each other due to frequent data insertions and removals. Therefore, in

real-world applications, longer cache-lines will be beneficial up until they are about the

size of a single data node.

Cache Size. Unlike the instruction cache, there was a visible benefit in increasing the

size of the overall cache. This is reflected in the colour values that increased from a

maxima of 2.8 to 8.0 on the plots. It can be safely concluded that a larger cache size

90

will improve the hit ratio until the size of the cache approximates the data set size of

400kbytes. However, the improvement is not linear as doubling the cache size did not

double the hit ratio.

Associativity. Associativity improvements seems a little more complex, initially. Against

all logic, an increase in associativity worsens the hit-ratio when the cache line is very

short. However, unless the data used were stored in multiple memory locations aliased

by a single cache block, increased associativity is less effective. As associativity increases,

the number of memory locations that map to a single block of cache actually increases.

This increases the chance of old data being evicted, when it otherwise would not have

been.

8.4.3 Data Cache Trends (Random Key)

Now that a baseline data cache performance has been established, a better approxima-

tion to real-world operation can be observed. Figure 8.5 shows the simulation results

for similar conditions as above, except that a different key is searched each time. All

the plots have a similar shape, which indicate consistency in the results. But this gives

a very different picture from that of section 8.4.2.

Line Width. Once again, the most visible improvement in hit ratio came from an

increase in line width. However, the improvements are non-linear as a doubling of line

width is not accompanied by a corresponding doubling in performance. This adds weight

to the earlier assertion that for real-world applications, increasing the line widths would

not continue to increase the hit ratio significantly. The reasons for this are the same as

before. The data structure will benefit from spatial locality, but only to the extent of

the data structure size of a single data node.

Cache Size. This gave a very different result as compared to the section before.

Increasing the cache size does not seem to have any significant affect on the hit-ratio as

the maximum values stay within the 1.6–1.8 range only while it increased steadily from

2.8–8.0 in the repeat case. This is because the data structure traversed exhibits only

limited temporal locality characteristics. The data held in cache is wasted because it is

not needed again and the cache has to continuously fetch new data from main memory.

Associativity. For the same reasons as above, changes to the associativity did not

bring any visible benefit. Increasing the associativity did not change the hit-ratio sig-

nificantly as evidenced by the lack of change in the heat map from 1-way to 4-way.

Associativity only helps when there is a problem with cache trashing. In this case, there

91

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Random Search Hit-Miss Ratio (2K)

1 2 3 4 5 6


0

1

2

Cac

he w

ay (

2W)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6


1 2 3 4 5 6


0

1

2

Cac

he w

ay (

2W)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6


1 2 3 4 5 6


0

1

2

Cac

he w

ay (

2W)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8


1 2 3 4 5 6


0

1

2

Cac

he w

ay (

2W)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8


1 2 3 4 5 6


0

1

2C

ache

way

(2W

)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8


1 2 3 4 5 6


0

1

2

Cac

he w

ay (

2W)

Figure 8.5: Random heap cache

is little cache contention because there is little temporal locality to begin with. So,

practically all the data in cache can be discarded.

8.5 Data Cache Prefetching

From the results obtained, they show that data caches tend to exhibit low hit-ratio for

search operations. It is possible that prefetching techniques can be used to improve the

performance. Prefetching techniques can be categorised into two main categories: static

and dynamic. More details can be found elsewhere [KY05].

92

8.5.1 Static Prefetching

Static prefetching involves making software modifications during compile-time to initiate

prefetching. Hardware changes are needed to implement a non-blocking method of

initiating memory access or a specialised sub-block to prefetch data into cache. This

operation is then accessed through a special software prefetch instruction.

There are two main approaches used: latency tolerance and latency reduction. There

are two categories of data that the techniques work on: arrays and linked data.

Latency tolerance involves prefetching memory in order to reduce the number of

misses. This will still incur a heavy cost on memory bandwidth as all the search data

still need to be prefetched. As bandwidth is finite, the performance will still eventually

reach a memory bandwidth limit.

Latency reduction involves retaining data in the cache to reduce the number of

misses. This method is more sustainable as it does not significantly increase the band-

width cost. However, it may not help for search data.

For arrays, data are usually stored in sequential locations and would benefit from

spatial locality. Also, data arrays are usually statically allocated during compile time.

So, it is trivial for the compiler to figure out which addresses to prefetch during compile-

time and safely reorder instructions to insert prefetch instructions.

For linked data, the next block to prefetch cannot be predicted during compile time.

Furthermore, there are no guarantees that the memory allocation would be sequential or

linear. Therefore, such cases are more difficult to handle than the arrays. Unfortunately,

this is the category of data that will be present for search applications.

Direct Arrays are the easiest data structures to prefetch. Access to these arrays are

statically defined and software operations that act on them can be thoroughly analysed

during compile-time. Instructions can then be accurately reordered and prefetched.

However, this method is only useful for searching and sorting through arrays and not

more complex search structures.

Indirect Arrays are slightly more difficult to handle than direct arrays. Access to

these arrays depends on an index that is only available during run-time. However, the

structure of the data is well defined. All that is needed is to precalculate the index value

and prefetch it. The rest is similar to direct arrays.

Tiling essentially breaks down and reorders loop operations into smaller chunks. This

allows work to be done on smaller chunks of the array at a time, to exploit temporal

locality. This would be fairly useful for signal processing applications but not for search.

93

Natural Pointers involve reordering instructions and inserting a special operation to

fetch the next child node before it is needed. It suffers from a few weaknesses. It will

only be useful if there is a significant number of other operations between the prefetch

and when the node is needed. There will be no benefit if the next node is operated

on immediately after the prefetch, like in the case of a streaming data. Also, it is only

able to prefetch nodes directly next to the present node, which limits its effectiveness at

prefetching longer paths.

Jump Pointers involve modifying the software data structure to store additional

pointers that hint on nodes to prefetch. These can point further down the path, solv-

ing the problem of natural pointers. However, this method would require the large

modifications to existing software and hardware, which limit its utility.

Data Linearisation works by reordering the location of data nodes to improve spatial

locality, during data insertion or deletion. However, this involves changing the malloc()

routine to re-order data structures on-the-fly. It mainly benefits data structures that

rarely change. Otherwise, penalty for re-ordering will be significant.

8.5.2 Dynamic Prefetching

Dynamic prefetching is a hardware technique to detect and initiate prefetching. This

can be performed at various levels of the memory hierarchy. The closer the hardware

prefetcher is located to the processor, the more information it has access to, in order to

calculate the prefetch location. Therefore, it would make sense to include the hardware

prefetcher within the processor, where possible.

Stride prefetching works by detecting and prefetching sequential accesses that are a

fixed distance apart. This can be implemented with extra hardware, using a stream

buffer. This is particularly useful for signal processing applications, which often access

data at a fixed distance, such as for filters. But for a scattered heap, it is less useful.

Correlation prefetching records the sequence of memory accesses and uses the in-

formation to decide on which blocks to prefetch. Most linked data are often stored at

random locations in memory. Although seemingly random, a repeating sequence can

be detected during run time. For example, multiple searches through a tree will always

access the root node, followed by one of its child nodes and so on. However, it requires

some initial time to build up the correlation tables and a lot of additional hardware.

94

Content-Based prefetching uses the content stored inside a data structure to predict

which block to prefetch next. It examines the data structure itself to identify potential

pointers stored within them. These can be identified by comparing the most significant

bits of the content with those of the current data being fetched. This is based on the

assumption that the nodes would be allocated within a similar section of memory (e.g.

within the heap area). This can be tricky when the data structure contains data that

have similar upper bits. Also, like natural pointers, it can only look down the search

path by one level.

8.5.3 Prefetched Data Cache

Most of the static techniques do not reduce the memory transaction cost but merely

hide the cost by performing the fetch when the data are not needed. Therefore, a

hardware technique was investigated to see how it affects the hit ratio. Content-based

dynamic prefetching was chosen as it seemed the most suitable for accelerating search

through linked data. Figure 8.6 shows the simulation result of a data cache with dynamic

content-based prediction hardware.

A comparison of this result with the result in figure 8.5 will show that it is virtually

the same. In fact, the raw numbers show a very slight decrease in hit ratio for the

prefetched cache.

The reason that prefetching does not work well for search is because of the nature of

search. Prefetching works on the principle of structural locality, where the data element

that is linked next, is prefetched. This should hide any cache misses. Due to temporal

locality, the data that are prefetched and retained, can then be used later.

However, search data are needed for only a brief period of time. Temporal locality

is virtually non-existent for search operations. Pointer chasing would not benefit from

prefetching unless there was a significant time gap between the pointer fetch and data

use. In the case of a chase or stream operation, neither of these is true. Therefore, any

performance advantage that can be gained from prefetching is lost because the data are

needed soon after.

Hence, the result is that prefetching data into the cache is more suitable for ap-

plications that require regular data access such as for signal processing. However, for

unpredictable data access, prefetching will have little advantage to a thoroughly ran-

dom read cache. In fact, it may even have a small penalty due to the wasted prefetching

overhead.

95

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6


1 2 3 4 5 6


0

1

2

Cac

he w

ay (

2W)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6


1 2 3 4 5 6


0

1

2

Cac

he w

ay (

2W)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6


1 2 3 4 5 6


0

1

2

Cac

he w

ay (

2W)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8


1 2 3 4 5 6


0

1

2

Cac

he w

ay (

2W)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8


1 2 3 4 5 6


0

1

2C

ache

way

(2W

)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8


1 2 3 4 5 6


0

1

2

Cac

he w

ay (

2W)

Figure 8.6: Random heap cache (with prefetch)

8.6 Cache Integration

For all the simulations above, the data set size is in the order of 1.6Mb of memory while

the different caches used are between 2kbytes (0.1%) to 64kbytes (4.0%) of the data set

size.

These sizes were chosen to reflect real-world L1 cache size such as those on AMD

processors1. But this extremely small cache ratio means that hardly any part of the data

set can be stored in cache. Therefore, the performance does not change significantly even

with different cache sizes. The instruction cache performs better as the cache to memory

ratio is much higher than the instruction memory size.

However, increasing the cache size to improve performance is extremely expensive as

static RAM is a very expensive resource in a chip. A block of single-port 64kbit memory

1http://web.archive.org/web/20080123090140/http://www.sandpile.org/impl/k8.htm

96

takes up 3.44mm2 (e1,995) in 0.35µm and 0.21mm2 (e230) in 0.13µm according to

recent prices from Europractice2. Therefore, it is important to determine a good cache

size, that uses a minimal resource, while still being able to provide some reduction in

memory bandwidth. Earlier results show that associativity does not help much, while a

longer cache line will help more. Therefore, a direct-mapped cache design can be used

for the investigation.

Of the different accelerators, the chaser would benefit from a cache if used for multi-

key searches on a single data structure, exhibiting temporal locality. A data structure

was constructed to approximately fit itself entirely into the largest cache size. A multi-

key search was then performed to search a portion of the data set.

200

220

240

260

280

300

320

1 10 100 0.99

1

1.01

1.02

1.03M

emor

y T

rans

actio

ns

N/S

rat

io

Size ratio (%)

Cache performance with structure (multi-key)

CacheNCacheS

N/S ratio

200

220

240

260

280

300

320

1 10 100 0.99

1

1.01

1.02

1.03M

emor

y T

rans

actio

ns

N/S

rat

io

Size ratio (%)

Cache performance with structure (multi-key)

200

220

240

260

280

300

320

1 10 100 1

1.1

1.2

1.3

1.4

1.5

Mem

ory

Tra

nsac

tions

Spe

edup

rat

io

Size ratio (%)

Cache performance with size (multi-key)

CacheNSpeedN

200

220

240

260

280

300

320

1 10 100 1

1.1

1.2

1.3

1.4

1.5

Mem

ory

Tra

nsac

tions

Spe

edup

rat

io

Size ratio (%)

Cache performance with size (multi-key)

Figure 8.7: Cache structure comparison

8.6.1 Cache Size Ratio

The cache was inserted between the chaser unit with memory and the multi-key simu-

lation performed, on different cache size ratios. Each simulation was repeated 50 times

and the values are plotted in a graph. The left sub-figure of Figure 8.7 shows the re-

sults of the simulation against the size ratio, from 1% to 100%. The speed-up ratio was

measured against the number of memory transactions on a cacheless solution.

Looking at the cache performance with size, it is evident that as the cache size

increased, there was also an increase in performance. However, increasing the cache size

from about 1.5% to 100% only speeds up the search operation by about 50%. From the

graph, the optimal cache size would be about 10% of the data structure. Beyond this

point, the improvement in performance diminishes with the increase in cost.

2http://web.archive.org/web/20080120104958/http://www.europractice-ic.com/docs/

MPW2008-general-v3.htm

97

8.6.2 Structural Locality

In order to exploit structural locality, a structural cache was designed. Figure 8.8 shows

the basic architecture of a structural cache. The basic concept behind it is a cache that

is segmented based on the levels of the tree. This can be likened to a form of enforced

pseudo-associativity, with the set chosen based on the level of the tree the data is in.

In this design, the LEVEL of the tree is produced by the chaser unit as it branches down

the tree and is not obtained from the memory address.

TAG INDEX LINE

WORDRAM

= HIT

DATATAGRAM

LEVEL

Figure 8.8: Structural cache architecture

Data nodes can be scattered randomly across the entire memory space. Therefore,

it is very possible that the nodes close to the root of a tree are aliased by the branches

of the tree. By associating only certain blocks of the cache with certain levels of the

tree, the chance of aliasing is reduced. This should increase the probability of keeping

the nodes close to the root of the tree, in cache.

A structural cache was designed and simulated using the same parameters as before.

As visible in the right sub-figure of Figure 8.7, the structural cache shifts the threshold

ratio down, allowing smaller caches to approximate the performance of a larger cache.

The use of this structural cache gives an improvement in performance of about 3% at

small cache sizes, for no extra cost. However, its significance reduces greatly for larger

caches, when almost all the data structure is stored in cache.

Therefore, if there are extra resources, a small structural cache can be integrated

with a chaser unit. This can give a slight performance boost by reducing the amount of

memory transactions needed to perform a multi-key search on a single data structure.

8.7 Conclusion

It is clear that instruction and data caches exhibit different characteristics and need

to be designed and configured separately. The instruction cache for search operations

would benefit from having a larger line width, associativity and size. Data caches would

only benefit from having a larger line width.

However, linked data structures used in search do not benefit significantly from data

caching. Data locality does not take into account the actual structure of the data as

98

each data node is usually scattered throughout the memory space. Furthermore, caches

designed to exploit temporal locality will fail when the data or key is different for each

search.

Spatial locality will help under certain circumstances. Neighbouring nodes that need

to be traversed, should be stored in spatially close locations. This will convert structural

locality into spatial locality and benefit from increased line widths in a cache structure.

However, none of these solutions should be handled in hardware. They are all bet-

ter handled in software. However, most software requires that the memory must be

randomly accessible, which makes it difficult to enforce the necessary changes.

As the results show, for the case of search operations, data cache size does not matter

much. So, if there is a need to include any data cache for search operations, a simple

and small direct-mapped cache is all that is needed, with a larger line width. This cache

can benefit the chaser unit slightly, but it will have minimal impact on the streamer

unit. As a result, it may actually be better to allocate the limited chip resources to

something else instead of cache.

99

CHAPTER 9

Search Pipelines

The hardware layer right above the accelerator units is the search pipeline

layer. The different hardware units can be combined in different ways to

accomplish the task of processing different types of search queries.

9.1 Pipelines

Now that the different accelerator units have been introduced and discussed, it is essen-

tial to visualise how the units work in combination to accomplish search. The different

categories of search problems were described in Section 2.2. The most primitive form of

search is primary search, as described in Section 2.2.1, which forms the most common

search operation. Secondary searches were described in section 2.2.2 and covers other

kinds of search operations. The different search problems can be accelerated using a

different combination of accelerator units described below.

Search Retrieval Collation

Figure 9.1: Search pipeline abstraction

9.1.1 Primary Search

A primary search is a search for primary keys, which are expected to return a single

unique result from the whole data set. This means searching for a equivalence key

only, as any other search criteria may return more than a single unique result. Such an

application can be typically accelerated using a single chaser unit connected to memory.

100

Whether or not an optional cache is used would depend entirely on the nature of the

search and the data set.

A primary search would typically only involve the first stage of the pipeline: search.

This is the most important stage as later stages and secondary searches are dependent

upon it. An analogy can be drawn with an instruction fetch stage, where the throughput

of operation is dependent on the issue rate of instructions. In this case, the search

pipeline throughput is dependent on the issue rate of primary key searches.

From equation 7.2, the size of the data set does not adversely affect the completion

time for a chase operation. The completion time will only double if the data sets

approach the size of 243 nodes. Since this is unlikely to be reached, the chase operation

can, for most estimations, be considered to complete in O(1) time with a dominant

configuration overhead.

Assuming that the data set for indices is large, the key issue rate can be estimated

to be C ′

hw,m(N) = 83.2 ticks per key or 0.012 keys per tick. From the assumption above,

the amount of time for a single search is 390.283.2 ≥ 4.68 times this rate. Although it will be

possible to reduce the overhead costs associated with key searching, it will not improve

the issue rate, which is quickly dominated by the actual key search, rather than the

overhead. Therefore, in order to amortise the configuration overhead cost, a hardware

chaser should be used for searching at least 5 keys on a single data structure before

switching over to a different data structure.

For a single primary search, a single chaser can provide aC′

sw,s(N)

C′

hw,s(N) ≥ 2.53 speed up

for a single key search on a sufficiently large data sets. Although this is not much of an

acceleration, it is fundamentally all that can be done for this type of search unless there

is a fundamental change in the types of memory, data structures and algorithms used.

For applications where a data set needs to be regularly searched for keys, permanently

assigning a chaser to each data set would be beneficial. This would result in a multi-key

acceleration ofC′

sw,m(N)

C′

hw,m(N)

≥ 3.43 speed up.

These values are obtained without the use of any cache. Figure 8.7 shows the speed-

up of using a cache. A small (10%) structural cache memory can increase the perfor-

mance to 3.43× 1.3 = 4.46 times, which is more significant. The key issue rate can also

be slightly reduced to 83.21.3 = 64.0 ticks per key.

As mentioned in section 2.2.1 the only way to increase this performance is to assign

multiple chasers to perform a number of independent searches on different data sets

in parallel, preferably on parallel memory channels. In such a configuration, memory

contention would become a more significant issue, which is why it is important to have

multiple memory channels to service the memory requirements of the multiple chasers.

101

9.1.2 Simple Query

A simple query is the most primitive secondary search operation. It differs from the

primary key search and is expected to return one or more results from the whole data

set for a single key search. This form of query is accelerated using the most primi-

tive pipeline, with only two stages: search and retrieval. This is implemented using a

streamer unit in cascade with a chaser unit.

The first part of the search is a primary key search as described above. Once a key is

found, it can be mapped to a list of records that match the key. The streamer can then

be used to pull the records into the accelerator. As a single stream retrieval operation

does not provide any additional acceleration on top of key search, the basic pipeline only

accelerates as much as a single key search on a chaser as described before.

As before, the nature of the problem means that performance is increased by paral-

lelising multiple searches. The most obvious way to do this is to replicate the chaser–

streamer pipeline multiple times and to run independent queries on each. This will

result in a linear increase in hardware cost with the number of parallel search pipelines.

However, there is an alternative way of building a pipeline for simple queries.

Multiple streamers can be paired with a single chaser if the chase time is assumed

to be much lower than the time taken to retrieve the results stream. Knowing that the

key issue rate is about 83.2 ticks per key, the size of the data set that can be streamed

during the interim, can be estimated. Using expression 5.2, this value is estimated to

beC′

hw,m(N)

M ′

hw(N)

≥ 3.68.

So, for the first four results returned, the chaser will be kept busy with a second

search. But if the streamer returns more than four results, the chaser would be kept

blocking until the streamer is free to service it. A simple query is likely to return more

than four results per key search. Therefore, it would be possible to match up multiple

streamer units with a single chaser instead of pairing them up one to one.

For sufficiently large data structures, the optimal number of streamers to service a

single chaser can be estimated using:

M ′

hw(N1)

C ′

hw,s(N2)= 2.5 ×

N1

lg N2(9.1)

It is safe to assume that for most common cases, N2 ≫ N1 ≫ 1 but it is difficult

to predict the exact or relative sizes of each data set. Therefore it will be impossible

to estimate exactly how many streamers should be paired. However, expression 9.1 can

provide an indication of the lower bound.

As mentioned before, a chaser would only be worth using if N2 ≥ 256, which gives

a value of lg N2 ≥ 8 for the denominator. When N2 grows, the denominator grows

logarithmically. In addition, it takes at least the time of N1 ≥ 4. It is easy for N1 ≥ 8

102

in most applications and when N1 grows, the numerator grows linearly. Therefore, it

is likely that N1 ≫ lg N2. So, an assumption is made that N1

lg N2≈ 1 as a minima for

practical applications.

This is a useful relationship to have as it can be a guide for deciding how many

streamers and chasers to place in a chip. Although no exact value is obtainable without

making prior assumptions about the data, a lower bound of 2.5 streamers per chaser

can be inferred from the relationship.

9.1.3 Range Query

A range query is similar to a simple query. There are a number of configurations that

can be used to accelerate this form of search, depending on how the range is bound.

The range can be bound at both ends or only one end.

For a range that is bound at both ends, there are two methods that can be used,

depending on the size of the range. A pure hardware method can be used to turn a

range query into a multi-key simple query for all the values within the range. Each

result can then be sent off to a streamer to be retrieved from memory. One or more

sieve units can be used in union mode to combine the results into a final results list.

A hybrid method can be used, where a chaser unit is used to chase down either the

lower or upper bound node. Once this value is retrieved, a software algorithm can be

used to traverse the tree, retrieving the other nodes within the range. These nodes can

then be fed to the streamer in a conventional way and the results can be combined in

hardware using a sieve unit. Which method is chosen would depend entirely on the size

of the data structure and the range of items to chase.

Assuming that the data structure size is significantly large and the range (R) to be

fairly large at about 10% of the data structure size, equations 7.1 and 7.4 can be used

to compare the options. Equation 7.1 needs to be modified slightly where lg N ⇒ R to

estimate the amount of time needed to traverse a tree, by observing the fact that lg N

actually symbolises the number of nodes visited when going down a tree.

Csw,s(R)

Chw,m(R)=

22.9R + 280.6

83.2R + 318.2= 0.275, R ≫ 1 (9.2)

Equation 9.2 shows that regardless of the size of the range, the hybrid software

method will be faster than a multi-key hardware method. Although this can be improved

by shifting the root of the tree once one bound is found, the multi-key hardware method

would still be slower than the software traversal. Therefore, the only advantage that

the hardware method has over the hybrid method is offloading and not speed-up unless

a method can be found to allow hardware traversal of trees.

A slight variation of this hybrid method can also be used for a single bound range,

103

which traverses all the nodes on one branch of the tree. This exploits the fact that for

any node in a tree, all the nodes to the right branch are greater while the nodes to the

left branch are smaller. So, a chaser unit can be used to chase down the lower or upper

bound. When this is found, all the nodes to one side of the branch can be retrieved

A hardware method of tree traversal is to use a streamer that is configured such that

its DATA and NEXT register is set to only one branch. This will force the streamer to

perform a depth-first traversal by traversing one branch and returning a list of pointers

down that branch. To estimate the effectiveness of this solution, equation 5.2 can be

used.Csw,s(R)

Mhw(R)=

22.9R + 280.6

22.6R + 241.7= 1.013, R ≫ 1 (9.3)

Equation 9.3 tells us that there will not be any significant difference between a

hardware stackless tree traversal and a software method. If the software is modified

slightly to build stackless trees, this is all that is needed to retrieve the entire branch of

a tree. Otherwise, a hybrid method can be used, where the software is used to configure

and set off multiple streamers down different branches or do extract the nodes entirely

in software, which will do no worse.

9.1.4 Boolean Query

A boolean query may involve all the different forms of queries above. It involves an

additional layer of result collation after any number of simple and range queries. Two

or more result streams can be collated with one or more sieve units.

In the most primitive applications, a sieve unit can be connected in cascade with

two streamer units. As measured earlier, this configuration alone will give a 5.2 times

speed-up. At this point, it may seem logical that combining this with chasers would

provide additional acceleration. Unfortunately, this is not true because the bottleneck

in such a pipeline is the streamer.

However, there is a situation where a sieve unit can provide additional acceleration.

For more complex collation operations, multiple sieves can be combined, in parallel

and cascade, in a logical manner. Depending on the complexity of the operation, the

speed-ups obtained can be significantly more than just 5.2 times.

It is also potentially possible to design a slightly more advanced sieve, which collates

more than two streams at a go. The only reason that this was not done was to simplify

things by working with primitive structures. While this will consume similar resources

with multiple cascaded sieves, it will reduce the number of stages in the search pipeline

and increase throughput for complex collations.

104

9.2 System Pipelining

With regard to throughput, the accelerator units can be combined in different ways,

which are described in the next chapter. Certain combinations of accelerator units can

then be constructed to address the different pipeline types expressed in this chapter. In

a situation where the pipelines are all single staged without overlapping, the amount

of acceleration would be similar to the values quoted in earlier sections. If multiple

pipelines are used in parallel, throughput can be increased until the bandwidth limit is

reached.

However, just like any other pipeline, the different stages could also be overlapped

to increase throughput. In such a situation, each accelerator unit has an output buffer

treated as pipeline buffers. The host processor will then be in charge of synchronisation

and reorganising the sequencing of data where necessary. While this has not been

explored specifically, a potentially larger total acceleration could be achieved, with a

modest number of hardware units.

105

CHAPTER 10

Implementation

The accelerator units can be combined together either in a dynamic or static

fashion and they can be used either as a bridge, a co-processor or even an I/O

device. In addition, the accelerators have primarily been designed for FPGA

implementation. However, some potential ASIC implementation (0.35µ and

0.18µ) are also explored.

10.1 Fabric Architectures

Although it is important to figure out the number of accelerator units that can be

implemented, there is also the question of how the units will be interconnected. Figure

10.1 shows some possible interconnection architectures for the accelerator itself. These

figures assume that there is a host interface and memory interface to each side. The

main difference in these two implementations is how the pipeline is structured, either

statically or dynamically. The choice of which to use would depend on the type of search

operations that need to be accelerated and the physical resources available.

10.1.1 Dynamic Fabric

In the dynamic fabric, the search pipeline is dynamically structured in hardware. De-

pending on the type of search being accelerated, the necessary accelerator units can be

allocated and linked together. The data path through this dynamic pipeline is mainly

controlled by configurable routers in hardware. These routers can be simple switch

fabrics that connect the outputs from one stage to the inputs of another and can be

adapted from existing switch architectures and also network-on-chip architectures.

106

C

M

V V

R

R

C

M

R

R

C

M

V V

R

R

C

M

R

R

C

M

V V

M

C C

M

V V

M

C

V V

(A) Dynamic Fabric (B) Static Fabric

Figure 10.1: Implementation architectures

There is also a possibility of using the sieve unit as a form of router. Each sieve unit

can be configured to perform a swap, as one of its operation modes. If alternated in

cascade, it is possible to build a swapping network to route signals around. It would also

be trivial to couple the swap operation with the intersect or union modes to perform

routing and result collation at the same time. In this case, the cost of routing can be

amortised as part of the search pipeline operation.

The advantage of such a configuration is the flexibility of the pipeline and the types

of problems it can solve. However, this flexibility comes at the expense of using multiple

hardware router units. This will increase both the limited hardware implementation

costs, and also the configuration overhead as the pipelines will now need to be configured,

on top of the accelerator units. Therefore, this would only be useful for performing

well defined search operations on extremely large databases such as searching a DNA

database.

10.1.2 Static Fabric

For other common operations, the ability to have such flexibility is expensive and may

be excessive. Most common searches would involve one of the common forms, such

as the example query. Therefore, it is also possible to accelerate most common search

problems without needing any dynamic configurability. The pipelines can be statically

configured and any additional complexity can be absorbed in the software. In this case,

the types of pipelines laid out may need to be considered.

There is no reason why static fabric should be composed of only one type of pipeline

accelerating one type of query. It would be better to mix different types of pipelines in

the accelerator and allow the software to select the pipeline to be used for acceleration.

The configuration shown in the figure can be used to accelerate a fairly complex boolean

query with four criteria or it could be used to accelerate four individual simple queries

or two simple queries and one boolean query. This is by no means the only pipeline

configuration possible.

Although hardware routers are not used, software routing could be used as a com-

107

plementary method. Software routing could be used to link multiple serial queries or

link parallel parts of a single query or to link complex queries that would not otherwise

be accelerated. In the case of figure 10.1, the last sieve stage may be removed and

replaced with a pure software sieve, or a software-pumped sieve. Software can also play

the role of a router and move results between the pipelines and between stages. The

only disadvantage of software routing would be the slow-down factor, which is not a

problem if the results are produced at a slower rate than an earlier pipeline stage such

as for certain accelerator units like the sieve or chaser.

10.2 Integration Architectures

There is the question of how the entire accelerator will fit into a standard computer

architecture. Figure 10.2 shows some possible integration architectures. The only re-

quirement for the accelerators is to have access to the main memory pool and the host

processor. Therefore, it is possible to integrate the accelerator in any number of ways.

It can be tightly integrated as a bridge and co-processor device or loosely integrated as

an external I/O device. Each method has its advantages and disadvantages.

In either case, the accelerator can be an on-chip or off-chip device. An on-chip

accelerator, like an FPU or PadLock device, will be very tightly integrated with the

host processor and can share the main I/O ports. An off-chip accelerator will be easy

to integrate and dropped into existing systems with only slight modifications but will

need to replicate many resources of the host processor.

CPU HSX MEMORY

CPU

HSX

MEMORY

CPU

HSX

MEMORY

(C) I/O Device

(B) Co−Processor

(A) Bridge

Figure 10.2: System level implementation

10.2.1 Tight Coupling

In a bridge mode, the accelerator is placed between the host processor and main memory.

To the host processor, the accelerator should behave like a memory device while to the

memory, it behaves like a host processor. When used in this mode, the accelerator is

closer to the main memory pool than the host processor, which allows the accelerator

to intercept the host’s access to memory. This allows the accelerator to regulate the

108

memory bandwidth used by the processor and block the processor when necessary. The

advantage of this arrangement is that all available memory bandwidth can be consumed

by the accelerator for search operations. The disadvantage is that the host processor

access to main memory will suffer from slower latency.

In a co-processor mode, the accelerator has the same priority as the host processor

and is placed next to the main memory. When used in this mode, the accelerator is

equally far from the main memory pool as the host. The accelerator communicates with

the host processor via a dedicated accelerator bus. However, as it does not have direct

control over the memory bus, it will be forced to share it with the main processor. This

arrangement reduces the latency of for the host processor as compared to the bridge

mode. Although, it may still allow the accelerator to consume a large amount of the

memory bandwidth as necessary, the disadvantage is that the accelerator is not in control

of the amount of memory bandwidth it uses.

In both these closely coupled configurations, the accelerator is platform specific. In

order to sit comfortably with the host processor, It needs to conform to the appropriate

bus standards used to communicate with the host processor, which is typically different

between microprocessor vendors and often proprietary. So, it further trades off platform

flexibility for faster access to the main memory pool.

10.2.2 Loose Coupling

In an I/O device mode, the accelerator sits alongside other I/O devices on an I/O bus

and behaves no differently from any other I/O device. Each individual accelerator unit

can be mapped to an external memory or I/O space and access via the memory or

dedicated I/O bus. In this mode, the accelerator will need to include a host interface

block, which could be chosen from any number of standard protocols, such as PCIe. The

advantage of the loose coupling is the simplicity and unlike the tightly coupled devices,

it can be totally platform agnostic and be used on any platform.

However, this comes as a result of reduced memory bandwidth. Any available band-

width would need to be shared between the accelerator and other I/O devices. Access to

main memory will also be affected as it will have to go through the main bus. Therefore,

this will have the slowest performance of the different coupling mechanisms.

10.3 FPGA Implementation

Owing to developments from mainstream computer and FPGA vendors, the accelera-

tor units were designed with a potential FPGA implementation in mind. The AMD

109

Torrenza1 and Intel QuickAssist2 programmes are two platform initiatives designed to

open up standard computer systems. They both provide varying levels of support for

the integration of specialised co-processors in their previously closed systems.

Major FPGA vendors have espoused this development by developing custom prod-

ucts for it. Both Xilinx and Altera have specialised products that can plug into a

co-processor socket on a suitable motherboard. These products both directly connect

the FPGA to the host CPU and memory access to the socket’s DDR memory slots.

They are designed to work alongside the host processor.

For the purpose of a sample implementation, a low-cost Xilinx Spartan3A FPGA

is used as a test platform. All the implementation results quoted below, are based on

the Spartan3A FPGA device, which is built on 90nm technology. There are other other

FPGA families that can give better results, in terms of area, power and speed, than the

Spartan3A. That is why the Spartan3A was chosen as a baseline representation for the

worst-case performance available today.

10.3.1 Chaser Implementation

Report 10.1 shows relevant performance figures extracted from the implementation re-

ports for a chaser. It shows that the chaser unit is capable of running at 100MHz on an

FPGA, with the resource and power consumption for a single chaser unit being scored

as:

Clut = 515

Cpow = 9

10.3.2 Streamer Implementation

Report 10.2 shows relevant performance figures extracted from the implementation re-

ports for a streamer. The report shows that the streamer unit is capable of running at

100MHz on a Spartan3A. Furthermore, it shows the resource and power consumption

for a single streamer unit running at 100MHz can each be scored as:

Mlut = 347

Mpow = 4

1http://web.archive.org/web/20071215011032/http://enterprise.amd.com/us-en/

AMD-Business/Technology-Home/Torrenza.aspx2http://www.intel.com/technology/platforms/quickassist

110

Report 10.1 Chaser FPGA implementation results (excerpt)

Design Summary

--------------

Number of errors: 0

Number of warnings: 63

Logic Utilization:

Number of Slice Flip Flops: 359 out of 11,776 3%

Number of 4 input LUTs: 514 out of 11,776 4%

Logic Distribution:

Number of occupied Slices: 387 out of 5,888 6%

Number of Slices containing only related logic: 387 out of 387 100%

Number of Slices containing unrelated logic: 0 out of 387 0%

*See NOTES below for an explanation of the effects of unrelated logic.

Total Number of 4 input LUTs: 515 out of 11,776 4%

Number used as logic: 388

Number used as a route-thru: 1

Number used for Dual Port RAMs: 126

(Two LUTs used per Dual Port RAM)

Number of bonded IOBs: 210 out of 372 56%

IOB Flip Flops: 67

Number of BUFGMUXs: 1 out of 24 4%

Power summary | I(mA) | P(mW) |

----------------------------------------------------------------

Total estimated power consumption | | 85 |

---

Total Vccint 1.20V | 40 | 49 |

Total Vccaux 2.50V | 14 | 35 |

Total Vcco25 2.50V | 1 | 1 |

---

Clocks | 8 | 9 |

Inputs | 0 | 0 |

Logic | 7 | 9 |

Outputs |

Vcco25 | 0 | 1 |

Signals | 0 | 0 |

---

Quiescent Vccint 1.20V | 26 | 31 |

Quiescent Vccaux 2.50V | 14 | 35 |

Quiescent Vcco25 2.50V | 0 | 1 |

Timing summary:

---------------

Timing errors: 0 Score: 0

Constraints cover 26762 paths, 0 nets, and 2247 connections

Design statistics:

Minimum period: 9.607ns{1} (Maximum frequency: 104.091MHz)

111

Report 10.2 Streamer FPGA implementation results (excerpt)

Design Summary

--------------

Number of errors: 0


Logic Utilization:



Logic Distribution:










IOB Flip Flops: 97



----------------------------------------------------------------


---



Total Vcco25 2.50V | 0 | 1 |

---

Clocks | 0 | 0 |

Inputs | 0 | 0 |

Logic | 4 | 4 |

Outputs |

Vcco25 | 0 | 0 |

Signals | 0 | 0 |

---




Timing summary:

---------------



Design statistics:


112

10.3.3 Sieve Implementation

Report 10.3 shows the relevant numbers from the implementation report. It shows that

the sieve unit is capable of running at 100MHz on an FPGA and show the resource

and power consumption for a single sieve unit running at that speed. They can each be

scored as:

Vlut = 634

Vpow = 17

Report 10.3 Sieve FPGA implementation results (excerpt)

Design Summary

--------------

Number of errors: 0


Logic Utilization:



Logic Distribution:







Number used as a route-thru: 5




IOB Flip Flops: 37



----------------------------------------------------------------


---



Total Vcco25 2.50V | 2 | 6 |

---

Clocks | 5 | 5 |

Inputs | 0 | 0 |

Logic | 15 | 17 |

Outputs |

Vcco25 | 2 | 5 |

Signals | 0 | 0 |

---




Timing summary:

---------------



Design statistics:


The retail price3 of a Spartan3A FPGA ranges from $5.75 for 1408 LUTs ($0.004

per LUT) to $35.60 for 22,528 LUTs ($0.0016 per LUT). For the price estimates, a mean

value of $0.0028 (£0.0017±0.0003) per LUT is used.

3prices from NuHorizons online store

113

10.3.4 Resource & Power

From the reports above, an expression for the resource consumption and power dissipa-

tion of the hardware accelerator can be developed.

Qlut = 515NC + 357NM + 634NV + 126 (10.1)

Qpow = 9NC + 4NM + 17NV + 75 (10.2)

Expression 10.1 calculates the total resource consumption of the different units.

The constant factor of 126 is used by the accelerator bus arbitration device, which is

insignificant compared to the rest once the number of accelerator units starts to increase.

Expression 10.2 estimates the power consumption of the different units, in mW .

Furthermore, these figures only apply to the design as implemented on a Spartan3A

FPGA. If a different FPGA family is used, the values will be different. Despite this, the

expressions have some value as they give us an idea of the relative resource and power

consumption of each accelerator unit. These will be useful when estimating the number

of accelerator units to place in an FPGA with limited constraints.

10.3.5 Physical Limits

From the chip resource requirements, the absolute maximum limit of each type of unit

can be calculated. The largest Spartan3 FPGA available has 22, 528 LUT units[Xil08],

this works out to 51 chaser, 63 streamer and 35 sieve units. These numbers indicate

that there are more than enough resources on a low cost FPGA to hold the accelera-

tor units. With such high numbers of accelerators, the main limitation in the FPGA

implementation is memory bandwidth.

The sieve unit does not consume any external memory bandwidth so, it is purely

resource-limited but the streamer and chaser units both consume significant amounts of

memory bandwidth. Assuming the maximum unit numbers above, the amount of mem-

ory bandwidth required would be 65.3Gbps and 134.2Gbps for the chasers and stream-

ers running at 100MHz. Assuming that the system is connected to the fastest standard

DDR2-1066 memory[JED07] available, the maximum memory bandwidth available to

the system is 68.2Gbps only. Therefore, this memory bandwidth ultimately limits the

number to NM ≤ 32 streamers or NC ≤ 53 chasers only.

10.4 ASIC Implementation

Although the hardware accelerator is not fabricated as an ASIC, the accelerator units

were synthesised for a standard cell ASIC implementation process for estimation pur-

poses. The sample technologies chosen were AMS 0.35µm and UMC 0.13µm technolo-

114

0.014

0.016

0.018

0.02

0.022

0.024

0.026

0.028

200 400 600 800 1000 1200 1400

Are

a (m

m2 )

Core speed (MHz)

UMC 0.13um

ChaserStreamer

Sieve 0.014

0.016

0.018

0.02

0.022

0.024

0.026

0.028

200 400 600 800 1000 1200 1400

Are

a (m

m2 )

Core speed (MHz)

UMC 0.13um

0.48

0.5

0.52

0.54

0.56

0.58

0.6

0.62

100 150 200 250 300 350 400 450 500

Are

a (m

m2 )

Core speed (MHz)

AMS 0.35um

ChaserStreamer

Sieve 0.48

0.5

0.52

0.54

0.56

0.58

0.6

0.62

100 150 200 250 300 350 400 450 500

Are

a (m

m2 )

Core speed (MHz)

AMS 0.35um

2

4

6

8

10

12

14

16

200 400 600 800 1000 1200 1400

Pow

er (

mW

)

Core speed (MHz)

UMC 0.13um

ChaserStreamer

Sieve 2

4

6

8

10

12

14

16

200 400 600 800 1000 1200 1400

Pow

er (

mW

)

Core speed (MHz)

UMC 0.13um

50

60

70

80

90

100

110

120

130

140

100 150 200 250 300 350 400 450 500

Pow

er (

mW

)

Core speed (MHz)

AMS 0.35um

ChaserStreamer

Sieve 50

60

70

80

90

100

110

120

130

140

100 150 200 250 300 350 400 450 500

Pow

er (

mW

)

Core speed (MHz)

AMS 0.35um

Figure 10.3: ASIC area and power estimates

gies. Although 0.35µm technology is not now used for new designs, it is useful to obtain

some results for comparison purposes. The chosen 0.13µm technology is fairly recent and

the best possible timings are obtained for this technology. It is also the more expensive

fabrication technology.

The retail fabrication prices4 for 0.35µ and 0.13µ are e720 and e1168 (£572 and

£928) per mm2 area with some minimum die size requirements, which are 10mm2 and

25mm2 (0.35µ and 0.13µ). The minimum quantity of chips obtained for these prices

are 30 (0.35µ) and 45 (0.13µ) units. Therefore, the cost per mm2 per chip is actually

£19.06 and £20.62 respectively.

All the estimates were obtained using the typical case library only. This is useful

for getting a ballpark figure of merit for each accelerator unit but for actual fabrication

purposes, a comprehensive programme of testing is necessary using the best and worst

case libraries to ensure that the accelerators work correctly.

Although the area-speed graphs are drawn using straight lines, actual area-speed

curves are typically non-linear. This non-linear trend can actually be observed in the

distribution of the points in the graphs. However, a linear extrapolation is good enough

to provide an area estimate within a small margin of error.

Although dynamic and static power values are also available, the dynamic power

estimates are used as a measure. For regular applications, the dynamic power is more

4prices from Europractice

115

important as it is the main source of power dissipation. The static power values are

only useful for mobile and embedded battery powered applications. In either case, it is

ultimately dependent upon fabrication technology. For the purpose of estimation and

comparison, an estimated power figure is sufficient.

Figure 10.3 shows the area and power estimates for each accelerator unit at different

operating frequencies. In each graph, the the highest speed plotted is the one where the

operation completed with successful timing closure. These area estimates are obtained

without integrated cache memories.

For all cases, the estimates only apply to the core unit themselves as no pad cells

were used. Whether or not pad cells are used in actual implementations will depend

on how the accelerator units are integrated into the host system. These integration

architectures were discussed in an earlier section.

10.4.1 Area Estimates

Area size directly translates to cost as a function of fabrication process and yield. Al-

tough some care was taken during the design process to make design choices that con-

sume less resources, the accelerators were designed to run at a fast speed and there is still

room for some improvement in terms of area by trading-off raw speed. However, this is

dependent both on the hardware technology chosen for implementation, and dependent

on the final application of the accelerator. So, the present designs are kept generic, to

allow final application customisation only when necessary. Linearly extrapolating each

line in the graph gives the following expressions for each unit (f in MHz):

C035(f) = 169.7 f + 527545 µm2

M035(f) = 197.7 f + 464857 µm2

V035(f) = 37.3 f + 593615 µm2

C013(f) = 2.57 f + 23769 µm2

M013(f) = 2.58 f + 14705 µm2

V013(f) = 1.27 f + 25893 µm2

These expressions show that there is a different rate of change for each accelerator

size with respect to speed but the effect of the speed on the area size will only become

significant for very large numbers of accelerator units (in the order of thousands). How-

ever, the changes in area size of the chaser and streamer are similar in each fabrication

technology but they are both significantly different from the streamer. This should be

116

taken into account when adding multiple chasers and streamers into a final application

under tight area constraints.

10.4.2 Power Estimates

Both dynamic and static power dissipation are very closely linked to the fabrication

technology used. Therefore, power optimisation is mainly a fabrication issue, rather

than an architecture issue. Of course, some minor steps can be taken to reduce power

consumption from within an architecture. For example, instead of using a 4-bit adder,

the FIFO counters were implemented as an LFSR with a single XOR gate. And, in-

stead of having multiple adders to calculate the pointers and offsets in the chaser and

streamer, a single adder was shared through multiplexing. All these steps are designed

to reduce the amount of resource consumption, hence transistor count. However, this

does not discount the fact that power is a process issue, more than an architectural one.

Extrapolating the lines in the graphs linearly gives the following expressions for each

unit (f in MHz):

C035(f) = 337.8 f + 18736.8 µW

M035(f) = 305.0 f + 13612.9 µW

V035(f) = 250.6 f + 11411.4 µW

C013(f) = 11.09 f + 551.9 µW

M013(f) = 7.02 f + 413.7 µW

V013(f) = 11.02 f + 475.3 µW

10.4.3 Speed Estimates

Looking at the graphs, the speed limit is about 333MHz (0.35µ) and 1.0GHz (0.13µ)

respectively. For regular applications, the fastest DDR2-1066 memory has a memory

bandwidth of 68.2Gbps at 533MHz.

On 0.35µ technology, the memory speed is still significantly higher than the accel-

erators. With the assumption that the memory runs at 533MHz while the core runs at

333MHz, the maximum theoretical bandwidth consumed is 4.3Gbps for the chaser and

7.1Gbps for the streamer. This gives a maximum limit of NC ≤ 15 chasers and NM ≤ 9

streamers per chip, assuming an unlimited area budget.

On 0.13µ, the accelerators run faster than the memory speeds. Therefore, the mem-

ory bandwidth becomes a serious bottleneck for high speed devices, as is the case for

117

Estimate 0.35µm ,f=333MHz 0.13µm ,f=1GHzC035 M035 V035 C013 M013 V013

Area (mm2) 0.584 0.531 0.606 0.026 0.017 0.027Power (mW ) 131.2 115.2 94.9 11.6 7.4 11.5

Table 10.1: ASIC area and power estimates at speed

general purpose processors. The maximum theoretical bandwidth at 1GHz is 12.8Gbps

for a chaser and 21.3Gbps for the streamer. At these speeds, the maximum number of

units are NC ≤ 5 chasers and NM ≤ 3 streamers.

10.5 Cost Estimates

For cost estimations the frequencies are fixed at 250MHz and 667MHz (0.35µ and 0.13µ),

which is the median overlap speed of each size. From the expressions above, it is clear

that the price for 0.35µ is very much higher than for an equivalent unit in 0.35 even

after taking the fabrication cost difference into account. This difference is mainly due

to the blocks of memory used in each unit, which depends on the RAM cells for the

selected technology library. The 0.13µ technology library used has a compact design for

the standard cells and memory.

Chaser Streamer Sieve

0.35µ £10.87 £9.81 £11.500.13µ £0.53 £0.34 £0.55

FPGA £0.88±0.15 £0.59±0.10 £1.08±0.19

Table 10.2: Fabrication cost per accelerator unit

The numbers in Table 10.2 are for small production runs and will be lower if the chip

is mass produced. Although it may seem that it is more cost effective to implement the

design in 0.13µ than an FPGA, this number does not take into account the minimum

die area. As the designs have to interface directly with memory, the chip will be a

pad limited design and will definitely cost much more than this table seems to indicate.

In addition, it does not take into consideration the packaging costs either. Therefore,

it is ultimately more cost effective to implement the design in an FPGA for custom

applications.

118

10.6 Conclusion

The FPGA implementation is cheap but not very fast. However, having many hardware

accelerators running in parallel, can make up for any lack in speed, as long as the FPGA

is paired with suitably fast memory. Going onto an ASIC implementation allows the

clock speeds to breach the 1GHz mark. However, the memory bottleneck then becomes

a very serious issue.

From the various implementations, it is evident that there are physical limitations

on how many accelerator units can be included in a chip. Although chip area and power

consumption are both important factors, the overriding physical limit in each case is the

memory bandwidth.

However, there is still a question of how the limited bandwidth can be best utilised.

It can potentially be used to run 51 parallel chasers only, or spread out evenly between

the other units. Therefore, it is important to work out how best to utilise this limited

resource.

Due to the fundamental nature of the accelerator units, there are a number of ways

in which they can be assembled. The units can be routed dynamically, or statically,

via hardware or software. The accelerator can also be integrated at different distances

between the host processor and memory pool. There are advantages and disadvan-

tages to the different configurations. However, these are all the subject of the end user

application and do not fundamentally change the nature of this research.

119

CHAPTER 11

Analysis & Synthesis

Some questions need to be answered with regard to the results obtained so

far, how the solutions presented can scale, and also the estimated cost for

implementation. Although the solution accomplished the job of accelerating

search, there are other possible avenues for accelerating search that improve

search on other layers. There is also room for improvement with regard to

the actual design of the accelerator units. Suggestions are made about how

these may be explored in future work.

11.1 Important Questions

Now that the accelerator units have all been presented, some very pertinent questions

need to be answered. Firstly, it needs to be checked that the accelerator units actually

accomplish an acceleration. Secondly, potential bottlenecks that affect scalability should

be identified. Thirdly, the potential acceleration cost needs to be estimated.

11.2 Host Processor Performance

The first question is whether any actual acceleration takes place. All the comparisons

have been made between the accelerator performance and the software performance of

the host processor. The biggest assumption made thus far, for obtaining all the speed-up

values, is that the host processor is running optimally. If the host processor performance

were degraded due to sub-optimal software, it could affect the results. In addition, a

120

different host processor architecture may also affect the actual results. Both these issues

need to be taken into account for consistency in the results.

11.2.1 Software Optimisation

The bulk of the code uses data structures and algorithms from the standard C++ STL

library. As mentioned before, one of the reasons that this library was chosen is because

it contains time-tested, optimised and mature code. It is unlikely that any custom code

would be significantly better than the STL code.

Next, the choice of compiler can also affect the quality of the code produced. All the

code was compiled with GCC4, which is the latest generation of this popular optimising

compiler. Once again, the compiler used is time-tested, optimised and mature[GS04].

Therefore, it is unlikely that any hand-written assembly would perform significantly

better.

Next, the choice of optimisation level will also affect the performance of the code

produced. All the code was compiled using -O2 optimisations. This optimisation level

was chosen as it reflects the most common optimisation level used in user software.

According to GCC documentation, it presents a balance of both size and performance

and is the best choice for deployment of a program[GS04].

It is arguable that the -O3 optimisation may produce more optimised code, but

only at the expense of a larger code size. This may slow things down, in turn, due to

instruction cache contention. Furthermore, -O3 optimisation only perform a few extra

optimisations compared to -O2 optimisation, the main one being loop unrolling, which

explains the larger code size. Therefore, -O2 optimisations are good enough to present

a practical indication of performance.

Although the software operations were mainly written using higher level C/C++,

the low level API library was mostly written using in-line functions, including in-line

assembly language. Certain functions had to be written in assembly language as it was

impossible to invoke the necessary instructions from within C/C++. In-line functions

were chosen as they work like macros and do not need any extra function call and

return overheads, which ensures that these hardware specific operations do not create

any unnecessary bottlenecks in the code.

So, although the software kernels used may not technically be the fully optimised

versions, they are sufficiently optimised to reflect real-world usage and fully optimised

versions may not present any significant performance improvements. Therefore, it is safe

to say that the performance exhibited by the accelerator units reflects the true nature of

hardware acceleration and is not due to a slow-down caused by poorly written software.

121

11.2.2 Processor Architecture

It is also important to have an idea of how well the host processor performs against

other processor architectures. There is no advantage in having an accelerator that is 10

times faster than a host processor if this is 100 times slower than any other processor.

So, for this purpose, the chosen host processor is compared against a mix of common

processor architectures covering both RISC and CISC architectures.

Relative performance can be estimated by using a code profile of standard library

code. The code chosen is the STL library find() method for a set data structure.

Listings 11.1, 11.2, 11.3, 11.4 and 11.5 are disassembly listings of the compiled find()

for different architectures. This code was chosen as it is indicative of a optimised key

retrieval operation. The code was compiled using GCC with -O2 optimisation.

It is not easy to compare the performance of such a disparate group of microproces-

sors as the architectures vary widely, both in scale and type. Hence, a few assumptions

need to be made to simplify the comparison. To ignore the effects of superscalar or mul-

ticore architectures, all processors are assumed to execute one instruction every clock

cycle. To ignore the use of any cache prediction, it is assumed that memory access

inflicts a memory transaction penalty during cache misses. To ignore the use of any

branch prediction, it is assumed that branch instructions inflict a penalty.

In addition, as the architectures are very different, some of the instructions may not

fit the categories exactly and some further simplifications were made:

• The 68K TST instruction, is counted as a compare instruction as it is essentially a

compare against zero.

• CISC architectures do not have explicit load and store instructions. Therefore,

the loads and stores are counted based on the addressing modes used instead.

• Branches were only considered for instructions that actually cause a change in the

program counter. Therefore, conditional instructions on the ARM were not clas-

sified as branches but counted as the relevant arithmetic or memory instruction.

• Miscellaneous instructions are instructions that moved data between registers and

instructions that provided convoluted linking and unlinking operations. Linking

and unlinking instructions on CISC machines are more complicated than RISC

machines.

A quick glance at the code will show that the pattern is mostly similar across archi-

tectures because the same compiler was used. This is further confirmed by tabulating

the profiles. The most important numbers in Table 11.1 are the memory access numbers

because both primary and secondary searches face memory bandwidth problems.

122

30500000000 <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >, std::allocator <int > >::find (int

const &)>:

0: e8660008 lwi r3, r6, 8

4: 30c60004 addik r6, r6, 4

8: be030034 beqid r3, 52 // 3c

309c: 11460000 addk r10 , r6, r0

10: e9270000 lwi r9, r7, 0

14: b8100014 brid 20 // 28

18: 11030000 addk r8, r3, r0

1c: 11480000 addk r10 , r8, r0

31420: e9080008 lwi r8, r8, 8

24: bc080018 beqi r8, 24 // 3c

28: e8680010 lwi r3, r8, 16

2c: 16491801 cmp r18 , r9, r3

30: bcb2ffec bgei r18 , -20 // 1c

31934: e908000c lwi r8, r8, 12

38: bc28fff0 bnei r8, -16 // 28

3c: 16465000 rsubk r18 , r6, r10

40: bc120014 beqi r18 , 20 // 54

44: e8870000 lwi r4, r7, 0

32448: e86a0010 lwi r3, r10 , 16

4c: 16432001 cmp r18 , r3, r4

50: bcb20010 bgei r18 , 16 // 60

54: f8c50000 swi r6, r5, 0

58: b60f0008 rtsd r15 , 8

3295c: 10650000 addk r3, r5, r0

60: f9450000 swi r10 , r5, 0

64: b60f0008 rtsd r15 , 8

68: 10650000 addk r3, r5, r0

Listing 11.1: AEMB disassembly (GCC 4.1.1)


const &)>:

0: e5903008 ldr r3, [r0, #8]

1554: e2800004 add r0, r0, #4 ; 0x4

8: e3530000 cmp r3, #0 ; 0x0

c: e1a02001 mov r2, r1

10: e52de004 push {lr} ; (str lr , [sp , #-4]!)

14: e1a01000 mov r1, r0

16018: 0a000008 beq 40 <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,

std::allocator <int > >::find (int const &)+0x40 >

1c: e592e000 ldr lr, [r2]

20: e1a0c003 mov ip, r3

24: e59c3010 ldr r3, [ip, #16]

28: e153000e cmp r3, lr

1652c: a1a0100c movge r1, ip

30: b59cc00c ldrlt ip, [ip, #12]

34: a59cc008 ldrge ip, [ip, #8]

38: e35c0000 cmp ip, #0 ; 0x0

3c: 1afffff8 bne 24 <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,


17040: e1500001 cmp r0, r1

44: 0a000003 beq 58 <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,


48: e5922000 ldr r2, [r2]

4c: e5913010 ldr r3, [r1, #16]

50: e1520003 cmp r2, r3

17554: a1a00001 movge r0, r1

58: e49de004 pop {lr} ; (ldr lr , [sp], #4)

5c: e12fff1e bx lr

Listing 11.2: ARM disassembly (GCC 4.2.3)

123


const &)>:

0: 80 03 00 08 lwz r0 ,8( r3)

1664: 39 43 00 04 addi r10 ,r3 ,4

8: 2f 80 00 00 cmpwi cr7 ,r0 ,0

c: 7d 43 53 78 mr r3,r10

10: 41 9e 00 38 beq - cr7 ,48 <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,


14: 81 64 00 00 lwz r11 ,0( r4)

17118: 7c 09 03 78 mr r9,r0

1c: 48 00 00 14 b 30 <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,


20: 7d 23 4b 78 mr r3,r9

24: 81 29 00 08 lwz r9 ,8( r9)

28: 2f 89 00 00 cmpwi cr7 ,r9 ,0

1762c: 41 9e 00 1c beq - cr7 ,48 <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,


30: 80 09 00 10 lwz r0 ,16( r9)

34: 7f 80 58 00 cmpw cr7 ,r0,r11

38: 40 bc ff e8 bge - cr7 ,20 <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,


3c: 81 29 00 0c lwz r9 ,12( r9)

18140: 2f 89 00 00 cmpwi cr7 ,r9 ,0

44: 40 9e ff ec bne+ cr7 ,30 <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,


48: 7f 83 50 00 cmpw cr7 ,r3,r10

4c: 4d 9e 00 20 beqlr cr7

50: 80 04 00 00 lwz r0 ,0( r4)

18654: 81 23 00 10 lwz r9 ,16( r3)

58: 7f 80 48 00 cmpw cr7 ,r0,r9

5c: 4c 9c 00 20 bgelr cr7

60: 7d 43 53 78 mr r3,r10

64: 4e 80 00 20 blr

Listing 11.3: PPC disassembly (GCC 4.1.1)


const &)>:

0: 4e56 0000 linkw %fp ,#0

4: 2f0a movel %a2 ,%sp@ -

6: 206e 0008 moveal %fp@ (8) ,%a0

202a: 246e 000c moveal %fp@ (12) ,%a2

e: 2268 0006 moveal %a0@ (6) ,%a1

12: 5488 addql #2,%a0

14: 2208 movel %a0 ,%d1

16: 4a89 tstl %a1

20718: 6712 beqs 2c <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,

std::allocator <int > >::find (int const &)+0x2c >

1a: 2012 movel %a2@ ,%d0

1c: b0a9 0010 cmpl %a1@ (16) ,%d0

20: 6e1a bgts 3c <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,


22: 2049 moveal %a1 ,%a0

21224: 2269 0008 moveal %a1@ (8) ,%a1

28: 4a89 tstl %a1

2a: 66f0 bnes 1c <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,


2c: b288 cmpl %a0 ,%d1

2e: 6708 beqs 38 <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,


21730: 2452 moveal %a2@ ,%a2

32: b5e8 0010 cmpal %a0@ (16) ,%a2

36: 6c0e bges 46 <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,


38: 2041 moveal %d1 ,%a0

3a: 600a bras 46 <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,


2223c: 2269 000c moveal %a1@ (12) ,%a1

40: 4a89 tstl %a1

42: 66d8 bnes 1c <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,


44: 60e6 bras 2c <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,


46: 2008 movel %a0 ,%d0

22748: 245f moveal %sp@ +,%a2

4a: 4e5e unlk %fp

4c: 4e75 rts

Listing 11.4: 68K disassembly (GCC 3.4.6)

124


const &)>:

0: 55 push %ebp

1: 89 e5 mov %esp ,%ebp

3: 57 push %edi

4: 56 push %esi

2135: 53 push %ebx

6: 8b 45 0c mov 0xc(%ebp) ,%eax

9: 8b 75 08 mov 0x8(%ebp) ,%esi

c: 8b 7d 10 mov 0x10 (%ebp),%edi

f: 8b 50 08 mov 0x8(%eax) ,%edx

21812: 8d 58 04 lea 0x4(%eax) ,%ebx

15: 89 d9 mov %ebx ,%ecx

17: 85 d2 test %edx ,%edx

19: 74 1b je 36 <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,


1b: 89 d0 mov %edx ,%eax

2231d: 8b 17 mov (%edi),%edx

1f: eb 09 jmp 2a <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,

std::allocator <int > >::find (int const &)+0x2a >

21: 89 c1 mov %eax ,%ecx

23: 8b 40 08 mov 0x8(%eax) ,%eax

26: 85 c0 test %eax ,%eax

22828: 74 0c je 36 <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,


2a: 39 50 10 cmp %edx ,0x10 (%eax)

2d: 7d f2 jge 21 <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,


2f: 8b 40 0c mov 0xc(%eax) ,%eax

32: 85 c0 test %eax ,%eax

23334: 75 f4 jne 2a <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,

std::allocator <int > >::find (int const &)+0x2a >

36: 39 cb cmp %ecx ,%ebx

38: 74 07 je 41 <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,


3a: 8b 07 mov (%edi),%eax

3c: 3b 41 10 cmp 0x10 (%ecx),%eax

2383f: 7d 0b jge 4c <std::_Rb_tree <int , int , std::_Identity <int >, std::less <int >,


41: 89 1e mov %ebx ,(% esi)

43: 89 f0 mov %esi ,%eax

45: 5b pop %ebx

46: 5e pop %esi

24347: 5f pop %edi

48: 5d pop %ebp

49: c2 04 00 ret $0x4

4c: 89 0e mov %ecx ,(% esi)

4e: 89 f0 mov %esi ,%eax

24850: 5b pop %ebx

51: 5e pop %esi

52: 5f pop %edi

53: 5d pop %ebp

54: c2 04 00 ret $0x4

Listing 11.5: X86 disassembly (GCC 4.2.3)

125

Type AEMB ARM PPC 68K X86

Arithmetic 33% 25% 27% 19% 13%

Comparison 22% 83% 86% 83% 100%Addition 67% 17% 14% 17% 0%Subtraction 11% 0% 0% 0% 0%

Branch 33% 17% 31% 28% 20%

Conditional 67% 75% 75% 67% 67%Unconditional 33% 25% 25% 33% 33%

Memory 33% 38% 27% 34% 29%

Load 78% 89% 100% 91% 77%Store 22% 11% 0% 9% 23%

Miscellaneous 0% 21% 15% 19% 38%

Instruction Count 27 24 26 31 44

Table 11.1: Code profile for std::set::find()

Looking at the averages and range for each category across architectures: memory

operations account for 32.2±5.5%, arithmetic operations account for 23.4±10% and

branch operations account for 25.8±8%. This tells us that the bulk of the code is taken

up by memory operations, and memory operations are also the most expensive in terms

of time cost. Therefore, it is safe to compare the performance of these architectures

using memory operations alone.

Looking at the profile of the AEMB in comparison with the rest, it exhibits a memory

profile (33%) that is similar to the average (32.2%) across architectures. Furthermore,

memory operations are the most consistent across architectures as they have the smallest

range (±5.5%) between the largest and smallest values. The ARM has about 5/33 = 15%

more memory operations and the PPC has about 6/33 = 18% less memory operations.

This difference cannot turn Cup = 3.43 into Cup ≤ 1.0.

Therefore, on the basis of memory operations, it is safe to assume that the AEMB

architecture is neither significantly slower nor faster than that of other processor archi-

tectures. Any performance numbers gained through the use of accelerator units when

compared with this host processor, will be broadly similar with a different host processor

architecture.

Using higher order processors that include many complex functions such as multiple

cores, large caches, and branch prediction will improve the performance of the host

processor for search applications, but at the cost of higher complexity. Accelerator

units are highly specialised but are simpler in concept and implementation. Then, the

issue becomes one of choosing between cost and performance of a complex processor

versus a simple accelerator unit.

126

11.3 Scalability

The next issue that needs to be addressed is whether this acceleration is scalable. Al-

though the accelerators can accelerate a single search thread, it would be more useful

if it was possible to accelerate N search threads in hardware. It should be obvious that

there are a number of bottlenecks in the system and these will be the limiting factors

on scalability.

11.3.1 Processor Scalability

Although the search acceleration is performed by the accelerator units, a processor

bottleneck exists at two points: the available communications bandwidth between the

host processor with the accelerator units, and the processing bandwidth or the ability

of the processor to allocate search threads to the different accelerator units. These are

two separate issues that can be considered together as the latter is partly dependent on

the former.

Each accelerator unit comes with an individual host processor interface, which is

used by the host processor to send configuration information, receive status information,

put data items into the input buffers and get data items out of the output buffers.

Depending on how each accelerator is used in the search pipeline, it may only take a few

transactions to configure and retrieve a single result, or it may take a large number of

transactions to configure and stream data items in and out of the accelerator unit. Hence,

the communication requirements for each accelerator unit and pipeline are different

depending on application.

The host processor interface in the prototype was configured as a shared bus. The

reason that this was used was because the accelerators were tested individually. Hence,

this shared bus was essentially a dedicated bus for a specific accelerator unit, one at

a time. This ensures that in each situation, the entire bandwidth is available to the

application kernel. A shared bus was perfectly suited to prototyping, but is not suitable

for real-world applications.

Therefore, the way to scale up the communication requirements would be to adopt

a different communication architecture. The host processor interface could be changed

into a packet based interface [SSTN03, LZJ06, Art05] running on a number of different

non-bus layouts to increase the number of channels for traffic. Alternatively, it is possible

to split up the host processor interface into two separate interfaces: a low traffic one for

configuration, and a high traffic one for streaming data. This will further improve the

use of available bandwidth.

Another potential bottleneck is the processing bandwidth. The research prototype

uses a single RISC processor core. As is evident from the simulation results of the

127

different accelerator units, the accelerator units are generally able to work faster than

the processor core and able to generate useful results at a faster rate than can be

consumed by the software kernel. Therefore, another potential bottleneck in the search

pipeline is the search issue rate and result consumption rate of the software.

Again, the most obvious way to increase processing bandwidth is to increase par-

allelism, whether at the fine or coarse grain level. There are a number of different

ways[Wal95, TEL95] to do this such as increasing pipeline depths, hardware threads

and processing cores. However, all these options are contingent upon an increase in host

communications bandwidth. Otherwise, the communications bandwidth will present the

major bottleneck.

11.3.2 Accelerator Scalability

Besides the physical cost constraints, the other problem that will affect the scalability of

the accelerator units is the inter-accelerator communications, which involves bandwidth

capacity and routing issues.

Inter-accelerator bandwidth is only used to stream data from one accelerator unit to

another, in a point-to-point fashion. The term streaming is used because it reflects what

happens in hardware. The output buffers of a transmitting unit are directly connected to

the input buffers of a receiving unit. These buffers are FIFOs and the control signals are

crossed in order to provide hardware flow control, which gives a maximum bandwidth

of 3.2Gbps at 100MHz for each channel.

This bandwidth is more than necessary for each accelerator unit, which can only

generate results at the maximum rate of 1.6Gbps per channel. Furthermore, this inter-

accelerator communication bypasses the host processor and is not affected by processor

scalability. Hence, the inter-accelerator communication has plenty of room to spare and

is scalable, subject to physical and architectural constraints.

The larger issue is that of routing data streams that are not connected directly. Rout-

ing can be achieved either by software routing or dedicated hardware routers. Software

routing would suffer from the processor scalability issues highlighted earlier while dedi-

cated hardware routing suffers from a number of physical constraints. Even if physical

constraints are discounted, hardware routing would still present issues on an architec-

tural level.

These problems stem from the fact that the order of complexity for a hardware

router tends to increase exponentially with the number of nodes. There are different

methods for reducing this problem by constraining the number of physical routes while

still providing the ability to route from one node to another node. However, for the

search pipeline, the number of physical routes can be further reduced by considering

the fact that not all units need to have access to every other unit. For example, it is

128

unlikely that results from a chaser unit need to be routed to another chaser unit or sieve

unit. Furthermore, the routes are uni-directional ones as data flows from one stage to

the next.

However, the best way to work around this problem is to remove the necessity for

routing altogether. This can be done by using the static routing architecture as proposed

in section 10.1.2. This will both reduce architectural and physical constraints on the

routing and due to the well defined nature of search, this architecture can still be used to

solve most common search problems. Therefore, architecturally speaking, it is possible

to scale up the performance of the accelerators by replicating multiple accelerator units

in fixed chains.

The only issues that will limit this are physical constraints. While area size is

definitely a limiting factor, it is less of a major concern. The main physical constraint

will be on the layout of the interconnects that form the host processor channel. These

long lines will become significant as the number of accelerator units go up and will at

least affect the overall speed of the system. Although there can be some creativity in

laying out these lines and the processor cores, this will ultimately limit the scalability

of the accelerator units.

11.3.3 Memory Scalability

Memory scalability is also a major bottleneck for a search application. We have shown

that the memory bandwidth requirements for each accelerator unit is fairly high and

that the use of cache memory will not help much in hiding this problem. Therefore,

memory will prove to be the ultimate bottleneck in implementation scalability. A mem-

ory bottleneck exists at two potential locations: the actual memory bandwidth available,

and the memory contention between the accelerator units. The one positive note is that

memory technology is constantly improving [Lin08, CW08] and that will help alleviate

the bottleneck.

The actual memory bandwidth available can be increased by using faster memory

technologies and multi-channel memory. While faster memory technology is able to

retrieve more data from memory in each clock cycle and thus increase bandwidth, it

usually comes at the cost of higher latency[Woo08]. Due to the random nature of data

access for search applications, this higher latency may prove to be a hidden problem,

which can still be hidden by using higher speed memory.

Multi-channel memory[ZZ05] can increase the bandwidth by accessing multiple mem-

ory locations at a time and is seeing increased use in consumer level computing. This

mainly involves striping different memory locations across different memory modules

accessed through separate memory channels. This is a scalable solution on an architec-

tural level but would end up consuming extra I/O and board space on a physical level,

129

which will again limit the scalability. Therefore, although there are work-arounds for the

different scalability issues, physical limitations will ultimately limit memory scalability.

11.4 Acceleration Cost

The term cost needs to be defined for the purpose of deriving the acceleration per unit

cost. In the case of this accelerator design, cost is measured in terms of monetary cost to

produce the desired result. Some of the basic values have been calculated in section 10.5.

However, these should be consolidated into specific values for specific configurations.

A number of boundary conditions need to be assumed before the costs can be esti-

mated.

Implementation Technology is assumed to be an off-the-shelf FPGA. With present

technology, this will have a potential I/O clock of about 200MHz using a high

speed FPGA. The FPGA is assumed to have the necessary I/O connections to

communicate with the outside world at that speed. The accelerator is assumed to

be closely coupled to the host processor and implemented as a bridge device.

Memory Technology is assumed to be regular DDR2 memory. With an I/O clock of

200MHz, this will limit the memory technology to DDR2-400 technology, which

has a maximum bandwidth of 25.6Gbps1. The host processor consumption of this

memory bandwidth is assumed to be negligible.

Host Communication Interface is assumed to connect directly to an x86 processor

via HyperTransport. With an I/O clock of 200MHz, this has a maximum commu-

nication bandwidth of 12.8Gbps2 in one direction. The aggregated bandwidth will

not be used as it assumes a 50:50 bi-directional communication ratio. This band-

width is more than sufficient to handle data streams coming in at the maximum

rate from memory.

11.4.1 Configuration A

Using memory as the limiting factor, the absolute maximum number of chaser units and

streamer units can each be easily computed.

Mmax =25.6Gbps

2.13Gbps×

100MHz

200MHz= 6.009 ≈ 6

Cmax =25.6Gbps

1.28Gbps×

100MHz

200MHz= 10

1200MHz × 2 transfers per clock × 64bits2200MHz × 2 transfers per clock × 32bits

130

Assuming that each chaser is directly paired with a streamer and they are configured

in a simple query pipeline:

Mmax = Cmax =25.6Gbps

2.13Gbps + 1.28Gbps×

100MHz

200MHz= 3.75 ≈ 4

This maximum figure is rounded up for two reasons: the sieve unit works better in even

channel pairs; and the chaser will not consume the maximum bandwidth as it has to be

reconfigured between searches.

Assuming that each streamer channel pair is then connected to a sieve unit and

configured in a simple boolean query pipeline:

Vmax = 2

The total resource consumption of the accelerator units can be computed from equa-

tion 10.1:

Qlut = 515NC + 357NM + 634NV + 126 = 4882

This will be able to fit inside a medium to large FPGA and will not fit into smaller ones.

Any additional resources can be used to implement additional sieve units to enable more

complicated queries.

Using the figures from table 10.2 the approxmate monetary cost of such an accelerator

will be:

Kfpga = £0.88NC + £0.59NM + £1.08NV ≈ £8.04

Under these conditions, the accelerator unit will be able to accelerate entirely in

hardware:

• Four parallel simple queries.

• Two parallel boolean queries, each with two streams and one operand.

• A combination of the above.

11.4.2 Configuration B

Assuming that the ratio of chasers to streamers is 1:2.5 as suggested in section 9.1.2 and

they are configured with dynamic software routing instead:

Cmax =25.6Gbps

2.5 × 2.13Gbps + 1.28Gbps×

100MHz

200MHz= 1.938 ≈ 2

Mmax = 2.5 × Cmax ≈ 5

131

This results in an odd number of streamer units. While not very homogeneous, this

oddity is acceptable if other types of queries are considered such as boolean queries with

three streams and two operands..

Assuming that each streamer channel pair is connected to a sieve unit while the odd

channel is joined with the output of an existing boolean sieve unit:

Vmax = 3

The total resource consumption of the accelerator units can be computed from equa-

tion 10.1:

Qlut = 4843

Using the figures from table 10.2 the approxmate monetary cost of such an accelerator

will be:

Kfpga ≈ £7.95

Under these conditions, the accelerator unit will be able to accelerate with some

software assisted routing:

• Five parallel simple queries.

• Two parallel boolean queries, one with two streams and one operand, the other

with three streams and two operands.

• A combination of the above.

This configuration is more versatile than the earlier configuration and at a similar

cost. As the absolute maximum number of streamer units that can be implemented is

six, this configuration also represents the maximum practical number of streamers that

can be implemented for a search pipeline.

11.4.3 Configuration Comparisons

There are a couple of comparisons that can be made from the above estimates. While the

cost of implementing a more complex configuration is similar to the cost of implementing

a simpler configuration, the complex configuration is capable of handling a different

range of search pipelines. Therefore, it is possible to implement different combinations of

pipelines for different applications, within the bounds of the absolute maximum number

of accelerator units. Although the number of sieve units were chosen to be a minimum,

a number of additional sieve units can be added to form more complex pipelines that

form more complicated boolean queries in addition to routing.

Another thing to note is that, although CMOS implementations have a fundamen-

tally faster clock speed, the maximum number of units is fundamentally bound by

132

memory bandwidth and is exactly the same as that for FPGAs because the bandwidth

requirements scale linearly with the core clock when paired with faster memory. There-

fore, the number of pipelines that can actually be accelerated in a CMOS implementation

is similar to that of the FPGA. The only difference is that they can run at a much faster

clock rate and complete more search operations per unit time.

One way to increase the number of search pipelines is by increasing the memory

bandwidth through the different methods suggested in section 11.3.3. This is ultimately

the bottleneck in any search system. Another way to increase it is to redistribute the

usage of the existing memory bandwidth, as shown between the two configurations

above. However, this can only help improve the situation slightly.

11.5 Alternative Technologies

There are, of course, other possible ways of achieving a performance boost for search

applications. A dedicated hardware search accelerator may, or may not, be the best

solution for an application. Therefore, it is prudent to look at the solution presented

in this research, against the many alternative solutions, which will show the advantages

and disadvantage that the present solution has against the rest. It is also a good time

to have another look at the search stack presented in Figure 2.1.

11.5.1 Improved Software

The simplest way to achieve any search acceleration would be to replace search algo-

rithms with alternatives at the primary and secondary search layers. This would allow

the search operation to be accelerated with the minimum amount of problems and cost.

A quick search will reveal that there is plenty of ongoing research in the area of ap-

plication specific search algorithms. As this is a pure software alternative, it does not

compete directly with the hardware accelerator alternatives and can actually exist in

parallel with hardware options. By providing an accelerated hardware layer that helps

in performing common operations, the hardware can potentially benefit a wide variety

of software search algorithms.

Another way to improve search without affecting any hardware is to improve the data

structures used. The choice of one data structure over another can affect the performance

of algorithms significantly[Knu73]. There is also ongoing research in the area of esoteric

data structures for use with newer algorithms. The search accelerator presented here

should not be considered the best solution for the problem. It is important to exhaust

possible software alternatives in addition to any other hardware alternative.

133

11.5.2 Content-Addressable Memories

The content-addressable memory (CAM) compares input search data against a table of

stored data and returns the address of the matching data[PS06a]. CAMs have a single

clock cycle throughput making them faster than other hardware-based and software-

based search systems. However, the speed of a CAM comes at the cost of increased silicon

area and power consumption. Most CAMs are implemented using expensive SRAM cells

instead of DRAM cells. Furthermore, a typical CAM cell is half the capacity of an SRAM

cell, which further exacerbates the problem. Due to this, the largest CAMs are only

about 18Mbit in size[PS06a]. Therefore, it is a fairly expensive hardware solution to the

search problem in terms of power and area.

A binary CAM performs only exact-match searches, while a more powerful ternary

CAM allows pattern matching with the use of “don’t cares”[ACS03]. Don’t cares act as

wildcards during a search and are particularly attractive for implementing longest-prefix-

match searches in routing tables. This makes it more suitable for performing attribute

search rather than comparison searches (section 2.2). Although it can be argued that

the CAM can replace a chaser unit for equality searches, it is less able to replace it for

comparison searches such as greater-than or less-than searches. Therefore, while there

is a place for CAMs in hardware search acceleration, they are better suited to a different

class of search operations than the solution presented here.

11.5.3 Multicore Processors

The most obvious method of accelerating search, as readily suggested in [Sto90], is by

performing search operations in parallel on multiple processors. This accelerates search

at the host processor layer. Most popular multi-processors used today are homogeneous

multi-processors such as those employed in x86 processors. Symmetric multi-processing

has been the mainstay in general purpose computing acceleration over the years.

The heterogeneous multi-processor system presented in this research is an alternative

method to accelerate applications. Both methods suffer from the same restrictions

and pitfalls caused by limited memory bandwidth. However, it has been conclusively

[KTJR05] demonstrated that heterogeneous multiprocessor systems are more efficient

than homogeneous systems from several perspectives.

Using a heterogeneous processor can significantly reduce processor power dissipation.

An increase in power consumption and heat dissipation will typically lead to higher

costs for thermal packaging, fans, electricity, and even air conditioning. To reduce this,

industry currently uses two broad classes of techniques for power reduction: gating-based

and voltage or frequency scaling-based.

Given the low core voltages of around 1 volt, there is very little more that voltage

134

scaling can do to improve power consumption. Any significant decrease in voltage scaling

will eat away at the noise margin, reducing the accuracy of the digital signal.

Gating circuitry itself has power and area overhead, limiting its application at the

lowest granularity. This means that power still gets dissipated even when dynamic

blocks are gated off. It is only feasible to use gating techniques at a large block level,

which is where it is principally applied today. This can be easily used for the accelerator

designed here by gating off specific accelerator units and only turning them on when

they are needed in the pipeline.

Given a fixed circuit area, a heterogeneous processor can provide significant advan-

tages. It can match applications to the core that best meets the performance demands

and it can provide an improved area-efficient coverage of various real-world work de-

mands. For the area size devoted to a single additional host processor core, it is possible

to include 1 sieve, 3 streamers and 2 chasers instead. This is a better allocation of

resources than an existing solution of pure homogeneous multicore processors. For a

similar amount of resources, a heterogeneous hardware accelerator can accelerate search

by 5 times, possibly more.

Further reductions in area consumption is possible, if the accelerator units are made

to share resources. In the design of the accelerator units, each unit has an individual

ALU unit that is not 100% exploited. The ALU units are only used during less than

50% of the machine states. Therefore, it is certainly feasible to join dual accelerator

units to share the same ALU unit. Others [SKV+06] have shown that sharing ALU

units for general purpose processor can reduce area consumption by almost 20%. In the

case of the accelerator units, the reduced area consumption will be significantly more as

the bulk of the unit is made up of the ALU device itself.

11.5.4 Data Graph Processors

Another class of heterogeneous processor that can be feasibly used to accelerate search

is the class of graph processors [MHH02, NK04]. This provides an alternative to the

accelerator unit layer. These processors work at an associative level by reconfiguring

hardware to build the data structures physically instead of virtual data structures in

memory. This has the advantage of manipulating data structures in hardware, which

can be very quick and fast.

It has been described elsewhere [CLRS01] that a tree is merely a specific represen-

tation of a generic graph. Therefore, a graph processor can definitely represent and

accelerate any tree functions. In fact, this is a very interesting class of processor as it

attempts to represent complex data structures in hardware.

This suggests that since primitive data structures such as stacks and queues are

already explicitly implemented in hardware, there is no reason why more complex data

135

structures cannot be similarly treated. One can certainly appreciate the logic behind

this and understand the benefits that come from representing data structures in hard-

ware, which would allow the data structures to be quickly searched, easily manipulated,

bypassing the limitations of memory bandwidth by simply not using much memory.

However, hardware data structures suffer from the very physical limitations of hard-

ware - the cost of hardware would grow exponentially in proportion to the problem set.

The way to reduce this cost is to partly move the problem up to the software domain by

swapping data graphs between memory and processor. But this will reduce many of the

benefits associated with implementing complex data structures in hardware. Therefore,

although exciting, the real-world use of such a processor is limited.

Furthermore, in search applications, only a subset of the data structure is usually

needed. Although the graph processor will likely defeat any other method of graph

traversal, a search has been shown earlier, to be a linear traversal and will not benefit

much from a graph processor.

11.5.5 Other Processors

There are many other types of hardware accelerators in used in the real world today

including media processors and physics processors. These processors provide alternatives

at the hardware layers. These processors are typically used for computationally intensive

operations and are not necessarily suited nor efficient for search applications.

For example, the CELL[GHF+05, CHI+05] processor is capable of performing com-

pares and data ordering, which would allow it perform a sieve operation. It has a

load-store unit that connects to a high bandwidth memory interconnection architecture

to supply it with the data it needs at a raw rate of 60 GB/s. Therefore, it can definitely

be programmed to perform a search operation. However, it is a little too costly for

search applications especially with its very high power consumption.

11.6 Suggestions for Future Work

The research prototype was designed in such a way as to facilitate the collection of data.

This means that it was designed as a stripped-down design, so that it can be assembled

in different configurations and tested under various conditions. To accommodate this,

many helpful assumptions about the usage of the accelerators could not be incorporated.

Therefore, there is still room to improve the design, particularly in the optimisation of

resources used.

136

11.6.1 Conjoining Arithmetic Units

Care was taken while designing each individual accelerator unit to optimise the use of

resources. For example, the chaser unit uses the same ALU to perform calculations on

the data and next pointer locations. Both these operations perform the addition of the

node pointer with a static offset value but at different times. Studying figures 5.5, 6.5,

7.4 and their respective descriptions, the use of the ALU has been interleaved in each

accelerator.

While the ALU has already been interleaved within each accelerator, there were still

clock cycles where the ALU was not in use. For example, the ALU unit for a chaser

unit is only used to add pointers during the NULL and NEXT states. The entire main

machine loop has five states in it, which means that the ALU is not in use for the other

three states. A second chaser unit, could feasibly share the same ALU and interleave its

calculations as well. This sort of conjoining of resources can also be applied elsewhere

in the design of other accelerator units.

While use of a single ALU may not result in massive savings of resources, the ALU is

one of the larger blocks of an accelerator unit (the other being memory). If a significant

number of accelerator units are used in an application, the number of ALU units saved

would become significant. Therefore, this is definitely one form of design optimisation

that should be undertaken for real-world implementations.

11.6.2 Conjoining Stream Buffers

The accelerator units used have individual input and output data buffers that are de-

signed as hardware FIFOs. These FIFOs contain a multi-port memory block, which

is typically expensive in terms of area cost[VST03b, VST03c, VST03a]. One port is

connected to the internal accelerator unit while the other port is used externally. As

the number of accelerator units increases, the number of memory blocks used increases

linearly.

The obvious way to reduce the cost of memory is to reduce the size of the FIFOs.

However, this method has limited advantages as memory blocks tend to come in fairly

standard sizes. Furthermore, the increase in chip area with respect to memory size is

non-linear. Simply reducing the size of the stream buffers will not save a significant

area. Halving the memory capacity of a block, results in a block that is still more than

half of the original area size. In fact, the size of the FIFO used in the research prototype

is already extremely small (15×32bit).

An alternate way to reduce the amount of memory used is to conjoin the stream

buffers. Instead of having a separate output buffer on one accelerator unit and an input

buffer on another, the two buffers can be merged into a single buffer. While a larger

137

memory block is needed to store the same amount of data, this proves to be an advantage

because a single 2kbit memory block is much smaller in size than two individual 1kbit

memory blocks3 as evident from Table 11.2.

1kbit 2kbit 4kbit 8kbit 16kbit 32kbit 64kbit

Area (mm2) 0.34 0.48 0.75 1.26 2.23 3.78 7.15Time (ns) 2.93 3.07 3.33 3.42 3.54 4.41 4.69

Table 11.2: Specifications for 0.35µm CMOS DPRAM blocks

Although this is the easiest way to reduce memory area size, it will complicate

matters for dynamic pipeline architectures. However, it is still possible to route the

data dynamically using software, by treating the buffers as a unified buffer. Data can

still be pumped in at the front end and extracted from the back end via software.

Alternatively, all the buffers could just be considered as either input or output buffers,

but not both, and treated that way instead.

Another way to reduce memory resources is to conjoin the memory blocks them-

selves. However, this can only be done after the accelerator units are paired up and

pipelines are well defined. Similar to the situation for the ALU, the memory block is

only being accessed every other clock cycle on each individual port. Therefore, it is

possible to interleave multiple operations on the same memory port. This can turn

a dual-port memory into a quad-port memory by time division multiplexing memory

operations[SD02, Xil05]. This method is more complicated than merely merging the

buffers but can be used in tandem with resultant additional savings.

11.6.3 Memory Interface

For the research prototype, all memory access goes through a central memory arbiter

that handles transactions in a round-robin manner. This is fine for the research pro-

totype because the accelerator units are tested one at a time, which means that each

accelerator unit actually has full access to the memory. In real-world applications, this

central arbiter would prove to be a bottleneck to performance and would consume sig-

nificant resources as the number of accelerator units increase. Some other alternatives

that can be used instead were briefly mentioned in section 11.3.3.

However, the different methods mentioned mainly deal with generic memory access

by multiple masters, typically a number of processor cores. The memory access patterns

for each accelerator unit are not totally random and generic. Each accelerator unit

is controlled by a finite-state-machine that performs memory transactions at periodic

3http://web.archive.org/web/20071006035159/http://asic.austriamicrosystems.com/

databooks/digital/mc_dpram_c35.html

138

intervals. If this is taken into account, it is possible to design a memory interface that

knows in advance, when to process the transactions. This will result in a simpler design.

139

CHAPTER 12

Conclusion

This dissertation has proposed a solution to the search problem. Search is a fundamental

problem in computing and as computers are increasingly invading our everyday lives,

search is also becoming an everyday problem for everyone. Historically, search has

received less attention than other computing problems.

Search was first defined into different categories and characterised. In order to visu-

alise the different components involved in a search, a novel search stack was developed.

This stack links the different hardware and software components of a complex search

operation together. It also serves to illustrate how search can be accelerated at different

layers using alternative technologies.

Furthermore, a generic search was broken down into a three-stage search pipeline.

Each stage can then be individually accelerated by different types of accelerator units

as they are characterised by very different operations and problems. The accelerator

units form fundamental building blocks that are only capable of performing one task and

performing it efficiently. They can be used on their own to offload some fundamental

tasks from the host processor.

The use of accelerator units give added flexibility to the overall accelerator design.

On top of these unit tasks, complex search acceleration can be built. The solution

presented here is novel in that these accelerator units can be combined like LEGO

bricks to solve various complex search problems. Different numbers and configurations of

accelerator units can be used together to form various pipelines for performing different

types of search, depending on the specific application.

In order to investigate the performance of these units, simulation was heavily used.

Initially, a single iteration of a complex search simulation took days to run. The bulk

140

of this time was used by the data set preparation process, which is O(N log N) bound.

To speed this up, a novel method of simulation was developed for accelerating the

simulation. This involved freezing the simulation data onto a disk file using Verilog

constructs, which was later reused in multiple simulation runs.

While the bulk of the work done is related to hardware design, a large part of it was

software focused. A number of search kernels were written to compare the performance

of hardware acceleration against a pure software operation. These kernels were written

in C++, exploiting the Standard Template Library (STL) to use optimised algorithms

and data structures. The code was compiled using the optimising GCC compiler to

produce compact and efficient code for testing purposes. As a result, the accelerators

were shown to achieve a significant factor of hardware acceleration when compared to

pure software solutions.

The chaser unit was designed to perform key search, which is a primary search and

is typically the first stage of any search pipeline. It is also a very common computer

operation, used by any number of operations including results selection, insertion and

deletion. A multi-key search can be accelerated by up to 3.43 times using this chaser unit

as compared to a pure software operation. However, it does not provide as significant

improvement when used for a single-key search.

The streamer unit was designed to offload the mundane list retrieval task, which is

a supporting task used in different search applications. On its own, it does not speed

up the operation when compared to a pure software operation. However, it works as an

excellent offloader, used to extract data values from fundamental data structures while

freeing up the host processor for other tasks.

The sieve unit was designed to perform result collation, which is a secondary search

task and is typically the last stage of any secondary search pipeline. A number of these

units can be combined to form different types of search operations including list union

and intersection. It is capable of accelerating secondary boolean queries by up to 5.2

times as compared to a pure software operation. In addition to result collation, it can

also be used to buffer and route results from other units.

While memory is a major search bottleneck, increasing the cache size has been shown

to have little overall effect on performance. This method of increasing general purpose

microprocessor performance does not work as well when it comes to search applications.

This can be easily understood through the ephemeral nature of search data. The results

show that unless the cache sizes are increased to the levels matching the size of the data

set, there is little benefit in increasing it.

In order to test a better way of constructing cache, a structural cache, which exploits

structural locality in addition to temporal and spatial locality, was developed. However,

at small sizes, a structural cache only provides a small 3% boost in performance. There-

141

fore, there is little reason to integrate a structural cache unless the cost can be justified

by the small increase in performance.

The accelerator units are a better solution that avoid either inventing a whole new

computing paradigm or a new microprocessor. Both of these other solutions, while

unique, will bring a whole host of other problems including incompatibilities with present

tools and platforms. These accelerators can be immediately integrated into existing

computing platforms either as an on-chip bridge, co-processor or I/O device.

The accelerator units were designed for FPGA implementation. With mainstream

microprocessor companies opening up their platforms to hybrid computation initiatives,

this is a potentially easier path for the adoption of this technology. In addition, these

units can also be targeted for ASIC implementation, which will allow them to run at

much higher clock speeds for a higher search throughput rate.

The accelerator design is also scalable. These units are designed to be simple and

small in order to simplify and reduce implementation costs. However, there is still

room for improvement when it comes to resource usage. There are different parts of

the design that can be shared and conjoined to further reduce resource consumption.

However, these optimisations are not dealt with directly here and could be used for

future work.

The most important end-result of this programme of research has been the iden-

tification and developmend of a low-cost method of search acceleration. While using

the accelerator is one possible way of accelerating search, there are many other ways

of achieving acceleration. However, the solution presented in this dissertation has the

advantage of being flexible, cheap and fast. It is flexible enough to be adapted for search

applications and other potential uses, while still being small and simple enough to be

integrated onto existing designs at little extra cost.

142

Bibliography

[20t93] 20th International Symposium on Computer Architecture. A Case for Two-

Way Skewed-Associative Caches, May 1993.

[AB05] Jeff Andrews and Nick Baker. Xbox360 system architecture. In Hot Chips,

number 17, 2005. Double Check Date.

[ABB+03] Dave Abrahams, Mike Ball, Walter Banks, Greg Colvin, Hiroshi Fukutomi,

Lois Goldthwaite, Yenjo Han, John Hauser, Seiji Hayashida, Howard Hin-

nant, Brendan Kehoe, Reobert Klarer, Jan Kristofferson, Dietmar Kuhl,

Jens Maurer, Fusako Mitsuhashi, Hiroshi Monden, Nathan Myers, Masaya

Obata, Martin O’Riordan, Tom Plum, Dan Saks, Martin Sebor, Bill Sey-

mour, Bjarne Stroustrup, Detlef Vollmann, and Willem Wakker. Technical

report on c++ performance. Technical Report PDTR 18015, ISO/IEC, Au-

gust 2003.

[ACS03] I. Arsovski, T. Chandler, and A. Sheikholeslami. A ternary content-

addressable memory (tcam) based on 4t static storage and including

a current-race sensing scheme. Solid-State Circuits, IEEE Journal of,

38(1):155–158, Jan 2003.

[ADS81] Sudhir K. Arora, S. R. Dumpala, and K. C. Smith. Wcrc: An ansi sparc

machine architecture for data base management. In ISCA ’81: Proceedings

of the 8th annual symposium on Computer Architecture, pages 373–387, Los

Alamitos, CA, USA, 1981. IEEE Computer Society Press.

[Alt07] Altera, Inc. Cyclone III Device Handbook, July 2007.

[Art05] Arteris SA. A Comparison of Network-on-Chip Busses, 2005.

143

[ASN+99] Shinsuke Azuma, Takao Sakuma, Takashi Nakano, Takaaki Ando, and Kenji

Shirai. High performance sort chip. In Hot Chips, number 11, 19999.

[Bab79] E. Babb. Implementing a relational database by means of specialzed hard-

ware. ACM Trans. Database Syst., 4(1):1–29, 1979.

[BDH03] Luiz Andre Barroso, Jeffrey Dean, and Urs Holzle. Web search for a planet:

The google cluster architecture. IEEE Micro, 23(2):22–28, March 2003.

[BGH+92] T. F. Bowen, G. Gopal, G. Herman, T. Hickey, K. C. Lee, W. H. Mansfield,

J. Raitz, and A. Weinrib. The datacycle architecture. Commun. ACM,

35(12):71–81, 1992.

[Bor99] Borland/Inprise. Interbase 6.0 documentation, 1999.

[Bro04] Leo Brodie. Thinking Forth, chapter 4,6,7,8. Creative Commons, 2004.

[BTRS05] Florin Baboescu, Dean M. Tullsen, Grigore Rosu, and Sumeet Singh. A tree

based router search engine architecture with single port memories. In ISCA

’05: Proceedings of the 32nd annual international symposium on Computer

Architecture, pages 123–133, Washington, DC, USA, 2005. IEEE Computer

Society.

[CHI+05] Scott Clark, Kent Haselhorst, Kerry Imming, John Irish, Dave Krolak, and

Tolga Ozguner. Cell broadband engine interconnect and memory interface.

In Hot Chips, number 17, 2005.

[CLRS01] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford

Stein. Introduction to Algorithms. The MIT Press, 2nd edition, 2001.

[CW08] Judy Chen and Fred Ware. The next generation of mobile memory. Presented

at MEMCON’08, July 2008.

[DeW78] David J. DeWitt. Direct - a multiprocessor organization for supporting

relational data base management systems. In ISCA ’78: Proceedings of the

5th annual symposium on Computer architecture, pages 182–189, New York,

NY, USA, 1978. ACM.

[DG92] David DeWitt and Jim Gray. Parallel database systems: the future of high

performance database systems. Commun. ACM, 35(6):85–98, 1992.

[DGG+86] David J. DeWitt, Robert H. Gerber, Goetz Graefe, Michael L. Heytens, Kr-

ishna B. Kumar, and M. Muralikrishna. GAMMA — A high performance

dataflow database machine. In Proceedings of the 12th International Con-

ference on Very Large Data Bases, pages 228–237, 1986.

144

[Fer60] David E. Ferguson. Fibonaccian searching. Commun. ACM, 3(12):648, 1960.

[FFP+05] Daniel Fallmann, Helmut Fallmann, Andreas Prambock, Horst Reiterer,

Martin Schumacher, Thomas Steinmaurer, and Roland Wagner. Comparison

of the enterprise functionality of open source database management systems,

Apr 2005.

[FH05] Michael J. Flynn and Patrick Hung. Microprocessor design issues: Thoughts

on the road ahead. IEEE Micro, 25(3):16–31, 2005.

[FK93] Shinya Fushimi and Masaru Kitsuregawa. Greo: a commercial database

processor based on a pipelined hardware sorter. In SIGMOD ’93: Proceedings

of the 1993 ACM SIGMOD international conference on Management of data,

pages 449–452, New York, NY, USA, 1993. ACM.

[FKS97] Terrence Fountain, Peter Kacsuk, and Dezso Sima. Advanced Computer

Architectures: A Design Space Approach, chapter 10-18. Addison-Wesley,

1st edition, 1997.

[FKT86] Shinya Fushimi, Masaru Kitsuregawa, and Hidehiko Tanaka. An overview of

the system software of a parallel relational database machine grace. In VLDB

’86: Proceedings of the 12th International Conference on Very Large Data

Bases, pages 209–219, San Francisco, CA, USA, 1986. Morgan Kaufmann

Publishers Inc.

[Gen04] Paul Genua. A cache primer. Technical report, Freescale Semiconductor,

October 2004.

[GG00] Pierre Guerrier and Alain Greiner. A generic architecture for on-chip packet-

switched interconnections. In DATE ’00: Proceedings of the conference on

Design, automation and test in Europe, pages 250–256, New York, NY, USA,

2000. ACM.

[GHF+05] Michael Gschwind, Peter Hofstee, Brian Flachs, Martin Hopkins, Yukio

Watanabe, and Takeshi Yamazaki. A novel simd architecture for the cell

heterogeneous chip-multiprocessor. IBM, 2005.

[GLS73] Jr. George P. Copeland, G. J. Lipovski, and Stanley Y.W. Su. The archi-

tecture of cassm: A cellular system for non-numeric processing. SIGARCH

Comput. Archit. News, 2(4):121–128, 1973.

[GS04] Brian J. Gough and Richard M. Stallman. An Introduction to GCC, chap-

ter 6. Network Theory Ltd., 2004.

145

[Han98] Jim Handy. The Cache Memory Book. Academic Press, 2nd edition, 1998.

[HB07] Simon Harding and Wolfgang Banzhaf. Fast genetic programming on GPUs.

In Marc Ebner, Michael O’Neill, Aniko Ekart, Leonardo Vanneschi, and

Anna Isabel Esparcia-Alcazar, editors, Proceedings of the 10th European

Conference on Genetic Programming, volume 4445 of Lecture Notes in Com-

puter Science, pages 90–101, Valencia, Spain, 11 - 13 April 2007. Springer.

[Hea95] Steve Heath. Microprocessor Architectures RISC, CISC and DSP, chapter 8.

Newnes, 2nd edition, 1995.

[Hil88] Mark D. Hill. A case for direct-mapped caches. Computer, 21(12):25–40,

1988.

[Hip07] D. Richard Hipp. The virtual database engine of sqlite, 2007.

[HLW87] Gary Herman, K. C. Lee, and Abel Weinrib. The datacycle architecture for

very high throughput database systems. In SIGMOD ’87: Proceedings of

the 1987 ACM SIGMOD international conference on Management of data,


[HP96] John L. Hennessy and David A. Patterson. Computer Architecture: A Quan-

titative Approach, chapter 2,5,8,C,E. Morgan Kaufmann, 2nd edition, 1996.

[HS89] Mark D. Hill and Alan Jay Smith. Evaluating associativity in cpu caches.

IEEE Transactions on Computer, 38(12):1612–1630, December 1989.

[Int07] International Business Machines Corporation. Power ISA Version 2.05, Oc-

tober 2007.

[ISH+91] U. Inoue, T. Satoh, H. Hayami, H. Takeda, T. Nakamura, and H. Fukuoka.

Rinda: a relational database processor with hardware specialized for search-

ing and sorting. Micro, IEEE, 11(6):61–70, Dec 1991.

[JED07] JEDEC Solid State Technology Association. JEDEC Standard: Specialty

DDR2-1066 SDRAM, November 2007.

[JED08] JEDEC Solid State Technology Association. JEDEC Standard: DDR2

SDRAM Specification, April 2008.

[Jon05] M. Tim Jones. Optimization in gcc.

http://www.linuxjournal.com/article/7269, January 2005.

[Kan81] Gerry Kane. 68000 Microprocessor Handbook. Osborne/McGraw-Hill, 1981.

146

[Kan87] Gerry Kane. MIPS R2000 RISC Architecture. Prentice Hall, 1987.

[KG05] Sen M. Kuo and Woon-Seng Gan. Digital Signal Processors: Architectures,

Implementations, and Applications. Pearson Education Inc, 1 edition, 2005.

[Knu69] Donald E. Knuth. The Art of Computer Programming: Fundamental Algo-

rithms, volume 1 of Computer Science and Information Processing. Addison-

Wesley, 2nd edition, 1969.

[Knu73] Donald E. Knuth. The Art of Computer Programming: Sorting and Search-

ing, volume 3 of Computer Science and Information Processing. Addison-

Wesley, 1st edition, 1973.

[Knu81] Donald E. Knuth. The Art of Computer Programming: Seminumerical Algo-

rithms, volume 2 of Computer Science and Information Processing. Addison-

Wesley, 2nd edition, 1981.

[Koo89] Philip J. Koopman. Stack Computers: The New Wave, chapter 1-9,B-C.

Ellis Horwood, 1989.

[Kor87] James F. Korsh. Data structures, algorithms, and program style. PWS

Publishers, 1987.

[KTJR05] Rakesh Kumar, Dean M. Tullsen, Norman P. Jouppi, and Parthasarathy

Ranganathan. Heterogeneous chip multiprocessors. Computer, 38(11):32–

38, 2005.

[KY05] David Kaeli and Pen-Chung Yew, editors. Speculative Execution in High Per-

formance Architectures. Computer and Information Science Series. Chapman

& Hall, 2005.

[LaF06] Eric LaForest. Next generation stack computing, 2006.

[Lan07] Joe Landman. The need for acceleration technologies to achieve cost-effective

supercomputing performance for advanced applications. Technical report,

AMD, 2007.

[LB08] William B. Langdon and Wolfgang Banzhaf. A SIMD interpreter for genetic

programming on GPU graphics cards. In Michael O’Neill, Leonardo Van-

neschi, Steven Gustafson, Anna Isabel Esparcia Alcazar, Ivanoe De Falco,

Antonio Della Cioppa, and Ernesto Tarantino, editors, Proceedings of the

11th European Conference on Genetic Programming, EuroGP 2008, vol-

ume 4971 of Lecture Notes in Computer Science, pages 73–85, Naples, 26-28

March 2008. Springer.

147

[LCM+06] Damjan Lampret, Chen-Min Chen, Marko Minar, Johan Rydberg, Matan

Ziv-Av, Bob Gardner, Chris Ziomkowski, Greg McGary, Rohit Mathur, and

Maria Bolado. OpenRISC 1000 Architecture Manual. OpenCores.Org, April

2006.

[LFM88] K. C. Lee, O. Frieder, and V. Mak. A parallel vlsi architecture for unfor-

matted data processing. In DPDS ’88: Proceedings of the first international

symposium on Databases in parallel and distributed systems, pages 80–86,

Los Alamitos, CA, USA, 1988. IEEE Computer Society Press.

[Lin08] Joseph Lin. Rambus memory technologies update. www.rambus.com, June

2008.

[LSY02] Ruby Lee, Zhijie Shi, and Xiao Yang. How a processor can permutate n bits

in o(1) cycles. In Hot Chips, number 14, 2002.

[LZJ06] ZhongHai Lu, MingChen Zhong, and Axel Jantsch. Evaluation of on-chip

networks using deflection routing. In GLSVLSI ’06: Proceedings of the 16th

ACM Great Lakes symposium on VLSI, pages 296–301, New York, NY, USA,

2006. ACM Press.

[MA06] MySQL-AB. Mysql 5.1 reference manual, Aug 2006.

[McC07] Ian McCallum. Intel quickassist technology accelerator abstraction layer

(aal). Technical report, Intel, 2007.

[McF06] Grant McFarland. Microprocessor Design. McGraw-Hill, 2006.

[Mer08] Rick Merritt. Cpu designers debate multi-core future. EE-Times, February

2008.

[MHH02] Oskar Mencer, Zhining Huang, and Lorenz Huelsbergen. Hagar: Efficient

multi-context graph processors. In 12th International Conference on Field-

Programmable Logic and Applications, pages 915–924. Springer, 2002.

[Mil00] Veljko Milutinovic. Surviving the Design of Microprocessor and Multiproces-

sor Systems. John Wiley & Sons Inc, 2000.

[MK04] Morris M. Mano and Charles R. Kime. Logic and Computer Design Funda-

mentals, chapter 9,14. Pearson Prentice-Hall, 3rd edition, 2004.

[NK04] Anna Nepomniaschaya and Zbigniew Kokosinski. Associative graph proces-

sor and its properties. In PARELEC ’04: Proceedings of the international

conference on Parallel Computing in Electrical Engineering, pages 297–302,

Washington, DC, USA, 2004. IEEE Computer Society.

148

[NP08] Wolfgang Nejdl and Raluca Paiu. R.: I know i stored it somewhere - con-

textual information and ranking on our desktop. 2008.

[Okl01] Vojin G. Oklobdzija. The Computer Engineering Handbook: Electrical En-

gineering Handbook. CRC Press, Inc., Boca Raton, FL, USA, 2001.

[Pay00] Bernd Paysan. A four stack processor, 2000.

[PDG05] PostgreSQL-Development-Group. Postgresql 8.1 documentation, 2005.

[Pel05] Stephen Pelc. Programming Forth, chapter 2,5. Microprocessor Engineering

Limited, 2005.

[PH05] David A. Patterson and John L. Hennessy. Computer Organization and De-

sign: The Hardware/Software Interface, chapter 2,7,9,C,D. Morgan Kauf-

mann, 2005.

[Por] James N. Porter. Five decades of disk drive industry firsts.

http://www.disktrend.com/5decades2.htm.

[PS06a] K. Pagiamtzis and A. Sheikholeslami. Content-addressable memory (cam)

circuits and architectures: a tutorial and survey. Solid-State Circuits, IEEE

Journal of, 41(3):712–727, March 2006.

[PS06b] Kostas Pagiamtzis and Ali Sheikholeslami. Content-addressable memory

(CAM) circuits and architectures: A tutorial and survey. IEEE Journal of

Solid-State Circuits, 41(3):712–727, March 2006.

[Rob78] David C. Roberts. A specialized computer architecture for text retrieval. In

CAW ’78: Proceedings of the fourth workshop on Computer architecture for

non-numeric processing, pages 51–59, New York, NY, USA, 1978. ACM.

[RSK04] Pamela Ravasio, Sissel Guttormsen Schar, and Helmut Krueger. In pursuit

of desktop evolution: User problems and practices with modern desktop

systems. ACM Trans. Comput.-Hum. Interact., 11(2):156–180, 2004.

[Sak02] Dan Saks. Representing and manipulating hardware in standard c and c++.

Embedded Systems Conference San Francisco, 2002.

[SB88] Gerard Salton and Chris Buckley. Parallel text search methods. Communi-

cations of the ACM, 31(2):202–215, Feb 1988.

[SD02] Nick Sawyer and Marc Defossez. Quad-Port Memories in Virtex Devices.

Xilinx Inc, September 2002. XAPP228.

149

[SF96] Robert Sedgewick and Philippe Flajolet. An Introduction to the Analysis of

Algorithms, chapter 5-8. Addison-Wesley, 1st edition, 1996.

[Shi06] Sajjan G. Shiva. Advanced Computer Architecture. Taylor & Francis, 1st

edition, 2006.

[Sil02] Silicore and Opencores. WISHBONE System-on-Chip (SOC) Interconnect

Architecture for Portable IP Cores, b3 edition, Sept 2002.

[SKV+06] David Sheldon, Rakesh Kumar, Frank Vahid, Dean Tullsen, and Roman

Lysecky. Conjoining soft-core fpga processors. In ICCAD ’06: Proceedings

of the 2006 IEEE/ACM international conference on Computer-aided design,


[SL75] Stanley Y. W. Su and G. Jack Lipovski. Cassm: a cellular system for very

large data bases. In VLDB ’75: Proceedings of the 1st International Confer-

ence on Very Large Data Bases, pages 456–472, New York, NY, USA, 1975.

ACM.

[SL95] Alexander Stepanov and Meng Lee. The standard template library. Technical

Report 95-11, HP Laboratories, November 1995.

[Smy03] Bill Smyth. Computing Patterns in Strings. Pearson Addison-Wesley, 1st

edition, 2003.

[SSTN03] Ilkka Saastamoinen, David Siguenza-Tortosa, and Jari Nurmi. An ip-based

on-chip packet-switched network. pages 193–213, 2003.

[Sta06] William Stallings. Computer Organization & Architecture: Designing for

Performance, chapter 18. Pearson Prentice-Hall, 7th edition, 2006.

[Ste06] Alexander Stepanov. Short history of stl, August 2006.

[Sto90] Harold S. Stone. High-Performance Computer Architecture. Addison-Wesley,

2nd edition, 1990.

[Str94] Bjarne Stroustrup. The Design and Evolution of C++. Addison-Wesley Pub

Co, March 1994.

[Sun06] Sun Microsystems, Inc. OpenSPARC T1 Microarchitecture Specification,

August 2006.

[Tan04] Shawn Tan. AEMB: 32-bit RISC Microprocessor Core Data Sheet. Open-

Cores.Org, 2004.

150

[Tan05] Andrew S. Tanenbaum. Structured Computer Organization. Pearson

Prentice-Hall, 5th edition, 2005.

[TDB+06] Xuan-Tu Tran, Jean Durupt, Francois BERTRAND Bertrand, Vincent

Beroulle, and Chantal Robach. A dft architecture for asynchronous networks-

on-chip. In ETS ’06: Proceedings of the Eleventh IEEE European Test Sym-

posium, pages 219–224, Washington, DC, USA, 2006. IEEE Computer Soci-

ety.

[TEL95] Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. Simultaneous mul-

tithreading: maximizing on-chip parallelism. In ISCA ’95: Proceedings of

the 22nd annual international symposium on Computer architecture, pages

392–403, New York, NY, USA, 1995. ACM.

[van02] Ruud van der Pas. Memory hierarchy in cache-based systems. Technical

report, Sun Microsystems, November 2002.

[Vir03] Virtual Silicon Inc. Virtual Silicon: 0.13um High Density Standard Cell

Library, 1.2 edition, Aug 2003.

[Vir04] Virtual Silicon Inc. Virtual Silicon: 0.18um VIP Standard Cell Library Tape

Out Ready, 1.0 edition, Jul 2004.

[Vit01] Jeffrey Scott Vitter. External memory algorithms and data structures: deal-

ing with massive data. ACM Comput. Surv., 33(2):209–271, 2001.

[VST03a] Dual-Port SRAM Compiler UMC 0.13um (L130E-HS-FSG), June 2003.

[VST03b] Single-Port SRAM Compiler UMC 0.13um (L130E-HS-FSG), March 2003.

[VST03c] Two-Port SRAM Compiler UMC 0.13um (L130E-HS-FSG), June 2003.

[VST04] Single-Port SRAM Compiler UMC 0.18um (L180 GII), August 2004.

[Wal95] David W. Wall. Limits of instruction-level parallelism. pages 432–444, 1995.

[Wik09a] Wikipedia. Desktop search — wikipedia, the free encyclopedia, 2009. [On-

line; accessed 17-March-2009].

[Wik09b] Wikipedia. Non-uniform memory access — wikipedia, the free encyclopedia,

2009. [Online; accessed 17-March-2009].

[Wik09c] Wikipedia. Stored procedure — wikipedia, the free encyclopedia, 2009. [On-

line; accessed 17-March-2009].

151

[Wik09d] Wikipedia. Stream processing — wikipedia, the free encyclopedia, 2009.

[Online; accessed 17-March-2009].

[Woo08] Steven Woo. Memory system challenges in the multi-core era. Presented at

MEMCON’08, July 2008.

[Xil04] Xilinx, Inc. Microblaze Processor Reference Guide: EDK6.2i, June 2004.

[Xil05] Xilinx Inc. Using Block RAM in Spartan3 Generation FPGAs, March 2005.

XAPP463.

[Xil08] Xilinx, Inc. Spartan-3A FPGA Family: Data Sheet, April 2008.

[ZZ05] Zhichun Zhu and Zhao Zhang. A performance comparison of dram memory

system optimizations for smt processors. In HPCA ’05: Proceedings of the

11th International Symposium on High-Performance Computer Architecture,

pages 213–224, Washington, DC, USA, 2005. IEEE Computer Society.

LATEX2ε

152

design and development of a heterogeneous hardware search … · 2009. 7. 16. · way that the...

Documents