thesis proposal - cs.cmu.edumpapamic/papamichael_thesis_proposal.pdf · thesis proposal pandora:...

Thesis Proposal

Pandora:A Knowledge-Encapsulating IP Development Paradigmto Overcome the Complexity Wall in Hardware Design

Michael K. Papamichael

Computer Science Department

Carnegie Mellon University

Pittsburgh, PA 15213

May 2013

Thesis Committee:James C. Hoe, Chair (Carnegie Mellon University)

Mark Horowitz (Stanford University)

Ken Mai (Carnegie Mellon University)

Todd Mowry (Carnegie Mellon University)

Onur Mutlu (Carnegie Mellon University)

Submitted in partial fulfillment of the requirements

for the degree of Doctor of Philosophy.

Copyright c© 2013 Michael K. Papamichael

Abstract

Rapidly increasing transistor counts coupled with recent technology advances, such as

3D stacking, are leading to the development of massive, complex and diverse Systems-on-

Chip (SoCs) with tens or even hundreds of interacting modules. Despite the existence of an

ever-growing number of rich Intellectual Property (IP) block catalogs that can greatly reduce

individual module development effort, time and cost and facilitate reuse, building a chip

today takes more time, people and is more expensive than ever. As designs contain tens or

hundreds of modules that are developed and maintained by different engineers or even third-

party IP vendors, managing and tuning the myriad of low-level parameters associated with

each module becomes a very inefficient and bug-prone process. Moreover, low-level module-

specific parameters often become irrelevant or too hard to tune for a non-domain-expert.

This research proposal introduces Pandora, a hardware design paradigm that aims at re-

taining the benefits of highly parameterized IP design and generation, while at the same time

tackling the critical issue of rapidly increasing complexity in hardware design. In Pandora, IP

blocks not only capture the microarchitectural or structural view of a design, but also encapsu-

late rich domain-expert knowledge, which comes in the form of i) high-level tuning knobs that

are tailored to the specific domain and are meaningful to the application developer, ii) char-

acterization meta-data and optimization mechanisms that map and help effectively navigate

the design space, iii) domain-aware monitoring and introspection mechanisms that gather and

analyze low-level experimental data to identify or even diagnose higher order correctness and

performance issues and iv) a set of auxiliary supporting tools, mechanisms and frameworks

that are packaged along with the IP and enhance how the user interacts with the IP. In ad-

dition to keeping complexity under control and boosting productivity, the Pandora approach

also dramatically reduces the combined total effort, because work that would potentially oth-

erwise be repeated by each IP user, is now only performed once and can be leveraged by

others.

To demonstrate the effectiveness of the Pandora hardware design approach, I plan to

perform an FPGA-driven study of the proposed ideas in the context of Networks-on-Chip

(NoCs). In the process of pursuing the overarching goal of reining in hardware design com-

plexity, I expect this research to generate a series of contributions across a variety of as-

pects surrounding HW design, including: i) a detailed NoC design space characterization,

ii) a flexible systematic interconnect generation framework, iii) a study of static and dy-

namic feedback-driven NoC design space exploration and optimization approaches and iv)

an FPGA-based application-driven evaluation of the proposed HW design approach. To dis-

seminate the results of this work, I plan to incorporate all findings in a sophisticated RTL

generation engine, which I plan to publicly release in the form of a flexible user-friendly

web-based NoC generator that will also serve as a demonstration vehicle for Pandora.

1 Proposal Overview

Rapidly increasing transistor counts, driven by Moore’s Law [23], coupled with recent technology

advances, such as 3D stacking, are leading to the development of massive chips, containing billions of

transistors. At the same time, as power dissipation has become a major concern in all facets of comput-

ing, increased use of more energy-efficient application-specific special-purpose hardware is a promising

approach to keep power dissipation under control. The confluence of these trends is leading to the de-

velopment of complex and diverse Systems-on-Chip (SoCs) that include tens or hundreds of interacting

modules.

The Complexity Wall. Despite this enormous rise in scale and complexity, the vast majority of chips

are still developed using conventional design methodologies that have fundamentally remained unchanged

since the introduction of modern Hardware Description Languages (HDLs) in the 1980s and the prolif-

eration of Application Specific Integrated Circuit (ASIC)-based design more than two decades ago1. In

the majority of cases, hardware design is still a primitive low-level complex process, which essentially

pertains to transcribing schematics, specifications or block diagrams using HDLs into Register Transfer

Level (RTL)-level code. Even when leveraging existing extensive IP libraries to accelerate the develop-

ment cycle by quickly assembling the required submodules comprising a modern chip, designers are still

burdened with the complexity of integrating everything together and configuring the myriad and often

cryptic (for the non-domain-expert) low-level design parameters. Incidentally, these are also some of the

reasons preventing application developers from experimenting with their ideas in hardware, even though

they are often comfortable to do so in a software setting.

As chip density continues to increase and designs contain tens or hundreds of interacting modules

that are developed and maintained by different engineers or even third-party IP vendors, current hardware

design methodologies struggle to keep up with the complexity. In addition to the complexity involved

with developing and validating individual pieces of hardware, designers are typically also burdened with

integrating and tuning the growing number of IP blocks in a chip. Managing and calibrating countless

low-level parameters associated with the various submodules of a design, becomes a very inefficient and

bug-prone process. Moreover, these low-level module-specific parameters often become irrelevant or

too hard to tune for non-domain-experts. The end result is a ”complexity explosion” as chips are built

as a ”fragile” collection of crude submodules and IPs, each with its own collection of low-level knobs.

Consequently, the complexity and the associated development time and cost of building a chip today are

higher than ever.

Reining in Design Complexity. In an effort to tackle this problem, which I refer to as the ”Complex-

ity Wall” in this work, and advance the state of current hardware design methodologies, this thesis will

1Some recent proposals pertaining to new hardware design paradigms are discussed later.

1

introduce and study Pandora2 a new hardware design paradigm for IP development that marks a departure

from the current aging status quo by:

• Encoding and embodying domain-expert knowledge through the enrichment of the IP with exten-

sive qualitative and quantitative information that encompasses how the various knobs and parameter

settings affect the IP behavior with respect to implementation (e.g. speed, area, power) and higher

level performance characteristics (e.g. average MIPS in the case of a processor IP or bandwidth and

latency in the case of a Network-on-Chip IP). This will allow the IP user to effectively explore the

often multi-dimensional design landscape and make informed trade-off decisions.

• Raising the level of abstraction by exposing high-level configuration and tuning knobs that are

meaningful at the application level and to the end-user of the IP. This abstraction layer will empower

non-domain-experts to easily and effectively navigate the design space and meet application-level

design goals. Moreover, it will also act as a filter to isolate and guard the IP user from the inner

workings or low-level details of the IP.

• Including powerful instrumentation that constantly monitors the hardware and enables accelerated

validation as well as more effective performance and cost optimization by guiding the designer

through meaningful feedback. To be amenable to the non-domain-expert, instead of presenting

the user with raw data, the IP should also contain the necessary introspection mechanisms for

interpreting the gathered data in order to: i) capture high-level effects (e.g. in the case of a NoC

detect deadlocks, capture congestion effects, or identify bottlenecks) and ii) trace back and inform

the user about the root cause of any correctness and performance issues.

• Bundling the IP with supporting material, tools, mechanisms and frameworks that enhance how the

user interacts with the IP and can range from simple testbenches that are tailored to generated IP

instances or more advanced supporting infrastructure, such as sophisticated optimization frame-works that can aid in tuning the IP, identifying sweet spots within the design space or hep gradually

approach better solutions through feedback-driven optimization loops.

Network-on-Chip Focus. To demonstrate the effectiveness of the proposed approach, I plan to per-

form an FPGA-driven study of the proposed ideas in the context of Networks-on-Chip (NoCs), which are

a very fundamental class of IPs that not only play a central and often performance-critical role in modern

SoCs, but also deeply encompass the issues that pertain to this thesis. In particular, I consider NoCs to be

an ideal research vehicle, because they: i) play a central ubiquitous role in modern chips, ii) are complex,

costly and performance-critical, iii) form a rich design space, iv) are hard to configure and optmize and v)

typically require expert knowledge to be configured.

Impact and Contributions. I expect this work to have multiple research contributions that will po-

2The name Pandora is inspired from Greek mythology and has a dual meaning. The first meaning relates to how Pandora wascreated through unique gifts from each god, which resembles how the proposed design paradigm encapsulates rich domain-expert(”gods”) knowledge to support complexity-reducing interfaces, mechanisms and tools (”gifts”). The second meaning pertains toPandora’s box, which kept sealed all of the evils of the world, similar to how the proposed hardware design paradigm tries tohide or restrain complexity within the IP and avoid exposing the user to the ”evils” or complexities of hardware design.

2

tentially span and impact various levels of the HW design process. The major and most ambitious con-

tribution pertains to overcoming the ”Complexity Wall” in hardware design. I hope to move a step closer

towards this goal through Pandora, a novel hardware design paradigm that raises the level of abstraction

and allows non-domain-experts to easily and efficiently navigate and identify sweet spots within the de-

sign space, without having to deal with the low-level and often cryptic details of modern IP design. Even

though I plan to demonstrate Pandora through an FPGA-driven NoC-focused study, I expect that the de-

veloped approaches and methodologies will also be useful or directly applicable to other aspects of HW

design or IP instances that are approaching or have already hit the ”Complexity Wall”.

In the process of pursuing the overarching goal of reining in hardware design complexity, I expect this

research to generate a series of smaller scale, more tangible and directly applicable contributions across a

variety of aspects surrounding HW design, including: i) a detailed NoC design space characterization, ii)

a flexible systematic interconnect infrastructure, iii) a study of static and dynamic feedback-driven NoC

optimization approaches, iv) an FPGA-based application-driven evaluation of the proposed HW design

methodologies and optimization approaches.

To disseminate the results of this work, I will incorporate all of the findings of this thesis in a sophis-

ticated RTL generation engine, which I plan to publicly release in the form of a flexible, user-friendly,

web-based NoC generator that will also serve as a demonstration vehicle for the conducted research. To

carry out this goal I will leverage the experience I gained while building and publicly releasing CONNECT

(http://www.cs.cmu.edu/ mpapamic/projects/connect.html), an online NoC generator that has gained sig-

nificant traction and is actively used by multiple researchers.

Ultimately, the long-reaching goal of this thesis is to contribute towards advancing the status quo in

HW design, especially in the context of NoCs, which are emerging as the de-facto interconnection fabric

for integrating multiple IPs within a modern SoC.

2 Background

2.1 The Rise of Design Complexity

Over the last few decades technology scaling has closely tracked Moore’s Law [23], which refers

to the empirical observation that the number of transistors in a chip doubles approximately every 18

months. This exponential growth in transistor counts, which has surprisingly persisted until today, along

with Dennard scaling [13], which predicted that power density remains constant as transistors get smaller,

have served as major drivers in the semiconductor industry for multiple decades. As a result, by the

1990s integrated circuits already contained tens of millions of transistors, while power consumption still

remained a second-order concern.

The Productivity Gap. As chip density continued its exponential growth, hardware designers strug-

gled to keep up. This discrepancy between the number of transistors available on a single chip and the

ability of designers to efficiently use these transistors was identified about fifteen years ago and was

labeled the ”design productivity gap”, which is illustrated in Figure 1. This worrying trend sparked re-

3

Figure 1: The Design Productivity Gap: The difference between the transistors available on a single dieand designer productivity (number of transistors designers are able to effectively use).

search in multiple aspects of hardware design, including high-level synthesis techniques, validation tools,

more powerful hardware description languages, as well as new hardware design methodologies, such as

platform-based design [18].

In particular, the increased (re)use of Intellectual Property (IP) blocks, which refer to pre-made, pre-

validated, reusable packaged units of hardware, has been recognized as a very promising approach to

alleviate the productivity gap. Instead of designing every component in a chip from scratch, designers

can build entire chips or portions thereof by leveraging third-party prepackaged IP blocks, which can

greatly reduce the development time and cost of individual submodules within a larger chip. Compared

to other approaches trying to tackle the growing productivity concerns, IP reuse faired as a simpler, more

tangible and immediate solution, that can be quickly adopted by the semiconductor industry, because it

does not require significant changes to the design process. Sure enough, the proliferation of an ever-

growing number of rich IP catalogs came as a much-needed productivity boost that would bridge the

productivity gap to some extent through modular design and heavy IP reuse.

The Power-Constrained Era. Since this design productivity gap was identified in the 1990s, transis-

tor counts have continued to rapidly increase driven both by Moore’s Law, as well as recent technological

advances in Integrated Circuit (IC) fabrication, such as the use of silicon interposers or other forms of 3D

stacking technologies. However, the inability to further scale supply voltage (due to leakage concerns)

has led to the breakdown of classical CMOS scaling as described by Dennard [13]. This, in turn, has cast

power dissipation as a first order concern in HW design affecting all facets of computing, from embed-

ded systems and smart phones to datacenter servers or even high-performance computing. In an effort

to continue increasing performance in the power-constrained setting of the post-Dennard era, designers

are turning to increased use of application-specific special-purpose hardware, which is a promising path

towards more energy-efficient computing, because it can lower the energy required to perform a task. Re-

cent research efforts have also recognized the power efficiency benefits of including more special-purpose

hardware in a chip and have proposed it as a promising solution to the dark silicon problem [14], which

refers to portions of a chip that are only rarely powered on due to power constraints.

The confluence of these trends — namely the ever-growing availability of transistors, which are now

4

in the billions, and the increased need for power-efficient special-purpose hardware within a chip —

is leading to the development of massive chips, that include tens or hundreds of interacting modules,

organized in intricate multi-level hierarchies. Yet, despite the clear power efficiency benefits of special-

purpose hardware and our ability to fabricate denser chips with billions of transistors, current design

methodologies have not evolved at the same pace to handle the complexity associated with such massive,

diverse designs. As a result, designing a chip today requires large skilled hardware design teams and costs

more than ever, even without considering manufacturing costs.

The Complexity Wall. As integration continues and designs contain tens or hundreds of interacting

modules that are developed and maintained by different engineers or even third-party IP vendors, current

HW design methodologies struggle or are unable to keep up with the complexity. Managing and tuning the

myriad of low-level parameters associated with each submodule and connecting them together, becomes

a very inefficient and bug-prone process. Moreover, low-level module-specific parameters often become

irrelevant or too hard to tune for non domain-experts. Similarly, as the number of modules scales, building

an efficient flexible interconnect that satisfies the often application-specific communication needs in an

efficient manner is also becoming a challenging task.

Despite the enormous rise in scale and complexity and the push towards modular design and IP reuse,

hardware and more specifically IP design has not fundamentally changed since the introduction of mod-

ern Hardware Description Languages (HDLs) in the 1980s and the proliferation of Application Specific

Integrated Circuit (ASIC)-based design more than two decades ago. In the majority of cases hardware

design is still a very primitive low-level process, which essentially pertains to transcribing schematics,

specification documents or block diagrams using HDLs into RTL-level code. The end result is a ”com-

plexity explosion” as designers build chips as a ”fragile” collection of crude submodules and IP blocks,

each with its own set of cryptic low-level knobs, which can often be traced back to the hardware schematic

or specification document that they originated from.

Similar to the ”design productivity gap” identified more than a decade ago, this work draws attention to

a resembling alarming trend, but this time at a different scale; not at the level of the transistor, but instead

at the level of the module or IP block: current hardware design methodologies struggle or are unable

to keep up with the complexity involved in configuring, tuning, integrating and validating the multiple

interacting IP blocks within a modern chip. In this work I refer to this problem as the ”Complexity Wall”in hardware design.

2.2 Existing Efforts to Tackle Design Complexity

Other researchers have also recognized complexity as a major obstacle in hardware design, which has

led to a number of proposals that attempt to mitigate various aspects of this problem. These efforts vary in

scale and range from the development of new hardware description languages, tools and algorithms that

can enhance existing chip development flows, to novel design methodologies that fundamentally rethink

the way we design hardware. Despite the wide variety, most proposed approaches share many similar un-

derlying themes, such as design reuse, modularity, abstraction, hierarchical design and orthogonalization

5

of concerns. The remainder of this section highlights some of these efforts that are relevant to or aligned

with the work in this proposal.

Design Reuse and Patterns. The reuse of existing design efforts has been a common recurrent theme

across many efforts to tackle design complexity. Design reuse can take various forms, from module or

IP replication to extensive parameterization to reusing concepts and techniques, and can span multiple

levels, from smaller hardware elements, such as an adder circuit, to larger chip modules or components,

such as a complex processor block or memory controller. Recent research has also studied the use of

design patterns [12], which refer to a more systematic and organized way of classifying and cataloging

existing hardware designs in an effort to more effectively leverage design reuse.

Alternative Hardware Description Languages. The majority of hardware design today is carried

out using the Verilog and VHDL Hardware Description Languages (HDLs), which were introduced in the

1980s. Even though both of these languages have been updated over the years and do include support for

modular design and some primitive forms of parameterization, they still require that designs are described

at a low structural level. This makes hardware design a very tedious and bug-prone process and has a

negative impact on designer productivity.

Over the last years there have been a number of proposals to replace the aging Verilog and VHDL

HDLs to allow designers to shift their focus on the behavior, functionality and algorithmic view of

their hardware design instead of its structural composition and implementation details. Bluespec [7] and

Chisel [33] are two examples of recently proposed hardware description languages that allow designers

to describe hardware at a higher level, borrowing software constructs, such as loops, conditionals and

recursion. Such languages enable high degrees of parameterization and modularity by allowing for well-

structured typed interfaces and support for polymorphism. In the case of Bluespec, designers also benefit

from compiler-enabled type checking and scheduling.

High-level Synthesis. A more aggressive approach towards the same goal of raising the level of

abstraction and increasing designer productivity is synthesis of hardware from high level languages. The

main idea behind this approach is to allow designers to express their algorithms or desired behaviors using

high-level software-like languages. High-level synthesis has gained significant traction in recent years and

current research and commercial solutions allow designers to write code using existing software languages

(e.g. C/C++), which is eventually converted into hardware. Examples of high-level synthesis tools include

AutoESL [1], LegUp [8], ROCC [3], Catapult [19] and Impulse [2].

Despite having a lot of potential, high-level synthesis is still a long way from replacing traditional

HDL-based hardware design. For arbitrary hardware designs, current high-level synthesis solutions typi-

cally produce lower quality results compared to conventional hardware design flows and are also limited

in terms of their expressiveness as they can only handle subsets of existing software languages. Still, re-

cent research has shown promising results when using high-level synthesis tools within specific problem

domains, such as signal processing [25] or nested loop transformations [24, 34].

Communication Architectures. As the number of interacting modules on a chip continued to rapidly

grow, communication was quickly recognized as a critical and time-consuming part of the design pro-

6

cess. To keep design time and cost under control, both academia and industry have studied tools and

frameworks [5, 6, 17, 27, 28] that automate and facilitate the process of designing and implementing com-

munication mechanisms. As the number of interacting modules kept increasing, more recent efforts have

shifted their focus from traditional bus-based and ad-hoc interconnect solutions to the more scalable ap-

proach of using Networks-on-Chip [11, 16]. This shift also came with a push for interface standardization

to promote modularity and design reuse.

New Design Approaches and Methodologies. In an effort to overcome the increasing challenges in

chip design, researchers have also explored higher-level unified approaches and methodologies to tackle

design complexity, which often combine or leverage some of the techniques and approaches already de-

scribed above, such as design reuse or interface standardization. In platform-based design [18], chips are

built as platform instances, which are compositions of library elements that are represented by models

of varying fidelity and adhere to a common set of rules and interconnection standard interfaces defined

by the ”platform”. The ”platform” serves as an abstraction layer or API that allows for quick design

space exploration and shields designers from low-level details. This approach is particularly useful in a

System-on-a-Chip (SoC) setting, where desings are implemented as collections of pre-made IP blocks.

Complementary approaches to platform-based design that also aim towards the same goal of raising

the level of abstraction have looked at new ways to describe hardware, how to embed designer knowledge

in IP blocks and how to create templates or generators that can produce multiple variants of a hardware

design from a common description. These efforts have led to the development of higher level software-

like languages, such as SystemC [4], as well as highly specialized hardware description languages, often

specifically tuned to a particular domain, such as digital signal processing [22, 25]. To enable embedding

of designer knowledge and template-based design, the Genesis2 project [15] offers a powerful frame-

work that builds on top of SystemVerilog and allows designers to build chip generators [29], such as a

multiprocessor generator [31] or a floating-point unit generator [30].

3 Proposed Research

The overarching goal of this work is to tackle the problem of increasing complexity in hardware

design. In particular, this work focuses on Pandora, a design paradigm for developing individual hardware

modules, often called Intellectual Property (IP) blocks. IP blocks refer to common building modules

that implement a specific function, (e.g. processor, Network-on-Chip, FFT core, etc.) and can be shared

among designers and reused across different projects. Multiple such IP blocks are often composed to build

a multi-module chip. The main advantage of packaging and distributing hardware in the form of IP blocks

is facilitating modularity and reuse, which can boost designer productivity and reduce chip development

time and cost.

Unfortunately, even though IP reuse has been fully embraced by the hardware design community,

hardware design today still remains a very complex process, which requires large and skilled hardware

design teams and can cost tens to hundreds of millions of dollars. Despite having rich IP libraries at

7

their disposal, designers are still burdened with the complexity of integrating everything together and

configuring the myriad and often cryptic (for the non-domain-expert) low-level design parameters. This

alarming trend is amplified due to the increasingly specialized nature of individual IP blocks, which often

require domain experts to be configured, validated and fine tuned.

To make matters worse, hardware development today has not fundamentally changed since the in-

troduction of Verilog and VHDL, the two dominant Hardware Description Languages (HDLs) that were

both introduced in the 1980s. Even though HDLs have seen minor updates over the years to improve

support for modular design and some primitive forms of parameterization, they still require that designs

are described at a very low structural level. Designing a hardware block is a crude low-level process,

which essentially pertains to transcribing schematics, specifications or block diagrams using HDLs into

low-level, often cryptic, Register Transfer Level (RTL)-level code. This primitive state of hardware design

affects multiple aspects of hardware design, including the development and use of IP blocks. As a result

IP developers lack the proper tools, frameworks and language support to easily develop and validate their

IP blocks. Developing a highly-parameterized IP block is a very tedious and bug-prone process. More-

over, the parameterization is typically constrained to low-level structural features of the design, such as

bus widths, buffer depths or the type of storage primitive.

To alleviate the situation rececent commercial [7, 33] and academic efforts [15] have laid the ground-

work to provide designers with frameworks and tools that allow for much more flexible hardware and IP

design, including software-like tools to aid extensive parameterization and validation, as well as extensive

static elaboration mechanisms for hardware generation. However, such efforts are mostly geared towards

empowering the IP developers or providers, instead of the IP end-users or clients, who are still exposed to

the low-level details of the design and have to interact with IPs at a primitive structural level. As a result,

albeit steps in the right direction, building IPs that support high degrees of parameterization and pushing

for building hardware generators instead of instances typically comes at the cost of increased complexity

for the IP user.

3.1 Pandora

This proposal introduces Pandora, a hardware design paradigm that is aligned with existing efforts to

tackle design complexity and aims at retaining the benefits of highly parameterized IP design and gen-

eration, while at the same time addressing the associated complexity explosion. In Pandora, IP blocks

not only capture the microarchitectural or structural view of a design, but also encapsulate rich domain-

expert knowledge , which comes in the form of i) high-level tuning knobs that are tailored to the specific

domain and are meaningful to the application developer, ii) characterization meta-data and optimization

mechanisms that map and help effectively navigate the design space, iii) domain- or application-aware

monitoring and introspection mechanisms that can analyze low-level information to identify and help cap-

ture or even diagnose higher order correctness and performance issues and iv) a set of auxiliary supporting

tools, mechanisms and frameworks that are packaged along with the IP and enhance how the user interacts

with the IP. In addition to keeping complexity under control and boosting productivity, this approach also

8

dramatically reduces the combined total effort, because work that would potentially otherwise be repeated

by each IP user, is now only performed once and can be leveraged by others.

Pandora marks a departure from the current status quo in hardware design by combining a set of key

ideas and principles that aim at empowering both IP developers and users. In addition to encompassing

the low-level crude hardware description of a design, the IP is now also enriched with domain-expert

knowledge and includes supporting mechanisms, tools and a diverse set of interface layers to match the

expertise of the IP user. Combined, these features give the IP a sense of ”smartness” or ”self awareness”

that enhance how the user interacts with it and can simplify and accelerate the integration, tuning and

validation phases of the design cycle. Pandora’s underlying principles and mechanisms span multiple

aspects of the IP development and usage cycle and are discussed below.

3.1.1 Encoding and Embodying Knowledge

In addition to capturing the microarchitectural and structural view of a design, in Pandora IPs carry

qualitative and quantitative meta-data. This information captures how the various knobs and parameter

settings affect the IP design space with respect to hardware implementation and higher-level domain-

specific properties and performance characteristics. This additional knowledge will allow the IP user to

effectively explore the often multi-dimensional design landscape and make informed trade-off decisions.

Hardware Implementation Characterization: This refers to capturing how the IP parameters affect

implementation characteristics of the design, such as area, critical path and power dissipation. In its

simplest form this can be a database or characterization library that is attached to and maintained along

with the IP. Alternatively, this can also be complemented or take the form of analytical formulas that

approximates implementation trends. Implementation information can be maintained at different degrees

of fidelity, ranging from classes of hardware devices (e.g., FPGAs or ASICs), specific device families (e.g.,

Altera Stratix V) or instances (e.g., Virtex-6 XC6VLX760) or technology libraries (e.g., TSMC 32nm).

Moreover, additional information can be maintained depending on the specific IP and implementation

target. For instance, in an FPGA environment, area can be broken down into LookUp Tables (LUTs),

BlockRAMs, DSPs, etc., or similarly in a design with multiple clock domains, critical path information

can be kept on a per-clock basis.

Domain-Specific Metrics and Properties: Besides hardware implementation details, which typically

take the same form across all different types of IP, Pandora also captures how the various IP parameters

affect higher-level metrics that are specific to the domain at hand. These are also typically the metrics that

the IP end-users are interested in and try to adjust to meet application-specific goals. For instance, in the

case of a processor core, such a metric could be IPC (Instructions Per Cycle), or in the case of a NoC IP,

such metrics could be the saturation bandwidth and idle latency of the network. This characterization can

also include high-level properties, e.g. in the case of a NoC IP, capture how IP parameters affect packet

delivery and ordering guarantees, traffic isolation or Quality-of-Service properties.

These characterization libraries are meant to aid the design process and do not necessarily need to be

exhaustive or perfectly accurate. Depending on the parameterization degree and complexity of the IP, the

9

characterization of the IP design space can be done in many ways and varying degrees of detail: e.g. i)

collectively for the entire IP (i.e. sweeping over externally-exposed parameters), ii) per submodule (i.e.

considering the internal parameters of each submodule in the hierarchy), iii) through selective sampling of

the design space (driven by domain-expert knowledge), or iv) through the derivation of analytical formulas

that predict implementation and performance results based on expert knowledge or experience.

Even in its raw form, whether it is detailed characterization libraries based on experimental data (e.g.

synthesis runs or simulations) or coarser grain predictive data based on analytical formulas and designer

experience, this extra knowledge that is coupled with the IP is already very useful to the IP user. Not

only does it facilitate faster and more informed navigation of the design space, it can also help drastically

prune the design space by identifying parameter combinations that do not constitute interesting or feasible

design points. As described in the next sections, Pandora also leverages this knowledge to raise the level

of abstraction at which the user interacts with the IP and also provide supporting toolsets and frameworks

that increase designer productivity.

3.1.2 Raising the Level of Abstraction

While building richer and more flexible parameterized IP blocks increases design space coverage and

facilitates reuse, it comes at the cost of added complexity, as IPs expose an ever-increasing number of

low-level parameters and settings. For example, the top-level router module of the Stanford Open Source

Network-on-Chip Router project [32] exposes 42 parameters with multiple additional parameters per each

sub-module. Such parameterized hardware designs typically expose a set of raw ”structural” parameters,

which can usually be traced back to a hardware block diagram, and are directly tied to and affect low-

level implementation details of the hardware, such as datapath widths, buffer depths, memory dimensions,

arithmetic operation precision, etc. Access to such low-level parameters allows fine-grain control to expert

hardware designers familiar with the domain pertaining to the IP. However, as hardware designs scale in

size and comprise of a growing number of IP blocks spanning different expertise domains and are often

developed and maintained by different engineers or even third-party IP vendors, managing and tuning the

myriad of low-level parameters associated with each submodule becomes a very inefficient and bug-prone

process.

To alleviate this problem, Pandora raises the level of abstraction by exposing high-level configuration

and tuning knobs that are meaningful at the application level and to the end-user of the IP. Pandora IP

blocks contain high-level information about their functionality, parameter settings, tuning, capabilites,

and also capture how the various knob settings relate to hardware implementation details, such as area,

power or clock frequency. This abstraction layer empowers non-domain-experts to easily and effectively

navigate the design space and meet application-level design goals. Moreover, it also acts as a filter to

isolate and guard the IP user from the inner workings or low-level details of the IP.

Pandora achieves this higher level of abstraction by leveraging the extensive characterization infor-

mation that is embedded into the IP and combining it with domain-specific expert knowledge to build

high-level polymorphic interface layers, a process illustrated in Figure 2. These high-level interfaces sim-

10

Typical Low-Level IP Interface

Pandora High-Level Polymorphic Interfaces

- + High-level Configuration & Tuning Interfaces

T1

T2

T3

T4

T5

T6

T7

Domain-Expert & Design Space Characterization Knowledge

Domain-Expert & Design Space Characterization Knowledge

IP Exposing Low-Level Configuration Parameters

- +

Pan

do

ra

Figure 2: Pandora raises the level of abstraction through high-level interfaces that leverage domain-expertand design space characterization knowledge.

plify how the user interacts with the IP and allow configuring and tuning the IP through a set of prescriptive

knobs that capture high-level objectives or requirements and can even be of a qualitative or ”fuzzy” de-

scriptive nature. Depending on different user objectives and levels of expertise these interfaces can take

different forms, span varying levels of abstraction and be tailored to match different classes of users. The

goal of each interface is to combine a proper set of configuration parameters and tuning knobs that are

meaningful and intuitive to the application developer that is using the IP.

To give a better idea of what these high-level interfaces might look like and highlight their polymorphic

nature, the list below includes some examples of different interface types:

• Set of Good Configurations or Personalities. In it’s most basic form, an interface can simply

provide the user with a set of good predetermined configurations or ”personalities”. These con-

figurations essentially correspond to points in the design space identified as ”sweet spots” that are

balanced or are well suited for specific classes of applications. Picking this set of configurations

is typically done by a domain expert that is familiar with the inner workings of the IP. Potentially

this could also be done by gathering IP usage information and statistics that capture which config-

urations have produced good results, which ties well with the idea of deploying IP generators as a

service, discussed later in this proposal. These predetermined configurations can also serve as good

starting design points that are later refined using other interfaces.

As an example, in the case of a processor IP generator, such configurations could be based on high-

level specifications, e.g. ”simple in-order”, ”fine-grain multi-threaded”, ”superscalar”, ”SIMD”,

but could also be driven by implementation goals or properties, e.g. ”3-stage pipeline low area”,

”5-stage pipeline high-frequency”, etc. These can also take the form of ”personalities”, which re-

late to the specific type of use within an application, e.g. ”high-throughput”, ”image processing”,

”cryptography”, etc. Similarly, for a NoC IP, these configurations could correspond to different

11

topologies, e.g. ”mesh”, ”ring”, ”fat-tree”, ”butterfly”, etc, or be based on the properties and archi-

tecture of the routers comprising the network, e.g. ”Virtual Channel” or ”Virtual Output Queued”.

The set of design points can also be organized into a hierarchy. For example, in the case of the

NoC IP, each topology could offer a set of subconfigurations, that allow more localized movements

within the design space, e.g. ”Virtual Channel Mesh” or ”Virtual Output Queued Mesh”.

• Objective and Constraint-Driven Queries. These types of interfaces heavily leverage the char-

acterization libraries or other expert knowledge that is embedded into the IP to provide a powerful

means for effectively navigating the design space. Instead of going through a trial-and-error process

or exhaustively sweeping different low-level parameters, the IP user can specify a set of objectives

and/or a set of constraints. The IP contains the necessary supporting mechanisms to sift through

characterization data and leverage designer knowledge to identify design points that match the user

specifications.

For example, for a processor IP within an FPGA environment, a simple query could be of the

form ”highest IPC that fits within an area budget of 30K LUTs”. Depending on the coverage of

the characterization libraries, the extra embedded expert knowledge and the sophistication of the

implemented design space navigation mechanisms, these queries can be more complex. E.g. for a

NOC IP, a query could be of the form ”minimal area NoC that runs at 800MHz and offers 50Gbps

of bisection bandwidth with 2 virtual channels for traffic isolation and prioritization”.

• Tuning Interfaces. While the two previous interface examples allow users to directly pin point

specific configurations within the design space, Pandora can also offer interfaces that allow the user

to start from a given design and navigate towards a another design point that is better suited to the

specific application at hand. These are especially useful for users that are working with designs that

are close to meeting their goals, but are less familiar with the domain and would otherwise find it

very difficult to configure and tune the IP using low-level parameters. In its simplest form such an

interface is a high-level knob that can e.g. take the form of ”higher performance” for a processor IP

or ”less latency” or ”higher frequency” for a NoC IP. Moving in the right direction within the design

space might require a coordinated change of multiple low-level parameters, that would otherwise

require a domain expert.

• Fuzzy Trade-Off Knobs. These interfaces abstract away the entire design space reducing it to a

small set of high-level parameters that have trade-off relationships. For instance, from a hardware

implementation perspective, such an interface could be a ”frequency vs. area” trade-off knob that

allows traversing through various points that lie on the Pareto front. These interfaces could also

combine trade-offs among multiple parameters, e.g. in the form of a triangle trade-off selector,

where the user picks a point within a triangular area that represents the relative trade-off between

three parameters, such as area, speed and power.

While the above list of interfaces is not exhaustive, it gives a good flavor of the types of high-level

interfaces that encompass the key ideas and principles of Pandora. Building a high-level Pandora interface

12

pertains to encoding the relation between the specific high-level configuration or tuning knobs and the

low-level structural parameters of the design. Bridging this gap between the IP user interface and the low-

level parameters requires extensive knowledge about the IP’s design space mapping, which can come in

the form of extensive characterization libraries, as well as algorithmic and architectural expert knowledge

that is embedded into the IP.

3.1.3 Instrumentation and Introspection

While the previous section touched on the issue of configuring IP parameters and identifying design

space sweet spots to pick the proper IP instance for the application at hand, this section discusses how

Pandora accelerates the hardware development cycle once an IP instance has been chosen and possibly in-

tegrated within a larger hardware design. To this end, Pandora employs instrumentation and introspection

mechanisms that constantly monitor, collect and analyze detailed information about the IP’s operation. To

be amenable to non-domain-experts, instead of flooding the user with low-level raw data, Pandora also

includes the necessary tools and domain-expert-driven mechanisms that analyze and interpret the low-

level gathered data to draw high-level conclusions about correctness or performance issues affecting the

design. As a result, the IP user receives meaningful feedback that relates to and captures application-level

behaviors, which, in turn, accelerates the verification process and enable effective performance and cost

optimization.

Instrumentation. Instrumentation refers to tapping into a design for monitoring or validation pur-

poses. While this is something that is commonly done in an ad-hoc fashion by hardware developers

during the development process (e.g. when chasing a bug), Pandora does this in a systematic manner and

bundles it as part of the IP. An immediate benefit is reduced effort, as this process is done once during the

IP development and does not have to be repeated by each IP user. Additionally, when the instrumentation

is staged by the IP developer, who is typically more familiar with the domain and has a much better grasp

of the inner workings of the IP, it will be of higher quality and more effective at collecting the proper set

of low-level data needed to draw conclusions about potential IP correctness and performance issues.

Instrumentation can take many different forms, depending on its purpose and the nature of the IP; it can

range from a simple set of passive counters that monitor interesting events to more sophisticated stateful

pieces of logic that can keep track of sequences of events and even interact with the IP. For instance, in the

case of a processor IP, a simple example of instrumentation could monitor cache misses or keep track of

the types of events that cause stalls. Similarly, in a NoC IP, instrumentation can take many different forms

and span different levels of the internal IP hierarchy. At a basic level it can be used to to monitor link

utilization, network load, packet latency, buffer occupancy, average number of hops, or lower level data

such as allocator unit matching quality, or how many times a higher priority traffic class blocked a lower

level traffic class. A more advanced instrumentation example could pertain to a sophisticated monitoring

and diagnostic block that implements the network flow control protocol and is attached to a network port

to monitor or stress-test the NoC.

Synthesizable Instrumentation. Contrary to typical instrumentation approaches that are often con-

13

strained to only running in a simulation environment, Pandora argues for implementing all or the majority

of the instrumentation in hardware, i.e. keeping it synthesizable. While this would be very challenging

and complex using traditional HDLs, the use of modern HDLs with advanced static elaboration mecha-

nisms, such as Bluespec [7] or Chisel [33], make it much more tractable. In fact, this is not only limited

to instrumentation, but also applies to other traditionally simulation-bound portions of a design, such as

testbenches.

Synthesizable instrumentation can be especially useful in a reconfigurable FPGA setting for a series

of reasons. Firstly, even considering the possible area and timing penalty of turning on instrumentation,

running in hardware is still multiple orders of magnitude faster than RTL simulations. This not only short-

ens the development cycle, but also improves coverage from a verification perspective, as the design can

now quickly reach states that would otherwise never be exercised in a simulation environment. Moreover

any area and timing overheads are instantly eliminated when instrumentation is turned off. Secondly, the

flexible nature of FPGAs allows for quickly turning instrumentation on and off, which can be, for exam-

ple, particularly useful when trying to diagnose performance bottlenecks. Instead of having to recreate

the same scenario in a simulation, the designer can quickly switch to an instance of the design with in-

strumentation and start collecting data. Thirdly, synthesizable instrumentation allows designers to capture

hardware artifacts and behaviors that would otherwise be very hard or impossible to reproduce in a simu-

lation environment (e.g. DRAM controller refresh). Finally, since all measurements are taken by directly

probing the actual running hardware, they are bound to be more accurate than those obtained within a

simulation environment.

Introspection. Introspection in Pandora refers to the supporting mechanisms and logic that are bun-

dled with the IP and can analyze or process data that are being collected during instrumentation. This can

happen dynamically while the IP is actively being used (or simulated) or in a separate post-processing

step that analyzes logs of collected data. Similar to how Pandora raises the level of abstraction by pro-

viding high-level polymorphic interfaces for navigating the design space, it also tries to raise the level of

abstraction through providing higher level feedback about the operation of the IP. This goal is achieved by

embedding domain-expert knowledge that can interpret low-level information to detect or even diagnose

higher-order correctness and performance issues.

At a basic level, Pandora’s introspection mechanisms help accelerate the verification process by

quickly identifying problems. These range from static configuration mistakes (e.g. invalid routing ta-

ble that prevents packets from reaching their destination) to improper integration or use of the IP (e.g.

network endpoints do not properly implement flow control). At a more advanced level, Pandora lever-

ages the embedded domain-expert knowledge, to capture higher-level domain-specific dynamic effects

and behaviors to offer feedback that is more natural, meaningful and intuitive to the end-user of the IP.

For instance, in the case of an NoC IP, such feedback could range from detecting deadlocks to capturing

congestion effects or identifying bottlenecks.

If a design is suffering from performance or correctness issues, Pandora also includes mechanisms

that leverage the embedded domain-expert knowledge to identify and inform the user about the root cause

14

of the problem and offer possible solutions or even directly take corrective action to fix the problem with

minimal user intervention. In an NoC setting, a simple example would be notifying the user to increase

the internal network buffering to accommodate the maximum packet size. While, in this example, the

cause of the problem could be statically traced back to suboptimal configuration, detecting other issues

can require dynamic monitoring of the IP and fixing them can require a coordinated adjustment of multiple

aspects of the IP. For instance, if an NoC IP detects frequent allocator collisions, then a possible fix could

involve low-level changes, such as switching to a different allocator and different router architecture, or

potentially also require higher-level changes to the design, such as picking a different network topology.

3.1.4 The IP ”Uncore”3

The Pandora principles presented up to this point were tied in one way or another to the core func-

tionality of the IP and described Pandora’s approach to configuring, tuning and debugging an IP. The

final principle of Pandora spans a variety of related topics and mechanisms that affect hardware design

and pertain to IP supporting material and toolsets, as well as release and packaging strategies. Although

not crucial to the functionality of the IP, this final set of Pandora principles greatly enhance the IP user

experience and contribute towards reducing the complexity involved in developing and using an IP block.

Supporting Infrastructure. Contrary to software projects, hardware designs are often released in a

very crude form, which can be attributed, at least in part, to the limited expressiveness of conventional

HDLs. Pandora argues for augmenting the IP with supporting infrastructure that boosts productivity

and enhances how the user interacts with the IP. In addition to elementary supporting material, such

as documentation and testbenches, this includes supporting toolsets, such as scripts and interfaces for

configuring the IP or processing output logs, as well as more advanced supporting infrastructure, such as

sophisticated optimization frameworks.

This supporting infrastructure is often domain-specific and as such, needs to be tailored by domain-

experts that have a better grasp of the type of supporting material that might be useful to the end user.

When building these auxiliary mechanisms, Pandora can leverage the characterization and domain-expert

knowledge described previously. For example, in an NoC setting, a configuration tool could enhance

design space navigation by tapping into the characterization database to provide instant feedback on hard-

ware implementation characteristics (e.g. frequency, area), predict network performance (e.g. bisection

bandwidth, latency), show previews of the generated network topology or even give hints as to what

types of applications would be a good match for the selected configuration. A more advanced example

could be a sophisticated feedback-driven optimization framework that processes instrumentation data and

iteratively tunes network parameters to reach a design sweet spot.

Releasing the IP as a Service. Given the inability of conventional HDLs to express and capture

the high degrees of parameterization required to develop a flexible IP block, developers typically have

to either package their IP with ad-hoc auxiliary tools, such as java applets, that generate instances of

3The uncore is a term used by Intel to describe the functions of a microprocessor that are not in the core, but which areessential for core operation and performance.

15

the IP (e.g. Xilinx’s coregen) or develop their IP within special languages and frameworks that natively

support hardware generation, such as Bluespec [7] or Genesis2 [15]. Regardless of the specific packaging

approach, this typically entails additional effort and increases the complexity for the IP user, who typically

has to perform multiple steps on his end to start using the IP, such as setting up a proper environment,

acquire licenses, install a series of tools, take care of any library dependencies, etc.

In an effort to reduce complexity and lower the barrier to entry for the IP user, Pandora argues for

packaging and releasing IP generators as a service, that isolates the IP user/client from the IP devel-

oper/provider. This can be done in the form of a portal that combines the various high-level configuration

and tuning interfaces along with interactive feedback mechanisms and other supporting material, such as

documentation. The primary benefit of such an approach is that the user can focus on using the IP, without

having to worry about equipment, environment and tools setup or development and configuration logistics.

This service-oriented distribution of the IP also has multiple advantages from an IP developer stand-

point, such as facilitating prompt and transparent IP updates. This becomes especially interesting in the

presence of the high-level interfaces described earlier, e.g., in an NOC setting, the user can receive an

internally improved NoC instance that still meets the same high-level criteria. Providing IP generation

as a service through a single point of distribution also allows the IP provider to gather usage statistics,

which can be used to improve the IP and guide development or even potentially facilitate crowd-sourced

characterization of the IP.

3.2 NoC-Focused FPGA-Driven Study

To demonstrate the effectiveness of the Pandora hardware design approach, I plan to perform an

FPGA-driven study of the proposed ideas in the context of Networks-on-Chip (NoCs). To make the most

efficient use of resources and time, I will perform this study by leveraging my work on CONNECT [26], a

flexible NoC IP generator that I recently developed and released publicly. As part of this study I will mod-

ify and build on top of the CONNECT NoC generation framework to incorporate the Pandora principles

and experiment with the proposed hardware design techniques and mechanisms.

NoC Focus. Although I expect the techniques and methodologies presented in this thesis to be ap-

plicable to arbitrary hardware designs spanning various domains, this study focuses on NoCs, a very

fundamental class of IPs that plays a central and often performance-critical role in modern SoCs and also

deeply encompasses the issues that pertain to this thesis. In particular, I consider NoCs to be an ideal

research vehicle, because they:

• Play a central ubiquitous role. As the number of modules within a chip scales and traditional

connectivity solutions, such as design-specific global wiring, are inefficient or even infeasible, NoCs

are quickly becoming the de-facto means for communication within modern SoCs.

• Are complex, costly and performance-critical. In addition to their ubiquitous nature, NoCs also

often end up being one of the most complex, costly and performance-critical components in a chip.

• Form a rich design space. The complex and diverse nature of interconnects, coupled with the surge

16

in NoC-related research in recent years, has led to the formation of a very rich multi-dimensional

design space, that spans a large number of design parameters, such as topology, flow-control, router

architecture, QoS (Quality-of-Service) guarantees, layout, etc.

• Are Hard to Configure and Optimize. In addition to forming such a rich design space, NoCs

often have to meet a diverse and often conflicting range of goals that include, but are not limited to,

bandwidth, latency, QoS guarantees, as well as clock frequency, area and power constraints. Not

only does this cast them as one of the potentially most complex components within an SoC design,

but it also makes them a very challenging component to properly configure and optimize.

• Require expert knowledge. Navigating this rich design space is hard; it is often a very time-

consuming process and typically requires extensive expert knowledge, because existing NoC so-

lutions do not capture or encode the relation between low-level design parameters, (e.g. datapath

width, pipelining options, allocator type, flow-control scheme), and high-level often application-

specific goals (e.g., load-latency behavior, traffic isolation, suitability for specific traffic pattern).

FPGA-Driven Study. Field Programmable Gate Arrays (FPGAs) have experienced a rapid growth,

both in terms of raw capacity as well as features. Modern FPGAs offer millions of logic cells, megabytes

of on-chip storage and hardwired support for a multitude of diverse interfaces and functionalities, such as

Gigabit Ethernet, PCI Express, SATA, DSP blocks and full-fledged processors. This rapid growth, coupled

with the steady introduction of new features and the presence of a rich set of on-chip hardened IPs has

promoted FPGAs to an attractive and capable platform for hosting even extended System-on-Chip designs,

as well as other demanding systems applications, such as full-system prototyping and high performance

computing.

From the perspective of this thesis and my NoC-focused study, FPGAs and their reconfigurable nature

constitute a promising research vehicle that can accelerate the development cycle and allow for rapid

and low-cost prototype-based evaluation. FPGAs also create a unique flexible environment that allows

customizing the interconnect to an extreme degree that would typically not be considered in a conventional

ASIC-based design. Some of our recent work described in Section 4.2 demonstrates the potential of

application-specific NoC tuning within an FPGA environment.

4 Completed Work

In preparation for my thesis work, I have completed key pieces of work that are tightly coupled with

the proposed thesis research. In particular, I recently developed CONNECT [26], a flexible FPGA-tuned

NoC IP generation framework, and also worked on Shrinkwrap [10], our recent effort towards automating

the generation of application-specific memory interconnects within FPGA applications. These projects

helped lay the groundwork for my thesis research and also served to some degree as the inspiration for

the Pandora hardware design paradigm. In fact, as is highlighted below, both CONNECT and Shrinkwrap

already encompass elements, albeit at an elementary stage, of the Pandora design paradigm.

17

4.1 CONNECT

Fast, Flexible, FPGA-Optimized NoCs. As part of our work on the CONNECT [26] project, we

performed a Network-on-Chip (NoC) design study from the mindset of NoC as a synthesizable infrastruc-

tural element to support emerging System-on-Chip (SoC) applications on FPGAs. CONNECT embodies

a set of design guidelines and disciplines that try to make the most efficient use of the FPGA substrate and

in many cases go against ASIC-driven conventional wisdom in NoC design. Interestingly, we found that

these FPGA-motivated design principles uniquely influence key NoC design decisions, such as topology,

link width, router pipeline depth, network buffer sizing and flow control.

In particular, we took full consideration of FPGAs’ special hardware mapping and operating char-

acteristics to identify their own specialized NoC design sweet spot. Specifically, the considerations that

motivated rethinking NoC design for FPGAs were (1) FPGAs’ relative abundance of wires compared to

logic and memory; (2) the scarcity of on-die storage resources in the form of a large number of modest-

sized buffers; (3) the rapidly diminishing return on performance from deep pipelining; and (4) the field

reconfigurability that allows for an extreme degree of application-specific fine-tuning.

When compared against a high-quality publicly available synthesizable RTL-level NoC design in-

tended for ASICs, CONNECT consistently offers lower latencies and is able to achieve comparable net-

work performance at one-half the FPGA resource cost; or alternatively, three to four times higher net-

work performance at approximately the same FPGA resource cost. In addition to being more efficient,

CONNECT also provides the flexibility to create application-specific NoCs; our NoC generator is able to

produce synthesizable RTL designs of FPGA-tuned multi-node NoCs of arbitrary topology.

NoC Generation Framework. To support our FPGA-oriented NoC design study, I developed the

CONNECT NoC generation framework [21] that can produce synthesizable RTL designs of FPGA-tuned

multi-node NoCs of arbitrary topology. The CONNECT NoC generator is based on a very flexible

topology-agnostic router architecture that is highly parameterized. This high degree of parameteriza-

tion spans multiple NoC key characteristics, such as topology, router architecture, flow control, allocation

algorithms, pipelining options, buffer sizing, etc.

Public Release. In an effort to create a useful research tool for the community, we publicly released

CONNECT in March 2012, in the form of a flexible user-friendly web-based NoC generator [21] that

supports a variety of common network topologies, as well as custom user-defined networks that can be

created through a visual network editor [20] . Since its release, the CONNECT NoC generator has gained

significant traction (more than 3500 unique visitors and 650 network generation requests) and is actively

used by multiple researchers around the world. Figure 3 shows screenshots of CONNECT’s web interface.

Pandora-Inspired Baby Steps. Over time, I have been extending CONNECT with new features and

functionality based on user feedback and our group’s internal needs and along with these updates I have

also been enhancing how users interact with the NoC generation framework. In retrospect, although small

in scale and not performed in a coordinated or structured fashion, a large part of the work performed

within the scope of CONNECT is very well aligned with the principles of Pandora. The remainder of this

section briefly touches on such ”Pandora-inspired” CONNECT features, mechanisms or extensions.

18

Figure 3: Screenshots of the publicly released web-based CONNECT NoC Generator and Network Editor.

Building a NoC instance using the original prototype version of CONNECT was a complex multi-step

tedious and bug-prone process as it relied on the user to modify raw RTL source code to set parameters,

properly configure router modules, arrange them in a valid network topology, populate routing tables, etc.

However, since it’s original incarnation, the CONNECT project has come a long way and is now offered

in the form of a web-based portal [21] that completely automates the NoC generation process and includes

multiple supporting tools that enhance how users interact with the IP generator. From the user perspective,

CONNECT appears as a service that produces synthesizable NoC RTL; generating a network essentially

only requires an internet connection.

In particular the CONNECT web-based front-end offers multiple high-level interfaces for IP config-

uration that include documentation and are dynamically updated as the user interacts with them to guard

against erroneous configurations. CONNECT’s main interface offers support for a wide range of com-

mon network topologies and offers a dynamically-generated visual preview of the router and endpoint

arrangement for each candidate network. CONNECT also offers secondary interfaces for building custom

arbitrary-topology NoCs, which can either be configured through the use of a visual network editor [20]

or by using a custom network specification language. Finally, to allow for easier integration with tradi-

tional command-line tool flows, I recently also developed a command-line front-end that generates NoC

instances by remotely connecting to the CONNECT NoC generation framework.

4.2 Shrinkwrap: Memory Interconnect Optimization

In the Shrinkwrap work [10] we experimented with compiler-guided development of application-

specific Networks-on-Chip within the CoRAM FPGA memory abstraction [9]. For this work, I extended

CONNECT to support a class of tree-based topologies that were a good fit for the traffic patterns exer-

cised by various CoRAM applications. Compared to using a baseline generic NoC, across a number of

CoRAM application instances, our customized NoCs reduced FPGA resource usage by almost an order

of magnitude, while retaining the same application performance.

Apart from demonstrating the effectiveness of application-specific customization of NoCs, this work

was also a first step towards exposing high-level configuration knobs in CONNECT, which is one of

19

Pandora’s core principles. In particular, the special version of CONNECT that was developed for this

work, allows for generating networks using higher-level prescriptive parameters or directives, such as the

type of traffic (e.g. memory distribution or aggregation) and peak bandwidth requirements. Based on

these high-level inputs, CONNECT can generate a suitable topology and scale the network to meet the

user-specified requirements.

5 Research Tasks and Timeline

The proposed research is split into a series of tasks, which are briefly described below.

T1: Augment the CONNECT NoC Framework

This first task will serve as a preparatory step that will focus on developing the necessary infrastructure

that will allow me to apply and study the Pandora hardware design paradigm in the context of NoCs.

In particular, I will extend the CONNECT NoC generation framework by implementing and exposing

additional low-level knobs to cover an even larger portion of the NoC design space and also modify some

of the aspects of CONNECT’s IP generation engine to allow finer-grain control over design parameters.

Finally, I plan to implement extensive instrumentation across the entire CONNECT NoC framework.

T2: Design Space Characterization

This task pertains to characterizing the design space of the augmented CONNECT NoC generation

framework from T1. As a first step, this will involve sensitivity studies to distill a set of essential CON-

NECT low-level parameters that capture a sufficiently large portion of the design space than spans all

possible parameter configurations of the generated NoC IPs. As a second step, this will require identify-

ing a core set of high-level parameters, metrics and properties that are meaningful and intuitive to the IP

user. Having identified the set of low-level parameters and high-level domain-specific parameters, metrics

and properties of interest, I will start characterizing the design space of the augmented CONNECT NoC

generator. This will include generating multiple NoC instances, running synthesis sweeps, simulating the

generated designs under various traffic patterns, deriving and calibrating analytical formulas and encoding

my own knowledge as a developer and as a NoC-domain-expert.

T3: Build Introspection Mechanisms

This task will focus on the development of introspection mechanisms, which can be used to identify

NoC performance and correctness issues. As a starting step I will modify the CONNECT NoC framework

to support dynamic instrumentation mechanisms that become part of each generated NoC and monitor

the design while it is running. To capture more elaborate or transient behaviors of a design, I also plan

to implement static instrumentation mechanisms that post-process data gathered through instrumentation

while the network was actively used.

20

T4: Implement High-Level Interfaces

This task pertains to building the mechanisms that capture how the various low-level parameters can

be modified in a coordinated fashion to achieve and meet high-level goals and constraints. To do this I

will leverage the characterization information gathered in T2 to build and experiment with different types

of high-level configuration and tuning interfaces that are tailored to the domain of NoCs. This will include

i) interfaces for picking an initial design point, such as providing a set of predefined good configurations,

supporting objective/constraint-based queries or configuring an NoC based on qualitative user input and ii)

interfaces for tweaking an existing design point, allowing more controlled or localized movements within

the design space, such as prescriptive commands or trade-off knobs.

T5: Putting it All Together

As part of this task I will integrate the efforts of T1-T4 to build a unified NoC generation framework

that embodies Pandora’s principles. This also includes working on additional auxiliary infrastructure that

is not included in previous taks, such as developing test harnesses for evaluation or optimization tools that

combine the introspection mechanisms developed in T3 with the high-level interfaces built in T4.

T6: Evaluation

This task will evaluate the resulting NoC generation framework , as well as the effectiveness of the

Pandora hardware design approach. This evaluation will include simulations using traffic generation en-

gines, synthetic benchmarks and traffic traces, as well as application-driven evaluations, possibly within

the CoRAM [9] framework. To study the effectiveness of the Pandora paradigm, I plan to compare the

performance and hardware implementation characteristics of NoCs generated leveraging Pandora against

a set of baseline conventionally-generated NoCs.

T7: Public Release

To disseminate the results of this work, I plan to incorporate all findings of this thesis in a sophisticated

NoC RTL generation engine that embodies Pandora’s principles and which I plan to release in the form of

a flexible user-friendly web-based NoC generation portal.

T8: Thesis Writeup and Defense

Timeline

Figure 4 shows an approximate timeline for the research tasks proposed above. Based on this timeline,

I expect to defend my thesis in the Fall semester of 2014.

21

Vertical hand-in-hand design approach ideal for FPGA environment Need for application-specific customization ed interconnects

Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov

2013 2014

T1

T2

T3

T4

T5

T6

T7

T8

Figure 4: Thesis Timeline.

22

References

[1] AutoESL High-Level Synthesis Tool. http://www.xilinx.com/tools/autoesl.htm.

2.2

[2] Impulse CoDeveloper. http://www.impulseaccelerated.com. 2.2

[3] Riverside Optimizing Compiler for Configurable Computing. http://www.

jacquardcomputing.com/roccc/. 2.2

[4] SystemC: The Open SystemC Initiative. http://www.systemc.org. 2.2

[5] Altera. Qsys System Integration Tool. http://www.altera.com/products/software/

quartus-ii/subscription-edition/qsys/qts-qsys.html. 2.2

[6] D. Bertozzi, A. Jalabert, Srinivasan Murali, R. Tamhankar, S. Stergiou, L. Benini, and G. De Micheli.

NoC Synthesis Flow for Customized Domain Specific Multiprocessor Systems-on-Chip. Parallel

and Distributed Systems, IEEE Transactions on, 2005. 2.2

[7] Bluespec, Inc. http://www.bluespec.com/products/bsc.htm. 2.2, 3, 3.1.3, 3.1.4

[8] Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson,

Stephen Brown, and Tomasz Czajkowski. LegUp: High-level Synthesis for FPGA-based Proces-

sor/Accelerator Systems. In Proceedings of the 19th ACM/SIGDA international symposium on Field

programmable gate arrays, FPGA ’11, New York, NY, USA, 2011. ACM. 2.2

[9] Eric S. Chung, James C. Hoe, and Ken Mai. CoRAM: An In-Fabric Memory Abstraction for FPGA-

based Computing. In FPGA, 2011. 4.2, 5

[10] Eric S. Chung and Michael K. Papamichael. ShrinkWrap: Compiler-Enabled Optimization and

Customization of Soft Memory Interconnects. In FCCM, 2013. 4, 4.2

[11] W.J. Dally and B. Towles. Route Packets, Not Wires: On-chip Interconnection Networks. In Design

Automation Conference, 2001. Proceedings, 2001. 2.2

[12] A. DeHon, J. Adams, M. deLorimier, N. Kapre, Y. Matsuda, H. Naeimi, M. Vanier, and M. Wrighton.

Design patterns for reconfigurable computing. In Field-Programmable Custom Computing Ma-

chines, 2004. FCCM 2004. 12th Annual IEEE Symposium on, pages 13–23, 2004. 2.2

[13] R.H. Dennard, F.H. Gaensslen, V.L. Rideout, E. Bassous, and A.R. LeBlanc. Design of Ion-

Implanted MOSFET’s with Very Small Physical Dimensions. Solid-State Circuits, IEEE Journal

23

http://www.xilinx.com/tools/autoesl.htm

http://www.impulseaccelerated.com

http://www.jacquardcomputing.com/roccc/

http://www.jacquardcomputing.com/roccc/

http://www.systemc.org

http://www.altera.com/products/software/quartus-ii/subscription-edition/qsys/qts-qsys.html

http://www.altera.com/products/software/quartus-ii/subscription-edition/qsys/qts-qsys.html

http://www.bluespec.com/products/bsc.htm

of, October 1974. 2.1, 2.1

[14] Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger.

Dark Silicon and the End of Multicore Scaling. In Proceedings of the 38th annual international

symposium on Computer architecture, ISCA ’11, New York, NY, USA, 2011. ACM. 2.1

[15] Genesis 2. Creating Chip Generators. http://genesis2.stanford.edu. 2.2, 3, 3.1.4

[16] P. Guerrier and A. Greiner. A generic architecture for on-chip packet-switched interconnections. In

Design, Automation and Test in Europe Conference and Exhibition 2000. Proceedings, 2000. 2.2

[17] IBM. The Coreconnect Bus Architecture. https://www-01.ibm.com/chips/techlib/

techlib.nsf/products/CoreConnect_Bus_Architecture, 1999. 2.2

[18] K. Keutzer, A. R. Newton, J. M. Rabaey, and A. Sangiovanni-Vincentelli. System-level Design:

Orthogonalization of Concerns and Platform-based Design. Trans. Comp.-Aided Des. Integ. Cir.

Sys., 19(12), November 2006. 2.1, 2.2

[19] Mentor Graphics. Catapult C. http://www.mentor.com/esl, 2009. 2.2

[20] Michael K. Papamichael. CONNECT Network Editor. http://users.ece.cmu.edu/

˜mpapamic/connect/network_editor/, . 4.1, 4.1

[21] Michael K. Papamichael. CONNECT NoC Generation Framework. http://users.ece.cmu.

edu/˜mpapamic/connect/, . 4.1, 4.1

[22] P. A. Milder, F. Franchetti, J. C. Hoe, and M. Puschel. Formal Datapath Representation and Manipu-

lation for Implementing DSP Transforms. In Proc. Design Automation Conference, pages 385–390,

2008. 2.2

[23] Gordon E. Moore. Cramming more components onto integrated circuits. Electronics, 38(8), April

1965. 1, 2.1

[24] A. Morvan, S. Derrien, and P. Quinton. Efficient Nested Loop Pipelining in High Level Synthesis

Using Polyhedral Bubble Insertion. In Field-Programmable Technology (FPT), 2011 International

Conference on, 2011. 2.2

[25] Grace Nordin, Peter A. Milder, James C. Hoe, and Markus Puschel. Automatic Generation of Cus-

tomized Discrete Fourier Transform IPs. In Design Automation Conference (DAC), pages 471–474,

2005. 2.2

[26] Michael K. Papamichael and James C. Hoe. CONNECT: Re-Examining Conventional Wisdom for

Designing NoCs in the Context of FPGAs. In FPGA, 2012. 3.2, 4, 4.1

[27] A. Pinto, L.P. Carloni, and A. Sangiovanni-Vincentelli. COSI: A Framework for the Design of

Interconnection Networks. Design Test of Computers, IEEE, 2008. 2.2

[28] Kyeong Keol Ryu and V.J. Mooney. Automated Bus Generation for Multiprocessor SoC Design.

Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 23(11), 2004.

2.2

24

http://genesis2.stanford.edu

https://www-01.ibm.com/chips/techlib/techlib.nsf/products/CoreConnect_Bus_Architecture

https://www-01.ibm.com/chips/techlib/techlib.nsf/products/CoreConnect_Bus_Architecture

http://www.mentor.com/esl

http://users.ece.cmu.edu/~mpapamic/connect/network_editor/

http://users.ece.cmu.edu/~mpapamic/connect/network_editor/

http://users.ece.cmu.edu/~mpapamic/connect/

http://users.ece.cmu.edu/~mpapamic/connect/

[29] O. Shacham, O. Azizi, M. Wachs, W. Qadeer, Z. Asgar, K. Kelley, J.P. Stevenson, S. Richardson,

M. Horowitz, B. Lee, A. Solomatnikov, and A. Firoozshahian. Rethinking Digital Design: Why

Design Must Change. Micro, IEEE, 30(6), 2010. 2.2

[30] O. Shacham, S. Galal, S. Sankaranarayanan, M. Wachs, J. Brunhaver, A. Vassiliev, M. Horowitz,

A. Danowitz, W. Qadeer, and S. Richardson. Avoiding Game Over: Bringing Design to the Next

Level. In Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE, 2012. 2.2

[31] Ofer Shacham. Chip multiprocessor generator: automatic generation of custom and heterogeneous

compute platforms. PhD thesis, 2011. 2.2

[32] Stanford Concurrent VLSI Architecture Group. Open Source Network-on-Chip Router RTL.

https://nocs.stanford.edu/cgi-bin/trac.cgi/wiki/Resources/Router.

3.1.2

[33] The Chisel Hardware Construction Language. http://chisel.eecs.berkeley.edu. 2.2,

3, 3.1.3

[34] G. Weisz and J. C. Hoe. C-To-CoRAM: Compiling Perfect Loop Nests to the Portable CoRAM

Abstraction. In FPGA, 2013. 2.2

25

https://nocs.stanford.edu/cgi-bin/trac.cgi/wiki/Resources/Router

http://chisel.eecs.berkeley.edu

thesis proposal - cs.cmu.edumpapamic/papamichael_thesis_proposal.pdf · thesis proposal pandora:...

Documents