low latency network-on-chip router using static straight ... · low latency network-on-chip router...

Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion

Low Latency Network-on-Chip Router Using StaticStraight Allocator

Including Other Progresses in Many-Core Research

Muhammad Nadzir Marsono

Dept. of Electronic and Computer EngineeringFaculty of Electrical Engineering

Universiti Teknologi Malaysia

1 / 49


Table of Content

1 Introduction

2 Low Latency Network-on-Chip Router Using Static StraightAllocatorAlireza Monemi, Chia Yee Ooi, Maurizio Palesi and Muhammad Nadzir

Marsono

3 Other Progresses in Many-Core Research

Prototyping NoC based MCSoCMapping Applications on MCSoCHardware Transactional MemoryDynamic Reconfiguration of Reconfigurable Hardware

4 Conclusion

2 / 49


Table of Content

1 Introduction


Marsono



4 Conclusion

3 / 49


Single-Core System-on-Chip

There is always demand for faster processing, even for embeddedapplications

Technology allows

Packing more transistors on silicon

More advanced functionalityAdding more processors on a chip

Increase in operating clock rate

All with diminishing return

Transistor size is approaching its minimum sizeFaster clock rate leads to more power consumptionMulti-processor offers less than 3x speed-up

4 / 49


Multi-Core System-on-Chip

Allows parallel computation

Shared bus Communication

Not flexible – redesign when new core is addedNot extensible – support only limited number of cores

5 / 49


Many-Core System-on-Chip

Future many-core SoCs wouldhave hundreds ofheterogeneous processing cores

Network-on-chipcommunication instead of bus

Cores are communicatingthrough routersInformation passing throughexchange of packetsDecouple communicationfrom computationFlexible, reusable andextensible

The overall performance of many-core SoCs is directly affected byNoC performance

6 / 49


Many-Core System-on-Chip Design Issues

Many-core SoCs come with their own set of challenges

Communication bandwidth

Bus based architecture for communication becomes bottle neck asthe number of cores increasesSolution: Network-on-chip

Memory synchronization

Efficient software with lock synchronization requires longdevelopment time and prone to programmer errorSolution: Transactional memory

Application and Task Mapping

Inefficient task mapping results in sub-par performance.Solution: Optimal mapping (static and run-time)

Fixed architecture

Lack the flexibility for run-time functional updateSolution: Dynamically reconfigurable platform

7 / 49


Network-on-chip Architectural Issues

Topology: Defines how routers are connected to each other

mesh,torus, ring, fat tree, star ....

Wormhole flow control buffered flit in several routers along the path

Reduce the required buffer size and power consumption

Several virtual channels can be implemented on a single physicalchannel

Increase throughputPrevent deadlockProvide quality-of-service (QoS) at application level

Routing algorithm: Determines how packets routed to thedestination

DeterministicPartially adaptiveFully adaptive

Router microarchitecture

Arbiter designRouter microarchitecture 8 / 49


Selected Prior Works on Network-on-Chip

A. Monemi, C.Y. Ooi, M. N. Marsono, ”Low Latency Network-on-Chip RouterMicroarchitecture Using Request Masking Technique,” International Journal ofReconfigurable Computing, Volume 2015, Article ID 570836, 13 pages, 2015.doi:10.1155/2015/570836.

A. Monemi, C.Y. Ooi, M. N. Marsono, M. Palesi, ”Improved Flow Control forMinimal Fully Adaptive Routing in 2D Mesh NoC,” in Ninth InternationalWorkshop on Network-on-Chip Architecture (NoCArch 2016), Taipei, October2016.

A. Monemi, C.Y. Ooi, M.N. Marsono. “Virtual Channel and Switch Allocationfor Low Latency Network-on-Chip Routers”, in Proceedings of the 2015 IEEE23rd Annual Symposium on Field-Programmable Custom Computing Machines(FCCM 2015), Vancouver, Canada, May 2015. pp. 234.

N. Qaid, A. Monemi, M.N. Marsono, “Partially Adaptive Look-Ahead Routingfor Low Latency Network-on-Chip”, in Proceedings the 2014 StudentConference on Research and Development (SCOReD 2014), Penang, Malaysia,November 2014, pp. ?

9 / 49


Table of Content

1 Introduction

2 Low Latency Network-on-Chip Router Using StaticStraight AllocatorAlireza Monemi, Chia Yee Ooi, Maurizio Palesi and Muhammad Nadzir

Marsono



4 Conclusion

10 / 49


Network-on-chip Router Latency

NoC may be affected high communication Latency

Multiple number of hops between a source and a destinationRouter processing latencyCongestion (application specific, mapping and routing may alleviateup to a certain extent)

Canonical NoC pipelinestages

1 Route computation (RC)2 VC allocation (VA)3 Switch allocation (SA)4 Switch traversal (ST)

Canonical router uses twoseparate VA and SA module –SA only after successful VA

This results in 4-cycle latencyper router

11 / 49


Improving NoC Router Latency

Look-ahead routing (LR):

Routing is done one router inadvancedCan be done in parallel withother router’s pipeline stagesResults in a 3-cycle latency

Combined VA and SA (VSA):

VSA shares hardware logicbetween SA and VA – results insmaller areaVSA perform SA and VA inparallel, VA only for packetsthat successfully get the SA1grant

LR and VSA results in 2-cyclelatency router

12 / 49


Single-Cycle NoC Router

Single-cycle is the ideal latency for an NoC router

It requires bypassing of all routers pipeline stages

Related works presented complex solutions

Author year Proposed Solution DrawbackMullins et al. [9] 2004 Optimistic ST specula-

tionIncreasing network load results in high miss-speculation rate.

Park et al. [11] 2007 dynamic priority-basedNoC

Complex architecture due to priority-based arbiterand path frequency analyzer

Kumar et al [5] 2007 Look-ahead crossbarcontrol signals

Complex architecture due to the need for a three-level priority based switch allocator

Michelogianna-kis et al. [7]

2007 Reconfigurablepreferred-path

VC is not supported. is efficient only for ASIC dueto tri-state based crossbar

Krishna et al. [3] 2008 Express VC (EVC) Require starvation avoidance mechanism. EVC limitssharing buffer efficiency.

Kim et al.[2] 2009 Single-clock dimensionbased NoC

May result in starvation/unfairness. Intermediatebuffers act as bottleneck at high load

Krishna et al.[4] 2010 Look-ahead bypassing communication link overhead. high communicationlatency/starvation for stored flits

Matsutani etal. [6]

2011 Prediction-router Complex allocator as request signals must be priori-tized towards the reservation signals

He et al.[1] 2013 Predict-more-router Complex router due to need for multiple kill signalsand multi-cast crossbar switch support

13 / 49


Proposed Static Straight Allocator (SSA)

We propose a single-cycle latencyforwarding for packets traveling to thesame direction

SSA is adopted from Prediction Router(PR) [6] that predict packets moving inthe same direction

PR sends reservation signals inadvanced, where request signals havehigher priority to local reservationsignalsThe PR’s drawback is the need forcomplex priority based VSA andmis-prediction correction mechanismCompared to PR’s SS algorithm(SS-baseline), the proposed SSAimproves prediction on edge routers

0

10

20

30

40

50

60

70

80

42

52

62

72

82

92

102

112

122

132

142

152

162

Pre

dic

ation H

it R

ate

(%

)

Mesh Size

SS-baselineSS-modified

14 / 49


Static Straight Allocator (SSA) RTL Architecture

SSA allocates VC/SW for packetsat flit capturing time

The proposed SSA works canwork in parallel with router’s VSA

The router’s allocation result isthe sum of both SSA and VSAresults

SSA has simple architecture

In order to achieve single-cyclelatency, internal buffer must bebypassed for SS packets

Proposed SSA RTL

Buffer bypass module RTL 15 / 49


SSA Implemention Result on FPGA

SSA-router is evaluated against three other NoC configurations1 2-cycle NoC [8]: The early version of ProNoC with 2-cycle latency2 1-cycle NoC: The modified version of 2-cycle NoC [8] which perform

ST serially after VSA to achive single-cycle latency.3 CONNECT NoC [10] which is similar to ProNoC is an

FPGA-optimized single-cycle NoC router

NoCs’ pipeline/timing comparison

16 / 49


SSA Implemention Result on FPGA (cont’d)

Synthesis results summary of different NoCs (VCs=4, flit size=32-bit, buffersize=4 flits) on Stratix IV EP4SGX230KF40C2 Altera FPGA

Architecture Max freq.Total BRAMnum. (M9k)

Total LCs of4x4 mesh

Avg. LCs of a5-port router

2-cycle [8] 165 MHz 64 (5.2%) 41406 (23%) 3234 (1.8%)

2-cycle-SSA 152 MHz 64 (5.2%) 46296 (25%) 3616 (2.0%)

1-cycle 129 MHz - 90915 (64%) 7102 (5.0%)

CONNECT [10] 94 MHz - 67666 (43%) 5286 (3.6%)

Both single-cycle NoCs consumes large LCU as they can not useBRAMs

CONNECT serially implements all router processing stages – lowoperating frequency

SSA has the least trade-off to 2-cycle NoC – 8% increase in CPDand 12% area overhead

17 / 49


SSA Latency Clock Domain

0

5

10

15

20

25

30

35

5 10 15 20 25 30 35Ave

rag

e la

ten

cy p

er

pa

cke

t (c

lk)

Load per router (in flit/clk %)

2-cycle2-cycle-ssa

1-cycleCONNECT

Random.

0

5

10

15

20

25

30

35

5 10 15 20 25 30Ave

rag

e la

ten

cy p

er

pa

cke

t (c

lk)


2-cycle2-cycle-ssa

1-cycleCONNECT

Hot-spot.

0

5

10

15

20

25

30

35

5 10 15 20 25 30Ave

rag

e la

ten

cy p

er

pa

cke

t (c

lk)


2-cycle2-cycle-ssa

1-cycleCONNECT

Tornado.

0

5

10

15

20

25

30

35

6 8 10 12 14 16 18Ave

rag

e la

ten

cy p

er

pa

cke

t (c

lk)


2-cycle2-cycle-ssa

1-cycleCONNECT

Matrix-Transposed.

0

5

10

15

20

25

30

35

6 8 10 12 14 16 18 20 22 24Ave

rag

e la

ten

cy p

er

pa

cke

t (c

lk)


2-cycle2-cycle-ssa

1-cycleCONNECT

Bit-complement.

0

5

10

15

20

25

30

35

6 8 10 12 14 16Ave

rag

e la

ten

cy p

er

pa

cke

t (c

lk)


2-cycle2-cycle-ssa

1-cycleCONNECT

Bit-reverse.

1-cycle (50%) and SSA (24%) lower average latency c.f. 2-cycle routers

SSA-router works as 1.5-cycle NoC in average18 / 49


SSA Performance (Seconds)

0

50

100

150

200

0 10 20 30 40 50 60

Avera

ge late

ncy p

er

packet (n

s)

Load per router (in million flits/sec)

2-cycle2-cycle-ssa

1-cycleCONNECT

Random.

0

50

100

150

200

0 5 10 15 20 25 30 35 40 45

Avera

ge late

ncy p

er

packet (n

s)


2-cycle2-cycle-ssa

1-cycleCONNECT

Hot-spot.

0

50

100

150

200

0 5 10 15 20 25 30 35 40 45

Avera

ge late

ncy p

er

packet (n

s)


2-cycle2-cycle-ssa

1-cycleCONNECT

Tornado.

0

50

100

150

200

0 5 10 15 20 25 30

Avera

ge late

ncy p

er

packet (n

s)


2-cycle2-cycle-ssa

1-cycleCONNECT

Matrix-Transposed.

0

50

100

150

200

0 5 10 15 20 25 30 35 40

Avera

ge late

ncy p

er

packet (n

s)


2-cycle2-cycle-ssa

1-cycleCONNECT

Bit-complement.

0

50

100

150

200

0 5 10 15 20 25 30

Avera

ge late

ncy p

er

packet (n

s)


2-cycle2-cycle-ssa

1-cycleCONNECT

Bit-reverse.

SSA has 37% less CPD than 1-cycle router

SSA has 8% lower frequency than two-cycle router19 / 49


SSA Overall Performance Comparison

Overall performance metrics normalized to the 2-cycle latency router

Metric 2-cycle [8] 2-cycle-SSA 1-cycle CONNECT [10]Overall average

comm. latency (ns)100 81 71 97

Average saturationload (ns)

100 93 77 50

LCU 100 118 220 163

SSA offers the best trade-off for reducing the communicationlatency

19% reduction in average communication latency (in ns)7% lower average saturation load (in ns), 18% higher LCU

Proposed 1-cycle NoC

30% reduction in average communication latency (in ns)23% lower average saturation load (in ns), 120% higher LCU

CONNECT

Similar average communication latency (in ns)50% less average saturation throughput (in ns), 63% higher LCU

20 / 49


References I

[1] Yuan He, Hiromu Sasaki, Shinsuke Miwa, and Hajime Nakamura.Predict-more router: A low latency NoC router with more route predictions.In IEEE 27th International Parallel and Distributed Processing SymposiumWorkshops & PhD Forum (IPDPSW), pages 842–850. IEEE, 2013.

[2] John Kim.Low-cost router microarchitecture for on-chip networks.In Proceedings of the 42nd Annual IEEE/ACM International Symposium onMicroarchitecture, pages 255–266. ACM, 2009.

[3] Tushar Krishna, Amit Kumar, Patrick Chiang, Mattan Erez, and Li-Shiuan Peh.NoC with near-ideal express virtual channels using global-line communication.In 16th IEEE Symposium on High Performance Interconnects, HOTI’08., pages11–20. IEEE, 2008.

[4] Tushar Krishna, Jacob Postman, Christopher Edmonds, Li-Shiuan Peh, andPatrick Chiang.Swift: A swing-reduced interconnect for a token-based network-on-chip in 90nmCMOS.In Computer Design (ICCD), 2010 IEEE International Conference on, pages439–446. IEEE, 2010.

21 / 49


References II

[5] Amit Kumar, Partha Kundu, Arvind P. Singh, Li-Shiuan Peh, and Niraj K. Jha.A 4.6 Tbits/s 3.6 GHz single-cycle NoC router with a novel switch allocator in65nm CMOS.In 25th International Conference on Computer Design,ICCD, pages 63–70. IEEE,2007.

[6] Hiroki Matsutani, Michihiro Koibuchi, Hideharu Amano, and TsutomuYoshinaga.Prediction router: A low-latency on-chip router architecture with multiplepredictors.IEEE Transactions on Computers, 60(6):783–799, 2011.

[7] George Michelogiannakis, Dionisios Pnevmatikatos, and Manolis Katevenis.Approaching ideal NoC latency with pre-configured routes.In Proceedings of the First International Symposium on Networks-on-Chip, pages153–162. IEEE Computer Society, 2007.

[8] Alireza Monemi, Chia Yee Ooi, and Muhammad Nadzir Marsono.Low latency network-on-chip router microarchitecture using request maskingtechnique.International Journal of Reconfigurable Computing, 2015:2, 2015.

22 / 49


References III

[9] Robert Mullins, Andrew West, and Simon Moore.Low-latency virtual-channel routers for on-chip networks.In ACM SIGARCH Computer Architecture News, volume 32, page 188, 2004.

[10] Michael. K. Papamichael and James. C. Hoe.CONNECT: Re-examining conventional wisdom for designing NoCs in thecontext of FPGAs.In Proceedings of the ACM/SIGDA international symposium on FPGA, pages37–46, 2012.

[11] Dongkook Park, Reetuparna Das, Chrysostomos Nicopoulos, Jongman Kim,Narayanan Vijaykrishnan, Ravishankar Iyer, and Chita R Das.Design of a dynamic priority-based fast path architecture for on-chipinterconnects.In 15th Annual IEEE Symposium on High-Performance Interconnects, pages15–20. IEEE, 2007.

23 / 49


Table of Content

1 Introduction


Marsono

3 Other Progresses in Many-Core ResearchPrototyping NoC based MCSoCMapping Applications on MCSoCHardware Transactional MemoryDynamic Reconfiguration of Reconfigurable Hardware

4 Conclusion

24 / 49


Prototyping NoC based System-on-Chip

Designing a complete many-core SoC is time consuming andcomplex in nature

We have developed ProNoC as a design automation tool tofacilitate the development a large-scaled NoC-based many-core SoCin RTL targeting FPGA

Multiple NoC and processing core configurations

25 / 49


ProNoC IP Generator

IP Generator: Develops an intellectual property (IP) library fromVerilog design files

Map interfaces to modules’ portsDetect Verilog parameters and allow the selection of appropriateGUI interface for redefining design parameter

26 / 49


ProNoC Processing Tile (PT) Generator

Processing tile (PT) generator: Generates a processing tile usingarbitrary number of IPs connected using Wishbone bus

Automate generation of interconnect logicAutomate Wishbone addressingAdjusting different IP parameters using GUI

27 / 49


ProNoC MCSoC generator

MCSoC generator: Generates a heterogeneous NoC-basedmany-core SoC using generated PTs

Provides GUI interface for setting the NoC and PTs parameters

28 / 49


ProNoC NoC Emulation

NoC emulator: Generates an NoC emulator

Generate NoC connected to programmable packet injector modulesProgram/read packet injectors at run time using JTAG interfaceCurrently tested on Altera FPGAs

29 / 49


Table of Content

1 Introduction


Marsono



4 Conclusion

30 / 49


Mapping Applications on MCSoC

CORE 1:T0, T1

CORE 2:T2

CORE 3:T3

CORE 4:T4, T5

T0

T1 T2

T3 T4

T5

Map to2x2 Cores

Application Characteristic Graph MPSoC Platform Application is specified in

Directed Acyclic Graph(DAG), containing executionand communication delays

Each task, Ti is assigned toone core

Different mapping givesdifferent performance, i.e.,throughput, execution time,energy consumption, ...

T0 C01 T1C02CORE 1:

CORE 2:

CORE 3:

CORE 4:

T2C02

C13

C23

C13 C23

C24

T3 C35

C24 T4 C35 T5

T0 C01 T1C02

T2C02

C13

C23

C13 C23

C24

T3 C35

C24 T4 C35 T5

T0 C01 T1C02

T2C02

C13

C23

C13 C23

C24

T3 C35

C24 T4 C35 T5

Execution Periodexecution delay

communication delay

data 1

data 2

data 3

31 / 49


Application Mapping Exploration Framework

T0 T0

T0 T0

T0

Mapping Algorithm:● Branch-and-Bound● Simulated Annealing● Genetic Algorithm (GA)

MPSoC Platform

Application Mapping in MATLAB/ GNU Octave

Platform Development using PRONoC

Performance MeasurementIn FPGA Emulation

DAG

Mapping

Emulating

Platform Description:● No. of cores● Routing algorithm● No. of virtual channel

Emulation Terminals

Performance:● Execution delay● Comm. Delay● Throughput

OptimizedMapping

Mapping ExplorationFlows:

1 Mapping algorithmanalyzes the applicationgraph and generatesinitial map

2 All tasks are assigned tocores in the MCSoCplatform and emulatedin FPGA to obtain theon-chip performance

3 The process iteratesuntil optimizedmapping is obtained

32 / 49


Multi-Applications Optimization using Genetic Algorithm

0 10 20 30 40 50240

260

280

300

320

340

360

380

generations

appl

icat

ion

thro

ughp

ut

app 1app 2

Mapping 2 applications with priority app1 > app2

0 10 20 30 40 5050

100

150

200

250

300

350

400

generations

appl

icat

ion

thro

ughp

ut

app 1app 2app 3

Mapping 3 applications with priority app1 > app2 > app3

Examples of the mappingof multiple applications intoMCSoC using FPGAemulation.

Genetic Algorithm (GA) isutilized to optimizethroughput according topriority.

GA search for betterthroughput for app1 ineach generation

33 / 49


Selected Prior Works on Tasks and Application Mapping

J.W. Tang, Y.W. Hau, M.N. Marsono, ”Hardware/software partitioning ofembedded System-on-Chip applications”, in Proceedings of the 2015 IFIP/IEEEInternational Conference on Very Large Scale Integration (VLSI-SoC), Daejeon,South Korea, November 2015. pp.331-336

Y.-Z. Tei, Y.-W. Hau, N. Shaikh-Husin, M.N. Marsono, “Network PartitioningDomain Knowledge Multiobjective Application Mapping for Large-ScaleNetwork-on-Chip,” Applied Computational Intelligence and Soft Computing,Volume 2014, Article ID 867612, 10 pages, 2014. doi:10.1155/2014/867612.

Y.Z. Tei, Y.W. Hau, N. Shaikh-Husin1, T. Andromeda, M.N. Marsono,“Network-on-Chip Application Mapping Based on Domain Knowledge GeneticAlgorithm,in Proceedings of the 2014 IAES International Conference onElectrical Engineering, Computer Science and Informatics (EECSI 2014),Yogyakarta, Indonesia, August 2014, pp. 8690.

34 / 49


Table of Content

1 Introduction


Marsono



4 Conclusion

35 / 49


Transactional Memory

Parallel Programming on shared memory requires lock forsynchronization

Coarse grain lockFine grain lock

TM provides non-blocking wait-free synchronization

Each transaction is atomic, isolated, and consistentSimplify multi-threaded programming and in MCSoCCombines coarse grain programming simplicity with fine grainperformance

Software transactional memory (STM)

FlexibilityPoor performance

Hardware transactional memory (HTM)

Restricted to hardware resource allocationBetter performance

36 / 49


Hardware Transactional Memory Overview

Keep tracks of all the changes made by transactions

Processors issue commits to make changes permanent

Version Management

Determines locations of modified transactions

Conflict Management

Defines how conflicts are detected and managed

TM_

buffer

Main

Memory

P1

P(n)

P2

37 / 49


HTM Configurable Version Management

TM_

buffer(OLD)

Main_

Memory(NEW)

P1

P(n)

P2

TM_

buffer(NEW)

Main_

Memory(OLD)

P1

P(n)

P2

Eager

Fast Commit

Slow Abort

Suitable for Low

Contention

Lazy

Slow Commit

Fast Abort

Suitable for High

Contention

38 / 49


Performance of HTM Configurable Version Management

Transaction sizeVersion Management

Lazy Eager

4 5.37% 34.13%

8 9.39% 37.23%

16 12.82% 37.84%

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1

2

3

4

5

6

7

8

9

10x 10

5

Probability of Conflcit

Clo

ck c

ycl

es

EAGER experimental

LAZY experimental

LAZY max

LAZY mix

EAGER max

EAGER min

39 / 49


Selected Prior Works on Hardware Transactional Memory

J. Sirkunan , C.Y. Ooi, N. Shaikh-Husin, Y.W. Hau, M.N. Marsono, “AdaptiveConfigurable Transactional Memory for Multi-Processor FPGA Platforms”, inProceedings of the 2015 IEEE 23rd Annual Symposium on Field-ProgrammableCustom Computing Machines (FCCM 2015), Vancouver, Canada, May 2015.pp. 234.

J. Sirkunan, C.Y. Ooi, N. Shaikh-Husin, Y. W. Hau, M.N. Marsono, “HardwareTransactional Memory on Multi-processor FPGA Platform”, in Proceedings ofthe 2014 IEEE International Symposium on Circuits and Systems (ISCAS 2014),Melbourne, Australia, June 2014, pp. 2744–2747.

40 / 49


Table of Content

1 Introduction


Marsono



4 Conclusion

41 / 49


Benefits Dynamic Partial Reconfiguration

Reconfigurable hardware offers processing power similar to ASICsolution

Gate level parallelism

Pipelined architecture

Distributed memory (BlockRAM)

Offer flexibility greater than software solution

Configuration can be changed at run-time

Dynamic partial reconfiguration feature

42 / 49


Configuration capability

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

43 / 49


Our Dynamically Reconfigurable Middlebox Platform

Supports remote dynamic reconfiguration for functionalities updatesthrough UDP/IP over Ethernet

Provides flexibility to customize and optimize the packet forwardingalgorithm

Functionally extensible by modification of Partial ReconfigurableModules

Achieves 350Mbps of remote reconfiguration throughput, which areimportant for mass remote update and lower device down timeduring the remote updates

44 / 49


Functional Block Diagram of Middlebox Platform

45 / 49


Our Case Study on Network Protection

Remote dynamic reconfiguration

Extendable for network protection features at run-time forcustomization and optimization

Allows NIPS to be optimized with various type of string matching

algorithm, which can either CAM-based, hash-based or finite

automata based

Enables functional patch on the developed platform, where designflaws can be fixed after deployment

46 / 49


Selected Prior Works on Dynamic Reconfiguration ofReconfigurable Hardware

T.H. Tan, C.Y. Ooi, M.N. Marsono, ”rrBox: A Remote DynamicallyReconfigurable Network Processing Middlebox”, in Proceedings of the 25thInternational Conference on Field Programmable Logic andApplications (FPL2015), London, UK, September 2015. pp.1-4

T. -H. Tan, C.-Y. Ooi, Y.-W. Hau, N. Shaikh-Husin, M.N. Marsono, “RemoteDynamically Reconfigurable Platform using NetFPGA”, in Proceedings of the2014 IEEE International Symposium on Circuits and Systems (ISCAS 2014),Melbourne, Australia, June 2014, pp. 1239–1242.

47 / 49


Table of Content

1 Introduction


Marsono



4 Conclusion

48 / 49


Future Outlook on MCSoC Research

Many core research is multi-faceted, that requires inverstigation ofdifferent architectural aspects

Many-core SoCs require efficient, scalable and reliable inter-corecommunication infrastructure and automation tools

We have barely scratch the surface – there are a lot more to explore

Tasks and pplications mapping/schedulingRuntime optimization – grey/dark siliconCache and memory architectureWorkload characterization and evaluation of real applicationtracesFault-tolerance and reliability

49 / 49

low latency network-on-chip router using static straight ... · low latency network-on-chip router...

Documents