low latency network-on-chip router using static straight ... · low latency network-on-chip router...
TRANSCRIPT
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Low Latency Network-on-Chip Router Using StaticStraight Allocator
Including Other Progresses in Many-Core Research
Muhammad Nadzir Marsono
Dept. of Electronic and Computer EngineeringFaculty of Electrical Engineering
Universiti Teknologi Malaysia
1 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Table of Content
1 Introduction
2 Low Latency Network-on-Chip Router Using Static StraightAllocatorAlireza Monemi, Chia Yee Ooi, Maurizio Palesi and Muhammad Nadzir
Marsono
3 Other Progresses in Many-Core Research
Prototyping NoC based MCSoCMapping Applications on MCSoCHardware Transactional MemoryDynamic Reconfiguration of Reconfigurable Hardware
4 Conclusion
2 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Table of Content
1 Introduction
2 Low Latency Network-on-Chip Router Using Static StraightAllocatorAlireza Monemi, Chia Yee Ooi, Maurizio Palesi and Muhammad Nadzir
Marsono
3 Other Progresses in Many-Core Research
Prototyping NoC based MCSoCMapping Applications on MCSoCHardware Transactional MemoryDynamic Reconfiguration of Reconfigurable Hardware
4 Conclusion
3 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Single-Core System-on-Chip
There is always demand for faster processing, even for embeddedapplications
Technology allows
Packing more transistors on silicon
More advanced functionalityAdding more processors on a chip
Increase in operating clock rate
All with diminishing return
Transistor size is approaching its minimum sizeFaster clock rate leads to more power consumptionMulti-processor offers less than 3x speed-up
4 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Multi-Core System-on-Chip
Allows parallel computation
Shared bus Communication
Not flexible – redesign when new core is addedNot extensible – support only limited number of cores
5 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Many-Core System-on-Chip
Future many-core SoCs wouldhave hundreds ofheterogeneous processing cores
Network-on-chipcommunication instead of bus
Cores are communicatingthrough routersInformation passing throughexchange of packetsDecouple communicationfrom computationFlexible, reusable andextensible
The overall performance of many-core SoCs is directly affected byNoC performance
6 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Many-Core System-on-Chip Design Issues
Many-core SoCs come with their own set of challenges
Communication bandwidth
Bus based architecture for communication becomes bottle neck asthe number of cores increasesSolution: Network-on-chip
Memory synchronization
Efficient software with lock synchronization requires longdevelopment time and prone to programmer errorSolution: Transactional memory
Application and Task Mapping
Inefficient task mapping results in sub-par performance.Solution: Optimal mapping (static and run-time)
Fixed architecture
Lack the flexibility for run-time functional updateSolution: Dynamically reconfigurable platform
7 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Network-on-chip Architectural Issues
Topology: Defines how routers are connected to each other
mesh,torus, ring, fat tree, star ....
Wormhole flow control buffered flit in several routers along the path
Reduce the required buffer size and power consumption
Several virtual channels can be implemented on a single physicalchannel
Increase throughputPrevent deadlockProvide quality-of-service (QoS) at application level
Routing algorithm: Determines how packets routed to thedestination
DeterministicPartially adaptiveFully adaptive
Router microarchitecture
Arbiter designRouter microarchitecture 8 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Selected Prior Works on Network-on-Chip
A. Monemi, C.Y. Ooi, M. N. Marsono, ”Low Latency Network-on-Chip RouterMicroarchitecture Using Request Masking Technique,” International Journal ofReconfigurable Computing, Volume 2015, Article ID 570836, 13 pages, 2015.doi:10.1155/2015/570836.
A. Monemi, C.Y. Ooi, M. N. Marsono, M. Palesi, ”Improved Flow Control forMinimal Fully Adaptive Routing in 2D Mesh NoC,” in Ninth InternationalWorkshop on Network-on-Chip Architecture (NoCArch 2016), Taipei, October2016.
A. Monemi, C.Y. Ooi, M.N. Marsono. “Virtual Channel and Switch Allocationfor Low Latency Network-on-Chip Routers”, in Proceedings of the 2015 IEEE23rd Annual Symposium on Field-Programmable Custom Computing Machines(FCCM 2015), Vancouver, Canada, May 2015. pp. 234.
N. Qaid, A. Monemi, M.N. Marsono, “Partially Adaptive Look-Ahead Routingfor Low Latency Network-on-Chip”, in Proceedings the 2014 StudentConference on Research and Development (SCOReD 2014), Penang, Malaysia,November 2014, pp. ?
9 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Table of Content
1 Introduction
2 Low Latency Network-on-Chip Router Using StaticStraight AllocatorAlireza Monemi, Chia Yee Ooi, Maurizio Palesi and Muhammad Nadzir
Marsono
3 Other Progresses in Many-Core Research
Prototyping NoC based MCSoCMapping Applications on MCSoCHardware Transactional MemoryDynamic Reconfiguration of Reconfigurable Hardware
4 Conclusion
10 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Network-on-chip Router Latency
NoC may be affected high communication Latency
Multiple number of hops between a source and a destinationRouter processing latencyCongestion (application specific, mapping and routing may alleviateup to a certain extent)
Canonical NoC pipelinestages
1 Route computation (RC)2 VC allocation (VA)3 Switch allocation (SA)4 Switch traversal (ST)
Canonical router uses twoseparate VA and SA module –SA only after successful VA
This results in 4-cycle latencyper router
11 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Improving NoC Router Latency
Look-ahead routing (LR):
Routing is done one router inadvancedCan be done in parallel withother router’s pipeline stagesResults in a 3-cycle latency
Combined VA and SA (VSA):
VSA shares hardware logicbetween SA and VA – results insmaller areaVSA perform SA and VA inparallel, VA only for packetsthat successfully get the SA1grant
LR and VSA results in 2-cyclelatency router
12 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Single-Cycle NoC Router
Single-cycle is the ideal latency for an NoC router
It requires bypassing of all routers pipeline stages
Related works presented complex solutions
Author year Proposed Solution DrawbackMullins et al. [9] 2004 Optimistic ST specula-
tionIncreasing network load results in high miss-speculation rate.
Park et al. [11] 2007 dynamic priority-basedNoC
Complex architecture due to priority-based arbiterand path frequency analyzer
Kumar et al [5] 2007 Look-ahead crossbarcontrol signals
Complex architecture due to the need for a three-level priority based switch allocator
Michelogianna-kis et al. [7]
2007 Reconfigurablepreferred-path
VC is not supported. is efficient only for ASIC dueto tri-state based crossbar
Krishna et al. [3] 2008 Express VC (EVC) Require starvation avoidance mechanism. EVC limitssharing buffer efficiency.
Kim et al.[2] 2009 Single-clock dimensionbased NoC
May result in starvation/unfairness. Intermediatebuffers act as bottleneck at high load
Krishna et al.[4] 2010 Look-ahead bypassing communication link overhead. high communicationlatency/starvation for stored flits
Matsutani etal. [6]
2011 Prediction-router Complex allocator as request signals must be priori-tized towards the reservation signals
He et al.[1] 2013 Predict-more-router Complex router due to need for multiple kill signalsand multi-cast crossbar switch support
13 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Proposed Static Straight Allocator (SSA)
We propose a single-cycle latencyforwarding for packets traveling to thesame direction
SSA is adopted from Prediction Router(PR) [6] that predict packets moving inthe same direction
PR sends reservation signals inadvanced, where request signals havehigher priority to local reservationsignalsThe PR’s drawback is the need forcomplex priority based VSA andmis-prediction correction mechanismCompared to PR’s SS algorithm(SS-baseline), the proposed SSAimproves prediction on edge routers
0
10
20
30
40
50
60
70
80
42
52
62
72
82
92
102
112
122
132
142
152
162
Pre
dic
ation H
it R
ate
(%
)
Mesh Size
SS-baselineSS-modified
14 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Static Straight Allocator (SSA) RTL Architecture
SSA allocates VC/SW for packetsat flit capturing time
The proposed SSA works canwork in parallel with router’s VSA
The router’s allocation result isthe sum of both SSA and VSAresults
SSA has simple architecture
In order to achieve single-cyclelatency, internal buffer must bebypassed for SS packets
Proposed SSA RTL
Buffer bypass module RTL 15 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
SSA Implemention Result on FPGA
SSA-router is evaluated against three other NoC configurations1 2-cycle NoC [8]: The early version of ProNoC with 2-cycle latency2 1-cycle NoC: The modified version of 2-cycle NoC [8] which perform
ST serially after VSA to achive single-cycle latency.3 CONNECT NoC [10] which is similar to ProNoC is an
FPGA-optimized single-cycle NoC router
NoCs’ pipeline/timing comparison
16 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
SSA Implemention Result on FPGA (cont’d)
Synthesis results summary of different NoCs (VCs=4, flit size=32-bit, buffersize=4 flits) on Stratix IV EP4SGX230KF40C2 Altera FPGA
Architecture Max freq.Total BRAMnum. (M9k)
Total LCs of4x4 mesh
Avg. LCs of a5-port router
2-cycle [8] 165 MHz 64 (5.2%) 41406 (23%) 3234 (1.8%)
2-cycle-SSA 152 MHz 64 (5.2%) 46296 (25%) 3616 (2.0%)
1-cycle 129 MHz - 90915 (64%) 7102 (5.0%)
CONNECT [10] 94 MHz - 67666 (43%) 5286 (3.6%)
Both single-cycle NoCs consumes large LCU as they can not useBRAMs
CONNECT serially implements all router processing stages – lowoperating frequency
SSA has the least trade-off to 2-cycle NoC – 8% increase in CPDand 12% area overhead
17 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
SSA Latency Clock Domain
0
5
10
15
20
25
30
35
5 10 15 20 25 30 35Ave
rag
e la
ten
cy p
er
pa
cke
t (c
lk)
Load per router (in flit/clk %)
2-cycle2-cycle-ssa
1-cycleCONNECT
Random.
0
5
10
15
20
25
30
35
5 10 15 20 25 30Ave
rag
e la
ten
cy p
er
pa
cke
t (c
lk)
Load per router (in flit/clk %)
2-cycle2-cycle-ssa
1-cycleCONNECT
Hot-spot.
0
5
10
15
20
25
30
35
5 10 15 20 25 30Ave
rag
e la
ten
cy p
er
pa
cke
t (c
lk)
Load per router (in flit/clk %)
2-cycle2-cycle-ssa
1-cycleCONNECT
Tornado.
0
5
10
15
20
25
30
35
6 8 10 12 14 16 18Ave
rag
e la
ten
cy p
er
pa
cke
t (c
lk)
Load per router (in flit/clk %)
2-cycle2-cycle-ssa
1-cycleCONNECT
Matrix-Transposed.
0
5
10
15
20
25
30
35
6 8 10 12 14 16 18 20 22 24Ave
rag
e la
ten
cy p
er
pa
cke
t (c
lk)
Load per router (in flit/clk %)
2-cycle2-cycle-ssa
1-cycleCONNECT
Bit-complement.
0
5
10
15
20
25
30
35
6 8 10 12 14 16Ave
rag
e la
ten
cy p
er
pa
cke
t (c
lk)
Load per router (in flit/clk %)
2-cycle2-cycle-ssa
1-cycleCONNECT
Bit-reverse.
1-cycle (50%) and SSA (24%) lower average latency c.f. 2-cycle routers
SSA-router works as 1.5-cycle NoC in average18 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
SSA Performance (Seconds)
0
50
100
150
200
0 10 20 30 40 50 60
Avera
ge late
ncy p
er
packet (n
s)
Load per router (in million flits/sec)
2-cycle2-cycle-ssa
1-cycleCONNECT
Random.
0
50
100
150
200
0 5 10 15 20 25 30 35 40 45
Avera
ge late
ncy p
er
packet (n
s)
Load per router (in million flits/sec)
2-cycle2-cycle-ssa
1-cycleCONNECT
Hot-spot.
0
50
100
150
200
0 5 10 15 20 25 30 35 40 45
Avera
ge late
ncy p
er
packet (n
s)
Load per router (in million flits/sec)
2-cycle2-cycle-ssa
1-cycleCONNECT
Tornado.
0
50
100
150
200
0 5 10 15 20 25 30
Avera
ge late
ncy p
er
packet (n
s)
Load per router (in million flits/sec)
2-cycle2-cycle-ssa
1-cycleCONNECT
Matrix-Transposed.
0
50
100
150
200
0 5 10 15 20 25 30 35 40
Avera
ge late
ncy p
er
packet (n
s)
Load per router (in million flits/sec)
2-cycle2-cycle-ssa
1-cycleCONNECT
Bit-complement.
0
50
100
150
200
0 5 10 15 20 25 30
Avera
ge late
ncy p
er
packet (n
s)
Load per router (in million flits/sec)
2-cycle2-cycle-ssa
1-cycleCONNECT
Bit-reverse.
SSA has 37% less CPD than 1-cycle router
SSA has 8% lower frequency than two-cycle router19 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
SSA Overall Performance Comparison
Overall performance metrics normalized to the 2-cycle latency router
Metric 2-cycle [8] 2-cycle-SSA 1-cycle CONNECT [10]Overall average
comm. latency (ns)100 81 71 97
Average saturationload (ns)
100 93 77 50
LCU 100 118 220 163
SSA offers the best trade-off for reducing the communicationlatency
19% reduction in average communication latency (in ns)7% lower average saturation load (in ns), 18% higher LCU
Proposed 1-cycle NoC
30% reduction in average communication latency (in ns)23% lower average saturation load (in ns), 120% higher LCU
CONNECT
Similar average communication latency (in ns)50% less average saturation throughput (in ns), 63% higher LCU
20 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
References I
[1] Yuan He, Hiromu Sasaki, Shinsuke Miwa, and Hajime Nakamura.Predict-more router: A low latency NoC router with more route predictions.In IEEE 27th International Parallel and Distributed Processing SymposiumWorkshops & PhD Forum (IPDPSW), pages 842–850. IEEE, 2013.
[2] John Kim.Low-cost router microarchitecture for on-chip networks.In Proceedings of the 42nd Annual IEEE/ACM International Symposium onMicroarchitecture, pages 255–266. ACM, 2009.
[3] Tushar Krishna, Amit Kumar, Patrick Chiang, Mattan Erez, and Li-Shiuan Peh.NoC with near-ideal express virtual channels using global-line communication.In 16th IEEE Symposium on High Performance Interconnects, HOTI’08., pages11–20. IEEE, 2008.
[4] Tushar Krishna, Jacob Postman, Christopher Edmonds, Li-Shiuan Peh, andPatrick Chiang.Swift: A swing-reduced interconnect for a token-based network-on-chip in 90nmCMOS.In Computer Design (ICCD), 2010 IEEE International Conference on, pages439–446. IEEE, 2010.
21 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
References II
[5] Amit Kumar, Partha Kundu, Arvind P. Singh, Li-Shiuan Peh, and Niraj K. Jha.A 4.6 Tbits/s 3.6 GHz single-cycle NoC router with a novel switch allocator in65nm CMOS.In 25th International Conference on Computer Design,ICCD, pages 63–70. IEEE,2007.
[6] Hiroki Matsutani, Michihiro Koibuchi, Hideharu Amano, and TsutomuYoshinaga.Prediction router: A low-latency on-chip router architecture with multiplepredictors.IEEE Transactions on Computers, 60(6):783–799, 2011.
[7] George Michelogiannakis, Dionisios Pnevmatikatos, and Manolis Katevenis.Approaching ideal NoC latency with pre-configured routes.In Proceedings of the First International Symposium on Networks-on-Chip, pages153–162. IEEE Computer Society, 2007.
[8] Alireza Monemi, Chia Yee Ooi, and Muhammad Nadzir Marsono.Low latency network-on-chip router microarchitecture using request maskingtechnique.International Journal of Reconfigurable Computing, 2015:2, 2015.
22 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
References III
[9] Robert Mullins, Andrew West, and Simon Moore.Low-latency virtual-channel routers for on-chip networks.In ACM SIGARCH Computer Architecture News, volume 32, page 188, 2004.
[10] Michael. K. Papamichael and James. C. Hoe.CONNECT: Re-examining conventional wisdom for designing NoCs in thecontext of FPGAs.In Proceedings of the ACM/SIGDA international symposium on FPGA, pages37–46, 2012.
[11] Dongkook Park, Reetuparna Das, Chrysostomos Nicopoulos, Jongman Kim,Narayanan Vijaykrishnan, Ravishankar Iyer, and Chita R Das.Design of a dynamic priority-based fast path architecture for on-chipinterconnects.In 15th Annual IEEE Symposium on High-Performance Interconnects, pages15–20. IEEE, 2007.
23 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Table of Content
1 Introduction
2 Low Latency Network-on-Chip Router Using Static StraightAllocatorAlireza Monemi, Chia Yee Ooi, Maurizio Palesi and Muhammad Nadzir
Marsono
3 Other Progresses in Many-Core ResearchPrototyping NoC based MCSoCMapping Applications on MCSoCHardware Transactional MemoryDynamic Reconfiguration of Reconfigurable Hardware
4 Conclusion
24 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Prototyping NoC based System-on-Chip
Designing a complete many-core SoC is time consuming andcomplex in nature
We have developed ProNoC as a design automation tool tofacilitate the development a large-scaled NoC-based many-core SoCin RTL targeting FPGA
Multiple NoC and processing core configurations
25 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
ProNoC IP Generator
IP Generator: Develops an intellectual property (IP) library fromVerilog design files
Map interfaces to modules’ portsDetect Verilog parameters and allow the selection of appropriateGUI interface for redefining design parameter
26 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
ProNoC Processing Tile (PT) Generator
Processing tile (PT) generator: Generates a processing tile usingarbitrary number of IPs connected using Wishbone bus
Automate generation of interconnect logicAutomate Wishbone addressingAdjusting different IP parameters using GUI
27 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
ProNoC MCSoC generator
MCSoC generator: Generates a heterogeneous NoC-basedmany-core SoC using generated PTs
Provides GUI interface for setting the NoC and PTs parameters
28 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
ProNoC NoC Emulation
NoC emulator: Generates an NoC emulator
Generate NoC connected to programmable packet injector modulesProgram/read packet injectors at run time using JTAG interfaceCurrently tested on Altera FPGAs
29 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Table of Content
1 Introduction
2 Low Latency Network-on-Chip Router Using Static StraightAllocatorAlireza Monemi, Chia Yee Ooi, Maurizio Palesi and Muhammad Nadzir
Marsono
3 Other Progresses in Many-Core Research
Prototyping NoC based MCSoCMapping Applications on MCSoCHardware Transactional MemoryDynamic Reconfiguration of Reconfigurable Hardware
4 Conclusion
30 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Mapping Applications on MCSoC
CORE 1:T0, T1
CORE 2:T2
CORE 3:T3
CORE 4:T4, T5
T0
T1 T2
T3 T4
T5
Map to2x2 Cores
Application Characteristic Graph MPSoC Platform Application is specified in
Directed Acyclic Graph(DAG), containing executionand communication delays
Each task, Ti is assigned toone core
Different mapping givesdifferent performance, i.e.,throughput, execution time,energy consumption, ...
T0 C01 T1C02CORE 1:
CORE 2:
CORE 3:
CORE 4:
T2C02
C13
C23
C13 C23
C24
T3 C35
C24 T4 C35 T5
T0 C01 T1C02
T2C02
C13
C23
C13 C23
C24
T3 C35
C24 T4 C35 T5
T0 C01 T1C02
T2C02
C13
C23
C13 C23
C24
T3 C35
C24 T4 C35 T5
Execution Periodexecution delay
communication delay
data 1
data 2
data 3
31 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Application Mapping Exploration Framework
T0 T0
T0 T0
T0
Mapping Algorithm:● Branch-and-Bound● Simulated Annealing● Genetic Algorithm (GA)
MPSoC Platform
Application Mapping in MATLAB/ GNU Octave
Platform Development using PRONoC
Performance MeasurementIn FPGA Emulation
DAG
Mapping
Emulating
Platform Description:● No. of cores● Routing algorithm● No. of virtual channel
Emulation Terminals
Performance:● Execution delay● Comm. Delay● Throughput
OptimizedMapping
Mapping ExplorationFlows:
1 Mapping algorithmanalyzes the applicationgraph and generatesinitial map
2 All tasks are assigned tocores in the MCSoCplatform and emulatedin FPGA to obtain theon-chip performance
3 The process iteratesuntil optimizedmapping is obtained
32 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Multi-Applications Optimization using Genetic Algorithm
0 10 20 30 40 50240
260
280
300
320
340
360
380
generations
appl
icat
ion
thro
ughp
ut
app 1app 2
Mapping 2 applications with priority app1 > app2
0 10 20 30 40 5050
100
150
200
250
300
350
400
generations
appl
icat
ion
thro
ughp
ut
app 1app 2app 3
Mapping 3 applications with priority app1 > app2 > app3
Examples of the mappingof multiple applications intoMCSoC using FPGAemulation.
Genetic Algorithm (GA) isutilized to optimizethroughput according topriority.
GA search for betterthroughput for app1 ineach generation
33 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Selected Prior Works on Tasks and Application Mapping
J.W. Tang, Y.W. Hau, M.N. Marsono, ”Hardware/software partitioning ofembedded System-on-Chip applications”, in Proceedings of the 2015 IFIP/IEEEInternational Conference on Very Large Scale Integration (VLSI-SoC), Daejeon,South Korea, November 2015. pp.331-336
Y.-Z. Tei, Y.-W. Hau, N. Shaikh-Husin, M.N. Marsono, “Network PartitioningDomain Knowledge Multiobjective Application Mapping for Large-ScaleNetwork-on-Chip,” Applied Computational Intelligence and Soft Computing,Volume 2014, Article ID 867612, 10 pages, 2014. doi:10.1155/2014/867612.
Y.Z. Tei, Y.W. Hau, N. Shaikh-Husin1, T. Andromeda, M.N. Marsono,“Network-on-Chip Application Mapping Based on Domain Knowledge GeneticAlgorithm,in Proceedings of the 2014 IAES International Conference onElectrical Engineering, Computer Science and Informatics (EECSI 2014),Yogyakarta, Indonesia, August 2014, pp. 8690.
34 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Table of Content
1 Introduction
2 Low Latency Network-on-Chip Router Using Static StraightAllocatorAlireza Monemi, Chia Yee Ooi, Maurizio Palesi and Muhammad Nadzir
Marsono
3 Other Progresses in Many-Core Research
Prototyping NoC based MCSoCMapping Applications on MCSoCHardware Transactional MemoryDynamic Reconfiguration of Reconfigurable Hardware
4 Conclusion
35 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Transactional Memory
Parallel Programming on shared memory requires lock forsynchronization
Coarse grain lockFine grain lock
TM provides non-blocking wait-free synchronization
Each transaction is atomic, isolated, and consistentSimplify multi-threaded programming and in MCSoCCombines coarse grain programming simplicity with fine grainperformance
Software transactional memory (STM)
FlexibilityPoor performance
Hardware transactional memory (HTM)
Restricted to hardware resource allocationBetter performance
36 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Hardware Transactional Memory Overview
Keep tracks of all the changes made by transactions
Processors issue commits to make changes permanent
Version Management
Determines locations of modified transactions
Conflict Management
Defines how conflicts are detected and managed
TM_
buffer
Main
Memory
P1
P(n)
P2
37 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
HTM Configurable Version Management
TM_
buffer(OLD)
Main_
Memory(NEW)
P1
P(n)
P2
TM_
buffer(NEW)
Main_
Memory(OLD)
P1
P(n)
P2
Eager
Fast Commit
Slow Abort
Suitable for Low
Contention
Lazy
Slow Commit
Fast Abort
Suitable for High
Contention
38 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Performance of HTM Configurable Version Management
Transaction sizeVersion Management
Lazy Eager
4 5.37% 34.13%
8 9.39% 37.23%
16 12.82% 37.84%
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
1
2
3
4
5
6
7
8
9
10x 10
5
Probability of Conflcit
Clo
ck c
ycl
es
EAGER experimental
LAZY experimental
LAZY max
LAZY mix
EAGER max
EAGER min
39 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Selected Prior Works on Hardware Transactional Memory
J. Sirkunan , C.Y. Ooi, N. Shaikh-Husin, Y.W. Hau, M.N. Marsono, “AdaptiveConfigurable Transactional Memory for Multi-Processor FPGA Platforms”, inProceedings of the 2015 IEEE 23rd Annual Symposium on Field-ProgrammableCustom Computing Machines (FCCM 2015), Vancouver, Canada, May 2015.pp. 234.
J. Sirkunan, C.Y. Ooi, N. Shaikh-Husin, Y. W. Hau, M.N. Marsono, “HardwareTransactional Memory on Multi-processor FPGA Platform”, in Proceedings ofthe 2014 IEEE International Symposium on Circuits and Systems (ISCAS 2014),Melbourne, Australia, June 2014, pp. 2744–2747.
40 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Table of Content
1 Introduction
2 Low Latency Network-on-Chip Router Using Static StraightAllocatorAlireza Monemi, Chia Yee Ooi, Maurizio Palesi and Muhammad Nadzir
Marsono
3 Other Progresses in Many-Core Research
Prototyping NoC based MCSoCMapping Applications on MCSoCHardware Transactional MemoryDynamic Reconfiguration of Reconfigurable Hardware
4 Conclusion
41 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Benefits Dynamic Partial Reconfiguration
Reconfigurable hardware offers processing power similar to ASICsolution
Gate level parallelism
Pipelined architecture
Distributed memory (BlockRAM)
Offer flexibility greater than software solution
Configuration can be changed at run-time
Dynamic partial reconfiguration feature
42 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Configuration capability
����
����������������
�������� ���������������
������������������
����������������
����
����������������
�������� ���������������
������������������
����������������
������������� ������������ ���������� ��
��������������������
����� ���� ������� ����
������������ ���� �� ������������ ����
������������� ������� �������
������������� ����������������
�������������� ����
����������� ����
43 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Our Dynamically Reconfigurable Middlebox Platform
Supports remote dynamic reconfiguration for functionalities updatesthrough UDP/IP over Ethernet
Provides flexibility to customize and optimize the packet forwardingalgorithm
Functionally extensible by modification of Partial ReconfigurableModules
Achieves 350Mbps of remote reconfiguration throughput, which areimportant for mass remote update and lower device down timeduring the remote updates
44 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Functional Block Diagram of Middlebox Platform
45 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Our Case Study on Network Protection
Remote dynamic reconfiguration
Extendable for network protection features at run-time forcustomization and optimization
Allows NIPS to be optimized with various type of string matching
algorithm, which can either CAM-based, hash-based or finite
automata based
Enables functional patch on the developed platform, where designflaws can be fixed after deployment
46 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Selected Prior Works on Dynamic Reconfiguration ofReconfigurable Hardware
T.H. Tan, C.Y. Ooi, M.N. Marsono, ”rrBox: A Remote DynamicallyReconfigurable Network Processing Middlebox”, in Proceedings of the 25thInternational Conference on Field Programmable Logic andApplications (FPL2015), London, UK, September 2015. pp.1-4
T. -H. Tan, C.-Y. Ooi, Y.-W. Hau, N. Shaikh-Husin, M.N. Marsono, “RemoteDynamically Reconfigurable Platform using NetFPGA”, in Proceedings of the2014 IEEE International Symposium on Circuits and Systems (ISCAS 2014),Melbourne, Australia, June 2014, pp. 1239–1242.
47 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Table of Content
1 Introduction
2 Low Latency Network-on-Chip Router Using Static StraightAllocatorAlireza Monemi, Chia Yee Ooi, Maurizio Palesi and Muhammad Nadzir
Marsono
3 Other Progresses in Many-Core Research
Prototyping NoC based MCSoCMapping Applications on MCSoCHardware Transactional MemoryDynamic Reconfiguration of Reconfigurable Hardware
4 Conclusion
48 / 49
Introduction Static Straight Allocator Other Progresses in Many-Core Research Conclusion
Future Outlook on MCSoC Research
Many core research is multi-faceted, that requires inverstigation ofdifferent architectural aspects
Many-core SoCs require efficient, scalable and reliable inter-corecommunication infrastructure and automation tools
We have barely scratch the surface – there are a lot more to explore
Tasks and pplications mapping/schedulingRuntime optimization – grey/dark siliconCache and memory architectureWorkload characterization and evaluation of real applicationtracesFault-tolerance and reliability
49 / 49