using codel to rapidly prototype network processsor extensions nainesh agarwal and nikitas j....
Post on 22-Dec-2015
218 views
TRANSCRIPT
Using CoDeL to rapidly Using CoDeL to rapidly prototype networkprototype network
processsor extensionsprocesssor extensions
Nainesh Agarwal and Nikitas J. Dimopoulos Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Department of Electrical and Computer
EngineeringEngineering
University of VictoriaUniversity of Victoria
SAMOS IV, July 19-24, 2004 2
M
P P P…
Interconnect
NIC
M
P P P…
Interconnect
NICSystem
Interconnect
M
P P P…
Interconnect
NIC
General structure of a “massively” parallel system
SAMOS IV, July 19-24, 2004 3
OutlineOutline
The problem (latency)The problem (latency) Prediction Prediction Architectural enhancements Architectural enhancements CoDeL and ImplementationCoDeL and Implementation
SAMOS IV, July 19-24, 2004 4
LatencyLatency
CG Benchmark Completed. Class = W Size = 7000 Iterations = 15 Time in seconds = 1.72 Total processes = 8 Compiled procs = 8 Mop/s total = 244.10 Mop/s/process = 30.51 Operation type = floating point Verification = SUCCESSFUL Version = 2.3 Compile date = 07 Mar 2001
CG Benchmark Completed. Class = W Size = 7000 Iterations = 15 Time in seconds = .99 Total processes = 8 Compiled procs = 8 Mop/s total = 426.95 Mop/s/process = 53.37 Operation type = floating point Verification = SUCCESSFUL Version = 2.3 Compile date = 07 Mar 2001
Switch sharedSwitch shared, user spaceMPI over LAPIFaster communications
SAMOS IV, July 19-24, 2004 5
LatencyLatency
Minimizing communication latency Minimizing communication latency is crucial in achieving high is crucial in achieving high performance.performance.
Network
Send Process Receive ProcessSend buffer Receive buffer
System buffer System buffer
NI NI
SAMOS IV, July 19-24, 2004 6
LatencyLatency
Efficiency requires the message to Efficiency requires the message to be available to be consumedbe available to be consumed
Send call
Sender Receiver
Receive call issued
Receive call executed (address resolution)
Consumer
idle
Copy to consumer space
SAMOS IV, July 19-24, 2004 7
LatencyLatency
Send call
Sender Receiver thread
Receive call issued
Receive call executed (address resolution)
Consumer thread
Cache miss
SAMOS IV, July 19-24, 2004 8
LatencyLatency
Even when the network delays are Even when the network delays are minimized (non-existent)minimized (non-existent)
receiver synchronization, receiver synchronization, message copying, message copying, cache misses cache misses
delay execution. delay execution.
SAMOS IV, July 19-24, 2004 9
The solutionThe solution
Ensure that the received message is Ensure that the received message is in the consumer’s cache at the point in the consumer’s cache at the point the consumer needs to consume the the consumer needs to consume the message.message.
P
cache
M
SAMOS IV, July 19-24, 2004 10
The solutionThe solution
Enabling mechanismsEnabling mechanisms In an asynchronous environment where In an asynchronous environment where
many messages arrive at a node, can we many messages arrive at a node, can we decide which is the message to be decide which is the message to be consumed next?consumed next?
How do we place the message to be How do we place the message to be consumed in the cache?consumed in the cache?
M
P
cache
SAMOS IV, July 19-24, 2004 11
The solutionThe solution
Learn the pattern of message Learn the pattern of message consumption and use this to decide consumption and use this to decide which is the message to be which is the message to be consumed next .consumed next .
Develop a hardware environment Develop a hardware environment that will facilitate the placement of that will facilitate the placement of the message in the consumer’s the message in the consumer’s cache cache
SAMOS IV, July 19-24, 2004 12
Receive call predictorsReceive call predictors
History-based predictors predict History-based predictors predict subsequent receive calls at a given subsequent receive calls at a given node in a message-passing node in a message-passing application.application.
SAMOS IV, July 19-24, 2004 13
LocalityLocality
Message reception locality Message reception locality If a certain message reception call has If a certain message reception call has
been used it will be re-used with high been used it will be re-used with high probability by a portion of code that is probability by a portion of code that is “near” the place that was used earlier, “near” the place that was used earlier, and it will also be re-used in the near and it will also be re-used in the near futurefuture
SAMOS IV, July 19-24, 2004 14
Messages vary Messages vary in size from a in size from a few bytes to few bytes to several kbytesseveral kbytes
SAMOS IV, July 19-24, 2004 15
PredictorsPredictors
Heuristics that predict the Heuristics that predict the subsequent receive calls based on subsequent receive calls based on the past history of communication the past history of communication patterns on a per node basis. patterns on a per node basis.
Tag PredictorTag Predictor Single-cycle PredictorSingle-cycle Predictor Tag-cycle PredictorTag-cycle Predictor Tag-better-cycle PredictorTag-better-cycle Predictor
SAMOS IV, July 19-24, 2004 16
Single-cycle PredictorSingle-cycle Predictor
N = 64 for CG, and 49 for others
SAMOS IV, July 19-24, 2004 17
What nextWhat next
Network Processor ExtensionsNetwork Processor Extensions Achieve zero-copy through re-mappingAchieve zero-copy through re-mapping Use the predictors to “optimize” size Use the predictors to “optimize” size
and performance .and performance .
SAMOS IV, July 19-24, 2004 18
ArchitectureArchitecture
M
P
Interconnect
NIC
Netw
ork cachecache
SAMOS IV, July 19-24, 2004 19
Architectural Architectural EnhancementsEnhancements
Network Memory Space Process Memory Space
Network tag
Process tag cache data lineMessage ID
Network Cache
initi
al final
Separate Separate Network Network Cache “ties” Cache “ties” the Network the Network Memory Memory Space and the Space and the Process Process Memory Memory SpaceSpace
SAMOS IV, July 19-24, 2004 20
DefinitionsDefinitions
Network Memory Space:Network Memory Space: Network buffers Network buffers Received messages live waiting to be Received messages live waiting to be
bound to the process address space.bound to the process address space. Process Memory Space: Process Memory Space:
Process address space Process address space Process objects including bound Process objects including bound
messages livemessages live
SAMOS IV, July 19-24, 2004 21
OperationOperation
network tagnetwork tag is associated with the is associated with the Network Memory Space,Network Memory Space,
process tagprocess tag is associated with the Process is associated with the Process Memory Space.Memory Space.
message IDmessage ID tag holds the message ID. tag holds the message ID. All three tags can be searched All three tags can be searched
associatively. associatively.
The Network Cache includes three separate tags.
SAMOS IV, July 19-24, 2004 22
Operation cont’dOperation cont’d
On message arrival, the message is On message arrival, the message is cached on the network cache. cached on the network cache.
The The network tagnetwork tag is set to the address is set to the address of the buffer in network memory of the buffer in network memory space that is allocated to the space that is allocated to the message message
The The message idmessage id tag is set to the tag is set to the message id. message id.
SAMOS IV, July 19-24, 2004 23
Operation cont’dOperation cont’d
The message lives at the network The message lives at the network cache and it migrates to the cache and it migrates to the Network Memory space according to Network Memory space according to a cache replacement policy which a cache replacement policy which replaces the message that is least replaces the message that is least likely to be consumed next. likely to be consumed next.
The receive-call prediction heuristics The receive-call prediction heuristics are used for this purpose.are used for this purpose.
SAMOS IV, July 19-24, 2004 24
Late bindingLate binding A receive call invalidates the A receive call invalidates the message IDmessage ID
and and network tagsnetwork tags and will set the and will set the process process tagtag to point to the address of the object to point to the address of the object destined to receive the message in Process destined to receive the message in Process Memory Space. Memory Space.
The buffer in Network Memory space is The buffer in Network Memory space is released and can be garbage collected. released and can be garbage collected.
From this point onward, the cache line is From this point onward, the cache line is associated with the Process Memory Space. associated with the Process Memory Space. On cache replacement, the message is On cache replacement, the message is written back to its targeted object in written back to its targeted object in Process Memory SpaceProcess Memory Space
SAMOS IV, July 19-24, 2004 25
Large MessagesLarge Messages
Are not dealt with in this work (TLB Are not dealt with in this work (TLB techniques would accomplish techniques would accomplish message re-binding)message re-binding)
SAMOS IV, July 19-24, 2004 26
ISA extensionsISA extensions
network_loadnetwork_load network_storenetwork_store
Identical to standard Identical to standard loadload and and storestore instructions with the exception that instructions with the exception that they cause the network cache to be they cause the network cache to be searched according to the network tag. searched according to the network tag. No other cache is searched.No other cache is searched.
SAMOS IV, July 19-24, 2004 27
ISA extensions cont’dISA extensions cont’d
Regular Regular loadload and and storestore instructions instructions target both the normal data cache target both the normal data cache and the network cache and the and the network cache and the network cache is searched according network cache is searched according to the process tag.to the process tag.
SAMOS IV, July 19-24, 2004 28
ISA extensions cont’dISA extensions cont’d
remapremap message_id, new_process_tag message_id, new_process_tag
remaps the cache line identified by remaps the cache line identified by the the message_idmessage_id to the to the new_process_tagnew_process_tag. The . The message_idmessage_id and and new_process_tagnew_process_tag are in are in registers. registers.
SAMOS IV, July 19-24, 2004 29
ImplementationImplementation
SAMOS IV, July 19-24, 2004 30
Implementation --cont’dImplementation --cont’d
Network cache is implemented as Network cache is implemented as mm--way associativeway associative Three sectionsThree sections
ProcessProcess section section MessageIDMessageID section section Network CacheNetwork Cache section section
SAMOS IV, July 19-24, 2004 31
Implementation -- cont’dImplementation -- cont’d
The The network cachenetwork cache section holds the section holds the message payloadmessage payload
The The messageIDmessageID and and processprocess sections sections hold pointers that point to payloads hold pointers that point to payloads in the in the network cachenetwork cache section section
The associativity of the The associativity of the messageIDmessageID and and processprocess sections is larger than sections is larger than that of the network cache section to that of the network cache section to avoid unnecessary cache misses.avoid unnecessary cache misses.
SAMOS IV, July 19-24, 2004 32
Implementation--overallImplementation--overall
SAMOS IV, July 19-24, 2004 33
CoDeLCoDeL
CoDeL (Controller Description Language), targets the specification and design at the behavioral level.
CoDeL is a procedural language in which the order of the statements implicitly represents the sequence of activities.
It extracts the data and control flow from the program automatically, assigns the necessary hardware blocks and exploits inherent parallelism.
SAMOS IV, July 19-24, 2004 34
CoDeLCoDeL
It is similar to the C programming language and is therefore easy to learn.
It includes a library of I/O protocols that simplify (sub)system interaction. The CoDeL compiler produces synthesizable VHDL code which can be targeted to any technology including PLD, FPGA or ASIC.
SAMOS IV, July 19-24, 2004 35
CoDeL--Ports and CoDeL--Ports and ProtocolsProtocols
CoDeL abstracts module interaction through ports CoDeL abstracts module interaction through ports and protocols.and protocols.
Protocols define the sequence of events necessary Protocols define the sequence of events necessary to transfer information from one module to another to transfer information from one module to another
SAMOS IV, July 19-24, 2004 36
CoDeL--ExampleCoDeL--Example# Define a 16-bit address
# in 4 dimensions
bitstruct mixed_radix_4
{
(bits) field1[4];
(bits) field2[4];
(bits) field3[4];
(bits) field4[4];
}
# Define a 36-bit
# message header using
# the above
bitstruct data_frame
{
(mixed_radix_4) source_address;
(mixed_radix_4) destn_address;
(bits) header[4];
}
in (data_frame) p1 with input_handshake;
out (data_frame) p3 with output_handshake;
SAMOS IV, July 19-24, 2004 37
CoDeL--Example ProtocolCoDeL--Example Protocol
Example of a handshake protocolExample of a handshake protocol
SAMOS IV, July 19-24, 2004 38
Network Processor Extension Network Processor Extension ImplementationImplementation
The register file modules were implemented The register file modules were implemented in VHDL. Each of these required about 60 in VHDL. Each of these required about 60 lines of VHDL code. Each cache line is 32 lines of VHDL code. Each cache line is 32 bytes.bytes.
The network controller module, written in The network controller module, written in CoDeL, required about 697 lines of code, and CoDeL, required about 697 lines of code, and generated close to 4011 lines of VHDL code.generated close to 4011 lines of VHDL code.
Under simulation we see that the network Under simulation we see that the network load instruction requires 15 clock cycles, the load instruction requires 15 clock cycles, the network store takes 29 cycles, the remap network store takes 29 cycles, the remap takes 29 cycles, while the load requires 21 takes 29 cycles, while the load requires 21 cycles.cycles.
SAMOS IV, July 19-24, 2004 39
SynthesisSynthesis
This design has not been synthesized This design has not been synthesized (Xilinx synthesis has failed)(Xilinx synthesis has failed)
We have been able to syntjesize We have been able to syntjesize other designs (including the 5/3 Le other designs (including the 5/3 Le Gall integer-to-integer wavelet)Gall integer-to-integer wavelet)
SAMOS IV, July 19-24, 2004 40
ConclusionsConclusions A network processor extension has been A network processor extension has been
proposed and designed using CodeL.proposed and designed using CodeL. Using CoDeL has allowed the rapid Using CoDeL has allowed the rapid
prototyping of the design.prototyping of the design. CoDeL needs to be extended to enhance CoDeL needs to be extended to enhance
parallelism.parallelism. Compiler directives (similar to the technique Compiler directives (similar to the technique
used in OpenMP) could be used.used in OpenMP) could be used. State collapsing and data forwarding State collapsing and data forwarding
would allow faster design.would allow faster design.
SAMOS IV, July 19-24, 2004 41
What nextWhat next SMP nodes SMP nodes
A cache-coherent based organization will A cache-coherent based organization will migrate and bind received messages to the migrate and bind received messages to the consuming processorconsuming processor
Refine the ISA. Refine the ISA. Is there any more functionality needed?Is there any more functionality needed?
Is the TLB-based re-mapping of the very Is the TLB-based re-mapping of the very large messages necessary?large messages necessary? Can we live with one sided communications?Can we live with one sided communications?
Performance evaluation!! Performance evaluation!!