using codel to rapidly prototype network processsor extensions nainesh agarwal and nikitas j....

41
Using CoDeL to rapidly Using CoDeL to rapidly prototype network prototype network processsor extensions processsor extensions Nainesh Agarwal and Nikitas J. Nainesh Agarwal and Nikitas J. Dimopoulos Dimopoulos Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering University of Victoria University of Victoria

Post on 22-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

Using CoDeL to rapidly Using CoDeL to rapidly prototype networkprototype network

processsor extensionsprocesssor extensions

Nainesh Agarwal and Nikitas J. Dimopoulos Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Department of Electrical and Computer

EngineeringEngineering

University of VictoriaUniversity of Victoria

Page 2: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 2

M

P P P…

Interconnect

NIC

M

P P P…

Interconnect

NICSystem

Interconnect

M

P P P…

Interconnect

NIC

General structure of a “massively” parallel system

Page 3: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 3

OutlineOutline

The problem (latency)The problem (latency) Prediction Prediction Architectural enhancements Architectural enhancements CoDeL and ImplementationCoDeL and Implementation

Page 4: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 4

LatencyLatency

CG Benchmark Completed. Class = W Size = 7000 Iterations = 15 Time in seconds = 1.72 Total processes = 8 Compiled procs = 8 Mop/s total = 244.10 Mop/s/process = 30.51 Operation type = floating point Verification = SUCCESSFUL Version = 2.3 Compile date = 07 Mar 2001

CG Benchmark Completed. Class = W Size = 7000 Iterations = 15 Time in seconds = .99 Total processes = 8 Compiled procs = 8 Mop/s total = 426.95 Mop/s/process = 53.37 Operation type = floating point Verification = SUCCESSFUL Version = 2.3 Compile date = 07 Mar 2001

Switch sharedSwitch shared, user spaceMPI over LAPIFaster communications

Page 5: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 5

LatencyLatency

Minimizing communication latency Minimizing communication latency is crucial in achieving high is crucial in achieving high performance.performance.

Network

Send Process Receive ProcessSend buffer Receive buffer

System buffer System buffer

NI NI

Page 6: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 6

LatencyLatency

Efficiency requires the message to Efficiency requires the message to be available to be consumedbe available to be consumed

Send call

Sender Receiver

Receive call issued

Receive call executed (address resolution)

Consumer

idle

Copy to consumer space

Page 7: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 7

LatencyLatency

Send call

Sender Receiver thread

Receive call issued

Receive call executed (address resolution)

Consumer thread

Cache miss

Page 8: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 8

LatencyLatency

Even when the network delays are Even when the network delays are minimized (non-existent)minimized (non-existent)

receiver synchronization, receiver synchronization, message copying, message copying, cache misses cache misses

delay execution. delay execution.

Page 9: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 9

The solutionThe solution

Ensure that the received message is Ensure that the received message is in the consumer’s cache at the point in the consumer’s cache at the point the consumer needs to consume the the consumer needs to consume the message.message.

P

cache

M

Page 10: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 10

The solutionThe solution

Enabling mechanismsEnabling mechanisms In an asynchronous environment where In an asynchronous environment where

many messages arrive at a node, can we many messages arrive at a node, can we decide which is the message to be decide which is the message to be consumed next?consumed next?

How do we place the message to be How do we place the message to be consumed in the cache?consumed in the cache?

M

P

cache

Page 11: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 11

The solutionThe solution

Learn the pattern of message Learn the pattern of message consumption and use this to decide consumption and use this to decide which is the message to be which is the message to be consumed next .consumed next .

Develop a hardware environment Develop a hardware environment that will facilitate the placement of that will facilitate the placement of the message in the consumer’s the message in the consumer’s cache cache

Page 12: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 12

Receive call predictorsReceive call predictors

History-based predictors predict History-based predictors predict subsequent receive calls at a given subsequent receive calls at a given node in a message-passing node in a message-passing application.application.

Page 13: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 13

LocalityLocality

Message reception locality Message reception locality If a certain message reception call has If a certain message reception call has

been used it will be re-used with high been used it will be re-used with high probability by a portion of code that is probability by a portion of code that is “near” the place that was used earlier, “near” the place that was used earlier, and it will also be re-used in the near and it will also be re-used in the near futurefuture

Page 14: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 14

Messages vary Messages vary in size from a in size from a few bytes to few bytes to several kbytesseveral kbytes

Page 15: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 15

PredictorsPredictors

Heuristics that predict the Heuristics that predict the subsequent receive calls based on subsequent receive calls based on the past history of communication the past history of communication patterns on a per node basis. patterns on a per node basis.

Tag PredictorTag Predictor Single-cycle PredictorSingle-cycle Predictor Tag-cycle PredictorTag-cycle Predictor Tag-better-cycle PredictorTag-better-cycle Predictor

Page 16: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 16

Single-cycle PredictorSingle-cycle Predictor

N = 64 for CG, and 49 for others

Page 17: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 17

What nextWhat next

Network Processor ExtensionsNetwork Processor Extensions Achieve zero-copy through re-mappingAchieve zero-copy through re-mapping Use the predictors to “optimize” size Use the predictors to “optimize” size

and performance .and performance .

Page 18: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 18

ArchitectureArchitecture

M

P

Interconnect

NIC

Netw

ork cachecache

Page 19: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 19

Architectural Architectural EnhancementsEnhancements

Network Memory Space Process Memory Space

Network tag

Process tag cache data lineMessage ID

Network Cache

initi

al final

Separate Separate Network Network Cache “ties” Cache “ties” the Network the Network Memory Memory Space and the Space and the Process Process Memory Memory SpaceSpace

Page 20: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 20

DefinitionsDefinitions

Network Memory Space:Network Memory Space: Network buffers Network buffers Received messages live waiting to be Received messages live waiting to be

bound to the process address space.bound to the process address space. Process Memory Space: Process Memory Space:

Process address space Process address space Process objects including bound Process objects including bound

messages livemessages live

Page 21: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 21

OperationOperation

network tagnetwork tag is associated with the is associated with the Network Memory Space,Network Memory Space,

process tagprocess tag is associated with the Process is associated with the Process Memory Space.Memory Space.

message IDmessage ID tag holds the message ID. tag holds the message ID. All three tags can be searched All three tags can be searched

associatively. associatively.

The Network Cache includes three separate tags.

Page 22: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 22

Operation cont’dOperation cont’d

On message arrival, the message is On message arrival, the message is cached on the network cache. cached on the network cache.

The The network tagnetwork tag is set to the address is set to the address of the buffer in network memory of the buffer in network memory space that is allocated to the space that is allocated to the message message

The The message idmessage id tag is set to the tag is set to the message id. message id.

Page 23: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 23

Operation cont’dOperation cont’d

The message lives at the network The message lives at the network cache and it migrates to the cache and it migrates to the Network Memory space according to Network Memory space according to a cache replacement policy which a cache replacement policy which replaces the message that is least replaces the message that is least likely to be consumed next. likely to be consumed next.

The receive-call prediction heuristics The receive-call prediction heuristics are used for this purpose.are used for this purpose.

Page 24: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 24

Late bindingLate binding A receive call invalidates the A receive call invalidates the message IDmessage ID

and and network tagsnetwork tags and will set the and will set the process process tagtag to point to the address of the object to point to the address of the object destined to receive the message in Process destined to receive the message in Process Memory Space. Memory Space.

The buffer in Network Memory space is The buffer in Network Memory space is released and can be garbage collected. released and can be garbage collected.

From this point onward, the cache line is From this point onward, the cache line is associated with the Process Memory Space. associated with the Process Memory Space. On cache replacement, the message is On cache replacement, the message is written back to its targeted object in written back to its targeted object in Process Memory SpaceProcess Memory Space

Page 25: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 25

Large MessagesLarge Messages

Are not dealt with in this work (TLB Are not dealt with in this work (TLB techniques would accomplish techniques would accomplish message re-binding)message re-binding)

Page 26: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 26

ISA extensionsISA extensions

network_loadnetwork_load network_storenetwork_store

Identical to standard Identical to standard loadload and and storestore instructions with the exception that instructions with the exception that they cause the network cache to be they cause the network cache to be searched according to the network tag. searched according to the network tag. No other cache is searched.No other cache is searched.

Page 27: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 27

ISA extensions cont’dISA extensions cont’d

Regular Regular loadload and and storestore instructions instructions target both the normal data cache target both the normal data cache and the network cache and the and the network cache and the network cache is searched according network cache is searched according to the process tag.to the process tag.

Page 28: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 28

ISA extensions cont’dISA extensions cont’d

remapremap message_id, new_process_tag message_id, new_process_tag

remaps the cache line identified by remaps the cache line identified by the the message_idmessage_id to the to the new_process_tagnew_process_tag. The . The message_idmessage_id and and new_process_tagnew_process_tag are in are in registers. registers.

Page 29: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 29

ImplementationImplementation

Page 30: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 30

Implementation --cont’dImplementation --cont’d

Network cache is implemented as Network cache is implemented as mm--way associativeway associative Three sectionsThree sections

ProcessProcess section section MessageIDMessageID section section Network CacheNetwork Cache section section

Page 31: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 31

Implementation -- cont’dImplementation -- cont’d

The The network cachenetwork cache section holds the section holds the message payloadmessage payload

The The messageIDmessageID and and processprocess sections sections hold pointers that point to payloads hold pointers that point to payloads in the in the network cachenetwork cache section section

The associativity of the The associativity of the messageIDmessageID and and processprocess sections is larger than sections is larger than that of the network cache section to that of the network cache section to avoid unnecessary cache misses.avoid unnecessary cache misses.

Page 32: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 32

Implementation--overallImplementation--overall

Page 33: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 33

CoDeLCoDeL

CoDeL (Controller Description Language), targets the specification and design at the behavioral level.

CoDeL is a procedural language in which the order of the statements implicitly represents the sequence of activities.

It extracts the data and control flow from the program automatically, assigns the necessary hardware blocks and exploits inherent parallelism.

Page 34: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 34

CoDeLCoDeL

It is similar to the C programming language and is therefore easy to learn.

It includes a library of I/O protocols that simplify (sub)system interaction. The CoDeL compiler produces synthesizable VHDL code which can be targeted to any technology including PLD, FPGA or ASIC.

Page 35: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 35

CoDeL--Ports and CoDeL--Ports and ProtocolsProtocols

CoDeL abstracts module interaction through ports CoDeL abstracts module interaction through ports and protocols.and protocols.

Protocols define the sequence of events necessary Protocols define the sequence of events necessary to transfer information from one module to another to transfer information from one module to another

Page 36: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 36

CoDeL--ExampleCoDeL--Example# Define a 16-bit address

# in 4 dimensions

bitstruct mixed_radix_4

{

(bits) field1[4];

(bits) field2[4];

(bits) field3[4];

(bits) field4[4];

}

# Define a 36-bit

# message header using

# the above

bitstruct data_frame

{

(mixed_radix_4) source_address;

(mixed_radix_4) destn_address;

(bits) header[4];

}

in (data_frame) p1 with input_handshake;

out (data_frame) p3 with output_handshake;

Page 37: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 37

CoDeL--Example ProtocolCoDeL--Example Protocol

Example of a handshake protocolExample of a handshake protocol

Page 38: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 38

Network Processor Extension Network Processor Extension ImplementationImplementation

The register file modules were implemented The register file modules were implemented in VHDL. Each of these required about 60 in VHDL. Each of these required about 60 lines of VHDL code. Each cache line is 32 lines of VHDL code. Each cache line is 32 bytes.bytes.

The network controller module, written in The network controller module, written in CoDeL, required about 697 lines of code, and CoDeL, required about 697 lines of code, and generated close to 4011 lines of VHDL code.generated close to 4011 lines of VHDL code.

Under simulation we see that the network Under simulation we see that the network load instruction requires 15 clock cycles, the load instruction requires 15 clock cycles, the network store takes 29 cycles, the remap network store takes 29 cycles, the remap takes 29 cycles, while the load requires 21 takes 29 cycles, while the load requires 21 cycles.cycles.

Page 39: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 39

SynthesisSynthesis

This design has not been synthesized This design has not been synthesized (Xilinx synthesis has failed)(Xilinx synthesis has failed)

We have been able to syntjesize We have been able to syntjesize other designs (including the 5/3 Le other designs (including the 5/3 Le Gall integer-to-integer wavelet)Gall integer-to-integer wavelet)

Page 40: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 40

ConclusionsConclusions A network processor extension has been A network processor extension has been

proposed and designed using CodeL.proposed and designed using CodeL. Using CoDeL has allowed the rapid Using CoDeL has allowed the rapid

prototyping of the design.prototyping of the design. CoDeL needs to be extended to enhance CoDeL needs to be extended to enhance

parallelism.parallelism. Compiler directives (similar to the technique Compiler directives (similar to the technique

used in OpenMP) could be used.used in OpenMP) could be used. State collapsing and data forwarding State collapsing and data forwarding

would allow faster design.would allow faster design.

Page 41: Using CoDeL to rapidly prototype network processsor extensions Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering

SAMOS IV, July 19-24, 2004 41

What nextWhat next SMP nodes SMP nodes

A cache-coherent based organization will A cache-coherent based organization will migrate and bind received messages to the migrate and bind received messages to the consuming processorconsuming processor

Refine the ISA. Refine the ISA. Is there any more functionality needed?Is there any more functionality needed?

Is the TLB-based re-mapping of the very Is the TLB-based re-mapping of the very large messages necessary?large messages necessary? Can we live with one sided communications?Can we live with one sided communications?

Performance evaluation!! Performance evaluation!!