a configurable high-throughput linear sorter system

68
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS [email protected] David Andrews Computer Science and Computer Engineering The University of Arkansas 504 J.B. Hunt Building, Fayetteville, AR [email protected]

Upload: matteo

Post on 24-Feb-2016

18 views

Category:

Documents


1 download

DESCRIPTION

A Configurable High-Throughput Linear Sorter System. Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS [email protected]. - PowerPoint PPT Presentation

TRANSCRIPT

Slide 1

A Configurable High-Throughput Linear Sorter SystemJorge Ortiz

Information and Telecommunication Technology Center

2335 Irving Hill Road

Lawrence, KS

[email protected] Andrews

Computer Science and Computer Engineering

The University of Arkansas

504 J.B. Hunt Building, Fayetteville, AR

[email protected] an important system function Popular sorting algorithms not efficient or fast in hardware implementationsLinear sorters ideal for hardware, but sort at a rate of 1 value per cycleSorting networks better at throughput, but with high area and latency costNeed a better solution for high throughput, low latency sortingContributionsExpanding the linear sorter implementation and making it versatile, reconfigurable and better suited for streaming input and outputParallelizing the linear sorter for increased throughputImplementing the high-throughput linear sorter, and outmatching the performance of current linear sorter approachesBackgroundBackgroundSoftware quicksort, mergesort and heapsort use divide-and-conquer techniques to achieve efficiencyHardware sorting plagued with overhead from data movements, synchronization, bookkeeping and memory accessesNeed better use of concurrent data comparisons and swaps, rather than the extended execution of multiple assembly instructions like its software counterpartSorting NetworksSwap comparators sort pairs of valuesSink lowest value, then operate on remaining Sn-1 itemsReceive parallel data at inputsHigh #PE and latency, resort with each new insertion

Bubble Sort3254132541Sorting NetworksSwap comparators sort pairs of valuesSink lowest value, then operate on remaining Sn-1 itemsReceive parallel data at inputsHigh #PE and latency, resort with each new insertion

Bubble Sort3254123541Sorting NetworksSwap comparators sort pairs of valuesSink lowest value, then operate on remaining Sn-1 itemsReceive parallel data at inputsHigh #PE and latency, resort with each new insertion

Bubble Sort3254123451Sorting NetworksSwap comparators sort pairs of valuesSink lowest value, then operate on remaining Sn-1 itemsReceive parallel data at inputsHigh #PE and latency, resort with each new insertion

Bubble Sort3254123415Sorting NetworksSwap comparators sort pairs of valuesSink lowest value, then operate on remaining Sn-1 itemsReceive parallel data at inputsHigh #PE and latency, resort with each new insertion

Bubble Sort3254123145Sorting NetworksSwap comparators sort pairs of valuesSink lowest value, then operate on remaining Sn-1 itemsReceive parallel data at inputsHigh #PE and latency, resort with each new insertion

Bubble Sort3254121345Sorting NetworksSwap comparators sort pairs of valuesSink lowest value, then operate on remaining Sn-1 itemsReceive parallel data at inputsHigh #PE and latency, resort with each new insertion

Bubble Sort3254112345Linear SortersSorted insertionsForwards incoming value to all nodes Each node shifts autonomously depending on neighbors valuesSingle clock latency, small logic & regular structure Streaming input & outputSerial input, need higher throughputInput:Output:Throughput: the rate at which data elements are processed14Linear SortersSorted insertionsForwards incoming value to all nodes Each node shifts autonomously depending on neighbors valuesSingle clock latency, small logic & regular structure Streaming input & outputSerial input, need higher throughputInput:Output:3Throughput: the rate at which data elements are processed15Linear SortersSorted insertionsForwards incoming value to all nodes Each node shifts autonomously depending on neighbors valuesSingle clock latency, small logic & regular structure Streaming input & outputSerial input, need higher throughput3Input:Output:2Throughput: the rate at which data elements are processed16Linear SortersSorted insertionsForwards incoming value to all nodes Each node shifts autonomously depending on neighbors valuesSingle clock latency, small logic & regular structure Streaming input & outputSerial input, need higher throughput23Input:Output:5Throughput: the rate at which data elements are processed17Linear SortersSorted insertionsForwards incoming value to all nodes Each node shifts autonomously depending on neighbors valuesSingle clock latency, small logic & regular structure Streaming input & outputSerial input, need higher throughput235Input:Output:4Throughput: the rate at which data elements are processed18Linear SortersSorted insertionsForwards incoming value to all nodes Each node shifts autonomously depending on neighbors valuesSingle clock latency, small logic & regular structure Streaming input & outputSerial input, need higher throughput2345Input:Output:1Throughput: the rate at which data elements are processed19Linear SortersSorted insertionsForwards incoming value to all nodes Each node shifts autonomously depending on neighbors valuesSingle clock latency, small logic & regular structure Streaming input & outputSerial input, need higher throughput12345Input:Output:12345Throughput: the rate at which data elements are processed20Configurable Linear SorterConfigurable Linear SorterIncrease versatility for linear sortersConfigurable:Linear sorter depthSorting direction Sort on tags (for example, timestamps) rather than dataUser-defined data and tag sizeConfigurable Linear SorterIncrease functionality for linear sortersDetect full conditionsBuffer input while fullRetrieve output serially for streaming Delete top value, freeing nodesAugment with left shift functionalityTest tags before deleting them

Extended Linear Sorter System05Clock CyclesInserted TagsSorted OutputNode 1Node 2Node 3Node 4Node 5Node 6Extended Linear Sorter System5Clock CyclesInserted TagsSorted OutputNode 1Node 2Node 3Node 4Node 5Node 60517Extended Linear Sorter System557Clock CyclesInserted TagsSorted OutputNode 1Node 2Node 3Node 4Node 5Node 6051726Extended Linear Sorter System557567Clock CyclesInserted TagsSorted OutputNode 1Node 2Node 3Node 4Node 5Node 605172632Extended Linear Sorter System5575672567Clock CyclesInserted TagsSorted OutputNode 1Node 2Node 3Node 4Node 5Node 60517263241Extended Linear Sorter System557567256712567Clock CyclesInserted TagsSorted OutputNode 1Node 2Node 3Node 4Node 5Node 6051726324159Extended Linear Sorter System55756725671256725679Clock CyclesInserted TagsSorted OutputNode 1Node 2Node 3Node 4Node 5Node 6051726324159613Extended Linear Sorter System5575672567125672567935679Clock CyclesInserted TagsSorted OutputNode 1Node 2Node 3Node 4Node 5Node 6051726324159613728Extended Linear Sorter System557567256712567256793567956789Clock CyclesInserted TagsSorted OutputNode 1Node 2Node 3Node 4Node 5Node 6051726324159613728834Interleaved Linear Sorter System557567256712567256793567956789456789Clock CyclesInserted TagsSorted OutputNode 1Node 2Node 3Node 4Node 5Node 60517263241596137288349Interleaved Linear Sorter System55756725671256725679356795678945678956789Clock CyclesInserted TagsSorted OutputNode 1Node 2Node 3Node 4Node 5Node 60517263241596137288349104Interleaved Linear Sorter System557567256712567256793567956789456789567896789Clock CyclesInserted TagsSorted OutputNode 1Node 2Node 3Node 4Node 5Node 60517263241596137288349104115Extended Linear Sorter System557567256712567256793567956789456789567896789789Clock CyclesInserted TagsSorted OutputNode 1Node 2Node 3Node 4Node 5Node 60517263241596137288349104115126Extended Linear Sorter System55756725671256725679356795678945678956789678978989Clock CyclesInserted TagsSorted OutputNode 1Node 2Node 3Node 4Node 5Node 60517263241596137288349104115126137Extended Linear Sorter System557567256712567256793567956789456789567896789789899Clock CyclesInserted TagsSorted OutputNode 1Node 2Node 3Node 4Node 5Node 60517263241596137288349104115126137148Extended Linear Sorter System557567256712567256793567956789456789567896789789899Clock CyclesInserted TagsSorted OutputNode 1Node 2Node 3Node 4Node 5Node 60517263241596137288349104115126137148159Sorter Node Architecture

Interleaved Linear SorterInterleaved Linear SorterIncrease throughput by using multiple linear sorters with parallel inputsInterleave parallel inputs into linear sorters through modulo arithmeticDistribute data evenly among linear sorters to avoid full conditionsService each linear sorter in round-robin fashion to resort their outputsInterleaved Linear SorterILS width = 4Four parallel inputs{13, 04, 03, 10}After interleaving mod 4 {04, 13, 10, 03}[00, 01, 02, 03]

But, what happens for inputs{07, 05, 01, 06} ?Interleaved Linear SorterClock CyclesLS 0LS 1LS 2LS 30134310Inserted TagsInterleaved Linear Sorter413103Clock CyclesLS 0LS 1LS 2LS 30134310Inserted Tags17516InsertedInterleaved Linear Sorter4563Clock CyclesLS 0LS 1LS 2LS 30134310Inserted Tags131071751621InsertedPreemptedInterleaved Linear Sorter4163Clock CyclesLS 0LS 1LS 2LS 30134310Inserted Tags51071317516213291215InsertedPreemptedInterleaved Linear Sorter4123Clock CyclesLS 0LS 1LS 2LS 30134310Inserted Tags125679101513175162132912154110148InsertedPreemptedInterleaved Linear Sorter0123Clock CyclesLS 0LS 1LS 2LS 30134310Inserted Tags4567129101113141517516213291215411014858InsertedPreemptedInterleaved Linear Sorter0123Clock CyclesLS 0LS 1LS 2LS 30134310Inserted Tags456789101112131415175162132912154110148586InsertedPreemptedInterleaved Linear Sorter0123Clock CyclesLS 0LS 1LS 2LS 30134310Inserted Tags456789101112131415175162132912154110148586InsertedPreemptedInput Distribution and LatencyAssume uniformly distributed dataWith W = 2, 50% chance of interleaving delays, adding an extra clock cycleWith W > 2, more probabilities of stalling, adding 1+ clock cyclesILS Width WClock Cycles11.00021.50042.12582.597163.078Average latency for InterleavedLinear Sorters of length WLarger ILS widths allow parallel sorting and increase throughput. However, they have complex routing and additional delaysOutput LogicAccumulate values before output becomes relevantIncrease linear sorter depth to accumulate more data

Re-sort top values from each linear sorter to ensure continuity (particularly for contiguous values)Service in round-robin fashion: Test the top tag of each linear sorter before deletingStreaming output0123LS 0LS 1LS 2LS 345678914111315OKOKOutput {8,9}; Wait for 10Hardware ImplementationFPGA AreaLinear Sorters, WTotal SlicesSlices/NodeArea OverheadInterleavingArea (slices)127817.42.3%7264120.017.6%1134129420.218.8%2438261220.420.0%52216525020.520.6%1081Xilinx Virtex-5 FPGA8-bit tags, 8-bit data17 slices per sorter nodeLinear Sorter depth of 16Interleaved Linear Sorter FPGA AreaFPGA Throughputfclk x ILS width WIncludes logic and routing delay for interleaving data for W linear sortersAveraged for sorter depth from 1 up to 256 nodesInterleaved Linear Sorter FPGA Frequency & ThroughputLinear Sorters, WFrequency (MHz)Throughput (millions/sec)1299299227555042751101813210581640645FPGA ThroughputW=16, large logic & routing delays at inputsFirst three cases need single 6-input LUT for routing. Two and four LUTs needed for W=8 and W=16, respectively.

Interleaved Linear Sorter FPGA Frequency & ThroughputLinear Sorters, WFrequency (MHz)Throughput (millions/sec)1299299227555042751101813210581640645Maximum ILS Throughput

Average speedups of 1.0, 1.8, 3.7 and 3.5 against single linear sorterThroughput considerationsInterleaving contention results in an average latency which increases with ILS width WILS Width WClock Cycles11.00021.50042.12582.597163.078Average latency for Interleaved Linear Sorters of length WNormalized ILS ThroughputAverage speedups of 1.0, 1.3, 1.8 and 1.4 against single linear sorter

Virtex II-Pro implementation100 MHz bus frequencyData resided on BlockRAMsTested Interleaved Linear Sorter W=4

Compared against MicroBlaze running quicksort in CTiming includes bus arbitration, readand writes over the OPBFinal result is saved in BlockRAM

Three test scenariosMicroBlaze BRAM write & read-backMB writes BRAM unsorted dataMB sends start signal to ILSMB reads back sorted values over OPBMicroBlaze BRAM writeSame as scenario 1No need for read back into MBNo MicroBlazeHardware-only streaming output approachNo need for OPB requestsOutput consumed by other hardware components immediatelyILS speedup over MicroBlazeSystemClock cyclesSpeedupMicroBlaze quicksort49,9821ILS 1 MB write & read227222ILS 2 MB write73268ILS 3 Hardware-only 30166632-bit data16-bit tags64 valuesILS width of 4

ILS speedup against Sorting NetworkSystemExecution time (ns)SpeedupBatcher odd-even951.0ILS Hardware only1230.8Sorting network requires static data setRe-sorts the full set of data upon a single new insertion

16-bit data-tags32 valuesILS width of 4

ConclusionsConclusionsLinear sorters appropriate to overcome sorting network disadvantages, but limited throughputInterleaved Linear Sorters allow high throughput, configuration of width, depth, tag and data size to match system requirements

ILS speedup of 1.8 over regular linear sorter (3.7 for best case)ILS speedup of 68 against embedded software counterpart (1666 for best case)

Questions