a exploiting stable data dependency in stream processing ... tec… · streaming algorithms for...

25
A Exploiting Stable Data Dependency in Stream Processing Acceleration on FPGAs YULIANG SUN, Tsinghua University LANJUN WANG, University of Waterloo CHEN WANG, Tsinghua University YU WANG, Tsinghua University With the unique feature of fine-grained parallelism, FPGAs (field-programmable gate arrays) show great potential for streaming algorithm acceleration. However, the lack of a design framework, restrictions on FPGAs, and ineffective tools impede the utilization of FPGAs in practice. In this study, we provide a design paradigm to support streaming algorithm acceleration on FPGAs. We first propose an abstract model to describe streaming algorithms with homogeneous sub-functions (HSF) and stable data dependency (SDD), which we call it the HSF-SDD model. Using this model, we then develop an FPGA framework, PE-Ring, which has the advantages of (1) fully exploiting algorithm parallelism to achieve high performance, (2) leveraging block RAM to serve large scale parameters, and (3) enabling flexible parameter adjustments. Based on the proposed model and framework, we finally implement a specific converter to generate the register-transfer level (RTL) representation of the PE-Ring. Experimental results show that our method outperforms ordinary FPGA design tools by one to two orders of magnitude. Experiments also demonstrate the scalability of the PE-Ring. CCS Concepts: r Hardware ! Reconfigurable logic and FPGAs; Hardware accelerators; Pro- grammable logic elements; Additional Key Words and Phrases: Stream processing, algorithm model, high-level synthesis ACM Reference Format: Yuliang Sun, Lanjun Wang, Chen Wang and Yu Wang, 2017. Exploiting stable data dependency in stream processing acceleration on FPGAs. ACM Trans. Embedd. Comput. Syst. V, N, Article A (January YYYY), 25 pages. DOI: http://dx.doi.org/10.1145/0000000.0000000 1. INTRODUCTION The growing prevalence of big data across all industries and sciences is causing a pro- found shift in the nature and scope of analytics. Increasingly complex computations are quickly becoming the norm, such as those seen in the advanced statistics and machine learning. In databases, these types of tasks are expressed as workflows of user-defined functions (UDFs). However, compared with answering trivial queries (e.g., read and insert), processing UDFs incurs more workloads on CPUs. By moving computational The first two authors have equal contributions to this work. This work is done when Lanjun Wang was working at IBM Research-China. This work was supported by 973 project 2013CB329000, National Natural Science Foundation of China (No.61373026, 61622403), Huawei Technologies Co. Ltd, and Joint fund of Equipment pre-Research and Ministry of Education (No. 6141A02022608). Author’s addresses: Y. Wang is the corresponding author. Y. Sun and Y. Wang, Electronic Engineer- ing Department, Tsinghua National Laboratory for Information Science and Technology (TNList), Ts- inghua University, China; e-mail: [email protected], [email protected]. L. Wang, David R. Cheriton School of Computer Science, University of Waterloo, Canada; e-mail: [email protected]. C. Wang, National Engineering Lab for Big Data Software, Tsinghua Univer- sity, China; School of Software, Tsinghua University, China; e-mail: wang [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). c YYYY Copyright held by the owner/author(s). 1539-9087/YYYY/01-ARTA $15.00 DOI: http://dx.doi.org/10.1145/0000000.0000000 ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Upload: others

Post on 23-Sep-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

A

Exploiting Stable Data Dependency in Stream ProcessingAcceleration on FPGAs

YULIANG SUN, Tsinghua UniversityLANJUN WANG, University of WaterlooCHEN WANG, Tsinghua UniversityYU WANG, Tsinghua University

With the unique feature of fine-grained parallelism, FPGAs (field-programmable gate arrays) show greatpotential for streaming algorithm acceleration. However, the lack of a design framework, restrictions onFPGAs, and ineffective tools impede the utilization of FPGAs in practice. In this study, we provide a designparadigm to support streaming algorithm acceleration on FPGAs. We first propose an abstract model todescribe streaming algorithms with homogeneous sub-functions (HSF) and stable data dependency (SDD),which we call it the HSF-SDD model. Using this model, we then develop an FPGA framework, PE-Ring,which has the advantages of (1) fully exploiting algorithm parallelism to achieve high performance, (2)leveraging block RAM to serve large scale parameters, and (3) enabling flexible parameter adjustments.Based on the proposed model and framework, we finally implement a specific converter to generate theregister-transfer level (RTL) representation of the PE-Ring. Experimental results show that our methodoutperforms ordinary FPGA design tools by one to two orders of magnitude. Experiments also demonstratethe scalability of the PE-Ring.

CCS Concepts: rHardware ! Reconfigurable logic and FPGAs; Hardware accelerators; Pro-grammable logic elements;Additional Key Words and Phrases: Stream processing, algorithm model, high-level synthesis

ACM Reference Format:Yuliang Sun, Lanjun Wang, Chen Wang and Yu Wang, 2017. Exploiting stable data dependency in streamprocessing acceleration on FPGAs. ACM Trans. Embedd. Comput. Syst. V, N, Article A (January YYYY), 25pages.DOI: http://dx.doi.org/10.1145/0000000.0000000

1. INTRODUCTIONThe growing prevalence of big data across all industries and sciences is causing a pro-found shift in the nature and scope of analytics. Increasingly complex computations arequickly becoming the norm, such as those seen in the advanced statistics and machinelearning. In databases, these types of tasks are expressed as workflows of user-definedfunctions (UDFs). However, compared with answering trivial queries (e.g., read andinsert), processing UDFs incurs more workloads on CPUs. By moving computational

The first two authors have equal contributions to this work. This work is done when Lanjun Wang wasworking at IBM Research-China.This work was supported by 973 project 2013CB329000, National Natural Science Foundation of China(No.61373026, 61622403), Huawei Technologies Co. Ltd, and Joint fund of Equipment pre-Research andMinistry of Education (No. 6141A02022608).Author’s addresses: Y. Wang is the corresponding author. Y. Sun and Y. Wang, Electronic Engineer-ing Department, Tsinghua National Laboratory for Information Science and Technology (TNList), Ts-inghua University, China; e-mail: [email protected], [email protected]. Wang, David R. Cheriton School of Computer Science, University of Waterloo, Canada; e-mail:[email protected]. C. Wang, National Engineering Lab for Big Data Software, Tsinghua Univer-sity, China; School of Software, Tsinghua University, China; e-mail: wang [email protected] to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. Copyrights for third-party components of thiswork must be honored. For all other uses, contact the owner/author(s).c� YYYY Copyright held by the owner/author(s). 1539-9087/YYYY/01-ARTA $15.00DOI: http://dx.doi.org/10.1145/0000000.0000000

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 2: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

A:2 Y. Sun et al.

tasks closer to data, UDFs avoid performance bottlenecks in data transferring, as wellas prevent security and resource management issues in third party analytics platforms(e.g., Statistical Product and Service Solutions, Apache Streams, etc.).

A promising solution is to offload expensive UDFs to a co-processor in order to ac-complish efficient parallel tasks [Sukhwani et al. 2012]. In this study, we focus onstreaming algorithms for processing data streams when the input is presented as asequence of tuples. More particularly, we address accelerating UDFs of streaming al-gorithms by FPGAs.

Recent studies demonstrated that FPGAs are suitable for processing streaming al-gorithms [Mueller et al. 2012; Teubner and Mueller 2011]. This is because tuples fromdata streams are only consumed once in a pass of streaming algorithms, which incursa high cache miss rate in CPUs and GPUs [Kim et al. 2009]. In contrast, FPGAs re-alize a high parallel degree with fine-grained gate-level designs, and the problem ofhigh cache miss rates is avoided. For example, an FPGA design for the similar subse-quence pattern matching outperforms the CPU-based solution by one to four orders ofmagnitude and outperforms the GPU-based solution by one to two orders of magnitude[Wang et al. 2013].

However, there are three main challenges when FPGAs are used as co-processors:Lack of General Algorithm Model If an FPGA framework can be designed basedon a general algorithm model, then developers can map or customize the streamingalgorithm to the model. When the algorithm is in compliance with the model, the cor-responding FPGA implementation can be generated based on the framework. Unfortu-nately, most studies for FPGA acceleration are realized in an ad-hoc manner, such asstream join [Teubner and Mueller 2011], frequent item counting [Teubner et al. 2011],sorting network [Mueller et al. 2012], similar subsequence searching [Sart et al. 2010],etc. Unlike them, recent work has summarized a design pattern named shifter-list[Woods et al. 2015]. This design pattern is based on the view of the hardware on-chipdesign, without an algorithm level abstraction.Restrictions on FPGAs Here two restrictions on the FPGA development are ad-dressed: resource limitations, and recompiling with a changing parameter. First, FPGAimplementations are often constrained by hardware resources, especially for pro-grammable logic units (i.e., lookup tables (LUTs)) and registers (i.e., flip-flops (FFs)).For example, the Space-Saving algorithm calculates approximate counts of frequentitems by using a fixed number of bins [Metwally et al. 2006]. The number of bins isan important parameter that determines the error bound and the number of trackeditems. In a previous FPGA-based study [Teubner et al. 2011], 1024 bins consume 76%of LUTs. Since the LUT consumption in this design increases linearly with the num-ber of bins, this solution suffers from poor accuracy or cannot work when users needto discover a large number of distinct items. Second, FPGA implementation is sensi-tive to parameter adjustments, when parameters of the algorithm are hard-wired onthe FPGA. Therefore, the modification of a single parameter might need several hoursto recompile the FPGA. Even worse, the FPGA circuits may need to be completelyredesigned for parameter updates.Ineffective High-level Synthesis Tools In comparison with software developments,FPGA implementations usually need much longer development periods, even for ex-perts. To reduce design efforts, high-level synthesis (HLS) tools have been developed toconvert software codes into the RTL representation on FPGAs (a.k.a., C-to-RTL tools).Although HLS tools have been fast-growing recently (e.g., Vivado HLS [Vivado 2012]and OpenCL [OpenCL 2013]), the existing tools are designed for general-purpose ap-plications and cannot make full use of the property of data dependencies in streamingalgorithms. As a result, the performance of the design generated by these HLS tools isnot always as good as that of a customized FPGA design.

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 3: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

Exploiting Stable Data Dependency in Stream Processing Acceleration on FPGAs A:3

To address these challenges, we propose a design paradigm for the streaming al-gorithm acceleration on FPGAs. First, according to observations on streaming algo-rithms, we propose an abstract model for streaming algorithms with homogeneoussub-functions (HSF) and stable data dependency (SDD), which is called the HSF-SDDmodel. Second, we develop an FPGA framework, PE-Ring, to implement the stream-ing algorithms that match the HSF-SDD model. Specifically, the PE-Ring organizesits processing elements (PEs) into a ring structure, where each PE is a computationalunit that represents a sub-function of the streaming algorithm. The PE-Ring has theadvantages of (1) fully exploiting the parallelism of streaming algorithms to achievehigh performance, (2) leveraging block RAM to serve a large scale of parameters, and(3) decoupling the parameter storage and computational circuits for flexible parame-ter adjustments. Third, a specialized C-to-RTL tool for the PE-Ring is proposed toconvert a streaming algorithm that matches the HSF-SDD model into RTL codes.

The main contributions are as follows:

— We propose the HSF-SDD model, which is an abstract algorithm model that covers arange of streaming algorithms. The HSF-SDD model has two important properties:(1) operations on each input tuple are consist of homogeneous sub-functions (HSF),and (2) sub-functions follow a stable data dependency (SDD) pattern.

— We implement PE-Ring, which is an FPGA framework to support HSF-SDD match-ing algorithms. The PE-Ring can exploit the inherent parallelism of the algorithmdesign to break the limitations on FPGAs.

— We provide a customized C-to-RTL tool for PE-Ring, which generates RTL represen-tation for algorithms that are compliant with the HSF-SDD model.

— Four algorithms are exemplified to demonstrate how to leverage the HSF-SDD modeland the PE-Ring framework in the FPGA implementations. One of the use case, Dy-namic Time Warping (DTW), is now published on IBM OpenPOWER cloud [Informix2015].

— Experimental results demonstrate that the PE-Ring (1) outperforms the design fromordinary tools by one to two orders of magnitude, (2) achieves the scalability as soft-ware solutions, and (3) has similar or even less LUT consumption compared with theshifter-list.

We discuss related work in Sec. 2, propose the HSF-SDD model in Sec. 3, and presentthe design of PE-Ring in Sec. 4. Four use cases are shown in Sec. 5. Sec. 6 addressesa customized C-to-RTL tool for the PE-Ring, and Sec. 7 demonstrates experimentalresults. Finally, Sec. 8 concludes this paper.

2. RELATED WORKIn industry, major database vendors have their products with FPGAs as co-processors,e.g., IBM Netezza [Netezza 2011], and Microsoft Cipherbase [Arasu et al. 2015]. In-tel, as Oracle’s partner, is also offering a Xeon E5 processor with a coherent FPGA. Inacademic, studies have presented advantages of FPGAs as co-processors for databaseoperations such as join [Teubner and Mueller 2011], transaction processing [Arasuet al. 2015], decompression, predictive evaluation [Sukhwani et al. 2012], and otherstreaming applications [Wei et al. 2017]. However, the lack of suitable abstractionsand design patterns prevents the efforts invested in one solution to carry over fromone application to another. For the abstraction model, User-Defined Aggregates (UDA)[Wang and Zaniolo 1999] and Generalized Linear Aggregates(GLA) [Arumugam et al.2010] are well-known models to make databases “data-centric”. However, abstractionspresented in these models are from a coarse-grained level. Thus, they cannot be ap-plied in an FPGA design directly.

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 4: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

A:4 Y. Sun et al.

For the FPGA design pattern, Glacier [Mueller et al. 2010] is an SQL-to-VHDL com-piler. Given a query in an SQL dialect with streaming extensions, Glacier can generatesynthesizable VHDL. Another approach is provided by ASC [Mencer 2006], A StreamCompiler, a C++ library allowing the designers to optimize the hardware implemen-tation. ASC code is compiled to produce hardware netlist circuit. Similarly, anothersynthesis framework [Takenaka et al. 2012] that generates complex event processing(CEP) engines for FPGAs is developed. Researchers have also provided optimizationmethods for HLS [Zuo et al. 2013]. A recent study presents a streaming frameworknamed shifter list [Woods et al. 2015], where a single shifter list acts as a basic unitof an algorithm and several shifter lists are assembled into a shifter list framework.The shifter list framework shares the same objective as the PE-Ring which is to avoidad-hoc designs.

There are two major differences between shifter list and ours: the abstract leveland the topology level. The abstraction level of the framework in our study presentsan algorithm level abstraction model, which guides software developers to customizetheir algorithms for FPGA acceleration. In the topology of the shifter list, the FPGAimplementation is mapped as a list. The input data go through the framework fromthe first PE to the last PE, while the parameters are fixed in each PE. However, in thePE-Ring, the FPGA implementation is mapped as a ring, where the first PE and lastPE are connected. The input data is fixed in a corresponding PE and parameters gothrough the PE-Ring. Based on the ring topology design, PE-Ring breaks the limitationof on-chip resources and becomes scalable by leveraging block RAM as FIFOs.

In addition, systolic array is another type of architecture that can also support theacceleration of streaming algorithms. There are two main differences between PE-Ringand systolic array. First, in the systolic array structure, both input data and the pa-rameters are moved in multiple units during computation. In PE-Ring, the input dataare stably stored in a specific PE, while the parameters and computation functions aremoved. For a wide range of streaming algorithms, the systolic arrays have idle units ina specific cycle. This is because both the data and parameters need to be routed duringcomputation. Therefore, compared with fully-paralleled PE-Ring structure, the systolicarray has more complex structure and costs more hardware resources to support thesame algorithm. The resource waste may result in the decline of clock frequency andperformance. Second, we introduce a FIFO to obtain scalability of large-scale stream-ing algorithms. Therefore, even if the hardware resources cannot simultaneously sup-port all the computations, we can also implement the streaming algorithms by limitedamounts of PE. The scalability cannot be supported by systolic array structure. Thetorus systolic arrays may achieve the scalability. The disadvantages of torus systolicarrays compared with PE-Ring are also the complex structure and low performance.

3. THE HSF-SDD MODELIn this section, we discuss our HSF-SDD model that represents a range of streamingalgorithms. These algorithms have two important properties, homogenous sub-function(HSF) and stable data dependency (SDD), which is fundamental to this study. OurHSF-SDD model represents the streaming algorithms introduced in Sec. 4.3.2 suchas Space-Saving [Metwally et al. 2006] and dynamic time warping (DTW) [Ding et al.2008].

Let the input of a streaming algorithm be a tuple sequence in order, denoted ass = {si : 1 i N}, where si is the i-th tuple. In practice, all tuples s are homogenous,and each tuple si is a number or a vector. The model is defined as:

ri = F (si,para, ri�1) (1)

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 5: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

Exploiting Stable Data Dependency in Stream Processing Acceleration on FPGAs A:5

ri[j-qT] ri[j] ri[j+1] ri[M]

ri-1[j-pT]ri-1[1]

ri[1]

ri-1

ri[j-qH]

ri-1[j-pH] ri-1[j-pH+1]

ri[j-qH+1]

ri-1[M]ri-1[j]

ri

ri-1[j+1]

Fig. 1: Diagram of Data Dependency Pattern.

where ri is the result of F () on an input tuple si. In this work, ri is also called thestate vector of the input tuple si. para = {para[j] : 1 j M} is a set of pre-definedparameters, ri�1 is the state vector of the previous input tuple si�1 and F () stands forall the operations in a streaming algorithm.

We propose that the function F () has two important properties. First, F () is com-posed of a set of homogeneous sub-functions, where each sub-function G() outputs acertain item of the current state vector, denoted as ri[j] (1 j M ). Second, all thesesub-functions {G()} follow a stable data dependency pattern that the coefficients fordescribing the data dependency can be fixed. In particular, these two properties areformulated as:

ri[j] = G(si, ri�1(j � pH , j � pT ], ri(j � qH , j � qT ], para[j]) (2)

where ri[j] is the j-th item of the current state vector, ri�1(j � pH , j � pT ] is a subsetof its previous state vector, and ri(j � qH , j � qT ] is a subset of the current state vector.The coefficients pH , pT , qH and qT describe the data dependency pattern, which isillustrated in Fig. 1.

There are three special cases for r0 and ri(�pH , 0). First, r0 = ; when ri[j] doesnot rely on ri�1(j � pH , j � pT ). Second, ri(�qH , 0) = ; when ri[j] does not rely onri(j � qH , j � qT ). Third, when j � qt is negative, ri[j � qt] = ;. Given the four inputs,pH , pT , qH and qT , one pass in a streaming algorithm is implemented by lines 1-5,where F () in Eq. (1) is implemented by lines 2-4. The final result can be the statevector of its last input tuple such as Space-Saving [Metwally et al. 2006], or only apart of the state vector such as DTW [Ding et al. 2008], or some aggregations on thestate vector such as time series correlation.

In detail, the coefficients pH and pT (pH � pT ) are the head and tail offsets relativeto j, which are used to describe the range of the subset in ri�1. The coefficients qH andqT (qH � qT ) are the head and tail offsets relative to j, which are used to describe therange of the subset in ri. For example, if pH = 2 and pT = 0, then we have ri�1(j� 2, j],which indicates that ri[j] relies on ri�1[j � 1] and ri�1[j].

Note that unlike pT , whose range is flexible (i.e., pT can be positive, zero, and nega-tive, and Fig. 1 demonstrates a case of pT < 0), qT must be greater than 1 to ensure thecausality in the current state vector generation, because ri[j � 1] has to be generatedbefore ri[j]. A special case ri�1(j � pH , j � pT ] = ; when pH = pT indicates that ri[j]does not depend on any item of its previous state vector. Similarly, ri(j� qH , j� qT ] = ;when qH = qT indicates that ri[j] does not depend on any item in the current statevector.

Alg. 1 shows the implementation of the HSF-SDD model. In this model, in additionto the tuples s and parameters para, we have two other inputs, r0 and ri(�qH , 0], whichare the initial conditions for the calculation of r1 and ri[1]. Given the above four inputs,one pass in a streaming algorithm is implemented by lines 1-5, where F () in Eq. (1) isimplemented by lines 2-4.

Take DTW [Ding et al. 2008] as an example to demonstrate the relationship be-tween the algorithm and the HSF-SDD model. DTW is used to measure the similarity

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 6: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

A:6 Y. Sun et al.

ALGORITHM 1: the HSF-SDD modelInput: s, para, r0, ri(�qH , 0] for all iOutput: ri[j] for all i > 0, j > 01: for all si 2 s do2: for all j from 1 to M do3: ri[j] G(si, ri�1(j � pH , j � pT ], ri(j � qH , j � qT ], para[j])4: end for5: end for

between s = {si : 1 i N} and a pattern w = {w[j] : 1 j M}. For 1 i N and1 j M , the recursive processing of the DTW distance is defined as:

D(i, j) = dist(si, w[j]) + min {D (i� 1, j)D(i, j � 1)D(i� 1, j � 1) (3)where dist() is a distance measurement on tuples (e.g., Euclidean distance), and theinitial conditions are D(0, 0) = 0 and D(i, 0) = D(0, j) = 1. Based on Eq. (3), DTW canbe rewritten as Alg. 2.

ALGORITHM 2: Dynamic Time WarpingInput: s, w, D(0, 0), D(i, 0) for all i, D(0, j) for all jOutput: D(i,M) for all i1: for all i from 1 to N do2: for all j from 1 to M do3: di,j dist(si, w[j])4: D(i, j) di,j +min{D(i� 1, j), D(i� 1, j � 1), D(i, j � 1)}5: end for6: end for

Alg. 2 shows that all tuples are consumed by lines 2-5 in order. For each input tuplesi, lines 3-4 constitute the homogeneous sub-function G() of DTW, and the correspond-ing state vector ri is D(i, 1), . . . , D(i,M), where ri[j] = D(i, j). In the sub-function,there are pH = 2 and pT = 0 because D(i, j) relies on D(i � 1, j) and D(i � 1, j � 1),and qH = 2 and qT = 1 because D(i, j) depends on D(i, j � 1). This dependency patternis the same for all items in a state vector. Hence, DTW follows the HSF-SDD model,and the mapping relationships between the HSF-SDD model and DTW are listed inTable I.

4. THE PE-RING FRAMEWORKIn this section, we propose a framework to implement streaming algorithms that canbe expressed by the HSF-SDD model. Sec. 4.1 introduces the concept of the PE, andSec. 4.2 shows the design of the framework. Finally, Sec. 4.3 discusses advantages andlimitations of the PE-Ring.

Table I: Mapping Relationships between HSF-SDD Model and DTWHSF-SDD (Alg. 1) DTW (Alg. 2)G() Lines 3-4 of Alg. 2ri[j] D(i, j)pH 2pT 0qH 2qT 1para[j] w[j]

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 7: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

Exploiting Stable Data Dependency in Stream Processing Acceleration on FPGAs A:7

paraj

ri-1[j-pT] ... ri-1[j-pH+1]

ri-1[j-pT]

s[i] G()

ri [j]

ri[j-qT]

...ri[j-qH+1]

...

Function block

Data persister

Data transfer

Fig. 2: Processing Element (PE)

4.1. Processing ElementA PE is a basic module in a PE-Ring framework to implement sub-functions in paral-lelism as shown in Eq. (2). A PE returns one item ri[j] at a time and outputs all itemsof a state vector ri gradually.

Fig. 2 displays the design of a PE which has two components: function block anddata persister. The function block maps the sub-function G() to circuits, where detaileddesigns depend on the specific algorithm.

The data persister is a set of pipeline registers on FPGAs to keep the data depen-dency pattern in Eq. (2). All registers in the data persister are used to store the inputtuple si, the parameter para[j], and the state vectors ri and ri�1. Based on the pat-tern in Eq. (2), the amount of registers used for persisting ri�1 is pH � pT . There isa difference on the amount of registers used for persisting ri, which is qH � 1 but notqH � qT . That is to say, when qT > 1, a PE needs some additional registers to persistri(j� qT , j� 1], but those registers do not contribute in ri[j]. The reason for this is thatdiscarding them will damage computations on the following items of the state vector.For example, suppose qH = 3 and qT = 2, ri[j] relies on ri[j � 2], then its next itemri[j + 1] relies on ri[j � 1], and so on. If ri[j � 1] has not been persisted, this PE cannotwork for ri[j + 1] since ri[j � 1] is lost.

4.2. PE-RingGiven the design of a PE, the next issue is how to control data movement betweenPEs. According to Eq. (2), the output of a PE is ri[j], which is the input of its nextPE to produce ri+1. Thus, transferring state vectors (i.e., ri) between neighbor PEsare natural. Eq. (2) also shows that ri[j] depends on the parameter para[j]. The com-munication between neighbor PEs in our design also includes the parameter para[j]that is to accompany with the corresponding item ri[j]. Furthermore, considering boththe bandwidth and synchronization issues, a central router is used to deliver an inputtuple si to a PE, and si stays in the PE until the state vector ri is generated completely.

4.2.1. Basic PE-Ring. Based on the data movement above, the basic structure of thePE-Ring is shown in Fig. 3, where the output port of each PE connects to the inputport of the next PE. Since all PEs are linked as a ring, we name this structure as PE-Ring. In a PE-Ring, a router is used to ship each input tuple to a PE and to schedule

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 8: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

A:8 Y. Sun et al.

INIT RAM

PE II

PE I

PE ...PE R-1

PE R Router

r1 parapara

rR

s1

MUX

r2rR-1

para

s2

sR-1

sR

PARA RAM

para

Fig. 3: Basic PE-Ring

p[2] s5 r4[4]r3[6] p[4] s4p[6] s3r5[2]

r1[7]

p[4]

s3p[7] s1

r1[6]

p[5] s1 r1[5]

s1p[6]

p[3] s1 r1[3]

p[2] s5

p[5]r5[1] r3[5]s3

s3 r3[4]r2[6]

r2[7]

r4[4]

p[4] s1 r1[4]

p[4] s2 r2[4] p[2] s3 r3[2]

p[1] s4 r4[1]

r3[6] p[4] s4p[6] s3r5[2]

s5p[1]

p[1] s2 r2[1]

p[1] s3 r3[1]

r4[2]s4p[2]

r4[3]s4p[3]

r2[2]

r2[3]

r2[5] p[3] r3[3]

p[7]

p[6]

p[5]

p[3]

p[2]

s2

s2

s2

s2

s2

p[1] s1 r1[1]

p[2] s1 r1[2]

Fig. 4: PE-Ring Running Example on t > 1

each PE when to start data processing. Specifically, this router contains a counter andoutputs a pointer which represents the index of the PE to start. When a new inputtuple arrives, there is a validator at each PE to test whether its index matches thepointer.

The PE-Ring also leverages RAM to persist initial boundary conditions (i.e., r0 andri(�qH , 0] in Alg. 1) and the parameters para. We call the RAM storing initial boundaryconditions as the INIT RAM, and call the RAM storing parameters as the PARA RAM.Additionally, a PE-Ring uses a multiplexer to import data for PE I (the first PE).

From the perspective of PE I, the PE-Ring works as below. In the first cycle, PE Ireceives both initial boundary conditions init and parameter para[1] from the multi-plexer and also an input tuple s1 from the router. Then, PE I starts to calculate thefirst item r1[1]. In the next cycle, para[1] and r1[1] are delivered to PE II, meanwhile PEI receives para[2] to produce r1[2]. In this manner, PE I works on s1 until r1[M ] is ob-

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 9: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

Exploiting Stable Data Dependency in Stream Processing Acceleration on FPGAs A:9

tained. The above manners are repeated by each PE with parameters para to producethe state vector ri.

After PE I finishes r1, the multiplexer begins to read from PE R and the router sendssR+1 to PE I, which means PE I starts to process rR+1. Ignoring the PE I receiving paraand init from RAM, since PEs always receive parameters and the previous state vectorfrom their neighborhoods, there is no difference on processing ri or rR+i.

Given the above workflow of the PE-Ring, we should notice that there is a delaybetween start times of PEs. For example, except for the first cycle, when any item r0is ready for PE I, ri[j] is not able to be calculated immediately until ri�1[j � pT ] isready (according to Eq. (2)). As we discussed in Sec. 3, the tail offset pT can be anyinteger and if pT � 0, then operations for ri[j] can start right after ri�1[j]. However,if pT < 0, then operations for ri[j] have to wait �pT cycles after ri�1[j] is ready, sinceri[j] relies on ri�1[j + |pT |]. To sum up, PE II starts after PE I with a delay of t cycles(t = max{�pT , 0}+1) and PE III starts after PE II with the same delay t, and so forth.That is to say, in the (2t + 1)-th cycle, PE I is computing r1[2t + 1] for s1, PE II iscomputing r2[t+ 1] for s2, and PE III is computing r3[1] for s3.

Furthermore, pT also influences the maximal degree of parallelism P , which equalsto dM/te, where t = max{�pT , 0}+ 1. In details, for pT � 0, we have P = M because aPE starts to calculate its ri[j] right after its previous one and all M PEs are workingin parallel. That is to say, when PE I is calculating r1[M ], PE M is working on rM [1].While for pT < 0, we have P = dM/te since there are at most dM/te PEs running atthe same time due to data dependency between adjacent state vectors.

Fig. 4 demonstrates how a PE-Ring works when PEs start within a delay t > 1. Inthis example, we assume that an algorithm follows the HSF-SDD model with pH = 1,pT = �1, qH = 1 and qT = 0, which means ri[j] depends on ri�1[j + 1], ri�1[j] andri[j � 1]. Additionally, we assume that the parameter has 7 items (M = 7). Fig. 4shows all parameter items, input tuples and corresponding generated items of thestate vector in each PE at each cycle. For example, r5[2] is an item of the state vectorproduced by PE I in the 10th cycle, and data in dark cells are used to generate it. Dueto pT = �1 (t = 2), PE II starts in the 3rd cycle and PE III starts in the 5th cycle, andso forth. Fig. 4 also shows that it is enough to have dM/te = 4 PEs because more PEswill not increase the degree of parallelism.

4.2.2. FIFO-enabled PE-Ring. In the basic PE-Ring, we suppose that there are enoughon-chip resources, and thus the number of PEs R is set as the maximal parallel degreeP . That is to say, if pT � 0, then R = M ; otherwise, R = dM/te. However, in practice, Ris restricted by the configuration of the FPGA (i.e., the number of LUTs and FFs) andresources used by each PE. When the number of PEs R is limited by on-chip resources,say R < dM/te, PE I cannot finish r1 even after PE R starts.

To solve the problem that LUTs and FFs run out, we propose a FIFO-enabled PE-Ring that uses two FIFOs to store parameters and state vectors respectively. TheFIFO-enabled PE-Ring is illustrated in Fig. 5. Those FIFOs are implemented by theblock RAM on the FPGA chip.

In the FIFO-enabled PE-Ring, all state vectors generated by PE R together with thecorresponding parameters are stored into FIFOs respectively. The router will not sendthe input tuple sR+1 to PE I until r1[M ] is finished by PE I. This indicates that themultiplexer will not read from FIFOs until all items of para go through PE I. WhenPE I completes r1, it will receive the input tuple sR+1 from the router, as well as thestate vector rR and parameters para from FIFOs. Then PE I continues producing rR+1.In this way, R degree of parallelism can be guaranteed all the time.

Fig. 6 demonstrates how a FIFO-enabled PE-Ring works when on-chip resources arelimited. In the example, we suppose that an algorithm follows the HSF-SDD model

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 10: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

A:10 Y. Sun et al.

INIT RAM

PE II

PE I

PE ...PE R-1

PE R Router

r1parapara

rRs1

state

FIFOpara

FIFO

MUX

r2rR-1

para

s2

sR-1

sR

PARA RAM

para

Fig. 5: FIFO-enabled PE-Ringwith pH = 1, pT = 0, qH = 2 and qT = 1, which indicates that ri[j] depends on ri�1[j]and ri[j � 1]. The parameters have 7 items (M = 7), and the maximal parallelismdegree is 7 because pT = 0. However, due to limited on-chip resources, the PE-Ring isdesigned to have 4 PEs. In Fig. 6, we illustrate how parameters and state vectors to bebuffered in FIFOs. In this example, r5[3] is the item produced by PE I in the 10th cycle,and data in dark cells are used to generate it. Compared with the example in Fig. 4with enough on-chip resources, the difference is that a part of data (e.g., r4[3] in Fig. 6)come from the FIFO, instead of its neighbor PE. Furthermore, we observe that PE Ikeeps busy on the calculation of r1 from the 5th cycle to the 7th cycle, when FIFOsbuffer the outputs of PE IV and the corresponding parameters. Finally, the outputs ofPE IV are sent to PE I when PE I is idle again at the 8th cycle to compute r5.

To sum up, given on-chip resources (sufficient/insufficient) and data dependency pat-tern of an algorithm (pT � 0/pT < 0), the throughput T of the PE-Ring is as:

T =

R⇥ f

M(4)

where R is the number of PEs (the parallelism degree implemented by the PE-Ring),f is the clock frequency, and M is the scale of the parameter (the length of the statevector). Recall that, to produce one item in the state vector, ri[j] needs one clock cycleand thus we need M/f to finish the computation of a state vector. If the design iswithout any parallel mechanism, the throughput is f/M . In the PE-Ring, we realizeR-degree parallel, and so the throughput is R times of f/M . Based on Eq. (4), we obtainthat (1) when pT � 0 (t = 1) and there are enough resources, the maximal throughputof the PE-Ring only depends on f ; and (2) when on-chip resources are not enough, theactual throughput drops as the parameter scale M increases.

4.3. Discussions4.3.1. Advantages of the PE-Ring. From the perspective of the data movement, there

are two approaches to design the framework: parameter-centric and data-centric. In aparameter-centric design, para[j] is fixed in a PE, and si is transferred between neigh-bor PEs. The shifter list [Woods et al. 2015] is acting like this approach. Conversely,in a data-centric design, the input tuple si is fixed in a PE and the parameter para[j]

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 11: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

Exploiting Stable Data Dependency in Stream Processing Acceleration on FPGAs A:11

p[6] p[5] p[4]r4[6] r4[5] r4[4]p[5] p[4] p[3]r4[5] r4[4] r4[3]p[4] p[3] p[2]r4[4] r4[3] r4[2]p[3] p[2] p[1]r4[3] r4[2] r4[1]p[2] p[1]r4[2] r4[1]p[1]r4[1]

r3[1]

p[1] s1 r1[1]

p[2] s1 r1[2] p[1]

p[3] s1 r1[3] p[2] s2 r2[2] p[1] s3

r4[7]

r4[6]

p[4] s1 r1[4] p[3] s2 r2[3] p[2] s3 r3[2] p[1] s4 r4[1]

r7[1]

r3[7]

p[7]

p[6]

s4

s4

r6[2]

r6[1]

p[1]

p[7]

s7

s3

r5[3]

r5[2]

p[2]

p[1]

s6

s6

s5p[1]

p[3]

p[2]

s5

s5

p[5]

p[6]r2[7]s2p[7]r5[1] r4[5]s4p[5]r3[6]s3

p[2] s4 r4[2]

s2 r2[1]

p[7] s1

p[4]r2[5]s2p[5]r1[6]

s3 r3[5] p[4] s4 r4[4]r1[7] p[6] s2 r2[6]

p[5] s1 r1[5] p[4]

r4[3]s4p[3]r3[4]s3s1p[6]

s2 r2[4] p[3] s3 r3[3]

Fig. 6: PE-Ring Running Example on Insufficient On-chip Resources

is communicated between neighbor PEs. The PE-Ring is data-centric that has advan-tages in the resource utility and design flexibility.

For resource utility, when pT < 0, the maximal parallelism degree of the two designsis dM/te, where t = |pT | + 1. In this case, the PE-Ring only needs dM/te PEs, whilethe parameter-centric still requires M PEs. Thus we have two conclusions. (1) PEs inthe parameter-centric design are idle in some cycles because the maximal parallelismdegree is dM/te. (2) With FIFO enabled, the PE-Ring can handle the problems withlarge scales of parameters, because the FIFO persists parameters temporally whenwaiting for idle computational resources.

For design flexibility, since the parameters pass through the PE-Ring with the statevector, the updated parameters can be automatically sent into the PE-Ring from thePARA RAM during data moving. We only need to flush the PARA RAM when there isa need to update parameters, which greatly saves time and efforts.

4.3.2. Coverage of the PE-Ring. In Sec. 3, we use DTW as an example to illustrate howto map an algorithm to the HSF-SDD model. In addition to DTW, we have investigatedthat a range of streaming algorithms match the HSF-SDD model, such as edit distancewith real penalty, time warp edit distance, skyline, aggregation, K-means, discreteFourier transform, discrete wavelet transform, etc. In Sec. 5, use cases are shown toexplain how to describe an algorithm with our HSF-SDD model and then implementit with the PE-Ring.

However, it is noted that the PE-Ring does not fit for all the streaming algorithmsbecause not all of them are in compliance with the HSF-SDD model. It is the limitationof our work. For example, BIRCH (short for Balanced Iterative Reducing and Cluster-ing using Hierarchies) is a streaming algorithm for clustering [Zhang et al. 1996]. Thismethod builds up a tree structure where each leaf node represents a cluster. An inputtuple will be inserted in the tree by traversing the tree structure top-down to a leaffirst and then bottom-up to the root. But the operations in top-down (identify the clus-ter) and bottom-up (update hierarchical clustering structure) are not the same. Thatis to say, the function F () of each input is not composed of a set of homogeneous sub-functions. As a result, BIRCH does not match the HSF-SDD model and is not capableof being implemented by the PE-Ring.

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 12: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

A:12 Y. Sun et al.

w[j]

D(i-1,j)

s[i]D(i,j)

D(i,j-1)

dist() Add

D(i-1,j-1)

...

D(i-1,j)

min

Fig. 7: PE of DTW

5. USE CASESIn this section, we use four algorithms to demonstrate how to describe an algorithmwith our HSF-SDD model and then implement it with the PE-Ring. Sec. 5.1 discussesthe Dynamic Time Warping (DTW) that is one of the best distance measurements[Ding et al. 2008]. Sec. 5.2 addresses Space-Saving which solves the problem of fre-quent item counting [Metwally et al. 2006]. Sec. 5.3 focuses on a sliding window basedcorrelation computation. The above three algorithms belong to single-pass streamingalgorithms. For multiple passes algorithms, we investigate Expectation-Maximizationin Sec. 5.4 which is often used in clustering [Bilmes et al. 1998].

5.1. Dynamic Time WarpingIn the similarity search, a key component is to measure the distance between subse-quences and a given pattern. Though hundreds of similarity distance measurementshave been proposed in the last decade, recent evidence [Ding et al. 2008] shows thatDTW is receiving an increasing focus and becoming a widely-used distance measure-ment.

The definition of DTW has been introduced in Sec. 3. Based on the mapping relation-ship between DTW and the HSF-SDD model as shown in Table I, Fig. 7 shows a PEof DTW, which uses three registers to persist D(i, j � 1), D(i� 1, j) and D(i� 1, j � 1)

and passes the parameter item w[j] between neighbor PEs. Due to pT � 0, the PE-Ring implementation of DTW has M parallelism when on-chip resources are enough,by storing w in the PARA RAM and the initial boundary conditions D(0, 0), D(i, 0) andD(0, j) in the INIT RAM.

DTW is implemented as a UDF in IBM Informix. The prototype of the FPGA versionimplemented by the PE-Ring has been published on a public cloud [Informix 2015].Thisprototype demonstrates the acceleration capability of the PE-Ring in the scenarios ofhealthcare and Internet of Thing by comparing with the software solution.

5.2. Space-SavingThe frequent item counting is to pick up items that occur most frequently. The problemis defined as: for a stream s of size N drawn from an alphabet A, the �-frequent itemsinclude the item k whose frequency fk is at least �N .

The size of the alphabet in the practical applications (such as finding out frequent-click advertisements from the internet) poses challenges for answering exact queriesbecause the memory cannot persist all distinct items when the alphabet is too large.This motivates analyzers to devise an approximate approach for the problem. The ✏-

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 13: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

Exploiting Stable Data Dependency in Stream Processing Acceleration on FPGAs A:13

approximate problem is defined to return a set of items F such that: for all items k 2 F ,there is fk > (�� ✏)N , and no k /2 F s.t. fk � �N .

The state-of-art algorithm for the approximate frequent item counting is Space-Saving [Metwally et al. 2006]. In Space-Saving, frequent items are kept and countedin bins. Each bin maintains two components: item and count. The number of bins M isset based on the error bound ✏ where M = 1/✏.

ALGORITHM 3: Space-SavingInput: s, M , bini[0] for all i, bin0[j] for all jOutput: binN [j] for all j1: for all i from 1 to N do2: for all j from 1 to M do3: bini[j] bini�1[j]4: if bini[j].item = si then5: bini[j].count bini[j].count+ 1, si Null;6: end if7: if bini[j � 1].count < bini[j].count then8: swap content of bini[j � 1] and bini[j]9: end if10: if j = M & si 6= Null then11: bini[M ].count bini[M ].count+ 112: bini[M ].item si

13: end if14: end for15: end for

To remove the heterogeneous features in sub-functions, we use a bubble-sort opti-mized Space-Saving algorithm [Teubner et al. 2011] in Alg. 3. Here, we annotate thesubscript of bins with the index of the input tuple to show the status of bins, repre-sented as bini[j]. As a result, for Space-Saving, the state vector of the input tuple si isthe bin array bini = {bini[j] : 1 j M}. The final result is the bin array of the lastinput tuple sN . The initial conditions in Space-Saving are bini[0].count = 1, 1 i Nand bin0[j].count = 0, 1 j M .

Alg. 3 accomplishes the approximate frequent item counting. The item comparisonand the count value increment (if necessary) are in lines 4-6. Together with them, lines7-9 compare the count between neighbor bins to push the item with the smallest counttoward the last bin. That is to say, after line 9, bini[j � 1] is ready but bini[j] needs tobe compared with bini[j + 1]. Besides, lines 10-13 place the new item in the last bin.

From the view of software developers, Alg. 3 is not efficient because lines 7-13 are re-dundant logics for some input tuples. But these logics make Space-Saving fully matchthe HSF-SDD model, which improves the algorithm parallelism on FPGAs. The map-ping relationships between the HSF-SDD model and Space-Saving are listed in Ta-ble II. According to the swap step (lines 7-9) bini[j] relies on bini[j � 1], which deter-mines qH = 2 and qT = 1. Also, there are pH = 1 and pT = �1 because (1) bini[j]starts when bini�1[j] is ready in line 3; and (2) bini�1[j] is ready after comparing withbini�1[j +1] in the swap step in line 8. Therefore, bini[j] relies on bini�1[j], bini�1[j +1]

and bini[j � 1].A PE of Space-Saving to count frequent items is illustrated in Fig. 8. Two registers

are set to persist bini[j] and bini[j � 1]. For the practical implementation, as shown inFig. 8, the communication between neighbor PEs does not include the parameter M asthe general PE design in Fig. 2, because the parameter M is the same in all PEs.

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 14: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

A:14 Y. Sun et al.

Table II: Mapping Relationships between HSF-SDD Model and Space-SavingHSF-SDD (Alg. 1) Space-Saving (Alg. 3)G() Lines 3-13 of Alg. 3ri[j] bini[j]pH 1pT -1qH 2qT 1para[j] M

bini[j] bini[j-1]bini-1[j]

si

bini[j-1] comp add swapend

check

M

Fig. 8: PE of Space-Saving

Since there is pT = �1, a PE starts 2 cycles after its previous one as discussed inSec. 4.2. That is to say, to ignore limited on-chip resources, PE I starts to compute thefirst input tuple s1 in the 1st cycle. PE II starts to compute the second input tuple s2 inthe 3rd cycle because of max{�pT , 0}+ 1 = 2. PE III starts to calculate the third inputtuple s3 in the 5th cycle, and so forth. Moreover, the maximal parallelism degree of thePE-Ring based Space-Saving is dM/2e.

5.3. CorrelationIn the time series analytics, the Pearson correlation coefficient is a measure of thelinear correlation between two subsequences P = p1, . . . , pM and Q = q1, . . . , qM , whichis defined as:

⇢ =

MP

pmqm �P

pmP

qmpM

Pp2m � (

Ppm)

2pM

Pq2m � (

Pqm)

2(5)

It is not effective to accelerate correlation calculations between two subsequences.Let us consider the problem: given a pattern subsequence Q and a stream s, obtainhow correlations between the pattern and the stream vary with time. This problemis a sliding window problem with a window length as M and a sliding step size as 1,which can be implemented as shown in Alg. 4. The output ⇢i represents the correlationcoefficient between Q and a subsequence si�M+1, . . . , si.

From Alg. 4, we observe that all tuples from the input data stream are consumed inorder. For each input tuple si, there is a corresponding state vector ri = {C(i, j) : 1 j M}, where C(i, j) contains three components: the summation of the productionprod(i, j), the quadratic sum squa(i, j), and the sum sum(i, j). The initial conditions inthe Correlation is that each component in C(0, 0), C(i, 0) for all i and C(0, j) for all j isset to 0. By mapping Alg. 4 to Alg. 1 based on state vectors, we observe that Alg. 4 is incompliance with our HSF-SDD model. In Table III, we give all mapping relationshipsbetween the HSF-SDD model and the Pearson correlation with a sliding window.

For the data formatting issue, in practice, line 7 in Alg. 4 is executed by a floating-point division operator outside the PE-Ring to achieve the final result. In the PE-Ring,the fixed-point format is used to execute multiplier-accumulators as lines 3-5 in Alg. 4.

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 15: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

Exploiting Stable Data Dependency in Stream Processing Acceleration on FPGAs A:15

ALGORITHM 4: Pearson Correlation on Sliding WindowInput: s, Q, C(0, 0), C(i, 0) for all i, C(0, j) for all jOutput: ⇢i for i = M, . . . , N

1: for all i from 1 to N do2: for j = 1 : M do3: prod(i, j) prod(i� 1, j � 1) + q[j]si;4: squa(i, j) squa(i� 1, j � 1) + s

2i ;

5: sum(i, j) sum(i� 1, j � 1) + si;6: end for7: ⇢i Mprod(i,M)�sum(i,M)sumQ

varQ

pMsqua(i,M)�sum(i,M)2

8: end for

Table III: Mapping Relationships between HSF-SDD Model and CorrelationHSF-SDD (Alg. 1) Correlation (Alg. 4)G() Lines 3-5 of Alg. 4ri[j] C(i, j)pH 2pT 1qH ;qT ;para[j] q[j]

5.4. Expectation-MaximizationGaussian Mixture Models (GMM) are powerful for probability density modeling andsoft clustering [Bilmes et al. 1998]. The GMM with M Gaussian components can berepresented as:

p(si) =MX

j=1

!jG(si|µj ,⇥j) (6)

where the input si is a vector with dimension D, represented as si = (si,1, . . . , si,D), !j

is the weight of the j-th component, and G(si|µj ,⇥j) is a Gaussian probability densityfunction (PDF)1 with a mean µj and a variance matrix ⇥j .

One of the traditional parameter estimation solutions is Expectation-Maximization(EM) algorithm. The EM for GMM (EM-GMM) [Bilmes et al. 1998] is used to estimateparameters in an iterative manner between E-step and M-step, based on a randomlygenerated initial parameter set. [Guo et al. 2012] proposed a streaming EM-GMMalgorithm with multiple passes shown in Alg. 5 with a fixed-point adjustment.

In this algorithm, lines 4-7 are the E-step to compute the responsibility value ofsi, lines 15-20 are the M-step to estimate the new parameters to the next round ofiteration, and lines 8-14 generate some internal results for the M-step when the E-step does not finish.

In this study, we observe that the E-step (lines 4-7) matches our HSF-SDD modelbecause data are consumed chronologically by a homogenous approach in each EMiteration. By following the multiply-accumulate pattern, we have the correspondingstate vector ri[j] = yi,j . The final result is ri[M ] = yi,M for the further computation(lines 8-14). According to lines 5-6, there is ri[j] = ri[j � 1] + !jG(si|µj ,⇥j), and soqH = 2 and qT = 1 because ri[j] relies on ri[j � 1] and pT = pH because ri[j] does not

1We assume that covariance matrices for all Gaussian components are diagonal. This does not only sim-plify the evaluation of the Gaussian PDF, but also keeps or even improves accuracy and robustness of thealgorithm [Kumar et al. 2009].

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 16: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

A:16 Y. Sun et al.

rely on any item from ri�1. The mapping relationships between the HSF-SDD modeland EM-GMM are displayed in Table IV. Since ri does not depend on ri�1, the neighborcommunication between PEs of EM-GMM includes only parameters {!j , µj , ⇥j}.

ALGORITHM 5: EM-GMMInput: s, MOutput: {!j , µj , ⇥j} for all j1: while stop condition not met do2: for all i from 1 to N do3: ⇢i,j 0, ⌧i,j 0, yi,0 04: for all j 1 to M do5: gi,j !jG(si|µj ,⇥j)6: yi,j yi,j�1 + gi,j

7: end for8: for all j 1 to M do9: ⌘i,j ⌘i�1,j +

gi,jyi,M

,10: for all d 1 to D do11: ⇢i,j,d ⇢i�1,j,d +

gi,jyi,M

si,d

12: ⌧i,j,d ⌧i�1,j,d +gi,jyi,M

s2i,d13: end for14: end for15: end for16: for all j 1 to M do17: !j ⌘N,j

N,

18: for all d 1 to D do19: µj,d

⇢N,j,d

⌘N,j, �2

j,d ⌧N,j,d⌘N,j�⇢2N,j,d

⌘2N,j

20: end for21: end for22: ⇥j diag(�2

j,d)23: end while

6. A CUSTOMIZED C-TO-RTL TOOLThis section proposes the workflow to generate RTL codes of the PE-Ring framework,which is displayed in Fig. 9. The inputs of this workflow include C codes of F () in Eq. (1)and the parameters for the PE-Ring setup (e.g., word length, the number of PEs, andFIFO size). The output is the RTL codes. In this workflow, the technique from LowLevel Virtual Machine (LLVM, a middle layer of a compiler system) [Lattner and Adve2004] is used to detect whether an algorithm is in compliance with our proposed HSF-SDD model.

In the following, we first discuss the design philosophy of our C-to-RTL tool (Sec. 6.1)and then show how to detect whether an algorithm matches our HSF-SDD model(Sec. 6.2). Finally, we describe the template to generate hardware codes (Sec. 6.3).

6.1. Design PhilosophyTo implement a C-to-RTL tool for the PE-Ring framework, there are two basic require-ments: (1) to be feasible to use the LLVM technique to detect whether a streamingalgorithm is in compliance with the HSF-SDD model and (2) to guarantee the correctresult. Here, the correctness means not only to recognize the HSF-SDD compliant al-gorithm, but also to determine the correct model coefficients (i.e., pH , pT , qH , qT asshown in Eq. (2)).

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 17: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

Exploiting Stable Data Dependency in Stream Processing Acceleration on FPGAs A:17

Table IV: Mapping Relationships between HSF-SDD Model and EM-GMMHSF-SDD (Alg. 1) EM-GMM (Alg. 5)G() Lines 5-6 of Alg. 5ri[j] yi,jpH ;pT ;qH 2qT 1para[j] w[j]

Loop Extraction

CriticalLoop Analysis

Dependence Detection

Logic Design(C-RTL Tools)

PE-Ring Template

LLVM

C codes of Algorithm

G()

parameter setup for PE-Ring(e.g. word length, FIFO size)

RTL

Register DesignpT pH qT qH

PE module

Fig. 9: Customized Workflow for C to PE-Ring

void main(){ FILE *tp; int k; . . . . . . fp = fopen(“output.txt”, “a”); while (get_data(s_i)){

for (k=0; k<M; k++){ r_pre[k] =r[k]; r[k] = 0;}

DTW(s_i, w, M, r_pre, r); fprintf(fp, ”%d\n”, r[M-1]);}

. . . . . . }

(a) main.c: Main Function of DTW

void DTW(int s_i, int *w, int M, int *r_pre, int *r){

int min, j = 0; r[j] = abs(s_i - w[j]); for (j=1; j<M; j++){ min = (r[j-1] < r_pre[j-1] ? r[j-1] : r_pre[j-1]; min = (min < r_pre[j] ? min : r_pre[j]; r[j] = abs(s_i-w[j]) + min;} return;}

(b) f.c: F () of DTW

Fig. 10: C Code of DTW

For the first requirement, the input software codes need to be tuned by users. Weput two constraints on the input. First, the input must have two files: one is the mainfunction of the algorithm, whose filename is “main.c”, and the other one is F () of thealgorithm, whose filename is “f.c”. According to Alg. 1, F () contains the loop at thetop level for an input tuple in the main function. Second, in the input files, the vari-able name “r” is fixed to represent the state vector of the current input tuple (in theHSF-SDD model, we call the intermediate result as the state vector). The variablename “r pre” is also fixed to represent the intermediate result of the previous inputtuple. Fig. 10 shows example codes of DTW which obeys the above two constraints.The main function is shown in Fig. 10(a), in which the top-level loop “while” containsall operations related to an input tuple. Function F () is shown in Fig. 10(b), where“r[j]” represents ri[j] in the HSF-SDD model which depends on the input tuple “s i”, aparameter item “w[j]”, “r pre[j-1]”,“r pre[j]” and “r pre[j]”.

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 18: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

A:18 Y. Sun et al.

For the second requirement, the input algorithm is required to be analyzed to sum-marize all operations on each parameter item. For example, in Sec. 5.2, Alg. 3 is cus-tomized from original Space-Saving algorithm [Metwally et al. 2006], which makes thisalgorithm fit for the FPGA implementation. Though there are some redundant opera-tions in lines 7-13, the customized version is compliance with the HSF-SDD model andeasily implemented by the PE-Ring.

6.2. Model DetectionWe use LLVM to detect whether a streaming algorithm matches our proposed HSF-SDD model. A typical LLVM library defines the data flow diagram (DFD) of an algo-rithm and addresses of variables. In LLVM, passes perform the transformations andoptimizations that make up the compiler, they build the analysis results that are usedby these transformations. In our C-to-RTL implementation, we design three passesfor the HSF-SDD model detection which is displayed in Fig. 9: loop extraction, criticalloop analysis and data dependency detection. Basically, given an algorithm, the modeldetection works as follows:

— Loop Extraction analyzes the data dependency in the “main.c” based on its DFDand searches the jump between code blocks. In the DFD, a close path means a loop inthe algorithm. All close paths are detected to determine the critical one in the nextstep.

— Critical Loop Analysis determines the sub-function G() of Alg. 1. After all the loopsare identified, we then pick up the critical one that contains all operations to updateri[j]. This means that there is no operation to update the content in the addressof ri[j] outside the critical loop. The address of ri[j] can be determined by the fixedvariable name “r” and the offset j. Specifically, the checking begins from the mostfrequently used loop because it is the most inner loop to update ri[j]. If the mostfrequently-used loop does not cover all operations, then the second most frequentlyused loop is checked, and so forth until the loop covering all operations for ri[j] isfound.

— Dependency Detection is based on addresses of the state vectors ri and ri�1, whichare determined by the fixed variable name “r” and “r pre” respectively. We keep theposition of the current updated item from ri (e.g., ri[j]) and then find the maximumand minimum positions of items in ri and ri�1 which are required to update ri[j].After that, we can get the offsets pH , pT , qH , and qT in the sub-function G().

If a sub-function G() cannot be found after the model detection, this algorithm doesnot match the proposed HSF-SDD model. It is hard to implement this algorithm withthe PE-Ring framework on FPGAs.

6.3. Hardware Code GenerationThe hardware codes of a single PE can be generated based on the sub-function of thealgorithm which passes the phase of the model detection. Recall that, as discussedin Sec. 4.1, the function block and the data persister are two key components inthe PE design. For the function block, an ordinary C-to-RTL tool (e.g., Vivado HLS[Vivado]) can be used to transform the software codes of the sub-function G() into thecorresponding hardware codes. For the data persister, the pipeline registers in a PE aredesigned for storing the coefficients pH , pT , qH and qT , which are displayed in Fig. 2.

Fig. 11 shows our proposed design template to generate hardware codes of thePE-Ring implementation. The main component in this template is module PERING.There are seven parameters in this module, which are set by engineers. The parameterPE NUM indicates the number of PEs that will be instantiated in the PE-Ring,which is determined by the FPGA configuration, the resource costs of a single PE and

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 19: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

Exploiting Stable Data Dependency in Stream Processing Acceleration on FPGAs A:19

INIT RAM

PE II

PE I

PE ...PE R-1

PE R Router

r1parapara

rRs1

state

FIFOpara

FIFO

MUX

r2rR-1

para

s2

sR-1

sR

PARA RAM

para

①①

(a) Template Overview in Structure

module PERING#(parameter PE_NUM,parameter para_LTH, parameter state_LTH, parameter PARA_RAM_SIZE, parameter INIT_RAM_SIZE, parameter para_FIFO_SIZE,parameter state_FIFO_SIZE)(Input data, parameter, initial;output result);

state_FIFO (Input r[PE_NUM-1] , output state_f);para_FIFO (Input para[PE_NUM-1] , output para_f);INIT_RAM (Input initial, output state_r);PARA_RAM (Input pattern, output para_r);TS_router (output pointer);MUX (input state_f, state_r, para_f, para_r,

output state_in, para_in);

PE First_PE();genvar cnt;generate

for (cnt=1; cnt<PE_NUM-1; cnt=cnt+1) begin:PE Middle_PE();

endendgenerate

result = r[pointer];endmodule

③②

①①

(b) Template Overview in Pseudo-code

PE First_PE(.data_i (data ),.State_i (state_in),.Para_i (para_in),.Result_o (r[0] ),.Para_o (para[0] ),.Tuple_Valid_i (pointer==0 ));

(c) Interface of First PE

PE Middle_PE(.data_i (data ),.State_i (r[cnt-1]),.Para_i (para[cnt-1] ),.Result_o (r[cnt] ),.Para_o (para[cnt] ),.Tuple_Valid_i (pointer==cnt ));

(d) Interface of Other PEs

Fig. 11: Template of PE-Ring

the maximal parallelism degree of the algorithm (see Sec. 4.2). If the FPGA on-chipresource is adequate, then PE NUM = R. Otherwise, PE NUM is the maximumvalue in the case that an FPGA can handle. The parameters para LTH and state LTHspecify the word length of parameters and state vectors in G() respectively. Theparameters PARA RAM SIZE and INIT RAM SIZE define the initial RAM sizewhen we instantiate the INIT RAM and the PARA RAM respectively. The last twoparameters para FIFO SIZE and state FIFO SIZE define the FIFO size to bufferparameters and state vectors when we instantiate the state FIFO and the para FIFOrespectively. Note that all parameters related to FIFOs are optional (i.e., if provided,then a FIFO-enabled PE-Ring is built; otherwise, a basic PE-Ring is built).

In addition to parameters, we also list the interfaces of five components in thePE-Ring, which are illustrated in Fig. 11(a). The first two components (i.e., FIFO1� and RAM 2�) are implemented by on-chip IP cores. Then, the router 3� andmultiplexer 4� are designed based on the function explained in Sec. 4.2. For the lastcomponent 5�, it calls the PE design by the ordinary C-to-RTL tools. According toFig. 11(c) and 11(d), the interface of the first PE has some differences from others. This

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 20: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

A:20 Y. Sun et al.

Table V: Selected Characteristics of FPGA XC7V485Tlookup tables (6-to-1 LUTs) 607,200flip-flops (FF) (1-bit registers) 303,600Slices (4 LUTs, 8 FFs) 75,900block RAM (total kbit) 37,080DSP Slices 2,800

is because that the inputs of first PE includes the output signal from the multiplexercalled MUX in Fig. 11(a), while the inputs of other PEs are from their previous PEs.

The purpose of this template as well as this customized C-to-RTL tool is to provide adesign for a PE-Ring implementation with few human efforts. For an efficient solution,we still encourage engineers, especially experts, to optimize the specific design.

7. EVALUATIONWe experimentally evaluate the PE-Ring on the use cases introduced in Sec. 5. ThePE-Ring is compared with: (1) shifter list [Woods et al. 2015], the state-of-art FPGAdesign template; (2) FPGA implementations by the HLS tool, and here we use VivadoHLS 2015.4 [Vivado]; and (3) software solutions on the CPU, including single threadand multiple thread versions.

7.1. Experimental SetupData Sets We select two real and three synthetic data sets for different use cases.UCR Suite [UCRSuite 2012] is for DTW, which contains a day-long Electrocardiographtracing with 20M tuples. The data set used in Space-Saving is WebDocs from [WebDocs2003]. There are 10M tuples from a collection of web documents. For Correlation, wegenerate a synthetic data set which contains 10M tuples by MATLAB. Each tuplefrom above data sets has one dimension with a word length of 32 bits. The dataused in correlation is superposed by a sinusoidal wave and white noise with differentintensities. For EM-GMM, to evaluate with higher-dimensional data, two 10M-tupledata sets are generated, where the dimension of tuples is set 3 and 6 respectively. Theword length of each dimension in these two data sets is also 32 bits. The data fittinginto Gaussian Mixture Model framework is generated by MATLAB. More specifically,we randomly sample data sets from real GMMs with random noise.Environments The FPGA designs are performed on Xilinx Virtex-7 VC707Evaluation Kit Board with an XC7V485T FPGA and a PCI Express (PCIe) x8 gen2Edge Connector. Table. V lists available resources of the FPGA. The workstation PCequipped with the FPGA board runs Ubuntu 14.04 and has a 2.90GHz Intel i7-4600MCPU and 8GB DDR3. To transfer data between the CPU and the FPGA, we use an opensource CPU-FPGA PCIe framework called RIFFA [RIFFA 2013]. For all use cases, theinput is a data stream, and the access pattern is to read data sequentially. Thus, theCPU reads the file and transfers the data to the FPGA by the RIFFA framework, andthen the FPGA runs the streaming algorithms.

The C codes in the software solutions are also used in the simple HLS configuration.The software solutions are executed on the same workstation where the Intel i7-4600MCPU has eight threads. Based on it, we present 8-thread solutions. All the use casescan be implemented by dividing the input data into eight parts, except Space-Saving.In Space-Saving, simply porting data into multiple threads leads to errors. The8-thread version of Space-Saving is implemented by a mergeable Space-Savingalgorithm [Agarwal et al. 2013].Evaluation Metrics We use throughput to evaluate the performance. The throughputis calculated as: divide the number of tuples in the stream by the total executiontime to calculate the throughput. We run each experiment 10 times and average

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 21: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

Exploiting Stable Data Dependency in Stream Processing Acceleration on FPGAs A:21

(a) DTW (b) Space-Saving

(c) Correlation (d) EM-GMM

Fig. 12: Comparisons between PE-Ring and shifter list

the throughputs as the results. It is noted that the throughput of FPGA designs isdetermined by the on-chip logic because the data transfer by PCIe is not the bottleneck.In details, the effective bandwidth of PCIe we used is 4GB/s. Considering the wordlength is 32bit, the effective bandwidth in our work is 1G tuples/s, which is muchhigher than the throughput of FPGA designs.

For FPGA designs, we also list the resource consumptions (LUTs and FFs) to analyzethe scalability of different implementations.

7.2. Comparison with shifter listFig. 12 shows the performance of the PE-Ring and the shifter list. In general, thethroughputs of PE-Ring and shifter list are comparable.

We also list the resource consumptions of FPGA designs in Table. VI(a)-VI(d)respectively. According to these tables, we observe that the resource consumptions ofthe shifter list and the PE-Ring are similar in most cases. However, in Space-Saving,the PE-Ring costs much fewer LUTs than the shifter list. Especially when M � 256,the LUT consumptions of the shifter list are twice as many as that of the PE-Ring.

The reason for this phenomenon is that Space-Saving is an use case in which pT =

�1 < 0. In Eq. (2), pT is defined as the tail offset in ri�1, which is used to describethe data dependency pattern of the algorithm. As introduced in Sec. 4.2, pT influencesthe maximal parallel degree of the algorithm and the number of PEs in the PE-Ring.When pT < 0, the PE-Ring only needs dM/te PEs, where t = |pT |+1. Thus, the PE-Ringimplementation of Space-Saving requires dM/2e PEs. However, the shifter list uses MPEs in the structure. Since the maximal parallelism degree of Space-Saving is dM/2e,half of PEs in the shifter list are idle in each cycle. To sum up, for cases where pT < 0,PE-Ring saves LUTs.

The difference in the LUT consumptions between PE-Ring and shifter list also existswhen M < 256 in Space-Saving, but such difference is not as obvious as M � 256 asshown in Table. VI(b). The reason is that the LUT consumptions of RIFFA on thePE-Ring and shifter list are the same, which makes the difference in the LUT usagenot obvious when M is small.

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 22: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

A:22 Y. Sun et al.

Table VI: Resource Consumption(a) DTW

M PE-Ring(%) Shifter List(%) HLS(%)LUT FF LUT FF LUT FF

32 4 2 4 2 4 264 9 3 7 3 6 3

128 17 5 15 5 9 3256 30 9 27 8 16 4512 57 18 51 17 33 6

1024 - - - - 57 10

(b) Space-Saving

M PE-Ring(%) Shifter List(%) HLS(%)LUT FF LUT FF LUT FF

32 3 4 4 3 3 364 4 4 6 3 6 3

128 7 5 11 4 10 4256 10 7 22 6 15 5512 19 10 41 9 27 8

1024 37 17 81 17 53 132048 74 30 - - - -

(c) Correlation

M PE-Ring(%) Shifter List(%) HLS(%)LUT FF LUT FF LUT FF

32 2 2 2 2 4 364 4 3 4 3 7 3

128 6 4 6 4 11 5256 10 6 10 6 17 6512 19 9 19 10 34 10

(d) EM-GMM

D M PE-Ring(%) Shifter List(%) HLS(%)LUT FF LUT FF LUT FF

3 2 3 3 3 3 6 33 4 6 3 6 3 11 33 6 9 4 9 4 15 46 2 6 3 6 3 12 36 4 13 4 14 4 16 46 6 20 5 22 6 19 5

(a) DTW (b) Space-Saving

(c) Correlation (d) EM-GMM

Fig. 13: Comparison among PE-Ring, simple HLS configuration, single and multiplethread software solutions

7.3. Comparison with Simple HLS ConfigurationFig. 13 shows the comparison between the PE-Ring and the simple HLS configurationin throughput. The throughput of the PE-Ring outperforms simple HLS configurationby one to two orders of magnitude. The performance advantages of the PE-Ring arefrom a high degree of parallelism and data reuse based on the stable data dependencybetween sub-functions G(). In the simple HLS configuration solutions, for the innerfor�loop in function F (), which operates the sub-function G() in Sec. 3, we use looppipelining to improve the performance. However, the main optimization task of theHLS tool is still based on the parallelism inside each sub-function G().

According to resource consumptions listed in Table. VI(a)-VI(b), we also observe thatwhen the scale of the parameter M is large enough, the resources will be run outin both the basic PE-Ring and the simple HLS configuration design. In DTW andSpace-Saving, LUTs are the key factor that impends us from implementing largerM on the chip. In Correlation, the most needed resource is DSP, rather than LUTor FF. Both basic PE-Ring and simple HLS configuration cannot support M > 512,

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 23: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

Exploiting Stable Data Dependency in Stream Processing Acceleration on FPGAs A:23

because 90% DSP is used for multiplication calculations when M = 512. For EM-GMM,the objective of EM-GMM is to demonstrate how the performance changes with thedimension of a tuple. Since it is not practical to set a large number of Gaussiandistributions in real life applications, we make the scale of the parameter M vary ina small range. Fig. 13(d) shows that with the same M , when the dimension of a tupleD increases, the throughput of the PE-Ring drops slightly. This is because the clockfrequency decreases when the structure of a PE becomes complex to process multipledimensional data.

In practice, streaming algorithms are applied in the problems with differentparameters. For example, DTW and Correlations perform under different patterns.Space-Saving uses a parameter, the error bound ✏, to determine the result precision.The number of bins in Space-Saving is M = 1/✏ [Metwally et al. 2006]. The scale of theproblem in a streaming algorithm varies based on the application specific request.

When the scale of the problem increases, as we have claimed, simple HLSconfiguration cannot work to support the implementation. However, with block RAM,the FIFO-enabled PE-Ring breaks the limitation of on-chip resources (see Sec. 4.2.2).The FIFO-enabled PE-Ring, whose performance is also shown in Fig. 13, is used tosupport M that a basic PE-Ring cannot handle. In conclusion, the PE-Ring achievesscalability that general HLS tool cannot.

The time consumption of C-to-RTL flow contains two parts: (1) the time cost ofhardware design and (2) the time for computation. For the former part, the timetaken by other HLS tools and the proposed method is similar. Specifically, traditionalHLS tools use the input software codes to generate corresponding hardware structureand then process the synthesis phase. As for the proposed C-to-RTL methodology,we also need software codes as input, generate the application-specific structures asthe accelerator and then synthesis. Therefore, for the same target algorithm, bothtraditional HLS tool and the proposed method need minutes to hours for circuitgeneration, which depends on the requirements of parallelism or circuit complexityprovided by the application. For the later part, the experimental results have shownthat the proposed method can obtain more than 1 to 2 orders of magnitude of speedupcompared with traditional HLS tools.

We can further analyze the power and energy consumption based on the aboveexperimental results. For the software implementation, the power is only cost by thehost CPUs. For traditional HLS method and PE-Ring method, the power consumptioncontains three parts: CPUs, PCIe, and FPGA boards. Since the additional powerconsumptions of PCIe and FPGA board are small, the entire power consumption ofthese three methods is similar. However, the experimental results have shown thatthe proposed method obtain 1 to 2 orders of magnitude of speedup compared withsoftware method and HLS method. Therefore, the energy consumption of PE-Ring ismuch lower than other existing methods.

7.4. Comparison with SoftwareFig. 13 also demonstrates the comparison between the PE-Ring and software solutions.We observe that the PE-Ring achieves the same scales as the software solutions.In addition, the throughput of the PE-Ring outperforms 8-thread solution by one totwo orders of magnitude and outperforms 1-thread solution by one to three orders ofmagnitude.

According to Fig. 13, we also observe that the variation of throughputs is differentfor the basic PE-Ring and the FIFO-enabled PE-Ring. For the basic PE-Ring, thethroughput drops slightly and sometimes even keeps stable with M increasing.Meanwhile, for the FIFO-enabled PE-Ring, the throughput is dropping heavily, justlike the software solutions. This phenomenon can be explained by Eq. (4). The

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 24: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

A:24 Y. Sun et al.

Fig. 14: Clock frequency of PE-Ring

throughput depends on the number of PEs R, the clock frequency f and the parameterscale M . As discussed in Sec. 4, the number of PEs of the basic PE-Ring is R = dM/te.That is to say, R/M in Eq. (4) is stable for the basic PE-Ring. Moreover, when theparameter scale M increases, the resource consumption increases and leads to adecline in the clock frequency f as shown in Fig. 14. Therefore, in the basic PE-Ring,the throughput has some declines due to the clock frequency.

In the FIFO-enabled PE-Ring, the major reason for the decline of the throughputis the parameter scale M . Due to resource limitations, the number of PEs R is fixed,while the larger M brings more workloads without parallelism degree improvement.In addition, as shown in Fig. 14, FIFOs in the PE-Ring do not influence the clockfrequency very much. Thus, the reason for throughput descents on the FIFO-enabledPE-Ring is the same as that on software solutions. The workload does increase as Mgets large, however, the computing capacity remains the same.

8. CONCLUSIONIn this paper, we implement streaming algorithms with our proposed HSF-SDD modelon the PE-Ring. The objective is to popularize FPGAs as a co-processor for UDFsin databases. The PE-Ring framework fully exploits the parallelism of algorithms,leverages block RAM to support the large scale of parameters and delivers moreflexible FPGA designs. A customized C-to-RTL tool for the PE-Ring is also proposed.Experimental results show that our method outperforms the design from Vivado HLSby one to two orders of magnitude. Experiments also demonstrate the scalability of thePE-Ring.

REFERENCESPankaj K Agarwal, Graham Cormode, Zengfeng Huang, Jeff M Phillips, Zhewei Wei, and Ke Yi. 2013. Merge-

able summaries. ACM Transactions on Database Systems (TODS) 38, 4 (2013), 26.Arvind Arasu, Ken Eguro, Manas Joglekar, Raghav Kaushik, Donald Kossmann, and Ravi Ramamurthy.

2015. Transaction processing on confidential data using cipherbase. In 2015 IEEE 31st InternationalConference on Data Engineering. IEEE, 435–446.

Subi Arumugam, Alin Dobra, Christopher M Jermaine, Niketan Pansare, and Luis Perez. 2010. The Data-Path system: a data-centric analytic processing engine for large data warehouses. In Proceedings of the2010 ACM SIGMOD International Conference on Management of data. ACM, 519–530.

Jeff A Bilmes and others. 1998. A gentle tutorial of the EM algorithm and its application to parameterestimation for Gaussian mixture and hidden Markov models. International Computer Science Institute4, 510 (1998), 126.

Hui Ding, Goce Trajcevski, Peter Scheuermann, Xiaoyue Wang, and Eamonn Keogh. 2008. Querying andmining of time series data: experimental comparison of representations and distance measures. Pro-ceedings of the VLDB Endowment 1, 2 (2008), 1542–1552.

Ce Guo, Haohuan Fu, and Wayne Luk. 2012. A fully-pipelined expectation-maximization engine for Gaus-sian mixture models. In Field-Programmable Technology (FPT), 2012 International Conference on.IEEE, 182–189.

Informix. 2015. Informix-subsequence similarity search. https://crl.ptopenlab.com:8800/accelerator/accelerator/4/. (2015). Accessed Oct. 25, 2016.

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 25: A Exploiting Stable Data Dependency in Stream Processing ... TEC… · streaming algorithms for processing data streams when the input is presented as a sequence of tuples. More particularly,

Exploiting Stable Data Dependency in Stream Processing Acceleration on FPGAs A:25

Changhoon Kim, Matthew Caesar, Alexandre Gerber, and Jennifer Rexford. 2009. Revisiting route caching:The world should be flat. In International Conference on Passive and Active Network Measurement.Springer, 3–12.

NSL Phani Kumar, Sanjiv Satoor, and Ian Buck. 2009. Fast parallel expectation maximization for Gaussianmixture models on GPUs using CUDA. In High Performance Computing and Communications, 2009.HPCC’09. 11th IEEE International Conference on. IEEE, 103–109.

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis &transformation. In Code Generation and Optimization, 2004. CGO 2004. International Symposium on.IEEE, 75–86.

Oskar Mencer. 2006. ASC: a stream compiler for computing with FPGAs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 25, 9 (2006), 1603–1617.

Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2006. An integrated efficient solution for com-puting frequent and top-k elements in data streams. ACM Transactions on Database Systems (TODS)31, 3 (2006), 1095–1133.

Rene Mueller, Jens Teubner, and Gustavo Alonso. 2010. Glacier: a query-to-hardware compiler. In Proceed-ings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 1159–1162.

Rene Mueller, Jens Teubner, and Gustavo Alonso. 2012. Sorting networks on FPGAs. The VLDB JournalłTheInternational Journal on Very Large Data Bases 21, 1 (2012), 1–23.

Netezza. 2011. http://www.ibm.com/software/data/netezza. (2011). Accessed Oct. 25, 2016.OpenCL. 2013. https://www.altera.com/products/design-software/embedded-software-developers/opencl/

overview.html. (2013). Accessed Oct. 25, 2016.RIFFA. 2013. http://riffa.ucsd.edu/. (2013).Doruk Sart, Abdullah Mueen, Walid Najjar, Eamonn Keogh, and Vit Niennattrakul. 2010. Accelerating dy-

namic time warping subsequence search with GPUs and FPGAs. In 2010 IEEE International Conferenceon Data Mining. IEEE, 1001–1006.

Bharat Sukhwani, Hong Min, Mathew Thoennes, Parijat Dube, Balakrishna Iyer, Bernard Brezzo, DonnaDillenberger, and Sameh Asaad. 2012. Database analytics acceleration using FPGAs. In Proceedings ofthe 21st international conference on Parallel architectures and compilation techniques. ACM, 411–420.

Takashi Takenaka, Masamichi Takagi, and Hiroaki Inoue. 2012. A scalable complex event processing frame-work for combination of SQL-based continuous queries and C/C++ functions. In 22nd InternationalConference on Field Programmable Logic and Applications (FPL). IEEE, 237–242.

Jens Teubner and Rene Mueller. 2011. How soccer players would do stream joins. In Proceedings of the 2011ACM SIGMOD International Conference on Management of data. ACM, 625–636.

Jens Teubner, Rene Muller, and Gustavo Alonso. 2011. Frequent item computation on a chip. IEEE Trans-actions on Knowledge and Data Engineering 23, 8 (2011), 1169–1181.

UCRSuite. 2012. http://www.cs.ucr.edu/%7eeamonn/UCRsuite.html. (2012). Accessed Oct. 25, 2016.Vivado. 2012. http://www.xilinx.com/products/design-tools/vivado.html. (2012). Accessed Oct. 25, 2016.Haixun Wang and Carlo Zaniolo. 1999. User-defined aggregates in database languages. In International

Symposium on Database Programming Languages. Springer, 43–60.Zilong Wang, Sitao Huang, Lanjun Wang, Hao Li, Yu Wang, and Huazhong Yang. 2013. Accelerating sub-

sequence similarity search based on dynamic time warping distance with FPGA. In Proceedings of theACM/SIGDA international symposium on Field programmable gate arrays. ACM, 53–62.

WebDocs. 2003. http://fimi.ua.ac.be/data/. (2003). Accessed Oct. 25, 2016.Xuechao Wei, Yun Liang, Tao Wang, Songwu Lu, and Jason Cong. 2017. Throughput Optimization for

Streaming Applications on CPU-FPGA Heterogeneous Systems. In Proceedings of the 22th Asia andSouth Pacific Design Automation Conference (ASP-DAC).

Louis Woods, Gustavo Alonso, and Jens Teubner. 2015. Parallelizing Data Processing on FPGAs with ShifterLists. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 8, 2 (2015), 7.

Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: an efficient data clustering method forvery large databases. In ACM Sigmod Record, Vol. 25. ACM, 103–114.

Wei Zuo, Yun Liang, Peng Li, Kyle Rupnow, Deming Chen, and Jason Cong. 2013. Improving highlevel synthesis optimization opportunity through polyhedral transformations. In Proceedings of theACM/SIGDA international symposium on Field programmable gate arrays. ACM, 9–18.

ACM Transactions on Embedded Computing Systems, Vol. V, No. N, Article A, Publication date: January YYYY.