sy ara e mp ll lefficient and scalable · graphics processing [1]. openmp naturally meets the...
TRANSCRIPT
www.symparallel.com
Efficient and Scalable
OpenMP-based System-Level Designsy arallelMPturns parallel programmers
into hardware designers
Blending parallel programming with ESL design
Applications: Graphics Processing with sy arallelMP
Graphics Processing
JPEG
massive algorithm-levelparallelism
dynamic schedulingstrategy
is an important application domain forparallel computing due to the massive thread-level
parallelism inherently present in most algorithms. A commonexample of graphics processing applications, the
algorithm is used for compression and decompression ofbitmap images. According to the JPEG standard, the first
stages of image compression/decompression, i.e. DiscreteCosine Transform, Quantization, and their inverse
transformations, are applied independently for all 8x8 pixelsub-blocks in the image, enabling
. We used to perform the abovesteps of JPEG standard, both for compression and
decompression. Relying on the possibility of mapping theshared memory to a frame-buffer device we could alsovalidate graphically the results through a 640x480 VGA
external display. During the Design Space Exploration stage,we chose to allocate the maximum number of hardware
accelerators meeting the constraints of the FPGA devicearea, yielding a very fast compression/decompression cycle.
The chart on the left exemplifies the trade-offs explored bymeans of the area estimation tool. Notice that,
retaining a soft-core embedded processor in the final systemarchitecture, we could adopt a
when coding the cycles of the OpenMP
application, i.e. we could use a fully-fledged OpenMPapplication for the design-entry in the flow.
for
sy arallel
sy arallel
sy arallel
MP
MP
MP
Results: efficient and scalable design solutions
sy arallel
sy arallel
sy arallel
MP
MP
MP
is an innovative design framework for massively parallelapplications, blending the OpenMP programming paradigm withelectronic system-level (ESL) design methodologies. A de-facto standardfor parallel programming, is the most popular approach toshared memory parallel applications in a number of different applicationdomains, ranging from financial applications to scientific computing andgraphics processing [1]. OpenMP naturally meets the characteristics ofcurrent multi-processor systems-on-chip ( ), in that it providesenough semantics to express the parallelism of typical data-intensiveapplications. The design flow enables the automatedtranslation from high-level OpenMP programs down to an FPGA-basedmulti-processor system-on-chip highly customized for the targetapplication, possibly accelerated by hardware cores inferred from thesource code through a high-level synthesis process.
A few literature works address the description of hardware acceleratorsthrough OpenMP. They focus either on the integration of hardwareaccelerators in essentially software OpenMP programs [6][2], or on thepure hardware translation [3]-[5]. When supporting OpenMP-basedhardware design, they impose to the constructsactually available to parallel programmers, effectively preventing thereuse of legacy OpenMP programs and kernels. Other limitations includethe use of centralized mechanisms for controlling interactions amongthreads, causing scalability issues, the limited support for externalmemory and efficiency-critical mechanisms such as caching, and severalunsupported runtime library routines.
The design flow enables the generation of heterogeneoussystems, including one or more processors and dedicated hardwarecomponents, where OpenMP threads can be mapped to either softwarethreads or hardware blocks. The control-intensive facets making the fullOpenMP difficult to implement in hardware are managed in software,while the data-intensive part of the OpenMP application is addressed bydedicated software/hardware parallelism. Hardware threads aregenerated by means of high-level synthesis tools that perfectly fit thestructure of an OpenMP program, where the application logic is stilldescribed by means of plain C/C++ code. This approach provides fullsupport for standard-compliant OpenMP applications, as well asfundamental "general-purpose" characteristics such as memoryhierarchies and management.
OpenMP
MPSoCs
drastic restrictions
sy arallelMP Eclipse plug-in
The figure below shows the typical architecture of a system generatedby . Each subsystem represents an OpenMP thread, or agroup of threads. Software subsystems are processors executingcompiled C code. Hardware subsystems are either blocks generatedthrough HLS equipped with DMA capabilities and synchronization ports,or they can be ordinary peripherals such as timers or ad-hoc processingcomponents. For the generation of these subsystems,relies on third-party back-ends. Namely, the frameworkcurrently supports Xilinx FPGA devices and MPSoC architectures [10],and exploits Impulse CoDeveloper for high-level synthesis [9].
sy arallel
sy arallelsy arallel
MP
MP
MP
The design environment exposes a Graphical User Interface (GUI) seamlesslyintegrated into the as an external plug-in. Programmers are provided with a quick,
interactive, and user-friendly interface to control the whole design cycle, from functionalsimulation to system implementation and execution. The system-level design methodology
underpinned by environment is mainly based on a high-level top-down strategywhere each step can be controlled by the designer through the GUI.
Eclipse IDEsy arallel
sy arallel
MP
MP
References
Applications: Financial Computing with sy arallelMP
Computational finance
Monte Carlo Option Price
dynamic strategy
Design Space Exploration
manytimes faster
provides numerical methods to tackle difficultproblems in financial applications. Coding these numerical algorithms canbe greatly simplified by using parallel programming paradigms likeOpenMP, e.g. to exploit task-level parallelism for financial simulations.
is a numerical method often used incomputational finance to calculate the value of an option with multiplesources of uncertainties and random features, such as changing interestrates, stock prices, or exchange rates. Monte Carlo simulations can beeasily parallelized, since tasks are largely independent so they can bepartitioned among different computation units, while communicationrequirements are limited. The parallelization with was doneby partitioning the Monte Carlo iterations among all threads with a
, hence balancing the load in a completely distributedfashion. As soon as the independent tasks are complete, a reductionoperation allows the thread-safe merging of all results, letting the mainthread compute the final results. During thestep, the lowest latency implementation was chosen. It retains a softwarethread, used to perform control-related and miscellaneous operationssuch as random numbers generation. In conclusion, offereda direct path from parallel code to a parallel hardware engine foracceleration of Monte Carlo Option Price simulations, able to run
than a software implementation.
sy arallel
sy arallel
MP
MP
[1] OpenMP Architecture Review Board. (2011) , v3.1. [Online]. Available: www.openmp.org[2] W.-C. Jeun and S. Ha, “Effective OpenMP implementation and translation for multiprocessor System-on-Chip without using OS,” in
- ASP-DAC ’07, Jan. 2007, pp. 44–49.[3] Y. Leow, C. Ng, and W. Wong, “Generating hardware from OpenMP programs,” in (FPT 2006), Dec.
2006, pp. 73–80.[4] P. Dziurzanski and V. Beletskyy, “Defining synthesizable OpenMP directives and clauses,” in -
ICCS 2004, ser. LNCS, vol. 3038. Springer, 2004, pp. 398–407.[5] P. Dziurzanski, W. Bielecki, K. Trifunovic, and M. Kleszczonek, “A system for transforming an ANSI C code
with OpenMP directives into a SystemC description,” in, 2006. IEEE, apr 2006, pp. 151–152.
[6] D. Cabrera, X. Martorell, G. Gaydadjiev, E. Ayguade, and D. Jimenez-Gonzalez, “OpenMP extensionsfor FPGA accelerators,” in
, 2009 - SAMOS ’09, Jul. 2009, pp. 17–24.[7] P. Coussy and A. M. (Eds.), . Springer, 2008.[8] EPCC. (2012) . [Online]. Available:
http://www.epcc.ed.ac.uk/software-products/epcc-openmp-benchmarks/[9] Impulse Accelerated Technologies. (2012) . [Online]. Available:
http://www.impulseaccelerated.com[10] Xilinx. (2012) (EDK). [Online]. Available:
http://www.xilinx.com/tools/platform.htm
OpenMP application program interfaceProceedings of the 2007 Asia and South
Pacific Design Automation ConferenceIEEE International Conference on Field Programmable Technology
Proceedings of the 4th International Conference on Computational Science
Design and Diagnostics of Electronic Circuitsand Systems
International Symposium on Systems, Architectures, Modeling, andSimulation
High-Level Synthesis from Algorithm to Digital CircuitEPCC OpenMP benchmarks
Impulse CoDeveloper
Platform studio and the embedded development kit
Any ideas? Contact [email protected]
C / OpenMP-based system
description
functionalsimulation
hardware /software
partitioninghigh-levelsynthesis
software codecompilation
Platform-basedsystem
composition
hardwaresynthesis andplace&route
IP-core library(e.g. timers, video
controllers,…)
a)
c)
b)
d)
FPGA Intel i7FPGA Intel i7
ourapproach
Ref. [3]ourapproach
Ref. [3]
private
firstprivate
dynamic
static
critical
sy arallel
sy arallel
MP
MP
relies on state-of-the-art ESLdesign techniques helping reduce thedesign cycle for building a complete system.After the initial coding in C/OpenMP, afunctional simulation is performed to verifythe correctness of the application. The high-level executable specification, here, is keyto enabling a fast, purely softwaresimulation. Starting from the high-levelspecification, the ad-hoccompiler generates all the files needed tobuild the heterogeneous MPSoC,completely hiding the underlying technologyand the back-end synthesis tools.
To fully support the OpenMP specification, a library of IP cores isincorporated into the environment and automatically used by thecompiler when some specific OpenMP directives are detected. Thebifurcation point in the flow corresponds to the hardware/softwarepartitioning stage, where the physical architecture of the system isdefined. technology enables a of all costsassociated with a specific implementation. The area occupation isestimated by means of a suitable statistical analysis, while the evaluationof the overall latency relies on a cycle-accurate Instruction Set Simulatorfor the embedded processors and timing analysis for hardwareaccelerators. Subsequently, two distinct branches in the flow coversoftware compilation and high-level synthesis concurrently. They re-jointogether on the system composition step, where the whole system isautomatically built, usually assembling library hardware/softwarecomponents and application-specific components generated by theprevious steps, before the actual hardware synthesis takes place.
fast evaluationsy arallelMP
sy arallel
sy arallel
sy arallel
MP
MP
MP
was benchmarked by measuring the implementationoverhead caused by the support for OpenMP constructs. Precisely, wemeasured the overhead/execution time ratio, i.e. the
, relying on the well-known EPCC benchmarks [8]. Wecompared the normalized overhead with an OpenMP implementation fora Windows 7 OS on an Intel i7 processor at 1.8 GHz running the samebenchmarks.
Chart a) presents the normalized overheads as the number of threads isincreased. The important clue provided by the plot is that, in addition tobeing low in absolute values, the overhead tends to be horizontal. Thisshows the effects of the distributed architecture and synchronizationmechanisms implemented by and their impact on the
and the of the resulting MPSoC.
Chart b) summarizes the overhead trends for the above constructs,depicting the of the normalized overhead measuredduring our experiments. For example, the figure tells us that, on average,the overhead for the
clause grows by 0.006 per thread, while it grows by 0.147 per thread forthe i7 implementation. Again, this provides a convincing demonstration ofthe scalability of the approach compared to the puresoftware implementation.
normalizedoverhead
efficiency scalability
average slopes
#pragma omp parallel firstprivate (var)
We also comparedwith [3], the only solution forOpenMP-to-hardware translationpresenting a working tool andsome performance figures.Charts c) and d) compare ourresults with [3] in terms ofperformance scalability andsystem frequency. Chart c) refersto a program implementing theSieve of Eratosthenes algorithm,while Chart d) refers to aprogram implementing an infiniteimpulse response filter, as in [3].
sy arallelMP
The plots confirm the effectiveness of , ensuring asatisfactory level of efficiency and scalability, concerning both theapplication speed-up and the complexity of the generated hardware,directly affecting the clock frequency.
sy arallelMP
Fast estimates of latency and area occupation for the chosen implementation can bedisplayed in the form of reports and graphical diagrams helping the designer visualize and
understand the actual system requirements. This allows design choices to be made as earlyas possible in the development cycle, enabling a fast and effective
(DSE) process. After the complete hardware platform is generated, the earlyestimation parameters can be validated against the actual synthesis results, again relying onan intuitive and user-friendly graphic interface.Then, the designer can finally burn the FPGA
chip, download the code, and run the application on his custom MPSoC.
Design SpaceExploration
Throughout the design process, the dedicated console shows thestate of the third-party backend tools, notifying errors and warnings without
blocking the Eclipse graphical interface. After coding and compiling OpenMPsource code in the Eclipse text editor, users can execute and debug it on the host
machine in order to complete the required functional simulation. Then, they cangenerate the source files taken as input by the next steps by just launching the
compiler. Several architectural constraints can be specified for thesystem being built. For example, users can determine the mapping of a givenparallel thread by binding it to an embedded processor or to a dedicated fasterhardware accelerator. If an image or video application is being developed, thedesigner can map a specific portion of the shared memory onto a frame-bufferdevice: the compiler will automatically generate a system with an
integrated display controller that can be used to drive an external display.
sy arallel
sy arallel
sy arallel
MP
MP
MP
The parallel heterogeneous architecture is defined in such a way as toorchestrate its hardware/software components
, avoiding centralized hardware elements. The memorycomponents may be implemented in a different technology and indeedthey may be mapped to FPGA internal RAM blocks or off-chip SDRAMmemory, enabling the synthesis of real-world systems working on largeamounts of data. Shared memory areas are accessed by allsubsystems. Each hardware subsystem has its own local memorycorresponding to the synthesized registers, while software subsystemscan also have a local memory corresponding to the processor cachelevels of the memory hierarchy. The are thefundamental block for the implementation of mutual exclusionmechanisms as they provide a specific hardware support for it.
and mechanisms are very important to OpenMP,and are implemented by as efficiently as possible. Just asan example, the overhead caused by the OpenMP
clause only depends on the number of listed variables and does notincrease with the number of threads, preserving the applicationscalability. Another essential aspect in OpenMP is the dynamicscheduling partitioning strategy in the directive.
Supporting this strategy efficiently is vital for load balancing inheterogeneous systems. supports it in a completelydistributed fashion. Following is a simplified version of the algorithmexecuted independently by each processing element:
As shown above, the support is completely distributed. Consequently,the threads execute a number of iterations determined at run timeaccording to the computational power of the unit they are allocated toand the different load they happen to handle, fully implementing thesemantics of the clause.
in a distributedfashion
atomic registers
Datascoping initialization
syMParallel
firstprivate
#pragma omp for
dynamic
while (iterations are not finished){critical {
read the iteration_counterre-check iterations are not finishedupdate the iteration_counter with (iteration_counter + chunk_size)
}execute iterations
}
sy arallelMP
sy ara ell lMP
sy arallelMP technology
communication infrastructureand distributed synchronization
Softwaresubsystem
LocalMemory
Memorysubsystem
SW S1 stackSW S1 heapSW S1 text/global
Memorysubsystem
Memorysubsystem
Hardwaresubsystem
LocalMemory
SW S3 stackSW S3 heapSW S3 text/global
SW S2 stackSW S2 heapSW S2 text/global
Hardwaresubsystemperipheral
(e.g. TImer)
Hardwaresubsystem
LocalMemory
Softwaresubsystem
LocalMemory
Atomicregisters
Hardwaresubsystem
LocalMemory