high performance computing infrastructure: past, present, and future

Download High Performance Computing Infrastructure: Past, Present, and Future

Post on 25-May-2015




0 download

Embed Size (px)


  • 1. High Performance ComputingInfrastructure: Past, Present, and Future By Clay Gloster, Jr., Ph.D., P.E. Associate ProfessorDepartment of Electrical & Computer Engineering Howard UniversityTHE RARE PROJECTcgloster@howard.edu June 22, 2009 1

2. Presentation Outline Introduction to Reconfigurable Computing The Bison Configurable Digital Signal Processor The BCDSP Design Flow Current Function Cores and Modules A Remote Reconfigurable Computer A Parallel and Configurable Computer 2 3. Introduction to Reconfigurable Computing3 4. Problem Statement Given: An application that is computationally intensive or requires considerable CPU execution time. i.e., weather modeling, remote sensing, target recognition, precision targeting, gene sequencing Find: A solution that significantly improves performance, requires acceptable development time, at a reasonable cost. 4 5. Potential Solutions Cluster-based computing: The use of several general purpose computing systems, i.e. PCs. (Writing programs that execute on typical PCs/workstations.) Application-Specific Integrated Circuit (ASIC) Design: The use of special-purposed ICs or chips. (Designing a chip (hardware) that is highly optimized for the particular application.) Reconfigurable Computing: The merger of the two approaches. (Writing software to execute non-time- critical portions of the application on a PC while designing hardware to execute the time-critical portions of the application on an FPGA.) 5 6. A Reconfigurable Computer is: PC Host A PC attached to one or more Field Programmable Gate Arrays(FPGAs).6 7. An FPGA is:Programmable PinConfigurable Logic Block Programmable Interconnect A programmable integrated circuit.At time t1, it can be programmed as X1 (personal data assistant). At time t2, it can be programmed as X2 (calculator).7 8. RC Systems Advantages Several applications have been implemented on a reconfigurable computing system resulting in a system with execution times that were an order of magnitude faster than the same application implemented on a typical desktop computer. The same reconfigurable computing system hardware can be reused for diverse applications. With an RC system, a system can be deployed and subsequently reprogrammed with new hardware to perform functions that were not available at the time ofdeployment. 8 9. RC Systems Disadvantages Developing an RC system requires a system designer that is knowledgeable in both hardware design as well as software design. Time required to design and implement an RC system that executes faster than a typical desktop computer can be several months. 9 10. Research Objectives To obtain RC system implementations of several applications that achieve an order of magnitude speedup over executing the application on a typical desktop computer. To develop tools that reduce RC system development time from months to weeks or days and allow users who are not knowledgeable in hardware design to be able to implement RC systems while experiencing the potential benefit of increased system performance and system reuse. To develop a resource management system to efficiently utilize available reconfigurable computing resources located at remote sites.10 11. The Bison Configurable Digital Signal Processor11 12. A Configurable Digital Signal Processor M0M1Processor (BCDSP) DataData CONTROL DATAM2 UNITUNIT M3DataData Function Core Mn-2 (FunCoreGen) Mn-1DataDataMnInstructions 12 13. Functional Cores R0 R1 R7 Have one or more 32-bit inputs Perform floating point vector ENABLE operations.FunCore Have simple control. DONE Can be built using other FunCores. Can include conditional units.13 14. 2-D DCT Function Core R0 R1 R2 R3 R4 R5 R6 R7R8 R9 R10 R11 R12 R13 R14 R15 X X XX XX X X+ + ++ +++Z0 14 15. Optimizing System Performance with the BCDSP Memory is 64-bits wide allowing two single-precision floating point numbers to be fetched in a single memory access. There are N=4 data memories, hence multiple data items can be read/written in a single cycle. Theoretically, the number of memory accesses can be reduced by a factor of N=4. (This number can be increased to an upper bound 2N=8 if we store two floating point values per location.) Multiple function cores can be used. For example, a typical processor may have 1 multiplier. In this case, K multiplies require K time units or clock cycles. With K multipliers, K multiplies can be executed in a single time unit or clock cycle. Pipelining and DMA accesses are used to increase system performance.15 16. BCDSP Software, Cores,and Processors16 17. Distinguishing Features of RCCT Traditional Approach OriginalOur Approach Original ModuleSource SourceDefinition Code Code File SpecialCompiler RCCTCompiler ModifiedHDL Source Code SessionModified FilesSource Code Logic High LevelSynthesis Compiler High LevelCompiler PlacementExecutable& RoutingCodeExecutable CodeBit Stream Placement and routing is performed off-line. The Hardware Module Library evolves continuously. Compiler can easily recognize new modules. As new modules are added, the Compiler has a better chance to improve performance for each user application. AIST-0016-0044 17 18. The Front-End Compiler The purpose of the compiler is to map user applications to FPGA-based reconfigurable computers (RC), (i.e. the BISON reconfigurable computer). The compiler takes the original source code written in C/C++ and a module library and produces two outputs: the modified source code and a session file for each modified section. OriginalSource CodeProgrammingNewRCCT Modified ApplicationSourceLanguageExecutable CompilerCode Compiler (Calls theLoader)ModuleLibrarySession files18 19. The BCDSP ProcessorBack-End Compiler dct.cc2hldct_hl.vhdhl2cududct_cu.vhd dct_du.vhd PECORE.vhdhl2cudu consists of approximately 15 programs!!! 19 20. Execution Time for the 2D-DCT ImageSoftware (ms)Hardware (ms) SpeedupSize 2.97 GHz PC 24 MHz BCDSP8x8 0.04000.01123.5616x160.095 0.02723.4832x320.264 0.09150 2.8864x640.849 0.34842.43 128x128 3.080 1.37462.24 256x256 12.1545.478 2.22 512x512 60.55621.8942 2.761024x1024185.754 87.5560 2.12Reconfigurable hardware was 2.71 times faster on average!!!!20 21. A Remote Configurable Computer21 22. A Remote And ReconfigurableEnvironment (RARE) ProcessorLibrary Remote Environment Resource BankResource Controller FPGA0 M0 0M0 1 M0nAutomatedBCDSPTool SetFPGA1M10 M11 M1pApplication(C, Java,)FPGAmMm0Mm1MmqUser Parameters (power, size, weight)22 23. The RARE Project Infrastructure The RARE software is developed using Java. The Java language is selectedbecause it offers a number of advantages over other programming languages.Java supports native methods, remote method invocation and network security. The native method feature allows the use of software routines written in other programming languages such as C/C++ to be calledfrom Java applications. Remote method invocation and network security features make it possible to execute Java programs from a remote site. Client.java Server.javaFPGA with RMI INTERNETwith RMINMIFunction.clinks links Board 23 24. PNN Execution Times ImplementationLocal RemoteType (ms) (ms)Software (Java)628.712887.74 Software (Cpp) 861.043116.17 Hardware 104.07 371.01 Remote hardware can be faster than local software!!!! 24 25. A Parallel and Configurable Computer System25 26. A Parallel and Configurable ComputerPC2i NSF MRI Grant: A Parallel and Parallel CC FPGABrd2i Configurable Computer for Research in Engineering and the CCN0PC2i+1 Computational Mathematical CCN1FPGABrd2i+1 Sciences ($500K) CCN2 Projects related to RFID, an Electronic Nose, PET Image Reconstruction, ImageCCNi Compression, and Computer Vision are using this equipment to solve real world problems. CCN6 CCN7 26 27. Cluster Specifications 8 Compute Nodes 1 x PCI-X dual port Infiniband 4X HCA card 1 x 250GB SATA Hard Drive 7200RPM w/ 16MB Cache 8 x 1GB PC3200 ECC Reg DDR (400MHz) 1 x PNY nVidia Quadro FX 3000G w/ 8XAGP, 256MB DDR, Dual DVI/DVI 2 x AMD Opteron Model 250 (2.4GHz) 60-30-12921 1 x Dual Opteron S2885 EATX Motherboard w/ 8X AGP, gigE, SATA, audio, firewire, 4x 64-bit PCI 1 Head Node 1 x PCI-X dual port Infiniband 4X HCA card 8 x 1GB PC3200 ECC Reg DDR (400MHz) 2 x AMD Opteron Model 250 (2.4GHz) 1 x PNY nVidia Quadro FX 3000G w/ 8XAGP, 256MB DDR, Dual DVI/DVI 1 x 10/100/1000 64bit PCI-X Gigabit Copper NIC 9 FPGA Coprocessors 16 WS2P/XC2VP100-6P/48D/256 Wildstar II PRO PCI board with 2 ea P100-6 parts & 48 MB DDR SRAM and 256 MB DDR SDRAM27 28. RARE Project Past, Present, and Future28 29. AIST Program Space Based NRA Technologies Hierarchical Algorithms and their EmbeddedESTOComputational Realization in Reconfigurable HardwareEarth Science Technology OfficePI: Clay Gloster/Howard UniversityProposal No: AIST 0016-0044Description and Objectives6161VLIW Mem1 PE1This project addresses problems associated with6161developing data products for deployment in onboard RC Mem2PE234 systems. It involves the development of a compiler that61 Mem361 PE3 reads algorithm descriptions written in C. The compiler 34will produce hardware and software components required6161 Mem4 PE4 34for an RC implementation of typical NASA data products.61 Mem561 PE5The main objectives of this project are: efficient algorithm34 development and fast and reconfigurable hardware34 FIFO 1 FIFO2 FIFO 3 FIFO 4 FIFO5implementations (10X-100X speedup). 3434 34 3434 PCI BusApproachDeliverablesDevelop a compiler to translate nested loops into a - Prototype RC Testbed shown above sequence of floating point vector instructions. These-Prototype Compiler instructions correspond to modules in a library that is to be developed as a part of this project. Hardware-Cloud Masking Data Product Demonstration modules will perform complex instructions i.e.-Final Compiler matmult, vec-vecmult, FFT, etc.Application/Mission Co-I