introduction to national supercomputer center in tianjin th-1a supercomputer

Download Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer

Post on 14-Jun-2015

1.248 views

Category:

Technology

2 download

Embed Size (px)

TRANSCRIPT

  • 1. Introduction toNational Supercomputer center in TianjinTH-1A Supercomputer

2. Agenda National Supercomputer Center in Tianjin( NSCC-TJ) TH-1A system Hardware sub-system Software sub-system Applications 3. NSCC-TJ National SuperComputer Center in Tianjin Sponsored by Chinese Ministry of Science and Technology Tianjin Binhai New Area Public information infrastructure To accelerate the economy, education and industry ofNorthern China To provide high performance computing service to wholeChina Open platform for research and education 4. NSCC-TJ Main building officeComputer roomTransformer station &Total area: 2400m2air conditioner 5. NSCC-TJThe first floor of central computing room: 1200m2 6. NSCC-TJThe second floor of central computing room: Visualization environment, 1200m2 7. NSCC-TJElectric transformer station 8. NSCC-TJCooling water station2011-6-28TH-1 8 9. NSCC-TJ Layout of computing room 10. TH-1A system 11. TH-1A system Enhanced system based on TH-1 system Sep. 2009 Installed in NSCC-TJ, Aug. 2010 Debugging and performance testing, Sept.~Oct. 2010 Sept.~Oct. On service, after Nov. 2010 Items ConfigurationProcessors 14336 Intel CPUs + 7168 nVIDIA GPUs + 2048FT CPUs Memory 262TB in totalInterconnectProprietary high-speed interconnecting networkStorage 2PB120 Compute / service Cabinets Cabinets14 Storage Cabinets 6 Communication Cabinets 12. TH-1A system TH-1A System Architecture Hybrid MPP structure: CPU & GPU Proprietary compute nodes Connected by proprietary high-speed interconnectnetwork Global shared parallel storage system Custom software stack 13. TH-1A hardware sub-system ServiceService Compute sub-system Compute sub-systemsub-systemsub-systemCPUCPU CPU CPU CPU Operation Operation diagnosis sub-system diagnosis sub-system ++ + + + nodenodeGPUGPU GPU GPU GPU Monitor andMonitor andCommunication sub-systemCommunication sub-systemStorage sub-systemStorage sub-systemMDS OSS OSS OSS OSS 14. Compute sub-system 7,168 compute nodes 2 six-core CPU and 1 GPU per node CPUXeon X5670 ( Westmere ) (WestmereWestmere)Processor speed - 2.93GHz GPUNVIDIA Tesla M2050Connected with CPU by PCI-E 32GB memory per node 2U height Peak performance4,701,061Gflops 15. Service sub-system 1,024 service nodes 2 eight-core domestic CPUs CPU: FT-1000 SoC 1.0GH z1.0GHz Eight-core, eight-thread peright-core,core Peak performance 8Gflops 32GB memory per node For login, compile, and applicationsneed throughput computing 16. Proprietary interconnection network Interconnection signal speed 10Gbps Bi-directional bandwidth 160Gbps Hierarchy fat-tree structure First stage: 16 nodes connected by 16-port switching board Second stage: all parts connected to eleven 384-port switches 17. Proprietary interconnection network High radix router ASICNRC ASIC Feature size 90nm Die size17.16mm x 17.16mm size Package FC-PBGA Package 2577 pins Throughput of single NRC: 2.56Tbps Network interface ASICNIC Same feature size and package as NRC Die size 10.76mm x 10.76mm size 675 pins 18. Proprietary interconnection network16-port switch board in cabinetLeaf switch blade andRoot switch blade of 384-ports switch Back plane of 384-ports switchabout 700mm *600mm700mm* 19. Proprietary interconnecting network Switching board and high-radix switch Based on network interface ASIC and router ASIC Reduced user communication protocol Throughput: 61.44TbpsFronttwo 384-porthigh-radix switchesBack 20. Storage sub-system Capacity: 2 PB Connected by proprietary interconnection network Lustre based parallel file system 21. Monitor and diagnosis sub-system Rich monitor & control functions Real-time monitor hardware parameters Precise fault position Alarm and immediate action against emergency Self-feedback cool adjust for environment status I2C & JTAG diagnosis mechanism Large scale console Remote monitor and management 22. Computing cabinet Node: 2 CPUs and 1 GPU Blade: 2 nodes Frame 8 computing blades 16-port switching board 1 monitor and diagnosis board Cabinet 4 frames, 64 nodes Close-coupled chilled water cooling 128 CPUs, 64 GPU 56KW cooling capacity in a cabinet Footprint 700m2 23. TH-1A software sub-system Software stack 24. Operating system Kylin Linux compute node kernel Provide virtual running environment Isolated running environments for different users Custom software package installation QoS support Power aware computing 25. Compiler system C, C++, Fortran, Java OpenMP, MPI, OpenMP/MPIOpenMP,OpenMP/MPI CUDA, OpenCL Heterogeneous programming framework Accelerate the large scale, complex applications, especiallyfor applications in developing status or their full source codesare not available Use the computing power of CPUs and GPUs, hide the GPU GPUs,programming to users Inter-node homogeneous parallel programming (users) Intra-node heterogeneous parallel computing (computerexperts) 26. Compiler system Heterogeneous programming framework Inter-node homogeneous parallel programming (JASMIN) Patch-based objects data structures MPI communication, dynamic load balancing support Zero-copy optimization in communication library 27. Compiler system Heterogeneous programming framework Intra-node heterogeneous parallel computing Compiler optimized / hand-tuned threaded code Optimizations include Adaptive partitioning, balance the workloads between CPUs and GPU Asynchronous data transfer / computing, overlap CPU operations with GPU operations Software pipelining, overlap GPU computing with data transfer between host and GPU device memory 28. Compiler system Heterogeneous programming framework An example: 3-D short range molecular simulations For each time step Split workload (force calculation) between CPU and GPU For each patch allocated to GPU Start asynchronous operations: transfer the patch data toGPU, compute the patch, get results from GPU For each patch allocated to CPU Launch threads on CPU cores to compute the patch CPU waits for GPU completion event Adjust the split value according to the CPU/GPU performance (patches per second + empirical ) Other workload (velocity, position) computed on CPU Performance: one NVIDIA M2050 GPU is 3 times faster thanone Intel X5670 CPU 29. Programming environment Virtual running environments Provide services on demand Parallel toolkits Based on Eclipse To integrate all kinds of tools Editor, debugger, profiler Work flow support Support QoS negotiate Reserve resource for futurerequirement 30. Visualization system Application area Numerical weatherforecast Computational fluiddynamics Oil exploration Other large-scale data Computing platform Tianhe-1A Render server 128 CPU + 64 GPU Display device 3x6 multi-channeldisplay wall 31. Applications Oil exploration High-end equipment development Bio-medical research Animation design New energy research New material research Weather and climate forecasting Engineering design, simulation and analysis Remote sensing data processing Financial risk analysis 32. Thanks