Download - Hpc compass 2013-final_web
Simulation
CAD
CAEEngineering Price Modelling
Life Sciences
Automotive
Aerospace
Risk Analysis
Big Data Analytics
High Throughput Computing
High Performance Computing 2013/14
Technology Compass
2
Technology Compass
Table of contents and Introduction
High Performance Computing ........................................... 4Performance Turns Into Productivity ......................................................6
Flexible Deployment With xCAT ..................................................................8
Service and Customer Care From A to Z ............................................. 10
Advanced Cluster Management Made Easy ........... 12Easy-to-use, Complete and Scalable ...................................................... 14
Cloud Bursting With Bright ......................................................................... 18
Intelligent HPC Workload Management ................... 30Moab HPC Suite – Basic Edition ................................................................ 32
Moab HPC Suite – Enterprise Edition .................................................... 34
Moab HPC Suite – Grid Option ................................................................... 36
Optimizing Accelerators with Moab HPC Suite .............................. 38
Moab HPC Suite – Application Portal Edition .................................. 42
Moab HPC Suite – Remote Visualization Edition ........................... 44
Using Moab With Grid Environments ................................................... 46
NICE EnginFrame ..................................................................... 52A Technical Portal for Remote Visualization .................................... 54
Application Highlights ................................................................................... 56
Desktop Cloud Virtualization .................................................................... 58
Remote Visualization ...................................................................................... 60
Intel Cluster Ready ................................................................ 64A Quality Standard for HPC Clusters...................................................... 66
Intel Cluster Ready builds HPC Momentum ..................................... 70
The transtec Benchmarking Center ....................................................... 74
Parallel NFS ................................................................................. 76The New Standard for HPC Storage ....................................................... 78
Whats´s New in NFS 4.1? .............................................................................. 80
Panasas HPC Storage ...................................................................................... 82
NVIDIA GPU Computing ....................................................... 94What is GPU Computing? ............................................................................. 96
Kepler GK110 – The Next Generation GPU ......................................... 98
Introducing NVIDIA Parallel Nsight ..................................................... 100
A Quick Refresher on CUDA ...................................................................... 112
Intel TrueScale InfiniBand and GPUs .................................................. 118
Intel Xeon Phi Coprocessor .............................................122The Architecture.............................................................................................. 124
InfiniBand ...................................................................................134High-speed Interconnects ........................................................................ 136
Top 10 Reasons to Use Intel TrueScale InfiniBand ..................... 138
Intel MPI Library 4.0 Performance ........................................................ 140
Intel Fabric Suite 7 ......................................................................................... 144
Parstream ...................................................................................150Big Data Analytics .......................................................................................... 152
Glossary .......................................................................................160
3
More than 30 years of experience in Scientific Computing
1980 marked the beginning of a decade where numerous start-
ups were created, some of which later transformed into big play-
ers in the IT market. Technical innovations brought dramatic
changes to the nascent computer market. In Tübingen, close to
one of Germany’s prime and oldest universities, transtec was
founded.
In the early days, transtec focused on reselling DEC computers
and peripherals, delivering high-performance workstations to
university institutes and research facilities. In 1987, SUN/Sparc
and storage solutions broadened the portfolio, enhanced by
IBM/RS6000 products in 1991. These were the typical worksta-
tions and server systems for high performance computing then,
used by the majority of researchers worldwide.
In the late 90s, transtec was one of the first companies to offer
highly customized HPC cluster solutions based on standard Intel
architecture servers, some of which entered the TOP 500 list of
the world’s fastest computing systems.
Thus, given this background and history, it is fair to say that
transtec looks back upon a more than 30 years’ experience in sci-
entific computing; our track record shows more than 500 HPC in-
stallations. With this experience, we know exactly what custom-
ers’ demands are and how to meet them. High performance and
ease of management – this is what customers require today. HPC
systems are for sure required to peak-perform, as their name in-
dicates, but that is not enough: they must also be easy to handle.
Unwieldy design and operational complexity must be avoided
or at least hidden from administrators and particularly users of
HPC computer systems.
This brochure focusses on where transtec HPC solutions excel.
transtec HPC solutions use the latest and most innovative tech-
nology. Bright Cluster Manager as the technology leader for uni-
fied HPC cluster management, leading-edge Moab HPC Suite for
job and workload management, Intel Cluster Ready certification
as an independent quality standard for our systems, Panasas
HPC storage systems for highest performance and real ease of
management required of a reliable HPC storage system. Again,
with these components, usability, reliability and ease of man-
agement are central issues that are addressed, even in a highly
heterogeneous environment. transtec is able to provide custom-
ers with well-designed, extremely powerful solutions for Tesla
GPU computing, as well as thoroughly engineered Intel Xeon
Phi systems. Intel’s InfiniBand Fabric Suite makes managing a
large InfiniBand fabric easier than ever before – transtec mas-
terly combines excellent and well-chosen components that are
already there to a fine-tuned, customer-specific, and thoroughly
designed HPC solution.
Your decision for a transtec HPC solution means you opt for
most intensive customer care and best service in HPC. Our ex-
perts will be glad to bring in their expertise and support to assist
you at any stage, from HPC design to daily cluster operations, to
HPC Cloud Services.
Last but not least, transtec HPC Cloud Services provide custom-
ers with the possibility to have their jobs run on dynamically pro-
vided nodes in a dedicated datacenter, professionally managed
and individually customizable. Numerous standard applications
like ANSYS, LS-Dyna, OpenFOAM, as well as lots of codes like Gro-
macs, NAMD, VMD, and others are pre-installed, integrated into
an enterprise-ready cloud management environment, and ready
to run.
Have fun reading the transtec HPC Compass 2013/14!
High Performance Com-puting – Performance Turns Into Productivity
55
High Performance Comput-ing (HPC) has been with us from the very beginning of the computer era. High-performance computers were built to solve numer-ous problems which the “human computers” could not handle. The term HPC just hadn’t been coined yet. More important, some of the early principles have changed fundamentally.
6
Performance Turns Into ProductivityHPC systems in the early days were much different from those
we see today. First, we saw enormous mainframes from large
computer manufacturers, including a proprietary operating
system and job management system. Second, at universities
and research institutes, workstations made inroads and sci-
entists carried out calculations on their dedicated Unix or
VMS workstations. In either case, if you needed more comput-
ing power, you scaled up, i.e. you bought a bigger machine.
Today the term High-Performance Computing has gained a
fundamentally new meaning. HPC is now perceived as a way
to tackle complex mathematical, scientific or engineering
problems. The integration of industry standard, “off-the-shelf”
server hardware into HPC clusters facilitates the construction
of computer networks of such power that one single system
could never achieve. The new paradigm for parallelization is
scaling out.
Computer-supported simulations of realistic processes (so-
called Computer Aided Engineering – CAE) has established
itself as a third key pillar in the field of science and research
alongside theory and experimentation. It is nowadays incon-
ceivable that an aircraft manufacturer or a Formula One rac-
ing team would operate without using simulation software.
And scientific calculations, such as in the fields of astrophys-
ics, medicine, pharmaceuticals and bio-informatics, will to a
large extent be dependent on supercomputers in the future.
Software manufacturers long ago recognized the benefit of
high-performance computers based on powerful standard
servers and ported their programs to them accordingly.
The main advantages of scale-out supercomputers is just that:
they are infinitely scalable, at least in principle. Since they are
based on standard hardware components, such a supercom-
puter can be charged with more power whenever the com-
putational capacity of the system is not sufficient any more,
simply by adding additional nodes of the same kind. A cum-
bersome switch to a different technology can be avoided in
most cases.
High Performance Computing
Performance Turns Into Productivity
“transtec HPC solutions are meant to provide cus-
tomers with unparalleled ease-of-management and
ease-of-use. Apart from that, deciding for a transtec
HPC solution means deciding for the most intensive
customer care and the best service imaginable.”
Dr. Oliver Tennert Director Technology Management & HPC Solutions
7
Variations on the theme: MPP and SMP Parallel computations exist in two major variants today. Ap-
plications running in parallel on multiple compute nodes are
frequently so-called Massively Parallel Processing (MPP) appli-
cations. MPP indicates that the individual processes can each
utilize exclusive memory areas. This means that such jobs are
predestined to be computed in parallel, distributed across the
nodes in a cluster. The individual processes can thus utilize the
separate units of the respective node – especially the RAM, the
CPU power and the disk I/O.
Communication between the individual processes is imple-
mented in a standardized way through the MPI software inter-
face (Message Passing Interface), which abstracts the underlying
network connections between the nodes from the processes.
However, the MPI standard (current version 2.0) merely requires
source code compatibility, not binary compatibility, so an off-
the-shelf application usually needs specific versions of MPI
libraries in order to run. Examples of MPI implementations are
OpenMPI, MPICH2, MVAPICH2, Intel MPI or – for Windows clusters
– MS-MPI.
If the individual processes engage in a large amount of commu-
nication, the response time of the network (latency) becomes im-
portant. Latency in a Gigabit Ethernet or a 10GE network is typi-
cally around 10 µs. High-speed interconnects such as InfiniBand,
reduce latency by a factor of 10 down to as low as 1 µs. Therefore,
high-speed interconnects can greatly speed up total processing.
The other frequently used variant is called SMP applications.
SMP, in this HPC context, stands for Shared Memory Processing.
It involves the use of shared memory areas, the specific imple-
mentation of which is dependent on the choice of the underly-
ing operating system. Consequently, SMP jobs generally only run
on a single node, where they can in turn be multi-threaded and
thus be parallelized across the number of CPUs per node. For
many HPC applications, both the MPP and SMP variant can be
chosen.
Many applications are not inherently suitable for parallel execu-
tion. In such a case, there is no communication between the in-
dividual compute nodes, and therefore no need for a high-speed
network between them; nevertheless, multiple computing jobs
can be run simultaneously and sequentially on each individual
node, depending on the number of CPUs.
In order to ensure optimum computing performance for these
applications, it must be examined how many CPUs and cores de-
liver the optimum performance.
We find applications of this sequential type of work typically in
the fields of data analysis or Monte-Carlo simulations.
8
xCAT as a Powerful and Flexible Deployment Tool
xCAT (Extreme Cluster Administration Tool) is an open source
toolkit for the deployment and low-level administration of
HPC cluster environments, small as well as large ones.
xCAT provides simple commands for hardware control, node
discovery, the collection of MAC addresses, and the node de-
ployment with (diskful) or without local (diskless) installation.
The cluster configuration is stored in a relational database.
Node groups for different operating system images can be
defined. Also, user-specific scripts can be executed automati-
cally at installation time.
xCAT Provides the Following Low-Level Administrative Fea-
tures
ʎ Remote console support
ʎ Parallel remote shell and remote copy commands
ʎ Plugins for various monitoring tools like Ganglia or Nagios
ʎ Hardware control commands for node discovery, collect-
ing MAC addresses, remote power switching and resetting
of nodes
ʎ Automatic configuration of syslog, remote shell, DNS,
DHCP, and ntp within the cluster
ʎ Extensive documentation and man pages
For cluster monitoring, we install and configure the open
source tool Ganglia or the even more powerful open source
High Performance Computing
Flexible Deployment With xCAT
9
solution Nagios, according to the customer’s preferences and
requirements.
Local Installation or Diskless Installation
We offer a diskful or a diskless installation of the cluster
nodes. A diskless installation means the operating system is
hosted partially within the main memory, larger parts may or
may not be included via NFS or other means. This approach
allows for deploying large amounts of nodes very efficiently,
and the cluster is up and running within a very small times-
cale. Also, updating the cluster can be done in a very efficient
way. For this, only the boot image has to be updated, and the
nodes have to be rebooted. After this, the nodes run either a
new kernel or even a new operating system. Moreover, with
this approach, partitioning the cluster can also be very effi-
ciently done, either for testing purposes, or for allocating dif-
ferent cluster partitions for different users or applications.
Development Tools, Middleware, and Applications
According to the application, optimization strategy, or under-
lying architecture, different compilers lead to code results of
very different performance. Moreover, different, mainly com-
mercial, applications, require different MPI implementations.
And even when the code is self-developed, developers often
prefer one MPI implementation over another.
According to the customer’s wishes, we install various compil-
ers, MPI middleware, as well as job management systems like
Parastation, Grid Engine, Torque/Maui, or the very powerful
Moab HPC Suite for the high-level cluster management.
10
High Performance Computing
Services and Customer Care from A to Z
transtec HPC as a ServiceYou will get a range of applications like LS-Dyna, ANSYS,
Gromacs, NAMD etc. from all kinds of areas pre-installed,
integrated into an enterprise-ready cloud and workload
management system, and ready-to run. Do you miss your
application?
Ask us: [email protected]
transtec Platform as a ServiceYou will be provided with dynamically provided compute
nodes for running your individual code. The operating
system will be pre-installed according to your require-
ments. Common Linux distributions like RedHat, CentOS,
or SLES are the standard. Do you need another distribu-
tion?
Ask us: [email protected]
transtec Hosting as a ServiceYou will be provided with hosting space inside a profes-
sionally managed and secured datacenter where you
can have your machines hosted, managed, maintained,
according to your requirements. Thus, you can build up
your own private cloud. What range of hosting and main-
tenance services do you need?
Tell us: [email protected]
HPC @ transtec: Services and Customer Care from A to Ztranstec AG has over 30 years of experience in scientific computing
and is one of the earliest manufacturers of HPC clusters. For nearly
a decade, transtec has delivered highly customized High Perfor-
mance clusters based on standard components to academic and in-
dustry customers across Europe with all the high quality standards
and the customer-centric approach that transtec is well known for.
Every transtec HPC solution is more than just a rack full of hard-
ware – it is a comprehensive solution with everything the HPC user,
owner, and operator need. In the early stages of any customer’s HPC
project, transtec experts provide extensive and detailed consult-
ing to the customer – they benefit from expertise and experience.
Consulting is followed by benchmarking of different systems with
either specifically crafted customer code or generally accepted
individual Presalesconsulting
application-,customer-,
site-specificsizing of
HPC solution
burn-in testsof systems
benchmarking of different systems
continualimprovement
software& OS
installation
applicationinstallation
onsitehardwareassembly
integrationinto
customer’senvironment
customertraining
maintenance,support &
managed services
individual Presalesconsulting
application-,customer-,
site-specificsizing of
HPC solution
burn-in testsof systems
benchmarking of different systems
continualimprovement
software& OS
installation
applicationinstallation
onsitehardwareassembly
integrationinto
customer’senvironment
customertraining
maintenance,support &
managed services
Services and Customer Care from A to Z
11
benchmarking routines; this aids customers in sizing and devising
the optimal and detailed HPC configuration.
Each and every piece of HPC hardware that leaves our factory un-
dergoes a burn-in procedure of 24 hours or more if necessary. We
make sure that any hardware shipped meets our and our custom-
ers’ quality requirements. transtec HPC solutions are turnkey solu-
tions. By default, a transtec HPC cluster has everything installed and
configured – from hardware and operating system to important
middleware components like cluster management or developer
tools and the customer’s production applications. Onsite delivery
means onsite integration into the customer’s production environ-
ment, be it establishing network connectivity to the corporate net-
work, or setting up software and configuration parts.
transtec HPC clusters are ready-to-run systems – we deliver, you
turn the key, the system delivers high performance. Every HPC proj-
ect entails transfer to production: IT operation processes and poli-
cies apply to the new HPC system. Effectively, IT personnel is trained
hands-on, introduced to hardware components and software, with
all operational aspects of configuration management.
transtec services do not stop when the implementation projects
ends. Beyond transfer to production, transtec takes care. transtec
offers a variety of support and service options, tailored to the cus-
tomer’s needs. When you are in need of a new installation, a major
reconfiguration or an update of your solution – transtec is able to
support your staff and, if you lack the resources for maintaining
the cluster yourself, maintain the HPC solution for you. From Pro-
fessional Services to Managed Services for daily operations and
required service levels, transtec will be your complete HPC service
and solution provider. transtec’s high standards of performance, re-
liability and dependability assure your productivity and complete
satisfaction.
transtec’s offerings of HPC Managed Services offer customers the
possibility of having the complete management and administra-
tion of the HPC cluster managed by transtec service specialists, in
an ITIL compliant way. Moreover, transtec’s HPC on Demand servic-
es help provide access to HPC resources whenever they need them,
for example, because they do not have the possibility of owning
and running an HPC cluster themselves, due to lacking infrastruc-
ture, know-how, or admin staff.
transtec HPC Cloud ServicesLast but not least transtec’s services portfolio evolves as customers‘
demands change. Starting this year, transtec is able to provide HPC
Cloud Services. transtec uses a dedicated datacenter to provide
computing power to customers who are in need of more capacity
than they own, which is why this workflow model is sometimes
called computing-on-demand. With these dynamically provided
resources, customers with the possibility to have their jobs run on
HPC nodes in a dedicated datacenter, professionally managed and
secured, and individually customizable. Numerous standard ap-
plications like ANSYS, LS-Dyna, OpenFOAM, as well as lots of codes
like Gromacs, NAMD, VMD, and others are pre-installed, integrated
into an enterprise-ready cloud and workload management environ-
ment, and ready to run.
Alternatively, whenever customers are in need of space for hosting
their own HPC equipment because they do not have the space ca-
pacity or cooling and power infrastructure themselves, transtec is
also able to provide Hosting Services to those customers who’d like
to have their equipment professionally hosted, maintained, and
managed. Customers can thus build up their own private cloud!
Are you interested in any of transtec’s broad range of HPC related
services? Write us an email to [email protected]. We’ll be happy to
hear from you!
Advanced Cluster Management Made Easy
13
Bright Cluster Manager re-moves the complexity from the installation, manage-ment and use of HPC clus-ters — onpremise or in the cloud.
With Bright Cluster Man-ager, you can easily install, manage and use multiple clusters simultaneously, in-cluding compute, Hadoop, storage, database and work-station clusters.
14
The Bright AdvantageBright Cluster Manager delivers improved productivity, in-
creased uptime, proven scalability and security, while reduc-
ing operating cost:
Rapid Productivity Gains
ʎ Short learning curve: intuitive GUI drives it all
ʎ Quick installation: one hour from bare metal to compute-
ready
ʎ Fast, flexible provisioning: incremental, live, disk-full, disk-less,
over InfiniBand, to virtual machines, auto node discovery
ʎ Comprehensive monitoring: on-the-fly graphs, Rackview, mul-
tiple clusters, custom metrics
ʎ Powerful automation: thresholds, alerts, actions
ʎ Complete GPU support: NVIDIA, AMD1, CUDA, OpenCL
ʎ Full support for Intel Xeon Phi
ʎ On-demand SMP: instant ScaleMP virtual SMP deployment
ʎ Fast customization and task automation: powerful cluster
management shell, SOAP and JSON APIs make it easy
ʎ Seamless integration with leading workload managers: Slurm,
Open Grid Scheduler, Torque, openlava, Maui2, PBS Profes-
sional, Univa Grid Engine, Moab2, LSF
ʎ Integrated (parallel) application development environment
ʎ Easy maintenance: automatically update your cluster from
Linux and Bright Computing repositories
ʎ Easily extendable, web-based User Portal
ʎ Cloud-readiness at no extra cost3, supporting scenarios “Clus-
ter-on-Demand” and “Cluster-Extension”, with data-aware
scheduling
ʎ Deploys, provisions, monitors and manages Hadoop clusters
ʎ Future-proof: transparent customization minimizes disrup-
tion from staffing changes
Maximum Uptime
ʎ Automatic head node failover: prevents system downtime
ʎ Powerful cluster automation: drives pre-emptive actions
based on monitoring thresholds
Advanced Cluster Management Made EasyEasy-to-use, complete and scalable
The cluster installer takes you through the installation process and offers advanced options such as “Express” and “Remote”.
15
ʎ Comprehensive cluster monitoring and health checking: au-
tomatic sidelining of unhealthy nodes to prevent job failure
Scalability from Deskside to TOP500
ʎ Off-loadable provisioning: enables maximum scalability
ʎ Proven: used on some of the world’s largest clusters
Minimum Overhead / Maximum Performance
ʎ Lightweight: single daemon drives all functionality
ʎ Optimized: daemon has minimal impact on operating system
and applications
ʎ Efficient: single database for all metric and configuration
data
Top Security
ʎ Key-signed repositories: controlled, automated security and
other updates
ʎ Encryption option: for external and internal communications
ʎ Safe: X509v3 certificate-based public-key authentication
ʎ Sensible access: role-based access control, complete audit
trail
ʎ Protected: firewalls and LDAP
A Unified Approach
Bright Cluster Manager was written from the ground up as a to-
tally integrated and unified cluster management solution. This
fundamental approach provides comprehensive cluster manage-
ment that is easy to use and functionality-rich, yet has minimal
impact on system performance. It has a single, light-weight
daemon, a central database for all monitoring and configura-
tion data, and a single CLI and GUI for all cluster management
functionality. Bright Cluster Manager is extremely easy to use,
scalable, secure and reliable. You can monitor and manage all
aspects of your clusters with virtually no learning curve.
Bright’s approach is in sharp contrast with other cluster man-
agement offerings, all of which take a “toolkit” approach. These
toolkits combine a Linux distribution with many third-party
tools for provisioning, monitoring, alerting, etc. This approach
has critical limitations: these separate tools were not designed
to work together; were often not designed for HPC, nor designed
to scale. Furthermore, each of the tools has its own interface
(mostly command-line based), and each has its own daemon(s)
and database(s). Countless hours of scripting and testing by
highly skilled people are required to get the tools to work for a
specific cluster, and much of it goes undocumented.
Time is wasted, and the cluster is at risk if staff changes occur,
losing the “in-head” knowledge of the custom scripts.
By selecting a cluster node in the tree on the left and the Tasks tab on the right, you can execute a number of powerful tasks on that node with just a single mouse click.
1616
Ease of Installation
Bright Cluster Manager is easy to install. Installation and testing
of a fully functional cluster from “bare metal” can be complet-
ed in less than an hour. Configuration choices made during the
installation can be modified afterwards. Multiple installation
modes are available, including unattended and remote modes.
Cluster nodes can be automatically identified based on switch
ports rather than MAC addresses, improving speed and reliabil-
ity of installation, as well as subsequent maintenance. All major
hardware brands are supported: Dell, Cray, Cisco, DDN, IBM, HP,
Supermicro, Acer, Asus and more.
Ease of Use
Bright Cluster Manager is easy to use, with two interface op-
tions: the intuitive Cluster Management Graphical User Interface
(CMGUI) and the powerful Cluster Management Shell (CMSH).
The CMGUI is a standalone desktop application that provides
a single system view for managing all hardware and software
aspects of the cluster through a single point of control. Admin-
istrative functions are streamlined as all tasks are performed
through one intuitive, visual interface. Multiple clusters can be
managed simultaneously. The CMGUI runs on Linux, Windows
and OS X, and can be extended using plugins. The CMSH provides
practically the same functionality as the CMGUI, but via a com-
mand-line interface. The CMSH can be used both interactively
and in batch mode via scripts.
Either way, you now have unprecedented flexibility and control
over your clusters.
Support for Linux and Windows
Bright Cluster Manager is based on Linux and is available with
a choice of pre-integrated, pre-configured and optimized Linux
distributions, including SUSE Linux Enterprise Server, Red Hat
Enterprise Linux, CentOS and Scientific Linux. Dual-boot installa-
tions with Windows HPC Server are supported as well, allowing
nodes to either boot from the Bright-managed Linux head node,
or the Windows-managed head node.
Advanced Cluster Management Made EasyEasy-to-use, complete and scalable
The Overview tab provides instant, high-level insight into the status of the cluster.
1717
Extensive Development Environment
Bright Cluster Manager provides an extensive HPC development
environment for both serial and parallel applications, including
the following (some are cost options):
ʎ Compilers, including full suites from GNU, Intel, AMD and Port-
land Group.
ʎ Debuggers and profilers, including the GNU debugger and
profiler, TAU, TotalView, Allinea DDT and Allinea MAP.
ʎ GPU libraries, including CUDA and OpenCL.
ʎ MPI libraries, including OpenMPI, MPICH, MPICH2, MPICHMX,
MPICH2-MX, MVAPICH and MVAPICH2; all cross-compiled with
the compilers installed on Bright Cluster Manager, and opti-
mized for high-speed interconnects such as InfiniBand and
10GE.
ʎ Mathematical libraries, including ACML, FFTW, Goto-BLAS,
MKL and ScaLAPACK.
ʎ Other libraries, including Global Arrays, HDF5, IIPP, TBB, Net-
CDF and PETSc.
Bright Cluster Manager also provides Environment Modules to
make it easy to maintain multiple versions of compilers, libraries
and applications for different users on the cluster, without creating
compatibility conflicts. Each Environment Module file contains the
information needed to configure the shell for an application, and
automatically sets these variables correctly for the particular appli-
cation when it is loaded. Bright Cluster Manager includes many pre-
configured module files for many scenarios, such as combinations
of compliers, mathematical and MPI libraries.
Powerful Image Management and Provisioning
Bright Cluster Manager features sophisticated software image
management and provisioning capability. A virtually unlimited
number of images can be created and assigned to as many dif-
ferent categories of nodes as required. Default or custom Linux
kernels can be assigned to individual images. Incremental
changes to images can be deployed to live nodes without re-
booting or re-installation.
The provisioning system only propagates changes to the images,
minimizing time and impact on system performance and avail-
ability. Provisioning capability can be assigned to any number of
nodes on-the-fly, for maximum flexibility and scalability. Bright
Cluster Manager can also provision over InfiniBand and to ram-
disk or virtual machine.
Comprehensive Monitoring
With Bright Cluster Manager, you can collect, monitor, visual-
ize and analyze a comprehensive set of metrics. Many software
and hardware metrics available to the Linux kernel, and many
hardware management interface metrics (IPMI, DRAC, iLO, etc.)
are sampled.
Cluster metrics, such as GPU, Xeon Phi and CPU temperatures, fan speeds and network statistics can be visualized by simply dragging and dropping them into a graphing window. Multiple metrics can be combined in one graph and graphs can be zoomed into. A Graphing wizard allows creation of all graphs for a selected combination of metrics and nodes. Graph layout and color configurations can be tailored to your requirements and stored for re-use.
18
High performance meets efficiencyInitially, massively parallel systems constitute a challenge to
both administrators and users. They are complex beasts. Anyone
building HPC clusters will need to tame the beast, master the
complexity and present users and administrators with an easy-
to-use, easy-to-manage system landscape.
Leading HPC solution providers such as transtec achieve this
goal. They hide the complexity of HPC under the hood and match
high performance with efficiency and ease-of-use for both users
and administrators. The “P” in “HPC” gains a double meaning:
“Performance” plus “Productivity”.
Cluster and workload management software like Moab HPC
Suite, Bright Cluster Manager or QLogic IFS provide the means
to master and hide the inherent complexity of HPC systems.
For administrators and users, HPC clusters are presented as
single, large machines, with many different tuning parameters.
The software also provides a unified view of existing clusters
whenever unified management is added as a requirement by the
customer at any point in time after the first installation. Thus,
daily routine tasks such as job management, user management,
queue partitioning and management, can be performed easily
with either graphical or web-based tools, without any advanced
scripting skills or technical expertise required from the adminis-
trator or user.
Advanced Cluster Management Made EasyCloud Bursting With Bright
19
Bright Cluster Manager unleashes and manages the unlimited power of the cloudCreate a complete cluster in Amazon EC2, or easily extend your
onsite cluster into the cloud, enabling you to dynamically add
capacity and manage these nodes as part of your onsite cluster.
Both can be achieved in just a few mouse clicks, without the
need for expert knowledge of Linux or cloud computing.
Bright Cluster Manager’s unique data aware scheduling capabil-
ity means that your data is automatically in place in EC2 at the
start of the job, and that the output data is returned as soon as
the job is completed.
With Bright Cluster Manager, every cluster is cloud-ready, at no
extra cost. The same powerful cluster provisioning, monitoring,
scheduling and management capabilities that Bright Cluster
Manager provides to onsite clusters extend into the cloud.
The Bright advantage for cloud bursting
ʎ Ease of use: Intuitive GUI virtually eliminates user learning
curve; no need to understand Linux or EC2 to manage system.
Alternatively, cluster management shell provides powerful
scripting capabilities to automate tasks
ʎ Complete management solution: Installation/initialization,
provisioning, monitoring, scheduling and management in
one integrated environment
ʎ Integrated workload management: Wide selection of work-
load managers included and automatically configured with
local, cloud and mixed queues
ʎ Single system view; complete visibility and control: Cloud com-
pute nodes managed as elements of the on-site cluster; visible
from a single console with drill-downs and historic data.
ʎ Efficient data management via data aware scheduling: Auto-
matically ensures data is in position at start of computation;
delivers results back when complete.
ʎ Secure, automatic gateway: Mirrored LDAP and DNS services
over automatically-created VPN connects local and cloud-
based nodes for secure communication.
ʎ Cost savings: More efficient use of cloud resources; support
for spot instances, minimal user intervention.
Two cloud bursting scenarios
Bright Cluster Manager supports two cloud bursting scenarios:
“Cluster-on-Demand” and “Cluster Extension”.
Scenario 1: Cluster-on-Demand
The Cluster-on-Demand scenario is ideal if you do not have a
cluster onsite, or need to set up a totally separate cluster. With
just a few mouse clicks you can instantly create a complete clus-
ter in the public cloud, for any duration of time.
Scenario 2: Cluster Extension
The Cluster Extension scenario is ideal if you have a cluster onsite
but you need more compute power, including GPUs. With Bright
Cluster Manager, you can instantly add EC2-based resources to
20
your onsite cluster, for any duration of time. Bursting into the
cloud is as easy as adding nodes to an onsite cluster — there are
only a few additional steps after providing the public cloud ac-
count information to Bright Cluster Manager.
The Bright approach to managing and monitoring a cluster in
the cloud provides complete uniformity, as cloud nodes are man-
aged and monitored the same way as local nodes:
ʎ Load-balanced provisioning
ʎ Software image management
ʎ Integrated workload management
ʎ Interfaces — GUI, shell, user portal
ʎ Monitoring and health checking
ʎ Compilers, debuggers, MPI libraries, mathematical libraries
and environment modules
Bright Cluster Manager also provides additional features that
are unique to the cloud.
Advanced Cluster Management Made EasyEasy-to-use, complete and scalable
The Add Cloud Provider Wizard and the Node Creation Wizard make the cloud bursting process easy, also for users with no cloud experience.
21
Amazon spot instance support
Bright Cluster Manager enables users to take advantage of the
cost savings offered by Amazon’s spot instances. Users can spec-
ify the use of spot instances, and Bright will automatically sched-
ule as available, reducing the cost to compute without the need
to monitor spot prices and manually schedule.
Hardware Virtual Machine (HVM) virtualization
Bright Cluster Manager automatically initializes all Amazon
instance types, including Cluster Compute and Cluster GPU in-
stances that rely on HVM virtualization.
Data aware scheduling
Data aware scheduling ensures that input data is transferred to
the cloud and made accessible just prior to the job starting, and
that the output data is transferred back. There is no need to wait
(and monitor) for the data to load prior to submitting jobs (de-
laying entry into the job queue), nor any risk of starting the job
before the data transfer is complete (crashing the job). Users sub-
mit their jobs, and Bright’s data aware scheduling does the rest.
Bright Cluster Manager provides the choice:
“To Cloud, or Not to Cloud”
Not all workloads are suitable for the cloud. The ideal situation
for most organizations is to have the ability to choose between
onsite clusters for jobs that require low latency communication,
complex I/O or sensitive data; and cloud clusters for many other
types of jobs.
Bright Cluster Manager delivers the best of both worlds: a pow-
erful management solution for local and cloud clusters, with
the ability to easily extend local clusters into the cloud without
compromising provisioning, monitoring or managing the cloud
resources.
One or more cloud nodes can be configured under the Cloud Settings tab.
Bright Computing
22
Examples include CPU, GPU and Xeon Phi temperatures, fan
speeds, switches, hard disk SMART information, system load,
memory utilization, network metrics, storage metrics, power
systems statistics, and workload management metrics. Custom
metrics can also easily be defined.
Metric sampling is done very efficiently — in one process, or out-
of-band where possible. You have full flexibility over how and
when metrics are sampled, and historic data can be consolidat-
ed over time to save disk space.
Cluster Management Automation
Cluster management automation takes pre-emptive actions
when predetermined system thresholds are exceeded, saving
time and preventing hardware damage. Thresholds can be con-
figured on any of the available metrics. The built-in configura-
tion wizard guides you through the steps of defining a rule:
selecting metrics, defining thresholds and specifying actions.
For example, a temperature threshold for GPUs can be estab-
lished that results in the system automatically shutting down an
Advanced Cluster Management Made EasyEasy-to-use, complete and scalable
The parallel shell allows for simultaneous execution of commands or scripts across node groups or across the entire cluster.
The status of cluster nodes, switches, other hardware, as well as up to six metrics can be visualized in the Rackview. A zoomout option is available for clusters with many racks.
23
overheated GPU unit and sending a text message to your mobile
phone. Several predefined actions are available, but any built-in
cluster management command, Linux command or script can be
used as an action.
Comprehensive GPU Management
Bright Cluster Manager radically reduces the time and effort
of managing GPUs, and fully integrates these devices into the
single view of the overall system. Bright includes powerful GPU
management and monitoring capability that leverages function-
ality in NVIDIA Tesla and AMD GPUs.
You can easily assume maximum control of the GPUs and gain
instant and time-based status insight. Depending on the GPU
make and model, Bright monitors a full range of GPU metrics,
including:
ʎ GPU temperature, fan speed, utilization
ʎ GPU exclusivity, compute, display, persistance mode
ʎ GPU memory utilization, ECC statistics
ʎ Unit fan speed, serial number, temperature, power usage, vol-
tages and currents, LED status, firmware.
ʎ Board serial, driver version, PCI info.
Beyond metrics, Bright Cluster Manager features built-in support
for GPU computing with CUDA and OpenCL libraries. Switching
between current and previous versions of CUDA and OpenCL has
also been made easy.
Full Support for Intel Xeon Phi
Bright Cluster Manager makes it easy to set up and use the Intel
Xeon Phi coprocessor. Bright includes everything that is needed
to get Phi to work, including a setup wizard in the CMGUI. Bright
ensures that your software environment is set up correctly, so
that the Intel Xeon Phi coprocessor is available for applications
that are able to take advantage of it.
Bright collects and displays a wide range of metrics for Phi, en-
suring that the coprocessor is visible and manageable as a de-
vice type, as well as including Phi as a resource in the workload
management system. Bright’s pre-job health checking ensures
that Phi is functioning properly before directing tasks to the
coprocessor.
Multi-Tasking Via Parallel Shell
The parallel shell allows simultaneous execution of multiple
commands and scripts across the cluster as a whole, or across
easily definable groups of nodes. Output from the executed
commands is displayed in a convenient way with variable levels
of verbosity. Running commands and scripts can be killed eas-
ily if necessary. The parallel shell is available through both the
CMGUI and the CMSH.
Integrated Workload Management
Bright Cluster Manager is integrated with a wide selection
of free and commercial workload managers. This integration
provides a number of benefits:
ʎ The selected workload manager gets automatically installed
and configured
The automation configuration wizard guides you through the steps of defining a rule: selecting metrics, defining thresholds and specifying actions
24
ʎ Many workload manager metrics are monitored
ʎ The CMGUI and User Portal provide a user-friendly interface
to the workload manager
ʎ The CMSH and the SOAP & JSON APIs provide direct and po-
werful access to a number of workload manager commands
and metrics
ʎ Reliable workload manager failover is properly configured
ʎ The workload manager is continuously made aware of the
health state of nodes (see section on Health Checking)
ʎ The workload manager is used to save power through auto-
power on/off based on workload
ʎ The workload manager is used for data-aware scheduling
of jobs to the cloud
The following user-selectable workload managers are tightly in-
tegrated with Bright Cluster Manager:
ʎ PBS Professional, Univa Grid Engine, Moab, LSF
ʎ Slurm, openlava, Open Grid Scheduler, Torque, Maui
Alternatively, other workload managers, such as LoadLevele
and Condor can be installed on top of Bright Cluster Manager.
Integrated SMP Support
Bright Cluster Manager — Advanced Edition dynamically aggre-
gates multiple cluster nodes into a single virtual SMP node, us-
ing ScaleMP’s Versatile SMP™ (vSMP) architecture. Creating and
dismantling a virtual SMP node can be achieved with just a few
Advanced Cluster Management Made EasyEasy-to-use, complete and scalable
Creating and dismantling a virtual SMP node can be achieved with just a few clicks within the GUI or a single command in the cluster management shell.
Example graphs that visualize metrics on a GPU cluster.
25
clicks within the CMGUI. Virtual SMP nodes can also be launched
and dismantled automatically using the scripting capabilities of
the CMSH.
In Bright Cluster Manager a virtual SMP node behaves like any
other node, enabling transparent, on-the-fly provisioning, con-
figuration, monitoring and management of virtual SMP nodes as
part of the overall system management.
Maximum Uptime with Head Node
Failover Bright Cluster Manager — Advanced Edition allows two
head nodes to be configured in active-active failover mode. Both
head nodes are on active duty, but if one fails, the other takes
over all tasks, seamlesly.
Maximum Uptime with Health Checking
Bright Cluster Manager — Advanced Edition includes a
powerful cluster health checking framework that maxi-
mizes system uptime. It continually checks multiple health
indicators for all hardware and software components and
proactively initiates corrective actions. It can also automati-
cally perform a series of standard and user-defined tests just
before starting a new job, to ensure a successful execution,
and preventing the “black hole node syndrome”. Examples
of corrective actions include autonomous bypass of faulty
nodes, automatic job requeuing to avoid queue flushing, and
process “jailing” to allocate, track, trace and flush completed
user processes. The health checking framework ensures the
highest job throughput, the best overall cluster efficiency
and the lowest administration overhead.
Top Cluster Security
Bright Cluster Manager offers an unprecedented level of
security that can easily be tailored to local requirements.
Security features include:
ʎ Automated security and other updates from key-signed Linux
and Bright Computing repositories.
ʎ Encrypted internal and external communications.
ʎ X509v3 certificate based public-key authentication to the
cluster management infrastructure.
ʎ Role-based access control and complete audit trail.
ʎ Firewalls, LDAP and SSH.
User and Group Management
Users can be added to the cluster through the CMGUI or the
CMSH. Bright Cluster Manager comes with a pre-configured
LDAP database, but an external LDAP service, or alternative au-
thentication system, can be used instead.
Web-Based User Portal
The web-based User Portal provides read-only access to essen-
tial cluster information, including a general overview of the clus-
ter status, node hardware and software properties, workload
manager statistics and user-customizable graphs. The User Por-
tal can easily be customized and expanded using PHP and the
SOAP or JSON APIs.
Workload management queues can be viewed and configured from the GUI, without the need for workload management expertise.
26
Multi-Cluster Capability
Bright Cluster Manager — Advanced Edition is ideal for organiza-
tions that need to manage multiple clusters, either in one or in
multiple locations. Capabilities include:
ʎ All cluster management and monitoring functionality is
available for all clusters through one GUI.
ʎ Selecting any set of configurations in one cluster and ex-
porting them to any or all other clusters with a few mouse
clicks.
ʎ Metric visualizations and summaries across clusters.
ʎ Making node images available to other clusters.
Fundamentally API-Based
Bright Cluster Manager is fundamentally API-based, which
means that any cluster management command and any piece
of cluster management data — whether it is monitoring data
or configuration data — is available through the API. Both a
SOAP and a JSON API are available and interfaces for various
programming languages, including C++, Python and PHP are
provided.
Cloud Bursting
Bright Cluster Manager supports two cloud bursting sce-
narios: “Cluster-on-Demand” — running stand-alone clusters
Advanced Cluster Management Made EasyEasy-to-use, complete and scalable
“The building blocks for transtec HPC solutions
must be chosen according to our goals ease-of-
management and ease-of-use. With Bright Cluster
Manager, we are happy to have the technology
leader at hand, meeting these requirements, and
our customers value that.”
Armin Jäger HPC Solution Engineer
27
in the cloud; and “Cluster Extension” — adding cloud-based
resources to existing, onsite clusters and managing these
cloud nodes as if they were local. In addition, Bright provides
data aware scheduling to ensure that data is accessible in
the cloud at the start of jobs, and results are promptly trans-
ferred back. Both scenarios can be achieved in a few simple
steps. Every Bright cluster is automatically cloud-ready, at no
extra cost.
Hadoop Cluster Management
Bright Cluster Manager is the ideal basis for Hadoop clusters.
Bright installs on bare metal, configuring a fully operational
Hadoop cluster in less than one hour. In the process, Bright pre-
pares your Hadoop cluster for use by provisioning the operating
system and the general cluster management and monitoring ca-
pabilities required as on any cluster.
Bright then manages and monitors your Hadoop cluster’s
hardware and system software throughout its life-cycle,
collecting and graphically displaying a full range of Hadoop
metrics from the HDFS, RPC and JVM sub-systems. Bright sig-
nificantly reduces setup time for Cloudera, Hortonworks and
other Hadoop stacks, and increases both uptime and Map-
Reduce job throughput.
This functionality is scheduled to be further enhanced in
upcoming releases of Bright, including dedicated manage-
ment roles and profiles for name nodes, data nodes, as well
as advanced Hadoop health checking and monitoring func-
tionality.
Standard and Advanced Editions
Bright Cluster Manager is available in two editions: Standard
and Advanced. The table on this page lists the differences. You
can easily upgrade from the Standard to the Advanced Edition as
your cluster grows in size or complexity.
The web-based User Portal provides read-only access to essential cluster in-formation, including a general overview of the cluster status, node hardware and software properties, workload manager statistics and user-customizable graphs.
Bright Cluster Manager can manage multiple clusters simultaneously. This overview shows clusters in Oslo, Abu Dhabi and Houston, all managed through one GUI.
28
Documentation and Services
A comprehensive system administrator manual and user man-
ual are included in PDF format. Standard and tailored services
are available, including various levels of support, installation,
training and consultancy.
Advanced Cluster Management Made EasyEasy-to-use, complete and scalable
Feature Standard Advanced
Choice of Linux distributions x x
Intel Cluster Ready x x
Cluster Management GUI x x
Cluster Management Shell x x
Web-Based User Portal x x
SOAP API x x
Node Provisioning x x
Node Identification x x
Cluster Monitoring x x
Cluster Automation x x
User Management x x
Role-based Access Control x x
Parallel Shell x x
Workload Manager Integration x x
Cluster Security x x
Compilers x x
Debuggers & Profilers x x
MPI Libraries x x
Mathematical Libraries x x
Environment Modules x x
Cloud Bursting x x
Hadoop Management & Monitoring x x
NVIDIA CUDA & OpenCL - x
GPU Management & Monitoring - x
Xeon Phi Management & Monitoring - x
ScaleMP Management & Monitoring - x
Redundant Failover Head Nodes - x
Cluster Health Checking - x
Off-loadable Provisioning - x
Multi-Cluster Management - x
Suggested Number of Nodes 4–128 129–10,000+
Standard Support x x
Premium Support Optional Optional
Cluster health checks can be visualized in the Rackview. This screenshot shows that GPU unit 41 fails a health check called “AllFansRunning”.
29
GESTION INTELLI-GENTE DE FLUX D’APPLICATIONS HPC
Intelligent HPCWorkload Management
31
Tandis que tous les systèmes HPC se trouvent confrontés à de
grands challenges en matière de complexité, d’évolutivité et
d’exigences des flux applicatifs et des ressources, les systèmes
HPC des entreprises font face aux défis et attentes les plus
exigeants. Les systèmes HPC d’entreprises doivent satisfaire
aux exigences critiques et prioritaires des flux d’applications
pour les entreprises commerciales et les organisations de
recherche et académiques à orientation commerciale. Ils doi-
vent trouver un équilibre entre des SLA et des priorités com-
plexes. En effet, leurs flux HPC ont un impact direct sur leurs
revenus, la livraison de leurs produits et leurs objectifs.
31
While all HPC systems face challenges in workload de-mand, resource complexity, and scale, enterprise HPC systems face more strin-gent challenges and expec-tations.
Enterprise HPC systems must meet mission-critical and priority HPC workload demands for commercial businesses and business-ori-ented research and academ-ic organizations. They have complex SLAs and priorities to balance. Their HPC work-loads directly impact the revenue, product delivery, and organizational objec-tives of their organizations.
32
Enterprise HPC ChallengesEnterprise HPC systems must eliminate job delays and failures.
They are also seeking to improve resource utilization and man-
agement efficiency across multiple heterogeneous systems. To-
maximize user productivity, they are required to make it easier to
access and use HPC resources for users or even expand to other
clusters or HPC cloud to better handle workload demand surges.
Intelligent Workload ManagementAs today’s leading HPC facilities move beyond petascale towards
exascale systems that incorporate increasingly sophisticated and
specialized technologies, equally sophisticated and intelligent
management capabilities are essential. With a proven history of
managing the most advanced, diverse, and data-intensive sys-
tems in the Top500, Moab continues to be the preferred workload
management solution for next generation HPC facilities.
Intelligent HPC Workload ManagementMoab HPC Suite – Basic Edition
Moab is the “brain” of an HPC system, intelligently optimizing workload throughput while balancing service levels and priorities.
33
Moab HPC Suite – Basic EditionMoab HPC Suite - Basic Edition is an intelligent workload man-
agement solution. It automates the scheduling, managing,
monitoring, and reporting of HPC workloads on massive scale,
multi-technology installations. The patented Moab intelligence
engine uses multi-dimensional policies to accelerate running
workloads across the ideal combination of diverse resources.
These policies balance high utilization and throughput goals
with competing workload priorities and SLA requirements. The
speed and accuracy of the automated scheduling decisions
optimizes workload throughput and resource utilization. This
gets more work accomplished in less time, and in the right prio-
rity order. Moab HPC Suite - Basic Edition optimizes the value
and satisfaction from HPC systems while reducing manage-
ment cost and complexity.
Moab HPC Suite – Enterprise EditionMoab HPC Suite - Enterprise Edition provides enterprise-ready HPC
workload management that accelerates productivity, automates
workload uptime, and consistently meets SLAs and business prior-
ities for HPC systems and HPC cloud. It uses the battle-tested and
patented Moab intelligence engine to automatically balance the
complex, mission-critical workload priorities of enterprise HPC
systems. Enterprise customers benefit from a single integrated
product that brings together key enterprise HPC capabilities, im-
plementation services, and 24x7 support. This speeds the realiza-
tion of benefits from the HPC system for the business including:
ʎ Higher job throughput
ʎ Massive scalability for faster response and system expansion
ʎ Optimum utilization of 90-99% on a consistent basis
ʎ Fast, simple job submission and management to increase
34
productivity
ʎ Reduced cluster management complexity and support costs
across heterogeneous systems
ʎ Reduced job failures and auto-recovery from failures
ʎ SLAs consistently met for improved user satisfaction
ʎ Reduce power usage and costs by 10-30%
Productivity AccelerationMoab HPC Suite – Enterprise Edition gets more results deliv-
ered faster from HPC resources at lower costs by accelerating
overall system, user and administrator productivity. Moab pro-
vides the scalability, 90-99 percent utilization, and simple job
submission that is required to maximize the productivity of
enterprise HPC systems. Enterprise use cases and capabilities
include:
ʎ Massive multi-point scalability to accelerate job response
and throughput, including high throughput computing
ʎ Workload-optimized allocation policies and provisioning
to get more results out of existing heterogeneous resources
and reduce costs, including topology-based allocation
ʎ Unify workload management cross heterogeneous clus-
ters to maximize resource availability and administration
efficiency by managing them as one cluster
ʎ Optimized, intelligent scheduling packs workloads and
backfills around priority jobs and reservations while balanc-
ing SLAs to efficiently use all available resources
ʎ Optimized scheduling and management of accelerators,
both Intel MIC and GPGPUs, for jobs to maximize their utili-
zation and effectiveness
ʎ Simplified job submission and management with advanced
job arrays, self-service portal, and templates
ʎ Administrator dashboards and reporting tools reduce man-
agement complexity, time, and costs
ʎ Workload-aware auto-power management reduces energy
use and costs by 10-30 percent
Intelligent HPC Workload ManagementMoab HPC Suite – Enterprise Edition
“With Moab HPC Suite, we can meet very demand-
ing customers’ requirements as regards unifi ed
management of heterogeneous cluster environ-
ments, grid management, and provide them with
fl exible and powerful confi guration and reporting
options. Our customers value that highly.”
Thomas Gebert HPC Solution Architect
35
Uptime AutomationJob and resource failures in enterprise HPC systems lead to de-
layed results, missed organizational opportunities, and missed
objectives. Moab HPC Suite – Enterprise Edition intelligently au-
tomates workload and resource uptime in the HPC system to en-
sure that workload completes successfully and reliably, avoiding
these failures. Enterprises can benefit from:
ʎ Intelligent resource placement to prevent job failures with
granular resource modeling to meet workload requirements
and avoid at-risk resources
ʎ Auto-response to failures and events with configurable ac-
tions to pre-failure conditions, amber alerts, or other metrics
and monitors
ʎ Workload-aware future maintenance scheduling that helps
maintain a stable HPC system without disrupting workload
productivity
ʎ Real-world expertise for fast time-to-value and system
uptime with included implementation, training, and 24x7
support remote services
Auto SLA EnforcementMoab HPC Suite – Enterprise Edition uses the powerful Moab
intelligence engine to optimally schedule and dynamically
adjust workload to consistently meet service level agree-
ments (SLAs), guarantees, or business priorities. This auto-
matically ensures that the right workloads are completed at
the optimal times, taking into account the complex number of
using departments, priorities and SLAs to be balanced. Moab
provides:
ʎ Usage accounting and budget enforcement that schedules
resources and reports on usage in line with resource shar-
ing agreements and precise budgets (includes usage limits,
36
usage reports, auto budget management and dynamic fair-
share policies)
ʎ SLA and priority polices that make sure the highest priorty
workloads are processed first (includes Quality of Service,
hierarchical priority weighting)
ʎ Continuous plus future scheduling that ensures priorities
and guarantees are proactively met as conditions and work-
load changes (i.e. future reservations, pre-emption, etc.)
Grid- and Cloud-Ready Workload ManagementThe benefits of a traditional HPC environment can be extended
to more efficiently manage and meet workload demand with
the ulti-cluster grid and HPC cloud management capabilities in
Moab HPC Suite - Enterprise Edition. It allows you to:
ʎ Showback or chargeback for pay-for-use so actual resource
usage is tracked with flexible chargeback rates and report-
ing by user, department, cost center, or cluster
ʎ Manage and share workload across multiple remote clusters
to meet growing workload demand or surges by adding on
the Moab HPC Suite – Grid Option.
Moab HPC Suite – Grid OptionThe Moab HPC Suite - Grid Option is a powerful grid-workload
management solution that includes scheduling, advanced poli-
cy management, and tools to control all the components of ad-
vanced grids. Unlike other grid solutions, Moab HPC Suite - Grid
Option connects disparate clusters into a logical whole, enabling
grid administrators and grid policies to have sovereignty over all
systems while preserving control at the individual cluster.
Moab HPC Suite - Grid Option has powerful applications that al-
low organizations to consolidate reporting; information gather-
ing; and workload, resource, and data management. Moab HPC
Suite - Grid Option delivers these services in a near-transparent
way: users are unaware they are using grid resources—they
know only that they are getting work done faster and more eas-
ily than ever before.
Intelligent HPC Workload ManagementMoab HPC Suite – Grid Option
37
Moab manages many of the largest clusters and grids in the
world. Moab technologies are used broadly across Fortune 500
companies and manage more than a third of the compute cores
in the top 100 systems of the Top500 supercomputers. Adaptive
Computing is a globally trusted ISV (independent software ven-
dor), and the full scalability and functionality Moab HPC Suite
with the Grid Option offers in a single integrated solution has
traditionally made it a significantly more cost-effective option
than other tools on the market today.
ComponentsThe Moab HPC Suite - Grid Option is available as an add-on to
Moab HPC Suite - Basic Edition and -Enterprise Edition. It extends
the capabilities and functionality of the Moab HPC Suite compo-
nents to manage grid environments including:
ʎ Moab Workload Manager for Grids – a policy-based work-
load management and scheduling multi-dimensional deci-
sion engine
ʎ Moab Cluster Manager for Grids – a powerful and unified
graphical administration tool for monitoring, managing and
reporting tool across multiple clusters
ʎ Moab ViewpointTM for Grids – a Web-based self-service
end-user job-submission and management portal and ad-
ministrator dashboard
Benefits ʎ Unified management across heterogeneous clusters provides
ability to move quickly from cluster to optimized grid
ʎ Policy-driven and predictive scheduling ensures that jobs
start and run in the fastest time possible by selecting optimal
resources
ʎ Flexible policy and decision engine adjusts workload pro-
cessing at both grid and cluster level
ʎ Grid-wide interface and reporting tools provide view of grid
resources, status and usage charts, and trends over time for
capacity planning, diagnostics, and accounting
ʎ Advanced administrative control allows various business
units to access and view grid resources, regardless of physical
or organizational boundaries, or alternatively restricts access
to resources by specific departments or entities
ʎ Scalable architecture to support peta-scale, highthroughput
computing and beyond
Grid Control with Automated Tasks, Policies, and Re-porting ʎ Guarantee that the most-critical work runs first with flexible
global policies that respect local cluster policies but continue
to support grid service-level agreements
ʎ Ensure availability of key resources at specific times with ad-
vanced reservations
ʎ Tune policies prior to rollout with cluster- and grid-level simu-
lations
ʎ Use a global view of all grid operations for self-diagnosis, plan-
ning, reporting, and accounting across all resources, jobs, and
clusters
Moab HPC Suite -Grid Option can be flexibly configured for centralized, cen-tralized and local, or peer-to-peer grid policies, decisions and rules. It is able to manage multiple resource managers and multiple Moab instances.
38
Harnessing the Power of Intel Xeon Phi and GPGPUsThe mathematical acceleration delivered by many-core General
Purpose Graphics Processing Units (GPGPUs) offers significant
performance advantages for many classes of numerically in-
tensive applications. Parallel computing tools such as NVIDIA’s
CUDA and the OpenCL framework have made harnessing the
power of these technologies much more accessible to develop-
ers, resulting in the increasing deployment of hybrid GPGPU-
based systems and introducing significant challenges for ad-
ministrators and workload management systems.
The new Intel Xeon Phi coprocessors offer breakthrough perfor-
mance for highly parallel applications with the benefit of lever-
aging existing code and flexibility in programming models for
faster application development. Moving an application to Intel
Xeon Phi coprocessors has the promising benefit of requiring
substantially less effort and lower power usage. Such a small
investment can reap big performance gains for both data-inten-
sive and numerical or scientific computing applications that can
make the most of the Intel Many Integrated Cores (MIC) technol-
ogy. So it is no surprise that organizations are quickly embracing
these new coprocessors in their HPC systems.
The main goal is to create breakthrough discoveries, products
and research that improve our world. Harnessing and optimiz-
ing the power of these new accelerator technologies is key to
doing this as quickly as possible. Moab HPC Suite helps organi-
zations easily integrate accelerator and coprocessor technolo-
gies into their HPC systems, optimizing their utilization for the
right workloads and problems they are trying to solve.
Intelligent HPC Workload ManagementOptimizing Accelerators with Moab HPC Suite
39
Moab HPC Suite automatically schedules hybrid systems incor-
porating Intel Xeon Phi coprocessors and GPGPU accelerators,
optimizing their utilization as just another resource type with
policies. Organizations can choose the accelerators that best
meet their different workload needs.
Managing Hybrid Accelerator SystemsHybrid accelerator systems add a new dimension of manage-
ment complexity when allocating workloads to available re-
sources. In addition to the traditional needs of aligning work-
load placement with software stack dependencies, CPU type,
memory, and interconnect requirements, intelligent workload
management systems now need to consider:
ʎ Workload’s ability to exploit Xeon Phi or GPGPU
technology
ʎ Additional software dependenciesand reduce costs,
including topology-based allocation
ʎ Current health and usage status of available Xeon Phis
or GPGPUs
ʎ Resource configuration for type and number of Xeon Phis
or GPGPUs
Moab HPC Suite auto detects which types of accelerators are
where in the system to reduce the management effort and costs
as these processors are introduced and maintained together in
an HPC system. This gives customers maximum choice and per-
formance in selecting the accelerators that work best for each
of their workloads, whether Xeon Phi, GPGPU or a hybrid mix of
the two.
Moab HPC Suite optimizes accelerator utilization with policies that ensure the right ones get used for the right user and group jobs at the right time.
40
Optimizing Intel Xeon Phi and GPGPUs UtilizationSystem and application diversity is one of the first issues
workload management software must address in optimizing
accelerator utilization. The ability of current codes to use GP-
GPUs and Xeon Phis effectively, ongoing cluster expansion,
and costs usually mean only a portion of a system will be
equipped with one or a mix of both types of accelerators.
Moab HPC Suite has a twofold role in reducing the complexity
of their administration and optimizing their utilization. First,
it must be able to automatically detect new GPGPUs or Intel
Xeon Phi coprocessors in the environment and their availabil-
ity without the cumbersome burden of manual administra-
tor configuration. Second, and most importantly, it must ac-
curately match Xeon Phi-bound or GPGPUenabled workloads
with the appropriately equipped resources in addition to
managing contention for those limited resources according
to organizational policy. Moab’s powerful allocation and pri-
oritization policies can ensure the right jobs from users and
groups get to the optimal accelerator resources at the right
time. This keeps the accelerators at peak utilization for the
right priority jobs.
Moab’s policies are a powerful ally to administrators in auto
determining the optimal Xeon Phi coprocessor or GPGPU to
use and ones to avoid when scheduling jobs. These alloca-
tion policies can be based on any of the GPGPU or Xeon Phi
metrics such as memory (to ensure job needs are met), tem-
perature (to avoid hardware errors or failures), utilization, or
other metrics.
Intelligent HPC Workload ManagementOptimizing Accelerators with Moab HPC Suite
41
Use the following metrics in policies to optimize allocation of
Intel Xeon Phi coprocessors:
ʎ Number of Cores
ʎ Number of Hardware Threads
ʎ Physical Memory
ʎ Free Physical Memory
ʎ Swap Memory
ʎ Free Swap Memory
ʎ Max Frequency (in MHz)
Use the following metrics in policies to optimize allocation of
GPGPUs and for management diagnostics:
ʎ Error counts; single/double-bit, commands to reset counts
ʎ Temperature
ʎ Fan speed
ʎ Memory; total, used, utilization
ʎ Utilization percent
ʎ Metrics time stamp
Improve GPGPU Job Speed and SuccessThe GPGPU drivers supplied by vendors today allow multiple
jobs to share a GPGPU without corrupting results, but sharing
GPGPUs and other key GPGPU factors can significantly impact
the speed and success of GPGPU jobs and the final level of ser-
vice delivered to the system users and the organization.
Consider the example of a user estimating the time for a GPGPU
job based on exclusive access to a GPGPU, but the workload man-
ager allowing the GPGPU to be shared when the job is scheduled.
The job will likely exceed the estimated time and be terminated
by the workload manager unsuccessfully, leaving a very unsatis-
fied user and the organization, group or project they represent.
Moab HPC Suite provides the ability to schedule GPGPU jobs to
run exclusively and finish in the shortest time possible for cer-
tain types, or classes of jobs or users to ensure job speed and
success.
42
Cluster Sovereignty and Trusted Sharing ʎ Guarantee that shared resources are allocated fairly with
global policies that fully respect local cluster configuration
and needs
ʎ Establish trust between resource owners through graphical
usage controls, reports, and accounting across all shared
resources
ʎ Maintain cluster sovereignty with granular settings to con-
trol where jobs can originate and be processed n Establish
resource ownership and enforce appropriate access levels
with prioritization, preemption, and access guarantees
Increase User Collaboration and Productivity ʎ Reduce end-user training and job management time with
easy-to-use graphical interfaces
ʎ Enable end users to easily submit and manage their jobs
through an optional web portal, minimizing the costs of ca-
tering to a growing base of needy users
ʎ Collaborate more effectively with multicluster co-allocation,
allowing key resource reservations for high-priority projects
ʎ Leverage saved job templates, allowing users to submit mul-
Intelligent HPC Workload ManagementMoab HPC Suite – Application Portal Edition
The Visual Cluster view in Moab Cluster Manager for Grids makes cluster and grid management easy and efficient.
43
tiple jobs quickly and with minimal changes
ʎ Speed job processing with enhanced grid placement options
for job arrays; optimal or single cluster placement
Process More Work in Less Time to Maximize ROI ʎ Achieve higher, more consistent resource utilization with in-
telligent scheduling, matching job requests to the bestsuited
resources, including GPGPUs
ʎ Use optimized data staging to ensure that remote data trans-
fers are synchronized with resource availability to minimize
poor utilization
ʎ Allow local cluster-level optimizations of most grid workload
Unify Management Across Independent Clusters ʎ Unify management across existing internal, external, and
partner clusters – even if they have different resource man-
agers, databases, operating systems, and hardware
ʎ Out-of-the-box support for both local area and wide area
grids
ʎ Manage secure access to resources with simple credential
mapping or interface with popular security tool sets
ʎ Leverage existing data-migration technologies, such as SCP
or GridFTP
Moab HPC Suite – Application Portal EditionRemove the Barriers to Harnessing HPC
Organizations of all sizes and types are looking to harness the
potential of high performance computing (HPC) to enable and
accelerate more competitive product design and research. To do
this, you must find a way to extend the power of HPC resourc-
es to designers, researchers and engineers not familiar with or
trained on using the specialized HPC technologies. Simplifying
the user access to and interaction with applications running on
more efficient HPC clusters removes the barriers to discovery
and innovation for all types of departments and organizations.
Moab® HPC Suite – Application Portal Edition provides easy
single-point access to common technical applications, data, HPC
resources, and job submissions to accelerate research analysis
and collaboration process for end users while reducing comput-
ing costs.
HPC Ease of Use Drives ProductivityMoab HPC Suite – Application Portal Edition improves ease of
use and productivity for end users using HPC resources with sin-
gle-point access to their common technical applications, data,
resources, and job submissions across the design and research
process. It simplifies the specification and running of jobs on
HPC resources to a few simple clicks on an application specific
template with key capabilities including:
ʎ Application-centric job submission templates for common
manufacturing, energy, life-sciences, education and other
industry technical applications
ʎ Support for batch and interactive applications, such as
simulations or analysis sessions, to accelerate the full proj-
ect or design cycle
ʎ No special HPC training required to enable more users to
start accessing HPC resources with intuitive portal
ʎ Distributed data management avoids unnecessary remote
file transfers for users, storing data optimally on the network
for easy access and protection, and optimizing file transfer
to/ from user desktops if needed
Accelerate Collaboration While Maintaining ControlMore and more projects are done across teams and with partner
and industry organizations. Moab HPC Suite – Application Portal
Edition enables you to accelerate this collaboration between indi-
viduals and teams while maintaining control as additional users,
inside and outside the organization, can easily and securely ac-
cess project applications, HPC resources and data to speed project
cycles. Key capabilities and benefits you can leverage are:
ʎ Encrypted and controllable access, by class of user, ser-
vices, application or resource, for remote and partner users
improves collaboration while protecting infrastructure and
intellectual property
44
ʎ Enable multi-site, globally available HPC services available
anytime, anywhere, accessible through many different types
of devices via a standard web browser
ʎ Reduced client requirements for projectsmeans new proj-
ect members can quickly contribute without being limited by
their desktop capabilities
Reduce Costs with Optimized Utilization and SharingMoab HPC Suite – Application Portal Edition reduces the costs of
technical computing by optimizing resource utilization and the
sharing of resources including HPC compute nodes, application
licenses, and network storage. The patented Moab intelligence
engine and its powerful policies deliver value in key areas of uti-
lization and sharing:
ʎ Maximize HPC resource utilization with intelligent, opti-
mized scheduling policies that pack and enable more work
to be done on HPC resources to meet growing and changing
demand
ʎ Optimize application license utilization and access by shar-
ing application licenses in a pool, re-using and optimizing us-
age with allocation policies that integrate with license man-
agers for lower costs and better service levels to a broader
set of users
ʎ Usage budgeting and priorities enforcement with usage
accounting and priority policies that ensure resources are
shared in-line with service level agreements and project pri-
orities
ʎ Leverage and fully utilize centralized storage capacity in-
stead of duplicative, expensive, and underutilized individual
workstation storage
Key Intelligent Workload Management Capabilities: ʎ Massive multi-point scalability
ʎ Workload-optimized allocation policies and provisioning
ʎ Unify workload management cross heterogeneous clusters
ʎ Optimized, intelligent scheduling
Intelligent HPC Workload ManagementMoab HPC Suite – Remote Visualization Edition
45
ʎ Optimized scheduling and management of accelerators
(both Intel MIC and GPGPUs)
ʎ Administrator dashboards and reporting tools
ʎ Workload-aware auto power management
ʎ Intelligent resource placement to prevent job failures
ʎ Auto-response to failures and events
ʎ Workload-aware future maintenance scheduling
ʎ Usage accounting and budget enforcement
ʎ SLA and priority polices
ʎ Continuous plus future scheduling
Moab HPC Suite – Remote Visualization EditionRemoving Inefficiencies in Current 3D Visualization Methods
Using 3D and 2D visualization, valuable data is interpreted and
new innovations are discovered. There are numerous inefficien-
cies in how current technical computing methods support users
and organizations in achieving these discoveries in a cost effec-
tive way. High-cost workstations and their software are often
not fully utilized by the limited user(s) that have access to them.
They are also costly to manage and they don’t keep pace long
with workload technology requirements. In addition, the work-
station model also requires slow transfers of the large data sets
between them. This is costly and inefficient in network band-
width, user productivity, and team collaboration while posing in-
creased data and security risks. There is a solution that removes
these inefficiencies using new technical cloud computing mod-
els. Moab HPC Suite – Remote Visualization Edition significantly
reduces the costs of visualization technical computing while
improving the productivity, collaboration and security of the de-
sign and research process.
Reduce the Costs of Visualization Technical ComputingMoab HPC Suite – Remote Visualization Edition significantly
reduces the hardware, network and management costs of vi-
sualization technical computing. It enables you to create easy,
centralized access to shared visualization compute, applica-
tion and data resources. These compute, application, and data
resources reside in a technical compute cloud in your data cen-
ter instead of in expensive and underutilized individual work-
stations.
ʎ Reduce hardware costs by consolidating expensive individual
technical workstations into centralized visualization servers
for higher utilization by multiple users; reduce additive or up-
grade technical workstation or specialty hardware purchases,
such as GPUs, for individual users.
ʎ Reduce management costs by moving from remote user work-
stations that are difficult and expensive to maintain, upgrade,
and back-up to centrally managed visualization servers that
require less admin overhead
ʎ Decrease network access costs and congestion as significantly
lower loads of just compressed, visualized pixels are moving to
users, not full data sets
ʎ Reduce energy usageas centralized visualization servers con-
sume less energy for the user demand met
ʎ Reduce data storage costs by consolidating data into common
storage node(s) instead of under-utilized individual storage
The Application Portal Edition easily extends the power of HPC to your users and projects to drive innovation and competentive advantage.
46
Consolidation and GridMany sites have multiple clusters as a result of having multiple
independent groups or locations, each with demands for HPC,
and frequent additions of newer machines. Each new cluster
increases the overall administrative burden and overhead. Addi-
tionally, many of these systems can sit idle while others are over-
loaded. Because of this systems-management challenge, sites
turn to grids to maximize the efficiency of their clusters. Grids
can be enabled either independently or in conjunction with one
another in three areas:
ʎ Reporting Grids Managers want to have global reporting
across all HPC resources so they can see how users and proj-
ects are really utilizing hardware and so they can effectively
plan capacity. Unfortunately, manually consolidating all of
this information in an intelligible manner for more than just
a couple clusters is a management nightmare. To solve that
problem, sites will create a reporting grid, or share informa-
tion across their clusters for reporting and capacity-planning
purposes.
ʎ Management Grids Managing multiple clusters indepen-
dently can be especially difficult when processes change,
because policies must be manually reconfigured across all
clusters. To ease that difficulty, sites often set up manage-
ment grids that impose a synchronized management layer
across all clusters while still allowing each cluster some level
of autonomy.
ʎ Workload-Sharing Grids Sites with multiple clusters often
have the problem of some clusters sitting idle while other
clusters have large backlogs. Such inequality in cluster utili-
zation wastes expensive resources, and training users to per-
form different workload-submission routines across various
clusters can be difficult and expensive as well. To avoid these
problems, sites often set up workload-sharing grids. These
grids can be as simple as centralizing user submission or as
Intelligent HPC Workload ManagementUsing Moab With Grid Environments
47
complex as having each cluster maintain its own user sub-
mission routine with an underlying grid-management tool
that migrates jobs between clusters.
Inhibitors to Grid EnvironmentsThree common inhibitors keep sites from enabling grid environ-
ments:
ʎ Politics Because grids combine resources across users,
groups, and projects that were previously independent, grid
implementation can be a political nightmare. To create a
grid in the real world, sites need a tool that allows clusters
to retain some level of sovereignty while participating in the
larger grid.
ʎ Multiple Resource Managers Most sites have a variety of
resource managers used by various groups within the orga-
nization, and each group typically has a large investment in
scripts that are specific to one resource manager and that
cannot be changed. To implement grid computing effective-
ly, sites need a robust tool that has flexibility in integrating
with multiple resource managers.
ʎ Credentials Many sites have different log-in credentials for
each cluster, and those credentials are generally indepen-
dent of one another. For example, one user might be Joe.P on
one cluster and J_Peterson on another. To enable grid envi-
ronments, sites must create a combined user space that can
recognize and combine these different credentials.
Using Moab in a GridSites can use Moab to set up any combination of reporting,
management, and workload-sharing grids. Moab is a grid metas-
cheduler that allows sites to set up grids that work effectively in
the real world. Moab’s feature-rich functionality overcomes the
inhibitors of politics, multiple resource managers, and varying
credentials by providing:
ʎ Grid Sovereignty Moab has multiple features that break
down political barriers by letting sites choose how each clus-
ter shares in the grid. Sites can control what information is
shared between clusters and can specify which workload is
passed between clusters. In fact, sites can even choose to let
each cluster be completely sovereign in making decisions
about grid participation for itself.
ʎ Support for Multiple Resource Managers Moab meta-
schedules across all common resource managers. It fully
integrates with TORQUE and SLURM, the two most-common
open-source resource managers, and also has limited in-
tegration with commercial tools such as PBS Pro and SGE.
Moab’s integration includes the ability to recognize when a
user has a script that requires one of these tools, and Moab
can intelligently ensure that the script is sent to the correct
machine. Moab even has the ability to translate common
scripts across multiple resource managers.
ʎ Credential Mapping Moab can map credentials across clus-
ters to ensure that users and projects are tracked appropri-
ately and to provide consolidated reporting.
48
Improve Collaboration, Productivity and SecurityWith Moab HPC Suite – Remote Visualization Edition, you can im-
prove the productivity, collaboration and security of the design
and research process by only transferring pixels instead of data
to users to do their simulations and analysis. This enables a wid-
er range of users to collaborate and be more productive at any
time, on the same data, from anywhere without any data trans-
fer time lags or security issues. Users also have improved imme-
diate access to specialty applications and resources, like GPUs,
they might need for a project so they are no longer limited by
personal workstation constraints or to a single working location.
ʎ Improve workforce collaboration by enabling multiple
users to access and collaborate on the same interactive ap-
plication data at the same time from almost any location or
device, with little or no training needed
ʎ Eliminate data transfer time lags for users by keeping data
close to the visualization applications and compute resourc-
es instead of transferring back and forth to remote worksta-
tions; only smaller compressed pixels are transferred
ʎ Improve security with centralized data storage so data no
longer gets transferred to where it shouldn’t, gets lost, or
gets inappropriately accessed on remote workstations, only
pixels get moved
Maximize Utilization and Shared Resource GuaranteesMoab HPC Suite – Remote Visualization Edition maximizes your
resource utilization and scalability while guaranteeing shared
resource access to users and groups with optimized workload
management policies. Moab’s patented policy intelligence en-
gine optimizes the scheduling of the visualization sessions
across the shared resources to maximize standard and specialty
resource utilization, helping them scale to support larger vol-
umes of visualization workloads. These intelligent policies also
guarantee shared resource access to users to ensure a high ser-
vice level as they transition from individual workstations.
Intelligent HPC Workload ManagementMoab HPC Suite – Remote Visualization Edition
49
ʎ Guarantee shared visualization resource access for users
with priority policies and usage budgets that ensure they
receive the appropriate service level. Policies include ses-
sion priorities and reservations, number of users per node,
fairshare usage policies, and usage accounting and budgets
across multiple groups or users.
ʎ Maximize resource utilization and scalability by packing
visualization workload optimally on shared servers using
Moab allocation and scheduling policies. These policies can
include optimal resource allocation by visualization session
characteristics (like CPU, memory, etc.), workload packing,
number of users per node, and GPU policies, etc.
ʎ Improve application license utilization to reduce software
costs and improve access with a common pool of 3D visual-
ization applications shared across multiple users, with usage
optimized by intelligent Moab allocation policies, integrated
with license managers.
ʎ Optimized scheduling and management of GPUs and other
accelerators to maximize their utilization and effectiveness
ʎ Enable multiple Windows or Linux user sessions on a single
machine managed by Moab to maximize resource and power
utilization.
ʎ Enable dynamic OS re-provisioning on resources, managed
by Moab, to optimally meet changing visualization applica-
tion workload demand and maximize resource utilization
and availability to users.
The Remote Visualization Edition makes visualization more cost effective and productive with an innovative technical compute cloud.
transtec HPC solutions are designed for maximum flexibility,
and ease of management. We not only offer our customers the
most powerful and flexible cluster management solution out
there, but also provide them with customized setup and site-
specific configuration. Whether a customer needs a dynamical
Linux-Windows dual-boot solution, unifi ed management of dif-
ferent clusters at different sites, or the fi ne-tuning of the Moab
scheduler for implementing fi ne-grained policy confi guration
– transtec not only gives you the framework at hand, but also
helps you adapt the system according to your special needs.
Needless to say, when customers are in need of special train-
ings, transtec will be there to provide customers, administra-
tors, or users with specially adjusted Educational Services.
Having a many years’ experience in High Performance Comput-
ing enabled us to develop effi cient concepts for installing and
deploying HPC clusters. For this, we leverage well-proven 3rd
party tools, which we assemble to a total solution and adapt
and confi gure according to the customer’s requirements.
We manage everything from the complete installation and
confi guration of the operating system, necessary middleware
components like job and cluster management systems up to
the customer’s applications.
50
Intelligent HPC Workload ManagementMoab HPC Suite – Remote Visualization Edition
Moab HPC Suite Use Case Basic Enterprice App. Portal Remote Visualization
Productivity Acceleration
Massive multifaceted scalability x x x x
Workload optimized resource allocation x x x x
Optimized, intelligent scheduling of work-load and resources
x x x x
Optimized accelerator scheduling & manage-ment (Intel MIC & GPGPU)
x x x x
Simplified user job submission & mgmt.- Application Portal
x x xx
x
Administrator dashboard and management reporting
x x x x
Unified workload management across het-erogeneous cluster (s)
xSingle cluster only
x x x
Workload optimized node provisioning x x x
Visualization Workload Optimization x
Workload-aware auto-power management x x x
Uptime Automation
Intelligent resource placement for failure prevention
x x x x
Workload-aware maintenance scheduling x x x x
Basic deployment, training and premium (24x7) support service
Standard support only
x x x
Auto response to failures and events xLimited no
re-boot/provision
x x x
Auto SLA enforcement
Resource sharing and usage policies x x x x
SLA and oriority policies x x x x
Continuous and future reservations scheduling x x x x
Usage accounting and budget enforcement x x x
Grid- and Cloud-Ready
Manage and share workload across multiple clusters in wich area grid
xw/grid Option
xw/grid Option
xw/grid Option
xw/grid Option
User self-service: job request & management portal
x x x x
Pay-for-use showback and chargeback x x x
Workload optimized node provisioning & re-purposing
x x x
Multiple clusters in single domain
51
INFINIBANDHIGH-SPEED INTERCONNTECTS
NICE EnginFrame –A Technical Computing Portal
53
InfiniBand (IB) is an efficient I/O technology that provides high-
speed data transfers and ultra-low latencies for computing and
storage over a highly reliable and scalable single fabric. The
InfiniBand industry standard ecosystem creates cost effective
hardware and software solutions that easily scale from genera-
tion to generation.
InfiniBand is a high-bandwidth, low-latency network interconnect
solution that has grown tremendous market share in the High
Performance Computing (HPC) cluster community. InfiniBand was
designed to take the place of today’s data center networking tech-
nology. In the late 1990s, a group of next generation I/O architects
formed an open, community driven network technology to provide
scalability and stability based on successes from other network
designs. Today, InfiniBand is a popular and widely used I/O fabric
among customers within the Top500 supercomputers: major Uni-
versities and Labs; Life Sciences; Biomedical; Oil and Gas (Seismic,
Reservoir, Modeling Applications); Computer Aided Design and En-
gineering; Enterprise Oracle; and Financial Applications.
HPC systems and enterprise grids deliver unprecedented time-to-market and perfor-mance advantages to many research and corporate cus-tomers, struggling every day with compute and data intensive processes. This of-ten generates or transforms massive amounts of jobs and data, that needs to be handled and archived effi-ciently to deliver timely in-formation to users distrib-uted in multiple locations, with different security con-cerns.
Poor usability of such com-plex systems often nega-tively impacts users’ pro-ductivity, and ad-hoc data management often increas-es information entropy and dissipates knowledge and intellectual property.
54
NICE EnginFrame
A Technical Portal for Remote Visualization
Solving distributed computing issues for our customers, it is
easy to understand that a modern, user-friendly web front-end
to HPC and grids can drastically improve engineering productiv-
ity, if properly designed to address the specific challenges of the
Technical Computing market.
Nice EnginFrame overcomes many common issues in the areas
of usability, data management, security and integration, open-
ing the way to a broader, more effective use of the Technical
Computing resources.
The key components of our solutions are:
ʎ a flexible and modular Java-based kernel, with clear separati-
on between customizations and core services
ʎ powerful data management features, reflecting the typical
needs of engineering applications
ʎ comprehensive security options and a fine grained authoriza-
tion system
ʎ scheduler abstraction layer to adapt to different workload
and resource managers
ʎ responsive and competent support services
55
End users can typically enjoy the following improvements:
ʎ user-friendly, intuitive access to computing resources, using a
standard web browser
ʎ application-centric job submission
ʎ organized access to job information and data
ʎ increased mobility and reduced client requirements
On the other side, the Technical Computing Portal delivers sig-
nificant added-value for system administrators and IT:
ʎ reduced training needs to enable users to access the resour-
ces
ʎ centralized configuration and immediate deployment of ser-
vices
ʎ comprehensive authorization to access services and informa-
tion
ʎ reduced support calls and submission errors
Coupled with our Remote Visualization solutions, our custom-
ers quickly deploy end-to-end engineering processes on their
Intranet, Extranet or Internet.
EnginFrame Highlights ʎ Universal and flexible access to your Grid infrastructure En-
ginFrame provides easy accessover intranet, extranet or the
Internet using standard and secure Internet languages and
protocols.
ʎ Interface to the Grid in your organization EnginFrame offers
pluggable server-side XML services to enable easy integrati-
on of the leading workload schedulers in the market. Plug-in
modules for all major grid solutions including the Platform
Computing LSF and OCS, Microsoft Windows HPC Server
2008, Sun GridEngine, UnivaUD UniCluster, IBM LoadLeve-
ler, Altair PBS/Pro and OpenPBS, Torque, Condor, EDG/gLite
Globus-based toolkits, and provides easy interfaces to the
existing computing infrastructure.
ʎ Security and Access Control You can give encrypted and con-
trollable access to remote users and improve collaboration
with partners while protecting your infrastructure and Intel-
lectual Property (IP). This enables you to speed up design cycles
while working with related departments or external partners
and eases communication through a secure infrastructure.
Distributed Data Management The comprehensive remote
file management built into EnginFrame avoids unnecessary
file transfers, and enables server-side treatment of data, as
well as file transfer from/to the user’s desktop.
ʎ Interactive application support Through the integration
with leading GUI virtualization solutions (including Desktop
Cloud Visualization (DCV), VNC, VirtualGL and NX), EnginFrame
enables you to address all stages of the design process, both
batch and interactive.
ʎ SOA-enable your applications and resources EnginFrame
can automatically publish your computing applications via
standard WebServices (tested both for Java and .NET intero-
perability), enabling immediate integration with other enter-
prise applications.
Business Benefits
Category Business Benefits
Web and Cloud Portal ʎ Simplify user access; minimize
the training requirements for
HPC users; broaden the user
community
ʎ Enable access from intranet,
internet or extranet
56
Aerospace, Automotive and ManufacturingThe complex factors involved in CAE range from compute-in-
tensive data analysis to worldwide collaboration between de-
signers, engineers, OEM’s and suppliers. To accommodate these
factors, Cloud (both internal and external) and Grid Computing
solutions are increasingly seen as a logical step to optimize IT
infrastructure usage for CAE. It is no surprise then, that automo-
tive and aerospace manufacturers were the early adopters of
internal Cloud and Grid portals.
Manufacturers can now develop more “virtual products” and
simulate all types of designs, fluid flows, and crash simula-
tions. Such virtualized products and more streamlined collab-
oration environments are revolutionizing the manufacturing
process.
With NICE EnginFrame in their CAE environment engineers can
take the process even further by connecting design and simula-
tion groups in “collaborative environments” to get even greater
benefits from “virtual products”. Thanks to EnginFrame, CAE en-
gineers can have a simple intuitive collaborative environment
that can care of issues related to:
ʎ Access & Security - Where an organization must give access
to external and internal entities such as designers, engineers
and suppliers.
ʎ Distributed collaboration - Simple and secure connection of
design and simulation groups distributed worldwide.
ʎ Time spent on IT tasks - By eliminating time and resources
spent using cryptic job submission commands or acquiring
knowledge of underlying compute infrastructures, engi-
neers can spend more time concentrating on their core de-
sign tasks.
NICE EnginFrame
Application Highlights
57
EnginFrame’s web based interface can be used to access the
compute resources required for CAE processes. This means ac-
cess to job submission & monitoring tasks, input & output data
associated with industry standard CAE/CFD applications for Flu-
id-Dynamics, Structural Analysis, Electro Design and Design Col-
laboration, (like Abaqus, Ansys, Fluent, MSC Nastran, PAMCrash,
LS-Dyna, Radioss) without cryptic job submission commands or
knowledge of underlying compute infrastructures.
EnginFrame has a long history of usage in some of the most
prestigious manufacturing organizations worldwide including
Aerospace companies like AIRBUS, Alenia Space, CIRA, Galileo Avi-
onica, Hamilton Sunstrand, Magellan Aerospace, MTU and Auto-
motive companies like Audi, ARRK, Brawn GP, Bridgestone, Bosch,
Delphi, Elasis, Ferrari, FIAT, GDX Automotive, Jaguar-LandRover,
Lear, Magneti Marelli, McLaren, P+Z, RedBull, Swagelok, Suzuki,
Toyota, TRW.
Life Sciences and HealthcareNICE solutions are deployed in the Life Sciences sector at com-
panies like BioLab, Partners HealthCare, Pharsight and the
M.D.Anderson Cancer Center; also in leading research projects like
DEISA or LitBio in order to allow easy and transparent use of com-
puting resources without any insight of the HPC infrastructure.
The Life Science and Healthcare sectors have some very strict re-
quirement when choosing its IT solution like EnginFrame, for in-
stance
ʎ Security - To meet the strict security & privacy requirements
of the biomedical and pharmaceutical industry any solution
needs to take account of multiple layers of security and au-
thentication.
ʎ Industry specific software - ranging from the simplest cus-
tom tool to the more general purpose free and open middle-
wares.
EnginFrame’s modular architecture allows for different Grid mid-
dlewares and software including leading Life Science applications
like Schroedinger Glide, EPFL RAxML, BLAST family, Taverna, and
R) to be exploited, R Users can compose elementary services into
complex applications and “virtual experiments”, run, monitor and
build workflows via a standard Web browser. EnginFrame also has
highly tuneable resource sharing and fine-grained access control
where multiple authentication systems (like Active Directory, Krb5
or LDAP) can be exploited simultaneously.
58
Web and Cloud Portal ʎ “One Stop Shop” - the portal as
the starting point for knowledge
and resources
ʎ Enable collaboration and sharing
through the portal as a virtual
meeting point (employees, part-
ners, and affiliates)
ʎ Single Sign-on
ʎ Integrations for many technical
computing ISV and open-source
applications and infrastructure
components
Business Continuity ʎ Virtualizes servers and server
locations, data and storage loca-
tions, applications and licenses
ʎ Move to thin desktops for graphi-
cal workstation applications by
moving graphics processing into
the datacenter (also can increase
utilization and re-use of licenses)
ʎ Enable multi-site and global loca-
tions and services
ʎ Data maintained on network sto-
rage - not on laptop/desktop
ʎ Cloud and ASP – create, or take
advantage of, Cloud and ASP
(Application Service Provider)
services to extend your business,
grow new revenue, manage costs.
NICE EnginFrame
Desktop Cloud Virtualization
“The amount of data resulting from e.g. simulations
in CAE or other engineering environments can be in
the Gigabyte range. It is obvious that remote post-
processing is one of the most urgent topics to be tack-
led. NICE Engine Frame provides exactly that, and our
customers are impressed that such great technology
enhances their workflow so significantly.”
Robin Kienecker HPC Sales Specialist
59
Compliance ʎ Ensure secure and auditable
access and use of resources (ap-
plications, data, infrastructure,
licenses)
ʎ Enables IT to provide better upti-
me and service-levels to users
ʎ Allow collaboration with partners
while protecting Intellectual
Property and resources
ʎ Restrict access by class of user,
service, application, and resource
Web Services ʎ Wrapper legacy applications
(command line, local API/SDK)
with a web services SOAP/.NET/
Java interface to access in your
SOA environment
ʎ Workflow your human and auto-
mated business processes and
supply chain
ʎ Integrate with corporate policy
for Enterprise Portal or Business
Process Management (works with
SharePoint®, WebLogic®, WebS-
phere®, etc).
Desktop Cloud VirtualizationNice Desktop Cloud Visualization (DCV) is an advanced technol-
ogy that enables Technical Computing users to remote access
2D/3D interactive applications over a standard network.
Engineers and scientists are immediately empowered by taking
full advantage of high-end graphics cards, fast I/O performance
and large memory nodes hosted in “Public or Private 3D Cloud”,
rather then waiting for the next upgrade of the workstations.
The DCV protocol adapts to heterogeneous networking infrastruc-
tures like LAN, WAN and VPN, to deal with bandwidth and latency
constraints. All applications run natively on the remote machines,
that could be virtualized and share the same physical GPU.
In a typical visualization scenario, a software application sends a
stream of graphics commands to a graphics adapter through an
input/output (I/O) interface. The graphics adapter renders the data
into pixels and outputs them to the local display as a video signal.
When using Nice DCV, the scene geometry and graphics state are
rendered on a central server, and pixels are sent to one or more
remote displays.
This approach requires the server to be equipped with one or
more GPUs, which are used for the OpenGL rendering, while the
client software can run on “thin” devices.
Nice DCV architecture consist of:
ʎ DCV Server, equipped with one or more GPUs, used for OpenGL
rendering
ʎ one or more DCV EndStations, running on “thin clients”, only
used for visualization
ʎ etherogeneous networking infrastructures (like LAN, WAN
and VPN), optimized balancing quality vs frame rate
Nice DCV Highlights
ʎ enables high performance remote access to interactive 2D/3D
software applications on low bandwidth/high latency
ʎ supports multiple etherogeneous OS (Windows, Linux)
ʎ enables GPU sharing
ʎ supports 3D acceleration for OpenGL applications running on
Virtual Machines
60
Remote VisualizationIt’s human nature to want to ‘see’ the results from simula-
tions, tests, and analyses. Up until recently, this has meant
‘fat’ workstations on many user desktops. This approach pro-
vides CPU power when the user wants it – but as dataset size
increases, there can be delays in downloading the results.
Also, sharing the results with colleagues means gathering
around the workstation - not always possible in this global-
ized, collaborative workplace.
Increasing dataset complexity (millions of polygons, interact-
ing components, MRI/PET overlays) means that as time comes
to upgrade and replace the workstations, the next generation
of hardware needs more memory, more graphics processing,
more disk, and more CPU cores. This makes the workstation
expensive, in need of cooling, and noisy.
Innovation in the field of remote 3D processing now allows
companies to address these issues moving applications away
from the Desktop into the data center. Instead of pushing
data to the application, the application can be moved near
the data.
Instead of mass workstation upgrades, Remote Visualization
allows incremental provisioning, on-demand allocation, bet-
ter management and efficient distribution of interactive ses-
sions and licenses. Racked workstations or blades typically
have lower maintenance, cooling, replacement costs, and
they can extend workstation (or laptop) life as “thin clients”.
NICE EnginFrame
Remote Visualization
61
The solutionLeveraging their expertise in distributed computing and Web-
based application portals, NICE offers an integrated solution to
access, load balance and manage applications and desktop ses-
sions running within a visualization farm. The farm can include
both Linux and Windows resources, running on heterogeneous
hardware.
The core of the solution is the EnginFrame Visualization plug-in,
that delivers Web-based services to access and manage applica-
tions and desktops published in the Farm. This solution has been
integrated with:
� NICE Desktop Cloud Visualization (DCV)
� HP Remote Graphics Software (RGS)
� RealVNC
� TurboVNC and VirtualGL
� Nomachine NX
Coupled with these third party remote visualization engines
(which specialize in delivering high frame-rates for 3D graphics),
the NICE offering for Remote Visualization solves the issues of
user authentication, dynamic session allocation, session man-
agement and data transfers.
End users can enjoy the following improvements:
ʎ Intuitive, application-centric Web interface to start, control
and re-connect to a session
ʎ Single sign-on for batch and interactive applications
ʎ All data transfers from and to the remote visualization farm
are handled by EnginFrame
ʎ Built-in collaboration, to share sessions with other users
ʎ The load and usage of the visualization cluster is monitored
in the browser
The solution also delivers significant added-value for the sys-
tem administrators:
ʎ No need of SSH / SCP / FTP on the client machine
ʎ Easy integration into identity services, Single Sign-On (SSO),
Enterprise portals
ʎ Automated data life cycle management
ʎ Built-in user session sharing, to facilitate support
ʎ Interactive sessions are load balanced by a scheduler (LSF,
SGE or Torque) to achieve optimal performance and resource
usage
ʎ Better control and use of application licenses
ʎ Monitor, control and manage users’ idle sessions
transtec has a long-term experience in Engineering environ-
ments, especially from the CAD/CAE sector. This allow us to
provide customers from this area with solutions that greatly en-
hance their workflow and minimizes time-to-result.
This, together with transtec’s offerings of all kinds of services, al-
lows our customers to fully focus on their productive work, and
have us do the environmental optimizations.
62
ʎ Supports multiple user collaboration via session sharing
ʎ Enables attractive Return-on-Investment through resource
sharing and consolidation to data centers (GPU, memory, CPU,
...)
ʎ Keeps the data secure in the data center, reducing data load
and save time
ʎ Enables right sizing of system allocation based on user’s dy-
namic needs
ʎ Facilitates application deployment: all applications, updates
and patches are instantly available to everyone, without any
changes to original code
Business BenefitsThe business benefits for adopting Nice DCV can be summarized
in to four categories:
Category Business Benefits
Productivity ʎ Increase business efficiency
ʎ Improve team performance by
ensuring real-time collaboration
with colleagues and partners in
real time, anywhere.
ʎ Reduce IT management costs
by consolidating workstation
resources to a single point-of-
management
ʎ Save money and time on applica-
tion deployment
ʎ Let users work from anywhere
there is an Internet connection
NICE EnginFrame
Desktop Cloud Virtualization
63
Business Continuity ʎ Move graphics processing and
data to the datacenter - not on
laptop/desktop
ʎ Cloud-based platform support
enables you to scale the visua-
lization solution „on-demand“
to extend business, grow new
revenue, manage costs.
Data Security ʎ Guarantee secure and auditable
use of remote resources (appli-
cations, data, infrastructure,
licenses)
ʎ Allow real-time collaboration
with partners while protecting In-
tellectual Property and resources
ʎ Restrict access by class of user,
service, application, and resource
Training Effectiveness ʎ Enable multiple users to follow
application procedures alongside
an instructor in real-time
ʎ Enable collaboration and session
sharing among remote users (em-
ployees, partners, and affiliates)
Nice DCV is perfectly integrated into EnginFrame Views, lever-
aging 2D/3D capabilities over the Web, including the ability to
share an interactive session with other users for collaborative
working.
Windows Linux
ʎ Microsoft Windows 7 -
32/64 bit
ʎ Microsoft Windows Vista -
32/64 bit
ʎ Microsoft Windows XP -
32/64 bit
ʎ RedHat Enterprise 4, 5, 5.5,
6 - 32/64 bit
ʎ SUSE Enterprise Server 11
- 32/64 bit
© 2012 by NICE
Intel Cluster Ready –A Quality Standard for HPC Clusters
Intel Cluster Ready is de-signed to create predict-able expectations for users and providers of HPC clus-ters, primarily targeting customers in the commer-cial and industrial sectors. These are not experimental “test-bed” clusters used for computer science and com-puter engineering research, or high-end “capability” clusters closely targeting their specific computing re-quirements that power the high-energy physics at the national labs or other spe-cialized research organiza-tions.
Intel Cluster Ready seeks to advance HPC clusters used as computing resources in production environments by providing cluster owners with a high degree of confi-dence that the clusters they
deploy will run the applica-tions their scientific and en-gineering staff rely upon to do their jobs. It achieves this by providing cluster hard-ware, software, and system providers with a precisely defined basis for their prod-ucts to meet their customers’ production cluster require-ments.
65
66
Intel Cluster Ready
A Quality Standard for HPC Clusters
What are the Objectives of ICR?The primary objective of Intel Cluster Ready is to make clusters
easier to specify, easier to buy, easier to deploy, and make it easi-
er to develop applications that run on them. A key feature of ICR
is the concept of “application mobility”, which is defined as the
ability of a registered Intel Cluster Ready application – more cor-
rectly, the same binary – to run correctly on any certified Intel
Cluster Ready cluster. Clearly, application mobility is important
for users, software providers, hardware providers, and system
providers.
ʎ Users want to know the cluster they choose will reliably run
the applications they rely on today, and will rely on tomorrow
ʎ Application providers want to satisfy the needs of their cus-
tomers by providing applications that reliably run on their
customers’ cluster hardware and cluster stacks
ʎ Cluster stack providers want to satisfy the needs of their cus-
tomers by providing a cluster stack that supports their custo-
mers’ applications and cluster hardware
ʎ Hardware providers want to satisfy the needs of their custo-
mers by providing hardware components that supports their
customers’ applications and cluster stacks
ʎ System providers want to satisfy the needs of their customers
by providing complete cluster implementations that reliably
run their customers’ applications
Without application mobility, each group above must either try
to support all combinations, which they have neither the time
nor resources to do, or pick the “winning combination(s)” that
best supports their needs, and risk making the wrong choice.
The Intel Cluster Ready definition of application portability sup-
ports all of these needs by going beyond pure portability, (re-
compiling and linking a unique binary for each platform), to ap-
plication binary mobility, (running the same binary on multiple
platforms), by more precisely defining the target system.
“The Intel Cluster Checker allows us to certify that
our transtec HPC clusters are compliant with an
independent high quality standard. Our customers
can rest assured: their applications run as they ex-
pect.”
Marcus Wiedemann HPC Solution Engineer
67
A further aspect of application mobility is to ensure that regis-
tered Intel Cluster Ready applications do not need special pro-
gramming or alternate binaries for different message fabrics.
Intel Cluster Ready accomplishes this by providing an MPI imple-
mentation supporting multiple fabrics at runtime; through this,
registered Intel Cluster Ready applications obey the “message
layer independence property”. Stepping back, the unifying con-
cept of Intel Cluster Ready is “one-to-many,” that is
� One application will run on many clusters
� One cluster will run many applications
How is one-to-many accomplished? Looking at Figure 1, you see
the abstract Intel Cluster Ready “stack” components that always
exist in every cluster, i.e., one or more applications, a cluster soft-
ware stack, one or more fabrics, and finally the underlying clus-
ter hardware. The remainder of that picture (to the right) shows
the components in greater detail.
Applications, on the top of the stack, rely upon the various APIs,
utilities, and file system structure presented by the underly-
ing software stack. Registered Intel Cluster Ready applications
are always able to rely upon the APIs, utilities, and file system
structure specified by the Intel Cluster Ready Specification; if an
application requires software outside this “required” set, then
Intel Cluster Ready requires the application to provide that soft-
ware as a part of its installation. To ensure that this additional
per-application software doesn’t conflict with the cluster stack
or other applications, Intel Cluster Ready also requires the addi-
tional software to be installed in application-private trees, so the
application knows how to find that software while not interfer-
ing with other applications. While this may well cause duplicate
software to be installed, the reliability provided by the duplica-
tion far outweighs the cost of the duplicated files. A prime ex-
ample supporting this comparison is the removal of a common
file (library, utility, or other) that is unknowingly needed by some
ICR Stack
So
ftw
are
Sta
ck
Fabric
System
CFDRegisteredApplications
A SingleSolutionPlatform
Fabrics
Certified Cluster Platforms
Crash Climate QCD
Intel MPI Library (run-time)Intel MKL Cluster Edition (run-time)
Linux Cluster Tools(Intel Selected)
InfiniBand (OFED)
Intel Xeon Processor Platform
Gigabit Ethernet 10Gbit Ethernet
BioOptional
Value Addby
IndividualPlatform
Integrator
IntelDevelopment
Tools(C++, Intel Trace
Analyzer and Collector, MKL, etc.)
...
Intel OEM1 OEM2 PI1 PI2 ...
68
other application – such errors can be insidious to repair even
when they cause an outright application failure.
Cluster platforms, at the bottom of the stack, provide the APIs,
utilities, and file system structure relied upon by registered ap-
plications. Certified Intel Cluster Ready platforms ensure the
APIs, utilities, and file system structure are complete per the Intel
Cluster Ready Specification; certified clusters are able to provide
them by various means as they deem appropriate. Because of
the clearly defined responsibilities ensuring the presence of all
software required by registered applications, system providers
have a high confidence that the certified clusters they build are
able to run any certified applications their customers rely on. In
addition to meeting the Intel Cluster Ready requirements, certi-
fied clusters can also provide their added value, that is, other fea-
tures and capabilities that increase the value of their products.
How Does Intel Cluster Ready Accomplish its Objectives?At its heart, Intel Cluster Ready is a definition of the cluster as a
parallel application platform, as well as a tool to certify an actual
cluster to the definition. Let’s look at each of these in more de-
tail, to understand their motivations and benefits.
A definition of the cluster as parallel application platform
The Intel Cluster Ready Specification is very much written as
the requirements for, not the implementation of, a platform
upon which parallel applications, more specifically MPI appli-
cations, can be built and run. As such, the specification doesn’t
care whether the cluster is diskful or diskless, fully distributed or
single system image (SSI), built from “Enterprise” distributions or
community distributions, fully open source or not. Perhaps more
importantly, with one exception, the specification doesn’t have
any requirements on how the cluster is built; that one exception
is that compute nodes must be built with automated tools, so
that new, repaired, or replaced nodes can be rebuilt identically
Intel Cluster Ready
A Quality Standard for HPC Clusters
69
to the existing nodes without any manual interaction, other
than possibly initiating the build process.
Some items the specification does care about include:
ʎ The ability to run both 32- and 64-bit applications, including
MPI applications and X-clients, on any of the compute nodes
ʎ Consistency among the compute nodes’ configuration, capa-
bility, and performance
ʎ The identical accessibility of libraries and tools across
the cluster
ʎ The identical access by each compute node to permanent and
temporary storage, as well as users’ data
ʎ The identical access to each compute node from the
head node
ʎ The MPI implementation provides fabric independence
ʎ All nodes support network booting and provide a
remotely accessible console
The specification also requires that the runtimes for specific
Intel software products are installed on every certified cluster:
� Intel Math Kernel Library
� Intel MPI Library Runtime Environment
� Intel Threading Building Blocks
This requirement does two things. First and foremost, main-
line Linux distributions do not necessarily provide a sufficient
software stack to build an HPC cluster – such specialization is
beyond their mission. Secondly, the requirement ensures that
programs built with this software will always work on certified
clusters and enjoy simpler installations. As these runtimes are
directly available from the web, the requirement does not cause
additional costs to certified clusters. It is also very important
to note that this does not require certified applications to use
these libraries nor does it preclude alternate libraries, e.g., other
MPI implementations, from being present on certified clusters.
Quite clearly, an application that requires, e.g., an alternate MPI,
must also provide the runtimes for that MPI as a part of its instal-
lation.
A tool to certify an actual cluster to the definition
The Intel Cluster Checker, included with every certified Intel Clus-
ter Ready implementation, is used in four modes in the life of a
cluster:
ʎ To certify a system provider’s prototype cluster as a valid im-
plementation of the specification
ʎ To verify to the owner that the just-delivered cluster is a “true
copy” of the certified prototype
ʎ To ensure the cluster remains fully functional, reducing ser-
vice calls not related to the applications or the hardware
ʎ To help software and system providers diagnose and correct
actual problems to their code or their hardware.
While these are critical capabilities, in all fairness, this greatly
understates the capabilities of Intel Cluster Checker. The tool will
not only verify the cluster is performing as expected. To do this,
per-node and cluster-wide static and dynamic tests are made of
the hardware and software.
Intel Cluster Checker
Cluster Definition & Configuration XML File
STDOUT + LogfilePass/Fail Results & Diagnostics
Cluster Checker Engine
Output
Test Module
Parallel Ops Check
Res
ult
Co
nfi
g
Output
Res
ult
AP
I
Test Module
Parallel Ops Check
Co
nfi
gN
od
e
No
de
No
de
No
de
No
de
No
de
No
de
No
de
70
The static checks ensure the systems are configured consistent-
ly and appropriately. As one example, the tool will ensure the
systems are all running the same BIOS versions as well as hav-
ing identical configurations among key BIOS settings. This type
of problem – differing BIOS versions or settings – can be the root
cause of subtle problems such as differing memory configura-
tions that manifest themselves as differing memory bandwidths
only to be seen at the application level as slower than expected
overall performance. As is well-known, parallel program perfor-
mance can be very much governed by the performance of the
slowest components, not the fastest. In another static check, the
Intel Cluster Checker will ensure that the expected tools, librar-
ies, and files are present on each node, identically located on all
nodes, as well as identically implemented on all nodes. This en-
sures that each node has the minimal software stack specified
by the specification, as well as identical software stack among
the compute nodes.
A typical dynamic check ensures consistent system perfor-
mance, e.g., via the STREAM benchmark. This particular test en-
sures processor and memory performance is consistent across
compute nodes, which, like the BIOS setting example above, can
be the root cause of overall slower application performance. An
additional check with STREAM can be made if the user config-
ures an expectation of benchmark performance; this check will
ensure that performance is not only consistent across the clus-
ter, but also meets expectations. Going beyond processor per-
formance, the Intel MPI Benchmarks are used to ensure the net-
work fabric(s) are performing properly and, with a configuration
that describes expected performance levels, up to the cluster
provider’s performance expectations. Network inconsistencies
due to poorly performing Ethernet NICs, InfiniBand HBAs, faulty
switches, and loose or faulty cables can be identified. Finally, the
Intel Cluster Checker is extensible, enabling additional tests to
be added supporting additional features and capabilities. This
enables the Intel Cluster Checker to not only support the mini-
Intel Cluster Ready
Intel Cluster Ready Builds HPC Momentum
71
Intel Cluster Ready Builds HPC MomentumWith the Intel Cluster Ready (ICR) program, Intel Corporation set
out to create a win-win scenario for the major constituencies
in the high-performance computing (HPC) cluster market. Hard-
ware vendors and independent software vendors (ISVs) stand
to win by being able to ensure both buyers and users that their
products will work well together straight out of the box. System
administrators stand to win by being able to meet corporate de-
mands to push HPC competitive advantages deeper into their
organizations while satisfying end users’ demands for reliable
HPC cycles, all without increasing IT staff. End users stand to win
by being able to get their work done faster, with less downtime,
on certified cluster platforms. Last but not least, with ICR, Intel
has positioned itself to win by expanding the total addressable
market (TAM) and reducing time to market for the company’s mi-
croprocessors, chip sets, and platforms.
The Worst of Times
For a number of years, clusters were largely confined to govern-
ment and academic sites, where contingents of graduate stu-
dents and midlevel employees were available to help program
and maintain the unwieldy early systems. Commercial firms
lacked this low-cost labor supply and mistrusted the favored
cluster operating system, open source Linux, on the grounds
that no single party could be held accountable if something
went wrong with it. Today, cluster penetration in the HPC mar-
ket is deep and wide, extending from systems with a handful of
processors to some of the world’s largest supercomputers, and
from under $25,000 to tens or hundreds of millions of dollars in
price. Clusters increasingly pervade every HPC vertical market:
biosciences, computer-aided engineering, chemical engineering,
digital content creation, economic/financial services, electronic
design automation, geosciences/geo-engineering, mechanical
mal requirements of the Intel Cluster Ready Specification, but
the full cluster as delivered to the customer.
Conforming hardware and software
The preceding was primarily related to the builders of certified
clusters and the developers of registered applications. For end
users that want to purchase a certified cluster to run registered
applications, the ability to identify registered applications and
certified clusters is most important, as that will reduce their ef-
fort to evaluate, acquire, and deploy the clusters that run their
applications, and then keep that computing resource operating
properly, with full performance, directly increasing their produc-
tivity.
Intel, XEON and certain other trademarks and logos appearing in this
brochure, are trademarks or registered trademarks of Intel Corporation.
Cluster PlatformSoftware Tools
Reference Designs
Demand Creation
Support & Training
Specification
Co
nfi
gu
ration
Softw
are
Interco
nn
ect
Server Platfo
rm
Intel P
rocesso
r ISV Enabling
©
Intel Cluster Ready Program
72
design, defense, government labs, academia, and weather/cli-
mate.
But IDC studies have consistently shown that clusters remain
difficult to specify, deploy, and manage, especially for new and
less experienced HPC users. This should come as no surprise,
given that a cluster is a set of independent computers linked to-
gether by software and networking technologies from multiple
vendors.
Clusters originated as do-it-yourself HPC systems. In the late
1990s users began employing inexpensive hardware to cobble
together scientific computing systems based on the “Beowulf
cluster” concept first developed by Thomas Sterling and Don-
ald Becker at NASA. From their Beowulf origins, clusters have
evolved and matured substantially, but the system manage-
ment issues that plagued their early years remain in force to-
day.
The Need for Standard Cluster Solutions
The escalating complexity of HPC clusters poses a dilemma for
many large IT departments that cannot afford to scale up their
HPC-knowledgeable staff to meet the fast-growing end-user de-
mand for technical computing resources. Cluster management
is even more problematic for smaller organizations and business
units that often have no dedicated, HPC-knowledgeable staff to
begin with.
The ICR program aims to address burgeoning cluster complex-
ity by making available a standard solution (aka reference archi-
tecture) for Intel-based systems that hardware vendors can use
to certify their configurations and that ISVs and other software
vendors can use to test and register their applications, system
Intel Cluster Ready
Intel Cluster Ready Builds HPC Momentum
73
software, and HPC management software. The chief goal of this
voluntary compliance program is to ensure fundamental hard-
ware-software integration and interoperability so that system
administrators and end users can confidently purchase and de-
ploy HPC clusters, and get their work done, even in cases where
no HPC-knowledgeable staff are available to help.
The ICR program wants to prevent end users from having to
become, in effect, their own systems integrators. In smaller
organizations, the ICR program is designed to allow over-
worked IT departments with limited or no HPC expertise to
support HPC user requirements more readily. For larger or-
ganizations with dedicated HPC staff, ICR creates confidence
that required user applications will work, eases the problem
of system administration, and allows HPC cluster systems to
be scaled up in size without scaling support staff. ICR can help
drive HPC cluster resources deeper into larger organizations
and free up IT staff to focus on mainstream enterprise appli-
cations (e.g., payroll, sales, HR, and CRM).
The program is a three-way collaboration among hardware ven-
dors, software vendors, and Intel. In this triple alliance, Intel pro-
vides the specification for the cluster architecture implementa-
tion, and then vendors certify the hardware configurations and
register software applications as compliant with the specifica-
tion. The ICR program’s promise to system administrators and
end users is that registered applications will run out of the box
on certified hardware configurations.
ICR solutions are compliant with the standard platform architec-
ture, which starts with 64-bit Intel Xeon processors in an Intel-
certified cluster hardware platform. Layered on top of this foun-
dation are the interconnect fabric (Gigabit Ethernet, InfiniBand)
and the software stack: Intel-selected Linux cluster tools, an
Intel MPI runtime library, and the Intel Math Kernel Library. In-
tel runtime components are available and verified as part of
the certification (e.g., Intel tool runtimes) but are not required
to be used by applications. The inclusion of these Intel runtime
components does not exclude any other components a systems
vendor or ISV might want to use. At the top of the stack are Intel-
registered ISV applications.
At the heart of the program is the Intel Cluster Checker, a vali-
dation tool that verifies that a cluster is specification compliant
and operational before ISV applications are ever loaded. After
the cluster is up and running, the Cluster Checker can function
as a fault isolation tool in wellness mode. Certification needs
to happen only once for each distinct hardware platform, while
verification – which determines whether a valid copy of the
specification is operating – can be performed by the Cluster
Checker at any time.
Cluster Checker is an evolving tool that is designed to accept
new test modules. It is a productized tool that ICR members ship
with their systems. Cluster Checker originally was designed for
homogeneous clusters but can now also be applied to clusters
with specialized nodes, such as all-storage sub-clusters. Cluster
Checker can isolate a wide range of problems, including network
or communication problems.
transtec offers their customers a new and fascinating way to
evaluate transtec’s HPC solutions in real world scenarios. With
the transtec Benchmarking Center solutions can be explored in
detail with the actual applications the customers will later run on
them. Intel Cluster Ready makes this feasible by simplifying the
maintenance of the systems and set-up of clean systems very eas-
ily, and as often as needed. As High-Performance Computing (HPC)
systems are utilized for numerical simulations, more and more
advanced clustering technologies are being deployed. Because
of its performance, price/performance and energy efficiency ad-
vantages, clusters now dominate all segments of the HPC market
and continue to gain acceptance. HPC computer systems have be-
come far more widespread and pervasive in government, industry,
and academia. However, rarely does the client have the possibility
to test their actual application on the system they are planning
to acquire.
The transtec Benchmarking Center
transtec HPC solutions get used by a wide variety of clients.
Among those are most of the large users of compute power at
German and other European universities and research centers
as well as governmental users like the German army’s compute
74
Intel Cluster Ready
The transtec Benchmarking Center
center and clients from the high tech, the automotive and other
sectors. transtec HPC solutions have demonstrated their value in
more than 500 installations. Most of transtec’s cluster systems are
based on SUSE Linux Enterprise Server, Red Hat Enterprise Linux,
CentOS, or Scientific Linux. With xCAT for efficient cluster deploy-
ment, and Moab HPC Suite by Adaptive Computing for high-level
cluster management, transtec is able to efficiently deploy and
ship easy-to-use HPC cluster solutions with enterprise-class
management features. Moab has proven to provide easy-to-use
workload and job management for small systems as well as the
largest cluster installations worldwide. However, when selling
clusters to governmental customers as well as other large en-
terprises, it is often required that the client can choose from a
range of competing offers. Many times there is a fixed budget
available and competing solutions are compared based on their
performance towards certain custom benchmark codes.
So, in 2007 transtec decided to add another layer to their already
wide array of competence in HPC – ranging from cluster deploy-
ment and management, the latest CPU, board and network tech-
nology to HPC storage systems. In transtec’s HPC Lab the systems
are being assembled. transtec is using Intel Cluster Ready to fa-
cilitate testing, verification, documentation, and final testing
throughout the actual build process. At the benchmarking center
transtec can now offer a set of small clusters with the “newest
and hottest technology” through Intel Cluster Ready. A standard
installation infrastructure gives transtec a quick and easy way to
set systems up according to their customers’ choice of operating
system, compilers, workload management suite, and so on. With
Intel Cluster Ready there are prepared standard set-ups available
with verified performance at standard benchmarks while the
system stability is guaranteed by our own test suite and the Intel
Cluster Checker.
The Intel Cluster Ready program is designed to provide a com-
mon standard for HPC clusters, helping organizations design
and build seamless, compatible and consistent cluster configu-
rations. Integrating the standards and tools provided by this pro-
gram can help significantly simplify the deployment and man-
agement of HPC clusters.
75
Parallel NFS – The New Standard for HPC Storage
77
HPC computation results in the terabyte range are not uncommon. The problem in this context is not so much storing the data at rest, but the performance of the necessary copying back and forth in the course of the computation job flow and the dependent job turn-around time.
For interim results during a job runtime or for fast stor-age of input and results data, parallel file systems have established themselves as the standard to meet the ev-er-increasing performance requirements of HPC stor-age systems. Parallel NFS is about to become the new standard framework for a parallel file system.
Parallel NFS – The New Standard for HPC Storage
78
Parallel NFS
The New Standard for HPC Storage
Yesterday’s Solution: NFS for HPC StorageThe original Network File System (NFS) developed by Sun Micro-
systems at the end of the eighties – now available in version 4.1
– has been established for a long time as a de-facto standard for
the provisioning of a global namespace in networked computing.
A very widespread HPC cluster solution includes a central mater
node acting simultaneously as an NFS server, with its local file
system storing input, interim and results data and exporting
them to all other cluster nodes.
There is of course an immediate bottleneck in this method: When
the load of the network is high, or where there are large num-
bers of nodes, the NFS server can no longer keep up delivering
or receiving the data. In high-performance computing especially,
the nodes are interconnected at least once via Gigabit Ethernet,
so the sum total throughput is well above what an NFS server
with a Gigabit interface can achieve. Even a powerful network
connection of the NFS server to the cluster, for example with
10-Gigabit Ethernet, is only a temporary solution to this prob-
lem until the next cluster upgrade. The fundamental problem re-
mains – this solution is not scalable; in addition, NFS is a difficult
protocol to cluster in terms of load balancing: either you have
to ensure that multiple NFS servers accessing the same data
A classical NFS server is a bottleneck
Cluster Nodes= NFS Clients
NAS Head= NFS Server
79
are constantly synchronised, the disadvantage being a notice-
able drop in performance or you manually partition the global
namespace which is also time-consuming. NFS is not suitable for
dynamic load balancing as on paper it appears to be stateless
but in reality is, in fact, stateful.
Today’s Solution: Parallel File SystemsFor some time, powerful commercial products have been avail-
able to meet the high demands on an HPC storage system. The
open-source solutions FraunhoferFS (FhGFS) from the Fraun-
hofer Competence Center for High Performance Computing or
Lustre are widely used in the Linux HPC world, and also several
other free as well as commercial parallel file system solutions
exist.
What is new is that the time-honoured NFS is to be upgraded,
including a parallel version, into an Internet Standard with
the aim of interoperability between all operating systems. The
original problem statement for parallel NFS access was written
by Garth Gibson, a professor at Carnegie Mellon University and
founder and CTO of Panasas. Gibson was already a renowned
figure being one of the authors contributing to the original
paper on RAID architecture from 1988. The original statement
from Gibson and Panasas is clearly noticeable in the design of
pNFS. The powerful HPC file system developed by Gibson and
Panasas, ActiveScale PanFS, with object-based storage devices
functioning as central components, is basically the commercial
continuation of the “Network-Attached Secure Disk (NASD)”
project also developed by Garth Gibson at the Carnegie Mellon
University.
Parallel NFSParallel NFS (pNFS) is gradually emerging as the future standard
to meet requirements in the HPC environment. From the indus-
try’s as well as the user’s perspective, the benefits of utilising
standard solutions are indisputable: besides protecting end user
investment, standards also ensure a defined level of interoper-
ability without restricting the choice of products available. As
a result, less user and administrator training is required which
leads to simpler deployment and at the same time, a greater ac-
ceptance.
As part of the NFS 4.1 Internet Standard, pNFS will not only
adopt the semantics of NFS in terms of cache consistency or
security, it also represents an easy and flexible extension of
the NFS 4 protocol. pNFS is optional, in other words, NFS 4.1
implementations do not have to include pNFS as a feature. The
scheduled Internet Standard NFS 4.1 is today presented as IETF
RFC 5661.
The pNFS protocol supports a separation of metadata and data:
a pNFS cluster comprises so-called storage devices which store
the data from the shared file system and a metadata server
(MDS), called Director Blade with Panasas – the actual NFS 4.1
server. The metadata server keeps track of which data is stored
on which storage devices and how to access the files, the so-
called layout. Besides these “striping parameters”, the MDS also
manages other metadata including access rights or similar,
which is usually stored in a file’s inode.
The layout types define which Storage Access Protocol is used
by the clients to access the storage devices. Up until now,
three potential storage access protocols have been defined
for pNFS: file, block and object-based layouts, the former be-
ing described in RFC 5661 direct, the latters in RFC 5663 and
5664, respectively. Last but not least, a Control Protocol is also
used by the MDS and storage devices to synchronise status
data. This protocol is deliberately unspecified in the standard
to give manufacturers certain flexibility. The NFS 4.1 standard
does however specify certain conditions which a control pro-
tocol has to fulfil, for example, how to deal with the change/
modify time attributes of files.
80
What’s New in NFS 4.1?
NFS 4.1 is a minor update to NFS 4, and adds new features to it.
One of the optional features is parallel NFS (pNFS) but there is
other new functionality as well.
One of the technical enhancements is the use of sessions, a
persistent server object, dynamically created by the client. By
means of sessions, the state of an NFS connection can be stored,
no matter whether the connection is live or not. Sessions survive
temporary downtimes both of the client and the server.
Each session has a so-called fore channel, which is the connec-
tion connection from the client to the server for all RPC opera-
tions, and optionally a back channel for RPC callbacks from the
server that now can also be realized through firewall boundar-
ies. Sessions can be trunked to increase the bandwidth. Besides
session trunking there is also a client ID trunking for grouping
together several sessions to the same client ID.
By means of sessions, NFS can be seen as a really stateful proto-
col with a so-called “Exactly-Once Semantics (EOS)”. Until now, a
necessary but unspecified reply cache within the NFS server is
implemented to handle identical RPC operations that have been
sent several times. This statefulness in reality is not very robust,
however, and sometimes leads to the well-known stale NFS han-
dles. In NFS 4.1, the reply cache is now a mandatory part of the
NFS implementation, storing the server replies to RPC requests
persistently on disk.
Another new feature of NFS 4.1 is delegation for directories: NFS
clients can be given temporary exclusive access to directories.
Before, this has only been possible for simple files. With the
forthcoming version 4.2 of the NFS standard, federated filesys-
tems will be added as a feature, which represents the NFS coun-
terpart of Microsoft’s DFS (distributed filesystem).
Parallel NFS
What’s New in NFS 4.1?
81
pNFS supports backwards compatibility with non-pNFS com-
patible NFS 4 clients. In this case, the MDS itself gathers data
from the storage devices on behalf of the NFS client and pres-
ents the data to the NFS client via NFS 4. The MDS acts as a
kind of proxy server – which is e.g. what the Director Blades
from Panasas do.
pNFS Layout TypesIf storage devices act simply as NFS 4 file servers, the file
layout is used. It is the only storage access protocol directly
specified in the NFS 4.1 standard. Besides the stripe sizes and
stripe locations (storage devices), it also includes the NFS file
handles which the client needs to use to access the separate
file areas. The file layout is compact and static, the striping
information does not change even if changes are made to the
file enabling multiple pNFS clients to simultaneously cache
the layout and avoid synchronisation overhead between cli-
ents and the MDS or the MDS and storage devices.
File system authorisation and client authentication can be
well implemented with the file layout. When using NFS 4 as
the storage access protocol, client authentication merely de-
pends on the security flavor used – when using the RPCSEC_
GSS security flavor, client access is kerberized, for example
and the server controls access authorization using specified
ACLs and cryptographic processes.
In contrast, the block/volume layout uses volume identifiers
and block offsets and extents to specify a file layout. SCSI
block commands are used to access storage devices. As the
block distribution can change with each write access, the lay-
out must be updated more frequently than with the file lay-
out.
Block-based access to storage devices does not offer any se-
cure authentication option for the accessing SCSI initiator.
Secure SAN authorisation is possible with host granularity
only, based on World Wide Names (WWNs) with Fibre Channel
or Initiator Node Names (IQNs) with iSCSI. The server cannot
enforce access control governed by the file system. On the
contrary, a pNFS client basically voluntarily abides with the
access rights, the storage device has to trust the pNFS client
– a fundamental access control problem that is a recurrent is-
sue in the NFS protocol history.
The object layout is syntactically similar to the file layout,
but it uses the SCSI object command set for data access to
so-called Object-based Storage Devices (OSDs) and is heavily
based on the DirectFLOW protocol of the ActiveScale PanFS
from Panasas. From the very start, Object-Based Storage De-
vices were designed for secure authentication and access. So-
called capabilities are used for object access which involves
the MDS issuing so-called capabilities to the pNFS clients. The
ownership of these capabilities represents the authoritative
access right to an object.
pNFS can be upgraded to integrate other storage access pro-
tocols and operating systems, and storage manufacturers
also have the option to ship additional layout drivers for their
pNFS implementations.
Parallel NFS
pNFS Clients Storage Device
Metadata Server
Storage Access Protocol
NFS 4.1Control Protocol
82
Parallel NFS
Panasas HPC Storage
Having a many years’ experience in deploying parallel file sys-
tems like Lustre or FraunhoferFS (FhGFS) from smaller scales up
to hundreds of terabytes capacity and throughputs of several
gigabytes per second, transtec chose Panasas, the leader in HPC
storage solutions, as the partner for providing highest perfor-
mance and scalability on the one hand, and ease of manage-
ment on the other. Therefore, with Panasas as the technology
leader, and transtec’s overall experience and customer-oriented
approach, customers can be assured to get the best possible HPC
storage solution available.
83
The Panasas file system uses parallel and redundant access to
object storage devices (OSDs), per-file RAID, distributed metada-
ta management, consistent client caching, file locking services,
and internal cluster management to provide a scalable, fault
tolerant, high performance distributed file system. The clustered
design of the storage system and the use of client-driven RAID
provide scalable performance to many concurrent file system
clients through parallel access to file data that is striped across
OSD storage nodes. RAID recovery is performed in parallel by the
cluster of metadata managers, and declustered data placement
yields scalable RAID rebuild rates as the storage system grows
larger.
IntroductionStorage systems for high performance computing environments
must be designed to scale in performance so that they can be
configured to match the required load. Clustering techniques
are often used to provide scalability. In a storage cluster, many
nodes each control some storage, and the overall distributed file
system assembles the cluster elements into one large, seamless
storage system. The storage cluster can be hosted on the same
computers that perform data processing, or they can be a sepa-
rate cluster that is devoted entirely to storage and accessible to
the compute cluster via a network protocol.
The Panasas storage system is a specialized storage cluster, and
this paper presents its design and a number of performance
measurements to illustrate the scalability. The Panasas system
is a production system that provides file service to some of the
largest compute clusters in the world, in scientific labs, in seis-
mic data processing, in digital animation studios, in computa-
tional fluid dynamics, in semiconductor manufacturing, and
in general purpose computing environments. In these environ-
ments, hundreds or thousands of file system clients share data
and generate very high aggregate I/O load on the file system. The
Panasas system is designed to support several thousand clients
and storage capacities in excess of a petabyte.
The unique aspects of the Panasas system are its use of per-file,
client-driven RAID, its parallel RAID rebuild, its treatment of dif-
ferent classes of metadata (block, file, system) and a commodity
parts based blade hardware with integrated UPS. Of course, the
system has many other features (such as object storage, fault tol-
erance, caching and cache consistency, and a simplified manage-
ment model) that are not unique, but are necessary for a scalable
system implementation.
Panasas File System BackgroundThe two overall themes to the system are object storage, which
affects how the file system manages its data, and clustering of
components, which allows the system to scale in performance
and capacity.
Object Storage
An object is a container for data and attributes; it is analogous to the
inode inside a traditional UNIX file system implementation. Special-
ized storage nodes called Object Storage Devices (OSD) store objects
in a local OSDFS file system. The object interface addresses objects
in a two-level (partition ID/object ID) namespace. The OSD wire pro-
tocol provides byteoriented access to the data, attribute manipula-
tion, creation and deletion of objects, and several other specialized
operations. Panasas uses an iSCSI transport to carry OSD commands
that are very similar to the OSDv2 standard currently in progress
within SNIA and ANSI-T10.
The Panasas file system is layered over the object storage. Each file
is striped over two or more objects to provide redundancy and high
bandwidth access. The file system semantics are implemented by
metadata managers that mediate access to objects from clients of
the file system. The clients access the object storage using the
iSCSI/OSD protocol for Read and Write operations. The I/O op-
erations proceed directly and in parallel to the storage nodes,
bypassing the metadata managers. The clients interact with the
out-of-band metadata managers via RPC to obtain access capa-
bilities and location information for the objects that store files.
84
Object attributes are used to store file-level attributes, and direc-
tories are implemented with objects that store name to object
ID mappings. Thus the file system metadata is kept in the object
store itself, rather than being kept in a separate database or
some other form of storage on the metadata nodes.
System Software Components
The major software subsystems are the OSDFS object storage
system, the Panasas file system metadata manager, the Panasas
file system client, the NFS/CIFS gateway, and the overall cluster
management system.
ʎ The Panasas client is an installable kernel module that runs inside
the Linux kernel. The kernel module implements the standard
VFS interface, so that the client hosts can mount the file system
Parallel NFS
Panasas HPC Storage
NFS/CIFS
OSDFS
SysMgr
COMPUTE NODE
MANAGER NODE
STORAGE NODE
iSCSI/OSD
PanFSClient
RPC
Client
Panasas System Components
85
and use a POSIX interface to the storage system.
ʎ Each storage cluster node runs a common platform that is based
on FreeBSD, with additional services to provide hardware moni-
toring, configuration management, and overall control.
ʎ The storage nodes use a specialized local file system (OSDFS)
that implements the object storage primitives. They imple-
ment an iSCSI target and the OSD command set. The OSDFS
object store and iSCSI target/OSD command processor are
kernel modules. OSDFS is concerned with traditional block-
level file system issues such as efficient disk arm utilization,
media management (i.e., error handling), high throughput, as
well as the OSD interface.
ʎ The cluster manager (SysMgr) maintains the global configuration,
and it controls the other services and nodes in the storage clus-
ter. There is an associated management application that provi-
des both a command line interface (CLI) and an HTML interface
(GUI). These are all user level applications that run on a subset
of the manager nodes. The cluster manager is concerned with
membership in the storage cluster, fault detection, configuration
management, and overall control for operations like software up-
grade and system restart.
ʎ The Panasas metadata manager (PanFS) implements the file
system semantics and manages data striping across the object
storage devices. This is a user level application that runs on every
manager node. The metadata manager is concerned with distri-
buted file system issues such as secure multi-user access, main-
taining consistent file- and object-level metadata, client cache
coherency, and recovery from client, storage node, and metadata
server crashes. Fault tolerance is based on a local transaction log
that is replicated to a backup on a different manager node.
ʎ The NFS and CIFS services provide access to the file system
for hosts that cannot use our Linux installable file system
client. The NFS service is a tuned version of the standard
FreeBSD NFS server that runs inside the kernel. The CIFS ser-
vice is based on Samba and runs at user level. In turn, these
services use a local instance of the file system client, which
runs inside the FreeBSD kernel. These gateway services run
on every manager node to provide a clustered NFS and CIFS
service.
Commodity Hardware Platform
The storage cluster nodes are implemented as blades that are
very compact computer systems made from commodity parts.
The blades are clustered together to provide a scalable platform.
The OSD StorageBlade module and metadata manager Director-
Blade module use the same form factor blade and fit into the
same chassis slots.
Storage Management
Traditional storage management tasks involve partitioning avail-
able storage space into LUNs (i.e., logical units that are one or more
disks, or a subset of a RAID array), assigning LUN ownership to dif-
ferent hosts, configuring RAID parameters, creating file systems or
databases on LUNs, and connecting clients to the correct server for
their storage. This can be a labor-intensive scenario. Panasas pro-
vides a simplified model for storage management that shields the
storage administrator from these kinds of details and allow a single,
part-time admin to manage systems that were hundreds of tera-
bytes in size.
The Panasas storage system presents itself as a file system with a
POSIX interface, and hides most of the complexities of storage man-
agement. Clients have a single mount point for the entire system.
The /etc/fstab file references the cluster manager, and from that the
client learns the location of the metadata service instances. The ad-
ministrator can add storage while the system is online, and new re-
sources are automatically discovered. To manage available storage,
Panasas introduces two basic storage concepts: a physical storage
pool called a BladeSet, and a logical quota tree called a Volume.
The BladeSet is a collection of StorageBlade modules in one or more
shelves that comprise a RAID fault domain. Panasas mitigates the
risk of large fault domains with the scalable rebuild performance
described below in the text. The BladeSet is a hard physical bound-
86
ary for the volumes it contains. A BladeSet can be grown at any time,
either by adding more StorageBlade modules, or by merging two ex-
isting BladeSets together.
The Volume is a directory hierarchy that has a quota constraint and
is assigned to a particular BladeSet. The quota can be changed at any
time, and capacity is not allocated to the Volume until it is used, so
multiple volumes compete for space within their BladeSet and grow
on demand. The files in those volumes are distributed among all the
StorageBlade modules in the BladeSet.
Volumes appear in the file system name space as directories.
Clients have a single mount point for the whole storage sys-
tem, and volumes are simply directories below the mount point.
There is no need to update client mounts when the admin cre-
ates, deletes, or renames volumes.
Automatic Capacity Balancing
Capacity imbalance occurs when expanding a BladeSet (i.e., add-
ing new, empty storage nodes), merging two BladeSets, and re-
placing a storage node following a failure. In the latter scenario,
the imbalance is the result of the RAID rebuild, which uses spare
capacity on every storage node rather than dedicating a specific
“hot spare” node. This provides better throughput during rebuild,
but causes the system to have a new, empty storage node after
the failed storage node is replaced. The system automatically bal-
ances used capacity across storage nodes in a BladeSet using two
mechanisms: passive balancing and active balancing.
Passive balancing changes the probability that a storage node will
be used for a new component of a file, based on its available capac-
ity. This takes effect when files are created, and when their stripe size
is increased to include more storage nodes. Active balancing is done
by moving an existing component object from one storage node to
another, and updating the storage map for the affected file. During
the transfer, the file is transparently marked read-only by the stor-
age management layer, and the capacity balancer skips files that are
being actively written. Capacity balancing is thus transparent to file
system clients.
Parallel NFS
Panasas HPC Storage
©2013 Panasas Incorporated. All rights reserved. Panasas, the
Panasas logo, Accelerating Time to Results, ActiveScale, Direct-
FLOW, DirectorBlade, StorageBlade, PanFS, PanActive and MyPa-
nasas are trademarks or registered trademarks of Panasas, Inc.
in the United States and other countries. All other trademarks
are the property of their respective owners. Information sup-
plied by Panasas, Inc. is believed to be accurate and reliable at
the time of publication, but Panasas, Inc. assumes no responsi-
bility for any errors that may appear in this document. Panasas,
Inc. reserves the right, without notice, to make changes in prod-
uct design, specifications and prices. Information is subject to
change without notice.
87
Object RAID and ReconstructionPanasas protects against loss of a data object or an entire storage
node by striping files across objects stored on different storage
nodes, using a fault-tolerant striping algorithm such as RAID-1 or
RAID-5. Small files are mirrored on two objects, and larger files are
striped more widely to provide higher bandwidth and less capacity
overhead from parity information. The per-file RAID layout means
that parity information for different files is not mixed together, and
easily allows different files to use different RAID schemes alongside
each other. This property and the security mechanisms of the OSD
protocol makes it possible to enforce access control over files even
as clients access storage nodes directly. It also enables what is per-
haps the most novel aspect of our system, client-driven RAID. That
is, the clients are responsible for computing and writing parity. The
OSD security mechanism also allows multiple metadata managers
to manage objects on the same storage device without heavyweight
coordination or interference from each other.
Client-driven, per-file RAID has four advantages for large-scale stor-
age systems. First, by having clients compute parity for their own
data, the XOR power of the system scales up as the number of clients
increases. We measured XOR processing during streaming write
bandwidth loads at 7% of the client’s CPU, with the rest going to the
OSD/iSCSI/TCP/IP stack and other file system overhead. Moving XOR
computation out of the storage system into the client requires some
additional work to handle failures. Clients are responsible for gener-
ating good data and good parity for it. Because the RAID equation is
per-file, an errant client can only damage its own data. However, if a
client fails during a write, the metadata manager will scrub parity to
ensure the parity equation is correct.
The second advantage of client-driven RAID is that clients can per-
form an end-to-end data integrity check. Data has to go through
the disk subsystem, through the network interface on the storage
nodes, through the network and routers, through the NIC on the
client, and all of these transits can introduce errors with a very low
probability. Clients can choose to read parity as well as data, and
verify parity as part of a read operation. If errors are detected, the
operation is retried. If the error is persistent, an alert is raised and the
read operation fails. By checking parity across storage nodes within
the client, the system can ensure end-to-end data integrity. This is
another novel property of per-file, client-driven RAID.
Third, per-file RAID protection lets the metadata managers rebuild
files in parallel. Although parallel rebuild is theoretically possible in
block-based RAID, it is rarely implemented. This is due to the fact that
the disks are owned by a single RAID controller, even in dual-ported
configurations. Large storage systems have multiple RAID control-
lers that are not interconnected. Since the SCSI Block command set
does not provide fine-grained synchronization operations, it is dif-
ficult for multiple RAID controllers to coordinate a complicated op-
eration such as an online rebuild without external communication.
Even if they could, without connectivity to the disks in the affected
parity group, other RAID controllers would be unable to assist. Even
in a high-availability configuration, each disk is typically only at-
tached to two different RAID controllers, which limits the potential
speedup to 2x.
When a StorageBlade module fails, the metadata managers that
own Volumes within that BladeSet determine what files are affect-
ed, and then they farm out file reconstruction work to every other
metadata manager in the system. Metadata managers rebuild their
own files first, but if they finish early or do not own any Volumes in
the affected Bladeset, they are free to aid other metadata managers.
Declustered parity groups spread out the I/O workload among all
StorageBlade modules in the BladeSet. The result is that larger stor-
age clusters reconstruct lost data more quickly.
The fourth advantage of per-file RAID is that unrecoverable faults
can be constrained to individual files. The most commonly en-
countered double-failure scenario with RAID-5 is an unrecover-
able read error (i.e., grown media defect) during the reconstruc-
tion of a failed storage device. The second storage device is still
healthy, but it has been unable to read a sector, which prevents
rebuild of the sector lost from the first drive and potentially the
entire stripe or LUN, depending on the design of the RAID con-
troller. With block-based RAID, it is difficult or impossible to di-
88
rectly map any lost sectors back to higher-level file system data
structures, so a full file system check and media scan will be re-
quired to locate and repair the damage. A more typical response
is to fail the rebuild entirely. RAID controllers monitor drives in
an effort to scrub out media defects and avoid this bad scenario,
and the Panasas system does media scrubbing, too. However,
with high capacity SATA drives, the chance of encountering a me-
dia defect on drive B while rebuilding drive A is still significant.
With per-file RAID-5, this sort of double failure means that only
a single file is lost, and the specific file can be easily identified
and reported to the administrator. While block-based RAID sys-
tems have been compelled to introduce RAID-6 (i.e., fault tolerant
schemes that handle two failures), the Panasas solution is able
to deploy highly reliable RAID-5 systems with large, high perfor-
mance storage pools.
RAID Rebuild Performance
RAID rebuild performance determines how quickly the system can
recover data when a storage node is lost. Short rebuild times reduce
the window in which a second failure can cause data loss. There are
three techniques to reduce rebuild times: reducing the size of the
RAID parity group, declustering the placement of parity group ele-
ments, and rebuilding files in parallel using multiple RAID engines.
Parallel NFS
Panasas HPC Storage
Declustered parity groups
bCf
hJK
CDe
Gim
ADe
him
ACD
GJK
bCf
Gim
Aef
hJK
bDe
Gim
Abf
hjK
89
The rebuild bandwidth is the rate at which reconstructed data is
written to the system when a storage node is being reconstruct-
ed. The system must read N times as much as it writes, depend-
ing on the width of the RAID parity group, so the overall through-
put of the storage system is several times higher than the rebuild
rate. A narrower RAID parity group requires fewer read and XOR
operations to rebuild, so will result in a higher rebuild band-
width. However, it also results in higher capacity overhead for
parity data, and can limit bandwidth during normal I/O. Thus,
selection of the RAID parity group size is a trade-off between ca-
pacity overhead, on-line performance, and rebuild performance.
Understanding declustering is easier with a picture. In the figure
on the left, each parity group has 4 elements, which are indicat-
ed by letters placed in each storage device. They are distributed
among 8 storage devices. The ratio between the parity group size
and the available storage devices is the declustering ratio, which
in this example is ½. In the picture, capital letters represent those
parity groups that all share the second storage node. If the second
storage device were to fail, the system would have to read the sur-
viving members of its parity groups to rebuild the lost elements.
You can see that the other elements of those parity groups occupy
about ½ of each other storage device.
For this simple example you can assume each parity element is the
same size so all the devices are filled equally. In a real system, the
component objects will have various sizes depending on the overall
file size, although each member of a parity group will be very close
in size. There will be thousands or millions of objects on each de-
vice, and the Panasas system uses active balancing to move compo-
nent objects between storage nodes to level capacity.
Declustering means that rebuild requires reading a subset
of each device, with the proportion being approximately the
same as the declustering ratio. The total amount of data read
is the same with and without declustering, but with decluster-
ing it is spread out over more devices. When writing the re-
constructed elements, two elements of the same parity group
cannot be located on the same storage node. Declustering
leaves many storage devices available for the reconstructed
parity element, and randomizing the placement of each file’s
parity group lets the system spread out the write I/O over all
the storage. Thus declustering RAID parity groups has the im-
portant property of taking a fixed amount of rebuild I/O and
spreading it out over more storage devices.
Having per-file RAID allows the Panasas system to divide the
work among the available DirectorBlade modules by assign-
ing different files to different DirectorBlade modules. This di-
vision is dynamic with a simple master/worker model in which
metadata services make themselves available as workers, and
each metadata service acts as the master for the volumes it
implements. By doing rebuilds in parallel on all DirectorBlade
modules, the system can apply more XOR throughput and uti-
lize the additional I/O bandwidth obtained with declustering.
Metadata ManagementThere are several kinds of metadata in the Panasas system. These
include the mapping from object IDs to sets of block addresses,
mapping files to sets of objects, file system attributes such as
ACLs and owners, file system namespace information (i.e., direc-
tories), and configuration/management information about the
storage cluster itself.
Block-level Metadata
Block-level metadata is managed internally by OSDFS, the file
system that is optimized to store objects. OSDFS uses a floating
block allocation scheme where data, block pointers, and object
descriptors are batched into large write operations. The write
buffer is protected by the integrated UPS, and it is flushed to disk
on power failure or system panics. Fragmentation was an issue
in early versions of OSDFS that used a first-fit block allocator, but
this has been significantly mitigated in later versions that use a
modified best-fit allocator.
OSDFS stores higher level file system data structures, such as the
partition and object tables, in a modified BTree data structure.
90
Block mapping for each object uses a traditional direct/indirect/
double-indirect scheme. Free blocks are tracked by a proprietary
bitmap-like data structure that is optimized for copy-on-write
reference counting, part of OSDFS’s integrated support for ob-
ject- and partition-level copy-on-write snapshots.
Block-level metadata management consumes most of the cycles
in file system implementations. By delegating storage manage-
ment to OSDFS, the Panasas metadata managers have an order
of magnitude less work to do than the equivalent SAN file system
metadata manager that must track all the blocks in the system.
File-level Metadata
Above the block layer is the metadata about files. This includes
user-visible information such as the owner, size, and modifica-
tion time, as well as internal information that identifies which
objects store the file and how the data is striped across those
objects (i.e., the file’s storage map). Our system stores this file
metadata in object attributes on two of the N objects used
to store the file’s data. The rest of the objects have basic at-
tributes like their individual length and modify times, but the
higher-level file system attributes are only stored on the two
attribute-storing components.
File names are implemented in directories similar to traditional
UNIX file systems. Directories are special files that store an ar-
ray of directory entries. A directory entry identifies a file with
a tuple of <serviceID, partitionID, objectID>, and also includes
two <osdID> fields that are hints about the location of the at-
tribute storing components. The partitionID/objectID is the
two-level object numbering scheme of the OSD interface, and
Panasas uses a partition for each volume. Directories are mir-
rored (RAID-1) in two objects so that the small write operations
associated with directory updates are efficient.
Clients are allowed to read, cache and parse directories, or
they can use a Lookup RPC to the metadata manager to trans-
late a name to an <serviceID, partitionID, objectID> tuple and
the <osdID> location hints. The serviceID provides a hint about
Parallel NFS
Panasas HPC Storage
91
the metadata manager for the file, although clients may be re-
directed to the metadata manager that currently controls the
file. The osdID hint can become out-of-date if reconstruction or
active balancing moves an object. If both osdID hints fail, the
metadata manager has to multicast a GetAttributes to the stor-
age nodes in the BladeSet to locate an object. The partitionID
and objectID are the same on every storage node that stores
a component of the file, so this technique will always work.
Once the file is located, the metadata manager automatically
updates the stored hints in the directory, allowing future ac-
cesses to bypass this step.
File operations may require several object operations. The figure
on the left shows the steps used in creating a file. The metadata
manager keeps a local journal to record in-progress actions so it
can recover from object failures and metadata manager crashes
that occur when updating multiple objects. For example, creat-
ing a file is fairly complex task that requires updating the parent
directory as well as creating the new file. There are 2 Create OSD
operations to create the first two components of the file, and 2
Write OSD operations, one to each replica of the parent direc-
tory. As a performance optimization, the metadata server also
grants the client read and write access to the file and returns
the appropriate capabilities to the client as part of the FileCre-
ate results. The server makes record of these write capabilities
to support error recovery if the client crashes while writing the
file. Note that the directory update (step 7) occurs after the reply,
so that many directory updates can be batched together. The de-
ferred update is protected by the op-log record that gets deleted
in step 8 after the successful directory update.
The metadata manager maintains an op-log that records the
object create and the directory updates that are in progress.
This log entry is removed when the operation is complete. If the
metadata service crashes and restarts, or a failure event moves
the metadata service to a different manager node, then the op-
log is processed to determine what operations were active at
the time of the failure. The metadata manager rolls the opera-
tions forward or backward to ensure the object store is consis-
tent. If no reply to the operation has been generated, then the
operation is rolled back. If a reply has been generated but pend-
ing operations are outstanding (e.g., directory updates), then
the operation is rolled forward.
The write capability is stored in a cap-log so that when a meta-
data server starts it knows which of its files are busy. In addi-
tion to the “piggybacked” write capability returned by FileC-
reate, the client can also execute a StartWrite RPC to obtain a
separate write capability. The cap-log entry is removed when
the client releases the write cap via an EndWrite RPC. If the
client reports an error during its I/O, then a repair log entry is
made and the file is scheduled for repair. Read and write capa-
bilities are cached by the client over multiple system calls, fur-
ther reducing metadata server traffic.
Creating a file
OSDs
1. CREATE
7. WRITE3. CREATE
MetadataServer
6. REPLY
Client
Txn_log
2
8
4
5
oplog
caplog
Replycache
92
System-level Metadata
The final layer of metadata is information about the overall sys-
tem itself. One possibility would be to store this information in
objects and bootstrap the system through a discovery protocol.
The most difficult aspect of that approach is reasoning about
the fault model. The system must be able to come up and be
manageable while it is only partially functional. Panasas chose
instead a model with a small replicated set of system managers,
each that stores a replica of the system configuration metadata.
Each system manager maintains a local database, outside of the
object storage system. Berkeley DB is used to store tables that
represent our system model. The different system manager in-
stances are members of a replication set that use Lamport’s part-
time parliament (PTP) protocol to make decisions and update
the configuration information. Clusters are configured with one,
three, or five system managers so that the voting quorum has
an odd number and a network partition will cause a minority of
system managers to disable themselves.
System configuration state includes both static state, such as the
identity of the blades in the system, as well as dynamic state such
as the online/offline state of various services and error conditions
associated with different system components. Each state update
decision, whether it is updating the admin password or activating
a service, involves a voting round and an update round according to
the PTP protocol. Database updates are performed within the PTP
transactions to keep the databases synchronized. Finally, the sys-
tem keeps backup copies of the system configuration databases on
several other blades to guard against catastrophic loss of every sys-
tem manager blade.
Blade configuration is pulled from the system managers as part of
each blade’s startup sequence. The initial DHCP handshake conveys
the addresses of the system managers, and thereafter the local OS
on each blade pulls configuration information from the system
managers via RPC.
The cluster manager implementation has two layers. The lower level
PTP layer manages the voting rounds and ensures that partitioned
Parallel NFS
Panasas HPC Storage
“Outstanding in the HPC world, the ActiveStor so-
lutions provided by Panasas are undoubtedly the
only HPC storage solutions that combine highest
scalability and performance with a convincing ease
of management.“
Thomas Gebert HPC Solution Architect
93
or newly added system managers will be brought up-to-date with
the quorum. The application layer above that uses the voting and
update interface to make decisions. Complex system operations
may involve several steps, and the system manager has to keep track
of its progress so it can tolerate a crash and roll back or roll forward
as appropriate.
For example, creating a volume (i.e., a quota-tree) involves file system
operations to create a top-level directory, object operations to cre-
ate an object partition within OSDFS on each StorageBlade module,
service operations to activate the appropriate metadata manager,
and configuration database operations to reflect the addition of the
volume. Recovery is enabled by having two PTP transactions. The
initial PTP transaction determines if the volume should be created,
and it creates a record about the volume that is marked as incom-
plete. Then the system manager does all the necessary service acti-
vations, file and storage operations. When these all complete, a final
PTP transaction is performed to commit the operation. If the system
manager crashes before the final PTP transaction, it will detect the
incomplete operation the next time it restarts, and then roll the op-
eration forward or backward.
GESTION INTELLI-GENTE DE FLUX D’APPLICATIONS HPC
NVIDIA GPU Computing –The CUDA Architecture
95
Tandis que tous les systèmes HPC se trouvent confrontés à de
grands challenges en matière de complexité, d’évolutivité et
d’exigences des flux applicatifs et des ressources, les systèmes
HPC des entreprises font face aux défis et attentes les plus
exigeants. Les systèmes HPC d’entreprises doivent satisfaire
aux exigences critiques et prioritaires des flux d’applications
pour les entreprises commerciales et les organisations de
recherche et académiques à orientation commerciale. Ils doi-
vent trouver un équilibre entre des SLA et des priorités com-
plexes. En effet, leurs flux HPC ont un impact direct sur leurs
revenus, la livraison de leurs produits et leurs objectifs.
95
The CUDA parallel comput-ing platform is now widely deployed with 1000s of GPU-accelerated applications and 1000s of published re-search papers, and a com-plete range of CUDA tools and ecosystem solutions is available to developers.
96
What is GPU Computing?GPU computing is the use of a GPU (graphics processing unit)
together with a CPU to accelerate general-purpose scientific
and engineering applications. Pioneered more five years ago by
NVIDIA, GPU computing has quickly become an industry stan-
dard, enjoyed by millions of users worldwide and adopted by
virtually all computing vendors.
GPU computing offers unprecedented application performance
by offloading compute-intensive portions of the application to
the GPU, while the remainder of the code still runs on the CPU.
From a user’s perspective, applications simply run significantly
faster.
CPU + GPU is a powerful combination because CPUs consist of a
few cores optimized for serial processing, while GPUs consist of
thousands of smaller, more efficient cores designed for parallel
performance. Serial portions of the code run on the CPU while
parallel portions run on the GPU.
Most customers can immediately enjoy the power of GPU com-
puting by using any of the GPU-accelerated applications listed in
our catalog, which highlights over one hundred, industry-lead-
ing applications. For developers, GPU computing offers a vast
ecosystem of tools and libraries from major software vendors
“We are very proud to be one of the leading provid-
ers of Tesla systems who are able to combine the
overwhelming power of NVIDIA Tesla systems with
the fully engineered and thoroughly tested trans-
tec hardware to a total Tesla-based solution.”
Norbert Zeidler Senior HPC Solution Engineer
NVIDIA GPU Computing
What is GPU Computing?
97
History of GPU Computing
Graphics chips started as fixed-function graphics processors but
became increasingly programmable and computationally pow-
erful, which led NVIDIA to introduce the first GPU. In the 1999-
2000 timeframe, computer scientists and domain scientists from
various fields started using GPUs to accelerate a range of scien-
tific applications. This was the advent of the movement called
GPGPU, or General-Purpose computation on GPU.
While users achieved unprecedented performance (over 100x
compared to CPUs in some cases), the challenge was that GPGPU
required the use of graphics programming APIs like OpenGL and
Cg to program the GPU. This limited accessibility to the tremen-
dous capability of GPUs for science.
NVIDIA recognized the potential of bringing this performance
for the larger scientific community, invested in making the GPU
fully programmable, and offered seamless experience for devel-
opers with familiar languages like C, C++, and Fortran.
GPU computing momentum is growing faster than ever before.
Today, some of the fastest supercomputers in the world rely on
GPUs to advance scientific discoveries; 600 universities around
the world teach parallel computing with NVIDIA GPUs; and hun-
dreds of thousands of developers are actively using GPUs.
All NVIDIA GPUs—GeForce, Quadro, and Tesla — support GPU
computing and the CUDA parallel programming model. Develop-
ers have access to NVIDIA GPUs in virtually any platform of their
choice, including the latest Apple MacBook Pro. However, we
recommend Tesla GPUs for workloads where data reliability and
overall performance are critical.
Tesla GPUs are designed from the ground-up to accelerate sci-
entific and technical computing workloads. Based on innovative
features in the “Kepler architecture,” the latest Tesla GPUs offer
3x more performance compared to the previous architecture,
more than one teraflops of double-precision floating point while
dramatically advancing programmability and efficiency. Kepler
is the world’s fastest and most efficient high performance com-
puting (HPC) architecture.
Kepler GK110 – The Next Generation GPU Computing ArchitectureAs the demand for high performance parallel computing increas-
es across many areas of science, medicine, engineering, and fi-
nance, NVIDIA continues to innovate and meet that demand
with extraordinarily powerful GPU computing architectures.
NVIDIA’s existing Fermi GPUs have already redefined and accel-
GPU Computing Applications
NVIDIA GPUwith the CUDA Parallel Computing Architecture
CC++
OpenCL FORTRANDirectXCompute
Java andPython
The CUDA parallel architecure
the cuda programming model
1
2
34
CUDA Parallel Compute Enginesinside NVIDIA GPUs
CUDA Support in OS Kernel
CUDA Driver
ApplicationsUsing DirectX
DirectXCompute
OpenCLDriver
C Runtimefor CUDA
ApplicationsUsing OpenCL
ApplicationsUsing the
CUDA Driver API
ApplicationsUsing C, C++, Fortran,
Java, Python
HLSLCompute Shaders
OpenCL CCompute Kernels
C for CUDACompute Kernels
PTX (ISA)
C for CUDACompute Functions
Device-level APIs Language Integration
98
erated High Performance Computing (HPC) capabilities in areas
such as seismic processing, biochemistry simulations, weather
and climate modeling, signal processing, computational finance,
computer aided engineering, computational fluid dynamics, and
data analysis. NVIDIA’s new Kepler GK110 GPU raises the parallel
computing bar considerably and will help solve the world’s most
difficult computing problems.
By offering much higher processing power than the prior GPU
generation and by providing new methods to optimize and in-
crease parallel workload execution on the GPU, Kepler GK110
simplifies creation of parallel programs and will further revolu-
tionize high performance computing.
Kepler GK110 – Extreme Performance, Extreme Efficiency
Comprising 7.1 billion transistors, Kepler GK110 is not only the
fastest, but also the most architecturally complex microproces-
sor ever built. Adding many new innovative features focused on
compute performance, GK110 was designed to be a parallel pro-
cessing powerhouse for Tesla® and the HPC market.
Kepler GK110 provides over 1 TFlop of double precision
throughput with greater than 93% DGEMM efficiency versus
60-65% on the prior Fermi architecture.
In addition to greatly improved performance, the Kepler archi-
tecture offers a huge leap forward in power efficiency, delivering
up to 3x the performance per watt of Fermi.
The following new features in Kepler GK110 enable increased
GPU utilization, simplify parallel program design, and aid in the
deployment of GPUs across the spectrum of compute environ-
ments ranging from personal workstations to supercomputers:
ʎ Dynamic Parallelism – adds the capability for the GPU to ge-
nerate new work for itself, synchronize on results, and con-
trol the scheduling of that work via dedicated, accelerated
hardware paths, all without involving the CPU. By providing
the flexibility to adapt to the amount and form of parallelism
NVIDIA GPU Computing
Kepler GK110 – The Next Generation GPU Computing Architecture
99
through the course of a program’s execution, programmers
can expose more varied kinds of parallel work and make the
most efficient use the GPU as a computation evolves. This ca-
pability allows less - structured, more complex tasks to run ea-
sily and effectively, enabling larger portions of an application
to run entirely on the GPU. In addition, programs are easier to
create, and the CPU is freed for other tasks.
ʎ Hyper-Q – Hyper-Q enables multiple CPU cores to launch work
on a single GPU simultaneously, thereby dramatically increa-
sing GPU utilization and significantly reducing CPU idle times.
Hyper-Q increases the total number of connections (work
queues) between the host and the GK110 GPU by allowing 32
simultaneous, hardware-managed connections (compared to
the single connection available with Fermi). Hyper-Q is a flexi-
ble solution that allows separate connections from multiple
CUDA streams, from multiple Message Passing Interface (MPI)
processes, or even from multiple threads within a process.
Applications that previously encountered false serialization
across tasks, thereby limiting achieved GPU utilization, can
see up to dramatic performance increase without changing
any existing code.
ʎ Grid Management Unit – Enabling Dynamic Parallelism re-
quires an advanced, flexible grid management and dispatch
control system. The new GK110 Grid Management Unit (GMU)
manages and prioritizes grids to be executed on the GPU. The
GMU can pause the dispatch of new grids and queue pending
and suspended grids until they are ready to execute, providing
the flexibility to enable powerful runtimes, such as Dynamic
Parallelism. The GMU ensures both CPU- and GPU-generated
workloads are properly managed and dispatched.
ʎ NVIDIA GPUDirect – NVIDIA GPUDirect is a capability that
enables GPUs within a single computer, or GPUs in different
servers located across a network, to directly exchange data
without needing to go to CPU/system memory. The RDMA
feature in GPUDirect allows third party devices such as SSDs,
NICs, and IB adapters to directly access memory on multiple
GPUs within the same system, significantly decreasing the
latency of MPI send and receive messages to/from GPU me-
mory. It also reduces demands on system memory bandwidth
and frees the GPU DMA engines for use by other CUDA tasks.
Kepler GK110 also supports other GPUDirect features inclu-
ding Peer-to-Peer and GPUDirect for Video.
An Overview of the GK110 Kepler Architecture
Kepler GK110 was built first and foremost for Tesla, and its goal
was to be the highest performing parallel computing micropro-
cessor in the world. GK110 not only greatly exceeds the raw com-
pute horsepower delivered by Fermi, but it does so efficiently,
consuming significantly less power and generating much less
heat output.
100
Introducing NVIDIA Parallel NsightNVIDIA Parallel Nsight is the first development environment de-
signed specifically to support massively parallel CUDA C, OpenCL,
and DirectCompute applications. It bridges the productivity gap
between CPU and GPU code by bringing parallel-aware hardware
source code debugging and performance analysis directly into Mi-
crosoft Visual Studio, the most widely used integrated application
development environment under Microsoft Windows.
Parallel Nsight allows Visual Studio developers to write and de-
bug GPU source code using exactly the same tools and interfaces
that are used when writing and debugging CPU code, including
source and data breakpoints, and memory inspection. Further-
more, Parallel Nsight extends Visual Studio functionality by of-
fering tools to manage massive parallelism, such as the ability
to focus and debug on a single thread out of the thousands of
threads running parallel, and the ability to simply and efficiently
visualize the results computed by all parallel threads.
Parallel Nsight is the perfect environment to develop co-pro-
cessing applications that take advantage of both the CPU and
GPU. It captures performance events and information across
both processors, and presents the information to the devel-
oper on a single correlated timeline. This allows developers to
see how their application behaves and performs on the entire
system, rather than through a narrow view that is focused on
a particular subsystem or processor.
NVIDIA GPU Computing
Introducing NVIDIA Parallel Nsight
101
Parallel Nsight Debugger for GPU Computing
ʎ Debug your CUDA C/C++ and DirectCompute source code di-
rectly on the GPU hardware
ʎ As the industry’s only GPU hardware debugging solution, it
drastically increases debugging speed and accuracy
ʎ Use the familiar Visual Studio Locals, Watches, Memory and
Breakpoints windows
Parallel Nsight Analysis Tool for GPU Computing
ʎ Isolate performance bottlenecks by viewing system-wide
CPU+GPU events
ʎ Support for all major GPU Computing APIs, including CUDA C/
C++, OpenCL, and Microsoft DirectCompute
Debugger for GPU Computing Analysis Tool for GPU Computing
transtec has strived for developing well-engineered GPU Com-
puting solutions from the very beginning of the Tesla era. From
High-Performance GPU Workstations to rack-mounted Tesla
server solutions, transtec has a broad range of specially designed
systems available. As an NVIDIA Tesla Preferred Provider (TPP),
transtec is able to provide customers with the latest NVIDIA GPU
technology as well as fully-engineered hybrid systems and Tesla
Preconfigured Clusters. Thus, customers can be assured that
transtec’s large experience in HPC cluster solutions is seamlessly
brought into the GPU computing world. Performance Engineer-
ing made by transtec.
102
NVIDIA GPU Computing
Introducing NVIDIA Parallel Nsight
103
Parallel Nsight Debugger for Graphics Development
ʎ Debug HLSL shaders directly on the GPU hardware. Drastical-
ly increasing debugging speed and accuracy over emulated
(SW) debugging
ʎ Use the familiar Visual Studio Locals, Watches, Memory and
Breakpoints windows with HLSL shaders, including Direct-
Compute code
ʎ The Debugger supports all HLSL shader types: Vertex, Pixel,
Geometry, and Tessellation
Parallel Nsight Graphics Inspector for Graphics Development
ʎ Graphics Inspector captures Direct3D rendered frames for
real-time examination
ʎ The Frame Profiler automatically identifies bottlenecks and
performance information on a per-draw call basis
ʎ Pixel History shows you all operations that affected a given
pixel
Debugger for Graphics Development Graphics Inspector for Graphics Development
104
A full Kepler GK110 implementation includes 15 SMX units and
six 64-bit memory controllers. Different products will use differ-
ent configurations of GK110. For example, some products may
deploy 13 or 14 SMXs.
Key features of the architecture that will be discussed below in
more depth include:
ʎ The new SMX processor architecture
ʎ An enhanced memory subsystem, offering additional caching
capabilities, more bandwidth at each level of the hierarchy,
and a fully redesigned and substantially faster DRAM I/O im-
plementation.
ʎ Hardware support throughout the design to enable new pro-
gramming model capabilities
Kepler GK110 supports the new CUDA Compute Capability 3.5.
(For a brief overview of CUDA see Appendix A - Quick Refresher
on CUDA). The following table compares parameters of different
Compute Capabilities for Fermi and Kepler GPU architectures:
Performance per Watt
A principal design goal for the Kepler architecture was impro-
ving power efficiency. When designing Kepler, NVIDIA engineers
NVIDIA GPU Computing
Kepler GK110 – The Next Generation GPU Computing Architecture
Kepler GK110 Full chip block diagram
105
applied everything learned from Fermi to better optimize the
Kepler architecture for highly efficient operation. TSMC’s 28nm
manufacturing process plays an important role in lowering pow-
er consumption, but many GPU architecture modifications were
required to further reduce power consumption while maintai-
ning great performance.
FermiGF100
FermiGF104
KeplerGK104
KeplerGK110
Compute Capability 2.0 2.1 3.0 3.5
Threads / Warp 32 32 32 32
Max Warps / Multiprocessor 48 48 64 64
Max Threads / Multiproces-sor
1536 1536 2048 2048
Max Thread Blocks / Multi-processor
8 8 16 16
32-bit Registers / Multipro-cessor
32768 32768 65536 65536
Max Registers / Thread 63 63 63 255
Max Threads / Thread Block 1024 1024 1024 1024
Shared Memory Size Con-figurations (bytes)
16K48K
16K48K
16K32K48K
16K32K48K
Max X Grid Dimension 2^16-1 2^16-1 2^32-1 2^32-1
Hyper-Q NO NO NO YES
Dynamic Parallelism NO NO NO YES
Every hardware unit in Kepler was designed and scrubbed to
provide outstanding performance per watt. The best example
of great perf/watt is seen in the design of Kepler GK110’s new
Streaming Multiprocessor (SMX), which is similar in many res-
pects to the SMX unit recently introduced in Kepler GK104, but
includes substantially more double precision units for compute
algorithms.
Streaming Multiprocessor (SMX) Architecture
Kepler GK110’s new SMX introduces several architectural inno-
vations that make it not only the most powerful multiprocessor
we’ve built, but also the most programmable and power-effi-
cient.
SMX Processing Core Architecture
Each of the Kepler GK110 SMX units feature 192 single-precision
CUDA cores, and each core has fully pipelined floating-point and
integer arithmetic logic units. Kepler retains the full IEEE 754-2008
compliant single- and double-precision arithmetic introduced in
Fermi, including the fused multiply-add (FMA) operation.
One of the design goals for the Kepler GK110 SMX was to signifi-
cantly increase the GPU’s delivered double precision performance,
since double precision arithmetic is at the heart of many HPC ap-
plications. Kepler GK110’s SMX also retains the special function
units (SFUs) for fast approximate transcendental operations as in
previous-generation GPUs, providing 8x the number of SFUs of the
Fermi GF110 SM.
Similar to GK104 SMX units, the cores within the new GK110
SMX units use the primary GPU clock rather than the 2x shader
clock. Recall the 2x shader clock was introduced in the G80 Tes-
la-architecture GPU and used in all subsequent Tesla- and Fermi-
architecture GPUs. Running execution units at a higher clock rate
SMX: 192 single-precision CUDA cores, 64 double-precision units, 32 special function units (SFU), and 32 load/store units (LD/ST).
106
allows a chip to achieve a given target throughput with fewer
copies of the execution units, which is essentially an area optimi-
zation, but the clocking logic for the faster cores is more power-
hungry. For Kepler, our priority was performance per watt. While
we made many optimizations that benefitted both area and pow-
er, we chose to optimize for power even at the expense of some
added area cost, with a larger number of processing cores running
at the lower, less power-hungry GPU clock.
Quad Warp Scheduler
The SMX schedules threads in groups of 32 parallel threads called
warps. Each SMX features four warp schedulers and eight instruc-
tion dispatch units, allowing four warps to be issued and execut-
ed concurrently. Kepler’s quad warp scheduler selects four warps,
and two independent instructions per warp can be dispatched
each cycle. Unlike Fermi, which did not permit double precision
instructions to be paired with other instructions, Kepler GK110
allows double precision instructions to be paired with other ins-
tructions.
NVIDIA, GeForce, Tesla, CUDA, PhysX, GigaThread, NVIDIA
Parallel Data Cache and certain other trademarks and
logos appearing in this brochure, are trademarks or
registered trademarks of NVIDIA Corporation.
NVIDIA GPU Computing
Kepler GK110 – The Next Generation GPU Computing Architecture
Each Kepler SMX contains 4 Warp Schedulers, each with dual Instruction Dis-patch Units. A single Warp Scheduler Unit is shown above.
107
We also looked for opportunities to optimize the power in the
SMX warp scheduler logic. For example, both Kepler and Fermi
schedulers contain similar hardware units to handle the schedu-
ling function, including:
ʎ Register scoreboarding for long latency operations (texture
and load)
ʎ Inter-warp scheduling decisions (e.g., pick the best warp to go
next among eligible candidates)
ʎ Thread block level scheduling (e.g., the GigaThread engine)
However, Fermi’s scheduler also contains a complex hardware
stage to prevent data hazards in the math datapath itself. A mul-
ti-port register scoreboard keeps track of any registers that are
not yet ready with valid data, and a dependency checker block
analyzes register usage across a multitude of fully decoded warp
instructions against the scoreboard, to determine which are eli-
gible to issue.
For Kepler, we recognized that this information is deterministic
(the math pipeline latencies are not variable), and therefore it is
possible for the compiler to determine up front when instructions
will be ready to issue, and provide this information in the instruc-
tion itself. This allowed us to replace several complex and power-
expensive blocks with a simple hardware block that extracts
the pre-determined latency information and uses it to mask out
warps from eligibility at the inter-warp scheduler stage.
New ISA Encoding: 255 Registers per Thread
The number of registers that can be accessed by a thread has
been quadrupled in GK1 10, allowing each thread access to up to
255 registers. Codes that exhibit high register pressure or spill-
ing behavior in Fermi may see substantial speedups as a result
of the increased available per-thread register count . A compel-
ling example can be seen in the QUDA library for performing lat-
tice QCD (quantum chromodynamics) calculations using CUDA.
QUDA fp64-based algorithms see performance increases up to
5.3x due to the ability to use many more registers per thread and
experiencing fewer spills to local memory.
Shuffle Instruction
To further improve performance, Kepler implements a new
Shuffle instruction, which allows threads within a warp to share
data. Previously, sharing data between threads within a warp
required separate store and load operations to pas the data
through shared memory. With the Shuffle instruction, threads
within a warp can read values from other threads in the warp
in just about any imaginable permutation. Shuffle supports arbi-
trary indexed references –– i.e. any thread reads from any other
thread. U seful shuffle subsets including next-thread (offset u p
or down by a fixed amount) and XORR “butterfly” style permuta-
tions among the threads in a warp, are also available as CUDA
intrinsics.
This example shows some of the variations possible using the new Shuffle instruction in Kepler.
108
Shuffle offers a performance advantage over shared memory,
in that a store-and-load operation is carried out in a single step.
Shuffle also can reduce the amount of shared memory needed
per thread block, since data exchanged at the warp level never
needs to o be placed in shared memory. In the case of FFT, which
requires da ta sharing within a warp, a 6% performance gain can
be seen just by using Shuffle.
Atomic Operations
Atomic memory operations are important in parallel program-
ming, allowing concurrent threads to correctly perform read-
modify-write operations on shared data structures. Atomic oper-
ations such as add, min, max, and compare-and-swap are atomic
in the sense that the read, modify, and write operations are per-
formed without interruption by other threads. Atomic memory
operations are widely used for parallel sorting, reduction opera-
tions, and building data structures in parallel without locks that
Serialize thread execution.
Throughput of global memory atomic operations on Kepler
GK110 is substantially improved compared to the Fermi genera-
tion. Atomic operation throughput to a common global memory
address is improved by 9x to one operation per clock. Atomic
operation throughput to independent global addresses is also
significantly accelerated, and logic to handle address conflicts
has been made more efficient. Atomic operations can often be
processed at rates similar to global load operations. This speed
increase makes atomics fast enough to use frequently within
kernel inner loops, eliminating the separate reduction passes
that were previously required by some algorithms to consoli-
date results. Kepler GK110 also expands the native support for
64-bit atomic operations in global memory. In addition to atomi-
cAdd, atomicCAS, and atomicExch (which were also supported by
Fermi and Kepler GK104), GK110 supports the following:
ʎ atomicMin
ʎ atomicMax
NVIDIA GPU Computing
Kepler GK110 – The Next Generation GPU Computing Architecture
109
ʎ atomicAnd
ʎ atomicOr
ʎ atomicXor
Other atomic operations which are not supported natively (for
example 64-bit floating point atomics) may be emulated using
the compare-and-swap (CAS) instruction.
Texture Improvements
The GPU’s dedicated hardware Texture units are a valuable re-
source for compute programs with a need to sample or filter
image data. The texture throughput in Kepler is significantly in-
creased compared to Fermi – each SMX unit contains 16 texture
filtering units, a 4x increase vs the Fermi GF110 SM.
In addition, Kepler changes the way texture state is managed. In
the Fermi generation, for the GPU to reference a texture, it had
to be assigned a “slot” in a fixed-size binding table prior to grid
launch. The number of slots in that table ultimately limits how
many unique textures a program can read from at runtime. Ulti-
mately, a program was limited to accessing only 128 simultane-
ous textures in Fermi.
With bindless textures in Kepler, the additional step of using
slots isn’t necessary: texture state is now saved as an object in
memory and the hardware fetches these state objects on de-
mand, making binding tables obsolete. This effectively elimi-
nates any limits on the number of unique textures that can be
referenced by a compute program. Instead, programs can map
textures at any time and pass texture handles around as they
would any other pointer.
Kepler Memory Subsystem – L1, L2, ECC
Kepler’s memory hierarchy is organized similarly to Fermi. The
Kepler architecture supports a unified memory request path for
loads and stores, with an L1 cache per SMX multiprocessor. Ke-
pler GK110 also enables compiler-directed use of an additional
new cache for read-only data, as described below.
64 KB Configurable Shared Memory and L1 Cache
In the Kepler GK110 architecture, as in the previous generation
Fermi architecture, each SMX has 64 KB of on-chip memory that
can be configured as 48 KB of Shared memory with 16 KB of L1
cache, or as 16 KB of shared memory with 48 KB of L1 cache. Ke-
pler now allows for additional flexibility in configuring the al-
location of shared memory and L1 cache by permitting a 32KB
/ 32KB split between shared memory and L1 cache. To support
the increased throughput of each SMX unit, the shared memory
bandwidth for 64b and larger load operations is also doubled
compared to the Fermi SM, to 256B per core clock.
48KB Read-Only Data Cache
In addition to the L1 cache, Kepler introduces a 48KB cache for
data that is known to be read-only for the duration of the func-
tion. In the Fermi generation, this cache was accessible only by
the Texture unit. Expert programmers often found it advanta-
geous to load data through this path explicitly by mapping their
110
data as textures, but this approach had many limitations.
In Kepler, in addition to significantly increasing the capacity of
this cache along with the texture horsepower increase, we de-
cided to make the cache directly accessible to the SM for general
load operations. Use of the read-only path is beneficial because
it takes both load and working set footprint off of the Shared/L1
cache path. In addition, the Read-Only Data Cache’s higher tag
bandwidth supports full speed unaligned memory access pat-
terns among other scenarios.
Use of the read-only path can be managed automatically by the
compiler or explicitly by the programmer. Access to any variable
or data structure that is known to be constant through program-
mer use of the C99-standard “const __restrict” keyword may be
tagged by the compiler to be loaded through the Read- Only
Data Cache. The programmer can also explicitly use this path
with the __ldg() intrinsic.
Improved L2 Cache
The Kepler GK110 GPU features 1536KB of dedicated L2 cache
memory, double the amount of L2 available in the Fermi archi-
tecture. The L2 cache is the primary point of data unification
between the SMX units, servicing all load, store, and texture re-
quests and providing efficient, high speed data sharing across
the GPU. The L2 cache on Kepler offers up to 2x of the bandwidth
per clock available in Fermi. Algorithms for which data addresses
are not known beforehand, such as physics solvers, ray tracing,
and sparse matrix multiplication especially benefit from the
cache hierarchy. Filter and convolution kernels that require mul-
tiple SMs to read the same data also benefit.
Memory Protection Support
Like Fermi, Kepler’s register files, shared memories, L1 cache, L2
cache and DRAM memory are protected by a Single-Error Correct
Double-Error Detect (SECDED) ECC code. In addition, the Read-Only
Data Cache supports single-error correction through a parity check;
in the event of a parity error, the cache unit automatically invali-
NVIDIA GPU Computing
Kepler GK110 – The Next Generation GPU Computing Architecture
111
dates the failed line, forcing a read of the correct data from L2.
ECC checkbit fetches from DRAM necessarily consume some
amount of DRAM bandwidth, which results in a performance
difference between ECC-enabled and ECC-disabled operation,
especially on memory bandwidth-sensitive applications. Ke-
pler GK110 implements several optimizations to ECC checkbit
fetch handling based on Fermi experience. As a result, the ECC
on-vs-off performance delta has been reduced by an average
of 66%, as measured across our internal compute application
test suite.
Dynamic Parallelism
In a hybrid CPU-GPU system, enabling a larger amount of paral-
lel code in an application to run efficiently and entirely within
the GPU improves scalability and performance as GPUs increase
in perf/watt. To accelerate these additional parallel portions of
the application, GPUs must support more varied types of parallel
workloads.
Dynamic Parallelism is a new feature introduced with Kepler GK110
that allows the GPU to generate new work for itself, synchronize on
results, and control the scheduling of that work via dedicated, ac-
celerated hardware paths, all without involving the CPU.
Fermi was very good at processing large parallel data structures
when the scale and parameters of the problem were known at
kernel launch time. All work was launched from the host CPU,
would run to completion, and return a result back to the CPU.
The result would then be used as part of the final solution, or
would be analyzed by the CPU which would then send additional
requests back to the GPU for additional processing.
In Kepler GK110 any kernel can launch another kernel, and can
create the necessary streams, events and manage the depen-
dencies needed to process additional work without the need
for host CPU interaction. This architectural innovation makes it
easier for developers to create and optimize recursive and data-
dependent execution patterns, and allows more of a program to
be run directly on GPU. The system CPU can then be freed up for
additional tasks, or the system could be configured with a less
powerful CPU to carry out the same workload.
Dynamic Parallelism allows more varieties of parallel algorithms
to be implemented on the GPU, including nested loops with dif-
fering amounts of parallelism, parallel teams of serial control-
task threads, or simple serial control code offloaded to the GPU
in order to promote data-locality with the parallel portion of the
application.
Because a kernel has the ability to launch additional workloads
based on intermediate, on-GPU results, programmers can now
intelligently load-balance work to focus the bulk of their re-
sources on the areas of the problem that either require the most
processing power or are most relevant to the solution.
One example would be dynamically setting up a grid for a nu-
merical simulation – typically grid cells are focused in regions
of greatest change, requiring an expensive pre-processing pass
through the data. Alternatively, a uniformly coarse grid could be
used to prevent wasted GPU resources, or a uniformlyfine grid
could be used to ensure all the features are captured, but these
options risk missing simulation features or “over-spending”
compute resources on regions of less interest. With Dynamic
Parallelism, the grid resolution can be determined dynamically at
Dynamic Parallelism allows more parallel code in an application to be launched directly by the GPU onto itself (right side of image) rather than requiring CPU intervention (left side of image).
112
NVIDIA GPU Computing
A Quick Refresher on CUDA
113
A Quick Refresher on CUDA CUDA is a combination hardware/software platform that en-
ables NVIDIA GPUs to execute programs written with C, C++,
Fortran, and other languages. A CUDA program invokes paral-
lel functions called kernels that execute accross many parallel
threads. The programmer or compiler organizes these threads
into thread blocks and grids of thre ad blocks, as shown in the
Figure on the right side. Each thread within a thread block ex-
ecutes an instance oof the kernel. Each thread also has thread
and block IDs within its thread block and grid, a program coun-
ter, registers, per-thread private memory, inputs, and output re-
suults. A thread block is a set of concurren tly executing threads
that can cooperate among themselves through
barrier sy nchronizationn and shared memory. A t hread block
has a block ID within its grid. A grid is an array of thread blocks
that execute the same kernel, read inputs from global memory,
write results to global memory, and synchronize between de-
pendent kernel calls. In the CUDA parallel programming model,
each thread has a per-thread private memory space used for
register spills, function calls, and C automaticc array variables.
Each thread block has a per-block shared memory space used for
inter-thread
communication, datasharing, and result sharing in parallel allgo-
rithms. Gr rids of thread blocks share results in Global Memory
space after kernel-wide global synchronization.
CUDA Hardware Execution
CUDA’s hierarchy of threads maps to a hierarchy of processors
on the GPU; a GPU executes one or more kernel grids; a stream-
ing multiprocessor (SM on Fermi / SMX on Kepler) executes one
or more thread blocks; and CUDA cores and other execution
units in the SMX execute thread instructions. The SMX executes
threads in groups of 32 threads called warps. While programmers
can generally ignore warp execution for functional correctness
and focus on programming individual scalar threads, they can
greatly improve performance by having threads in a warp execute
the same code path and access memory with nearby addresses.
CUDA Hierarchy of threads, blocks, and grids, with corresponding per-thread private, per-block shared,-and per-application global memory spaces.
114
runtime in a datadependent manner.
Starting with a coarse grid, the simulation can “zoom in” on ar-
eas of interest while avoiding unnecessary calculation in areas
with little change. Though this could be accomplished using a se-
quence of CPU-launched kernels, it would be far simpler to allow
the GPU to refine the grid itself by analyzing the data and launch-
ing additional work as part of a single simulation kernel, eliminat-
ing interruption of the CPU and data transfers between the CPU
and GPU.
The above example illustrates the benefits of using a dynami-
cally sized grid in a numerical simulation. To meet peak preci-
sion requirements, a fixed resolution simulation must run at an
excessively fine resolution across the entire simulation domain,
whereas a multi-resolution grid applies the correct simulation
resolution to each area based on local variation.
Hyper-Q
One of the challenges in the past has been keeping the GPU sup-
plied with an optimally scheduled load of work from multiple
streams. Thee Fermi architecture supported 16-way concur-
rency of kernel launches from separate streams, but ultimately
the streams were all multiplexed into the same hardware work
queue . This allowed for false intra-stream dependencies, requir-
NVIDIA GPU Computing
Kepler GK110 – The Next Generation GPU Computing Architecture
Image attribution Charles Reid
115
ing dependent kernels within one stream to complete before ad-
ditional kernels in a separate stream could be executed. While this
could be alleviated to some extent through the use of a breadth-
first launch order , as program complexity increases, this can be-
come more and more difficult to manage efficiently.
Kepler GKK110 improves on this functionality with the new Hy-
per-Q feature. Hyper-Q increases the total number of connection
s (work queues) between the host and the CUDA Work Distribu-
tor (CWD) logic in the GPU by allowing 322 simultaneous, hard-
ware-managed connections (compared to the single connection
available with Fermi). Hyper-Q is a flexible solution that allows
connections from multiple CUDA streams, from multiple Mes-
sage Passing Interface (MP PI) processes, or even from multiple
threads within a process. Applications that previously encoun-
tered false serialization across tasks, there by limiting GPU utili-
zation, can see up to a 32x performance increase without chang-
ing any existing co de.
Each CUDA stream is managed within its own hardware work
queue, inter-stream dependencies are optimized, and opera-
tions in one stream will no longer block other streams, enabling
streams to execute concurrently without neding to specifically
tailor the launch order to eliminate possible false dependencies.
Hyper-Q offers significant benefits for use in MPI-based paral-
lel computer systems. Legacy MPI-based algorithms were often
created to run on multi-core CPU systems, with the amount of
work assigned to each MPI process scaled accordingly. This can
lead to a single MPI process having insufficient work to fully oc-
cupy the GPU. While it has always been possible for multiple MPI
processes to share a GPU, these processes could become bottle-
necked by false dependencies. Hyper-Q removes those false de-
pendencies, dramatically increasing the efficiency of GPU shar-
ing across MPI processes.
Grid Management Unit - Efficiently Keeping the GPU Utilized
New features in Kepler GK110, such as the ability for CUDA ker-
nels to launch work directly on the GPU with Dynamic Parallel-
ism, required that the CPU-to-GPU workflow in Kepler offer in-
creased functionality over the Fermi design. On Fermi, a grid of
thread blocks would be launched by the CPU and would always
run to completion, creating a simple unidirectional flow of work
Hyper-Q permits more simultaneous connections between CPU and GPU
Hyper-Q working with CUDA Streams: In the Fermi model shown on the left, only (C,P) & (R,X) can run concurrently due to intra-stream dependencies caused by the single hardware work queue. The Kepler Hyper-Q model allows all streams to run concurrently using separate work queues.
116
from the host to the SMs via the CUDA Work Distributor (CWD)
unit. Kepler GK110 was designed to improve the CPU-to-GPU
workflow by allowing the GPU to efficiently manage both CPU-
and CUDA-created workloads.
We discussed the ability of the Kepler GK110 GPU to allow ker-
nels to launch work directly on the GPU, and it’s important to
understand the changes made in the Kepler GK110 architec-
ture to facilitate these new functions. In Kepler, a grid can be
launched from the CPU just as was the case with Fermi, how-
ever new grids can also be created programmatically by CUDA
within the Kepler SMX unit. To manage both CUDA-created and
host-originated grids, a new Grid Management Unit (GMU) was
introduced in Kepler GK110. This control unit manages and pri-
oritizes grids that are passed into the CWD to be sent to the
SMX units for execution.
The CWD in Kepler holds grids that are ready to dispatch, and it
is able to dispatch 32 active grids, which is double the capacity of
NVIDIA GPU Computing
Kepler GK110 – The Next Generation GPU Computing Architecture
The redesigned Kepler HOST to GPU workflow shows the new Grid Manage-ment Unit, which allows it to manage the actively dispatching grids, pause dispatch, and hold pending and suspended grids.
117
the Fermi CWD. The Kepler CWD communicates with the GMU via
a bidirectional link that allows the GMU to pause the dispatch
of new grids and to hold pending and suspended grids until
needed. The GMU also has a direct connection to the Kepler SMX
units to permit grids that launch additional work on the GPU via
Dynamic Parallelism to send the new work back to GMU to be
prioritized and dispatched. If the kernel that dispatched the ad-
ditional workload pauses, the GMU will hold it inactive until the
dependent work has completed.
NVIDIA GPUDirect
When working with a large amount of data, increasing the data
throughput and reducing latency is vital to increasing compute
performance. Kepler GK110 supports the RDMA feature in NVIDIA
GPUDirect, which is designed to improve performance by allow-
ing direct access to GPU memory by third-party devices such as
IB adapters, NICs, and SSDs. When using CUDA 5.0, GPUDirect pro-
vides the following important features:
ʎ Direct memory access (DMA) between NIC and GPU without
the need for CPU-side data buffering.
ʎ Significantly improved MPISend/MPIRecv efficiency between
GPU and other nodes in a network.
ʎ Eliminates CPU bandwidth and latency bottlenecks
ʎ Works with variety of 3rd-party network, capture, and storage
devices
Applications like reverse time migration (used in seismic imag-
ing for oil & gas exploration) distribute the large imaging data
across several GPUs. Hundreds of GPUs must collaborate to
crunch the data, often communicating intermediate results.
GPUDirect enables much higher aggregate bandwidth for this
GPUto GPU communication scenario within a server and across
servers with the P2P and RDMA features.
Kepler GK110 also supports other GPUDirect features such as
Peer- to-Peer and GPUDirect for Video.
Conclusion
With the launch of Fermi in 2010, NVIDIA ushered in a new era
in the high performance computing (HPC) industry based on a
hybrid computing model where CPUs and GPUs work together
to solve computationally-intensive workloads. Now, with the
new Kepler GK110 GPU, NVIDIA again raises the bar for the HPC
industry.
Kepler GK110 was designed from the ground up to maximize
computational performance and throughput computing with
outstanding power efficiency. The architecture has many new
innovations such as SMX, Dynamic Parallelism, and Hyper-Q that
make hybrid computing dramatically faster, easier to program,
and applicable to a broader set of applications. Kepler GK110
GPUs will be used in numerous systems ranging from worksta-
tions to supercomputers to address the most daunting chal-
lenges in HPC.
GPUDirect RDMA allows direct access to GPU memory from 3rd-party devices such as network adapters, which translates into direct transfers between GPUs across nodes as well.
118
Executive OverviewThe High Performance Computing market’s continuing need for
improved time-to-solution and the ability to explore expanding
models seems unquenchable, requiring ever-faster HPC clus-
ters. This has led many HPC users to implement graphic pro-
cessing units (GPUs) into their clusters. While GPUs have tradi-
tionally been used solely for visualization or animation, today
they serve as fully programmable, massively parallel proces-
sors, allowing computing tasks to be divided and concurrently
processed on the GPU’s many processing cores. When multiple
GPUs are integrated into an HPC cluster, the performance po-
tential of the HPC cluster is greatly enhanced. This processing
environment enables scientists and researchers to tackle some
of the world’s most challenging computational problems.
HPC applications modified to take advantage of the GPU pro-
cessing capabilities can benefit from significant performance
gains over clusters implemented with traditional processors. To
obtain these results, HPC clusters with multiple GPUs require a
high-performance interconnect to handle the GPU-to-GPU com-
munications and optimize the overall performance potential of
the GPUs. Because the GPUs place significant demands on the
interconnect, it takes a high-performance interconnect, such
as InfiniBand, to provide the low latency, high message rate,
and bandwidth that are needed to enable all resources in the
cluster to run at peak performance.
Intel worked in concert with NVIDIA to optimize Intel TrueScale
InfiniBand with NVIDIA GPU technologies. This solution sup-
ports the full performance potential of NVIDIA GPUs through
an interface that is easy to deploy and maintain.
Key Points
ʎ Up to 44 percent GPU performance improvement versus
implementation without GPUDirect – a GPU computing
NVIDIA GPU Computing
Intel TrueScale InfiniBand and GPUs
Intel is a global leader and technology innovator in high perfor-
mance networking, including adapters, switches and ASICs.
119
product from NVIDIA that enables faster communication be-
tween the GPU and InfiniBand
ʎ Intel TrueScale InfiniBand offers as much as 10 percent better
GPU performance than other InfiniBand interconnects
ʎ Ease of installation and maintenance – Intel’s implementa-
tion offers a streamlined deployment approach that is sig-
nificantly easier than alternatives
Ease of DeploymentOne of the key challenges with deploying clusters consisting of
multi-GPU nodes is to maximize application performance. With-
out GPUDirect, GPU-to-GPU communications would require the
host CPU to make multiple memory copies to avoid a memory
pinning conflict between the GPU and InfiniBand. Each addi-
tional CPU memory copy significantly reduces the performance
potential of the GPUs.
Intel’s implementation of GPUDirect takes a streamlined ap-
proach to optimizing NVIDIA GPU performance with Intel TrueS-
cale InfiniBand. With Intel’s solution, a user only needs to update
the NVIDIA driver with code provided and tested by Intel. Other
InfiniBand implementations require the user to implement
a Linux kernel patch as well as a special InfiniBand driver. The
Intel approach provides a much easier way to deploy, support,
and maintain GPUs in a cluster without having to sacrifice per-
formance. In addition, it is completely compatible with other
GPUDirect implementations; the CUDA libraries and application
code require no changes.
Optimized PerformanceIntel used AMBER molecular dynamics simulation software to
test clustered GPU performance with and without GPUDirect.
Figure 15 shows that there is a significant performance gain of
up to 44 percent that results from streamlining the host memory
access to support GPU-to-GPU communications.
Clustered GPU PerformanceHPC applications that have been designed to take advantage
of parallel GPU performance require a high-performance inter-
connect, such as InfiniBand, to maximize that performance. In
addition, the implementation or architecture of the InfiniBand
interconnect can impact performance. The two industry-leading
InfiniBand implementations have very different architectures
and only one was specifically designed for the HPC market –
Intel’s TrueScale InfiniBand. TrueScale InfiniBand provides un-
matched performance benefits, especially as the GPU cluster is
scaled. It offers high performance in all of the key areas that influ-
ence the performance of HPC applications, including GPU-based
applications. These factors include the following:
ʎ Scalable non-coalesced message rate performance greater
than 25M messages per second
ʎ Extremely low latency for MPI collectives, even on clusters
consisting of thousands of nodes
ʎ Consistently low latency of one to two µS, even at scale
These factors and the design of Intel TrueScale InfiniBand en-
able it to optimize the performance of NVIDIA GPUs. The follow-
Performance with and without GPUDirect
Amber Performance Cellulose Test
8 GPU
WithoutGPUDirect TrueScale with GPUDirect
4,5
4
3,5
3
2,5
2
1,5
1
0,5
0
44%
Bet
ter
NS/
DAY
120
ing tests were performed on NVIDIA Tesla 2050s interconnected
with Intel TrueScale QDR InfiniBand at Intel’s NETtrack Develop-
er Center. The Tesla 2050 results for the industry’s other leading
InfiniBand are from the published results on the AMBER bench-
mark site.
Figure 16 shows performance results from the AMBER Myoglo-
bin benchmark (2,492 atoms) when scaling from two to eight
Tesla 2050 GPUs. The results indicate that the Intel TrueScale
InfiniBand offers up to 10 percent more performance than the
industry’s other leading InfiniBand when both used their ver-
sions of GPUDirect. As the figure shows, the performance dif-
ference increases as the application is scaled to more GPUs.
The next test (Figure 17) shows the impact of the InfiniBand
interconnect on the performance of AMBER across models
of various sizes. The following Explicit Solvent models were
tested:
ʎ DHFR: 23,558 atoms
ʎ FactorIX: 90,906 atoms
ʎ Cellulose: 408,609 atoms
It is important to point out that the performance of the models
is dependent on the model size, the size of the GPU cluster, and
the performance of the InfiniBand interconnect. The smaller
the model, the more it is dependent on the interconnect due
to the fact that the model’s components (atoms in the case of
AMBER) are divided across the available GPUs in the cluster to
be processed for each step of the simulation. For example, the
DHFR test with its 23,557 atoms means that each Tesla 2050 in an
eight-GPU cluster is processing only 2,945 atoms for each step
of the simulation. The processing time is relatively small when
compared to the communication time. In contrast, the Cellulose
model with its 408K atoms requires each GPU to process 17 times
more data per step than the DHFR test, so significantly more
time is spent in GPU processing than in communications.
NVIDIA GPU Computing
Intel TrueScale InfiniBand and GPUs
121
The preceding tests demonstrate that the TrueScale InfiniBand
performs better under load. The DHFR model is the most sensi-
tive to the interconnect performance, and it indicates that Tru-
eScale offers six percent more performance than the alternative
InfiniBand product. Combining the results from Figure 15 and
Figure 16 illustrate that TrueScale InfiniBand provides better
results with smaller models on small clusters and better model
scalability for larger models on larger GPU clusters.
Performance/Watt AdvantageToday the focus is not just on performance, but how efficiently
that performance can be delivered.
This is an area in which Intel TrueScale InfiniBand excels. The
National Center for SuperCompute Applications (NCSA) has a
cluster based on NVIDIA GPUs interconnected with TrueScale
InfiniBand. This cluster is number three on the November 2010
Green500 list with performance of 933 MFlops/Watt .
This on its own is a significant accomplishment, but it is even
more impressive when considering its original position on the
SuperComputing Top500 list. In fact, the cluster is ranked at #404
on the Top500 list, but the combination of NVIDIA’s GPU perfor-
mance, Intel’s TrueScale performance, and low power consump-
tion enabled the cluster to move up 401 spots from the Top500
list to reach number three on the Green500 list. This is the most
dramatic shift of any cluster in the top 50 of the Green500. In
part, the following are the reasons for such dramatic perfor-
mance/watt results:
ʎ Performance of the NVIDIA Telsa 2050 GPU
ʎ Linpack performance efficiency of this cluster is 49 percent,
which is almost 20 percent better than most other NVIDIA
GPUbased clusters on the Top 500 list
ʎ The Intel TrueScale InfiniBand Adapter required 25 - 50 per-
cent less power than the alternative InfiniBand product
ConclusionThe performance of the InfiniBand interconnect has a significant
impact on the performance of GPU-based clusters. Intel’s Tru-
eScale InfiniBand is designed and architected for the HPC mar-
ketplace, and it offers an unmatched performance profile with
a GPU-based cluster. Finally, Intel’s solution provides an imple-
mentation that is easier to deploy and maintain, and allows for
optimal performance in comparison to the industry’s other lead-
ing InfiniBand.
Explicit Solvent Benchmark Results for the Two Leading InfiniBandsGPU Scalable Performance with the Industry’s Leading InfiniBands
AMBERMyoglobin Test
OtherInfiniBand
TrueScale
1409,6%
3,2%
Even
120
100
80
60
40
20
0
Bet
ter
2 4 8
NS/
DAY
Explicit Solvent Tests8 x GPU
OtherInfiniBand
TrueScale
506%
5%
1%
45
40
35
30
25
20
15
105
0B
ette
r
Cellulose FactorIX DHFR
NS/
DAY
Intel Xeon Phi Coprocessor –The Architecture
123
Intel Many Integrated Core (Intel MIC) architecture com-bines many Intel CPU cores onto a single chip. Intel MIC architecture is targeted for highly parallel, High Per-formance Computing (HPC) workloads in a variety of fields such as computational physics, chemistry, biology, and financial services. To-day such workloads are run as task parallel programs on large compute clusters.
Intel Xeon Phi Coprocessor –The Architecture
124
Intel Xeon Phi Coprocessor
The Architecture
The Intel MIC architecture is aimed at achieving high through-
put performance in cluster environments where there are rigid
floor planning and power constraints. A key attribute of the mi-
croarchitecture is that it is built to provide a general-purpose
programming environment similar to the Intel Xeon processor
programming environment. The Intel Xeon Phi coprocessors
based on the Intel MIC architecture run a full service Linux op-
erating system, support x86 memory order model and IEEE 754
floating-point arithmetic, and are capable of running applica-
tions written in industry-standard programming languages such
as Fortran, C, and C++. The coprocessor is supported by a rich
development environment that includes compilers, numerous
libraries such as threading libraries and high performance math
libraries, performance characterizing and tuning tools, and de-
buggers.
The Intel Xeon Phi coprocessor is connected to an Intel Xeon
processor, also known as the “host”, through a PCI Express (PCIe)
bus. Since the Intel Xeon Phi coprocessor runs a Linux operating
system, a virtualized TCP/IP stack could be implemented over
the PCIe bus, allowing the user to access the coprocessor as a
network node. Thus, any user can connect to the coprocessor
The first generation Intel Xeon Phi product codenamed “Knights Corner”
125
through a secure shell and directly run individual jobs or submit
batch jobs to it. The coprocessor also supports heterogeneous
applications wherein a part of the application executes on the
host while a part executes on the coprocessor.
Multiple Intel Xeon Phi coprocessors can be installed in a single
host system. Within a single system, the coprocessors can com-
municate with each other through the PCIe peer-to-peer inter-
connect without any intervention from the host. Similarly, the
coprocessors can also communicate through a network card
such as InfiniBand or Ethernet, without any intervention from
the host.
Intel’s initial development cluster named “Endeavor”, which is
composed of 140 nodes of prototype Intel Xeon Phi coproces-
sors, was ranked 150 in the TOP500 supercomputers in the world
based on its Linpack scores. Based on its power consumption,
this cluster was as good if not better than other heterogeneous
systems in the TOP500.
These results on unoptimized prototype systems demonstrate
that high levels of performance efficiency can be achieved on
compute-dense workloads without the need for a new program-
ming language or APIs.
The Intel Xeon Phi coprocessor is primarily composed of process-
ing cores, caches, memory controllers, PCIe client logic, and a
very high bandwidth, bidirectional ring interconnect. Each core
comes complete with a private L2 cache that is kept fully coher-
ent by a global-distributed tag directory. The memory controllers
and the PCIe client logic provide a direct interface to the GDDR5
memory on the coprocessor and the PCIe bus, respectively. All
these components are connected together by the ring intercon-
nect.
Each core in the Intel Xeon Phi coprocessor is designed to be
power efficient while providing a high throughput for highly
parallel workloads. A closer look reveals that the core uses a
short in-order pipeline and is capable of supporting 4 threads in
Linpack performance and power of Intel’s cluster
Microarchitecture
Intel Xeon Phi Coprocessor Core
126
hardware. It is estimated that the cost to support IA architecture
legacy is a mere 2% of the area costs of the core and is even less
at a full chip or product level. Thus the cost of bringing the Intel
Architecture legacy capability to the market is very marginal.
Vector Processing UnitAn important component of the Intel Xeon Phi coprocessor’s
core is its vector processing unit (VPU). The VPU features a nov-
el 512-bit SIMD instruction set, officially known as Intel Initial
Many Core Instructions (Intel IMCI). Thus, the VPU can execute 16
single-precision (SP) or 8 double-precision (DP) operations per cy-
cle. The VPU also supports Fused Multiply-Add (FMA) instructions
and hence can execute 32 SP or 16 DP floating point operations
per cycle. It also provides support for integers.
Vector units are very power efficient for HPC workloads. A single
operation can encode a great deal of work and does not incur en-
ergy costs associated with fetching, decoding, and retiring many
instructions. However, several improvements were required to
support such wide SIMD instructions. For example, a mask regis-
ter was added to the VPU to allow per lane predicated execution.
This helped in vectorizing short conditional branches, thereby
improving the overall software pipelining efficiency. The VPU
also supports gather and scatter instructions, which are simply
non-unit stride vector memory accesses, directly in hardware.
Intel Xeon Phi Coprocessor
The Architecture
Vector Processing Unit
127
Thus for codes with sporadic or irregular access patterns, vector
scatter and gather instructions help in keeping the code vector-
ized.
The VPU also features an Extended Math Unit (EMU) that can ex-
ecute transcendental operations such as reciprocal, square root,
and log, thereby allowing these operations to be executed in a
vector fashion with high bandwidth. The EMU operates by calcu-
lating polynomial approximations of these functions.
The InterconnectThe interconnect is implemented as a bidirectional ring. Each
direction is comprised of three independent rings. The first,
largest, and most expensive of these is the data block ring. The
data block ring is 64 bytes wide to support the high bandwidth
requirement due to the large number of cores. The address ring
is much smaller and is used to send read/write commands and
memory addresses. Finally, the smallest ring and the least expen-
sive ring is the acknowledgement ring, which sends flow control
and coherence messages.
When a core accesses its L2 cache and misses, an address re-
quest is sent on the address ring to the tag directories. The
memory addresses are uniformly distributed amongst the tag
directories on the ring to provide a smooth traffic characteristic
on the ring. If the requested data block is found in another core’s
L2 cache, a forwarding request is sent to that core’s L2 over the
address ring and the request block is subsequently forwarded
on the data block ring. If the requested data is not found in any
caches, a memory address is sent from the tag directory to the
memory controller.
The figure below shows the distribution of the memory con-
trollers on the bidirectional ring. The memory controllers are
symmetrically interleaved around the ring. There is an all-to-all
mapping from the tag directories to the memory controllers. The
addresses are evenly distributed across the memory controllers,
thereby eliminating hotspots and providing a uniform access
pattern which is essential for a good bandwidth response.
During a memory access, whenever an L2 cache miss occurs on a
core, the core generates an address request on the address ring
and queries the tag directories. If the data is not found in the
tag directories, the core generates another address request and
queries the memory for the data. Once the memory controller
fetches the data block from memory, it is returned back to the
core over the data ring. Thus during this process, one data block,
two address requests (and by protocol, two acknowledgement
messages) are transmitted over the rings. Since the data block
rings are the most expensive and are designed to support the re-
The Interconnect
Distributed Tag Directories
128
quired data bandwidth, we need to increase the number of less
expensive address and acknowledgement rings by a factor of
two to match the increased bandwidth requirement caused by
the higher number of requests on these rings.
Intel Xeon Phi Coprocessor
The Architecture
Interleaved Memory Access
Interconnect: 2x AD/AK
129
Multi-Threaded Streams TriadThe figure below shows the core scaling results for the multi-
threaded streams triad workload. These results were generated
on a simulator for a prototype of the Intel Xeon Phi coprocessor
with only one address ring and one acknowledgement ring per
direction in its interconnect. The results indicate that in this case
the address and acknowledgement rings would become perfor-
mance bottlenecks and would exhibit poor scalability beyond 32
cores.
The production grade Intel Xeon Phi coprocessor uses two ad-
dress and two acknowledgement rings per direction and pro-
vides a good performance scaling up to 50 cores and beyond. It
is evident from the figure that the addition of the rings results in
an over 40% aggregate bandwidth improvement.
Streaming StoresStreaming stores was another key innovation that was em-
ployed to further boost memory bandwidth. The pseudo code
for Streams Triads is shown below:
Streams Triad
for (I=0; I<HUGE; I++)
A[I] = k*B[I] + C[I];
The stream triad kernel reads two arrays, B and C, and writes a
single array A from memory. Historically, a core reads a cache
line before it writes the addressed data. Hence there is an ad-
ditional read overhead associated with the write. A streaming
store instruction allows the cores to write an entire cache line
without reading it first. This reduces the number of bytes trans-
ferred per iteration from 256 bytes to 192 bytes.
Streaming Stores
The figure below shows the core scaling results of stream triads
workload with streaming stores. As is evident from the results,
streaming stores provide a 30% improvement over previous re-
sults. Totally, then, by adding two rings per direction and imple-
menting streaming stores we are able to improve bandwidth by
more than a factor of 2 for multithreaded streams triad.
Without Streaming Stores
With Streaming Stores
Behavior Read A, B, C write A Read B, C write A
Bytes transferred to/from memory per iteration
256 192
Multi-threaded Triad – Saturation for 1 AD/AK Ring
Multi-threaded Triad – Benefit of Doubling AD/AK
Other Design FeaturesOther micro-architectural optimizations incorporated into the
Intel Xeon Phi coprocessor include a 64-entry second-level Trans-
lation Lookaside Buffer (TLB), simultaneous data cache loads and
stores, and 512KB L2 caches. Lastly, the Intel Xeon Phi coproces-
sor implements a 16 stream hardware prefetcher to improve
the cache hits and provide higher bandwidth.The figure below
shows the net performance improvements for the SPECfp 2006
benchmark suite for a single core, single thread runs. The results
indicate an average improvement of over 80% per cycle not in-
cluding frequency.
130
Intel Xeon Phi Coprocessor
The Architecture
Multi-threaded Triad – with Streaming Stores
Per-Core ST Performance Improvement (per cycle)
CachesThe Intel MIC architecture invests more heavily in L1 and L2 cach-
es compared to GPU architectures. The Intel Xeon Phi coproces-
sor implements a leading-edge, very high bandwidth memory
subsystem. Each core is equipped with a 32KB L1 instruction
cache and 32KB L1 data cache and a 512KB unified L2 cache.
These caches are fully coherent and implement the x86 memory
order model. The L1 and L2 caches provide an aggregate band-
width that is approximately 15 and 7 times, respectively, faster
compared to the aggregate memory bandwidth. Hence, effective
utilization of the caches is key to achieving peak performance on
the Intel Xeon Phi coprocessor. In addition to improving band-
width, the caches are also more energy efficient for supplying
data to the cores than memory. The figure below shows the en-
ergy consumed per byte of data transfered from the memory,
and L1 and L2 caches. In the exascale compute era, caches will
play a crucial role in achieving real performance under restric-
tive power constraints.
StencilsStencils are common in physics simulations and are classic
examples of workloads which show a large performance gain
through efficient use of caches.
Stencils are typically employed in simulation of physical sys-
tems to study the behavior of the system over time. When these
workloads are not programmed to be cache-blocked, they will
be bound by memory bandwidth. Cache blocking promises sub-
stantial performance gains given the increased bandwidth and
energy efficiency of the caches compared to memory. Cache
blocking improves performance by blocking the physical struc-
ture or the physical system such that the blocked data fits well
into a core’s L1 and or L2 caches. For example, during each time-
step, the same core can process the data which is already resi-
dent in the L2 cache from the last time step, and hence does not
need to be fetched from the memory, thereby improving perfor-
mance. Additionally, the cache coherence further aids the stencil
operation by automatically fetching the updated data from the
nearest neighboring blocks which are resident in the L2 caches
of other cores. Thus, stencils clearly demonstrate the benefits of
efficient cache utilization and coherence in HPC workloads.
Power ManagementIntel Xeon Phi coprocessors are not suitable for all workloads.
131
Stencils Example
Caches – For or Against?
In some cases, it is beneficial to run the workloads only on the
host. In such situations where the coprocessor is not being used,
it is necessary to put the coprocessor in a power-saving mode.
The figure above shows all the components of the Intel Xeon Phi
coprocessor in a running state. To conserve power, as soon as all
the four threads on a core are halted, the clock to the core is gat-
ed. Once the clock has been gated for some programmable time,
the core power gates itself, thereby eliminating any leakage.
At any point, any number of the cores can be powered down or
powered up. Additionally, when all the cores are power gated
and the uncore detects no activity, the tag directories, the inter-
132
Intel Xeon Phi Coprocessor
The Architecture
Power Management: All On and Running
Clock Gate Core
connect, L2 caches and the memory controllers are clock gated.
At this point, the host driver can put the coprocessor into a deep-
er sleep or an idle state, wherein all the uncore is power gated,
the GDDR is put into a self-refresh mode, the PCIe logic is put in
a wait state for a wakeup and the GDDR-IO is consuming very lit-
tle power. These power management techniques help conserve
power and make Intel Xeon Phi coprocessor an excellent candi-
date for data centers.
133
Power Gate Core
Package Auto C3
INFINIBANDHIGH-SPEED INTERCONNTECTS
InfiniBand – High-Speed Interconnects
135
InfiniBand (IB) is an efficient I/O technology that provides high-
speed data transfers and ultra-low latencies for computing and
storage over a highly reliable and scalable single fabric. The
InfiniBand industry standard ecosystem creates cost effective
hardware and software solutions that easily scale from genera-
tion to generation.
InfiniBand is a high-bandwidth, low-latency network interconnect
solution that has grown tremendous market share in the High
Performance Computing (HPC) cluster community. InfiniBand was
designed to take the place of today’s data center networking tech-
nology. In the late 1990s, a group of next generation I/O architects
formed an open, community driven network technology to provide
scalability and stability based on successes from other network
designs. Today, InfiniBand is a popular and widely used I/O fabric
among customers within the Top500 supercomputers: major Uni-
versities and Labs; Life Sciences; Biomedical; Oil and Gas (Seismic,
Reservoir, Modeling Applications); Computer Aided Design and En-
gineering; Enterprise Oracle; and Financial Applications.
InfiniBand (IB) is an effi-cient I/O technology that provides high-speed data transfers and ultra-low la-tencies for computing and storage over a highly reli-able and scalable single fabric. The InfiniBand in-dustry standard ecosystem creates cost effective hard-ware and software solu-tions that easily scale from generation to generation.
InfiniBand is a high-band-width, low-latency network interconnect solution that has grown tremendous market share in the High Performance Computing (HPC) cluster community. InfiniBand was designed to take the place of to-day’s data center network-ing technology. In the late 1990s, a group of next gener-ation I/O architects formed
an open, community driven network technology to pro-vide scalability and stabil-ity based on successes from other network designs. To-day, InfiniBand is a popular and widely used I/O fabric among customers within the Top500 supercomput-ers: major Universities and Labs; Life Sciences; Biomed-ical; Oil and Gas (Seismic, Reservoir, Modeling Appli-cations); Computer Aided Design and Engineering; En-terprise Oracle; and Finan-cial Applications.
136
InfiniBand was designed to meet the evolving needs of the
high performance computing market. Computational science
depends on InfiniBand to deliver:
ʎ High Bandwidth: Supports host connectivity of 10Gbps
with Single Data Rate (SDR), 20Gbps with Double Data
Rate (DDR), and 40Gbps with Quad Data Rate (QDR),
all while offering an 80Gbps switch for link switching
ʎ Low Latency: Accelerates the performance of HPC and enter-
prise computing applications by providing ultra-low latencies
ʎ Superior Cluster Scaling: Point-to-point latency re-
mains low as node and core counts scale – 1.2 µs. High-
est real message per adapter: each PCIe x16 adapter
drives 26 million messages per second. Excellent commu-
nications/computation overlap among nodes in a cluster
ʎ High Efficiency: InfiniBand allows reliable protocols like Re-
mote Direct Memory Access (RDMA) communication to occur
between interconnected hosts, thereby increasing efficiency
ʎ Fabric Consolidation and Energy Savings: InfiniBand can
consolidate networking, clustering, and storage data over
a single fabric, which significantly lowers overall power,
real estate, and management overhead in data centers. En-
hanced Quality of Service (QoS) capabilities support run-
ning and managing multiple workloads and traffic classes
ʎ Data Integrity and Reliability: InfiniBand provides the high-
est levels of data integrity by performing Cyclic Redundancy
Checks (CRCs) at each fabric hop and end-to-end across the
fabric to avoid data corruption. To meet the needs of mission
critical applications and high levels of availability, InfiniBand
provides fully redundant and lossless I/O fabrics with auto-
matic failover path and link layer multi-paths
InfiniBand
High-Speed Interconnects
Intel is a global leader and technology innovator in high perfor-mance networking, including adapters, switches and ASICs.
137
Components of the InfiniBand Fabric
InfiniBand is a point-to-point, switched I/O fabric architecture.
Point-to-point means that each communication link extends be-
tween only two devices. Both devices at each end of a link have full
and exclusive access to the communication path. To go beyond a
point and traverse the network, switches come into play. By adding
switches, multiple points can be interconnected to create a fabric.
As more switches are added to a network, aggregated bandwidth
of the fabric increases. By adding multiple paths between devices,
switches also provide a greater level of redundancy.
The InfiniBand fabric has four primary components, which are
explained in the following sections:
� Host Channel Adapter
� Target Channel Adapter
� Switch
� Subnet Manager
Host Channel Adapter
This adapter is an interface that resides within a server and
communicates directly with the server’s memory and proces-
sor as well as the InfiniBand Architecture (IBA) fabric. The adapt-
er guarantees delivery of data, performs advanced memory ac-
cess, and can recover from transmission errors. Host channel
adapters can communicate with a target channel adapter or a
switch. A host channel adapter can be a standalone InfiniBand
card or it can be integrated on a system motherboard. Intel Tru-
eScale InfiniBand host channel adapters outperform the com-
petition with the industry’s highest message rate. Combined
with the lowest MPI latency and highest effective bandwidth,
Intel host channel adapters enable MPI and TCP applications
to scale to thousands of nodes with unprecedented price per-
formance.
Target Channel Adapter
This adapter enables I/O devices, such as disk or tape storage, to
be located within the network independent of a host computer.
Target channel adapters include an I/O controller that is specific
to its particular device’s protocol (for example, SCSI, Fibre Chan-
nel (FC), or Ethernet). Target channel adapters can communicate
with a host channel adapter or a switch.
Switch
An InfiniBand switch allows many host channel adapters and
target channel adapters to connect to it and handles network
traffic. The switch looks at the “local route header” on each
packet of data that passes through it and forwards it to the
appropriate location. The switch is a critical component of
the InfiniBand implementation that offers higher availability,
higher aggregate bandwidth, load balancing, data mirroring,
and much more. A group of switches is referred to as a Fabric. If
a host computer is down, the switch still continues to operate.
The switch also frees up servers and other devices by handling
network traffic.
Host Channel Adapter
InfiniBand SwitchWith Subnet Manager
Target Channel Adapter
figure 1 Typical InfiniBand High Performance Cluster
138
Top 10 Reasons to Use Intel TrueScale InfiniBand
1. Predictable Low Latency Under Load – Less Than 1.0 μs.
TrueScale is designed to make the most of multi-core nodes
by providing ultra-low latency and significant message rate
scalability. As additional compute resources are added to
the Intel TrueScale InfiniBand solution, latency and message
rates scale linearly. HPC applications can be scaled without
having to worry about diminished utilization of compute re-
sources.
2. Quad Data Rate (QDR) Performance. The Intel 12000 switch
family runs at lane speeds of 10Gbps, providing a full bi-
sectional bandwidth of 40Gbps (QDR). In addition, the 12000
switch has the unique capability of riding through periods
of congestion with features such as deterministically low la-
tency. The Intel family of TrueScale 12000 products offers the
lowest latency of any IB switch and high performance trans-
fers with the industry’s most robust signal integrity.
3. Flexible QoS Maximizes Bandwidth Use. The Intel 12800
advanced design is based on an architecture that provides
comprehensive virtual fabric partitioning capabilities that
enable the IB fabric to support the evolving requirements of
an organization.
4. Unmatched Scalability – 18 to 864 Ports per Switch. Intel of-
fers the broadest portfolio (five chassis and two edge switch-
es) from 18 to 864 TrueScale InfiniBand ports, allowing cus-
tomers to buy switches that match their connectivity, space,
and power requirements.
InfiniBand
Top 10 Reasons to Use Intel TrueScale InfiniBand
“Effective fabric management has become the
most important factor in maximizing performance
in an HPC cluster. With IFS 7.0, Intel has addressed
all of the major fabric management issues in a prod-
uct that in many ways goes beyond what others are
offering.”
Michael Wirth HPC Presales Specialist
139
5. Highly Reliable and Available. Reliability And Service-
ability (RAS) that is proven in the most demanding Top500
and Enterprise environments is designed into Intel’s 12000
series with hot swappable components, redundant com-
ponents, customer replaceable units, and non-disruptive
code load.
6. Lowest Per-Port and Cooling Requirements. The True-Scale
12000 offers the lowest power consumption and the highest
port density – 864 total TrueScale InfiniBand ports in a single
chassis makes it unmatched in the industry. This results in
delivering the lowest power per port for a director switch (7.8
watts per port) and the lowest power per port for an edge
switch (3.3 watts per port).
7. Easy to Install and Manage. Intel installation, configuration,
and monitoring Wizards reduce time-to-ready. The Intel In-
finiBand Fabric Suite (IFS) assists in diagnosing problems in
the fabric. Non-disruptive firmware upgrades provide maxi-
mum availability and operational simplicity.
8. Protects Existing InfiniBand Investments. Seamless vir-
tual I/O integration at the Operating System (OS) and appli-
cation levels that match standard network interface cards
and adapter semantics with no OS or app changes – they
just work. Additionally, the TrueScale family of products is
compliant with the InfiniBand Trade Association (IBTA) open
specification, so Intel products inter-operate with any IBTA-
compliant InfiniBand vendor. Being IBTA-compliant makes
the Intel 12000 family of switches ideal for network consoli-
dation; sharing and scaling I/O pools across servers; and to
pool and share I/O resources between servers.
9. Modular Configuration Flexibility. The Intel 12000 series
switches offer configuration and scalability flexibility that
meets the requirements of either a high-density or high-
performance compute grid by offering port modules that
address both needs. Units can be populated with Ultra High
Density (UHD) leafs for maximum connectivity or Ultra High
Performance (UHP) leafs for maximum performance. The
high scalability 24-port leaf modules support configurations
between 18 and 864 ports, providing the right size to start
and the capability to grow as your grid grows.
10. Option to Gateway to Ethernet and Fibre Channel Net-
works. Intel offers multiple options to enable hosts on In-
finiBand fabrics to transparently access Fibre Channel based
storage area networks (SANs) or Ethernet based local area
networks (LANs).
Intel TrueScale architecture and the resulting family of products
delivers the promise of InfiniBand to the enterprise today.
140
InfiniBand
Intel MPI Library 4.0 PerformanceIntroductionIntel’s latest MPI release, Intel MPI 4.0, is now optimized to work
with Intel’s TrueScale InfiniBand adapter. Intel MPI 4.0 can now
directly call Intel’s TrueScale Performance Scale Messaging (PSM)
interface. The PSM interface is designed to optimize MPI applica-
tion performance. This means that organizations will be able to
achieve a significant performance boost with a combination of
Intel MPI 4.0 and Intel TrueScale InfiniBand.
SolutionIntel worked with Intel to tune and optimize the company’s lat-
est MPI release – Intel MPI Library 4.0 – to improve performance
when used with Intel TrueScale InfiniBand. With MPI Library 4.0,
applications can make full use of High Performance Computing
(HPC) hardware, improving the overall performance of the appli-
cations on the clusters.
Intel MPI Library 4.0 PerformanceIntel MPI Library 4.0 uses the high performance MPI-2 specifica-
tion on multiple fabrics, which results in better performance for
applications on Intel architecture-based clusters. This library en-
ables quick delivery of maximum end user performance, even if
MPI v3.1
MPI v4.0
1400
35% Improvement
1200
1000
800
600
400
200
0
Bet
ter
Ela
pse
d t
ime
Number of cores
16 32 64
Figure 3 Performance with and without GPUDirect
141
there are changes or upgrades to new interconnects – without
requiring major changes to the software or operating environ-
ment. This high-performance, message-passing interface library
develops applications that can run on multiple cluster fabric in-
terconnects chosen by the user at runtime.
Testing
Intel used the Parallel Unstructured Maritime Aerodynamics
(PUMA) Benchmark program to test the performance of Intel MPI
Library versions 4.0 and 3.1 with Intel TrueScale InfiniBand. The
program analyses internal and external non-reacting compress-
ible flows over arbitrarily complex 3D geometries. PUMA is writ-
ten in ANSI C, and uses MPI libraries for message passing.
Results
The test showed that MPI performance can improve by more
than 35 percent using Intel MPI Library 4.0 with Intel TrueScale
InfiniBand, compared to using Intel MPI Library 3.1.
Intel TrueScale InfiniBand Adapters Intel TrueScale InfiniBand Adapters offer scalable performance,
reliability, low power, and superior application performance.
These adapters ensure superior performance of HPC applica-
tions by delivering the highest message rate for multicore
compute nodes, the lowest scalable latency, large node count
clusters, the highest overall bandwidth on PCI Express Gen1 plat-
forms, and superior power efficiency.
Purpose: Compare the performance of Intel MPI Library 4.0 and 3.1 with Intel InfiniBandBenchmark: PUMA FlowCluster: Intel/IBM iDataPlex™ clusterSystem Configuration: The NET track IBM Q-Blue Cluster/iDataPlex nodes were configured as follows:
Processor Intel Xeon® CPU X5570 @ 2,93 GHz
Memory 24GB (6x4GB) @ 1333MHz (DDR3)
QDR InfiniBand SwitchIntel Model 12300/firmware version
6.0.0.0.33
QDR InfiniBand Host Cnel Adapter Intel QLE7340 software stack 5.1.0.0.49
Operating System Red Hat® Enterprise Linux® Server
release 5.3
Kernel 2.6.18-128.el5
File System IFS Mounted
Typical InfiniBand High Performance Cluster
142
The Intel TrueScale 12000 family of Multi-Protocol Fabric Directors
is the most highly integrated cluster computing interconnect so-
lution available. An ideal solution HPC, database clustering, and
grid utility computing applications, the 12000 Fabric Directors
maximize cluster and grid computing interconnect performance
while simplifying and reducing the cost of operating a data center.
Subnet Manager
The subnet manager is an application responsible for config-
uring the local subnet and ensuring its continued operation.
Configuration responsibilities include managing switch setup
and reconfiguring the subnet if a link goes down or a new one
is added.
How High Performance Computing Helps Vertical ApplicationsEnterprises that want to do high performance computing must
balance the following scalability metrics as the core size and the
number of cores per node increase:
ʎ Latency and Message Rate Scalability: must allow near line-
ar growth in productivity as the number of compute cores is
scaled.
ʎ Power and Cooling Efficiency: as the cluster is scaled, power
and cooling requirements must not become major concerns
in today’s world of energy shortages and high-cost energy.
InfiniBand offers the promise of low latency, high bandwidth,
and unmatched scalability demanded by high performance
computing applications. IB adapters and switches that perform
well on these key metrics allow enterprises to meet their high-
performance and MPI needs with optimal efficiency. The IB so-
lution allows enterprises to quickly achieve their compute and
business goals.
InfiniBand
High-Speed Interconnects
143
Vertical Market Application Segment InfiniBand Value
Oil & GasMix of Independent Service Provider (ISP) and home grown codes: reservoir modeling
Low latency, high bandwidth
Computer Aided Engineering (CAE)Mostly Independent Software Vendor (ISV) codes: crash, air flow, and fluid flow simulations
High message rate, low latency, scalability
GovernmentHome grown codes: labs, defense, weather, and a wide range of apps
High message rate, low latency, scalability, high bandwidth
Education Home grown and open source codes: a wide range of apps High message rate, low latency, scalability, high bandwidth
Financial Mix of ISP and home grown codes: market simulation and trading floor
High performance IP, scalability, high bandwidth
Life and Materials ScienceMostly ISV codes: molecular simulation, computational chemistry, and biology apps
Low latency, high message rates
144
InfiniBand
Intel Fabric Suite 7Intel Fabric Suite 7
Maximizing Investments in High Performance Computing
Around the world and across all industries, high performance
computing (HPC) is used to solve today’s most demanding com-
puting problems. As today’s high performance computing chal-
lenges grow in complexity and importance, it is vital that the
software tools used to install, configure, optimize, and manage
HPC fabrics also grow more powerful. Today’s HPC workloads are
too large and complex to be managed using software tools that
are not focused on the unique needs of HPC.
Designed specifically for HPC, Intel Fabric Suite 7 is a complete
fabric management solution that maximizes the return on HPC
investments, by allowing users to achieve the highest levels
of performance, efficiency, and ease of management from In-
finiBand-connected HPC clusters of any size.
Highlights
ʎ Intel FastFabric and Fabric Viewer integration with leading
third-party HPC cluster management suites
ʎ Simple but powerful Fabric Viewer dashboard for monitoring
fabric performance
ʎ Intel Fabric Manager integration with leading HPC workload-
management suites that combine virtual fabrics and compute
ʎ Quality of Service (QoS) levels that maximize fabric efficiency
and application performance
ʎ Smart, powerful software tools that make Intel TrueScale
Fabric solutions easy to install, configure, verify, optimize,
and manage
ʎ Congestion control architecture that reduces the effects of
fabric congestion caused by low credit availability that can
result in head-of-line blocking
ʎ Powerful fabric routing methods—including adaptive and
145
dispersive routing—that optimize traffic flows to avoid or
eliminate congestion, maximizing fabric throughput
ʎ Intel Fabric Manager’s advanced design ensures fabrics of
all sizesand topologies—from fat-tree to mesh and torus—
scale to support the most demanding HPC workloads
Superior Fabric Performance and Simplified Management are
Vital for HPC
As HPC clusters scale to take advantage of multi-core and GPU-
accelerated nodes attached to ever-larger and more complex
fabrics, simple but powerful management tools are vital for
maximizing return on HPC investments.
Intel Fabric Suite 7 provides the performance and management
tools for today’s demanding HPC cluster environments. As clus-
ters grow larger, management functions, from installation and
configuration to fabric verification and optimization, are vital
in ensuring that the interconnect fabric can support growing
workloads. Besides fabric deployment and monitoring, IFS opti-
mizes the performance of message passing applications—from
advanced routing algorithms to quality of service (QoS)—that
ensure all HPC resources are optimally utilized.
Scalable Fabric Performance
ʎ Purpose-built for HPC, IFS is designed to make HPC clusters
faster, easier, and simpler to deploy, manage, and optimize
ʎ The Intel TrueScale Fabric on-load host architecture exploits
the processing power of today’s faster multi-core processors
for superior application scaling and performance
ʎ Policy-driven vFabric virtual fabrics optimize HPC resource
utilization by prioritizing and isolating compute and storage
traffic flows
ʎ Advanced fabric routing options—including adaptive and
dispersive routing—that distribute traffic across all poten-
tial links that improve the overall fabric performance, lower-
ing congestion to improve throughput and latency
Scalable Fabric Intelligence
ʎ Routing intelligence scales linearly as Intel TrueScale Fabric
switches are added to the fabric
ʎ Intel Fabric Manager can initialize fabrics having several
thousand nodes within seconds
ʎ Advanced and optimized routing algorithms overcome the
limitations of typical subnet managers
ʎ Smart management tools quickly detect and respond to fab-
ric changes, including isolating and correcting problems that
can result in unstable fabrics
Standards-based Foundation
Intel TrueScale Fabric solutions are compliant with all industry
software and hardware standards. Intel Fabric Suite (IFS) rede-
fines IBTA management by coupling powerful management
tools with intuitive user interfaces. InfiniBand fabrics built using
IFS deliver the highest levels of fabric performance, efficiency,
and management simplicity, allowing users to realize the full
benefits from their HPC investments.
Major components of Intel Fabric Suite 7 include:
ʎ FastFabric Toolset
ʎ Fabric Viewer
ʎ Fabric Manager
Intel FastFabric Toolset
Ensures rapid, error-free installation and configuration of Intel
True Scale Fabric switch, host, and management software tools.
Guided by an intuitive interface, users can easily install, config-
ure, validate, and optimize HPC fabrics.
146
Key features include:
ʎ Automated host software installation and configuration
ʎ Powerful fabric deployment, verification, analysis, and report-
ing tools for measuring connectivity, latency, and bandwidth
ʎ Automated chassis and switch firmware installation and up-
date
ʎ Fabric route and error analysis tools
ʎ Benchmarking and tuning tools
ʎ Easy-to-use fabric health checking tools
Intel Fabric Viewer
A key component that provides an intuitive, Java-based user
interface with a “topology-down” view for fabric status and di-
agnostics, with the ability to drill down to the device layer to
identify and help correct errors. IFS 7 includes fabric dashboard,
a simple and intuitive user interface that presents vital fabric
performance statistics.
Key features include:
ʎ Bandwidth and performance monitoring
ʎ Device and device group level displays
ʎ Definition of cluster-specific node groups so that displays
can be oriented toward end-user node types
ʎ Support for user-defined hierarchy
ʎ Easy-to-use virtual fabrics interface
Intel Fabric Manager
Provides comprehensive control of administrative functions us-
ing a commercial-grade subnet manager. With advanced routing
algorithms, powerful diagnostic tools, and full subnet manager
failover, Fabric Manager simplifies subnet, fabric, and individual
component management, making even the largest fabrics easy
to deploy and optimize.
InfiniBand
Intel Fabric Suite 7
147
Key features include:
ʎ Designed and optimized for large fabric support
ʎ Integrated with both adaptive and dispersive routing systems
ʎ Congestion control architecture (CCA)
ʎ Robust failover of subnet management
ʎ Path/route management
ʎ Fabric/chassis management
ʎ Fabric initialization in seconds, even for very large fabrics
ʎ Performance and fabric error monitoring
Adaptive Routing
While most HPC fabrics are configured to have multiple paths
between switches, standard InfiniBand switches may not be ca-
pable of taking advantage of them to reduce congestion. Adap-
tive routing monitors the performance of each possible path,
and automatically chooses the least congested route to the
destination node. Unlike other implementations that rely on a
purely subnet manager-based approach, the intelligent path se-
lection capabilities within Fabric Manager, a key part of IFS and
Intel 12000 series switches, scales as your fabric grows larger
and more complex.
Key adaptive routing features include:
ʎ Highly scalable—adaptive routing intelligence scales as the-
fabric grows
ʎ Hundreds of real-time adjustments per second per switch
ʎ Topology awareness through Intel Fabric Manager
ʎ Supports all InfiniBand Trade Association* (IBTA*)-compliant
adapters
Dispersive Routing
One of the critical roles of the fabric management is the initial-
ization and configuration of routes through the fabric between
a single pair of nodes. Intel Fabric Manager supports a variety
of routing methods, including defining alternate routes that
disperse traffic flows for redundancy, performance, and load
balancing. Instead of sending all packets to a destination on a
single path, Intel dispersive routing distributes traffic across all
possible paths. Once received, packets are reassembled in their
proper order for rapid, efficient processing. By leveraging the
entire fabric to deliver maximum communications performance
for all jobs, dispersive routing ensures optimal fabric efficiency.
Key dispersive routing features include:
ʎ Fabric routing algorithms that provide the widest separation
of paths possible
ʎ Fabric “hotspot” reductions to avoid fabric congestion
ʎ Balances traffic across all potential routes
ʎ May be combined with vFabrics and adaptive routing
ʎ Very low latency for small messages and time sensitive con-
trol protocols
ʎ Messages can be spread across multiple paths—Intel Perfor-
mance Scaled Messaging (PSM) ensures messages are reas-
sembled correctly
ʎ Supports all leading MPI libraries
Mesh and Torus Topology Support
Fat-tree configurations are the most common topology used
in HPC cluster environments today, but other topologies are
gaining broader use as organizations try to create increasingly
larger, more cost-effective fabrics. Intel is leading the way with
full support of emerging mesh and torus topologies that can
help reduce networking costs as clusters scale to thousands of
nodes. IFS has been enhanced to support these emerging op-
tions—from failure handling that maximizes performance, to
even higher reliability for complex networking environments.
148
InfiniBand
Intel Fabric Suite 7
transtec HPC solutions excel through their easy management
and high usability, while maintaining high performance and
quality throughout the whole lifetime of the system. As clusters
scale, issues like congestion mitigation and Quality-of-Service
can make a big difference in whether the fabric performs up to
its full potential.
With the intelligent choice of Intel InfiniBand products, transtec
remains true to combining the best components together to pro-
vide the full best-of-breed solution stack to the customer.
transtec HPC engineering experts are always available to fine-
tune customers’ HPC cluster systems and InfiniBand fabrics to
get the maximum performance while at the same time provide
them with an easy-to-manage and easy-to-use HPC solution.
149
Parstream –Big Data Analytics
151
The amount of digital data resources has exploded in the past decade. The 2010 IDC Digital Universe Study shows that compared to 2008, the digital universe has grown by 62% or up to 800,000 petabytes (0.8 zettabyte) in 2009. By the year 2020, IDC predicts the amount of data to be 44 times as big as it was in 2009, thus reaching approximate-ly 35 zettabytes. As a result of the exploding amount of data, the demand for appli-cations used for searching and analyzing large datas-ets will significantly grow in all industries.
However, today’s applica-tions require large and ex-pensive infrastructures which consume vast amounts of energy and resources, creat-ing challenges in the analysis
of mass-data. In the future, increasingly more terabyte-scale datasets will be used for research, analysis and diagno-sis bringing about further dif-ficulties.
152
Parstream
Big Data Analytics
What is Big DataThe amount of generated and stored data is growing exponen-
tially in unprecedented proportions. According to Eric Schmidt,
former CEO of Google, the amount of data that is now produced
in 2 days is as large as the total amount since the beginning of
civilization until 2003.
Large amounts of data are produced in many areas of business
and private life. For example, there are 200 million Twitter mes-
sages generated – daily. Mobile devices, sensor networks, ma-
chine-to-machine communications, RFID readers, software logs,
cameras and microphones generate a continuous stream of data
from multiple terabytes per day.
For organizations Big Data means a great opportunity and a big
challenge. Companies that know how to manage big data will
create more comprehensive predictions for the market and busi-
ness. Decision-makers can quickly and safely respond to the de-
velopments in dynamic and volatile markets.
It is important that companies know how to use Big Data to
improve their competitive positions, increase productivity and
develop new business models. According to Roger Magoulas of
O’Reilly the ability to analyze big data has developed into a core
Real Time
< 1...10 milli sec
10...100 milli sec
1 sec
Terabyte
1...10 sec
1...10 min
> 10 min
Gigabyte Petabyte
Lag Time
Big DataVolume
InteractiveAnalyticsOperational
Data Volumes
Batch Analytics(MapReduce)
OLTP Reporting
In-Memory DB
Complex EventProcessing
153
competence in the information age and can provide an enor-
mous competitive advantage for companies.
Big Data analytics has a huge impact on traditional data process-
ing and causes business and IT managers to face new problems.
The latest IT technologies and processes are poorly suited to
analyze very large amounts of data, according to a recent study
by Gartner.
IT departments often try to meet the increasing flood of infor-
mation by traditional means, using the same infrastructure de-
velopment and just purchasing more hardware. As data volumes
grow stronger than the processing capabilities of additional
servers or better processors, new technological approaches are
needed.
Setbacks in current database architecturesCurrent databases are not engineered for mass data, but rather
for small data volumes up to 100 million records. Today’s data-
bases use outdated, 20-30 year old architectures and the data
and index structures are not constructed for efficient analysis
of such data volumes. And, because these databases employ se-
quential algorithms, they are not able to exploit the potential of
parallel hardware.
Algorithmic procedures for indexing large amounts of data
have seen relatively few innovations in the past years and
decades. Due to the ever-growing amounts of data to be pro-
cessed, there are rising challenges that traditional database
systems cannot cope with. Currently, new and innovative ap-
proaches to solve these problems are developed and evaluat-
ed. However, some of these approaches seem to be heading in
the wrong direction.
Parstream – Big Data Analytics PlatformParStream offers a revolutionary approach to high-performance
data analysis. It addresses the problems arising from rapidly in-
creasing data volumes in modern business, as well as scientific
application scenarios.
ParStream
ʎ has a unique index technology
ʎ comes with efficient parallel processing
ʎ is ultra-fast, even with billions of records
ʎ scales linearly up to petabytes
ʎ offers real-time analysis and continuous import
ʎ is cost and energy efficient
ParStream uses a unique indexing technology which enables
efficient multi-threaded processing on parallel architectures.
Hardware and energy costs are substantially reduced while
overall performance is optimized to index and execute queries.
Close-to-real-time analysis is obtained through simultaneous
importing, indexing and querying of data. The developers of Par-
Stream strive to continuously improve on the product by work-
ing closely with universities, research partners, and clients.
ParStream is the right choice when...
ʎ extreme amounts of data are to be searched and filtered
ʎ filters use many columns in various combinations (ad-hoc
queries)
SQL API / JDBC / ODBC
Real-Time Analytics Engine
High Performance Compressed Index
(HPCI)
Fast Hybrid Storage (Columnar/ Row)
High SpeedLoader with Low Latency
C++UDF -API
Multi-DimensionalPartitioning
Shared NothingArchitecture
In-Memory andDisc Technology
Massively ParallelProcessing (MPP)
154
TechnologyThe technical architecture of the ParStream database can be
roughly divided into the following areas: During the extract,
transform and load (ETL) process, the supplied data is read and
processed so that it can be forwarded to the actual loader.
In the next step, ParStream generates and stores necessary index
structures and data representations. Input data can be stored in
row and/or column-oriented data stores. Here configurable opti-
mizations regarding sorting, compression and partitioning are ac-
counted for.
Once data is loaded into the server, the user can pose queries
over a standard interface (SQL) to the database engine. A parser
then interprets the queries and a query executor executes it in
parallel.
The optimizer is used to generate the initial query plan starting
from the declarative request. This initial execution plan is used
to start processing. However, this initial plan can be altered dy-
namically during runtime, e.g., changing the level of parallelism.
Parstream
Big Data Analytics
High SpeedImport & Index
DataSource
Results
DataSource
DataSource
ETL
ETL
ETL
Rowori-entedrecordstore
Columnori-entedrecordstore
QueryLogic
Loader
Multiple Servers, CPUs and GPUs
CompressionPartitioning
CachingParallelzation
IndexStore
ParStream
ETL
Query
155
The query executor also makes optimal use of the available infra-
structure.
All available parts of the query exploit bitmap information where
possible. Logical filter conditions and aggregations can thus be
calculated and forwarded to the client by highly efficient bitmap
operations.
Index Structure
ParStream’s innovative index structure enables efficient parallel
processing, allowing for unsurpassed levels of performance. Key
features of the underlying database engine include:
ʎ A column-oriented bitmap index using a unique data struc-
ture.
ʎ A data structure that allows allows processing in compressed
form.
ʎ No need for index decompression as in other database and
indexing systems.
Use of ParStream and its highly efficient query engine yield sig-
nificant advantages over regular database systems. ParStream
offers:
ʎ faster index operations,
ʎ shorter response times,
ʎ substantially less CPU-load,
ʎ efficient parallel processing of search and analysis,
ʎ capability of adding new data to the index during query ex-
ecution, close-to-real-time importing and querying of data.
Optimized Use of Resources
Inside one single processing node, as well as inside single data
partitions, query processing can be parallelized to achieve mini-
mal response time, by using all available resources (CPU, I/O-
Channels).
Reliability
The reliability of a ParStream database is guaranteed through
several product features and fully supports multiple-server en-
vironments. The ParStream database allows the usage of an arbi-
trary number of servers. Each server can be configured to repli-
cate the entire or only parts of the data store.
Several load-balancing algorithms will pass a complete query
to one of the servers or parts of one query to several, even re-
dundant servers. The query can be sent to any of the cluster
members, allowing customers to use a load-balancing and fail-
over configuration according to their own need, e.g. round robin
query distribution.
Interfaces
The ParStream server can be queried using any of the following
three methods:
At a high level, we provide a JDBC driver which enables the imple-
mentation of a cross platform front end. This allows ParStream
to be used with any application capable of using a standard
JDBC interface. For example, a Java applet can be built. This ap-
plet can be used to query the ParStream server from within any
web browser.
At mid-level, queries can be submitted as SQL code. Additionally,
other descriptive query implementations are available, e.g., a
JSON format. This allows the user to define queries, which can-
not easily be expressed in standard SQL. The results are then
sent back to the client as CSV text or in binary format.
At a low level, ParStream’s C++ API and base classes can be used
to write user defined query nodes that are stored in dynamic li-
braries. A developer can thus integrate his own query nodes into
a tree description and register it dynamically into the ParStream
156
server. Such user-defined queries can be executed via a TCP/IP
connection and are also integrated into ParStream’s parallel ex-
ecution framework. This interface layer allows the formulation
of queries that cannot be expressed using SQL.
Data Import
One of ParStream’s strengths is its ability to import CSV files at
unprecedented speeds. This is based on two factors:
First of all, the index is much faster in adding data than indexes
used in most other databases. Secondly, the importer partitions
and sorts the data in parallel, which exploits the capabilities of
today’s multi-core processors.
Additionally, the import process may run outside the query pro-
cess enabling the user to ship the finished data and index files
to the servers. In this way, the import’s CPU and I/O load can be
separated and moved to different machines.
Parstream
Big Data Analytics
JDBC/ODBCSQL 2003
ParallelImport
CSV & BinaryImport
ETL-Support
SocketC++ API
SOACompatibility
Front End
Map-Reduce RDBMS Raw-Data
Real-Time Big Data Analytics
Application Tool
“As Big Data Analytics is becoming more and more
important, new and innovative technology arises,
and we are happy that with Parstream, our cus-
tomers may experience a performance boost with
respect to their analytics work by a factor of up to
1,000.”
Matthias Groß HPC Sales Specialist
157
Another remarkable feature of ParStream is its ability to
optionally operate on a CSV record store instead of column
stores. This accelerates the import because only the indexes
need to be written. Plus, since one usually wants to save the
original CSV files, no additional hard drive memory is wasted.
Supported Platforms
Currently ParStream is available on a number of Linux Distri-
butions including RedHat Enterprise Linux, Novell Enterprise
Linux and Debian Lenny running on X86_64 CPUs. On request,
ParStream will be ported to other platforms.
Besides the detailed simulation of realistic processes, data an-
alytics is is one of the classic application of High Performance
Computing. What is new is that with the amount of data in-
creasing such that we now speak of Big Data Analytics, and the
required time-to-result getting shorter and shorter, specialized
solutions arise that apply the scale-out principle of HPC to areas
that up to now have been in the focus of High Performance Com-
puting.
Parstream as a clustered database – optimized for highest read-
only parallelized data throughput, next-to-realtime query results
– is such a specialized solution.
transtec ensures that this latest innovative technology is imple-
mented and run on the most reliable systems available, sized
exactly according to the customer’s specific requirements. Par-
stream as the software technology core together with transtec
HPC systems and transtec services combined, provide for the
most reliable and comprehensive Big Data Analytics solutions
available.
158
ʎ complex queries are performed frequently
ʎ datasets are continuously growing
ʎ close-to-real-time analytical results are expected
ʎ infrastructure and operating costs need to be optimized
Key business scenariosBig Data is a game changer in all industries and the public sector.
ParStream has identified many applications across all sectors
that yield high returns for the customer. Examples are ad-spend-
ing analytics and social media monitoring in eCommerce, fraud
detection and algorithmic trading in Finance/Capital markets,
network quality monitoring and customer targeting in Telcos,
smart metering and smart grids in Utilities, and many more.
ʎ Search and Selection – ParStream enables new levels of
search and online shopping satisfaction By providing facet-
ted search in combination with continuously analyzing cus-
tomers preferences and behavior in real-time, ParStream can
guide customers very effectively to products and services of
their interest leading to increased conversion rate and plat-
form revenue.
ʎ Real-Time Analytics – ParStream drives interactive analytics
processes to gain insights faster. Identifying valuable insights
from Big Data is ideally an interactive process where hypothe-
ses are identified and validated. ParStream delivers results
instantaneously on up-to-date Big Data even if many power-
users or automated-agents are querying the data at the same
time. ParStream enables faster organizational learning for
process and product optimization.
ʎ Online-Processing – ParStream responds automatically to
large data streams. For online advertising, re-targeting, trend-
spotting, fraud detection and many more cases ParStream
can automatically process new data, compare to large histo-
ric data sets, and enable react quickly. Fast growing data vo-
lumes and increasing algorithmic complexity create a market
for faster solutions.
Parstream
Big Data Analytics
eCommerceServices
facetted Search
Web analytics
SEO- analytics
Online- Advertising
SocialNetworks
Ad serving
Pro�ling
Targeting
Telco
Customer attrition prevention
Network monitoring
Targeting
Prepaid account mgmt
Finance
Trend analysis
Fraud detection
Automatic trading
Risk analysis
EnergyOil and Gas
Smart metering
Smart grids Wind parks
Mining Solar Panels
Many More
Production
Mining
M2M
Sensors Genetics Intelligence
Wather
ALL INDUSTRIES
MA
NY
APP
LICA
TIO
NS
159
OfferingParStream delivers a unique combination of real-time analyt-
ics, low-latency continuous import and high throughput for Big
Data. ParStream features analysis of billions of records with
thousands of columns in sub-second time with very small com-
puting infrastructure requirements. ParStream does NOT use
cubes to store data, which enables ParStream to analyze data in
its full granularity, flexibly in all dimensions and to import new
data on the fly.
ParStream delivers results typically 1,000 times faster than Post-
greSQL or MySQL. Compared to column-based databases, Par-
Stream delivers superior performance; this performance differ-
ential with traditional technology increases exponentially with
increasing data volume and query complexity.
ParStream has unique intellectual property in compression
and indexing. With High-Performance-Compressed-Index (HPCI)
technology, data indices can be analyzed in a compressed form,
removing the need to go through the extra step of decompres-
sion. This technological advantage results in major performance
advantages compared to other solutions, including:
ʎ Substantially reducing CPU load (previously needed for de-
compression, index operations more efficient than full data
scanning)
ʎ Substantially reducing RAM volume (previously needed to
store the decompressed index)
ʎ Providing much faster query response rates because HPCI
processing can be parallelized eliminating the delays that re-
sult with decompression
Compared to state of the art data warehouse solutions, Par-
Stream delivers results much faster, with more flexibility and
granularity, on more up-to-date data, with lower infrastructure
requirements and at lower TCO. ParStream uniquely addresses
Big Data scenarios with its ability to continuously import data
and make it available to analytics in real-time. This allows for
analytics, filtering and conditioning of “live” data streams, for
example from social network or financial data feeds. ParStream
licensing is based on data volume. The product can be deployed
in four ways that also can be combined:
ʎ As a traditional on premise Software deployment managed by
the customer (including Cloud deployments),
ʎ As an Appliance managed by the customer,
ʎ As a Cloud Service provided by a ParStream Service partner,
and
ʎ As a Cloud Service provided by ParStream.
These options provide flexibility and eases adoption as custom-
ers can use ParStream on their preferred infrastructure or cloud,
regardless of number of servers, cores, etc. Although private
and public cloud-infrastructures are seen as the future model
for many applications, dedicated cluster-infrastructures are fre-
quently the preferred option for real-time Big Data scenarios.
© 2012 by Parstream GmbH
Long QueryRuntime
Query Query
Data Big Data
Parallel ImportNightly Batch - Import
STANDARD DW ARCHITECHTURE PARSTREAM ARCHITECTURE
HPCIFrequent FullTable Scans
Data is at Least1 Day Old
Each Query UsesMultipleProcessor Cores
Query execution compressedindices
ContinuousImport AssuresZimeliness ofData
160
ACML (“AMD Core Math Library“)
A software development library released by AMD. This library
provides useful mathematical routines optimized for AMD pro-
cessors. Originally developed in 2002 for use in high-performance
computing (HPC) and scientific computing, ACML allows nearly
optimal use of AMD Opteron processors in compute-intensive
applications. ACML consists of the following main components:
ʎ A full implementation of Level 1, 2 and 3 Basic Linear Algebra
Subprograms (→ BLAS), with optimizations for AMD Opteron
processors.
ʎ A full suite of Linear Algebra (→ LAPACK) routines.
ʎ A comprehensive suite of Fast Fourier transform (FFTs) in sin-
gle-, double-, single-complex and double-complex data types.
ʎ Fast scalar, vector, and array math transcendental library
routines.
ʎ Random Number Generators in both single- and double-pre-
cision.
AMD offers pre-compiled binaries for Linux, Solaris, and Win-
dows available for download. Supported compilers include gfor-
tran, Intel Fortran Compiler, Microsoft Visual Studio, NAG, PathS-
cale, PGI compiler, and Sun Studio.
BLAS (“Basic Linear Algebra Subprograms“)
Routines that provide standard building blocks for performing
basic vector and matrix operations. The Level 1 BLAS perform
scalar, vector and vector-vector operations, the Level 2 BLAS
perform matrix-vector operations, and the Level 3 BLAS per-
form matrix-matrix operations. Because the BLAS are efficient,
portable, and widely available, they are commonly used in the
development of high quality linear algebra software, e.g. → LA-
PACK . Although a model Fortran implementation of the BLAS
in available from netlib in the BLAS library, it is not expected to
perform as well as a specially tuned implementation on most
high-performance computers – on some machines it may give
much worse performance – but it allows users to run → LAPACK
software on machines that do not offer any other implementa-
tion of the BLAS.
Cg (“C for Graphics”)
A high-level shading language developed by Nvidia in close col-
laboration with Microsoft for programming vertex and pixel
shaders. It is very similar to Microsoft’s → HLSL. Cg is based on
the C programming language and although they share the same
syntax, some features of C were modified and new data types
were added to make Cg more suitable for programming graphics
processing units. This language is only suitable for GPU program-
ming and is not a general programming language. The Cg com-
piler outputs DirectX or OpenGL shader programs.
CISC (“complex instruction-set computer”)
A computer instruction set architecture (ISA) in which each in-
struction can execute several low-level operations, such as a load
from memory, an arithmetic operation, and a memory store, all
in a single instruction. The term was retroactively coined in con-
trast to reduced instruction set computer (RISC). The terms RISC
and CISC have become less meaningful with the continued evo-
lution of both CISC and RISC designs and implementations, with
modern processors also decoding and splitting more complex in-
structions into a series of smaller internal micro-operations that
can thereby be executed in a pipelined fashion, thus achieving
high performance on a much larger subset of instructions.
Glossary
161
cluster
Aggregation of several, mostly identical or similar systems to
a group, working in parallel on a problem. Previously known
as Beowulf Clusters, HPC clusters are composed of commodity
hardware, and are scalable in design. The more machines are
added to the cluster, the more performance can in principle be
achieved.
control protocol
Part of the → parallel NFS standard
CUDA driver API
Part of → CUDA
CUDA SDK
Part of → CUDA
CUDA toolkit
Part of → CUDA
CUDA (“Compute Uniform Device Architecture”)
A parallel computing architecture developed by NVIDIA. CUDA
is the computing engine in NVIDIA graphics processing units or
GPUs that is accessible to software developers through indus-
try standard programming languages. Programmers use “C for
CUDA” (C with NVIDIA extensions), compiled through a PathScale
Open64 C compiler, to code algorithms for execution on the GPU.
CUDA architecture supports a range of computational interfaces
including → OpenCL and → DirectCompute. Third party wrappers
are also available for Python, Fortran, Java and Matlab. CUDA
works with all NVIDIA GPUs from the G8X series onwards, includ-
ing GeForce, Quadro and the Tesla line. CUDA provides both a low
level API and a higher level API. The initial CUDA SDK was made
public on 15 February 2007, for Microsoft Windows and Linux.
Mac OS X support was later added in version 2.0, which super-
sedes the beta released February 14, 2008.
CUDA is the hardware and software architecture that enables
NVIDIA GPUs to execute programs written with C, C++, Fortran,
→ OpenCL, → DirectCompute, and other languages. A CUDA pro-
gram calls parallel kernels. A kernel executes in parallel across
a set of parallel threads. The programmer or compiler organizes
these threads in thread blocks and grids of thread blocks. The
GPU instantiates a kernel program on a grid of parallel thread
blocks. Each thread within a thread block executes an instance
of the kernel, and has a thread ID within its thread block, pro-
gram counter, registers, per-thread private memory, inputs, and
Thread
Thread Block
Grid 0
Grid 1
per-Thread PrivateLocal Memory
per-BlockShared Memory
per-Application
ContextGlobal
Memory
...
...
162
output results.
A thread block is a set of concurrently executing threads that
can cooperate among themselves through barrier synchro-
nization and shared memory. A thread block has a block ID
within its grid. A grid is an array of thread blocks that execute
the same kernel, read inputs from global memory, write re-
sults to global memory, and synchronize between dependent
kernel calls. In the CUDA parallel programming model, each
thread has a per-thread private memory space used for regis-
ter spills, function calls, and C automatic array variables. Each
thread block has a per-block shared memory space used for
inter-thread communication, data sharing, and result sharing
in parallel algorithms. Grids of thread blocks share results in
global memory space after kernel-wide global synchroniza-
tion.
CUDA’s hierarchy of threads maps to a hierarchy of proces-
sors on the GPU; a GPU executes one or more kernel grids; a
streaming multiprocessor (SM) executes one or more thread
blocks; and CUDA cores and other execution units in the SM
execute threads. The SM executes threads in groups of 32
threads called a warp. While programmers can generally ig-
nore warp execution for functional correctness and think of
programming one thread, they can greatly improve perfor-
mance by having threads in a warp execute the same code
path and access memory in nearby addresses. See the main
article “GPU Computing” for further details.
DirectCompute
An application programming interface (API) that supports gen-
eral-purpose computing on graphics processing units (GPUs)
on Microsoft Windows Vista or Windows 7. DirectCompute is
part of the Microsoft DirectX collection of APIs and was initial-
ly released with the DirectX 11 API but runs on both DirectX 10
and DirectX 11 GPUs. The DirectCompute architecture shares
a range of computational interfaces with → OpenCL and →
CUDA.
ETL (“Extract, Transform, Load”)
A process in database usage and especially in data warehousing
that involves:
ʎ Extracting data from outside sources
ʎ Transforming it to fit operational needs (which can include
quality levels)
ʎ Loading it into the end target (database or data warehouse)
The first part of an ETL process involves extracting the data from
the source systems. In many cases this is the most challenging
aspect of ETL, as extracting data correctly will set the stage for
how subsequent processes will go. Most data warehousing proj-
ects consolidate data from different source systems. Each sepa-
rate system may also use a different data organization/format.
Common data source formats are relational databases and flat
files, but may include non-relational database structures such
as Information Management System (IMS) or other data struc-
tures such as Virtual Storage Access Method (VSAM) or Indexed
Sequential Access Method (ISAM), or even fetching from outside
sources such as through web spidering or screen-scraping. The
streaming of the extracted data source and load on-the-fly to the
destination database is another way of performing ETL when
no intermediate data storage is required. In general, the goal of
the extraction phase is to convert the data into a single format
which is appropriate for transformation processing.
An intrinsic part of the extraction involves the parsing of ex-
Glossary
163
tracted data, resulting in a check if the data meets an expected
pattern or structure. If not, the data may be rejected entirely or
in part.
The transform stage applies a series of rules or functions to the
extracted data from the source to derive the data for loading
into the end target. Some data sources will require very little
or even no manipulation of data. In other cases, one or more of
the following transformation types may be required to meet the
business and technical needs of the target database.
The load phase loads the data into the end target, usually the
data warehouse (DW). Depending on the requirements of the
organization, this process varies widely. Some data warehouses
may overwrite existing information with cumulative informa-
tion, frequently updating extract data is done on daily, weekly
or monthly basis. Other DW (or even other parts of the same DW)
may add new data in a historicized form, for example, hourly. To
understand this, consider a DW that is required to maintain sales
records of the last year. Then, the DW will overwrite any data
that is older than a year with newer data. However, the entry of
data for any one year window will be made in a historicized man-
ner. The timing and scope to replace or append are strategic de-
sign choices dependent on the time available and the business
needs. More complex systems can maintain a history and audit
trail of all changes to the data loaded in the DW. As the load
phase interacts with a database, the constraints defined in the
database schema — as well as in triggers activated upon data
load — apply (for example, uniqueness, referential integrity,
mandatory fields), which also contribute to the overall data
quality performance of the ETL process.
FFTW (“Fastest Fourier Transform in the West”)
A software library for computing discrete Fourier transforms
(DFTs), developed by Matteo Frigo and Steven G. Johnson at
the Massachusetts Institute of Technology. FFTW is known as
the fastest free software implementation of the Fast Fourier
transform (FFT) algorithm (upheld by regular benchmarks). It
can compute transforms of real and complex-valued arrays of
arbitrary size and dimension in O(n log n) time.
floating point standard (IEEE 754)
The most widely-used standard for floating-point computation,
and is followed by many hardware (CPU and FPU) and software
implementations. Many computer languages allow or require
that some or all arithmetic be carried out using IEEE 754 for-
mats and operations. The current version is IEEE 754-2008,
which was published in August 2008; the original IEEE 754-1985
was published in 1985. The standard defines arithmetic for-
mats, interchange formats, rounding algorithms, operations,
and exception handling. The standard also includes extensive
recommendations for advanced exception handling, additional
operations (such as trigonometric functions), expression evalu-
ation, and for achieving reproducible results. The standard
defines single-precision, double-precision, as well as 128-byte
quadruple-precision floating point numbers. In the proposed
754r version, the standard also defines the 2-byte half-precision
number format.
FraunhoferFS (FhGFS)
A high-performance parallel file system from the Fraunhofer
Competence Center for High Performance Computing. Built
on scalable multithreaded core components with native → In-
164
finiBand support, file system nodes can serve → InfiniBand and
Ethernet (or any other TCP-enabled network) connections at
the same time and automatically switch to a redundant con-
nection path in case any of them fails. One of the most funda-
mental concepts of FhGFS is the strict avoidance of architectur-
al bottle necks. Striping file contents across multiple storage
servers is only one part of this concept. Another important as-
pect is the distribution of file system metadata (e.g. directory
information) across multiple metadata servers. Large systems
and metadata intensive applications in general can greatly
profit from the latter feature.
FhGFS requires no dedicated file system partition on the serv-
ers – it uses existing partitions, formatted with any of the stan-
dard Linux file systems, e.g. XFS or ext4. For larger networks, it is
also possible to create several distinct FhGFS file system parti-
tions with different configurations. FhGFS provides a coherent
mode, in which it is guaranteed that changes to a file or directo-
ry by one client are always immediately visible to other clients.
Global Arrays (GA)
A library developed by scientists at Pacific Northwest National
Laboratory for parallel computing. GA provides a friendly API for
shared-memory programming on distributed-memory comput-
ers for multidimensional arrays. The GA library is a predecessor
to the GAS (global address space) languages currently being de-
veloped for high-performance computing. The GA toolkit has ad-
ditional libraries including a Memory Allocator (MA), Aggregate
Remote Memory Copy Interface (ARMCI), and functionality for
out-of-core storage of arrays (ChemIO). Although GA was initially
developed to run with TCGMSG, a message passing library that
came before the → MPI standard (Message Passing Interface), it
is now fully compatible with → MPI. GA includes simple matrix
computations (matrix-matrix multiplication, LU solve) and works
with → ScaLAPACK. Sparse matrices are available but the imple-
mentation is not optimal yet. GA was developed by Jarek Nieplo-
cha, Robert Harrison and R. J. Littlefield. The ChemIO library for
out-of-core storage was developed by Jarek Nieplocha, Robert
Harrison and Ian Foster. The GA library is incorporated into many
quantum chemistry packages, including NWChem, MOLPRO,
UTChem, MOLCAS, and TURBOMOLE. The GA toolkit is free soft-
ware, licensed under a self-made license.
Globus Toolkit
An open source toolkit for building computing grids developed
and provided by the Globus Alliance, currently at version 5.
GMP (“GNU Multiple Precision Arithmetic Library”)
A free library for arbitrary-precision arithmetic, operating on
signed integers, rational numbers, and floating point num-
bers. There are no practical limits to the precision except the
ones implied by the available memory in the machine GMP
runs on (operand dimension limit is 231 bits on 32-bit ma-
chines and 237 bits on 64-bit machines). GMP has a rich set of
functions, and the functions have a regular interface. The ba-
sic interface is for C but wrappers exist for other languages in-
cluding C++, C#, OCaml, Perl, PHP, and Python. In the past, the
Kaffe Java virtual machine used GMP to support Java built-in
arbitrary precision arithmetic. This feature has been removed
from recent releases, causing protests from people who claim
that they used Kaffe solely for the speed benefits afforded by
GMP. As a result, GMP support has been added to GNU Class-
path. The main target applications of GMP are cryptography
Glossary
165
applications and research, Internet security applications, and
computer algebra systems.
GotoBLAS
Kazushige Goto’s implementation of → BLAS.
grid (in CUDA architecture)
Part of the → CUDA programming model
GridFTP
An extension of the standard File Transfer Protocol (FTP) for
use with Grid computing. It is defined as part of the → Globus
toolkit, under the organisation of the Global Grid Forum (spe-
cifically, by the GridFTP working group). The aim of GridFTP is
to provide a more reliable and high performance file transfer
for Grid computing applications. This is necessary because of
the increased demands of transmitting data in Grid comput-
ing – it is frequently necessary to transmit very large files, and
this needs to be done fast and reliably. GridFTP is the answer
to the problem of incompatibility between storage and access
systems. Previously, each data provider would make their data
available in their own specific way, providing a library of ac-
cess functions. This made it difficult to obtain data from multi-
ple sources, requiring a different access method for each, and
thus dividing the total available data into partitions. GridFTP
provides a uniform way of accessing the data, encompassing
functions from all the different modes of access, building on
and extending the universally accepted FTP standard. FTP was
chosen as a basis for it because of its widespread use, and be-
cause it has a well defined architecture for extensions to the
protocol (which may be dynamically discovered).
Hierarchical Data Format (HDF)
A set of file formats and libraries designed to store and organize
large amounts of numerical data. Originally developed at the
National Center for Supercomputing Applications, it is currently
supported by the non-profit HDF Group, whose mission is to en-
sure continued development of HDF5 technologies, and the con-
tinued accessibility of data currently stored in HDF. In keeping
with this goal, the HDF format, libraries and associated tools are
available under a liberal, BSD-like license for general use. HDF is
supported by many commercial and non-commercial software
platforms, including Java, MATLAB, IDL, and Python. The freely
available HDF distribution consists of the library, command-line
utilities, test suite source, Java interface, and the Java-based HDF
Viewer (HDFView). There currently exist two major versions of
HDF, HDF4 and HDF5, which differ significantly in design and API.
HLSL (“High Level Shader Language“)
The High Level Shader Language or High Level Shading Language
(HLSL) is a proprietary shading language developed by Microsoft
for use with the Microsoft Direct3D API. It is analogous to the
GLSL shading language used with the OpenGL standard. It is very
similar to the NVIDIA Cg shading language, as it was developed
alongside it.
HLSL programs come in three forms, vertex shaders, geometry
shaders, and pixel (or fragment) shaders. A vertex shader is ex-
ecuted for each vertex that is submitted by the application, and
is primarily responsible for transforming the vertex from object
space to view space, generating texture coordinates, and calcu-
lating lighting coefficients such as the vertex’s tangent, binor-
mal and normal vectors. When a group of vertices (normally 3, to
form a triangle) come through the vertex shader, their output po-
166
sition is interpolated to form pixels within its area; this process
is known as rasterisation. Each of these pixels comes through
the pixel shader, whereby the resultant screen colour is calculat-
ed. Optionally, an application using a Direct3D10 interface and
Direct3D10 hardware may also specify a geometry shader. This
shader takes as its input the three vertices of a triangle and uses
this data to generate (or tessellate) additional triangles, which
are each then sent to the rasterizer.
InfiniBand
Switched fabric communications link primarily used in HPC and
enterprise data centers. Its features include high throughput,
low latency, quality of service and failover, and it is designed to
be scalable. The InfiniBand architecture specification defines a
connection between processor nodes and high performance I/O
nodes such as storage devices. Like → PCI Express, and many other
modern interconnects, InfiniBand offers point-to-point bidirec-
tional serial links intended for the connection of processors with
high-speed peripherals such as disks. On top of the point to point
capabilities, InfiniBand also offers multicast operations as well.
It supports several signalling rates and, as with PCI Express, links
can be bonded together for additional throughput.
The SDR serial connection’s signalling rate is 2.5 gigabit per sec-
ond (Gbit/s) in each direction per connection. DDR is 5 Gbit/s and
QDR is 10 Gbit/s. FDR is 14.0625 Gbit/s and EDR is 25.78125 Gbit/s
per lane. For SDR, DDR and QDR, links use 8B/10B encoding – every
10 bits sent carry 8 bits of data – making the useful data transmis-
sion rate four-fifths the raw rate. Thus single, double, and quad
data rates carry 2, 4, or 8 Gbit/s useful data, respectively. For FDR
and EDR, links use 64B/66B encoding – every 66 bits sent carry 64
bits of data.
Implementers can aggregate links in units of 4 or 12, called 4X or
12X. A 12X QDR link therefore carries 120 Gbit/s raw, or 96 Gbit/s
of useful data. As of 2009 most systems use a 4X aggregate, imply-
ing a 10 Gbit/s (SDR), 20 Gbit/s (DDR) or 40 Gbit/s (QDR) connection.
Larger systems with 12X links are typically used for cluster and
supercomputer interconnects and for inter-switch connections.
The single data rate switch chips have a latency of 200 nano-
seconds, DDR switch chips have a latency of 140 nanoseconds
and QDR switch chips have a latency of 100 nanoseconds. The
end-to-end latency range ranges from 1.07 microseconds MPI la-
tency to 1.29 microseconds MPI latency to 2.6 microseconds. As
of 2009 various InfiniBand host channel adapters (HCA) exist in
the market, each with different latency and bandwidth charac-
teristics. InfiniBand also provides RDMA capabilities for low CPU
overhead. The latency for RDMA operations is less than 1 micro-
second.
See the main article “InfiniBand” for further description of In-
finiBand features
Intel Integrated Performance Primitives (Intel IPP)
A multi-threaded software library of functions for multimedia
and data processing applications, produced by Intel. The library
supports Intel and compatible processors and is available for
Windows, Linux, and Mac OS X operating systems. It is available
separately or as a part of Intel Parallel Studio. The library takes
advantage of processor features including MMX, SSE, SSE2, SSE3,
SSSE3, SSE4, AES-NI and multicore processors. Intel IPP is divided
into four major processing groups: Signal (with linear array or
vector data), Image (with 2D arrays for typical color spaces), Ma-
trix (with n x m arrays for matrix operations), and Cryptography.
GlossaryGLOSSARY
167
Half the entry points are of the matrix type, a third are of the
signal type and the remainder are of the image and cryptogra-
phy types. Intel IPP functions are divided into 4 data types: Data
types include 8u (8-bit unsigned), 8s (8-bit signed), 16s, 32f (32-bit
floating-point), 64f, etc. Typically, an application developer works
with only one dominant data type for most processing func-
tions, converting between input to processing to output formats
at the end points. Version 5.2 was introduced June 5, 2007, add-
ing code samples for data compression, new video codec sup-
port, support for 64-bit applications on Mac OS X, support for
Windows Vista, and new functions for ray-tracing and rendering.
Version 6.1 was released with the Intel C++ Compiler on June 28,
2009 and Update 1 for version 6.1 was released on July 28, 2009.
Intel Threading Building Blocks (TBB)
A C++ template library developed by Intel Corporation for
writing software programs that take advantage of multi-core
processors. The library consists of data structures and algo-
rithms that allow a programmer to avoid some complications
arising from the use of native threading packages such as
POSIX → threads, Windows → threads, or the portable Boost
Threads in which individual → threads of execution are cre-
ated, synchronized, and terminated manually. Instead the li-
brary abstracts access to the multiple processors by allowing
the operations to be treated as “tasks”, which are allocated
to individual cores dynamically by the library’s run-time en-
gine, and by automating efficient use of the CPU cache. A TBB
program creates, synchronizes and destroys graphs of depen-
dent tasks according to algorithms, i.e. high-level parallel pro-
gramming paradigms (a.k.a. Algorithmic Skeletons). Tasks are
then executed respecting graph dependencies. This approach
groups TBB in a family of solutions for parallel programming
aiming to decouple the programming from the particulars of
the underlying machine. Intel TBB is available commercially
as a binary distribution with support and in open source in
both source and binary forms. Version 4.0 was introduced on
September 8, 2011.
iSER (“iSCSI Extensions for RDMA“)
A protocol that maps the iSCSI protocol over a network that pro-
vides RDMA services (like → iWARP or → InfiniBand). This permits
data to be transferred directly into SCSI I/O buffers without in-
termediate data copies. The Datamover Architecture (DA) defines
an abstract model in which the movement of data between iSCSI
end nodes is logically separated from the rest of the iSCSI proto-
col. iSER is one Datamover protocol. The interface between the
iSCSI and a Datamover protocol, iSER in this case, is called Dat-
amover Interface (DI).
SDRSingle Data Rate
DDRDouble Data Rate
QDRQuadruple Data Rate
FDRFourteen Data Rate
EDREnhanced Data Rate
1X 2 Gbit/s 4 Gbit/s 8 Gbit/s 14 Gbit/s 25 Gbit/s
4X 8 Gbit/s 16 Gbit/s 32 Gbit/s 56 Gbit/s 100 Gbit/s
12X 24 Gbit/s 48 Gbit/s 96 Gbit/s 168 Gbit/s 300 Gbit/s
168
iWARP (“Internet Wide Area RDMA Protocol”)
An Internet Engineering Task Force (IETF) update of the RDMA
Consortium’s → RDMA over TCP standard. This later standard is
zero-copy transmission over legacy TCP. Because a kernel imple-
mentation of the TCP stack is a tremendous bottleneck, a few
vendors now implement TCP in hardware. This additional hard-
ware is known as the TCP offload engine (TOE). TOE itself does
not prevent copying on the receive side, and must be combined
with RDMA hardware for zero-copy results. The main component
is the Data Direct Protocol (DDP), which permits the actual zero-
copy transmission. The transmission itself is not performed by
DDP, but by TCP.
kernel (in CUDA architecture)
Part of the → CUDA programming model
LAM/MPI
A high-quality open-source implementation of the → MPI specifi-
cation, including all of MPI-1.2 and much of MPI-2. Superseded by
the → OpenMPI implementation-
LAPACK (“linear algebra package”)
Routines for solving systems of simultaneous linear equations,
least-squares solutions of linear systems of equations, eigenvalue
problems, and singular value problems. The original goal of the LA-
PACK project was to make the widely used EISPACK and → LINPACK
libraries run efficiently on shared-memory vector and parallel pro-
cessors. LAPACK routines are written so that as much as possible
of the computation is performed by calls to the → BLAS library.
While → LINPACK and EISPACK are based on the vector operation
kernels of the Level 1 BLAS, LAPACK was designed at the outset to
exploit the Level 3 BLAS. Highly efficient machine-specific imple-
mentations of the BLAS are available for many modern high-per-
formance computers. The BLAS enable LAPACK routines to achieve
high performance with portable software.
layout
Part of the → parallel NFS standard. Currently three types of lay-
out exist: file-based, block/volume-based, and object-based, the
latter making use of → object-based storage devices
LINPACK
A collection of Fortran subroutines that analyze and solve linear
equations and linear least-squares problems. LINPACK was de-
signed for supercomputers in use in the 1970s and early 1980s.
LINPACK has been largely superseded by → LAPACK, which has
been designed to run efficiently on shared-memory, vector su-
percomputers. LINPACK makes use of the → BLAS libraries for
performing basic vector and matrix operations.
The LINPACK benchmarks are a measure of a system‘s float-
ing point computing power and measure how fast a computer
solves a dense N by N system of linear equations Ax=b, which is a
common task in engineering. The solution is obtained by Gauss-
ian elimination with partial pivoting, with 2/3•N³ + 2•N² floating
point operations. The result is reported in millions of floating
point operations per second (MFLOP/s, sometimes simply called
MFLOPS).
LNET
Communication protocol in → Lustre
GlossaryGLOSSARY
169
logical object volume (LOV)
A logical entity in → Lustre
Lustre
An object-based → parallel file system
management server (MGS)
A functional component in → Lustre
MapReduce
A framework for processing highly distributable problems across
huge datasets using a large number of computers (nodes), collec-
tively referred to as a cluster (if all nodes use the same hardware)
or a grid (if the nodes use different hardware). Computational
processing can occur on data stored either in a filesystem (un-
structured) or in a database (structured).
“Map” step: The master node takes the input, divides it into
smaller sub-problems, and distributes them to worker nodes. A
worker node may do this again in turn, leading to a multi-level
tree structure. The worker node processes the smaller problem,
and passes the answer back to its master node.
“Reduce” step: The master node then collects the answers to all
the sub-problems and combines them in some way to form the
output – the answer to the problem it was originally trying to
solve.
MapReduce allows for distributed processing of the map and
reduction operations. Provided each mapping operation is
independent of the others, all maps can be performed in par-
allel – though in practice it is limited by the number of inde-
pendent data sources and/or the number of CPUs near each
source. Similarly, a set of ‘reducers’ can perform the reduction
phase - provided all outputs of the map operation that share
the same key are presented to the same reducer at the same
time. While this process can often appear inefficient com-
pared to algorithms that are more sequential, MapReduce can
be applied to significantly larger datasets than “commodity”
servers can handle – a large server farm can use MapReduce
to sort a petabyte of data in only a few hours. The parallelism
also offers some possibility of recovering from partial failure
of servers or storage during the operation: if one mapper or
reducer fails, the work can be rescheduled – assuming the in-
put data is still available.
metadata server (MDS)
A functional component in → Lustre
metadata target (MDT)
A logical entity in → Lustre
MKL (“Math Kernel Library”)
A library of optimized, math routines for science, engineering,
and financial applications developed by Intel. Core math func-
tions include → BLAS, → LAPACK, → ScaLAPACK, Sparse Solvers,
Fast Fourier Transforms, and Vector Math. The library supports
Intel and compatible processors and is available for Windows,
Linux and Mac OS X operating systems.
MPI, MPI-2 (“message-passing interface”)
A language-independent communications protocol used to
program parallel computers. Both point-to-point and collec-
tive communication are supported. MPI remains the domi-
nant model used in high-performance computing today. There
170
are two versions of the standard that are currently popular:
version 1.2 (shortly called MPI-1), which emphasizes message
passing and has a static runtime environment, and MPI-2.1
(MPI-2), which includes new features such as parallel I/O, dy-
namic process management and remote memory operations.
MPI-2 specifies over 500 functions and provides language
bindings for ANSI C, ANSI Fortran (Fortran90), and ANSI C++. In-
teroperability of objects defined in MPI was also added to allow
for easier mixed-language message passing programming. A side
effect of MPI-2 standardization (completed in 1996) was clarifi-
cation of the MPI-1 standard, creating the MPI-1.2 level. MPI-2 is
mostly a superset of MPI-1, although some functions have been
deprecated. Thus MPI-1.2 programs still work under MPI imple-
mentations compliant with the MPI-2 standard. The MPI Forum
reconvened in 2007, to clarify some MPI-2 issues and explore de-
velopments for a possible MPI-3.
MPICH2
A freely available, portable → MPI 2.0 implementation, main-
tained by Argonne National Laboratory
MPP (“massively parallel processing”)
So-called MPP jobs are computer programs with several parts
running on several machines in parallel, often calculating simu-
lation problems. The communication between these parts can
e.g. be realized by the → MPI software interface.
MS-MPI
Microsoft → MPI 2.0 implementation shipped with Microsoft HPC
Pack 2008 SDK, based on and designed for maximum compatibil-
ity with the → MPICH2 reference implementation.
MVAPICH2
An → MPI 2.0 implementation based on → MPICH2 and developed
by the Department of Computer Science and Engineering at
Ohio State University. It is available under BSD licensing and sup-
ports MPI over InfiniBand, 10GigE/iWARP and RDMAoE.
NetCDF (“Network Common Data Form”)
A set of software libraries and self-describing, machine-indepen-
dent data formats that support the creation, access, and shar-
ing of array-oriented scientific data. The project homepage is
hosted by the Unidata program at the University Corporation
for Atmospheric Research (UCAR). They are also the chief source
of NetCDF software, standards development, updates, etc. The
format is an open standard. NetCDF Classic and 64-bit Offset For-
mat are an international standard of the Open Geospatial Con-
sortium. The project is actively supported by UCAR. The recently
released (2008) version 4.0 greatly enhances the data model by
allowing the use of the → HDF5 data file format. Version 4.1 (2010)
adds support for C and Fortran client access to specified subsets
of remote data via OPeNDAP. The format was originally based on
the conceptual model of the NASA CDF but has since diverged
and is not compatible with it. It is commonly used in climatology,
meteorology and oceanography applications (e.g., weather fore-
casting, climate change) and GIS applications. It is an input/out-
put format for many GIS applications, and for general scientific
data exchange. The NetCDF C library, and the libraries based on it
(Fortran 77 and Fortran 90, C++, and all third-party libraries) can,
starting with version 4.1.1, read some data in other data formats.
Data in the → HDF5 format can be read, with some restrictions.
Data in the → HDF4 format can be read by the NetCDF C library if
created using the → HDF4 Scientific Data (SD) API.
GlossaryGLOSSARY
171
NetworkDirect
A remote direct memory access (RDMA)-based network in-
terface implemented in Windows Server 2008 and later. Net-
workDirect uses a more direct path from → MPI applications
to networking hardware, resulting in very fast and efficient
networking. See the main article “Windows HPC Server 2008
R2” for further details.
NFS (Network File System)
A network file system protocol originally developed by Sun
Microsystems in 1984, allowing a user on a client computer to
access files over a network in a manner similar to how local
storage is accessed. NFS, like many other protocols, builds on
the Open Network Computing Remote Procedure Call (ONC
RPC) system. The Network File System is an open standard
defined in RFCs, allowing anyone to implement the protocol.
Sun used version 1 only for in-house experimental purposes.
When the development team added substantial changes to
NFS version 1 and released it outside of Sun, they decided to
release the new version as V2, so that version interoperation
and RPC version fallback could be tested. Version 2 of the pro-
tocol (defined in RFC 1094, March 1989) originally operated
entirely over UDP. Its designers meant to keep the protocol
stateless, with locking (for example) implemented outside of
the core protocol Version 3 (RFC 1813, June 1995) added:
ʎ support for 64-bit file sizes and offsets, to handle files larger
than 2 gigabytes (GB)
ʎ support for asynchronous writes on the server, to improve
write performance
ʎ additional file attributes in many replies, to avoid the need
to re-fetch them
ʎ a READDIRPLUS operation, to get file handles and attributes
along with file names when scanning a directory
ʎ assorted other improvements
At the time of introduction of Version 3, vendor support for TCP
as a transport-layer protocol began increasing. While several
vendors had already added support for NFS Version 2 with TCP
as a transport, Sun Microsystems added support for TCP as a
transport for NFS at the same time it added support for Ver-
sion 3. Using TCP as a transport made using NFS over a WAN
more feasible.
Version 4 (RFC 3010, December 2000; revised in RFC 3530, April
2003), influenced by AFS and CIFS, includes performance im-
provements, mandates strong security, and introduces a state-
ful protocol. Version 4 became the first version developed with
the Internet Engineering Task Force (IETF) after Sun Microsys-
tems handed over the development of the NFS protocols.
NFS version 4 minor version 1 (NFSv 4.1) has been approved by
the IESG and received an RFC number since Jan 2010. The NFSv 4.1
specification aims: to provide protocol support to take advantage
of clustered server deployments including the ability to provide
scalable parallel access to files distributed among multiple serv-
ers. NFSv 4.1 adds the parallel NFS (pNFS) capability, which enables
data access parallelism. The NFSv 4.1 protocol defines a method
of separating the filesystem meta-data from the location of the
file data; it goes beyond the simple name/data separation by strip-
ing the data amongst a set of data servers. This is different from
the traditional NFS server which holds the names of files and their
data under the single umbrella of the server.
172
In addition to pNFS, NFSv 4.1 provides sessions, directory delega-
tion and notifications, multi-server namespace, access control
lists (ACL/SACL/DACL), retention attributions, and SECINFO_NO_
NAME. See the main article “Parallel Filesystems” for further de-
tails.
Current work is being done in preparing a draft for a future ver-
sion 4.2 of the NFS standard, including so-called federated file-
systems, which constitute the NFS counterpart of Microsoft’s
distributed filesystem (DFS).
NUMA (“non-uniform memory access”)
A computer memory design used in multiprocessors, where the
memory access time depends on the memory location relative
to a processor. Under NUMA, a processor can access its own local
memory faster than non-local memory, that is, memory local to
another processor or memory shared between processors.
object storage server (OSS)
A functional component in → Lustre
object storage target (OST)
A logical entity in → Lustre
object-based storage device (OSD)
An intelligent evolution of disk drives that can store and serve
objects rather than simply place data on tracks and sectors. This
task is accomplished by moving low-level storage functions into
the storage device and accessing the device through an object
interface. Unlike a traditional block-oriented device providing ac-
cess to data organized as an array of unrelated blocks, an object
Glossary
store allows access to data by means of storage objects. A storage
object is a virtual entity that groups data together that has been
determined by the user to be logically related. Space for a storage
object is allocated internally by the OSD itself instead of by a host-
based file system. OSDs manage all necessary low-level storage,
space management, and security functions. Because there is no
host-based metadata for an object (such as inode information),
the only way for an application to retrieve an object is by using
its object identifier (OID). The SCSI interface was modified and ex-
tended by the OSD Technical Work Group of the Storage Network-
ing Industry Association (SNIA) with varied industry and academic
contributors, resulting in a draft standard to T10 in 2004. This stan-
dard was ratified in September 2004 and became the ANSI T10 SCSI
OSD V1 command set, released as INCITS 400-2004. The SNIA group
continues to work on further extensions to the interface, such as
the ANSI T10 SCSI OSD V2 command set.
OLAP cube (“Online Analytical Processing”)
A set of data, organized in a way that facilitates non-predeter-
mined queries for aggregated information, or in other words,
online analytical processing. OLAP is one of the computer-based
techniques for analyzing business data that are collectively
called business intelligence. OLAP cubes can be thought of as ex-
tensions to the two-dimensional array of a spreadsheet. For ex-
ample a company might wish to analyze some financial data by
product, by time-period, by city, by type of revenue and cost, and
by comparing actual data with a budget. These additional meth-
ods of analyzing the data are known as dimensions. Because
there can be more than three dimensions in an OLAP system the
term hypercube is sometimes used.
GLOSSARY
PCIe 1.x PCIe 2.x PCIe 3.0 PCIe 4.0
x1 256 MB/s 512 MB/s 1 GB/s 2 GB/s
x2 512 MB/s 1 GB/s 2 GB/s 4 GB/s
x4 1 GB/s 2 GB/s 4 GB/s 8 GB/s
x8 2 GB/s 4 GB/s 8 GB/s 16 GB/s
x16 4 GB/s 8 GB/s 16 GB/s 32 GB/s
x32 8 GB/s 16 GB/s 32 GB/s 64 GB/s
173
OpenCL (“Open Computing Language”)
A framework for writing programs that execute across heteroge-
neous platforms consisting of CPUs, GPUs, and other processors.
OpenCL includes a language (based on C99) for writing kernels
(functions that execute on OpenCL devices), plus APIs that are
used to define and then control the platforms. OpenCL provides
parallel computing using task-based and data-based parallelism.
OpenCL is analogous to the open industry standards OpenGL
and OpenAL, for 3D graphics and computer audio, respectively.
Originally developed by Apple Inc., which holds trademark rights,
OpenCL is now managed by the non-profit technology consor-
tium Khronos Group.
OpenMP (“Open Multi-Processing”)
An application programming interface (API) that supports multi-
platform shared memory multiprocessing programming in C,
C++ and Fortran on many architectures, including Unix and Mi-
crosoft Windows platforms. It consists of a set of compiler direc-
tives, library routines, and environment variables that influence
run-time behavior.
Jointly defined by a group of major computer hardware and soft-
ware vendors, OpenMP is a portable, scalable model that gives
programmers a simple and flexible interface for developing par-
allel applications for platforms ranging from the desktop to the
supercomputer.
An application built with the hybrid model of parallel program-
ming can run on a computer cluster using both OpenMP and
Message Passing Interface (MPI), or more transparently through
the use of OpenMP extensions for non-shared memory systems.
OpenMPI
An open source → MPI-2 implementation that is developed and
maintained by a consortium of academic, research, and industry
partners.
parallel NFS (pNFS)
A → parallel file system standard, optional part of the current →
NFS standard 4.1. See the main article “Parallel Filesystems” for
further details.
PCI Express (PCIe)
A computer expansion card standard designed to replace the
older PCI, PCI-X, and AGP standards. Introduced by Intel in 2004,
PCIe (or PCI-E, as it is commonly called) is the latest standard
for expansion cards that is available on mainstream comput-
ers. PCIe, unlike previous PC expansion standards, is structured
around point-to-point serial links, a pair of which (one in each
direction) make up lanes; rather than a shared parallel bus.
These lanes are routed by a hub on the main-board acting as
a crossbar switch. This dynamic point-to-point behavior allows
174
more than one pair of devices to communicate with each other
at the same time. In contrast, older PC interfaces had all devices
permanently wired to the same bus; therefore, only one device
could send information at a time. This format also allows “chan-
nel grouping”, where multiple lanes are bonded to a single de-
vice pair in order to provide higher bandwidth. The number of
lanes is “negotiated” during power-up or explicitly during op-
eration. By making the lane count flexible a single standard can
provide for the needs of high-bandwidth cards (e.g. graphics
cards, 10 Gigabit Ethernet cards and multiport Gigabit Ethernet
cards) while also being economical for less demanding cards.
Unlike preceding PC expansion interface standards, PCIe is a net-
work of point-to-point connections. This removes the need for
“arbitrating” the bus or waiting for the bus to be free and allows
for full duplex communications. This means that while standard
PCI-X (133 MHz 64 bit) and PCIe x4 have roughly the same data
transfer rate, PCIe x4 will give better performance if multiple
device pairs are communicating simultaneously or if communi-
cation within a single device pair is bidirectional. Specifications
of the format are maintained and developed by a group of more
than 900 industry-leading companies called the PCI-SIG (PCI Spe-
cial Interest Group). In PCIe 1.x, each lane carries approximately
250 MB/s. PCIe 2.0, released in late 2007, adds a Gen2-signalling
mode, doubling the rate to about 500 MB/s. On November 18,
2010, the PCI Special Interest Group officially publishes the final-
ized PCI Express 3.0 specification to its members to build devic-
es based on this new version of PCI Express, which allows for a
Gen3-signalling mode at 1 GB/s.
On November 29, 2011, PCI-SIG has annonced to proceed to PCI
Express 4.0 featuring 16 GT/s, still on copper technology. Addi-
Glossary
tionally, active and idle power optimizations are to be investigat-
ed. Final specifications are expected to be released in 2014/15.
PETSc (“Portable, Extensible Toolkit for Scientific Computation”)
A suite of data structures and routines for the scalable (parallel)
solution of scientific applications modeled by partial differen-
tial equations. It employs the → Message Passing Interface (MPI)
standard for all message-passing communication. The current
version of PETSc is 3.2; released September 8, 2011. PETSc is in-
tended for use in large-scale application projects, many ongoing
computational science projects are built around the PETSc li-
braries. Its careful design allows advanced users to have detailed
control over the solution process. PETSc includes a large suite
of parallel linear and nonlinear equation solvers that are eas-
ily used in application codes written in C, C++, Fortran and now
Python. PETSc provides many of the mechanisms needed within
parallel application code, such as simple parallel matrix and vec-
tor assembly routines that allow the overlap of communication
and computation. In addition, PETSc includes support for paral-
lel distributed arrays useful for finite difference methods.
process
→ thread
PTX (“parallel thread execution”)
Parallel Thread Execution (PTX) is a pseudo-assembly language
used in NVIDIA’s CUDA programming environment. The ‘nvcc’
compiler translates code written in CUDA, a C-like language, into
PTX, and the graphics driver contains a compiler which trans-
lates the PTX into something which can be run on the processing
cores.
GLOSSARY
175
RDMA (“remote direct memory access”)
Allows data to move directly from the memory of one computer
into that of another without involving either one‘s operating
system. This permits high-throughput, low-latency networking,
which is especially useful in massively parallel computer clus-
ters. RDMA relies on a special philosophy in using DMA. RDMA
supports zero-copy networking by enabling the network adapt-
er to transfer data directly to or from application memory, elim-
inating the need to copy data between application memory and
the data buffers in the operating system. Such transfers require
no work to be done by CPUs, caches, or context switches, and
transfers continue in parallel with other system operations.
When an application performs an RDMA Read or Write request,
the application data is delivered directly to the network, reduc-
ing latency and enabling fast message transfer. Common RDMA
implementations include → InfiniBand, → iSER, and → iWARP.
RISC (“reduced instruction-set computer”)
A CPU design strategy emphasizing the insight that simplified
instructions that “do less“ may still provide for higher perfor-
mance if this simplicity can be utilized to make instructions ex-
ecute very quickly → CISC.
ScaLAPACK (“scalable LAPACK”)
Library including a subset of → LAPACK routines redesigned
for distributed memory MIMD (multiple instruction, multiple
data) parallel computers. It is currently written in a Single-
Program-Multiple-Data style using explicit message passing
for interprocessor communication. ScaLAPACK is designed
for heterogeneous computing and is portable on any com-
puter that supports → MPI. The fundamental building blocks
of the ScaLAPACK library are distributed memory versions
(PBLAS) of the Level 1, 2 and 3 → BLAS, and a set of Basic Linear
Algebra Communication Subprograms (BLACS) for communi-
cation tasks that arise frequently in parallel linear algebra
computations. In the ScaLAPACK routines, all interprocessor
communication occurs within the PBLAS and the BLACS. One
of the design goals of ScaLAPACK was to have the ScaLAPACK
routines resemble their → LAPACK equivalents as much as
possible.
service-oriented architecture (SOA)
An approach to building distributed, loosely coupled applica-
tions in which functions are separated into distinct services that
can be distributed over a network, combined, and reused. See the
main article “Windows HPC Server 2008 R2” for further details.
single precision/double precision
→ floating point standard
SMP (“shared memory processing”)
So-called SMP jobs are computer programs with several
parts running on the same system and accessing a shared
memory region. A usual implementation of SMP jobs is
→ multi-threaded programs. The communication between the
single threads can e.g. be realized by the → OpenMP software
interface standard, but also in a non-standard way by means of
native UNIX interprocess communication mechanisms.
SMP (“symmetric multiprocessing”)
A multiprocessor or multicore computer architecture where
two or more identical processors or cores can connect to a
176
GlossaryGLOSSARY
single shared main memory in a completely symmetric way, i.e.
each part of the main memory has the same distance to each of
the cores. Opposite: → NUMA
storage access protocol
Part of the → parallel NFS standard
STREAM
A simple synthetic benchmark program that measures sustain-
able memory bandwidth (in MB/s) and the corresponding com-
putation rate for simple vector kernels.
streaming multiprocessor (SM)
Hardware component within the → Tesla GPU series
subnet manager
Application responsible for configuring the local → InfiniBand
subnet and ensuring its continued operation.
superscalar processors
A superscalar CPU architecture implements a form of parallel-
ism called instruction-level parallelism within a single proces-
sor. It thereby allows faster CPU throughput than would other-
wise be possible at the same clock rate. A superscalar processor
executes more than one instruction during a clock cycle by si-
multaneously dispatching multiple instructions to redundant
functional units on the processor. Each functional unit is not
a separate CPU core but an execution resource within a single
CPU such as an arithmetic logic unit, a bit shifter, or a multi-
plier.
Tesla
NVIDIA‘s third brand of GPUs, based on high-end GPUs from
the G80 and on. Tesla is NVIDIA‘s first dedicated General Pur-
pose GPU. Because of the very high computational power
(measured in floating point operations per second or FLOPS)
compared to recent microprocessors, the Tesla products are
intended for the HPC market. The primary function of Tesla
products are to aid in simulations, large scale calculations (es-
pecially floating-point calculations), and image generation for
professional and scientific fields, with the use of → CUDA. See
the main article “NVIDIA GPU Computing” for further details.
thread
A thread of execution is a fork of a computer program into two or
more concurrently running tasks. The implementation of threads
and processes differs from one operating system to another, but
in most cases, a thread is contained inside a process. On a single
processor, multithreading generally occurs by multitasking: the
processor switches between different threads. On a multiproces-
sor or multi-core system, the threads or tasks will generally run
at the same time, with each processor or core running a particu-
lar thread or task. Threads are distinguished from processes in
that processes are typically independent, while threads exist as
subsets of a process. Whereas processes have separate address
spaces, threads share their address space, which makes inter-
thread communication much easier than classical inter-process
communication (IPC).
thread (in CUDA architecture)
Part of the → CUDA programming model
177
thread block (in CUDA architecture)
Part of the → CUDA programming model
thread processor array (TPA)
Hardware component within the → Tesla GPU series
10 Gigabit Ethernet
The fastest of the Ethernet standards, first published in 2002
as IEEE Std 802.3ae-2002. It defines a version of Ethernet with a
nominal data rate of 10 Gbit/s, ten times as fast as Gigabit Eth-
ernet. Over the years several 802.3 standards relating to 10GbE
have been published, which later were consolidated into the
IEEE 802.3-2005 standard. IEEE 802.3-2005 and the other amend-
ments have been consolidated into IEEE Std 802.3-2008. 10
Gigabit Ethernet supports only full duplex links which can be
connected by switches. Half Duplex operation and CSMA/CD
(carrier sense multiple access with collision detect) are not
supported in 10GbE. The 10 Gigabit Ethernet standard encom-
passes a number of different physical layer (PHY) standards.
As of 2008 10 Gigabit Ethernet is still an emerging technology
with only 1 million ports shipped in 2007, and it remains to be
seen which of the PHYs will gain widespread commercial ac-
ceptance.
warp (in CUDA architecture)
Part of the → CUDA programming modelOLAP cube
Notes
Notes
Notes
I am interested in:
P compute nodes and other hardware
P cluster and workload management
P HPC-related services
P GPU computing
P parallel filesystems and HPC storage
P InfiniBand solutions
P remote postprocessing and visualization
P HPC databases and Big Data Analytics
P YES: I would like more information about
transtec’s HPC solutions.
P YES: I would like a transtec HPC specialist to contact me.
Company:
Department:
Name:
Address:
ZIP, City:
Email:
Please fill in the postcard on the right, or send an email to
[email protected] for further information.
tran
stec A
GW
aldh
oern
lestrasse 1872072 Tu
ebin
gen
Germ
any
Simulation
Simulation
CAD
CAD
CAE
CAE
Engineering
Engineering
Price Modelling
Price Modelling
Life Sciences
Life SciencesAutomotive
Automotive
Aerospace
Aerospace
Risk Analysis
Risk Analysis
Big Data Analytics
Big Data Analytics
High Throughput Computing
High Throughput Computing
transtec GermanyTel +49 (0) 7071/[email protected]
transtec SwitzerlandTel +41 (0) 44/818 47 [email protected]
transtec United KingdomTel +44 (0) 1295/756 [email protected]
ttec NetherlandsTel +31 (0) 24 34 34 [email protected]
transtec FranceTel +33 (0) [email protected]
Texts and concept:
Layout and design:
Dr. Oliver Tennert, Director Technology Management & HPC Solutions | [email protected]
Stefanie Gauger, Graphics & Design | [email protected]
© transtec AG, June 2013The graphics, diagrams and tables found herein are the intellectual property of transtec AG and may be reproduced or published only with its express permission. No responsibility will be assumed for inaccuracies or omissions. Other names or logos may be trademarks of their respective owners.
Simulation
CAD
CAEEngineering
Price Modelling
Life Sciences
Automotive
Aerospace
Risk Analysis
Risk Analysis
Big Data Analytics
High Throughput Computing