hpc compass 2013-final_web

Simulation

CAD

CAEEngineering Price Modelling

Life Sciences

Automotive

Aerospace

Risk Analysis

Big Data Analytics

High Throughput Computing

High Performance Computing 2013/14

Technology Compass

2

Technology Compass

Table of contents and Introduction

High Performance Computing ........................................... 4Performance Turns Into Productivity ......................................................6

Flexible Deployment With xCAT ..................................................................8

Service and Customer Care From A to Z ............................................. 10

Advanced Cluster Management Made Easy ........... 12Easy-to-use, Complete and Scalable ...................................................... 14

Cloud Bursting With Bright ......................................................................... 18

Intelligent HPC Workload Management ................... 30Moab HPC Suite – Basic Edition ................................................................ 32

Moab HPC Suite – Enterprise Edition .................................................... 34

Moab HPC Suite – Grid Option ................................................................... 36

Optimizing Accelerators with Moab HPC Suite .............................. 38

Moab HPC Suite – Application Portal Edition .................................. 42

Moab HPC Suite – Remote Visualization Edition ........................... 44

Using Moab With Grid Environments ................................................... 46

NICE EnginFrame ..................................................................... 52A Technical Portal for Remote Visualization .................................... 54

Application Highlights ................................................................................... 56

Desktop Cloud Virtualization .................................................................... 58

Remote Visualization ...................................................................................... 60

Intel Cluster Ready ................................................................ 64A Quality Standard for HPC Clusters...................................................... 66

Intel Cluster Ready builds HPC Momentum ..................................... 70

The transtec Benchmarking Center ....................................................... 74

Parallel NFS ................................................................................. 76The New Standard for HPC Storage ....................................................... 78

Whats´s New in NFS 4.1? .............................................................................. 80

Panasas HPC Storage ...................................................................................... 82

NVIDIA GPU Computing ....................................................... 94What is GPU Computing? ............................................................................. 96

Kepler GK110 – The Next Generation GPU ......................................... 98

Introducing NVIDIA Parallel Nsight ..................................................... 100

A Quick Refresher on CUDA ...................................................................... 112

Intel TrueScale InfiniBand and GPUs .................................................. 118

Intel Xeon Phi Coprocessor .............................................122The Architecture.............................................................................................. 124

InfiniBand ...................................................................................134High-speed Interconnects ........................................................................ 136

Top 10 Reasons to Use Intel TrueScale InfiniBand ..................... 138

Intel MPI Library 4.0 Performance ........................................................ 140

Intel Fabric Suite 7 ......................................................................................... 144

Parstream ...................................................................................150Big Data Analytics .......................................................................................... 152

Glossary .......................................................................................160

3

More than 30 years of experience in Scientific Computing

1980 marked the beginning of a decade where numerous start-

ups were created, some of which later transformed into big play-

ers in the IT market. Technical innovations brought dramatic

changes to the nascent computer market. In Tübingen, close to

one of Germany’s prime and oldest universities, transtec was

founded.

In the early days, transtec focused on reselling DEC computers

and peripherals, delivering high-performance workstations to

university institutes and research facilities. In 1987, SUN/Sparc

and storage solutions broadened the portfolio, enhanced by

IBM/RS6000 products in 1991. These were the typical worksta-

tions and server systems for high performance computing then,

used by the majority of researchers worldwide.

In the late 90s, transtec was one of the first companies to offer

highly customized HPC cluster solutions based on standard Intel

architecture servers, some of which entered the TOP 500 list of

the world’s fastest computing systems.

Thus, given this background and history, it is fair to say that

transtec looks back upon a more than 30 years’ experience in sci-

entific computing; our track record shows more than 500 HPC in-

stallations. With this experience, we know exactly what custom-

ers’ demands are and how to meet them. High performance and

ease of management – this is what customers require today. HPC

systems are for sure required to peak-perform, as their name in-

dicates, but that is not enough: they must also be easy to handle.

Unwieldy design and operational complexity must be avoided

or at least hidden from administrators and particularly users of

HPC computer systems.

This brochure focusses on where transtec HPC solutions excel.

transtec HPC solutions use the latest and most innovative tech-

nology. Bright Cluster Manager as the technology leader for uni-

fied HPC cluster management, leading-edge Moab HPC Suite for

job and workload management, Intel Cluster Ready certification

as an independent quality standard for our systems, Panasas

HPC storage systems for highest performance and real ease of

management required of a reliable HPC storage system. Again,

with these components, usability, reliability and ease of man-

agement are central issues that are addressed, even in a highly

heterogeneous environment. transtec is able to provide custom-

ers with well-designed, extremely powerful solutions for Tesla

GPU computing, as well as thoroughly engineered Intel Xeon

Phi systems. Intel’s InfiniBand Fabric Suite makes managing a

large InfiniBand fabric easier than ever before – transtec mas-

terly combines excellent and well-chosen components that are

already there to a fine-tuned, customer-specific, and thoroughly

designed HPC solution.

Your decision for a transtec HPC solution means you opt for

most intensive customer care and best service in HPC. Our ex-

perts will be glad to bring in their expertise and support to assist

you at any stage, from HPC design to daily cluster operations, to

HPC Cloud Services.

Last but not least, transtec HPC Cloud Services provide custom-

ers with the possibility to have their jobs run on dynamically pro-

vided nodes in a dedicated datacenter, professionally managed

and individually customizable. Numerous standard applications

like ANSYS, LS-Dyna, OpenFOAM, as well as lots of codes like Gro-

macs, NAMD, VMD, and others are pre-installed, integrated into

an enterprise-ready cloud management environment, and ready

to run.

Have fun reading the transtec HPC Compass 2013/14!

High Performance Com-puting – Performance Turns Into Productivity

55

High Performance Comput-ing (HPC) has been with us from the very beginning of the computer era. High-performance computers were built to solve numer-ous problems which the “human computers” could not handle. The term HPC just hadn’t been coined yet. More important, some of the early principles have changed fundamentally.

6

Performance Turns Into ProductivityHPC systems in the early days were much different from those

we see today. First, we saw enormous mainframes from large

computer manufacturers, including a proprietary operating

system and job management system. Second, at universities

and research institutes, workstations made inroads and sci-

entists carried out calculations on their dedicated Unix or

VMS workstations. In either case, if you needed more comput-

ing power, you scaled up, i.e. you bought a bigger machine.

Today the term High-Performance Computing has gained a

fundamentally new meaning. HPC is now perceived as a way

to tackle complex mathematical, scientific or engineering

problems. The integration of industry standard, “off-the-shelf”

server hardware into HPC clusters facilitates the construction

of computer networks of such power that one single system

could never achieve. The new paradigm for parallelization is

scaling out.

Computer-supported simulations of realistic processes (so-

called Computer Aided Engineering – CAE) has established

itself as a third key pillar in the field of science and research

alongside theory and experimentation. It is nowadays incon-

ceivable that an aircraft manufacturer or a Formula One rac-

ing team would operate without using simulation software.

And scientific calculations, such as in the fields of astrophys-

ics, medicine, pharmaceuticals and bio-informatics, will to a

large extent be dependent on supercomputers in the future.

Software manufacturers long ago recognized the benefit of

high-performance computers based on powerful standard

servers and ported their programs to them accordingly.

The main advantages of scale-out supercomputers is just that:

they are infinitely scalable, at least in principle. Since they are

based on standard hardware components, such a supercom-

puter can be charged with more power whenever the com-

putational capacity of the system is not sufficient any more,

simply by adding additional nodes of the same kind. A cum-

bersome switch to a different technology can be avoided in

most cases.

High Performance Computing

Performance Turns Into Productivity

“transtec HPC solutions are meant to provide cus-

tomers with unparalleled ease-of-management and

ease-of-use. Apart from that, deciding for a transtec

HPC solution means deciding for the most intensive

customer care and the best service imaginable.”

Dr. Oliver Tennert Director Technology Management & HPC Solutions

7

Variations on the theme: MPP and SMP Parallel computations exist in two major variants today. Ap-

plications running in parallel on multiple compute nodes are

frequently so-called Massively Parallel Processing (MPP) appli-

cations. MPP indicates that the individual processes can each

utilize exclusive memory areas. This means that such jobs are

predestined to be computed in parallel, distributed across the

nodes in a cluster. The individual processes can thus utilize the

separate units of the respective node – especially the RAM, the

CPU power and the disk I/O.

Communication between the individual processes is imple-

mented in a standardized way through the MPI software inter-

face (Message Passing Interface), which abstracts the underlying

network connections between the nodes from the processes.

However, the MPI standard (current version 2.0) merely requires

source code compatibility, not binary compatibility, so an off-

the-shelf application usually needs specific versions of MPI

libraries in order to run. Examples of MPI implementations are

OpenMPI, MPICH2, MVAPICH2, Intel MPI or – for Windows clusters

– MS-MPI.

If the individual processes engage in a large amount of commu-

nication, the response time of the network (latency) becomes im-

portant. Latency in a Gigabit Ethernet or a 10GE network is typi-

cally around 10 µs. High-speed interconnects such as InfiniBand,

reduce latency by a factor of 10 down to as low as 1 µs. Therefore,

high-speed interconnects can greatly speed up total processing.

The other frequently used variant is called SMP applications.

SMP, in this HPC context, stands for Shared Memory Processing.

It involves the use of shared memory areas, the specific imple-

mentation of which is dependent on the choice of the underly-

ing operating system. Consequently, SMP jobs generally only run

on a single node, where they can in turn be multi-threaded and

thus be parallelized across the number of CPUs per node. For

many HPC applications, both the MPP and SMP variant can be

chosen.

Many applications are not inherently suitable for parallel execu-

tion. In such a case, there is no communication between the in-

dividual compute nodes, and therefore no need for a high-speed

network between them; nevertheless, multiple computing jobs

can be run simultaneously and sequentially on each individual

node, depending on the number of CPUs.

In order to ensure optimum computing performance for these

applications, it must be examined how many CPUs and cores de-

liver the optimum performance.

We find applications of this sequential type of work typically in

the fields of data analysis or Monte-Carlo simulations.

8

xCAT as a Powerful and Flexible Deployment Tool

xCAT (Extreme Cluster Administration Tool) is an open source

toolkit for the deployment and low-level administration of

HPC cluster environments, small as well as large ones.

xCAT provides simple commands for hardware control, node

discovery, the collection of MAC addresses, and the node de-

ployment with (diskful) or without local (diskless) installation.

The cluster configuration is stored in a relational database.

Node groups for different operating system images can be

defined. Also, user-specific scripts can be executed automati-

cally at installation time.

xCAT Provides the Following Low-Level Administrative Fea-

tures

ʎ Remote console support

ʎ Parallel remote shell and remote copy commands

ʎ Plugins for various monitoring tools like Ganglia or Nagios

ʎ Hardware control commands for node discovery, collect-

ing MAC addresses, remote power switching and resetting

of nodes

ʎ Automatic configuration of syslog, remote shell, DNS,

DHCP, and ntp within the cluster

ʎ Extensive documentation and man pages

For cluster monitoring, we install and configure the open

source tool Ganglia or the even more powerful open source


Flexible Deployment With xCAT

9

solution Nagios, according to the customer’s preferences and

requirements.

Local Installation or Diskless Installation

We offer a diskful or a diskless installation of the cluster

nodes. A diskless installation means the operating system is

hosted partially within the main memory, larger parts may or

may not be included via NFS or other means. This approach

allows for deploying large amounts of nodes very efficiently,

and the cluster is up and running within a very small times-

cale. Also, updating the cluster can be done in a very efficient

way. For this, only the boot image has to be updated, and the

nodes have to be rebooted. After this, the nodes run either a

new kernel or even a new operating system. Moreover, with

this approach, partitioning the cluster can also be very effi-

ciently done, either for testing purposes, or for allocating dif-

ferent cluster partitions for different users or applications.

Development Tools, Middleware, and Applications

According to the application, optimization strategy, or under-

lying architecture, different compilers lead to code results of

very different performance. Moreover, different, mainly com-

mercial, applications, require different MPI implementations.

And even when the code is self-developed, developers often

prefer one MPI implementation over another.

According to the customer’s wishes, we install various compil-

ers, MPI middleware, as well as job management systems like

Parastation, Grid Engine, Torque/Maui, or the very powerful

Moab HPC Suite for the high-level cluster management.

10


Services and Customer Care from A to Z

transtec HPC as a ServiceYou will get a range of applications like LS-Dyna, ANSYS,

Gromacs, NAMD etc. from all kinds of areas pre-installed,

integrated into an enterprise-ready cloud and workload

management system, and ready-to run. Do you miss your

application?

Ask us: [email protected]

transtec Platform as a ServiceYou will be provided with dynamically provided compute

nodes for running your individual code. The operating

system will be pre-installed according to your require-

ments. Common Linux distributions like RedHat, CentOS,

or SLES are the standard. Do you need another distribu-

tion?

Ask us: [email protected]

transtec Hosting as a ServiceYou will be provided with hosting space inside a profes-

sionally managed and secured datacenter where you

can have your machines hosted, managed, maintained,

according to your requirements. Thus, you can build up

your own private cloud. What range of hosting and main-

tenance services do you need?

Tell us: [email protected]

HPC @ transtec: Services and Customer Care from A to Ztranstec AG has over 30 years of experience in scientific computing

and is one of the earliest manufacturers of HPC clusters. For nearly

a decade, transtec has delivered highly customized High Perfor-

mance clusters based on standard components to academic and in-

dustry customers across Europe with all the high quality standards

and the customer-centric approach that transtec is well known for.

Every transtec HPC solution is more than just a rack full of hard-

ware – it is a comprehensive solution with everything the HPC user,

owner, and operator need. In the early stages of any customer’s HPC

project, transtec experts provide extensive and detailed consult-

ing to the customer – they benefit from expertise and experience.

Consulting is followed by benchmarking of different systems with

either specifically crafted customer code or generally accepted

individual Presalesconsulting

application-,customer-,

site-specificsizing of

HPC solution

burn-in testsof systems

benchmarking of different systems

continualimprovement

software& OS

installation

applicationinstallation

onsitehardwareassembly

integrationinto

customer’senvironment

customertraining

maintenance,support &

managed services

individual Presalesconsulting

application-,customer-,

site-specificsizing of

HPC solution

burn-in testsof systems

benchmarking of different systems

continualimprovement

software& OS

installation

applicationinstallation

onsitehardwareassembly

integrationinto

customer’senvironment

customertraining

maintenance,support &

managed services

Services and Customer Care from A to Z

11

benchmarking routines; this aids customers in sizing and devising

the optimal and detailed HPC configuration.

Each and every piece of HPC hardware that leaves our factory un-

dergoes a burn-in procedure of 24 hours or more if necessary. We

make sure that any hardware shipped meets our and our custom-

ers’ quality requirements. transtec HPC solutions are turnkey solu-

tions. By default, a transtec HPC cluster has everything installed and

configured – from hardware and operating system to important

middleware components like cluster management or developer

tools and the customer’s production applications. Onsite delivery

means onsite integration into the customer’s production environ-

ment, be it establishing network connectivity to the corporate net-

work, or setting up software and configuration parts.

transtec HPC clusters are ready-to-run systems – we deliver, you

turn the key, the system delivers high performance. Every HPC proj-

ect entails transfer to production: IT operation processes and poli-

cies apply to the new HPC system. Effectively, IT personnel is trained

hands-on, introduced to hardware components and software, with

all operational aspects of configuration management.

transtec services do not stop when the implementation projects

ends. Beyond transfer to production, transtec takes care. transtec

offers a variety of support and service options, tailored to the cus-

tomer’s needs. When you are in need of a new installation, a major

reconfiguration or an update of your solution – transtec is able to

support your staff and, if you lack the resources for maintaining

the cluster yourself, maintain the HPC solution for you. From Pro-

fessional Services to Managed Services for daily operations and

required service levels, transtec will be your complete HPC service

and solution provider. transtec’s high standards of performance, re-

liability and dependability assure your productivity and complete

satisfaction.

transtec’s offerings of HPC Managed Services offer customers the

possibility of having the complete management and administra-

tion of the HPC cluster managed by transtec service specialists, in

an ITIL compliant way. Moreover, transtec’s HPC on Demand servic-

es help provide access to HPC resources whenever they need them,

for example, because they do not have the possibility of owning

and running an HPC cluster themselves, due to lacking infrastruc-

ture, know-how, or admin staff.

transtec HPC Cloud ServicesLast but not least transtec’s services portfolio evolves as customers‘

demands change. Starting this year, transtec is able to provide HPC

Cloud Services. transtec uses a dedicated datacenter to provide

computing power to customers who are in need of more capacity

than they own, which is why this workflow model is sometimes

called computing-on-demand. With these dynamically provided

resources, customers with the possibility to have their jobs run on

HPC nodes in a dedicated datacenter, professionally managed and

secured, and individually customizable. Numerous standard ap-

plications like ANSYS, LS-Dyna, OpenFOAM, as well as lots of codes

like Gromacs, NAMD, VMD, and others are pre-installed, integrated

into an enterprise-ready cloud and workload management environ-

ment, and ready to run.

Alternatively, whenever customers are in need of space for hosting

their own HPC equipment because they do not have the space ca-

pacity or cooling and power infrastructure themselves, transtec is

also able to provide Hosting Services to those customers who’d like

to have their equipment professionally hosted, maintained, and

managed. Customers can thus build up their own private cloud!

Are you interested in any of transtec’s broad range of HPC related

services? Write us an email to [email protected]. We’ll be happy to

hear from you!

Advanced Cluster Management Made Easy

13

Bright Cluster Manager re-moves the complexity from the installation, manage-ment and use of HPC clus-ters — onpremise or in the cloud.

With Bright Cluster Man-ager, you can easily install, manage and use multiple clusters simultaneously, in-cluding compute, Hadoop, storage, database and work-station clusters.

14

The Bright AdvantageBright Cluster Manager delivers improved productivity, in-

creased uptime, proven scalability and security, while reduc-

ing operating cost:

Rapid Productivity Gains

ʎ Short learning curve: intuitive GUI drives it all

ʎ Quick installation: one hour from bare metal to compute-

ready

ʎ Fast, flexible provisioning: incremental, live, disk-full, disk-less,

over InfiniBand, to virtual machines, auto node discovery

ʎ Comprehensive monitoring: on-the-fly graphs, Rackview, mul-

tiple clusters, custom metrics

ʎ Powerful automation: thresholds, alerts, actions

ʎ Complete GPU support: NVIDIA, AMD1, CUDA, OpenCL

ʎ Full support for Intel Xeon Phi

ʎ On-demand SMP: instant ScaleMP virtual SMP deployment

ʎ Fast customization and task automation: powerful cluster

management shell, SOAP and JSON APIs make it easy

ʎ Seamless integration with leading workload managers: Slurm,

Open Grid Scheduler, Torque, openlava, Maui2, PBS Profes-

sional, Univa Grid Engine, Moab2, LSF

ʎ Integrated (parallel) application development environment

ʎ Easy maintenance: automatically update your cluster from

Linux and Bright Computing repositories

ʎ Easily extendable, web-based User Portal

ʎ Cloud-readiness at no extra cost3, supporting scenarios “Clus-

ter-on-Demand” and “Cluster-Extension”, with data-aware

scheduling

ʎ Deploys, provisions, monitors and manages Hadoop clusters

ʎ Future-proof: transparent customization minimizes disrup-

tion from staffing changes

Maximum Uptime

ʎ Automatic head node failover: prevents system downtime

ʎ Powerful cluster automation: drives pre-emptive actions

based on monitoring thresholds

Advanced Cluster Management Made EasyEasy-to-use, complete and scalable

The cluster installer takes you through the installation process and offers advanced options such as “Express” and “Remote”.

15

ʎ Comprehensive cluster monitoring and health checking: au-

tomatic sidelining of unhealthy nodes to prevent job failure

Scalability from Deskside to TOP500

ʎ Off-loadable provisioning: enables maximum scalability

ʎ Proven: used on some of the world’s largest clusters

Minimum Overhead / Maximum Performance

ʎ Lightweight: single daemon drives all functionality

ʎ Optimized: daemon has minimal impact on operating system

and applications

ʎ Efficient: single database for all metric and configuration

data

Top Security

ʎ Key-signed repositories: controlled, automated security and

other updates

ʎ Encryption option: for external and internal communications

ʎ Safe: X509v3 certificate-based public-key authentication

ʎ Sensible access: role-based access control, complete audit

trail

ʎ Protected: firewalls and LDAP

A Unified Approach

Bright Cluster Manager was written from the ground up as a to-

tally integrated and unified cluster management solution. This

fundamental approach provides comprehensive cluster manage-

ment that is easy to use and functionality-rich, yet has minimal

impact on system performance. It has a single, light-weight

daemon, a central database for all monitoring and configura-

tion data, and a single CLI and GUI for all cluster management

functionality. Bright Cluster Manager is extremely easy to use,

scalable, secure and reliable. You can monitor and manage all

aspects of your clusters with virtually no learning curve.

Bright’s approach is in sharp contrast with other cluster man-

agement offerings, all of which take a “toolkit” approach. These

toolkits combine a Linux distribution with many third-party

tools for provisioning, monitoring, alerting, etc. This approach

has critical limitations: these separate tools were not designed

to work together; were often not designed for HPC, nor designed

to scale. Furthermore, each of the tools has its own interface

(mostly command-line based), and each has its own daemon(s)

and database(s). Countless hours of scripting and testing by

highly skilled people are required to get the tools to work for a

specific cluster, and much of it goes undocumented.

Time is wasted, and the cluster is at risk if staff changes occur,

losing the “in-head” knowledge of the custom scripts.

By selecting a cluster node in the tree on the left and the Tasks tab on the right, you can execute a number of powerful tasks on that node with just a single mouse click.

1616

Ease of Installation

Bright Cluster Manager is easy to install. Installation and testing

of a fully functional cluster from “bare metal” can be complet-

ed in less than an hour. Configuration choices made during the

installation can be modified afterwards. Multiple installation

modes are available, including unattended and remote modes.

Cluster nodes can be automatically identified based on switch

ports rather than MAC addresses, improving speed and reliabil-

ity of installation, as well as subsequent maintenance. All major

hardware brands are supported: Dell, Cray, Cisco, DDN, IBM, HP,

Supermicro, Acer, Asus and more.

Ease of Use

Bright Cluster Manager is easy to use, with two interface op-

tions: the intuitive Cluster Management Graphical User Interface

(CMGUI) and the powerful Cluster Management Shell (CMSH).

The CMGUI is a standalone desktop application that provides

a single system view for managing all hardware and software

aspects of the cluster through a single point of control. Admin-

istrative functions are streamlined as all tasks are performed

through one intuitive, visual interface. Multiple clusters can be

managed simultaneously. The CMGUI runs on Linux, Windows

and OS X, and can be extended using plugins. The CMSH provides

practically the same functionality as the CMGUI, but via a com-

mand-line interface. The CMSH can be used both interactively

and in batch mode via scripts.

Either way, you now have unprecedented flexibility and control

over your clusters.

Support for Linux and Windows

Bright Cluster Manager is based on Linux and is available with

a choice of pre-integrated, pre-configured and optimized Linux

distributions, including SUSE Linux Enterprise Server, Red Hat

Enterprise Linux, CentOS and Scientific Linux. Dual-boot installa-

tions with Windows HPC Server are supported as well, allowing

nodes to either boot from the Bright-managed Linux head node,

or the Windows-managed head node.


The Overview tab provides instant, high-level insight into the status of the cluster.

1717

Extensive Development Environment

Bright Cluster Manager provides an extensive HPC development

environment for both serial and parallel applications, including

the following (some are cost options):

ʎ Compilers, including full suites from GNU, Intel, AMD and Port-

land Group.

ʎ Debuggers and profilers, including the GNU debugger and

profiler, TAU, TotalView, Allinea DDT and Allinea MAP.

ʎ GPU libraries, including CUDA and OpenCL.

ʎ MPI libraries, including OpenMPI, MPICH, MPICH2, MPICHMX,

MPICH2-MX, MVAPICH and MVAPICH2; all cross-compiled with

the compilers installed on Bright Cluster Manager, and opti-

mized for high-speed interconnects such as InfiniBand and

10GE.

ʎ Mathematical libraries, including ACML, FFTW, Goto-BLAS,

MKL and ScaLAPACK.

ʎ Other libraries, including Global Arrays, HDF5, IIPP, TBB, Net-

CDF and PETSc.

Bright Cluster Manager also provides Environment Modules to

make it easy to maintain multiple versions of compilers, libraries

and applications for different users on the cluster, without creating

compatibility conflicts. Each Environment Module file contains the

information needed to configure the shell for an application, and

automatically sets these variables correctly for the particular appli-

cation when it is loaded. Bright Cluster Manager includes many pre-

configured module files for many scenarios, such as combinations

of compliers, mathematical and MPI libraries.

Powerful Image Management and Provisioning

Bright Cluster Manager features sophisticated software image

management and provisioning capability. A virtually unlimited

number of images can be created and assigned to as many dif-

ferent categories of nodes as required. Default or custom Linux

kernels can be assigned to individual images. Incremental

changes to images can be deployed to live nodes without re-

booting or re-installation.

The provisioning system only propagates changes to the images,

minimizing time and impact on system performance and avail-

ability. Provisioning capability can be assigned to any number of

nodes on-the-fly, for maximum flexibility and scalability. Bright

Cluster Manager can also provision over InfiniBand and to ram-

disk or virtual machine.

Comprehensive Monitoring

With Bright Cluster Manager, you can collect, monitor, visual-

ize and analyze a comprehensive set of metrics. Many software

and hardware metrics available to the Linux kernel, and many

hardware management interface metrics (IPMI, DRAC, iLO, etc.)

are sampled.

Cluster metrics, such as GPU, Xeon Phi and CPU temperatures, fan speeds and network statistics can be visualized by simply dragging and dropping them into a graphing window. Multiple metrics can be combined in one graph and graphs can be zoomed into. A Graphing wizard allows creation of all graphs for a selected combination of metrics and nodes. Graph layout and color configurations can be tailored to your requirements and stored for re-use.

18

High performance meets efficiencyInitially, massively parallel systems constitute a challenge to

both administrators and users. They are complex beasts. Anyone

building HPC clusters will need to tame the beast, master the

complexity and present users and administrators with an easy-

to-use, easy-to-manage system landscape.

Leading HPC solution providers such as transtec achieve this

goal. They hide the complexity of HPC under the hood and match

high performance with efficiency and ease-of-use for both users

and administrators. The “P” in “HPC” gains a double meaning:

“Performance” plus “Productivity”.

Cluster and workload management software like Moab HPC

Suite, Bright Cluster Manager or QLogic IFS provide the means

to master and hide the inherent complexity of HPC systems.

For administrators and users, HPC clusters are presented as

single, large machines, with many different tuning parameters.

The software also provides a unified view of existing clusters

whenever unified management is added as a requirement by the

customer at any point in time after the first installation. Thus,

daily routine tasks such as job management, user management,

queue partitioning and management, can be performed easily

with either graphical or web-based tools, without any advanced

scripting skills or technical expertise required from the adminis-

trator or user.

Advanced Cluster Management Made EasyCloud Bursting With Bright

19

Bright Cluster Manager unleashes and manages the unlimited power of the cloudCreate a complete cluster in Amazon EC2, or easily extend your

onsite cluster into the cloud, enabling you to dynamically add

capacity and manage these nodes as part of your onsite cluster.

Both can be achieved in just a few mouse clicks, without the

need for expert knowledge of Linux or cloud computing.

Bright Cluster Manager’s unique data aware scheduling capabil-

ity means that your data is automatically in place in EC2 at the

start of the job, and that the output data is returned as soon as

the job is completed.

With Bright Cluster Manager, every cluster is cloud-ready, at no

extra cost. The same powerful cluster provisioning, monitoring,

scheduling and management capabilities that Bright Cluster

Manager provides to onsite clusters extend into the cloud.

The Bright advantage for cloud bursting

ʎ Ease of use: Intuitive GUI virtually eliminates user learning

curve; no need to understand Linux or EC2 to manage system.

Alternatively, cluster management shell provides powerful

scripting capabilities to automate tasks

ʎ Complete management solution: Installation/initialization,

provisioning, monitoring, scheduling and management in

one integrated environment

ʎ Integrated workload management: Wide selection of work-

load managers included and automatically configured with

local, cloud and mixed queues

ʎ Single system view; complete visibility and control: Cloud com-

pute nodes managed as elements of the on-site cluster; visible

from a single console with drill-downs and historic data.

ʎ Efficient data management via data aware scheduling: Auto-

matically ensures data is in position at start of computation;

delivers results back when complete.

ʎ Secure, automatic gateway: Mirrored LDAP and DNS services

over automatically-created VPN connects local and cloud-

based nodes for secure communication.

ʎ Cost savings: More efficient use of cloud resources; support

for spot instances, minimal user intervention.

Two cloud bursting scenarios

Bright Cluster Manager supports two cloud bursting scenarios:

“Cluster-on-Demand” and “Cluster Extension”.

Scenario 1: Cluster-on-Demand

The Cluster-on-Demand scenario is ideal if you do not have a

cluster onsite, or need to set up a totally separate cluster. With

just a few mouse clicks you can instantly create a complete clus-

ter in the public cloud, for any duration of time.

Scenario 2: Cluster Extension

The Cluster Extension scenario is ideal if you have a cluster onsite

but you need more compute power, including GPUs. With Bright

Cluster Manager, you can instantly add EC2-based resources to

20

your onsite cluster, for any duration of time. Bursting into the

cloud is as easy as adding nodes to an onsite cluster — there are

only a few additional steps after providing the public cloud ac-

count information to Bright Cluster Manager.

The Bright approach to managing and monitoring a cluster in

the cloud provides complete uniformity, as cloud nodes are man-

aged and monitored the same way as local nodes:

ʎ Load-balanced provisioning

ʎ Software image management

ʎ Integrated workload management

ʎ Interfaces — GUI, shell, user portal

ʎ Monitoring and health checking

ʎ Compilers, debuggers, MPI libraries, mathematical libraries

and environment modules

Bright Cluster Manager also provides additional features that

are unique to the cloud.


The Add Cloud Provider Wizard and the Node Creation Wizard make the cloud bursting process easy, also for users with no cloud experience.

21

Amazon spot instance support

Bright Cluster Manager enables users to take advantage of the

cost savings offered by Amazon’s spot instances. Users can spec-

ify the use of spot instances, and Bright will automatically sched-

ule as available, reducing the cost to compute without the need

to monitor spot prices and manually schedule.

Hardware Virtual Machine (HVM) virtualization

Bright Cluster Manager automatically initializes all Amazon

instance types, including Cluster Compute and Cluster GPU in-

stances that rely on HVM virtualization.

Data aware scheduling

Data aware scheduling ensures that input data is transferred to

the cloud and made accessible just prior to the job starting, and

that the output data is transferred back. There is no need to wait

(and monitor) for the data to load prior to submitting jobs (de-

laying entry into the job queue), nor any risk of starting the job

before the data transfer is complete (crashing the job). Users sub-

mit their jobs, and Bright’s data aware scheduling does the rest.

Bright Cluster Manager provides the choice:

“To Cloud, or Not to Cloud”

Not all workloads are suitable for the cloud. The ideal situation

for most organizations is to have the ability to choose between

onsite clusters for jobs that require low latency communication,

complex I/O or sensitive data; and cloud clusters for many other

types of jobs.

Bright Cluster Manager delivers the best of both worlds: a pow-

erful management solution for local and cloud clusters, with

the ability to easily extend local clusters into the cloud without

compromising provisioning, monitoring or managing the cloud

resources.

One or more cloud nodes can be configured under the Cloud Settings tab.

Bright Computing

22

Examples include CPU, GPU and Xeon Phi temperatures, fan

speeds, switches, hard disk SMART information, system load,

memory utilization, network metrics, storage metrics, power

systems statistics, and workload management metrics. Custom

metrics can also easily be defined.

Metric sampling is done very efficiently — in one process, or out-

of-band where possible. You have full flexibility over how and

when metrics are sampled, and historic data can be consolidat-

ed over time to save disk space.

Cluster Management Automation

Cluster management automation takes pre-emptive actions

when predetermined system thresholds are exceeded, saving

time and preventing hardware damage. Thresholds can be con-

figured on any of the available metrics. The built-in configura-

tion wizard guides you through the steps of defining a rule:

selecting metrics, defining thresholds and specifying actions.

For example, a temperature threshold for GPUs can be estab-

lished that results in the system automatically shutting down an


The parallel shell allows for simultaneous execution of commands or scripts across node groups or across the entire cluster.

The status of cluster nodes, switches, other hardware, as well as up to six metrics can be visualized in the Rackview. A zoomout option is available for clusters with many racks.

23

overheated GPU unit and sending a text message to your mobile

phone. Several predefined actions are available, but any built-in

cluster management command, Linux command or script can be

used as an action.

Comprehensive GPU Management

Bright Cluster Manager radically reduces the time and effort

of managing GPUs, and fully integrates these devices into the

single view of the overall system. Bright includes powerful GPU

management and monitoring capability that leverages function-

ality in NVIDIA Tesla and AMD GPUs.

You can easily assume maximum control of the GPUs and gain

instant and time-based status insight. Depending on the GPU

make and model, Bright monitors a full range of GPU metrics,

including:

ʎ GPU temperature, fan speed, utilization

ʎ GPU exclusivity, compute, display, persistance mode

ʎ GPU memory utilization, ECC statistics

ʎ Unit fan speed, serial number, temperature, power usage, vol-

tages and currents, LED status, firmware.

ʎ Board serial, driver version, PCI info.

Beyond metrics, Bright Cluster Manager features built-in support

for GPU computing with CUDA and OpenCL libraries. Switching

between current and previous versions of CUDA and OpenCL has

also been made easy.

Full Support for Intel Xeon Phi

Bright Cluster Manager makes it easy to set up and use the Intel

Xeon Phi coprocessor. Bright includes everything that is needed

to get Phi to work, including a setup wizard in the CMGUI. Bright

ensures that your software environment is set up correctly, so

that the Intel Xeon Phi coprocessor is available for applications

that are able to take advantage of it.

Bright collects and displays a wide range of metrics for Phi, en-

suring that the coprocessor is visible and manageable as a de-

vice type, as well as including Phi as a resource in the workload

management system. Bright’s pre-job health checking ensures

that Phi is functioning properly before directing tasks to the

coprocessor.

Multi-Tasking Via Parallel Shell

The parallel shell allows simultaneous execution of multiple

commands and scripts across the cluster as a whole, or across

easily definable groups of nodes. Output from the executed

commands is displayed in a convenient way with variable levels

of verbosity. Running commands and scripts can be killed eas-

ily if necessary. The parallel shell is available through both the

CMGUI and the CMSH.

Integrated Workload Management

Bright Cluster Manager is integrated with a wide selection

of free and commercial workload managers. This integration

provides a number of benefits:

ʎ The selected workload manager gets automatically installed

and configured

The automation configuration wizard guides you through the steps of defining a rule: selecting metrics, defining thresholds and specifying actions

24

ʎ Many workload manager metrics are monitored

ʎ The CMGUI and User Portal provide a user-friendly interface

to the workload manager

ʎ The CMSH and the SOAP & JSON APIs provide direct and po-

werful access to a number of workload manager commands

and metrics

ʎ Reliable workload manager failover is properly configured

ʎ The workload manager is continuously made aware of the

health state of nodes (see section on Health Checking)

ʎ The workload manager is used to save power through auto-

power on/off based on workload

ʎ The workload manager is used for data-aware scheduling

of jobs to the cloud

The following user-selectable workload managers are tightly in-

tegrated with Bright Cluster Manager:

ʎ PBS Professional, Univa Grid Engine, Moab, LSF

ʎ Slurm, openlava, Open Grid Scheduler, Torque, Maui

Alternatively, other workload managers, such as LoadLevele

and Condor can be installed on top of Bright Cluster Manager.

Integrated SMP Support

Bright Cluster Manager — Advanced Edition dynamically aggre-

gates multiple cluster nodes into a single virtual SMP node, us-

ing ScaleMP’s Versatile SMP™ (vSMP) architecture. Creating and

dismantling a virtual SMP node can be achieved with just a few


Creating and dismantling a virtual SMP node can be achieved with just a few clicks within the GUI or a single command in the cluster management shell.

Example graphs that visualize metrics on a GPU cluster.

25

clicks within the CMGUI. Virtual SMP nodes can also be launched

and dismantled automatically using the scripting capabilities of

the CMSH.

In Bright Cluster Manager a virtual SMP node behaves like any

other node, enabling transparent, on-the-fly provisioning, con-

figuration, monitoring and management of virtual SMP nodes as

part of the overall system management.

Maximum Uptime with Head Node

Failover Bright Cluster Manager — Advanced Edition allows two

head nodes to be configured in active-active failover mode. Both

head nodes are on active duty, but if one fails, the other takes

over all tasks, seamlesly.

Maximum Uptime with Health Checking

Bright Cluster Manager — Advanced Edition includes a

powerful cluster health checking framework that maxi-

mizes system uptime. It continually checks multiple health

indicators for all hardware and software components and

proactively initiates corrective actions. It can also automati-

cally perform a series of standard and user-defined tests just

before starting a new job, to ensure a successful execution,

and preventing the “black hole node syndrome”. Examples

of corrective actions include autonomous bypass of faulty

nodes, automatic job requeuing to avoid queue flushing, and

process “jailing” to allocate, track, trace and flush completed

user processes. The health checking framework ensures the

highest job throughput, the best overall cluster efficiency

and the lowest administration overhead.

Top Cluster Security

Bright Cluster Manager offers an unprecedented level of

security that can easily be tailored to local requirements.

Security features include:

ʎ Automated security and other updates from key-signed Linux

and Bright Computing repositories.

ʎ Encrypted internal and external communications.

ʎ X509v3 certificate based public-key authentication to the

cluster management infrastructure.

ʎ Role-based access control and complete audit trail.

ʎ Firewalls, LDAP and SSH.

User and Group Management

Users can be added to the cluster through the CMGUI or the

CMSH. Bright Cluster Manager comes with a pre-configured

LDAP database, but an external LDAP service, or alternative au-

thentication system, can be used instead.

Web-Based User Portal

The web-based User Portal provides read-only access to essen-

tial cluster information, including a general overview of the clus-

ter status, node hardware and software properties, workload

manager statistics and user-customizable graphs. The User Por-

tal can easily be customized and expanded using PHP and the

SOAP or JSON APIs.

Workload management queues can be viewed and configured from the GUI, without the need for workload management expertise.

26

Multi-Cluster Capability

Bright Cluster Manager — Advanced Edition is ideal for organiza-

tions that need to manage multiple clusters, either in one or in

multiple locations. Capabilities include:

ʎ All cluster management and monitoring functionality is

available for all clusters through one GUI.

ʎ Selecting any set of configurations in one cluster and ex-

porting them to any or all other clusters with a few mouse

clicks.

ʎ Metric visualizations and summaries across clusters.

ʎ Making node images available to other clusters.

Fundamentally API-Based

Bright Cluster Manager is fundamentally API-based, which

means that any cluster management command and any piece

of cluster management data — whether it is monitoring data

or configuration data — is available through the API. Both a

SOAP and a JSON API are available and interfaces for various

programming languages, including C++, Python and PHP are

provided.

Cloud Bursting

Bright Cluster Manager supports two cloud bursting sce-

narios: “Cluster-on-Demand” — running stand-alone clusters


“The building blocks for transtec HPC solutions

must be chosen according to our goals ease-of-

management and ease-of-use. With Bright Cluster

Manager, we are happy to have the technology

leader at hand, meeting these requirements, and

our customers value that.”

Armin Jäger HPC Solution Engineer

27

in the cloud; and “Cluster Extension” — adding cloud-based

resources to existing, onsite clusters and managing these

cloud nodes as if they were local. In addition, Bright provides

data aware scheduling to ensure that data is accessible in

the cloud at the start of jobs, and results are promptly trans-

ferred back. Both scenarios can be achieved in a few simple

steps. Every Bright cluster is automatically cloud-ready, at no

extra cost.

Hadoop Cluster Management

Bright Cluster Manager is the ideal basis for Hadoop clusters.

Bright installs on bare metal, configuring a fully operational

Hadoop cluster in less than one hour. In the process, Bright pre-

pares your Hadoop cluster for use by provisioning the operating

system and the general cluster management and monitoring ca-

pabilities required as on any cluster.

Bright then manages and monitors your Hadoop cluster’s

hardware and system software throughout its life-cycle,

collecting and graphically displaying a full range of Hadoop

metrics from the HDFS, RPC and JVM sub-systems. Bright sig-

nificantly reduces setup time for Cloudera, Hortonworks and

other Hadoop stacks, and increases both uptime and Map-

Reduce job throughput.

This functionality is scheduled to be further enhanced in

upcoming releases of Bright, including dedicated manage-

ment roles and profiles for name nodes, data nodes, as well

as advanced Hadoop health checking and monitoring func-

tionality.

Standard and Advanced Editions

Bright Cluster Manager is available in two editions: Standard

and Advanced. The table on this page lists the differences. You

can easily upgrade from the Standard to the Advanced Edition as

your cluster grows in size or complexity.

The web-based User Portal provides read-only access to essential cluster in-formation, including a general overview of the cluster status, node hardware and software properties, workload manager statistics and user-customizable graphs.

Bright Cluster Manager can manage multiple clusters simultaneously. This overview shows clusters in Oslo, Abu Dhabi and Houston, all managed through one GUI.

28

Documentation and Services

A comprehensive system administrator manual and user man-

ual are included in PDF format. Standard and tailored services

are available, including various levels of support, installation,

training and consultancy.


Feature Standard Advanced

Choice of Linux distributions x x

Intel Cluster Ready x x

Cluster Management GUI x x

Cluster Management Shell x x

Web-Based User Portal x x

SOAP API x x

Node Provisioning x x

Node Identification x x

Cluster Monitoring x x

Cluster Automation x x

User Management x x

Role-based Access Control x x

Parallel Shell x x

Workload Manager Integration x x

Cluster Security x x

Compilers x x

Debuggers & Profilers x x

MPI Libraries x x

Mathematical Libraries x x

Environment Modules x x

Cloud Bursting x x

Hadoop Management & Monitoring x x

NVIDIA CUDA & OpenCL - x

GPU Management & Monitoring - x

Xeon Phi Management & Monitoring - x

ScaleMP Management & Monitoring - x

Redundant Failover Head Nodes - x

Cluster Health Checking - x

Off-loadable Provisioning - x

Multi-Cluster Management - x

Suggested Number of Nodes 4–128 129–10,000+

Standard Support x x

Premium Support Optional Optional

Cluster health checks can be visualized in the Rackview. This screenshot shows that GPU unit 41 fails a health check called “AllFansRunning”.

GESTION INTELLI-GENTE DE FLUX D’APPLICATIONS HPC

Intelligent HPCWorkload Management

31

Tandis que tous les systèmes HPC se trouvent confrontés à de

grands challenges en matière de complexité, d’évolutivité et

d’exigences des flux applicatifs et des ressources, les systèmes

HPC des entreprises font face aux défis et attentes les plus

exigeants. Les systèmes HPC d’entreprises doivent satisfaire

aux exigences critiques et prioritaires des flux d’applications

pour les entreprises commerciales et les organisations de

recherche et académiques à orientation commerciale. Ils doi-

vent trouver un équilibre entre des SLA et des priorités com-

plexes. En effet, leurs flux HPC ont un impact direct sur leurs

revenus, la livraison de leurs produits et leurs objectifs.

31

While all HPC systems face challenges in workload de-mand, resource complexity, and scale, enterprise HPC systems face more strin-gent challenges and expec-tations.

Enterprise HPC systems must meet mission-critical and priority HPC workload demands for commercial businesses and business-ori-ented research and academ-ic organizations. They have complex SLAs and priorities to balance. Their HPC work-loads directly impact the revenue, product delivery, and organizational objec-tives of their organizations.

32

Enterprise HPC ChallengesEnterprise HPC systems must eliminate job delays and failures.

They are also seeking to improve resource utilization and man-

agement efficiency across multiple heterogeneous systems. To-

maximize user productivity, they are required to make it easier to

access and use HPC resources for users or even expand to other

clusters or HPC cloud to better handle workload demand surges.

Intelligent Workload ManagementAs today’s leading HPC facilities move beyond petascale towards

exascale systems that incorporate increasingly sophisticated and

specialized technologies, equally sophisticated and intelligent

management capabilities are essential. With a proven history of

managing the most advanced, diverse, and data-intensive sys-

tems in the Top500, Moab continues to be the preferred workload

management solution for next generation HPC facilities.

Intelligent HPC Workload ManagementMoab HPC Suite – Basic Edition

Moab is the “brain” of an HPC system, intelligently optimizing workload throughput while balancing service levels and priorities.

33

Moab HPC Suite – Basic EditionMoab HPC Suite - Basic Edition is an intelligent workload man-

agement solution. It automates the scheduling, managing,

monitoring, and reporting of HPC workloads on massive scale,

multi-technology installations. The patented Moab intelligence

engine uses multi-dimensional policies to accelerate running

workloads across the ideal combination of diverse resources.

These policies balance high utilization and throughput goals

with competing workload priorities and SLA requirements. The

speed and accuracy of the automated scheduling decisions

optimizes workload throughput and resource utilization. This

gets more work accomplished in less time, and in the right prio-

rity order. Moab HPC Suite - Basic Edition optimizes the value

and satisfaction from HPC systems while reducing manage-

ment cost and complexity.

Moab HPC Suite – Enterprise EditionMoab HPC Suite - Enterprise Edition provides enterprise-ready HPC

workload management that accelerates productivity, automates

workload uptime, and consistently meets SLAs and business prior-

ities for HPC systems and HPC cloud. It uses the battle-tested and

patented Moab intelligence engine to automatically balance the

complex, mission-critical workload priorities of enterprise HPC

systems. Enterprise customers benefit from a single integrated

product that brings together key enterprise HPC capabilities, im-

plementation services, and 24x7 support. This speeds the realiza-

tion of benefits from the HPC system for the business including:

ʎ Higher job throughput

ʎ Massive scalability for faster response and system expansion

ʎ Optimum utilization of 90-99% on a consistent basis

ʎ Fast, simple job submission and management to increase

34

productivity

ʎ Reduced cluster management complexity and support costs

across heterogeneous systems

ʎ Reduced job failures and auto-recovery from failures

ʎ SLAs consistently met for improved user satisfaction

ʎ Reduce power usage and costs by 10-30%

Productivity AccelerationMoab HPC Suite – Enterprise Edition gets more results deliv-

ered faster from HPC resources at lower costs by accelerating

overall system, user and administrator productivity. Moab pro-

vides the scalability, 90-99 percent utilization, and simple job

submission that is required to maximize the productivity of

enterprise HPC systems. Enterprise use cases and capabilities

include:

ʎ Massive multi-point scalability to accelerate job response

and throughput, including high throughput computing

ʎ Workload-optimized allocation policies and provisioning

to get more results out of existing heterogeneous resources

and reduce costs, including topology-based allocation

ʎ Unify workload management cross heterogeneous clus-

ters to maximize resource availability and administration

efficiency by managing them as one cluster

ʎ Optimized, intelligent scheduling packs workloads and

backfills around priority jobs and reservations while balanc-

ing SLAs to efficiently use all available resources

ʎ Optimized scheduling and management of accelerators,

both Intel MIC and GPGPUs, for jobs to maximize their utili-

zation and effectiveness

ʎ Simplified job submission and management with advanced

job arrays, self-service portal, and templates

ʎ Administrator dashboards and reporting tools reduce man-

agement complexity, time, and costs

ʎ Workload-aware auto-power management reduces energy

use and costs by 10-30 percent

Intelligent HPC Workload ManagementMoab HPC Suite – Enterprise Edition

“With Moab HPC Suite, we can meet very demand-

ing customers’ requirements as regards unifi ed

management of heterogeneous cluster environ-

ments, grid management, and provide them with

fl exible and powerful confi guration and reporting

options. Our customers value that highly.”

Thomas Gebert HPC Solution Architect

35

Uptime AutomationJob and resource failures in enterprise HPC systems lead to de-

layed results, missed organizational opportunities, and missed

objectives. Moab HPC Suite – Enterprise Edition intelligently au-

tomates workload and resource uptime in the HPC system to en-

sure that workload completes successfully and reliably, avoiding

these failures. Enterprises can benefit from:

ʎ Intelligent resource placement to prevent job failures with

granular resource modeling to meet workload requirements

and avoid at-risk resources

ʎ Auto-response to failures and events with configurable ac-

tions to pre-failure conditions, amber alerts, or other metrics

and monitors

ʎ Workload-aware future maintenance scheduling that helps

maintain a stable HPC system without disrupting workload

productivity

ʎ Real-world expertise for fast time-to-value and system

uptime with included implementation, training, and 24x7

support remote services

Auto SLA EnforcementMoab HPC Suite – Enterprise Edition uses the powerful Moab

intelligence engine to optimally schedule and dynamically

adjust workload to consistently meet service level agree-

ments (SLAs), guarantees, or business priorities. This auto-

matically ensures that the right workloads are completed at

the optimal times, taking into account the complex number of

using departments, priorities and SLAs to be balanced. Moab

provides:

ʎ Usage accounting and budget enforcement that schedules

resources and reports on usage in line with resource shar-

ing agreements and precise budgets (includes usage limits,

36

usage reports, auto budget management and dynamic fair-

share policies)

ʎ SLA and priority polices that make sure the highest priorty

workloads are processed first (includes Quality of Service,

hierarchical priority weighting)

ʎ Continuous plus future scheduling that ensures priorities

and guarantees are proactively met as conditions and work-

load changes (i.e. future reservations, pre-emption, etc.)

Grid- and Cloud-Ready Workload ManagementThe benefits of a traditional HPC environment can be extended

to more efficiently manage and meet workload demand with

the ulti-cluster grid and HPC cloud management capabilities in

Moab HPC Suite - Enterprise Edition. It allows you to:

ʎ Showback or chargeback for pay-for-use so actual resource

usage is tracked with flexible chargeback rates and report-

ing by user, department, cost center, or cluster

ʎ Manage and share workload across multiple remote clusters

to meet growing workload demand or surges by adding on

the Moab HPC Suite – Grid Option.

Moab HPC Suite – Grid OptionThe Moab HPC Suite - Grid Option is a powerful grid-workload

management solution that includes scheduling, advanced poli-

cy management, and tools to control all the components of ad-

vanced grids. Unlike other grid solutions, Moab HPC Suite - Grid

Option connects disparate clusters into a logical whole, enabling

grid administrators and grid policies to have sovereignty over all

systems while preserving control at the individual cluster.

Moab HPC Suite - Grid Option has powerful applications that al-

low organizations to consolidate reporting; information gather-

ing; and workload, resource, and data management. Moab HPC

Suite - Grid Option delivers these services in a near-transparent

way: users are unaware they are using grid resources—they

know only that they are getting work done faster and more eas-

ily than ever before.

Intelligent HPC Workload ManagementMoab HPC Suite – Grid Option

37

Moab manages many of the largest clusters and grids in the

world. Moab technologies are used broadly across Fortune 500

companies and manage more than a third of the compute cores

in the top 100 systems of the Top500 supercomputers. Adaptive

Computing is a globally trusted ISV (independent software ven-

dor), and the full scalability and functionality Moab HPC Suite

with the Grid Option offers in a single integrated solution has

traditionally made it a significantly more cost-effective option

than other tools on the market today.

ComponentsThe Moab HPC Suite - Grid Option is available as an add-on to

Moab HPC Suite - Basic Edition and -Enterprise Edition. It extends

the capabilities and functionality of the Moab HPC Suite compo-

nents to manage grid environments including:

ʎ Moab Workload Manager for Grids – a policy-based work-

load management and scheduling multi-dimensional deci-

sion engine

ʎ Moab Cluster Manager for Grids – a powerful and unified

graphical administration tool for monitoring, managing and

reporting tool across multiple clusters

ʎ Moab ViewpointTM for Grids – a Web-based self-service

end-user job-submission and management portal and ad-

ministrator dashboard

Benefits ʎ Unified management across heterogeneous clusters provides

ability to move quickly from cluster to optimized grid

ʎ Policy-driven and predictive scheduling ensures that jobs

start and run in the fastest time possible by selecting optimal

resources

ʎ Flexible policy and decision engine adjusts workload pro-

cessing at both grid and cluster level

ʎ Grid-wide interface and reporting tools provide view of grid

resources, status and usage charts, and trends over time for

capacity planning, diagnostics, and accounting

ʎ Advanced administrative control allows various business

units to access and view grid resources, regardless of physical

or organizational boundaries, or alternatively restricts access

to resources by specific departments or entities

ʎ Scalable architecture to support peta-scale, highthroughput

computing and beyond

Grid Control with Automated Tasks, Policies, and Re-porting ʎ Guarantee that the most-critical work runs first with flexible

global policies that respect local cluster policies but continue

to support grid service-level agreements

ʎ Ensure availability of key resources at specific times with ad-

vanced reservations

ʎ Tune policies prior to rollout with cluster- and grid-level simu-

lations

ʎ Use a global view of all grid operations for self-diagnosis, plan-

ning, reporting, and accounting across all resources, jobs, and

clusters

Moab HPC Suite -Grid Option can be flexibly configured for centralized, cen-tralized and local, or peer-to-peer grid policies, decisions and rules. It is able to manage multiple resource managers and multiple Moab instances.

38

Harnessing the Power of Intel Xeon Phi and GPGPUsThe mathematical acceleration delivered by many-core General

Purpose Graphics Processing Units (GPGPUs) offers significant

performance advantages for many classes of numerically in-

tensive applications. Parallel computing tools such as NVIDIA’s

CUDA and the OpenCL framework have made harnessing the

power of these technologies much more accessible to develop-

ers, resulting in the increasing deployment of hybrid GPGPU-

based systems and introducing significant challenges for ad-

ministrators and workload management systems.

The new Intel Xeon Phi coprocessors offer breakthrough perfor-

mance for highly parallel applications with the benefit of lever-

aging existing code and flexibility in programming models for

faster application development. Moving an application to Intel

Xeon Phi coprocessors has the promising benefit of requiring

substantially less effort and lower power usage. Such a small

investment can reap big performance gains for both data-inten-

sive and numerical or scientific computing applications that can

make the most of the Intel Many Integrated Cores (MIC) technol-

ogy. So it is no surprise that organizations are quickly embracing

these new coprocessors in their HPC systems.

The main goal is to create breakthrough discoveries, products

and research that improve our world. Harnessing and optimiz-

ing the power of these new accelerator technologies is key to

doing this as quickly as possible. Moab HPC Suite helps organi-

zations easily integrate accelerator and coprocessor technolo-

gies into their HPC systems, optimizing their utilization for the

right workloads and problems they are trying to solve.

Intelligent HPC Workload ManagementOptimizing Accelerators with Moab HPC Suite

39

Moab HPC Suite automatically schedules hybrid systems incor-

porating Intel Xeon Phi coprocessors and GPGPU accelerators,

optimizing their utilization as just another resource type with

policies. Organizations can choose the accelerators that best

meet their different workload needs.

Managing Hybrid Accelerator SystemsHybrid accelerator systems add a new dimension of manage-

ment complexity when allocating workloads to available re-

sources. In addition to the traditional needs of aligning work-

load placement with software stack dependencies, CPU type,

memory, and interconnect requirements, intelligent workload

management systems now need to consider:

ʎ Workload’s ability to exploit Xeon Phi or GPGPU

technology

ʎ Additional software dependenciesand reduce costs,

including topology-based allocation

ʎ Current health and usage status of available Xeon Phis

or GPGPUs

ʎ Resource configuration for type and number of Xeon Phis

or GPGPUs

Moab HPC Suite auto detects which types of accelerators are

where in the system to reduce the management effort and costs

as these processors are introduced and maintained together in

an HPC system. This gives customers maximum choice and per-

formance in selecting the accelerators that work best for each

of their workloads, whether Xeon Phi, GPGPU or a hybrid mix of

the two.

Moab HPC Suite optimizes accelerator utilization with policies that ensure the right ones get used for the right user and group jobs at the right time.

40

Optimizing Intel Xeon Phi and GPGPUs UtilizationSystem and application diversity is one of the first issues

workload management software must address in optimizing

accelerator utilization. The ability of current codes to use GP-

GPUs and Xeon Phis effectively, ongoing cluster expansion,

and costs usually mean only a portion of a system will be

equipped with one or a mix of both types of accelerators.

Moab HPC Suite has a twofold role in reducing the complexity

of their administration and optimizing their utilization. First,

it must be able to automatically detect new GPGPUs or Intel

Xeon Phi coprocessors in the environment and their availabil-

ity without the cumbersome burden of manual administra-

tor configuration. Second, and most importantly, it must ac-

curately match Xeon Phi-bound or GPGPUenabled workloads

with the appropriately equipped resources in addition to

managing contention for those limited resources according

to organizational policy. Moab’s powerful allocation and pri-

oritization policies can ensure the right jobs from users and

groups get to the optimal accelerator resources at the right

time. This keeps the accelerators at peak utilization for the

right priority jobs.

Moab’s policies are a powerful ally to administrators in auto

determining the optimal Xeon Phi coprocessor or GPGPU to

use and ones to avoid when scheduling jobs. These alloca-

tion policies can be based on any of the GPGPU or Xeon Phi

metrics such as memory (to ensure job needs are met), tem-

perature (to avoid hardware errors or failures), utilization, or

other metrics.

Intelligent HPC Workload ManagementOptimizing Accelerators with Moab HPC Suite

41

Use the following metrics in policies to optimize allocation of

Intel Xeon Phi coprocessors:

ʎ Number of Cores

ʎ Number of Hardware Threads

ʎ Physical Memory

ʎ Free Physical Memory

ʎ Swap Memory

ʎ Free Swap Memory

ʎ Max Frequency (in MHz)

Use the following metrics in policies to optimize allocation of

GPGPUs and for management diagnostics:

ʎ Error counts; single/double-bit, commands to reset counts

ʎ Temperature

ʎ Fan speed

ʎ Memory; total, used, utilization

ʎ Utilization percent

ʎ Metrics time stamp

Improve GPGPU Job Speed and SuccessThe GPGPU drivers supplied by vendors today allow multiple

jobs to share a GPGPU without corrupting results, but sharing

GPGPUs and other key GPGPU factors can significantly impact

the speed and success of GPGPU jobs and the final level of ser-

vice delivered to the system users and the organization.

Consider the example of a user estimating the time for a GPGPU

job based on exclusive access to a GPGPU, but the workload man-

ager allowing the GPGPU to be shared when the job is scheduled.

The job will likely exceed the estimated time and be terminated

by the workload manager unsuccessfully, leaving a very unsatis-

fied user and the organization, group or project they represent.

Moab HPC Suite provides the ability to schedule GPGPU jobs to

run exclusively and finish in the shortest time possible for cer-

tain types, or classes of jobs or users to ensure job speed and

success.

42

Cluster Sovereignty and Trusted Sharing ʎ Guarantee that shared resources are allocated fairly with

global policies that fully respect local cluster configuration

and needs

ʎ Establish trust between resource owners through graphical

usage controls, reports, and accounting across all shared

resources

ʎ Maintain cluster sovereignty with granular settings to con-

trol where jobs can originate and be processed n Establish

resource ownership and enforce appropriate access levels

with prioritization, preemption, and access guarantees

Increase User Collaboration and Productivity ʎ Reduce end-user training and job management time with

easy-to-use graphical interfaces

ʎ Enable end users to easily submit and manage their jobs

through an optional web portal, minimizing the costs of ca-

tering to a growing base of needy users

ʎ Collaborate more effectively with multicluster co-allocation,

allowing key resource reservations for high-priority projects

ʎ Leverage saved job templates, allowing users to submit mul-

Intelligent HPC Workload ManagementMoab HPC Suite – Application Portal Edition

The Visual Cluster view in Moab Cluster Manager for Grids makes cluster and grid management easy and efficient.

43

tiple jobs quickly and with minimal changes

ʎ Speed job processing with enhanced grid placement options

for job arrays; optimal or single cluster placement

Process More Work in Less Time to Maximize ROI ʎ Achieve higher, more consistent resource utilization with in-

telligent scheduling, matching job requests to the bestsuited

resources, including GPGPUs

ʎ Use optimized data staging to ensure that remote data trans-

fers are synchronized with resource availability to minimize

poor utilization

ʎ Allow local cluster-level optimizations of most grid workload

Unify Management Across Independent Clusters ʎ Unify management across existing internal, external, and

partner clusters – even if they have different resource man-

agers, databases, operating systems, and hardware

ʎ Out-of-the-box support for both local area and wide area

grids

ʎ Manage secure access to resources with simple credential

mapping or interface with popular security tool sets

ʎ Leverage existing data-migration technologies, such as SCP

or GridFTP

Moab HPC Suite – Application Portal EditionRemove the Barriers to Harnessing HPC

Organizations of all sizes and types are looking to harness the

potential of high performance computing (HPC) to enable and

accelerate more competitive product design and research. To do

this, you must find a way to extend the power of HPC resourc-

es to designers, researchers and engineers not familiar with or

trained on using the specialized HPC technologies. Simplifying

the user access to and interaction with applications running on

more efficient HPC clusters removes the barriers to discovery

and innovation for all types of departments and organizations.

Moab® HPC Suite – Application Portal Edition provides easy

single-point access to common technical applications, data, HPC

resources, and job submissions to accelerate research analysis

and collaboration process for end users while reducing comput-

ing costs.

HPC Ease of Use Drives ProductivityMoab HPC Suite – Application Portal Edition improves ease of

use and productivity for end users using HPC resources with sin-

gle-point access to their common technical applications, data,

resources, and job submissions across the design and research

process. It simplifies the specification and running of jobs on

HPC resources to a few simple clicks on an application specific

template with key capabilities including:

ʎ Application-centric job submission templates for common

manufacturing, energy, life-sciences, education and other

industry technical applications

ʎ Support for batch and interactive applications, such as

simulations or analysis sessions, to accelerate the full proj-

ect or design cycle

ʎ No special HPC training required to enable more users to

start accessing HPC resources with intuitive portal

ʎ Distributed data management avoids unnecessary remote

file transfers for users, storing data optimally on the network

for easy access and protection, and optimizing file transfer

to/ from user desktops if needed

Accelerate Collaboration While Maintaining ControlMore and more projects are done across teams and with partner

and industry organizations. Moab HPC Suite – Application Portal

Edition enables you to accelerate this collaboration between indi-

viduals and teams while maintaining control as additional users,

inside and outside the organization, can easily and securely ac-

cess project applications, HPC resources and data to speed project

cycles. Key capabilities and benefits you can leverage are:

ʎ Encrypted and controllable access, by class of user, ser-

vices, application or resource, for remote and partner users

improves collaboration while protecting infrastructure and

intellectual property

44

ʎ Enable multi-site, globally available HPC services available

anytime, anywhere, accessible through many different types

of devices via a standard web browser

ʎ Reduced client requirements for projectsmeans new proj-

ect members can quickly contribute without being limited by

their desktop capabilities

Reduce Costs with Optimized Utilization and SharingMoab HPC Suite – Application Portal Edition reduces the costs of

technical computing by optimizing resource utilization and the

sharing of resources including HPC compute nodes, application

licenses, and network storage. The patented Moab intelligence

engine and its powerful policies deliver value in key areas of uti-

lization and sharing:

ʎ Maximize HPC resource utilization with intelligent, opti-

mized scheduling policies that pack and enable more work

to be done on HPC resources to meet growing and changing

demand

ʎ Optimize application license utilization and access by shar-

ing application licenses in a pool, re-using and optimizing us-

age with allocation policies that integrate with license man-

agers for lower costs and better service levels to a broader

set of users

ʎ Usage budgeting and priorities enforcement with usage

accounting and priority policies that ensure resources are

shared in-line with service level agreements and project pri-

orities

ʎ Leverage and fully utilize centralized storage capacity in-

stead of duplicative, expensive, and underutilized individual

workstation storage

Key Intelligent Workload Management Capabilities: ʎ Massive multi-point scalability

ʎ Workload-optimized allocation policies and provisioning

ʎ Unify workload management cross heterogeneous clusters

ʎ Optimized, intelligent scheduling

Intelligent HPC Workload ManagementMoab HPC Suite – Remote Visualization Edition

45

ʎ Optimized scheduling and management of accelerators

(both Intel MIC and GPGPUs)

ʎ Administrator dashboards and reporting tools

ʎ Workload-aware auto power management

ʎ Intelligent resource placement to prevent job failures

ʎ Auto-response to failures and events

ʎ Workload-aware future maintenance scheduling

ʎ Usage accounting and budget enforcement

ʎ SLA and priority polices

ʎ Continuous plus future scheduling

Moab HPC Suite – Remote Visualization EditionRemoving Inefficiencies in Current 3D Visualization Methods

Using 3D and 2D visualization, valuable data is interpreted and

new innovations are discovered. There are numerous inefficien-

cies in how current technical computing methods support users

and organizations in achieving these discoveries in a cost effec-

tive way. High-cost workstations and their software are often

not fully utilized by the limited user(s) that have access to them.

They are also costly to manage and they don’t keep pace long

with workload technology requirements. In addition, the work-

station model also requires slow transfers of the large data sets

between them. This is costly and inefficient in network band-

width, user productivity, and team collaboration while posing in-

creased data and security risks. There is a solution that removes

these inefficiencies using new technical cloud computing mod-

els. Moab HPC Suite – Remote Visualization Edition significantly

reduces the costs of visualization technical computing while

improving the productivity, collaboration and security of the de-

sign and research process.

Reduce the Costs of Visualization Technical ComputingMoab HPC Suite – Remote Visualization Edition significantly

reduces the hardware, network and management costs of vi-

sualization technical computing. It enables you to create easy,

centralized access to shared visualization compute, applica-

tion and data resources. These compute, application, and data

resources reside in a technical compute cloud in your data cen-

ter instead of in expensive and underutilized individual work-

stations.

ʎ Reduce hardware costs by consolidating expensive individual

technical workstations into centralized visualization servers

for higher utilization by multiple users; reduce additive or up-

grade technical workstation or specialty hardware purchases,

such as GPUs, for individual users.

ʎ Reduce management costs by moving from remote user work-

stations that are difficult and expensive to maintain, upgrade,

and back-up to centrally managed visualization servers that

require less admin overhead

ʎ Decrease network access costs and congestion as significantly

lower loads of just compressed, visualized pixels are moving to

users, not full data sets

ʎ Reduce energy usageas centralized visualization servers con-

sume less energy for the user demand met

ʎ Reduce data storage costs by consolidating data into common

storage node(s) instead of under-utilized individual storage

The Application Portal Edition easily extends the power of HPC to your users and projects to drive innovation and competentive advantage.

46

Consolidation and GridMany sites have multiple clusters as a result of having multiple

independent groups or locations, each with demands for HPC,

and frequent additions of newer machines. Each new cluster

increases the overall administrative burden and overhead. Addi-

tionally, many of these systems can sit idle while others are over-

loaded. Because of this systems-management challenge, sites

turn to grids to maximize the efficiency of their clusters. Grids

can be enabled either independently or in conjunction with one

another in three areas:

ʎ Reporting Grids Managers want to have global reporting

across all HPC resources so they can see how users and proj-

ects are really utilizing hardware and so they can effectively

plan capacity. Unfortunately, manually consolidating all of

this information in an intelligible manner for more than just

a couple clusters is a management nightmare. To solve that

problem, sites will create a reporting grid, or share informa-

tion across their clusters for reporting and capacity-planning

purposes.

ʎ Management Grids Managing multiple clusters indepen-

dently can be especially difficult when processes change,

because policies must be manually reconfigured across all

clusters. To ease that difficulty, sites often set up manage-

ment grids that impose a synchronized management layer

across all clusters while still allowing each cluster some level

of autonomy.

ʎ Workload-Sharing Grids Sites with multiple clusters often

have the problem of some clusters sitting idle while other

clusters have large backlogs. Such inequality in cluster utili-

zation wastes expensive resources, and training users to per-

form different workload-submission routines across various

clusters can be difficult and expensive as well. To avoid these

problems, sites often set up workload-sharing grids. These

grids can be as simple as centralizing user submission or as

Intelligent HPC Workload ManagementUsing Moab With Grid Environments

47

complex as having each cluster maintain its own user sub-

mission routine with an underlying grid-management tool

that migrates jobs between clusters.

Inhibitors to Grid EnvironmentsThree common inhibitors keep sites from enabling grid environ-

ments:

ʎ Politics Because grids combine resources across users,

groups, and projects that were previously independent, grid

implementation can be a political nightmare. To create a

grid in the real world, sites need a tool that allows clusters

to retain some level of sovereignty while participating in the

larger grid.

ʎ Multiple Resource Managers Most sites have a variety of

resource managers used by various groups within the orga-

nization, and each group typically has a large investment in

scripts that are specific to one resource manager and that

cannot be changed. To implement grid computing effective-

ly, sites need a robust tool that has flexibility in integrating

with multiple resource managers.

ʎ Credentials Many sites have different log-in credentials for

each cluster, and those credentials are generally indepen-

dent of one another. For example, one user might be Joe.P on

one cluster and J_Peterson on another. To enable grid envi-

ronments, sites must create a combined user space that can

recognize and combine these different credentials.

Using Moab in a GridSites can use Moab to set up any combination of reporting,

management, and workload-sharing grids. Moab is a grid metas-

cheduler that allows sites to set up grids that work effectively in

the real world. Moab’s feature-rich functionality overcomes the

inhibitors of politics, multiple resource managers, and varying

credentials by providing:

ʎ Grid Sovereignty Moab has multiple features that break

down political barriers by letting sites choose how each clus-

ter shares in the grid. Sites can control what information is

shared between clusters and can specify which workload is

passed between clusters. In fact, sites can even choose to let

each cluster be completely sovereign in making decisions

about grid participation for itself.

ʎ Support for Multiple Resource Managers Moab meta-

schedules across all common resource managers. It fully

integrates with TORQUE and SLURM, the two most-common

open-source resource managers, and also has limited in-

tegration with commercial tools such as PBS Pro and SGE.

Moab’s integration includes the ability to recognize when a

user has a script that requires one of these tools, and Moab

can intelligently ensure that the script is sent to the correct

machine. Moab even has the ability to translate common

scripts across multiple resource managers.

ʎ Credential Mapping Moab can map credentials across clus-

ters to ensure that users and projects are tracked appropri-

ately and to provide consolidated reporting.

48

Improve Collaboration, Productivity and SecurityWith Moab HPC Suite – Remote Visualization Edition, you can im-

prove the productivity, collaboration and security of the design

and research process by only transferring pixels instead of data

to users to do their simulations and analysis. This enables a wid-

er range of users to collaborate and be more productive at any

time, on the same data, from anywhere without any data trans-

fer time lags or security issues. Users also have improved imme-

diate access to specialty applications and resources, like GPUs,

they might need for a project so they are no longer limited by

personal workstation constraints or to a single working location.

ʎ Improve workforce collaboration by enabling multiple

users to access and collaborate on the same interactive ap-

plication data at the same time from almost any location or

device, with little or no training needed

ʎ Eliminate data transfer time lags for users by keeping data

close to the visualization applications and compute resourc-

es instead of transferring back and forth to remote worksta-

tions; only smaller compressed pixels are transferred

ʎ Improve security with centralized data storage so data no

longer gets transferred to where it shouldn’t, gets lost, or

gets inappropriately accessed on remote workstations, only

pixels get moved

Maximize Utilization and Shared Resource GuaranteesMoab HPC Suite – Remote Visualization Edition maximizes your

resource utilization and scalability while guaranteeing shared

resource access to users and groups with optimized workload

management policies. Moab’s patented policy intelligence en-

gine optimizes the scheduling of the visualization sessions

across the shared resources to maximize standard and specialty

resource utilization, helping them scale to support larger vol-

umes of visualization workloads. These intelligent policies also

guarantee shared resource access to users to ensure a high ser-

vice level as they transition from individual workstations.


49

ʎ Guarantee shared visualization resource access for users

with priority policies and usage budgets that ensure they

receive the appropriate service level. Policies include ses-

sion priorities and reservations, number of users per node,

fairshare usage policies, and usage accounting and budgets

across multiple groups or users.

ʎ Maximize resource utilization and scalability by packing

visualization workload optimally on shared servers using

Moab allocation and scheduling policies. These policies can

include optimal resource allocation by visualization session

characteristics (like CPU, memory, etc.), workload packing,

number of users per node, and GPU policies, etc.

ʎ Improve application license utilization to reduce software

costs and improve access with a common pool of 3D visual-

ization applications shared across multiple users, with usage

optimized by intelligent Moab allocation policies, integrated

with license managers.

ʎ Optimized scheduling and management of GPUs and other

accelerators to maximize their utilization and effectiveness

ʎ Enable multiple Windows or Linux user sessions on a single

machine managed by Moab to maximize resource and power

utilization.

ʎ Enable dynamic OS re-provisioning on resources, managed

by Moab, to optimally meet changing visualization applica-

tion workload demand and maximize resource utilization

and availability to users.

The Remote Visualization Edition makes visualization more cost effective and productive with an innovative technical compute cloud.

transtec HPC solutions are designed for maximum flexibility,

and ease of management. We not only offer our customers the

most powerful and flexible cluster management solution out

there, but also provide them with customized setup and site-

specific configuration. Whether a customer needs a dynamical

Linux-Windows dual-boot solution, unifi ed management of dif-

ferent clusters at different sites, or the fi ne-tuning of the Moab

scheduler for implementing fi ne-grained policy confi guration

– transtec not only gives you the framework at hand, but also

helps you adapt the system according to your special needs.

Needless to say, when customers are in need of special train-

ings, transtec will be there to provide customers, administra-

tors, or users with specially adjusted Educational Services.

Having a many years’ experience in High Performance Comput-

ing enabled us to develop effi cient concepts for installing and

deploying HPC clusters. For this, we leverage well-proven 3rd

party tools, which we assemble to a total solution and adapt

and confi gure according to the customer’s requirements.

We manage everything from the complete installation and

confi guration of the operating system, necessary middleware

components like job and cluster management systems up to

the customer’s applications.

50


Moab HPC Suite Use Case Basic Enterprice App. Portal Remote Visualization

Productivity Acceleration

Massive multifaceted scalability x x x x

Workload optimized resource allocation x x x x

Optimized, intelligent scheduling of work-load and resources

x x x x

Optimized accelerator scheduling & manage-ment (Intel MIC & GPGPU)

x x x x

Simplified user job submission & mgmt.- Application Portal

x x xx

x

Administrator dashboard and management reporting

x x x x

Unified workload management across het-erogeneous cluster (s)

xSingle cluster only

x x x

Workload optimized node provisioning x x x

Visualization Workload Optimization x

Workload-aware auto-power management x x x

Uptime Automation

Intelligent resource placement for failure prevention

x x x x

Workload-aware maintenance scheduling x x x x

Basic deployment, training and premium (24x7) support service

Standard support only

x x x

Auto response to failures and events xLimited no

re-boot/provision

x x x

Auto SLA enforcement

Resource sharing and usage policies x x x x

SLA and oriority policies x x x x

Continuous and future reservations scheduling x x x x

Usage accounting and budget enforcement x x x

Grid- and Cloud-Ready

Manage and share workload across multiple clusters in wich area grid

xw/grid Option

xw/grid Option

xw/grid Option

xw/grid Option

User self-service: job request & management portal

x x x x

Pay-for-use showback and chargeback x x x

Workload optimized node provisioning & re-purposing

x x x

Multiple clusters in single domain

51

INFINIBANDHIGH-SPEED INTERCONNTECTS

NICE EnginFrame –A Technical Computing Portal

53

InfiniBand (IB) is an efficient I/O technology that provides high-

speed data transfers and ultra-low latencies for computing and

storage over a highly reliable and scalable single fabric. The

InfiniBand industry standard ecosystem creates cost effective

hardware and software solutions that easily scale from genera-

tion to generation.

InfiniBand is a high-bandwidth, low-latency network interconnect

solution that has grown tremendous market share in the High

Performance Computing (HPC) cluster community. InfiniBand was

designed to take the place of today’s data center networking tech-

nology. In the late 1990s, a group of next generation I/O architects

formed an open, community driven network technology to provide

scalability and stability based on successes from other network

designs. Today, InfiniBand is a popular and widely used I/O fabric

among customers within the Top500 supercomputers: major Uni-

versities and Labs; Life Sciences; Biomedical; Oil and Gas (Seismic,

Reservoir, Modeling Applications); Computer Aided Design and En-

gineering; Enterprise Oracle; and Financial Applications.

HPC systems and enterprise grids deliver unprecedented time-to-market and perfor-mance advantages to many research and corporate cus-tomers, struggling every day with compute and data intensive processes. This of-ten generates or transforms massive amounts of jobs and data, that needs to be handled and archived effi-ciently to deliver timely in-formation to users distrib-uted in multiple locations, with different security con-cerns.

Poor usability of such com-plex systems often nega-tively impacts users’ pro-ductivity, and ad-hoc data management often increas-es information entropy and dissipates knowledge and intellectual property.

54

NICE EnginFrame

A Technical Portal for Remote Visualization

Solving distributed computing issues for our customers, it is

easy to understand that a modern, user-friendly web front-end

to HPC and grids can drastically improve engineering productiv-

ity, if properly designed to address the specific challenges of the

Technical Computing market.

Nice EnginFrame overcomes many common issues in the areas

of usability, data management, security and integration, open-

ing the way to a broader, more effective use of the Technical

Computing resources.

The key components of our solutions are:

ʎ a flexible and modular Java-based kernel, with clear separati-

on between customizations and core services

ʎ powerful data management features, reflecting the typical

needs of engineering applications

ʎ comprehensive security options and a fine grained authoriza-

tion system

ʎ scheduler abstraction layer to adapt to different workload

and resource managers

ʎ responsive and competent support services

55

End users can typically enjoy the following improvements:

ʎ user-friendly, intuitive access to computing resources, using a

standard web browser

ʎ application-centric job submission

ʎ organized access to job information and data

ʎ increased mobility and reduced client requirements

On the other side, the Technical Computing Portal delivers sig-

nificant added-value for system administrators and IT:

ʎ reduced training needs to enable users to access the resour-

ces

ʎ centralized configuration and immediate deployment of ser-

vices

ʎ comprehensive authorization to access services and informa-

tion

ʎ reduced support calls and submission errors

Coupled with our Remote Visualization solutions, our custom-

ers quickly deploy end-to-end engineering processes on their

Intranet, Extranet or Internet.

EnginFrame Highlights ʎ Universal and flexible access to your Grid infrastructure En-

ginFrame provides easy accessover intranet, extranet or the

Internet using standard and secure Internet languages and

protocols.

ʎ Interface to the Grid in your organization EnginFrame offers

pluggable server-side XML services to enable easy integrati-

on of the leading workload schedulers in the market. Plug-in

modules for all major grid solutions including the Platform

Computing LSF and OCS, Microsoft Windows HPC Server

2008, Sun GridEngine, UnivaUD UniCluster, IBM LoadLeve-

ler, Altair PBS/Pro and OpenPBS, Torque, Condor, EDG/gLite

Globus-based toolkits, and provides easy interfaces to the

existing computing infrastructure.

ʎ Security and Access Control You can give encrypted and con-

trollable access to remote users and improve collaboration

with partners while protecting your infrastructure and Intel-

lectual Property (IP). This enables you to speed up design cycles

while working with related departments or external partners

and eases communication through a secure infrastructure.

Distributed Data Management The comprehensive remote

file management built into EnginFrame avoids unnecessary

file transfers, and enables server-side treatment of data, as

well as file transfer from/to the user’s desktop.

ʎ Interactive application support Through the integration

with leading GUI virtualization solutions (including Desktop

Cloud Visualization (DCV), VNC, VirtualGL and NX), EnginFrame

enables you to address all stages of the design process, both

batch and interactive.

ʎ SOA-enable your applications and resources EnginFrame

can automatically publish your computing applications via

standard WebServices (tested both for Java and .NET intero-

perability), enabling immediate integration with other enter-

prise applications.

Business Benefits

Category Business Benefits

Web and Cloud Portal ʎ Simplify user access; minimize

the training requirements for

HPC users; broaden the user

community

ʎ Enable access from intranet,

internet or extranet

56

Aerospace, Automotive and ManufacturingThe complex factors involved in CAE range from compute-in-

tensive data analysis to worldwide collaboration between de-

signers, engineers, OEM’s and suppliers. To accommodate these

factors, Cloud (both internal and external) and Grid Computing

solutions are increasingly seen as a logical step to optimize IT

infrastructure usage for CAE. It is no surprise then, that automo-

tive and aerospace manufacturers were the early adopters of

internal Cloud and Grid portals.

Manufacturers can now develop more “virtual products” and

simulate all types of designs, fluid flows, and crash simula-

tions. Such virtualized products and more streamlined collab-

oration environments are revolutionizing the manufacturing

process.

With NICE EnginFrame in their CAE environment engineers can

take the process even further by connecting design and simula-

tion groups in “collaborative environments” to get even greater

benefits from “virtual products”. Thanks to EnginFrame, CAE en-

gineers can have a simple intuitive collaborative environment

that can care of issues related to:

ʎ Access & Security - Where an organization must give access

to external and internal entities such as designers, engineers

and suppliers.

ʎ Distributed collaboration - Simple and secure connection of

design and simulation groups distributed worldwide.

ʎ Time spent on IT tasks - By eliminating time and resources

spent using cryptic job submission commands or acquiring

knowledge of underlying compute infrastructures, engi-

neers can spend more time concentrating on their core de-

sign tasks.

NICE EnginFrame

Application Highlights

57

EnginFrame’s web based interface can be used to access the

compute resources required for CAE processes. This means ac-

cess to job submission & monitoring tasks, input & output data

associated with industry standard CAE/CFD applications for Flu-

id-Dynamics, Structural Analysis, Electro Design and Design Col-

laboration, (like Abaqus, Ansys, Fluent, MSC Nastran, PAMCrash,

LS-Dyna, Radioss) without cryptic job submission commands or

knowledge of underlying compute infrastructures.

EnginFrame has a long history of usage in some of the most

prestigious manufacturing organizations worldwide including

Aerospace companies like AIRBUS, Alenia Space, CIRA, Galileo Avi-

onica, Hamilton Sunstrand, Magellan Aerospace, MTU and Auto-

motive companies like Audi, ARRK, Brawn GP, Bridgestone, Bosch,

Delphi, Elasis, Ferrari, FIAT, GDX Automotive, Jaguar-LandRover,

Lear, Magneti Marelli, McLaren, P+Z, RedBull, Swagelok, Suzuki,

Toyota, TRW.

Life Sciences and HealthcareNICE solutions are deployed in the Life Sciences sector at com-

panies like BioLab, Partners HealthCare, Pharsight and the

M.D.Anderson Cancer Center; also in leading research projects like

DEISA or LitBio in order to allow easy and transparent use of com-

puting resources without any insight of the HPC infrastructure.

The Life Science and Healthcare sectors have some very strict re-

quirement when choosing its IT solution like EnginFrame, for in-

stance

ʎ Security - To meet the strict security & privacy requirements

of the biomedical and pharmaceutical industry any solution

needs to take account of multiple layers of security and au-

thentication.

ʎ Industry specific software - ranging from the simplest cus-

tom tool to the more general purpose free and open middle-

wares.

EnginFrame’s modular architecture allows for different Grid mid-

dlewares and software including leading Life Science applications

like Schroedinger Glide, EPFL RAxML, BLAST family, Taverna, and

R) to be exploited, R Users can compose elementary services into

complex applications and “virtual experiments”, run, monitor and

build workflows via a standard Web browser. EnginFrame also has

highly tuneable resource sharing and fine-grained access control

where multiple authentication systems (like Active Directory, Krb5

or LDAP) can be exploited simultaneously.

58

Web and Cloud Portal ʎ “One Stop Shop” - the portal as

the starting point for knowledge

and resources

ʎ Enable collaboration and sharing

through the portal as a virtual

meeting point (employees, part-

ners, and affiliates)

ʎ Single Sign-on

ʎ Integrations for many technical

computing ISV and open-source

applications and infrastructure

components

Business Continuity ʎ Virtualizes servers and server

locations, data and storage loca-

tions, applications and licenses

ʎ Move to thin desktops for graphi-

cal workstation applications by

moving graphics processing into

the datacenter (also can increase

utilization and re-use of licenses)

ʎ Enable multi-site and global loca-

tions and services

ʎ Data maintained on network sto-

rage - not on laptop/desktop

ʎ Cloud and ASP – create, or take

advantage of, Cloud and ASP

(Application Service Provider)

services to extend your business,

grow new revenue, manage costs.

NICE EnginFrame

Desktop Cloud Virtualization

“The amount of data resulting from e.g. simulations

in CAE or other engineering environments can be in

the Gigabyte range. It is obvious that remote post-

processing is one of the most urgent topics to be tack-

led. NICE Engine Frame provides exactly that, and our

customers are impressed that such great technology

enhances their workflow so significantly.”

Robin Kienecker HPC Sales Specialist

59

Compliance ʎ Ensure secure and auditable

access and use of resources (ap-

plications, data, infrastructure,

licenses)

ʎ Enables IT to provide better upti-

me and service-levels to users

ʎ Allow collaboration with partners

while protecting Intellectual

Property and resources

ʎ Restrict access by class of user,

service, application, and resource

Web Services ʎ Wrapper legacy applications

(command line, local API/SDK)

with a web services SOAP/.NET/

Java interface to access in your

SOA environment

ʎ Workflow your human and auto-

mated business processes and

supply chain

ʎ Integrate with corporate policy

for Enterprise Portal or Business

Process Management (works with

SharePoint®, WebLogic®, WebS-

phere®, etc).

Desktop Cloud VirtualizationNice Desktop Cloud Visualization (DCV) is an advanced technol-

ogy that enables Technical Computing users to remote access

2D/3D interactive applications over a standard network.

Engineers and scientists are immediately empowered by taking

full advantage of high-end graphics cards, fast I/O performance

and large memory nodes hosted in “Public or Private 3D Cloud”,

rather then waiting for the next upgrade of the workstations.

The DCV protocol adapts to heterogeneous networking infrastruc-

tures like LAN, WAN and VPN, to deal with bandwidth and latency

constraints. All applications run natively on the remote machines,

that could be virtualized and share the same physical GPU.

In a typical visualization scenario, a software application sends a

stream of graphics commands to a graphics adapter through an

input/output (I/O) interface. The graphics adapter renders the data

into pixels and outputs them to the local display as a video signal.

When using Nice DCV, the scene geometry and graphics state are

rendered on a central server, and pixels are sent to one or more

remote displays.

This approach requires the server to be equipped with one or

more GPUs, which are used for the OpenGL rendering, while the

client software can run on “thin” devices.

Nice DCV architecture consist of:

ʎ DCV Server, equipped with one or more GPUs, used for OpenGL

rendering

ʎ one or more DCV EndStations, running on “thin clients”, only

used for visualization

ʎ etherogeneous networking infrastructures (like LAN, WAN

and VPN), optimized balancing quality vs frame rate

Nice DCV Highlights

ʎ enables high performance remote access to interactive 2D/3D

software applications on low bandwidth/high latency

ʎ supports multiple etherogeneous OS (Windows, Linux)

ʎ enables GPU sharing

ʎ supports 3D acceleration for OpenGL applications running on

Virtual Machines

60

Remote VisualizationIt’s human nature to want to ‘see’ the results from simula-

tions, tests, and analyses. Up until recently, this has meant

‘fat’ workstations on many user desktops. This approach pro-

vides CPU power when the user wants it – but as dataset size

increases, there can be delays in downloading the results.

Also, sharing the results with colleagues means gathering

around the workstation - not always possible in this global-

ized, collaborative workplace.

Increasing dataset complexity (millions of polygons, interact-

ing components, MRI/PET overlays) means that as time comes

to upgrade and replace the workstations, the next generation

of hardware needs more memory, more graphics processing,

more disk, and more CPU cores. This makes the workstation

expensive, in need of cooling, and noisy.

Innovation in the field of remote 3D processing now allows

companies to address these issues moving applications away

from the Desktop into the data center. Instead of pushing

data to the application, the application can be moved near

the data.

Instead of mass workstation upgrades, Remote Visualization

allows incremental provisioning, on-demand allocation, bet-

ter management and efficient distribution of interactive ses-

sions and licenses. Racked workstations or blades typically

have lower maintenance, cooling, replacement costs, and

they can extend workstation (or laptop) life as “thin clients”.

NICE EnginFrame

Remote Visualization

61

The solutionLeveraging their expertise in distributed computing and Web-

based application portals, NICE offers an integrated solution to

access, load balance and manage applications and desktop ses-

sions running within a visualization farm. The farm can include

both Linux and Windows resources, running on heterogeneous

hardware.

The core of the solution is the EnginFrame Visualization plug-in,

that delivers Web-based services to access and manage applica-

tions and desktops published in the Farm. This solution has been

integrated with:

� NICE Desktop Cloud Visualization (DCV)

� HP Remote Graphics Software (RGS)

� RealVNC

� TurboVNC and VirtualGL

� Nomachine NX

Coupled with these third party remote visualization engines

(which specialize in delivering high frame-rates for 3D graphics),

the NICE offering for Remote Visualization solves the issues of

user authentication, dynamic session allocation, session man-

agement and data transfers.

End users can enjoy the following improvements:

ʎ Intuitive, application-centric Web interface to start, control

and re-connect to a session

ʎ Single sign-on for batch and interactive applications

ʎ All data transfers from and to the remote visualization farm

are handled by EnginFrame

ʎ Built-in collaboration, to share sessions with other users

ʎ The load and usage of the visualization cluster is monitored

in the browser

The solution also delivers significant added-value for the sys-

tem administrators:

ʎ No need of SSH / SCP / FTP on the client machine

ʎ Easy integration into identity services, Single Sign-On (SSO),

Enterprise portals

ʎ Automated data life cycle management

ʎ Built-in user session sharing, to facilitate support

ʎ Interactive sessions are load balanced by a scheduler (LSF,

SGE or Torque) to achieve optimal performance and resource

usage

ʎ Better control and use of application licenses

ʎ Monitor, control and manage users’ idle sessions

transtec has a long-term experience in Engineering environ-

ments, especially from the CAD/CAE sector. This allow us to

provide customers from this area with solutions that greatly en-

hance their workflow and minimizes time-to-result.

This, together with transtec’s offerings of all kinds of services, al-

lows our customers to fully focus on their productive work, and

have us do the environmental optimizations.

62

ʎ Supports multiple user collaboration via session sharing

ʎ Enables attractive Return-on-Investment through resource

sharing and consolidation to data centers (GPU, memory, CPU,

...)

ʎ Keeps the data secure in the data center, reducing data load

and save time

ʎ Enables right sizing of system allocation based on user’s dy-

namic needs

ʎ Facilitates application deployment: all applications, updates

and patches are instantly available to everyone, without any

changes to original code

Business BenefitsThe business benefits for adopting Nice DCV can be summarized

in to four categories:

Category Business Benefits

Productivity ʎ Increase business efficiency

ʎ Improve team performance by

ensuring real-time collaboration

with colleagues and partners in

real time, anywhere.

ʎ Reduce IT management costs

by consolidating workstation

resources to a single point-of-

management

ʎ Save money and time on applica-

tion deployment

ʎ Let users work from anywhere

there is an Internet connection

NICE EnginFrame

Desktop Cloud Virtualization

63

Business Continuity ʎ Move graphics processing and

data to the datacenter - not on

laptop/desktop

ʎ Cloud-based platform support

enables you to scale the visua-

lization solution „on-demand“

to extend business, grow new

revenue, manage costs.

Data Security ʎ Guarantee secure and auditable

use of remote resources (appli-

cations, data, infrastructure,

licenses)

ʎ Allow real-time collaboration

with partners while protecting In-

tellectual Property and resources

ʎ Restrict access by class of user,

service, application, and resource

Training Effectiveness ʎ Enable multiple users to follow

application procedures alongside

an instructor in real-time

ʎ Enable collaboration and session

sharing among remote users (em-

ployees, partners, and affiliates)

Nice DCV is perfectly integrated into EnginFrame Views, lever-

aging 2D/3D capabilities over the Web, including the ability to

share an interactive session with other users for collaborative

working.

Windows Linux

ʎ Microsoft Windows 7 -

32/64 bit

ʎ Microsoft Windows Vista -

32/64 bit

ʎ Microsoft Windows XP -

32/64 bit

ʎ RedHat Enterprise 4, 5, 5.5,

6 - 32/64 bit

ʎ SUSE Enterprise Server 11

- 32/64 bit

© 2012 by NICE

Intel Cluster Ready –A Quality Standard for HPC Clusters

Intel Cluster Ready is de-signed to create predict-able expectations for users and providers of HPC clus-ters, primarily targeting customers in the commer-cial and industrial sectors. These are not experimental “test-bed” clusters used for computer science and com-puter engineering research, or high-end “capability” clusters closely targeting their specific computing re-quirements that power the high-energy physics at the national labs or other spe-cialized research organiza-tions.

Intel Cluster Ready seeks to advance HPC clusters used as computing resources in production environments by providing cluster owners with a high degree of confi-dence that the clusters they

deploy will run the applica-tions their scientific and en-gineering staff rely upon to do their jobs. It achieves this by providing cluster hard-ware, software, and system providers with a precisely defined basis for their prod-ucts to meet their customers’ production cluster require-ments.

65

66

Intel Cluster Ready

A Quality Standard for HPC Clusters

What are the Objectives of ICR?The primary objective of Intel Cluster Ready is to make clusters

easier to specify, easier to buy, easier to deploy, and make it easi-

er to develop applications that run on them. A key feature of ICR

is the concept of “application mobility”, which is defined as the

ability of a registered Intel Cluster Ready application – more cor-

rectly, the same binary – to run correctly on any certified Intel

Cluster Ready cluster. Clearly, application mobility is important

for users, software providers, hardware providers, and system

providers.

ʎ Users want to know the cluster they choose will reliably run

the applications they rely on today, and will rely on tomorrow

ʎ Application providers want to satisfy the needs of their cus-

tomers by providing applications that reliably run on their

customers’ cluster hardware and cluster stacks

ʎ Cluster stack providers want to satisfy the needs of their cus-

tomers by providing a cluster stack that supports their custo-

mers’ applications and cluster hardware

ʎ Hardware providers want to satisfy the needs of their custo-

mers by providing hardware components that supports their

customers’ applications and cluster stacks

ʎ System providers want to satisfy the needs of their customers

by providing complete cluster implementations that reliably

run their customers’ applications

Without application mobility, each group above must either try

to support all combinations, which they have neither the time

nor resources to do, or pick the “winning combination(s)” that

best supports their needs, and risk making the wrong choice.

The Intel Cluster Ready definition of application portability sup-

ports all of these needs by going beyond pure portability, (re-

compiling and linking a unique binary for each platform), to ap-

plication binary mobility, (running the same binary on multiple

platforms), by more precisely defining the target system.

“The Intel Cluster Checker allows us to certify that

our transtec HPC clusters are compliant with an

independent high quality standard. Our customers

can rest assured: their applications run as they ex-

pect.”

Marcus Wiedemann HPC Solution Engineer

67

A further aspect of application mobility is to ensure that regis-

tered Intel Cluster Ready applications do not need special pro-

gramming or alternate binaries for different message fabrics.

Intel Cluster Ready accomplishes this by providing an MPI imple-

mentation supporting multiple fabrics at runtime; through this,

registered Intel Cluster Ready applications obey the “message

layer independence property”. Stepping back, the unifying con-

cept of Intel Cluster Ready is “one-to-many,” that is

� One application will run on many clusters

� One cluster will run many applications

How is one-to-many accomplished? Looking at Figure 1, you see

the abstract Intel Cluster Ready “stack” components that always

exist in every cluster, i.e., one or more applications, a cluster soft-

ware stack, one or more fabrics, and finally the underlying clus-

ter hardware. The remainder of that picture (to the right) shows

the components in greater detail.

Applications, on the top of the stack, rely upon the various APIs,

utilities, and file system structure presented by the underly-

ing software stack. Registered Intel Cluster Ready applications

are always able to rely upon the APIs, utilities, and file system

structure specified by the Intel Cluster Ready Specification; if an

application requires software outside this “required” set, then

Intel Cluster Ready requires the application to provide that soft-

ware as a part of its installation. To ensure that this additional

per-application software doesn’t conflict with the cluster stack

or other applications, Intel Cluster Ready also requires the addi-

tional software to be installed in application-private trees, so the

application knows how to find that software while not interfer-

ing with other applications. While this may well cause duplicate

software to be installed, the reliability provided by the duplica-

tion far outweighs the cost of the duplicated files. A prime ex-

ample supporting this comparison is the removal of a common

file (library, utility, or other) that is unknowingly needed by some

ICR Stack

So

ftw

are

Sta

ck

Fabric

System

CFDRegisteredApplications

A SingleSolutionPlatform

Fabrics

Certified Cluster Platforms

Crash Climate QCD

Intel MPI Library (run-time)Intel MKL Cluster Edition (run-time)

Linux Cluster Tools(Intel Selected)

InfiniBand (OFED)

Intel Xeon Processor Platform

Gigabit Ethernet 10Gbit Ethernet

BioOptional

Value Addby

IndividualPlatform

Integrator

IntelDevelopment

Tools(C++, Intel Trace

Analyzer and Collector, MKL, etc.)

...

Intel OEM1 OEM2 PI1 PI2 ...

68

other application – such errors can be insidious to repair even

when they cause an outright application failure.

Cluster platforms, at the bottom of the stack, provide the APIs,

utilities, and file system structure relied upon by registered ap-

plications. Certified Intel Cluster Ready platforms ensure the

APIs, utilities, and file system structure are complete per the Intel

Cluster Ready Specification; certified clusters are able to provide

them by various means as they deem appropriate. Because of

the clearly defined responsibilities ensuring the presence of all

software required by registered applications, system providers

have a high confidence that the certified clusters they build are

able to run any certified applications their customers rely on. In

addition to meeting the Intel Cluster Ready requirements, certi-

fied clusters can also provide their added value, that is, other fea-

tures and capabilities that increase the value of their products.

How Does Intel Cluster Ready Accomplish its Objectives?At its heart, Intel Cluster Ready is a definition of the cluster as a

parallel application platform, as well as a tool to certify an actual

cluster to the definition. Let’s look at each of these in more de-

tail, to understand their motivations and benefits.

A definition of the cluster as parallel application platform

The Intel Cluster Ready Specification is very much written as

the requirements for, not the implementation of, a platform

upon which parallel applications, more specifically MPI appli-

cations, can be built and run. As such, the specification doesn’t

care whether the cluster is diskful or diskless, fully distributed or

single system image (SSI), built from “Enterprise” distributions or

community distributions, fully open source or not. Perhaps more

importantly, with one exception, the specification doesn’t have

any requirements on how the cluster is built; that one exception

is that compute nodes must be built with automated tools, so

that new, repaired, or replaced nodes can be rebuilt identically

Intel Cluster Ready

A Quality Standard for HPC Clusters

69

to the existing nodes without any manual interaction, other

than possibly initiating the build process.

Some items the specification does care about include:

ʎ The ability to run both 32- and 64-bit applications, including

MPI applications and X-clients, on any of the compute nodes

ʎ Consistency among the compute nodes’ configuration, capa-

bility, and performance

ʎ The identical accessibility of libraries and tools across

the cluster

ʎ The identical access by each compute node to permanent and

temporary storage, as well as users’ data

ʎ The identical access to each compute node from the

head node

ʎ The MPI implementation provides fabric independence

ʎ All nodes support network booting and provide a

remotely accessible console

The specification also requires that the runtimes for specific

Intel software products are installed on every certified cluster:

� Intel Math Kernel Library

� Intel MPI Library Runtime Environment

� Intel Threading Building Blocks

This requirement does two things. First and foremost, main-

line Linux distributions do not necessarily provide a sufficient

software stack to build an HPC cluster – such specialization is

beyond their mission. Secondly, the requirement ensures that

programs built with this software will always work on certified

clusters and enjoy simpler installations. As these runtimes are

directly available from the web, the requirement does not cause

additional costs to certified clusters. It is also very important

to note that this does not require certified applications to use

these libraries nor does it preclude alternate libraries, e.g., other

MPI implementations, from being present on certified clusters.

Quite clearly, an application that requires, e.g., an alternate MPI,

must also provide the runtimes for that MPI as a part of its instal-

lation.

A tool to certify an actual cluster to the definition

The Intel Cluster Checker, included with every certified Intel Clus-

ter Ready implementation, is used in four modes in the life of a

cluster:

ʎ To certify a system provider’s prototype cluster as a valid im-

plementation of the specification

ʎ To verify to the owner that the just-delivered cluster is a “true

copy” of the certified prototype

ʎ To ensure the cluster remains fully functional, reducing ser-

vice calls not related to the applications or the hardware

ʎ To help software and system providers diagnose and correct

actual problems to their code or their hardware.

While these are critical capabilities, in all fairness, this greatly

understates the capabilities of Intel Cluster Checker. The tool will

not only verify the cluster is performing as expected. To do this,

per-node and cluster-wide static and dynamic tests are made of

the hardware and software.

Intel Cluster Checker

Cluster Definition & Configuration XML File

STDOUT + LogfilePass/Fail Results & Diagnostics

Cluster Checker Engine

Output

Test Module

Parallel Ops Check

Res

ult

Co

nfi

g

Output

Res

ult

AP

I

Test Module

Parallel Ops Check

Co

nfi

gN

od

e

No

de

No

de

No

de

No

de

No

de

No

de

No

de

70

The static checks ensure the systems are configured consistent-

ly and appropriately. As one example, the tool will ensure the

systems are all running the same BIOS versions as well as hav-

ing identical configurations among key BIOS settings. This type

of problem – differing BIOS versions or settings – can be the root

cause of subtle problems such as differing memory configura-

tions that manifest themselves as differing memory bandwidths

only to be seen at the application level as slower than expected

overall performance. As is well-known, parallel program perfor-

mance can be very much governed by the performance of the

slowest components, not the fastest. In another static check, the

Intel Cluster Checker will ensure that the expected tools, librar-

ies, and files are present on each node, identically located on all

nodes, as well as identically implemented on all nodes. This en-

sures that each node has the minimal software stack specified

by the specification, as well as identical software stack among

the compute nodes.

A typical dynamic check ensures consistent system perfor-

mance, e.g., via the STREAM benchmark. This particular test en-

sures processor and memory performance is consistent across

compute nodes, which, like the BIOS setting example above, can

be the root cause of overall slower application performance. An

additional check with STREAM can be made if the user config-

ures an expectation of benchmark performance; this check will

ensure that performance is not only consistent across the clus-

ter, but also meets expectations. Going beyond processor per-

formance, the Intel MPI Benchmarks are used to ensure the net-

work fabric(s) are performing properly and, with a configuration

that describes expected performance levels, up to the cluster

provider’s performance expectations. Network inconsistencies

due to poorly performing Ethernet NICs, InfiniBand HBAs, faulty

switches, and loose or faulty cables can be identified. Finally, the

Intel Cluster Checker is extensible, enabling additional tests to

be added supporting additional features and capabilities. This

enables the Intel Cluster Checker to not only support the mini-

Intel Cluster Ready

Intel Cluster Ready Builds HPC Momentum

71

Intel Cluster Ready Builds HPC MomentumWith the Intel Cluster Ready (ICR) program, Intel Corporation set

out to create a win-win scenario for the major constituencies

in the high-performance computing (HPC) cluster market. Hard-

ware vendors and independent software vendors (ISVs) stand

to win by being able to ensure both buyers and users that their

products will work well together straight out of the box. System

administrators stand to win by being able to meet corporate de-

mands to push HPC competitive advantages deeper into their

organizations while satisfying end users’ demands for reliable

HPC cycles, all without increasing IT staff. End users stand to win

by being able to get their work done faster, with less downtime,

on certified cluster platforms. Last but not least, with ICR, Intel

has positioned itself to win by expanding the total addressable

market (TAM) and reducing time to market for the company’s mi-

croprocessors, chip sets, and platforms.

The Worst of Times

For a number of years, clusters were largely confined to govern-

ment and academic sites, where contingents of graduate stu-

dents and midlevel employees were available to help program

and maintain the unwieldy early systems. Commercial firms

lacked this low-cost labor supply and mistrusted the favored

cluster operating system, open source Linux, on the grounds

that no single party could be held accountable if something

went wrong with it. Today, cluster penetration in the HPC mar-

ket is deep and wide, extending from systems with a handful of

processors to some of the world’s largest supercomputers, and

from under $25,000 to tens or hundreds of millions of dollars in

price. Clusters increasingly pervade every HPC vertical market:

biosciences, computer-aided engineering, chemical engineering,

digital content creation, economic/financial services, electronic

design automation, geosciences/geo-engineering, mechanical

mal requirements of the Intel Cluster Ready Specification, but

the full cluster as delivered to the customer.

Conforming hardware and software

The preceding was primarily related to the builders of certified

clusters and the developers of registered applications. For end

users that want to purchase a certified cluster to run registered

applications, the ability to identify registered applications and

certified clusters is most important, as that will reduce their ef-

fort to evaluate, acquire, and deploy the clusters that run their

applications, and then keep that computing resource operating

properly, with full performance, directly increasing their produc-

tivity.

Intel, XEON and certain other trademarks and logos appearing in this

brochure, are trademarks or registered trademarks of Intel Corporation.

Cluster PlatformSoftware Tools

Reference Designs

Demand Creation

Support & Training

Specification

Co

nfi

gu

ration

Softw

are

Interco

nn

ect

Server Platfo

rm

Intel P

rocesso

r ISV Enabling

©

Intel Cluster Ready Program

72

design, defense, government labs, academia, and weather/cli-

mate.

But IDC studies have consistently shown that clusters remain

difficult to specify, deploy, and manage, especially for new and

less experienced HPC users. This should come as no surprise,

given that a cluster is a set of independent computers linked to-

gether by software and networking technologies from multiple

vendors.

Clusters originated as do-it-yourself HPC systems. In the late

1990s users began employing inexpensive hardware to cobble

together scientific computing systems based on the “Beowulf

cluster” concept first developed by Thomas Sterling and Don-

ald Becker at NASA. From their Beowulf origins, clusters have

evolved and matured substantially, but the system manage-

ment issues that plagued their early years remain in force to-

day.

The Need for Standard Cluster Solutions

The escalating complexity of HPC clusters poses a dilemma for

many large IT departments that cannot afford to scale up their

HPC-knowledgeable staff to meet the fast-growing end-user de-

mand for technical computing resources. Cluster management

is even more problematic for smaller organizations and business

units that often have no dedicated, HPC-knowledgeable staff to

begin with.

The ICR program aims to address burgeoning cluster complex-

ity by making available a standard solution (aka reference archi-

tecture) for Intel-based systems that hardware vendors can use

to certify their configurations and that ISVs and other software

vendors can use to test and register their applications, system

Intel Cluster Ready

Intel Cluster Ready Builds HPC Momentum

73

software, and HPC management software. The chief goal of this

voluntary compliance program is to ensure fundamental hard-

ware-software integration and interoperability so that system

administrators and end users can confidently purchase and de-

ploy HPC clusters, and get their work done, even in cases where

no HPC-knowledgeable staff are available to help.

The ICR program wants to prevent end users from having to

become, in effect, their own systems integrators. In smaller

organizations, the ICR program is designed to allow over-

worked IT departments with limited or no HPC expertise to

support HPC user requirements more readily. For larger or-

ganizations with dedicated HPC staff, ICR creates confidence

that required user applications will work, eases the problem

of system administration, and allows HPC cluster systems to

be scaled up in size without scaling support staff. ICR can help

drive HPC cluster resources deeper into larger organizations

and free up IT staff to focus on mainstream enterprise appli-

cations (e.g., payroll, sales, HR, and CRM).

The program is a three-way collaboration among hardware ven-

dors, software vendors, and Intel. In this triple alliance, Intel pro-

vides the specification for the cluster architecture implementa-

tion, and then vendors certify the hardware configurations and

register software applications as compliant with the specifica-

tion. The ICR program’s promise to system administrators and

end users is that registered applications will run out of the box

on certified hardware configurations.

ICR solutions are compliant with the standard platform architec-

ture, which starts with 64-bit Intel Xeon processors in an Intel-

certified cluster hardware platform. Layered on top of this foun-

dation are the interconnect fabric (Gigabit Ethernet, InfiniBand)

and the software stack: Intel-selected Linux cluster tools, an

Intel MPI runtime library, and the Intel Math Kernel Library. In-

tel runtime components are available and verified as part of

the certification (e.g., Intel tool runtimes) but are not required

to be used by applications. The inclusion of these Intel runtime

components does not exclude any other components a systems

vendor or ISV might want to use. At the top of the stack are Intel-

registered ISV applications.

At the heart of the program is the Intel Cluster Checker, a vali-

dation tool that verifies that a cluster is specification compliant

and operational before ISV applications are ever loaded. After

the cluster is up and running, the Cluster Checker can function

as a fault isolation tool in wellness mode. Certification needs

to happen only once for each distinct hardware platform, while

verification – which determines whether a valid copy of the

specification is operating – can be performed by the Cluster

Checker at any time.

Cluster Checker is an evolving tool that is designed to accept

new test modules. It is a productized tool that ICR members ship

with their systems. Cluster Checker originally was designed for

homogeneous clusters but can now also be applied to clusters

with specialized nodes, such as all-storage sub-clusters. Cluster

Checker can isolate a wide range of problems, including network

or communication problems.

transtec offers their customers a new and fascinating way to

evaluate transtec’s HPC solutions in real world scenarios. With

the transtec Benchmarking Center solutions can be explored in

detail with the actual applications the customers will later run on

them. Intel Cluster Ready makes this feasible by simplifying the

maintenance of the systems and set-up of clean systems very eas-

ily, and as often as needed. As High-Performance Computing (HPC)

systems are utilized for numerical simulations, more and more

advanced clustering technologies are being deployed. Because

of its performance, price/performance and energy efficiency ad-

vantages, clusters now dominate all segments of the HPC market

and continue to gain acceptance. HPC computer systems have be-

come far more widespread and pervasive in government, industry,

and academia. However, rarely does the client have the possibility

to test their actual application on the system they are planning

to acquire.

The transtec Benchmarking Center

transtec HPC solutions get used by a wide variety of clients.

Among those are most of the large users of compute power at

German and other European universities and research centers

as well as governmental users like the German army’s compute

74

Intel Cluster Ready

The transtec Benchmarking Center

center and clients from the high tech, the automotive and other

sectors. transtec HPC solutions have demonstrated their value in

more than 500 installations. Most of transtec’s cluster systems are

based on SUSE Linux Enterprise Server, Red Hat Enterprise Linux,

CentOS, or Scientific Linux. With xCAT for efficient cluster deploy-

ment, and Moab HPC Suite by Adaptive Computing for high-level

cluster management, transtec is able to efficiently deploy and

ship easy-to-use HPC cluster solutions with enterprise-class

management features. Moab has proven to provide easy-to-use

workload and job management for small systems as well as the

largest cluster installations worldwide. However, when selling

clusters to governmental customers as well as other large en-

terprises, it is often required that the client can choose from a

range of competing offers. Many times there is a fixed budget

available and competing solutions are compared based on their

performance towards certain custom benchmark codes.

So, in 2007 transtec decided to add another layer to their already

wide array of competence in HPC – ranging from cluster deploy-

ment and management, the latest CPU, board and network tech-

nology to HPC storage systems. In transtec’s HPC Lab the systems

are being assembled. transtec is using Intel Cluster Ready to fa-

cilitate testing, verification, documentation, and final testing

throughout the actual build process. At the benchmarking center

transtec can now offer a set of small clusters with the “newest

and hottest technology” through Intel Cluster Ready. A standard

installation infrastructure gives transtec a quick and easy way to

set systems up according to their customers’ choice of operating

system, compilers, workload management suite, and so on. With

Intel Cluster Ready there are prepared standard set-ups available

with verified performance at standard benchmarks while the

system stability is guaranteed by our own test suite and the Intel

Cluster Checker.

The Intel Cluster Ready program is designed to provide a com-

mon standard for HPC clusters, helping organizations design

and build seamless, compatible and consistent cluster configu-

rations. Integrating the standards and tools provided by this pro-

gram can help significantly simplify the deployment and man-

agement of HPC clusters.

75

Parallel NFS – The New Standard for HPC Storage

77

HPC computation results in the terabyte range are not uncommon. The problem in this context is not so much storing the data at rest, but the performance of the necessary copying back and forth in the course of the computation job flow and the dependent job turn-around time.

For interim results during a job runtime or for fast stor-age of input and results data, parallel file systems have established themselves as the standard to meet the ev-er-increasing performance requirements of HPC stor-age systems. Parallel NFS is about to become the new standard framework for a parallel file system.

Parallel NFS – The New Standard for HPC Storage

78

Parallel NFS

The New Standard for HPC Storage

Yesterday’s Solution: NFS for HPC StorageThe original Network File System (NFS) developed by Sun Micro-

systems at the end of the eighties – now available in version 4.1

– has been established for a long time as a de-facto standard for

the provisioning of a global namespace in networked computing.

A very widespread HPC cluster solution includes a central mater

node acting simultaneously as an NFS server, with its local file

system storing input, interim and results data and exporting

them to all other cluster nodes.

There is of course an immediate bottleneck in this method: When

the load of the network is high, or where there are large num-

bers of nodes, the NFS server can no longer keep up delivering

or receiving the data. In high-performance computing especially,

the nodes are interconnected at least once via Gigabit Ethernet,

so the sum total throughput is well above what an NFS server

with a Gigabit interface can achieve. Even a powerful network

connection of the NFS server to the cluster, for example with

10-Gigabit Ethernet, is only a temporary solution to this prob-

lem until the next cluster upgrade. The fundamental problem re-

mains – this solution is not scalable; in addition, NFS is a difficult

protocol to cluster in terms of load balancing: either you have

to ensure that multiple NFS servers accessing the same data

A classical NFS server is a bottleneck

Cluster Nodes= NFS Clients

NAS Head= NFS Server

79

are constantly synchronised, the disadvantage being a notice-

able drop in performance or you manually partition the global

namespace which is also time-consuming. NFS is not suitable for

dynamic load balancing as on paper it appears to be stateless

but in reality is, in fact, stateful.

Today’s Solution: Parallel File SystemsFor some time, powerful commercial products have been avail-

able to meet the high demands on an HPC storage system. The

open-source solutions FraunhoferFS (FhGFS) from the Fraun-

hofer Competence Center for High Performance Computing or

Lustre are widely used in the Linux HPC world, and also several

other free as well as commercial parallel file system solutions

exist.

What is new is that the time-honoured NFS is to be upgraded,

including a parallel version, into an Internet Standard with

the aim of interoperability between all operating systems. The

original problem statement for parallel NFS access was written

by Garth Gibson, a professor at Carnegie Mellon University and

founder and CTO of Panasas. Gibson was already a renowned

figure being one of the authors contributing to the original

paper on RAID architecture from 1988. The original statement

from Gibson and Panasas is clearly noticeable in the design of

pNFS. The powerful HPC file system developed by Gibson and

Panasas, ActiveScale PanFS, with object-based storage devices

functioning as central components, is basically the commercial

continuation of the “Network-Attached Secure Disk (NASD)”

project also developed by Garth Gibson at the Carnegie Mellon

University.

Parallel NFSParallel NFS (pNFS) is gradually emerging as the future standard

to meet requirements in the HPC environment. From the indus-

try’s as well as the user’s perspective, the benefits of utilising

standard solutions are indisputable: besides protecting end user

investment, standards also ensure a defined level of interoper-

ability without restricting the choice of products available. As

a result, less user and administrator training is required which

leads to simpler deployment and at the same time, a greater ac-

ceptance.

As part of the NFS 4.1 Internet Standard, pNFS will not only

adopt the semantics of NFS in terms of cache consistency or

security, it also represents an easy and flexible extension of

the NFS 4 protocol. pNFS is optional, in other words, NFS 4.1

implementations do not have to include pNFS as a feature. The

scheduled Internet Standard NFS 4.1 is today presented as IETF

RFC 5661.

The pNFS protocol supports a separation of metadata and data:

a pNFS cluster comprises so-called storage devices which store

the data from the shared file system and a metadata server

(MDS), called Director Blade with Panasas – the actual NFS 4.1

server. The metadata server keeps track of which data is stored

on which storage devices and how to access the files, the so-

called layout. Besides these “striping parameters”, the MDS also

manages other metadata including access rights or similar,

which is usually stored in a file’s inode.

The layout types define which Storage Access Protocol is used

by the clients to access the storage devices. Up until now,

three potential storage access protocols have been defined

for pNFS: file, block and object-based layouts, the former be-

ing described in RFC 5661 direct, the latters in RFC 5663 and

5664, respectively. Last but not least, a Control Protocol is also

used by the MDS and storage devices to synchronise status

data. This protocol is deliberately unspecified in the standard

to give manufacturers certain flexibility. The NFS 4.1 standard

does however specify certain conditions which a control pro-

tocol has to fulfil, for example, how to deal with the change/

modify time attributes of files.

80

What’s New in NFS 4.1?

NFS 4.1 is a minor update to NFS 4, and adds new features to it.

One of the optional features is parallel NFS (pNFS) but there is

other new functionality as well.

One of the technical enhancements is the use of sessions, a

persistent server object, dynamically created by the client. By

means of sessions, the state of an NFS connection can be stored,

no matter whether the connection is live or not. Sessions survive

temporary downtimes both of the client and the server.

Each session has a so-called fore channel, which is the connec-

tion connection from the client to the server for all RPC opera-

tions, and optionally a back channel for RPC callbacks from the

server that now can also be realized through firewall boundar-

ies. Sessions can be trunked to increase the bandwidth. Besides

session trunking there is also a client ID trunking for grouping

together several sessions to the same client ID.

By means of sessions, NFS can be seen as a really stateful proto-

col with a so-called “Exactly-Once Semantics (EOS)”. Until now, a

necessary but unspecified reply cache within the NFS server is

implemented to handle identical RPC operations that have been

sent several times. This statefulness in reality is not very robust,

however, and sometimes leads to the well-known stale NFS han-

dles. In NFS 4.1, the reply cache is now a mandatory part of the

NFS implementation, storing the server replies to RPC requests

persistently on disk.

Another new feature of NFS 4.1 is delegation for directories: NFS

clients can be given temporary exclusive access to directories.

Before, this has only been possible for simple files. With the

forthcoming version 4.2 of the NFS standard, federated filesys-

tems will be added as a feature, which represents the NFS coun-

terpart of Microsoft’s DFS (distributed filesystem).

Parallel NFS

What’s New in NFS 4.1?

81

pNFS supports backwards compatibility with non-pNFS com-

patible NFS 4 clients. In this case, the MDS itself gathers data

from the storage devices on behalf of the NFS client and pres-

ents the data to the NFS client via NFS 4. The MDS acts as a

kind of proxy server – which is e.g. what the Director Blades

from Panasas do.

pNFS Layout TypesIf storage devices act simply as NFS 4 file servers, the file

layout is used. It is the only storage access protocol directly

specified in the NFS 4.1 standard. Besides the stripe sizes and

stripe locations (storage devices), it also includes the NFS file

handles which the client needs to use to access the separate

file areas. The file layout is compact and static, the striping

information does not change even if changes are made to the

file enabling multiple pNFS clients to simultaneously cache

the layout and avoid synchronisation overhead between cli-

ents and the MDS or the MDS and storage devices.

File system authorisation and client authentication can be

well implemented with the file layout. When using NFS 4 as

the storage access protocol, client authentication merely de-

pends on the security flavor used – when using the RPCSEC_

GSS security flavor, client access is kerberized, for example

and the server controls access authorization using specified

ACLs and cryptographic processes.

In contrast, the block/volume layout uses volume identifiers

and block offsets and extents to specify a file layout. SCSI

block commands are used to access storage devices. As the

block distribution can change with each write access, the lay-

out must be updated more frequently than with the file lay-

out.

Block-based access to storage devices does not offer any se-

cure authentication option for the accessing SCSI initiator.

Secure SAN authorisation is possible with host granularity

only, based on World Wide Names (WWNs) with Fibre Channel

or Initiator Node Names (IQNs) with iSCSI. The server cannot

enforce access control governed by the file system. On the

contrary, a pNFS client basically voluntarily abides with the

access rights, the storage device has to trust the pNFS client

– a fundamental access control problem that is a recurrent is-

sue in the NFS protocol history.

The object layout is syntactically similar to the file layout,

but it uses the SCSI object command set for data access to

so-called Object-based Storage Devices (OSDs) and is heavily

based on the DirectFLOW protocol of the ActiveScale PanFS

from Panasas. From the very start, Object-Based Storage De-

vices were designed for secure authentication and access. So-

called capabilities are used for object access which involves

the MDS issuing so-called capabilities to the pNFS clients. The

ownership of these capabilities represents the authoritative

access right to an object.

pNFS can be upgraded to integrate other storage access pro-

tocols and operating systems, and storage manufacturers

also have the option to ship additional layout drivers for their

pNFS implementations.

Parallel NFS

pNFS Clients Storage Device

Metadata Server

Storage Access Protocol

NFS 4.1Control Protocol

82

Parallel NFS

Panasas HPC Storage

Having a many years’ experience in deploying parallel file sys-

tems like Lustre or FraunhoferFS (FhGFS) from smaller scales up

to hundreds of terabytes capacity and throughputs of several

gigabytes per second, transtec chose Panasas, the leader in HPC

storage solutions, as the partner for providing highest perfor-

mance and scalability on the one hand, and ease of manage-

ment on the other. Therefore, with Panasas as the technology

leader, and transtec’s overall experience and customer-oriented

approach, customers can be assured to get the best possible HPC

storage solution available.

83

The Panasas file system uses parallel and redundant access to

object storage devices (OSDs), per-file RAID, distributed metada-

ta management, consistent client caching, file locking services,

and internal cluster management to provide a scalable, fault

tolerant, high performance distributed file system. The clustered

design of the storage system and the use of client-driven RAID

provide scalable performance to many concurrent file system

clients through parallel access to file data that is striped across

OSD storage nodes. RAID recovery is performed in parallel by the

cluster of metadata managers, and declustered data placement

yields scalable RAID rebuild rates as the storage system grows

larger.

IntroductionStorage systems for high performance computing environments

must be designed to scale in performance so that they can be

configured to match the required load. Clustering techniques

are often used to provide scalability. In a storage cluster, many

nodes each control some storage, and the overall distributed file

system assembles the cluster elements into one large, seamless

storage system. The storage cluster can be hosted on the same

computers that perform data processing, or they can be a sepa-

rate cluster that is devoted entirely to storage and accessible to

the compute cluster via a network protocol.

The Panasas storage system is a specialized storage cluster, and

this paper presents its design and a number of performance

measurements to illustrate the scalability. The Panasas system

is a production system that provides file service to some of the

largest compute clusters in the world, in scientific labs, in seis-

mic data processing, in digital animation studios, in computa-

tional fluid dynamics, in semiconductor manufacturing, and

in general purpose computing environments. In these environ-

ments, hundreds or thousands of file system clients share data

and generate very high aggregate I/O load on the file system. The

Panasas system is designed to support several thousand clients

and storage capacities in excess of a petabyte.

The unique aspects of the Panasas system are its use of per-file,

client-driven RAID, its parallel RAID rebuild, its treatment of dif-

ferent classes of metadata (block, file, system) and a commodity

parts based blade hardware with integrated UPS. Of course, the

system has many other features (such as object storage, fault tol-

erance, caching and cache consistency, and a simplified manage-

ment model) that are not unique, but are necessary for a scalable

system implementation.

Panasas File System BackgroundThe two overall themes to the system are object storage, which

affects how the file system manages its data, and clustering of

components, which allows the system to scale in performance

and capacity.

Object Storage

An object is a container for data and attributes; it is analogous to the

inode inside a traditional UNIX file system implementation. Special-

ized storage nodes called Object Storage Devices (OSD) store objects

in a local OSDFS file system. The object interface addresses objects

in a two-level (partition ID/object ID) namespace. The OSD wire pro-

tocol provides byteoriented access to the data, attribute manipula-

tion, creation and deletion of objects, and several other specialized

operations. Panasas uses an iSCSI transport to carry OSD commands

that are very similar to the OSDv2 standard currently in progress

within SNIA and ANSI-T10.

The Panasas file system is layered over the object storage. Each file

is striped over two or more objects to provide redundancy and high

bandwidth access. The file system semantics are implemented by

metadata managers that mediate access to objects from clients of

the file system. The clients access the object storage using the

iSCSI/OSD protocol for Read and Write operations. The I/O op-

erations proceed directly and in parallel to the storage nodes,

bypassing the metadata managers. The clients interact with the

out-of-band metadata managers via RPC to obtain access capa-

bilities and location information for the objects that store files.

84

Object attributes are used to store file-level attributes, and direc-

tories are implemented with objects that store name to object

ID mappings. Thus the file system metadata is kept in the object

store itself, rather than being kept in a separate database or

some other form of storage on the metadata nodes.

System Software Components

The major software subsystems are the OSDFS object storage

system, the Panasas file system metadata manager, the Panasas

file system client, the NFS/CIFS gateway, and the overall cluster

management system.

ʎ The Panasas client is an installable kernel module that runs inside

the Linux kernel. The kernel module implements the standard

VFS interface, so that the client hosts can mount the file system

Parallel NFS

Panasas HPC Storage

NFS/CIFS

OSDFS

SysMgr

COMPUTE NODE

MANAGER NODE

STORAGE NODE

iSCSI/OSD

PanFSClient

RPC

Client

Panasas System Components

85

and use a POSIX interface to the storage system.

ʎ Each storage cluster node runs a common platform that is based

on FreeBSD, with additional services to provide hardware moni-

toring, configuration management, and overall control.

ʎ The storage nodes use a specialized local file system (OSDFS)

that implements the object storage primitives. They imple-

ment an iSCSI target and the OSD command set. The OSDFS

object store and iSCSI target/OSD command processor are

kernel modules. OSDFS is concerned with traditional block-

level file system issues such as efficient disk arm utilization,

media management (i.e., error handling), high throughput, as

well as the OSD interface.

ʎ The cluster manager (SysMgr) maintains the global configuration,

and it controls the other services and nodes in the storage clus-

ter. There is an associated management application that provi-

des both a command line interface (CLI) and an HTML interface

(GUI). These are all user level applications that run on a subset

of the manager nodes. The cluster manager is concerned with

membership in the storage cluster, fault detection, configuration

management, and overall control for operations like software up-

grade and system restart.

ʎ The Panasas metadata manager (PanFS) implements the file

system semantics and manages data striping across the object

storage devices. This is a user level application that runs on every

manager node. The metadata manager is concerned with distri-

buted file system issues such as secure multi-user access, main-

taining consistent file- and object-level metadata, client cache

coherency, and recovery from client, storage node, and metadata

server crashes. Fault tolerance is based on a local transaction log

that is replicated to a backup on a different manager node.

ʎ The NFS and CIFS services provide access to the file system

for hosts that cannot use our Linux installable file system

client. The NFS service is a tuned version of the standard

FreeBSD NFS server that runs inside the kernel. The CIFS ser-

vice is based on Samba and runs at user level. In turn, these

services use a local instance of the file system client, which

runs inside the FreeBSD kernel. These gateway services run

on every manager node to provide a clustered NFS and CIFS

service.

Commodity Hardware Platform

The storage cluster nodes are implemented as blades that are

very compact computer systems made from commodity parts.

The blades are clustered together to provide a scalable platform.

The OSD StorageBlade module and metadata manager Director-

Blade module use the same form factor blade and fit into the

same chassis slots.

Storage Management

Traditional storage management tasks involve partitioning avail-

able storage space into LUNs (i.e., logical units that are one or more

disks, or a subset of a RAID array), assigning LUN ownership to dif-

ferent hosts, configuring RAID parameters, creating file systems or

databases on LUNs, and connecting clients to the correct server for

their storage. This can be a labor-intensive scenario. Panasas pro-

vides a simplified model for storage management that shields the

storage administrator from these kinds of details and allow a single,

part-time admin to manage systems that were hundreds of tera-

bytes in size.

The Panasas storage system presents itself as a file system with a

POSIX interface, and hides most of the complexities of storage man-

agement. Clients have a single mount point for the entire system.

The /etc/fstab file references the cluster manager, and from that the

client learns the location of the metadata service instances. The ad-

ministrator can add storage while the system is online, and new re-

sources are automatically discovered. To manage available storage,

Panasas introduces two basic storage concepts: a physical storage

pool called a BladeSet, and a logical quota tree called a Volume.

The BladeSet is a collection of StorageBlade modules in one or more

shelves that comprise a RAID fault domain. Panasas mitigates the

risk of large fault domains with the scalable rebuild performance

described below in the text. The BladeSet is a hard physical bound-

86

ary for the volumes it contains. A BladeSet can be grown at any time,

either by adding more StorageBlade modules, or by merging two ex-

isting BladeSets together.

The Volume is a directory hierarchy that has a quota constraint and

is assigned to a particular BladeSet. The quota can be changed at any

time, and capacity is not allocated to the Volume until it is used, so

multiple volumes compete for space within their BladeSet and grow

on demand. The files in those volumes are distributed among all the

StorageBlade modules in the BladeSet.

Volumes appear in the file system name space as directories.

Clients have a single mount point for the whole storage sys-

tem, and volumes are simply directories below the mount point.

There is no need to update client mounts when the admin cre-

ates, deletes, or renames volumes.

Automatic Capacity Balancing

Capacity imbalance occurs when expanding a BladeSet (i.e., add-

ing new, empty storage nodes), merging two BladeSets, and re-

placing a storage node following a failure. In the latter scenario,

the imbalance is the result of the RAID rebuild, which uses spare

capacity on every storage node rather than dedicating a specific

“hot spare” node. This provides better throughput during rebuild,

but causes the system to have a new, empty storage node after

the failed storage node is replaced. The system automatically bal-

ances used capacity across storage nodes in a BladeSet using two

mechanisms: passive balancing and active balancing.

Passive balancing changes the probability that a storage node will

be used for a new component of a file, based on its available capac-

ity. This takes effect when files are created, and when their stripe size

is increased to include more storage nodes. Active balancing is done

by moving an existing component object from one storage node to

another, and updating the storage map for the affected file. During

the transfer, the file is transparently marked read-only by the stor-

age management layer, and the capacity balancer skips files that are

being actively written. Capacity balancing is thus transparent to file

system clients.

Parallel NFS

Panasas HPC Storage

©2013 Panasas Incorporated. All rights reserved. Panasas, the

Panasas logo, Accelerating Time to Results, ActiveScale, Direct-

FLOW, DirectorBlade, StorageBlade, PanFS, PanActive and MyPa-

nasas are trademarks or registered trademarks of Panasas, Inc.

in the United States and other countries. All other trademarks

are the property of their respective owners. Information sup-

plied by Panasas, Inc. is believed to be accurate and reliable at

the time of publication, but Panasas, Inc. assumes no responsi-

bility for any errors that may appear in this document. Panasas,

Inc. reserves the right, without notice, to make changes in prod-

uct design, specifications and prices. Information is subject to

change without notice.

87

Object RAID and ReconstructionPanasas protects against loss of a data object or an entire storage

node by striping files across objects stored on different storage

nodes, using a fault-tolerant striping algorithm such as RAID-1 or

RAID-5. Small files are mirrored on two objects, and larger files are

striped more widely to provide higher bandwidth and less capacity

overhead from parity information. The per-file RAID layout means

that parity information for different files is not mixed together, and

easily allows different files to use different RAID schemes alongside

each other. This property and the security mechanisms of the OSD

protocol makes it possible to enforce access control over files even

as clients access storage nodes directly. It also enables what is per-

haps the most novel aspect of our system, client-driven RAID. That

is, the clients are responsible for computing and writing parity. The

OSD security mechanism also allows multiple metadata managers

to manage objects on the same storage device without heavyweight

coordination or interference from each other.

Client-driven, per-file RAID has four advantages for large-scale stor-

age systems. First, by having clients compute parity for their own

data, the XOR power of the system scales up as the number of clients

increases. We measured XOR processing during streaming write

bandwidth loads at 7% of the client’s CPU, with the rest going to the

OSD/iSCSI/TCP/IP stack and other file system overhead. Moving XOR

computation out of the storage system into the client requires some

additional work to handle failures. Clients are responsible for gener-

ating good data and good parity for it. Because the RAID equation is

per-file, an errant client can only damage its own data. However, if a

client fails during a write, the metadata manager will scrub parity to

ensure the parity equation is correct.

The second advantage of client-driven RAID is that clients can per-

form an end-to-end data integrity check. Data has to go through

the disk subsystem, through the network interface on the storage

nodes, through the network and routers, through the NIC on the

client, and all of these transits can introduce errors with a very low

probability. Clients can choose to read parity as well as data, and

verify parity as part of a read operation. If errors are detected, the

operation is retried. If the error is persistent, an alert is raised and the

read operation fails. By checking parity across storage nodes within

the client, the system can ensure end-to-end data integrity. This is

another novel property of per-file, client-driven RAID.

Third, per-file RAID protection lets the metadata managers rebuild

files in parallel. Although parallel rebuild is theoretically possible in

block-based RAID, it is rarely implemented. This is due to the fact that

the disks are owned by a single RAID controller, even in dual-ported

configurations. Large storage systems have multiple RAID control-

lers that are not interconnected. Since the SCSI Block command set

does not provide fine-grained synchronization operations, it is dif-

ficult for multiple RAID controllers to coordinate a complicated op-

eration such as an online rebuild without external communication.

Even if they could, without connectivity to the disks in the affected

parity group, other RAID controllers would be unable to assist. Even

in a high-availability configuration, each disk is typically only at-

tached to two different RAID controllers, which limits the potential

speedup to 2x.

When a StorageBlade module fails, the metadata managers that

own Volumes within that BladeSet determine what files are affect-

ed, and then they farm out file reconstruction work to every other

metadata manager in the system. Metadata managers rebuild their

own files first, but if they finish early or do not own any Volumes in

the affected Bladeset, they are free to aid other metadata managers.

Declustered parity groups spread out the I/O workload among all

StorageBlade modules in the BladeSet. The result is that larger stor-

age clusters reconstruct lost data more quickly.

The fourth advantage of per-file RAID is that unrecoverable faults

can be constrained to individual files. The most commonly en-

countered double-failure scenario with RAID-5 is an unrecover-

able read error (i.e., grown media defect) during the reconstruc-

tion of a failed storage device. The second storage device is still

healthy, but it has been unable to read a sector, which prevents

rebuild of the sector lost from the first drive and potentially the

entire stripe or LUN, depending on the design of the RAID con-

troller. With block-based RAID, it is difficult or impossible to di-

88

rectly map any lost sectors back to higher-level file system data

structures, so a full file system check and media scan will be re-

quired to locate and repair the damage. A more typical response

is to fail the rebuild entirely. RAID controllers monitor drives in

an effort to scrub out media defects and avoid this bad scenario,

and the Panasas system does media scrubbing, too. However,

with high capacity SATA drives, the chance of encountering a me-

dia defect on drive B while rebuilding drive A is still significant.

With per-file RAID-5, this sort of double failure means that only

a single file is lost, and the specific file can be easily identified

and reported to the administrator. While block-based RAID sys-

tems have been compelled to introduce RAID-6 (i.e., fault tolerant

schemes that handle two failures), the Panasas solution is able

to deploy highly reliable RAID-5 systems with large, high perfor-

mance storage pools.

RAID Rebuild Performance

RAID rebuild performance determines how quickly the system can

recover data when a storage node is lost. Short rebuild times reduce

the window in which a second failure can cause data loss. There are

three techniques to reduce rebuild times: reducing the size of the

RAID parity group, declustering the placement of parity group ele-

ments, and rebuilding files in parallel using multiple RAID engines.

Parallel NFS

Panasas HPC Storage

Declustered parity groups

bCf

hJK

CDe

Gim

ADe

him

ACD

GJK

bCf

Gim

Aef

hJK

bDe

Gim

Abf

hjK

89

The rebuild bandwidth is the rate at which reconstructed data is

written to the system when a storage node is being reconstruct-

ed. The system must read N times as much as it writes, depend-

ing on the width of the RAID parity group, so the overall through-

put of the storage system is several times higher than the rebuild

rate. A narrower RAID parity group requires fewer read and XOR

operations to rebuild, so will result in a higher rebuild band-

width. However, it also results in higher capacity overhead for

parity data, and can limit bandwidth during normal I/O. Thus,

selection of the RAID parity group size is a trade-off between ca-

pacity overhead, on-line performance, and rebuild performance.

Understanding declustering is easier with a picture. In the figure

on the left, each parity group has 4 elements, which are indicat-

ed by letters placed in each storage device. They are distributed

among 8 storage devices. The ratio between the parity group size

and the available storage devices is the declustering ratio, which

in this example is ½. In the picture, capital letters represent those

parity groups that all share the second storage node. If the second

storage device were to fail, the system would have to read the sur-

viving members of its parity groups to rebuild the lost elements.

You can see that the other elements of those parity groups occupy

about ½ of each other storage device.

For this simple example you can assume each parity element is the

same size so all the devices are filled equally. In a real system, the

component objects will have various sizes depending on the overall

file size, although each member of a parity group will be very close

in size. There will be thousands or millions of objects on each de-

vice, and the Panasas system uses active balancing to move compo-

nent objects between storage nodes to level capacity.

Declustering means that rebuild requires reading a subset

of each device, with the proportion being approximately the

same as the declustering ratio. The total amount of data read

is the same with and without declustering, but with decluster-

ing it is spread out over more devices. When writing the re-

constructed elements, two elements of the same parity group

cannot be located on the same storage node. Declustering

leaves many storage devices available for the reconstructed

parity element, and randomizing the placement of each file’s

parity group lets the system spread out the write I/O over all

the storage. Thus declustering RAID parity groups has the im-

portant property of taking a fixed amount of rebuild I/O and

spreading it out over more storage devices.

Having per-file RAID allows the Panasas system to divide the

work among the available DirectorBlade modules by assign-

ing different files to different DirectorBlade modules. This di-

vision is dynamic with a simple master/worker model in which

metadata services make themselves available as workers, and

each metadata service acts as the master for the volumes it

implements. By doing rebuilds in parallel on all DirectorBlade

modules, the system can apply more XOR throughput and uti-

lize the additional I/O bandwidth obtained with declustering.

Metadata ManagementThere are several kinds of metadata in the Panasas system. These

include the mapping from object IDs to sets of block addresses,

mapping files to sets of objects, file system attributes such as

ACLs and owners, file system namespace information (i.e., direc-

tories), and configuration/management information about the

storage cluster itself.

Block-level Metadata

Block-level metadata is managed internally by OSDFS, the file

system that is optimized to store objects. OSDFS uses a floating

block allocation scheme where data, block pointers, and object

descriptors are batched into large write operations. The write

buffer is protected by the integrated UPS, and it is flushed to disk

on power failure or system panics. Fragmentation was an issue

in early versions of OSDFS that used a first-fit block allocator, but

this has been significantly mitigated in later versions that use a

modified best-fit allocator.

OSDFS stores higher level file system data structures, such as the

partition and object tables, in a modified BTree data structure.

90

Block mapping for each object uses a traditional direct/indirect/

double-indirect scheme. Free blocks are tracked by a proprietary

bitmap-like data structure that is optimized for copy-on-write

reference counting, part of OSDFS’s integrated support for ob-

ject- and partition-level copy-on-write snapshots.

Block-level metadata management consumes most of the cycles

in file system implementations. By delegating storage manage-

ment to OSDFS, the Panasas metadata managers have an order

of magnitude less work to do than the equivalent SAN file system

metadata manager that must track all the blocks in the system.

File-level Metadata

Above the block layer is the metadata about files. This includes

user-visible information such as the owner, size, and modifica-

tion time, as well as internal information that identifies which

objects store the file and how the data is striped across those

objects (i.e., the file’s storage map). Our system stores this file

metadata in object attributes on two of the N objects used

to store the file’s data. The rest of the objects have basic at-

tributes like their individual length and modify times, but the

higher-level file system attributes are only stored on the two

attribute-storing components.

File names are implemented in directories similar to traditional

UNIX file systems. Directories are special files that store an ar-

ray of directory entries. A directory entry identifies a file with

a tuple of <serviceID, partitionID, objectID>, and also includes

two <osdID> fields that are hints about the location of the at-

tribute storing components. The partitionID/objectID is the

two-level object numbering scheme of the OSD interface, and

Panasas uses a partition for each volume. Directories are mir-

rored (RAID-1) in two objects so that the small write operations

associated with directory updates are efficient.

Clients are allowed to read, cache and parse directories, or

they can use a Lookup RPC to the metadata manager to trans-

late a name to an <serviceID, partitionID, objectID> tuple and

the <osdID> location hints. The serviceID provides a hint about

Parallel NFS

Panasas HPC Storage

91

the metadata manager for the file, although clients may be re-

directed to the metadata manager that currently controls the

file. The osdID hint can become out-of-date if reconstruction or

active balancing moves an object. If both osdID hints fail, the

metadata manager has to multicast a GetAttributes to the stor-

age nodes in the BladeSet to locate an object. The partitionID

and objectID are the same on every storage node that stores

a component of the file, so this technique will always work.

Once the file is located, the metadata manager automatically

updates the stored hints in the directory, allowing future ac-

cesses to bypass this step.

File operations may require several object operations. The figure

on the left shows the steps used in creating a file. The metadata

manager keeps a local journal to record in-progress actions so it

can recover from object failures and metadata manager crashes

that occur when updating multiple objects. For example, creat-

ing a file is fairly complex task that requires updating the parent

directory as well as creating the new file. There are 2 Create OSD

operations to create the first two components of the file, and 2

Write OSD operations, one to each replica of the parent direc-

tory. As a performance optimization, the metadata server also

grants the client read and write access to the file and returns

the appropriate capabilities to the client as part of the FileCre-

ate results. The server makes record of these write capabilities

to support error recovery if the client crashes while writing the

file. Note that the directory update (step 7) occurs after the reply,

so that many directory updates can be batched together. The de-

ferred update is protected by the op-log record that gets deleted

in step 8 after the successful directory update.

The metadata manager maintains an op-log that records the

object create and the directory updates that are in progress.

This log entry is removed when the operation is complete. If the

metadata service crashes and restarts, or a failure event moves

the metadata service to a different manager node, then the op-

log is processed to determine what operations were active at

the time of the failure. The metadata manager rolls the opera-

tions forward or backward to ensure the object store is consis-

tent. If no reply to the operation has been generated, then the

operation is rolled back. If a reply has been generated but pend-

ing operations are outstanding (e.g., directory updates), then

the operation is rolled forward.

The write capability is stored in a cap-log so that when a meta-

data server starts it knows which of its files are busy. In addi-

tion to the “piggybacked” write capability returned by FileC-

reate, the client can also execute a StartWrite RPC to obtain a

separate write capability. The cap-log entry is removed when

the client releases the write cap via an EndWrite RPC. If the

client reports an error during its I/O, then a repair log entry is

made and the file is scheduled for repair. Read and write capa-

bilities are cached by the client over multiple system calls, fur-

ther reducing metadata server traffic.

Creating a file

OSDs

1. CREATE

7. WRITE3. CREATE

MetadataServer

6. REPLY

Client

Txn_log

2

8

4

5

oplog

caplog

Replycache

92

System-level Metadata

The final layer of metadata is information about the overall sys-

tem itself. One possibility would be to store this information in

objects and bootstrap the system through a discovery protocol.

The most difficult aspect of that approach is reasoning about

the fault model. The system must be able to come up and be

manageable while it is only partially functional. Panasas chose

instead a model with a small replicated set of system managers,

each that stores a replica of the system configuration metadata.

Each system manager maintains a local database, outside of the

object storage system. Berkeley DB is used to store tables that

represent our system model. The different system manager in-

stances are members of a replication set that use Lamport’s part-

time parliament (PTP) protocol to make decisions and update

the configuration information. Clusters are configured with one,

three, or five system managers so that the voting quorum has

an odd number and a network partition will cause a minority of

system managers to disable themselves.

System configuration state includes both static state, such as the

identity of the blades in the system, as well as dynamic state such

as the online/offline state of various services and error conditions

associated with different system components. Each state update

decision, whether it is updating the admin password or activating

a service, involves a voting round and an update round according to

the PTP protocol. Database updates are performed within the PTP

transactions to keep the databases synchronized. Finally, the sys-

tem keeps backup copies of the system configuration databases on

several other blades to guard against catastrophic loss of every sys-

tem manager blade.

Blade configuration is pulled from the system managers as part of

each blade’s startup sequence. The initial DHCP handshake conveys

the addresses of the system managers, and thereafter the local OS

on each blade pulls configuration information from the system

managers via RPC.

The cluster manager implementation has two layers. The lower level

PTP layer manages the voting rounds and ensures that partitioned

Parallel NFS

Panasas HPC Storage

“Outstanding in the HPC world, the ActiveStor so-

lutions provided by Panasas are undoubtedly the

only HPC storage solutions that combine highest

scalability and performance with a convincing ease

of management.“

Thomas Gebert HPC Solution Architect

93

or newly added system managers will be brought up-to-date with

the quorum. The application layer above that uses the voting and

update interface to make decisions. Complex system operations

may involve several steps, and the system manager has to keep track

of its progress so it can tolerate a crash and roll back or roll forward

as appropriate.

For example, creating a volume (i.e., a quota-tree) involves file system

operations to create a top-level directory, object operations to cre-

ate an object partition within OSDFS on each StorageBlade module,

service operations to activate the appropriate metadata manager,

and configuration database operations to reflect the addition of the

volume. Recovery is enabled by having two PTP transactions. The

initial PTP transaction determines if the volume should be created,

and it creates a record about the volume that is marked as incom-

plete. Then the system manager does all the necessary service acti-

vations, file and storage operations. When these all complete, a final

PTP transaction is performed to commit the operation. If the system

manager crashes before the final PTP transaction, it will detect the

incomplete operation the next time it restarts, and then roll the op-

eration forward or backward.

GESTION INTELLI-GENTE DE FLUX D’APPLICATIONS HPC

NVIDIA GPU Computing –The CUDA Architecture

95

Tandis que tous les systèmes HPC se trouvent confrontés à de

grands challenges en matière de complexité, d’évolutivité et

d’exigences des flux applicatifs et des ressources, les systèmes

HPC des entreprises font face aux défis et attentes les plus

exigeants. Les systèmes HPC d’entreprises doivent satisfaire

aux exigences critiques et prioritaires des flux d’applications

pour les entreprises commerciales et les organisations de

recherche et académiques à orientation commerciale. Ils doi-

vent trouver un équilibre entre des SLA et des priorités com-

plexes. En effet, leurs flux HPC ont un impact direct sur leurs

revenus, la livraison de leurs produits et leurs objectifs.

95

The CUDA parallel comput-ing platform is now widely deployed with 1000s of GPU-accelerated applications and 1000s of published re-search papers, and a com-plete range of CUDA tools and ecosystem solutions is available to developers.

96

What is GPU Computing?GPU computing is the use of a GPU (graphics processing unit)

together with a CPU to accelerate general-purpose scientific

and engineering applications. Pioneered more five years ago by

NVIDIA, GPU computing has quickly become an industry stan-

dard, enjoyed by millions of users worldwide and adopted by

virtually all computing vendors.

GPU computing offers unprecedented application performance

by offloading compute-intensive portions of the application to

the GPU, while the remainder of the code still runs on the CPU.

From a user’s perspective, applications simply run significantly

faster.

CPU + GPU is a powerful combination because CPUs consist of a

few cores optimized for serial processing, while GPUs consist of

thousands of smaller, more efficient cores designed for parallel

performance. Serial portions of the code run on the CPU while

parallel portions run on the GPU.

Most customers can immediately enjoy the power of GPU com-

puting by using any of the GPU-accelerated applications listed in

our catalog, which highlights over one hundred, industry-lead-

ing applications. For developers, GPU computing offers a vast

ecosystem of tools and libraries from major software vendors

“We are very proud to be one of the leading provid-

ers of Tesla systems who are able to combine the

overwhelming power of NVIDIA Tesla systems with

the fully engineered and thoroughly tested trans-

tec hardware to a total Tesla-based solution.”

Norbert Zeidler Senior HPC Solution Engineer

NVIDIA GPU Computing

What is GPU Computing?

97

History of GPU Computing

Graphics chips started as fixed-function graphics processors but

became increasingly programmable and computationally pow-

erful, which led NVIDIA to introduce the first GPU. In the 1999-

2000 timeframe, computer scientists and domain scientists from

various fields started using GPUs to accelerate a range of scien-

tific applications. This was the advent of the movement called

GPGPU, or General-Purpose computation on GPU.

While users achieved unprecedented performance (over 100x

compared to CPUs in some cases), the challenge was that GPGPU

required the use of graphics programming APIs like OpenGL and

Cg to program the GPU. This limited accessibility to the tremen-

dous capability of GPUs for science.

NVIDIA recognized the potential of bringing this performance

for the larger scientific community, invested in making the GPU

fully programmable, and offered seamless experience for devel-

opers with familiar languages like C, C++, and Fortran.

GPU computing momentum is growing faster than ever before.

Today, some of the fastest supercomputers in the world rely on

GPUs to advance scientific discoveries; 600 universities around

the world teach parallel computing with NVIDIA GPUs; and hun-

dreds of thousands of developers are actively using GPUs.

All NVIDIA GPUs—GeForce, Quadro, and Tesla — support GPU

computing and the CUDA parallel programming model. Develop-

ers have access to NVIDIA GPUs in virtually any platform of their

choice, including the latest Apple MacBook Pro. However, we

recommend Tesla GPUs for workloads where data reliability and

overall performance are critical.

Tesla GPUs are designed from the ground-up to accelerate sci-

entific and technical computing workloads. Based on innovative

features in the “Kepler architecture,” the latest Tesla GPUs offer

3x more performance compared to the previous architecture,

more than one teraflops of double-precision floating point while

dramatically advancing programmability and efficiency. Kepler

is the world’s fastest and most efficient high performance com-

puting (HPC) architecture.

Kepler GK110 – The Next Generation GPU Computing ArchitectureAs the demand for high performance parallel computing increas-

es across many areas of science, medicine, engineering, and fi-

nance, NVIDIA continues to innovate and meet that demand

with extraordinarily powerful GPU computing architectures.

NVIDIA’s existing Fermi GPUs have already redefined and accel-

GPU Computing Applications

NVIDIA GPUwith the CUDA Parallel Computing Architecture

CC++

OpenCL FORTRANDirectXCompute

Java andPython

The CUDA parallel architecure

the cuda programming model

1

2

34

CUDA Parallel Compute Enginesinside NVIDIA GPUs

CUDA Support in OS Kernel

CUDA Driver

ApplicationsUsing DirectX

DirectXCompute

OpenCLDriver

C Runtimefor CUDA

ApplicationsUsing OpenCL

ApplicationsUsing the

CUDA Driver API

ApplicationsUsing C, C++, Fortran,

Java, Python

HLSLCompute Shaders

OpenCL CCompute Kernels

C for CUDACompute Kernels

PTX (ISA)

C for CUDACompute Functions

Device-level APIs Language Integration

98

erated High Performance Computing (HPC) capabilities in areas

such as seismic processing, biochemistry simulations, weather

and climate modeling, signal processing, computational finance,

computer aided engineering, computational fluid dynamics, and

data analysis. NVIDIA’s new Kepler GK110 GPU raises the parallel

computing bar considerably and will help solve the world’s most

difficult computing problems.

By offering much higher processing power than the prior GPU

generation and by providing new methods to optimize and in-

crease parallel workload execution on the GPU, Kepler GK110

simplifies creation of parallel programs and will further revolu-

tionize high performance computing.

Kepler GK110 – Extreme Performance, Extreme Efficiency

Comprising 7.1 billion transistors, Kepler GK110 is not only the

fastest, but also the most architecturally complex microproces-

sor ever built. Adding many new innovative features focused on

compute performance, GK110 was designed to be a parallel pro-

cessing powerhouse for Tesla® and the HPC market.

Kepler GK110 provides over 1 TFlop of double precision

throughput with greater than 93% DGEMM efficiency versus

60-65% on the prior Fermi architecture.

In addition to greatly improved performance, the Kepler archi-

tecture offers a huge leap forward in power efficiency, delivering

up to 3x the performance per watt of Fermi.

The following new features in Kepler GK110 enable increased

GPU utilization, simplify parallel program design, and aid in the

deployment of GPUs across the spectrum of compute environ-

ments ranging from personal workstations to supercomputers:

ʎ Dynamic Parallelism – adds the capability for the GPU to ge-

nerate new work for itself, synchronize on results, and con-

trol the scheduling of that work via dedicated, accelerated

hardware paths, all without involving the CPU. By providing

the flexibility to adapt to the amount and form of parallelism


Kepler GK110 – The Next Generation GPU Computing Architecture

99

through the course of a program’s execution, programmers

can expose more varied kinds of parallel work and make the

most efficient use the GPU as a computation evolves. This ca-

pability allows less - structured, more complex tasks to run ea-

sily and effectively, enabling larger portions of an application

to run entirely on the GPU. In addition, programs are easier to

create, and the CPU is freed for other tasks.

ʎ Hyper-Q – Hyper-Q enables multiple CPU cores to launch work

on a single GPU simultaneously, thereby dramatically increa-

sing GPU utilization and significantly reducing CPU idle times.

Hyper-Q increases the total number of connections (work

queues) between the host and the GK110 GPU by allowing 32

simultaneous, hardware-managed connections (compared to

the single connection available with Fermi). Hyper-Q is a flexi-

ble solution that allows separate connections from multiple

CUDA streams, from multiple Message Passing Interface (MPI)

processes, or even from multiple threads within a process.

Applications that previously encountered false serialization

across tasks, thereby limiting achieved GPU utilization, can

see up to dramatic performance increase without changing

any existing code.

ʎ Grid Management Unit – Enabling Dynamic Parallelism re-

quires an advanced, flexible grid management and dispatch

control system. The new GK110 Grid Management Unit (GMU)

manages and prioritizes grids to be executed on the GPU. The

GMU can pause the dispatch of new grids and queue pending

and suspended grids until they are ready to execute, providing

the flexibility to enable powerful runtimes, such as Dynamic

Parallelism. The GMU ensures both CPU- and GPU-generated

workloads are properly managed and dispatched.

ʎ NVIDIA GPUDirect – NVIDIA GPUDirect is a capability that

enables GPUs within a single computer, or GPUs in different

servers located across a network, to directly exchange data

without needing to go to CPU/system memory. The RDMA

feature in GPUDirect allows third party devices such as SSDs,

NICs, and IB adapters to directly access memory on multiple

GPUs within the same system, significantly decreasing the

latency of MPI send and receive messages to/from GPU me-

mory. It also reduces demands on system memory bandwidth

and frees the GPU DMA engines for use by other CUDA tasks.

Kepler GK110 also supports other GPUDirect features inclu-

ding Peer-to-Peer and GPUDirect for Video.

An Overview of the GK110 Kepler Architecture

Kepler GK110 was built first and foremost for Tesla, and its goal

was to be the highest performing parallel computing micropro-

cessor in the world. GK110 not only greatly exceeds the raw com-

pute horsepower delivered by Fermi, but it does so efficiently,

consuming significantly less power and generating much less

heat output.

100

Introducing NVIDIA Parallel NsightNVIDIA Parallel Nsight is the first development environment de-

signed specifically to support massively parallel CUDA C, OpenCL,

and DirectCompute applications. It bridges the productivity gap

between CPU and GPU code by bringing parallel-aware hardware

source code debugging and performance analysis directly into Mi-

crosoft Visual Studio, the most widely used integrated application

development environment under Microsoft Windows.

Parallel Nsight allows Visual Studio developers to write and de-

bug GPU source code using exactly the same tools and interfaces

that are used when writing and debugging CPU code, including

source and data breakpoints, and memory inspection. Further-

more, Parallel Nsight extends Visual Studio functionality by of-

fering tools to manage massive parallelism, such as the ability

to focus and debug on a single thread out of the thousands of

threads running parallel, and the ability to simply and efficiently

visualize the results computed by all parallel threads.

Parallel Nsight is the perfect environment to develop co-pro-

cessing applications that take advantage of both the CPU and

GPU. It captures performance events and information across

both processors, and presents the information to the devel-

oper on a single correlated timeline. This allows developers to

see how their application behaves and performs on the entire

system, rather than through a narrow view that is focused on

a particular subsystem or processor.


Introducing NVIDIA Parallel Nsight

101

Parallel Nsight Debugger for GPU Computing

ʎ Debug your CUDA C/C++ and DirectCompute source code di-

rectly on the GPU hardware

ʎ As the industry’s only GPU hardware debugging solution, it

drastically increases debugging speed and accuracy

ʎ Use the familiar Visual Studio Locals, Watches, Memory and

Breakpoints windows

Parallel Nsight Analysis Tool for GPU Computing

ʎ Isolate performance bottlenecks by viewing system-wide

CPU+GPU events

ʎ Support for all major GPU Computing APIs, including CUDA C/

C++, OpenCL, and Microsoft DirectCompute

Debugger for GPU Computing Analysis Tool for GPU Computing

transtec has strived for developing well-engineered GPU Com-

puting solutions from the very beginning of the Tesla era. From

High-Performance GPU Workstations to rack-mounted Tesla

server solutions, transtec has a broad range of specially designed

systems available. As an NVIDIA Tesla Preferred Provider (TPP),

transtec is able to provide customers with the latest NVIDIA GPU

technology as well as fully-engineered hybrid systems and Tesla

Preconfigured Clusters. Thus, customers can be assured that

transtec’s large experience in HPC cluster solutions is seamlessly

brought into the GPU computing world. Performance Engineer-

ing made by transtec.

102


Introducing NVIDIA Parallel Nsight

103

Parallel Nsight Debugger for Graphics Development

ʎ Debug HLSL shaders directly on the GPU hardware. Drastical-

ly increasing debugging speed and accuracy over emulated

(SW) debugging

ʎ Use the familiar Visual Studio Locals, Watches, Memory and

Breakpoints windows with HLSL shaders, including Direct-

Compute code

ʎ The Debugger supports all HLSL shader types: Vertex, Pixel,

Geometry, and Tessellation

Parallel Nsight Graphics Inspector for Graphics Development

ʎ Graphics Inspector captures Direct3D rendered frames for

real-time examination

ʎ The Frame Profiler automatically identifies bottlenecks and

performance information on a per-draw call basis

ʎ Pixel History shows you all operations that affected a given

pixel

Debugger for Graphics Development Graphics Inspector for Graphics Development

104

A full Kepler GK110 implementation includes 15 SMX units and

six 64-bit memory controllers. Different products will use differ-

ent configurations of GK110. For example, some products may

deploy 13 or 14 SMXs.

Key features of the architecture that will be discussed below in

more depth include:

ʎ The new SMX processor architecture

ʎ An enhanced memory subsystem, offering additional caching

capabilities, more bandwidth at each level of the hierarchy,

and a fully redesigned and substantially faster DRAM I/O im-

plementation.

ʎ Hardware support throughout the design to enable new pro-

gramming model capabilities

Kepler GK110 supports the new CUDA Compute Capability 3.5.

(For a brief overview of CUDA see Appendix A - Quick Refresher

on CUDA). The following table compares parameters of different

Compute Capabilities for Fermi and Kepler GPU architectures:

Performance per Watt

A principal design goal for the Kepler architecture was impro-

ving power efficiency. When designing Kepler, NVIDIA engineers



Kepler GK110 Full chip block diagram

105

applied everything learned from Fermi to better optimize the

Kepler architecture for highly efficient operation. TSMC’s 28nm

manufacturing process plays an important role in lowering pow-

er consumption, but many GPU architecture modifications were

required to further reduce power consumption while maintai-

ning great performance.

FermiGF100

FermiGF104

KeplerGK104

KeplerGK110

Compute Capability 2.0 2.1 3.0 3.5

Threads / Warp 32 32 32 32

Max Warps / Multiprocessor 48 48 64 64

Max Threads / Multiproces-sor

1536 1536 2048 2048

Max Thread Blocks / Multi-processor

8 8 16 16

32-bit Registers / Multipro-cessor

32768 32768 65536 65536

Max Registers / Thread 63 63 63 255

Max Threads / Thread Block 1024 1024 1024 1024

Shared Memory Size Con-figurations (bytes)

16K48K

16K48K

16K32K48K

16K32K48K

Max X Grid Dimension 2^16-1 2^16-1 2^32-1 2^32-1

Hyper-Q NO NO NO YES

Dynamic Parallelism NO NO NO YES

Every hardware unit in Kepler was designed and scrubbed to

provide outstanding performance per watt. The best example

of great perf/watt is seen in the design of Kepler GK110’s new

Streaming Multiprocessor (SMX), which is similar in many res-

pects to the SMX unit recently introduced in Kepler GK104, but

includes substantially more double precision units for compute

algorithms.

Streaming Multiprocessor (SMX) Architecture

Kepler GK110’s new SMX introduces several architectural inno-

vations that make it not only the most powerful multiprocessor

we’ve built, but also the most programmable and power-effi-

cient.

SMX Processing Core Architecture

Each of the Kepler GK110 SMX units feature 192 single-precision

CUDA cores, and each core has fully pipelined floating-point and

integer arithmetic logic units. Kepler retains the full IEEE 754-2008

compliant single- and double-precision arithmetic introduced in

Fermi, including the fused multiply-add (FMA) operation.

One of the design goals for the Kepler GK110 SMX was to signifi-

cantly increase the GPU’s delivered double precision performance,

since double precision arithmetic is at the heart of many HPC ap-

plications. Kepler GK110’s SMX also retains the special function

units (SFUs) for fast approximate transcendental operations as in

previous-generation GPUs, providing 8x the number of SFUs of the

Fermi GF110 SM.

Similar to GK104 SMX units, the cores within the new GK110

SMX units use the primary GPU clock rather than the 2x shader

clock. Recall the 2x shader clock was introduced in the G80 Tes-

la-architecture GPU and used in all subsequent Tesla- and Fermi-

architecture GPUs. Running execution units at a higher clock rate

SMX: 192 single-precision CUDA cores, 64 double-precision units, 32 special function units (SFU), and 32 load/store units (LD/ST).

106

allows a chip to achieve a given target throughput with fewer

copies of the execution units, which is essentially an area optimi-

zation, but the clocking logic for the faster cores is more power-

hungry. For Kepler, our priority was performance per watt. While

we made many optimizations that benefitted both area and pow-

er, we chose to optimize for power even at the expense of some

added area cost, with a larger number of processing cores running

at the lower, less power-hungry GPU clock.

Quad Warp Scheduler

The SMX schedules threads in groups of 32 parallel threads called

warps. Each SMX features four warp schedulers and eight instruc-

tion dispatch units, allowing four warps to be issued and execut-

ed concurrently. Kepler’s quad warp scheduler selects four warps,

and two independent instructions per warp can be dispatched

each cycle. Unlike Fermi, which did not permit double precision

instructions to be paired with other instructions, Kepler GK110

allows double precision instructions to be paired with other ins-

tructions.

NVIDIA, GeForce, Tesla, CUDA, PhysX, GigaThread, NVIDIA

Parallel Data Cache and certain other trademarks and

logos appearing in this brochure, are trademarks or

registered trademarks of NVIDIA Corporation.



Each Kepler SMX contains 4 Warp Schedulers, each with dual Instruction Dis-patch Units. A single Warp Scheduler Unit is shown above.

107

We also looked for opportunities to optimize the power in the

SMX warp scheduler logic. For example, both Kepler and Fermi

schedulers contain similar hardware units to handle the schedu-

ling function, including:

ʎ Register scoreboarding for long latency operations (texture

and load)

ʎ Inter-warp scheduling decisions (e.g., pick the best warp to go

next among eligible candidates)

ʎ Thread block level scheduling (e.g., the GigaThread engine)

However, Fermi’s scheduler also contains a complex hardware

stage to prevent data hazards in the math datapath itself. A mul-

ti-port register scoreboard keeps track of any registers that are

not yet ready with valid data, and a dependency checker block

analyzes register usage across a multitude of fully decoded warp

instructions against the scoreboard, to determine which are eli-

gible to issue.

For Kepler, we recognized that this information is deterministic

(the math pipeline latencies are not variable), and therefore it is

possible for the compiler to determine up front when instructions

will be ready to issue, and provide this information in the instruc-

tion itself. This allowed us to replace several complex and power-

expensive blocks with a simple hardware block that extracts

the pre-determined latency information and uses it to mask out

warps from eligibility at the inter-warp scheduler stage.

New ISA Encoding: 255 Registers per Thread

The number of registers that can be accessed by a thread has

been quadrupled in GK1 10, allowing each thread access to up to

255 registers. Codes that exhibit high register pressure or spill-

ing behavior in Fermi may see substantial speedups as a result

of the increased available per-thread register count . A compel-

ling example can be seen in the QUDA library for performing lat-

tice QCD (quantum chromodynamics) calculations using CUDA.

QUDA fp64-based algorithms see performance increases up to

5.3x due to the ability to use many more registers per thread and

experiencing fewer spills to local memory.

Shuffle Instruction

To further improve performance, Kepler implements a new

Shuffle instruction, which allows threads within a warp to share

data. Previously, sharing data between threads within a warp

required separate store and load operations to pas the data

through shared memory. With the Shuffle instruction, threads

within a warp can read values from other threads in the warp

in just about any imaginable permutation. Shuffle supports arbi-

trary indexed references –– i.e. any thread reads from any other

thread. U seful shuffle subsets including next-thread (offset u p

or down by a fixed amount) and XORR “butterfly” style permuta-

tions among the threads in a warp, are also available as CUDA

intrinsics.

This example shows some of the variations possible using the new Shuffle instruction in Kepler.

108

Shuffle offers a performance advantage over shared memory,

in that a store-and-load operation is carried out in a single step.

Shuffle also can reduce the amount of shared memory needed

per thread block, since data exchanged at the warp level never

needs to o be placed in shared memory. In the case of FFT, which

requires da ta sharing within a warp, a 6% performance gain can

be seen just by using Shuffle.

Atomic Operations

Atomic memory operations are important in parallel program-

ming, allowing concurrent threads to correctly perform read-

modify-write operations on shared data structures. Atomic oper-

ations such as add, min, max, and compare-and-swap are atomic

in the sense that the read, modify, and write operations are per-

formed without interruption by other threads. Atomic memory

operations are widely used for parallel sorting, reduction opera-

tions, and building data structures in parallel without locks that

Serialize thread execution.

Throughput of global memory atomic operations on Kepler

GK110 is substantially improved compared to the Fermi genera-

tion. Atomic operation throughput to a common global memory

address is improved by 9x to one operation per clock. Atomic

operation throughput to independent global addresses is also

significantly accelerated, and logic to handle address conflicts

has been made more efficient. Atomic operations can often be

processed at rates similar to global load operations. This speed

increase makes atomics fast enough to use frequently within

kernel inner loops, eliminating the separate reduction passes

that were previously required by some algorithms to consoli-

date results. Kepler GK110 also expands the native support for

64-bit atomic operations in global memory. In addition to atomi-

cAdd, atomicCAS, and atomicExch (which were also supported by

Fermi and Kepler GK104), GK110 supports the following:

ʎ atomicMin

ʎ atomicMax



109

ʎ atomicAnd

ʎ atomicOr

ʎ atomicXor

Other atomic operations which are not supported natively (for

example 64-bit floating point atomics) may be emulated using

the compare-and-swap (CAS) instruction.

Texture Improvements

The GPU’s dedicated hardware Texture units are a valuable re-

source for compute programs with a need to sample or filter

image data. The texture throughput in Kepler is significantly in-

creased compared to Fermi – each SMX unit contains 16 texture

filtering units, a 4x increase vs the Fermi GF110 SM.

In addition, Kepler changes the way texture state is managed. In

the Fermi generation, for the GPU to reference a texture, it had

to be assigned a “slot” in a fixed-size binding table prior to grid

launch. The number of slots in that table ultimately limits how

many unique textures a program can read from at runtime. Ulti-

mately, a program was limited to accessing only 128 simultane-

ous textures in Fermi.

With bindless textures in Kepler, the additional step of using

slots isn’t necessary: texture state is now saved as an object in

memory and the hardware fetches these state objects on de-

mand, making binding tables obsolete. This effectively elimi-

nates any limits on the number of unique textures that can be

referenced by a compute program. Instead, programs can map

textures at any time and pass texture handles around as they

would any other pointer.

Kepler Memory Subsystem – L1, L2, ECC

Kepler’s memory hierarchy is organized similarly to Fermi. The

Kepler architecture supports a unified memory request path for

loads and stores, with an L1 cache per SMX multiprocessor. Ke-

pler GK110 also enables compiler-directed use of an additional

new cache for read-only data, as described below.

64 KB Configurable Shared Memory and L1 Cache

In the Kepler GK110 architecture, as in the previous generation

Fermi architecture, each SMX has 64 KB of on-chip memory that

can be configured as 48 KB of Shared memory with 16 KB of L1

cache, or as 16 KB of shared memory with 48 KB of L1 cache. Ke-

pler now allows for additional flexibility in configuring the al-

location of shared memory and L1 cache by permitting a 32KB

/ 32KB split between shared memory and L1 cache. To support

the increased throughput of each SMX unit, the shared memory

bandwidth for 64b and larger load operations is also doubled

compared to the Fermi SM, to 256B per core clock.

48KB Read-Only Data Cache

In addition to the L1 cache, Kepler introduces a 48KB cache for

data that is known to be read-only for the duration of the func-

tion. In the Fermi generation, this cache was accessible only by

the Texture unit. Expert programmers often found it advanta-

geous to load data through this path explicitly by mapping their

110

data as textures, but this approach had many limitations.

In Kepler, in addition to significantly increasing the capacity of

this cache along with the texture horsepower increase, we de-

cided to make the cache directly accessible to the SM for general

load operations. Use of the read-only path is beneficial because

it takes both load and working set footprint off of the Shared/L1

cache path. In addition, the Read-Only Data Cache’s higher tag

bandwidth supports full speed unaligned memory access pat-

terns among other scenarios.

Use of the read-only path can be managed automatically by the

compiler or explicitly by the programmer. Access to any variable

or data structure that is known to be constant through program-

mer use of the C99-standard “const __restrict” keyword may be

tagged by the compiler to be loaded through the Read- Only

Data Cache. The programmer can also explicitly use this path

with the __ldg() intrinsic.

Improved L2 Cache

The Kepler GK110 GPU features 1536KB of dedicated L2 cache

memory, double the amount of L2 available in the Fermi archi-

tecture. The L2 cache is the primary point of data unification

between the SMX units, servicing all load, store, and texture re-

quests and providing efficient, high speed data sharing across

the GPU. The L2 cache on Kepler offers up to 2x of the bandwidth

per clock available in Fermi. Algorithms for which data addresses

are not known beforehand, such as physics solvers, ray tracing,

and sparse matrix multiplication especially benefit from the

cache hierarchy. Filter and convolution kernels that require mul-

tiple SMs to read the same data also benefit.

Memory Protection Support

Like Fermi, Kepler’s register files, shared memories, L1 cache, L2

cache and DRAM memory are protected by a Single-Error Correct

Double-Error Detect (SECDED) ECC code. In addition, the Read-Only

Data Cache supports single-error correction through a parity check;

in the event of a parity error, the cache unit automatically invali-



111

dates the failed line, forcing a read of the correct data from L2.

ECC checkbit fetches from DRAM necessarily consume some

amount of DRAM bandwidth, which results in a performance

difference between ECC-enabled and ECC-disabled operation,

especially on memory bandwidth-sensitive applications. Ke-

pler GK110 implements several optimizations to ECC checkbit

fetch handling based on Fermi experience. As a result, the ECC

on-vs-off performance delta has been reduced by an average

of 66%, as measured across our internal compute application

test suite.

Dynamic Parallelism

In a hybrid CPU-GPU system, enabling a larger amount of paral-

lel code in an application to run efficiently and entirely within

the GPU improves scalability and performance as GPUs increase

in perf/watt. To accelerate these additional parallel portions of

the application, GPUs must support more varied types of parallel

workloads.

Dynamic Parallelism is a new feature introduced with Kepler GK110

that allows the GPU to generate new work for itself, synchronize on

results, and control the scheduling of that work via dedicated, ac-

celerated hardware paths, all without involving the CPU.

Fermi was very good at processing large parallel data structures

when the scale and parameters of the problem were known at

kernel launch time. All work was launched from the host CPU,

would run to completion, and return a result back to the CPU.

The result would then be used as part of the final solution, or

would be analyzed by the CPU which would then send additional

requests back to the GPU for additional processing.

In Kepler GK110 any kernel can launch another kernel, and can

create the necessary streams, events and manage the depen-

dencies needed to process additional work without the need

for host CPU interaction. This architectural innovation makes it

easier for developers to create and optimize recursive and data-

dependent execution patterns, and allows more of a program to

be run directly on GPU. The system CPU can then be freed up for

additional tasks, or the system could be configured with a less

powerful CPU to carry out the same workload.

Dynamic Parallelism allows more varieties of parallel algorithms

to be implemented on the GPU, including nested loops with dif-

fering amounts of parallelism, parallel teams of serial control-

task threads, or simple serial control code offloaded to the GPU

in order to promote data-locality with the parallel portion of the

application.

Because a kernel has the ability to launch additional workloads

based on intermediate, on-GPU results, programmers can now

intelligently load-balance work to focus the bulk of their re-

sources on the areas of the problem that either require the most

processing power or are most relevant to the solution.

One example would be dynamically setting up a grid for a nu-

merical simulation – typically grid cells are focused in regions

of greatest change, requiring an expensive pre-processing pass

through the data. Alternatively, a uniformly coarse grid could be

used to prevent wasted GPU resources, or a uniformlyfine grid

could be used to ensure all the features are captured, but these

options risk missing simulation features or “over-spending”

compute resources on regions of less interest. With Dynamic

Parallelism, the grid resolution can be determined dynamically at

Dynamic Parallelism allows more parallel code in an application to be launched directly by the GPU onto itself (right side of image) rather than requiring CPU intervention (left side of image).

112


A Quick Refresher on CUDA

113

A Quick Refresher on CUDA CUDA is a combination hardware/software platform that en-

ables NVIDIA GPUs to execute programs written with C, C++,

Fortran, and other languages. A CUDA program invokes paral-

lel functions called kernels that execute accross many parallel

threads. The programmer or compiler organizes these threads

into thread blocks and grids of thre ad blocks, as shown in the

Figure on the right side. Each thread within a thread block ex-

ecutes an instance oof the kernel. Each thread also has thread

and block IDs within its thread block and grid, a program coun-

ter, registers, per-thread private memory, inputs, and output re-

suults. A thread block is a set of concurren tly executing threads

that can cooperate among themselves through

barrier sy nchronizationn and shared memory. A t hread block

has a block ID within its grid. A grid is an array of thread blocks

that execute the same kernel, read inputs from global memory,

write results to global memory, and synchronize between de-

pendent kernel calls. In the CUDA parallel programming model,

each thread has a per-thread private memory space used for

register spills, function calls, and C automaticc array variables.

Each thread block has a per-block shared memory space used for

inter-thread

communication, datasharing, and result sharing in parallel allgo-

rithms. Gr rids of thread blocks share results in Global Memory

space after kernel-wide global synchronization.

CUDA Hardware Execution

CUDA’s hierarchy of threads maps to a hierarchy of processors

on the GPU; a GPU executes one or more kernel grids; a stream-

ing multiprocessor (SM on Fermi / SMX on Kepler) executes one

or more thread blocks; and CUDA cores and other execution

units in the SMX execute thread instructions. The SMX executes

threads in groups of 32 threads called warps. While programmers

can generally ignore warp execution for functional correctness

and focus on programming individual scalar threads, they can

greatly improve performance by having threads in a warp execute

the same code path and access memory with nearby addresses.

CUDA Hierarchy of threads, blocks, and grids, with corresponding per-thread private, per-block shared,-and per-application global memory spaces.

114

runtime in a datadependent manner.

Starting with a coarse grid, the simulation can “zoom in” on ar-

eas of interest while avoiding unnecessary calculation in areas

with little change. Though this could be accomplished using a se-

quence of CPU-launched kernels, it would be far simpler to allow

the GPU to refine the grid itself by analyzing the data and launch-

ing additional work as part of a single simulation kernel, eliminat-

ing interruption of the CPU and data transfers between the CPU

and GPU.

The above example illustrates the benefits of using a dynami-

cally sized grid in a numerical simulation. To meet peak preci-

sion requirements, a fixed resolution simulation must run at an

excessively fine resolution across the entire simulation domain,

whereas a multi-resolution grid applies the correct simulation

resolution to each area based on local variation.

Hyper-Q

One of the challenges in the past has been keeping the GPU sup-

plied with an optimally scheduled load of work from multiple

streams. Thee Fermi architecture supported 16-way concur-

rency of kernel launches from separate streams, but ultimately

the streams were all multiplexed into the same hardware work

queue . This allowed for false intra-stream dependencies, requir-



Image attribution Charles Reid

115

ing dependent kernels within one stream to complete before ad-

ditional kernels in a separate stream could be executed. While this

could be alleviated to some extent through the use of a breadth-

first launch order , as program complexity increases, this can be-

come more and more difficult to manage efficiently.

Kepler GKK110 improves on this functionality with the new Hy-

per-Q feature. Hyper-Q increases the total number of connection

s (work queues) between the host and the CUDA Work Distribu-

tor (CWD) logic in the GPU by allowing 322 simultaneous, hard-

ware-managed connections (compared to the single connection

available with Fermi). Hyper-Q is a flexible solution that allows

connections from multiple CUDA streams, from multiple Mes-

sage Passing Interface (MP PI) processes, or even from multiple

threads within a process. Applications that previously encoun-

tered false serialization across tasks, there by limiting GPU utili-

zation, can see up to a 32x performance increase without chang-

ing any existing co de.

Each CUDA stream is managed within its own hardware work

queue, inter-stream dependencies are optimized, and opera-

tions in one stream will no longer block other streams, enabling

streams to execute concurrently without neding to specifically

tailor the launch order to eliminate possible false dependencies.

Hyper-Q offers significant benefits for use in MPI-based paral-

lel computer systems. Legacy MPI-based algorithms were often

created to run on multi-core CPU systems, with the amount of

work assigned to each MPI process scaled accordingly. This can

lead to a single MPI process having insufficient work to fully oc-

cupy the GPU. While it has always been possible for multiple MPI

processes to share a GPU, these processes could become bottle-

necked by false dependencies. Hyper-Q removes those false de-

pendencies, dramatically increasing the efficiency of GPU shar-

ing across MPI processes.

Grid Management Unit - Efficiently Keeping the GPU Utilized

New features in Kepler GK110, such as the ability for CUDA ker-

nels to launch work directly on the GPU with Dynamic Parallel-

ism, required that the CPU-to-GPU workflow in Kepler offer in-

creased functionality over the Fermi design. On Fermi, a grid of

thread blocks would be launched by the CPU and would always

run to completion, creating a simple unidirectional flow of work

Hyper-Q permits more simultaneous connections between CPU and GPU

Hyper-Q working with CUDA Streams: In the Fermi model shown on the left, only (C,P) & (R,X) can run concurrently due to intra-stream dependencies caused by the single hardware work queue. The Kepler Hyper-Q model allows all streams to run concurrently using separate work queues.

116

from the host to the SMs via the CUDA Work Distributor (CWD)

unit. Kepler GK110 was designed to improve the CPU-to-GPU

workflow by allowing the GPU to efficiently manage both CPU-

and CUDA-created workloads.

We discussed the ability of the Kepler GK110 GPU to allow ker-

nels to launch work directly on the GPU, and it’s important to

understand the changes made in the Kepler GK110 architec-

ture to facilitate these new functions. In Kepler, a grid can be

launched from the CPU just as was the case with Fermi, how-

ever new grids can also be created programmatically by CUDA

within the Kepler SMX unit. To manage both CUDA-created and

host-originated grids, a new Grid Management Unit (GMU) was

introduced in Kepler GK110. This control unit manages and pri-

oritizes grids that are passed into the CWD to be sent to the

SMX units for execution.

The CWD in Kepler holds grids that are ready to dispatch, and it

is able to dispatch 32 active grids, which is double the capacity of



The redesigned Kepler HOST to GPU workflow shows the new Grid Manage-ment Unit, which allows it to manage the actively dispatching grids, pause dispatch, and hold pending and suspended grids.

117

the Fermi CWD. The Kepler CWD communicates with the GMU via

a bidirectional link that allows the GMU to pause the dispatch

of new grids and to hold pending and suspended grids until

needed. The GMU also has a direct connection to the Kepler SMX

units to permit grids that launch additional work on the GPU via

Dynamic Parallelism to send the new work back to GMU to be

prioritized and dispatched. If the kernel that dispatched the ad-

ditional workload pauses, the GMU will hold it inactive until the

dependent work has completed.

NVIDIA GPUDirect

When working with a large amount of data, increasing the data

throughput and reducing latency is vital to increasing compute

performance. Kepler GK110 supports the RDMA feature in NVIDIA

GPUDirect, which is designed to improve performance by allow-

ing direct access to GPU memory by third-party devices such as

IB adapters, NICs, and SSDs. When using CUDA 5.0, GPUDirect pro-

vides the following important features:

ʎ Direct memory access (DMA) between NIC and GPU without

the need for CPU-side data buffering.

ʎ Significantly improved MPISend/MPIRecv efficiency between

GPU and other nodes in a network.

ʎ Eliminates CPU bandwidth and latency bottlenecks

ʎ Works with variety of 3rd-party network, capture, and storage

devices

Applications like reverse time migration (used in seismic imag-

ing for oil & gas exploration) distribute the large imaging data

across several GPUs. Hundreds of GPUs must collaborate to

crunch the data, often communicating intermediate results.

GPUDirect enables much higher aggregate bandwidth for this

GPUto GPU communication scenario within a server and across

servers with the P2P and RDMA features.

Kepler GK110 also supports other GPUDirect features such as

Peer- to-Peer and GPUDirect for Video.

Conclusion

With the launch of Fermi in 2010, NVIDIA ushered in a new era

in the high performance computing (HPC) industry based on a

hybrid computing model where CPUs and GPUs work together

to solve computationally-intensive workloads. Now, with the

new Kepler GK110 GPU, NVIDIA again raises the bar for the HPC

industry.

Kepler GK110 was designed from the ground up to maximize

computational performance and throughput computing with

outstanding power efficiency. The architecture has many new

innovations such as SMX, Dynamic Parallelism, and Hyper-Q that

make hybrid computing dramatically faster, easier to program,

and applicable to a broader set of applications. Kepler GK110

GPUs will be used in numerous systems ranging from worksta-

tions to supercomputers to address the most daunting chal-

lenges in HPC.

GPUDirect RDMA allows direct access to GPU memory from 3rd-party devices such as network adapters, which translates into direct transfers between GPUs across nodes as well.

118

Executive OverviewThe High Performance Computing market’s continuing need for

improved time-to-solution and the ability to explore expanding

models seems unquenchable, requiring ever-faster HPC clus-

ters. This has led many HPC users to implement graphic pro-

cessing units (GPUs) into their clusters. While GPUs have tradi-

tionally been used solely for visualization or animation, today

they serve as fully programmable, massively parallel proces-

sors, allowing computing tasks to be divided and concurrently

processed on the GPU’s many processing cores. When multiple

GPUs are integrated into an HPC cluster, the performance po-

tential of the HPC cluster is greatly enhanced. This processing

environment enables scientists and researchers to tackle some

of the world’s most challenging computational problems.

HPC applications modified to take advantage of the GPU pro-

cessing capabilities can benefit from significant performance

gains over clusters implemented with traditional processors. To

obtain these results, HPC clusters with multiple GPUs require a

high-performance interconnect to handle the GPU-to-GPU com-

munications and optimize the overall performance potential of

the GPUs. Because the GPUs place significant demands on the

interconnect, it takes a high-performance interconnect, such

as InfiniBand, to provide the low latency, high message rate,

and bandwidth that are needed to enable all resources in the

cluster to run at peak performance.

Intel worked in concert with NVIDIA to optimize Intel TrueScale

InfiniBand with NVIDIA GPU technologies. This solution sup-

ports the full performance potential of NVIDIA GPUs through

an interface that is easy to deploy and maintain.

Key Points

ʎ Up to 44 percent GPU performance improvement versus

implementation without GPUDirect – a GPU computing


Intel TrueScale InfiniBand and GPUs

Intel is a global leader and technology innovator in high perfor-

mance networking, including adapters, switches and ASICs.

119

product from NVIDIA that enables faster communication be-

tween the GPU and InfiniBand

ʎ Intel TrueScale InfiniBand offers as much as 10 percent better

GPU performance than other InfiniBand interconnects

ʎ Ease of installation and maintenance – Intel’s implementa-

tion offers a streamlined deployment approach that is sig-

nificantly easier than alternatives

Ease of DeploymentOne of the key challenges with deploying clusters consisting of

multi-GPU nodes is to maximize application performance. With-

out GPUDirect, GPU-to-GPU communications would require the

host CPU to make multiple memory copies to avoid a memory

pinning conflict between the GPU and InfiniBand. Each addi-

tional CPU memory copy significantly reduces the performance

potential of the GPUs.

Intel’s implementation of GPUDirect takes a streamlined ap-

proach to optimizing NVIDIA GPU performance with Intel TrueS-

cale InfiniBand. With Intel’s solution, a user only needs to update

the NVIDIA driver with code provided and tested by Intel. Other

InfiniBand implementations require the user to implement

a Linux kernel patch as well as a special InfiniBand driver. The

Intel approach provides a much easier way to deploy, support,

and maintain GPUs in a cluster without having to sacrifice per-

formance. In addition, it is completely compatible with other

GPUDirect implementations; the CUDA libraries and application

code require no changes.

Optimized PerformanceIntel used AMBER molecular dynamics simulation software to

test clustered GPU performance with and without GPUDirect.

Figure 15 shows that there is a significant performance gain of

up to 44 percent that results from streamlining the host memory

access to support GPU-to-GPU communications.

Clustered GPU PerformanceHPC applications that have been designed to take advantage

of parallel GPU performance require a high-performance inter-

connect, such as InfiniBand, to maximize that performance. In

addition, the implementation or architecture of the InfiniBand

interconnect can impact performance. The two industry-leading

InfiniBand implementations have very different architectures

and only one was specifically designed for the HPC market –

Intel’s TrueScale InfiniBand. TrueScale InfiniBand provides un-

matched performance benefits, especially as the GPU cluster is

scaled. It offers high performance in all of the key areas that influ-

ence the performance of HPC applications, including GPU-based

applications. These factors include the following:

ʎ Scalable non-coalesced message rate performance greater

than 25M messages per second

ʎ Extremely low latency for MPI collectives, even on clusters

consisting of thousands of nodes

ʎ Consistently low latency of one to two µS, even at scale

These factors and the design of Intel TrueScale InfiniBand en-

able it to optimize the performance of NVIDIA GPUs. The follow-

Performance with and without GPUDirect

Amber Performance Cellulose Test

8 GPU

WithoutGPUDirect TrueScale with GPUDirect

4,5

4

3,5

3

2,5

2

1,5

1

0,5

0

44%

Bet

ter

NS/

DAY

120

ing tests were performed on NVIDIA Tesla 2050s interconnected

with Intel TrueScale QDR InfiniBand at Intel’s NETtrack Develop-

er Center. The Tesla 2050 results for the industry’s other leading

InfiniBand are from the published results on the AMBER bench-

mark site.

Figure 16 shows performance results from the AMBER Myoglo-

bin benchmark (2,492 atoms) when scaling from two to eight

Tesla 2050 GPUs. The results indicate that the Intel TrueScale

InfiniBand offers up to 10 percent more performance than the

industry’s other leading InfiniBand when both used their ver-

sions of GPUDirect. As the figure shows, the performance dif-

ference increases as the application is scaled to more GPUs.

The next test (Figure 17) shows the impact of the InfiniBand

interconnect on the performance of AMBER across models

of various sizes. The following Explicit Solvent models were

tested:

ʎ DHFR: 23,558 atoms

ʎ FactorIX: 90,906 atoms

ʎ Cellulose: 408,609 atoms

It is important to point out that the performance of the models

is dependent on the model size, the size of the GPU cluster, and

the performance of the InfiniBand interconnect. The smaller

the model, the more it is dependent on the interconnect due

to the fact that the model’s components (atoms in the case of

AMBER) are divided across the available GPUs in the cluster to

be processed for each step of the simulation. For example, the

DHFR test with its 23,557 atoms means that each Tesla 2050 in an

eight-GPU cluster is processing only 2,945 atoms for each step

of the simulation. The processing time is relatively small when

compared to the communication time. In contrast, the Cellulose

model with its 408K atoms requires each GPU to process 17 times

more data per step than the DHFR test, so significantly more

time is spent in GPU processing than in communications.


Intel TrueScale InfiniBand and GPUs

121

The preceding tests demonstrate that the TrueScale InfiniBand

performs better under load. The DHFR model is the most sensi-

tive to the interconnect performance, and it indicates that Tru-

eScale offers six percent more performance than the alternative

InfiniBand product. Combining the results from Figure 15 and

Figure 16 illustrate that TrueScale InfiniBand provides better

results with smaller models on small clusters and better model

scalability for larger models on larger GPU clusters.

Performance/Watt AdvantageToday the focus is not just on performance, but how efficiently

that performance can be delivered.

This is an area in which Intel TrueScale InfiniBand excels. The

National Center for SuperCompute Applications (NCSA) has a

cluster based on NVIDIA GPUs interconnected with TrueScale

InfiniBand. This cluster is number three on the November 2010

Green500 list with performance of 933 MFlops/Watt .

This on its own is a significant accomplishment, but it is even

more impressive when considering its original position on the

SuperComputing Top500 list. In fact, the cluster is ranked at #404

on the Top500 list, but the combination of NVIDIA’s GPU perfor-

mance, Intel’s TrueScale performance, and low power consump-

tion enabled the cluster to move up 401 spots from the Top500

list to reach number three on the Green500 list. This is the most

dramatic shift of any cluster in the top 50 of the Green500. In

part, the following are the reasons for such dramatic perfor-

mance/watt results:

ʎ Performance of the NVIDIA Telsa 2050 GPU

ʎ Linpack performance efficiency of this cluster is 49 percent,

which is almost 20 percent better than most other NVIDIA

GPUbased clusters on the Top 500 list

ʎ The Intel TrueScale InfiniBand Adapter required 25 - 50 per-

cent less power than the alternative InfiniBand product

ConclusionThe performance of the InfiniBand interconnect has a significant

impact on the performance of GPU-based clusters. Intel’s Tru-

eScale InfiniBand is designed and architected for the HPC mar-

ketplace, and it offers an unmatched performance profile with

a GPU-based cluster. Finally, Intel’s solution provides an imple-

mentation that is easier to deploy and maintain, and allows for

optimal performance in comparison to the industry’s other lead-

ing InfiniBand.

Explicit Solvent Benchmark Results for the Two Leading InfiniBandsGPU Scalable Performance with the Industry’s Leading InfiniBands

AMBERMyoglobin Test

OtherInfiniBand

TrueScale

1409,6%

3,2%

Even

120

100

80

60

40

20

0

Bet

ter

2 4 8

NS/

DAY

Explicit Solvent Tests8 x GPU

OtherInfiniBand

TrueScale

506%

5%

1%

45

40

35

30

25

20

15

105

0B

ette

r

Cellulose FactorIX DHFR

NS/

DAY

Intel Xeon Phi Coprocessor –The Architecture

123

Intel Many Integrated Core (Intel MIC) architecture com-bines many Intel CPU cores onto a single chip. Intel MIC architecture is targeted for highly parallel, High Per-formance Computing (HPC) workloads in a variety of fields such as computational physics, chemistry, biology, and financial services. To-day such workloads are run as task parallel programs on large compute clusters.

Intel Xeon Phi Coprocessor –The Architecture

124

Intel Xeon Phi Coprocessor

The Architecture

The Intel MIC architecture is aimed at achieving high through-

put performance in cluster environments where there are rigid

floor planning and power constraints. A key attribute of the mi-

croarchitecture is that it is built to provide a general-purpose

programming environment similar to the Intel Xeon processor

programming environment. The Intel Xeon Phi coprocessors

based on the Intel MIC architecture run a full service Linux op-

erating system, support x86 memory order model and IEEE 754

floating-point arithmetic, and are capable of running applica-

tions written in industry-standard programming languages such

as Fortran, C, and C++. The coprocessor is supported by a rich

development environment that includes compilers, numerous

libraries such as threading libraries and high performance math

libraries, performance characterizing and tuning tools, and de-

buggers.

The Intel Xeon Phi coprocessor is connected to an Intel Xeon

processor, also known as the “host”, through a PCI Express (PCIe)

bus. Since the Intel Xeon Phi coprocessor runs a Linux operating

system, a virtualized TCP/IP stack could be implemented over

the PCIe bus, allowing the user to access the coprocessor as a

network node. Thus, any user can connect to the coprocessor

The first generation Intel Xeon Phi product codenamed “Knights Corner”

125

through a secure shell and directly run individual jobs or submit

batch jobs to it. The coprocessor also supports heterogeneous

applications wherein a part of the application executes on the

host while a part executes on the coprocessor.

Multiple Intel Xeon Phi coprocessors can be installed in a single

host system. Within a single system, the coprocessors can com-

municate with each other through the PCIe peer-to-peer inter-

connect without any intervention from the host. Similarly, the

coprocessors can also communicate through a network card

such as InfiniBand or Ethernet, without any intervention from

the host.

Intel’s initial development cluster named “Endeavor”, which is

composed of 140 nodes of prototype Intel Xeon Phi coproces-

sors, was ranked 150 in the TOP500 supercomputers in the world

based on its Linpack scores. Based on its power consumption,

this cluster was as good if not better than other heterogeneous

systems in the TOP500.

These results on unoptimized prototype systems demonstrate

that high levels of performance efficiency can be achieved on

compute-dense workloads without the need for a new program-

ming language or APIs.

The Intel Xeon Phi coprocessor is primarily composed of process-

ing cores, caches, memory controllers, PCIe client logic, and a

very high bandwidth, bidirectional ring interconnect. Each core

comes complete with a private L2 cache that is kept fully coher-

ent by a global-distributed tag directory. The memory controllers

and the PCIe client logic provide a direct interface to the GDDR5

memory on the coprocessor and the PCIe bus, respectively. All

these components are connected together by the ring intercon-

nect.

Each core in the Intel Xeon Phi coprocessor is designed to be

power efficient while providing a high throughput for highly

parallel workloads. A closer look reveals that the core uses a

short in-order pipeline and is capable of supporting 4 threads in

Linpack performance and power of Intel’s cluster

Microarchitecture

Intel Xeon Phi Coprocessor Core

126

hardware. It is estimated that the cost to support IA architecture

legacy is a mere 2% of the area costs of the core and is even less

at a full chip or product level. Thus the cost of bringing the Intel

Architecture legacy capability to the market is very marginal.

Vector Processing UnitAn important component of the Intel Xeon Phi coprocessor’s

core is its vector processing unit (VPU). The VPU features a nov-

el 512-bit SIMD instruction set, officially known as Intel Initial

Many Core Instructions (Intel IMCI). Thus, the VPU can execute 16

single-precision (SP) or 8 double-precision (DP) operations per cy-

cle. The VPU also supports Fused Multiply-Add (FMA) instructions

and hence can execute 32 SP or 16 DP floating point operations

per cycle. It also provides support for integers.

Vector units are very power efficient for HPC workloads. A single

operation can encode a great deal of work and does not incur en-

ergy costs associated with fetching, decoding, and retiring many

instructions. However, several improvements were required to

support such wide SIMD instructions. For example, a mask regis-

ter was added to the VPU to allow per lane predicated execution.

This helped in vectorizing short conditional branches, thereby

improving the overall software pipelining efficiency. The VPU

also supports gather and scatter instructions, which are simply

non-unit stride vector memory accesses, directly in hardware.


The Architecture

Vector Processing Unit

127

Thus for codes with sporadic or irregular access patterns, vector

scatter and gather instructions help in keeping the code vector-

ized.

The VPU also features an Extended Math Unit (EMU) that can ex-

ecute transcendental operations such as reciprocal, square root,

and log, thereby allowing these operations to be executed in a

vector fashion with high bandwidth. The EMU operates by calcu-

lating polynomial approximations of these functions.

The InterconnectThe interconnect is implemented as a bidirectional ring. Each

direction is comprised of three independent rings. The first,

largest, and most expensive of these is the data block ring. The

data block ring is 64 bytes wide to support the high bandwidth

requirement due to the large number of cores. The address ring

is much smaller and is used to send read/write commands and

memory addresses. Finally, the smallest ring and the least expen-

sive ring is the acknowledgement ring, which sends flow control

and coherence messages.

When a core accesses its L2 cache and misses, an address re-

quest is sent on the address ring to the tag directories. The

memory addresses are uniformly distributed amongst the tag

directories on the ring to provide a smooth traffic characteristic

on the ring. If the requested data block is found in another core’s

L2 cache, a forwarding request is sent to that core’s L2 over the

address ring and the request block is subsequently forwarded

on the data block ring. If the requested data is not found in any

caches, a memory address is sent from the tag directory to the

memory controller.

The figure below shows the distribution of the memory con-

trollers on the bidirectional ring. The memory controllers are

symmetrically interleaved around the ring. There is an all-to-all

mapping from the tag directories to the memory controllers. The

addresses are evenly distributed across the memory controllers,

thereby eliminating hotspots and providing a uniform access

pattern which is essential for a good bandwidth response.

During a memory access, whenever an L2 cache miss occurs on a

core, the core generates an address request on the address ring

and queries the tag directories. If the data is not found in the

tag directories, the core generates another address request and

queries the memory for the data. Once the memory controller

fetches the data block from memory, it is returned back to the

core over the data ring. Thus during this process, one data block,

two address requests (and by protocol, two acknowledgement

messages) are transmitted over the rings. Since the data block

rings are the most expensive and are designed to support the re-

The Interconnect

Distributed Tag Directories

128

quired data bandwidth, we need to increase the number of less

expensive address and acknowledgement rings by a factor of

two to match the increased bandwidth requirement caused by

the higher number of requests on these rings.


The Architecture

Interleaved Memory Access

Interconnect: 2x AD/AK

129

Multi-Threaded Streams TriadThe figure below shows the core scaling results for the multi-

threaded streams triad workload. These results were generated

on a simulator for a prototype of the Intel Xeon Phi coprocessor

with only one address ring and one acknowledgement ring per

direction in its interconnect. The results indicate that in this case

the address and acknowledgement rings would become perfor-

mance bottlenecks and would exhibit poor scalability beyond 32

cores.

The production grade Intel Xeon Phi coprocessor uses two ad-

dress and two acknowledgement rings per direction and pro-

vides a good performance scaling up to 50 cores and beyond. It

is evident from the figure that the addition of the rings results in

an over 40% aggregate bandwidth improvement.

Streaming StoresStreaming stores was another key innovation that was em-

ployed to further boost memory bandwidth. The pseudo code

for Streams Triads is shown below:

Streams Triad

for (I=0; I<HUGE; I++)

A[I] = k*B[I] + C[I];

The stream triad kernel reads two arrays, B and C, and writes a

single array A from memory. Historically, a core reads a cache

line before it writes the addressed data. Hence there is an ad-

ditional read overhead associated with the write. A streaming

store instruction allows the cores to write an entire cache line

without reading it first. This reduces the number of bytes trans-

ferred per iteration from 256 bytes to 192 bytes.

Streaming Stores

The figure below shows the core scaling results of stream triads

workload with streaming stores. As is evident from the results,

streaming stores provide a 30% improvement over previous re-

sults. Totally, then, by adding two rings per direction and imple-

menting streaming stores we are able to improve bandwidth by

more than a factor of 2 for multithreaded streams triad.

Without Streaming Stores

With Streaming Stores

Behavior Read A, B, C write A Read B, C write A

Bytes transferred to/from memory per iteration

256 192

Multi-threaded Triad – Saturation for 1 AD/AK Ring

Multi-threaded Triad – Benefit of Doubling AD/AK

Other Design FeaturesOther micro-architectural optimizations incorporated into the

Intel Xeon Phi coprocessor include a 64-entry second-level Trans-

lation Lookaside Buffer (TLB), simultaneous data cache loads and

stores, and 512KB L2 caches. Lastly, the Intel Xeon Phi coproces-

sor implements a 16 stream hardware prefetcher to improve

the cache hits and provide higher bandwidth.The figure below

shows the net performance improvements for the SPECfp 2006

benchmark suite for a single core, single thread runs. The results

indicate an average improvement of over 80% per cycle not in-

cluding frequency.

130


The Architecture

Multi-threaded Triad – with Streaming Stores

Per-Core ST Performance Improvement (per cycle)

CachesThe Intel MIC architecture invests more heavily in L1 and L2 cach-

es compared to GPU architectures. The Intel Xeon Phi coproces-

sor implements a leading-edge, very high bandwidth memory

subsystem. Each core is equipped with a 32KB L1 instruction

cache and 32KB L1 data cache and a 512KB unified L2 cache.

These caches are fully coherent and implement the x86 memory

order model. The L1 and L2 caches provide an aggregate band-

width that is approximately 15 and 7 times, respectively, faster

compared to the aggregate memory bandwidth. Hence, effective

utilization of the caches is key to achieving peak performance on

the Intel Xeon Phi coprocessor. In addition to improving band-

width, the caches are also more energy efficient for supplying

data to the cores than memory. The figure below shows the en-

ergy consumed per byte of data transfered from the memory,

and L1 and L2 caches. In the exascale compute era, caches will

play a crucial role in achieving real performance under restric-

tive power constraints.

StencilsStencils are common in physics simulations and are classic

examples of workloads which show a large performance gain

through efficient use of caches.

Stencils are typically employed in simulation of physical sys-

tems to study the behavior of the system over time. When these

workloads are not programmed to be cache-blocked, they will

be bound by memory bandwidth. Cache blocking promises sub-

stantial performance gains given the increased bandwidth and

energy efficiency of the caches compared to memory. Cache

blocking improves performance by blocking the physical struc-

ture or the physical system such that the blocked data fits well

into a core’s L1 and or L2 caches. For example, during each time-

step, the same core can process the data which is already resi-

dent in the L2 cache from the last time step, and hence does not

need to be fetched from the memory, thereby improving perfor-

mance. Additionally, the cache coherence further aids the stencil

operation by automatically fetching the updated data from the

nearest neighboring blocks which are resident in the L2 caches

of other cores. Thus, stencils clearly demonstrate the benefits of

efficient cache utilization and coherence in HPC workloads.

Power ManagementIntel Xeon Phi coprocessors are not suitable for all workloads.

131

Stencils Example

Caches – For or Against?

In some cases, it is beneficial to run the workloads only on the

host. In such situations where the coprocessor is not being used,

it is necessary to put the coprocessor in a power-saving mode.

The figure above shows all the components of the Intel Xeon Phi

coprocessor in a running state. To conserve power, as soon as all

the four threads on a core are halted, the clock to the core is gat-

ed. Once the clock has been gated for some programmable time,

the core power gates itself, thereby eliminating any leakage.

At any point, any number of the cores can be powered down or

powered up. Additionally, when all the cores are power gated

and the uncore detects no activity, the tag directories, the inter-

132


The Architecture

Power Management: All On and Running

Clock Gate Core

connect, L2 caches and the memory controllers are clock gated.

At this point, the host driver can put the coprocessor into a deep-

er sleep or an idle state, wherein all the uncore is power gated,

the GDDR is put into a self-refresh mode, the PCIe logic is put in

a wait state for a wakeup and the GDDR-IO is consuming very lit-

tle power. These power management techniques help conserve

power and make Intel Xeon Phi coprocessor an excellent candi-

date for data centers.

133

Power Gate Core

Package Auto C3

INFINIBANDHIGH-SPEED INTERCONNTECTS

InfiniBand – High-Speed Interconnects

135

InfiniBand (IB) is an efficient I/O technology that provides high-

speed data transfers and ultra-low latencies for computing and

storage over a highly reliable and scalable single fabric. The

InfiniBand industry standard ecosystem creates cost effective

hardware and software solutions that easily scale from genera-

tion to generation.

InfiniBand is a high-bandwidth, low-latency network interconnect

solution that has grown tremendous market share in the High

Performance Computing (HPC) cluster community. InfiniBand was

designed to take the place of today’s data center networking tech-

nology. In the late 1990s, a group of next generation I/O architects

formed an open, community driven network technology to provide

scalability and stability based on successes from other network

designs. Today, InfiniBand is a popular and widely used I/O fabric

among customers within the Top500 supercomputers: major Uni-

versities and Labs; Life Sciences; Biomedical; Oil and Gas (Seismic,

Reservoir, Modeling Applications); Computer Aided Design and En-

gineering; Enterprise Oracle; and Financial Applications.

InfiniBand (IB) is an effi-cient I/O technology that provides high-speed data transfers and ultra-low la-tencies for computing and storage over a highly reli-able and scalable single fabric. The InfiniBand in-dustry standard ecosystem creates cost effective hard-ware and software solu-tions that easily scale from generation to generation.

InfiniBand is a high-band-width, low-latency network interconnect solution that has grown tremendous market share in the High Performance Computing (HPC) cluster community. InfiniBand was designed to take the place of to-day’s data center network-ing technology. In the late 1990s, a group of next gener-ation I/O architects formed

an open, community driven network technology to pro-vide scalability and stabil-ity based on successes from other network designs. To-day, InfiniBand is a popular and widely used I/O fabric among customers within the Top500 supercomput-ers: major Universities and Labs; Life Sciences; Biomed-ical; Oil and Gas (Seismic, Reservoir, Modeling Appli-cations); Computer Aided Design and Engineering; En-terprise Oracle; and Finan-cial Applications.

136

InfiniBand was designed to meet the evolving needs of the

high performance computing market. Computational science

depends on InfiniBand to deliver:

ʎ High Bandwidth: Supports host connectivity of 10Gbps

with Single Data Rate (SDR), 20Gbps with Double Data

Rate (DDR), and 40Gbps with Quad Data Rate (QDR),

all while offering an 80Gbps switch for link switching

ʎ Low Latency: Accelerates the performance of HPC and enter-

prise computing applications by providing ultra-low latencies

ʎ Superior Cluster Scaling: Point-to-point latency re-

mains low as node and core counts scale – 1.2 µs. High-

est real message per adapter: each PCIe x16 adapter

drives 26 million messages per second. Excellent commu-

nications/computation overlap among nodes in a cluster

ʎ High Efficiency: InfiniBand allows reliable protocols like Re-

mote Direct Memory Access (RDMA) communication to occur

between interconnected hosts, thereby increasing efficiency

ʎ Fabric Consolidation and Energy Savings: InfiniBand can

consolidate networking, clustering, and storage data over

a single fabric, which significantly lowers overall power,

real estate, and management overhead in data centers. En-

hanced Quality of Service (QoS) capabilities support run-

ning and managing multiple workloads and traffic classes

ʎ Data Integrity and Reliability: InfiniBand provides the high-

est levels of data integrity by performing Cyclic Redundancy

Checks (CRCs) at each fabric hop and end-to-end across the

fabric to avoid data corruption. To meet the needs of mission

critical applications and high levels of availability, InfiniBand

provides fully redundant and lossless I/O fabrics with auto-

matic failover path and link layer multi-paths

InfiniBand

High-Speed Interconnects

Intel is a global leader and technology innovator in high perfor-mance networking, including adapters, switches and ASICs.

137

Components of the InfiniBand Fabric

InfiniBand is a point-to-point, switched I/O fabric architecture.

Point-to-point means that each communication link extends be-

tween only two devices. Both devices at each end of a link have full

and exclusive access to the communication path. To go beyond a

point and traverse the network, switches come into play. By adding

switches, multiple points can be interconnected to create a fabric.

As more switches are added to a network, aggregated bandwidth

of the fabric increases. By adding multiple paths between devices,

switches also provide a greater level of redundancy.

The InfiniBand fabric has four primary components, which are

explained in the following sections:

� Host Channel Adapter

� Target Channel Adapter

� Switch

� Subnet Manager

Host Channel Adapter

This adapter is an interface that resides within a server and

communicates directly with the server’s memory and proces-

sor as well as the InfiniBand Architecture (IBA) fabric. The adapt-

er guarantees delivery of data, performs advanced memory ac-

cess, and can recover from transmission errors. Host channel

adapters can communicate with a target channel adapter or a

switch. A host channel adapter can be a standalone InfiniBand

card or it can be integrated on a system motherboard. Intel Tru-

eScale InfiniBand host channel adapters outperform the com-

petition with the industry’s highest message rate. Combined

with the lowest MPI latency and highest effective bandwidth,

Intel host channel adapters enable MPI and TCP applications

to scale to thousands of nodes with unprecedented price per-

formance.

Target Channel Adapter

This adapter enables I/O devices, such as disk or tape storage, to

be located within the network independent of a host computer.

Target channel adapters include an I/O controller that is specific

to its particular device’s protocol (for example, SCSI, Fibre Chan-

nel (FC), or Ethernet). Target channel adapters can communicate

with a host channel adapter or a switch.

Switch

An InfiniBand switch allows many host channel adapters and

target channel adapters to connect to it and handles network

traffic. The switch looks at the “local route header” on each

packet of data that passes through it and forwards it to the

appropriate location. The switch is a critical component of

the InfiniBand implementation that offers higher availability,

higher aggregate bandwidth, load balancing, data mirroring,

and much more. A group of switches is referred to as a Fabric. If

a host computer is down, the switch still continues to operate.

The switch also frees up servers and other devices by handling

network traffic.

Host Channel Adapter

InfiniBand SwitchWith Subnet Manager

Target Channel Adapter

figure 1 Typical InfiniBand High Performance Cluster

138

Top 10 Reasons to Use Intel TrueScale InfiniBand

1. Predictable Low Latency Under Load – Less Than 1.0 μs.

TrueScale is designed to make the most of multi-core nodes

by providing ultra-low latency and significant message rate

scalability. As additional compute resources are added to

the Intel TrueScale InfiniBand solution, latency and message

rates scale linearly. HPC applications can be scaled without

having to worry about diminished utilization of compute re-

sources.

2. Quad Data Rate (QDR) Performance. The Intel 12000 switch

family runs at lane speeds of 10Gbps, providing a full bi-

sectional bandwidth of 40Gbps (QDR). In addition, the 12000

switch has the unique capability of riding through periods

of congestion with features such as deterministically low la-

tency. The Intel family of TrueScale 12000 products offers the

lowest latency of any IB switch and high performance trans-

fers with the industry’s most robust signal integrity.

3. Flexible QoS Maximizes Bandwidth Use. The Intel 12800

advanced design is based on an architecture that provides

comprehensive virtual fabric partitioning capabilities that

enable the IB fabric to support the evolving requirements of

an organization.

4. Unmatched Scalability – 18 to 864 Ports per Switch. Intel of-

fers the broadest portfolio (five chassis and two edge switch-

es) from 18 to 864 TrueScale InfiniBand ports, allowing cus-

tomers to buy switches that match their connectivity, space,

and power requirements.

InfiniBand

Top 10 Reasons to Use Intel TrueScale InfiniBand

“Effective fabric management has become the

most important factor in maximizing performance

in an HPC cluster. With IFS 7.0, Intel has addressed

all of the major fabric management issues in a prod-

uct that in many ways goes beyond what others are

offering.”

Michael Wirth HPC Presales Specialist

139

5. Highly Reliable and Available. Reliability And Service-

ability (RAS) that is proven in the most demanding Top500

and Enterprise environments is designed into Intel’s 12000

series with hot swappable components, redundant com-

ponents, customer replaceable units, and non-disruptive

code load.

6. Lowest Per-Port and Cooling Requirements. The True-Scale

12000 offers the lowest power consumption and the highest

port density – 864 total TrueScale InfiniBand ports in a single

chassis makes it unmatched in the industry. This results in

delivering the lowest power per port for a director switch (7.8

watts per port) and the lowest power per port for an edge

switch (3.3 watts per port).

7. Easy to Install and Manage. Intel installation, configuration,

and monitoring Wizards reduce time-to-ready. The Intel In-

finiBand Fabric Suite (IFS) assists in diagnosing problems in

the fabric. Non-disruptive firmware upgrades provide maxi-

mum availability and operational simplicity.

8. Protects Existing InfiniBand Investments. Seamless vir-

tual I/O integration at the Operating System (OS) and appli-

cation levels that match standard network interface cards

and adapter semantics with no OS or app changes – they

just work. Additionally, the TrueScale family of products is

compliant with the InfiniBand Trade Association (IBTA) open

specification, so Intel products inter-operate with any IBTA-

compliant InfiniBand vendor. Being IBTA-compliant makes

the Intel 12000 family of switches ideal for network consoli-

dation; sharing and scaling I/O pools across servers; and to

pool and share I/O resources between servers.

9. Modular Configuration Flexibility. The Intel 12000 series

switches offer configuration and scalability flexibility that

meets the requirements of either a high-density or high-

performance compute grid by offering port modules that

address both needs. Units can be populated with Ultra High

Density (UHD) leafs for maximum connectivity or Ultra High

Performance (UHP) leafs for maximum performance. The

high scalability 24-port leaf modules support configurations

between 18 and 864 ports, providing the right size to start

and the capability to grow as your grid grows.

10. Option to Gateway to Ethernet and Fibre Channel Net-

works. Intel offers multiple options to enable hosts on In-

finiBand fabrics to transparently access Fibre Channel based

storage area networks (SANs) or Ethernet based local area

networks (LANs).

Intel TrueScale architecture and the resulting family of products

delivers the promise of InfiniBand to the enterprise today.

140

InfiniBand

Intel MPI Library 4.0 PerformanceIntroductionIntel’s latest MPI release, Intel MPI 4.0, is now optimized to work

with Intel’s TrueScale InfiniBand adapter. Intel MPI 4.0 can now

directly call Intel’s TrueScale Performance Scale Messaging (PSM)

interface. The PSM interface is designed to optimize MPI applica-

tion performance. This means that organizations will be able to

achieve a significant performance boost with a combination of

Intel MPI 4.0 and Intel TrueScale InfiniBand.

SolutionIntel worked with Intel to tune and optimize the company’s lat-

est MPI release – Intel MPI Library 4.0 – to improve performance

when used with Intel TrueScale InfiniBand. With MPI Library 4.0,

applications can make full use of High Performance Computing

(HPC) hardware, improving the overall performance of the appli-

cations on the clusters.

Intel MPI Library 4.0 PerformanceIntel MPI Library 4.0 uses the high performance MPI-2 specifica-

tion on multiple fabrics, which results in better performance for

applications on Intel architecture-based clusters. This library en-

ables quick delivery of maximum end user performance, even if

MPI v3.1

MPI v4.0

1400

35% Improvement

1200

1000

800

600

400

200

0

Bet

ter

Ela

pse

d t

ime

Number of cores

16 32 64

Figure 3 Performance with and without GPUDirect

141

there are changes or upgrades to new interconnects – without

requiring major changes to the software or operating environ-

ment. This high-performance, message-passing interface library

develops applications that can run on multiple cluster fabric in-

terconnects chosen by the user at runtime.

Testing

Intel used the Parallel Unstructured Maritime Aerodynamics

(PUMA) Benchmark program to test the performance of Intel MPI

Library versions 4.0 and 3.1 with Intel TrueScale InfiniBand. The

program analyses internal and external non-reacting compress-

ible flows over arbitrarily complex 3D geometries. PUMA is writ-

ten in ANSI C, and uses MPI libraries for message passing.

Results

The test showed that MPI performance can improve by more

than 35 percent using Intel MPI Library 4.0 with Intel TrueScale

InfiniBand, compared to using Intel MPI Library 3.1.

Intel TrueScale InfiniBand Adapters Intel TrueScale InfiniBand Adapters offer scalable performance,

reliability, low power, and superior application performance.

These adapters ensure superior performance of HPC applica-

tions by delivering the highest message rate for multicore

compute nodes, the lowest scalable latency, large node count

clusters, the highest overall bandwidth on PCI Express Gen1 plat-

forms, and superior power efficiency.

Purpose: Compare the performance of Intel MPI Library 4.0 and 3.1 with Intel InfiniBandBenchmark: PUMA FlowCluster: Intel/IBM iDataPlex™ clusterSystem Configuration: The NET track IBM Q-Blue Cluster/iDataPlex nodes were configured as follows:

Processor Intel Xeon® CPU X5570 @ 2,93 GHz

Memory 24GB (6x4GB) @ 1333MHz (DDR3)

QDR InfiniBand SwitchIntel Model 12300/firmware version

6.0.0.0.33

QDR InfiniBand Host Cnel Adapter Intel QLE7340 software stack 5.1.0.0.49

Operating System Red Hat® Enterprise Linux® Server

release 5.3

Kernel 2.6.18-128.el5

File System IFS Mounted

Typical InfiniBand High Performance Cluster

142

The Intel TrueScale 12000 family of Multi-Protocol Fabric Directors

is the most highly integrated cluster computing interconnect so-

lution available. An ideal solution HPC, database clustering, and

grid utility computing applications, the 12000 Fabric Directors

maximize cluster and grid computing interconnect performance

while simplifying and reducing the cost of operating a data center.

Subnet Manager

The subnet manager is an application responsible for config-

uring the local subnet and ensuring its continued operation.

Configuration responsibilities include managing switch setup

and reconfiguring the subnet if a link goes down or a new one

is added.

How High Performance Computing Helps Vertical ApplicationsEnterprises that want to do high performance computing must

balance the following scalability metrics as the core size and the

number of cores per node increase:

ʎ Latency and Message Rate Scalability: must allow near line-

ar growth in productivity as the number of compute cores is

scaled.

ʎ Power and Cooling Efficiency: as the cluster is scaled, power

and cooling requirements must not become major concerns

in today’s world of energy shortages and high-cost energy.

InfiniBand offers the promise of low latency, high bandwidth,

and unmatched scalability demanded by high performance

computing applications. IB adapters and switches that perform

well on these key metrics allow enterprises to meet their high-

performance and MPI needs with optimal efficiency. The IB so-

lution allows enterprises to quickly achieve their compute and

business goals.

InfiniBand

High-Speed Interconnects

143

Vertical Market Application Segment InfiniBand Value

Oil & GasMix of Independent Service Provider (ISP) and home grown codes: reservoir modeling

Low latency, high bandwidth

Computer Aided Engineering (CAE)Mostly Independent Software Vendor (ISV) codes: crash, air flow, and fluid flow simulations

High message rate, low latency, scalability

GovernmentHome grown codes: labs, defense, weather, and a wide range of apps

High message rate, low latency, scalability, high bandwidth

Education Home grown and open source codes: a wide range of apps High message rate, low latency, scalability, high bandwidth

Financial Mix of ISP and home grown codes: market simulation and trading floor

High performance IP, scalability, high bandwidth

Life and Materials ScienceMostly ISV codes: molecular simulation, computational chemistry, and biology apps

Low latency, high message rates

144

InfiniBand

Intel Fabric Suite 7Intel Fabric Suite 7

Maximizing Investments in High Performance Computing

Around the world and across all industries, high performance

computing (HPC) is used to solve today’s most demanding com-

puting problems. As today’s high performance computing chal-

lenges grow in complexity and importance, it is vital that the

software tools used to install, configure, optimize, and manage

HPC fabrics also grow more powerful. Today’s HPC workloads are

too large and complex to be managed using software tools that

are not focused on the unique needs of HPC.

Designed specifically for HPC, Intel Fabric Suite 7 is a complete

fabric management solution that maximizes the return on HPC

investments, by allowing users to achieve the highest levels

of performance, efficiency, and ease of management from In-

finiBand-connected HPC clusters of any size.

Highlights

ʎ Intel FastFabric and Fabric Viewer integration with leading

third-party HPC cluster management suites

ʎ Simple but powerful Fabric Viewer dashboard for monitoring

fabric performance

ʎ Intel Fabric Manager integration with leading HPC workload-

management suites that combine virtual fabrics and compute

ʎ Quality of Service (QoS) levels that maximize fabric efficiency

and application performance

ʎ Smart, powerful software tools that make Intel TrueScale

Fabric solutions easy to install, configure, verify, optimize,

and manage

ʎ Congestion control architecture that reduces the effects of

fabric congestion caused by low credit availability that can

result in head-of-line blocking

ʎ Powerful fabric routing methods—including adaptive and

145

dispersive routing—that optimize traffic flows to avoid or

eliminate congestion, maximizing fabric throughput

ʎ Intel Fabric Manager’s advanced design ensures fabrics of

all sizesand topologies—from fat-tree to mesh and torus—

scale to support the most demanding HPC workloads

Superior Fabric Performance and Simplified Management are

Vital for HPC

As HPC clusters scale to take advantage of multi-core and GPU-

accelerated nodes attached to ever-larger and more complex

fabrics, simple but powerful management tools are vital for

maximizing return on HPC investments.

Intel Fabric Suite 7 provides the performance and management

tools for today’s demanding HPC cluster environments. As clus-

ters grow larger, management functions, from installation and

configuration to fabric verification and optimization, are vital

in ensuring that the interconnect fabric can support growing

workloads. Besides fabric deployment and monitoring, IFS opti-

mizes the performance of message passing applications—from

advanced routing algorithms to quality of service (QoS)—that

ensure all HPC resources are optimally utilized.

Scalable Fabric Performance

ʎ Purpose-built for HPC, IFS is designed to make HPC clusters

faster, easier, and simpler to deploy, manage, and optimize

ʎ The Intel TrueScale Fabric on-load host architecture exploits

the processing power of today’s faster multi-core processors

for superior application scaling and performance

ʎ Policy-driven vFabric virtual fabrics optimize HPC resource

utilization by prioritizing and isolating compute and storage

traffic flows

ʎ Advanced fabric routing options—including adaptive and

dispersive routing—that distribute traffic across all poten-

tial links that improve the overall fabric performance, lower-

ing congestion to improve throughput and latency

Scalable Fabric Intelligence

ʎ Routing intelligence scales linearly as Intel TrueScale Fabric

switches are added to the fabric

ʎ Intel Fabric Manager can initialize fabrics having several

thousand nodes within seconds

ʎ Advanced and optimized routing algorithms overcome the

limitations of typical subnet managers

ʎ Smart management tools quickly detect and respond to fab-

ric changes, including isolating and correcting problems that

can result in unstable fabrics

Standards-based Foundation

Intel TrueScale Fabric solutions are compliant with all industry

software and hardware standards. Intel Fabric Suite (IFS) rede-

fines IBTA management by coupling powerful management

tools with intuitive user interfaces. InfiniBand fabrics built using

IFS deliver the highest levels of fabric performance, efficiency,

and management simplicity, allowing users to realize the full

benefits from their HPC investments.

Major components of Intel Fabric Suite 7 include:

ʎ FastFabric Toolset

ʎ Fabric Viewer

ʎ Fabric Manager

Intel FastFabric Toolset

Ensures rapid, error-free installation and configuration of Intel

True Scale Fabric switch, host, and management software tools.

Guided by an intuitive interface, users can easily install, config-

ure, validate, and optimize HPC fabrics.

146

Key features include:

ʎ Automated host software installation and configuration

ʎ Powerful fabric deployment, verification, analysis, and report-

ing tools for measuring connectivity, latency, and bandwidth

ʎ Automated chassis and switch firmware installation and up-

date

ʎ Fabric route and error analysis tools

ʎ Benchmarking and tuning tools

ʎ Easy-to-use fabric health checking tools

Intel Fabric Viewer

A key component that provides an intuitive, Java-based user

interface with a “topology-down” view for fabric status and di-

agnostics, with the ability to drill down to the device layer to

identify and help correct errors. IFS 7 includes fabric dashboard,

a simple and intuitive user interface that presents vital fabric

performance statistics.


ʎ Bandwidth and performance monitoring

ʎ Device and device group level displays

ʎ Definition of cluster-specific node groups so that displays

can be oriented toward end-user node types

ʎ Support for user-defined hierarchy

ʎ Easy-to-use virtual fabrics interface

Intel Fabric Manager

Provides comprehensive control of administrative functions us-

ing a commercial-grade subnet manager. With advanced routing

algorithms, powerful diagnostic tools, and full subnet manager

failover, Fabric Manager simplifies subnet, fabric, and individual

component management, making even the largest fabrics easy

to deploy and optimize.

InfiniBand

Intel Fabric Suite 7

147


ʎ Designed and optimized for large fabric support

ʎ Integrated with both adaptive and dispersive routing systems

ʎ Congestion control architecture (CCA)

ʎ Robust failover of subnet management

ʎ Path/route management

ʎ Fabric/chassis management

ʎ Fabric initialization in seconds, even for very large fabrics

ʎ Performance and fabric error monitoring

Adaptive Routing

While most HPC fabrics are configured to have multiple paths

between switches, standard InfiniBand switches may not be ca-

pable of taking advantage of them to reduce congestion. Adap-

tive routing monitors the performance of each possible path,

and automatically chooses the least congested route to the

destination node. Unlike other implementations that rely on a

purely subnet manager-based approach, the intelligent path se-

lection capabilities within Fabric Manager, a key part of IFS and

Intel 12000 series switches, scales as your fabric grows larger

and more complex.

Key adaptive routing features include:

ʎ Highly scalable—adaptive routing intelligence scales as the-

fabric grows

ʎ Hundreds of real-time adjustments per second per switch

ʎ Topology awareness through Intel Fabric Manager

ʎ Supports all InfiniBand Trade Association* (IBTA*)-compliant

adapters

Dispersive Routing

One of the critical roles of the fabric management is the initial-

ization and configuration of routes through the fabric between

a single pair of nodes. Intel Fabric Manager supports a variety

of routing methods, including defining alternate routes that

disperse traffic flows for redundancy, performance, and load

balancing. Instead of sending all packets to a destination on a

single path, Intel dispersive routing distributes traffic across all

possible paths. Once received, packets are reassembled in their

proper order for rapid, efficient processing. By leveraging the

entire fabric to deliver maximum communications performance

for all jobs, dispersive routing ensures optimal fabric efficiency.

Key dispersive routing features include:

ʎ Fabric routing algorithms that provide the widest separation

of paths possible

ʎ Fabric “hotspot” reductions to avoid fabric congestion

ʎ Balances traffic across all potential routes

ʎ May be combined with vFabrics and adaptive routing

ʎ Very low latency for small messages and time sensitive con-

trol protocols

ʎ Messages can be spread across multiple paths—Intel Perfor-

mance Scaled Messaging (PSM) ensures messages are reas-

sembled correctly

ʎ Supports all leading MPI libraries

Mesh and Torus Topology Support

Fat-tree configurations are the most common topology used

in HPC cluster environments today, but other topologies are

gaining broader use as organizations try to create increasingly

larger, more cost-effective fabrics. Intel is leading the way with

full support of emerging mesh and torus topologies that can

help reduce networking costs as clusters scale to thousands of

nodes. IFS has been enhanced to support these emerging op-

tions—from failure handling that maximizes performance, to

even higher reliability for complex networking environments.

148

InfiniBand

Intel Fabric Suite 7

transtec HPC solutions excel through their easy management

and high usability, while maintaining high performance and

quality throughout the whole lifetime of the system. As clusters

scale, issues like congestion mitigation and Quality-of-Service

can make a big difference in whether the fabric performs up to

its full potential.

With the intelligent choice of Intel InfiniBand products, transtec

remains true to combining the best components together to pro-

vide the full best-of-breed solution stack to the customer.

transtec HPC engineering experts are always available to fine-

tune customers’ HPC cluster systems and InfiniBand fabrics to

get the maximum performance while at the same time provide

them with an easy-to-manage and easy-to-use HPC solution.

Parstream –Big Data Analytics

151

The amount of digital data resources has exploded in the past decade. The 2010 IDC Digital Universe Study shows that compared to 2008, the digital universe has grown by 62% or up to 800,000 petabytes (0.8 zettabyte) in 2009. By the year 2020, IDC predicts the amount of data to be 44 times as big as it was in 2009, thus reaching approximate-ly 35 zettabytes. As a result of the exploding amount of data, the demand for appli-cations used for searching and analyzing large datas-ets will significantly grow in all industries.

However, today’s applica-tions require large and ex-pensive infrastructures which consume vast amounts of energy and resources, creat-ing challenges in the analysis

of mass-data. In the future, increasingly more terabyte-scale datasets will be used for research, analysis and diagno-sis bringing about further dif-ficulties.

152

Parstream

Big Data Analytics

What is Big DataThe amount of generated and stored data is growing exponen-

tially in unprecedented proportions. According to Eric Schmidt,

former CEO of Google, the amount of data that is now produced

in 2 days is as large as the total amount since the beginning of

civilization until 2003.

Large amounts of data are produced in many areas of business

and private life. For example, there are 200 million Twitter mes-

sages generated – daily. Mobile devices, sensor networks, ma-

chine-to-machine communications, RFID readers, software logs,

cameras and microphones generate a continuous stream of data

from multiple terabytes per day.

For organizations Big Data means a great opportunity and a big

challenge. Companies that know how to manage big data will

create more comprehensive predictions for the market and busi-

ness. Decision-makers can quickly and safely respond to the de-

velopments in dynamic and volatile markets.

It is important that companies know how to use Big Data to

improve their competitive positions, increase productivity and

develop new business models. According to Roger Magoulas of

O’Reilly the ability to analyze big data has developed into a core

Real Time

< 1...10 milli sec

10...100 milli sec

1 sec

Terabyte

1...10 sec

1...10 min

> 10 min

Gigabyte Petabyte

Lag Time

Big DataVolume

InteractiveAnalyticsOperational

Data Volumes

Batch Analytics(MapReduce)

OLTP Reporting

In-Memory DB

Complex EventProcessing

153

competence in the information age and can provide an enor-

mous competitive advantage for companies.

Big Data analytics has a huge impact on traditional data process-

ing and causes business and IT managers to face new problems.

The latest IT technologies and processes are poorly suited to

analyze very large amounts of data, according to a recent study

by Gartner.

IT departments often try to meet the increasing flood of infor-

mation by traditional means, using the same infrastructure de-

velopment and just purchasing more hardware. As data volumes

grow stronger than the processing capabilities of additional

servers or better processors, new technological approaches are

needed.

Setbacks in current database architecturesCurrent databases are not engineered for mass data, but rather

for small data volumes up to 100 million records. Today’s data-

bases use outdated, 20-30 year old architectures and the data

and index structures are not constructed for efficient analysis

of such data volumes. And, because these databases employ se-

quential algorithms, they are not able to exploit the potential of

parallel hardware.

Algorithmic procedures for indexing large amounts of data

have seen relatively few innovations in the past years and

decades. Due to the ever-growing amounts of data to be pro-

cessed, there are rising challenges that traditional database

systems cannot cope with. Currently, new and innovative ap-

proaches to solve these problems are developed and evaluat-

ed. However, some of these approaches seem to be heading in

the wrong direction.

Parstream – Big Data Analytics PlatformParStream offers a revolutionary approach to high-performance

data analysis. It addresses the problems arising from rapidly in-

creasing data volumes in modern business, as well as scientific

application scenarios.

ParStream

ʎ has a unique index technology

ʎ comes with efficient parallel processing

ʎ is ultra-fast, even with billions of records

ʎ scales linearly up to petabytes

ʎ offers real-time analysis and continuous import

ʎ is cost and energy efficient

ParStream uses a unique indexing technology which enables

efficient multi-threaded processing on parallel architectures.

Hardware and energy costs are substantially reduced while

overall performance is optimized to index and execute queries.

Close-to-real-time analysis is obtained through simultaneous

importing, indexing and querying of data. The developers of Par-

Stream strive to continuously improve on the product by work-

ing closely with universities, research partners, and clients.

ParStream is the right choice when...

ʎ extreme amounts of data are to be searched and filtered

ʎ filters use many columns in various combinations (ad-hoc

queries)

SQL API / JDBC / ODBC

Real-Time Analytics Engine

High Performance Compressed Index

(HPCI)

Fast Hybrid Storage (Columnar/ Row)

High SpeedLoader with Low Latency

C++UDF -API

Multi-DimensionalPartitioning

Shared NothingArchitecture

In-Memory andDisc Technology

Massively ParallelProcessing (MPP)

154

TechnologyThe technical architecture of the ParStream database can be

roughly divided into the following areas: During the extract,

transform and load (ETL) process, the supplied data is read and

processed so that it can be forwarded to the actual loader.

In the next step, ParStream generates and stores necessary index

structures and data representations. Input data can be stored in

row and/or column-oriented data stores. Here configurable opti-

mizations regarding sorting, compression and partitioning are ac-

counted for.

Once data is loaded into the server, the user can pose queries

over a standard interface (SQL) to the database engine. A parser

then interprets the queries and a query executor executes it in

parallel.

The optimizer is used to generate the initial query plan starting

from the declarative request. This initial execution plan is used

to start processing. However, this initial plan can be altered dy-

namically during runtime, e.g., changing the level of parallelism.

Parstream

Big Data Analytics

High SpeedImport & Index

DataSource

Results

DataSource

DataSource

ETL

ETL

ETL

Rowori-entedrecordstore

Columnori-entedrecordstore

QueryLogic

Loader

Multiple Servers, CPUs and GPUs

CompressionPartitioning

CachingParallelzation

IndexStore

ParStream

ETL

Query

155

The query executor also makes optimal use of the available infra-

structure.

All available parts of the query exploit bitmap information where

possible. Logical filter conditions and aggregations can thus be

calculated and forwarded to the client by highly efficient bitmap

operations.

Index Structure

ParStream’s innovative index structure enables efficient parallel

processing, allowing for unsurpassed levels of performance. Key

features of the underlying database engine include:

ʎ A column-oriented bitmap index using a unique data struc-

ture.

ʎ A data structure that allows allows processing in compressed

form.

ʎ No need for index decompression as in other database and

indexing systems.

Use of ParStream and its highly efficient query engine yield sig-

nificant advantages over regular database systems. ParStream

offers:

ʎ faster index operations,

ʎ shorter response times,

ʎ substantially less CPU-load,

ʎ efficient parallel processing of search and analysis,

ʎ capability of adding new data to the index during query ex-

ecution, close-to-real-time importing and querying of data.

Optimized Use of Resources

Inside one single processing node, as well as inside single data

partitions, query processing can be parallelized to achieve mini-

mal response time, by using all available resources (CPU, I/O-

Channels).

Reliability

The reliability of a ParStream database is guaranteed through

several product features and fully supports multiple-server en-

vironments. The ParStream database allows the usage of an arbi-

trary number of servers. Each server can be configured to repli-

cate the entire or only parts of the data store.

Several load-balancing algorithms will pass a complete query

to one of the servers or parts of one query to several, even re-

dundant servers. The query can be sent to any of the cluster

members, allowing customers to use a load-balancing and fail-

over configuration according to their own need, e.g. round robin

query distribution.

Interfaces

The ParStream server can be queried using any of the following

three methods:

At a high level, we provide a JDBC driver which enables the imple-

mentation of a cross platform front end. This allows ParStream

to be used with any application capable of using a standard

JDBC interface. For example, a Java applet can be built. This ap-

plet can be used to query the ParStream server from within any

web browser.

At mid-level, queries can be submitted as SQL code. Additionally,

other descriptive query implementations are available, e.g., a

JSON format. This allows the user to define queries, which can-

not easily be expressed in standard SQL. The results are then

sent back to the client as CSV text or in binary format.

At a low level, ParStream’s C++ API and base classes can be used

to write user defined query nodes that are stored in dynamic li-

braries. A developer can thus integrate his own query nodes into

a tree description and register it dynamically into the ParStream

156

server. Such user-defined queries can be executed via a TCP/IP

connection and are also integrated into ParStream’s parallel ex-

ecution framework. This interface layer allows the formulation

of queries that cannot be expressed using SQL.

Data Import

One of ParStream’s strengths is its ability to import CSV files at

unprecedented speeds. This is based on two factors:

First of all, the index is much faster in adding data than indexes

used in most other databases. Secondly, the importer partitions

and sorts the data in parallel, which exploits the capabilities of

today’s multi-core processors.

Additionally, the import process may run outside the query pro-

cess enabling the user to ship the finished data and index files

to the servers. In this way, the import’s CPU and I/O load can be

separated and moved to different machines.

Parstream

Big Data Analytics

JDBC/ODBCSQL 2003

ParallelImport

CSV & BinaryImport

ETL-Support

SocketC++ API

SOACompatibility

Front End

Map-Reduce RDBMS Raw-Data

Real-Time Big Data Analytics

Application Tool

“As Big Data Analytics is becoming more and more

important, new and innovative technology arises,

and we are happy that with Parstream, our cus-

tomers may experience a performance boost with

respect to their analytics work by a factor of up to

1,000.”

Matthias Groß HPC Sales Specialist

157

Another remarkable feature of ParStream is its ability to

optionally operate on a CSV record store instead of column

stores. This accelerates the import because only the indexes

need to be written. Plus, since one usually wants to save the

original CSV files, no additional hard drive memory is wasted.

Supported Platforms

Currently ParStream is available on a number of Linux Distri-

butions including RedHat Enterprise Linux, Novell Enterprise

Linux and Debian Lenny running on X86_64 CPUs. On request,

ParStream will be ported to other platforms.

Besides the detailed simulation of realistic processes, data an-

alytics is is one of the classic application of High Performance

Computing. What is new is that with the amount of data in-

creasing such that we now speak of Big Data Analytics, and the

required time-to-result getting shorter and shorter, specialized

solutions arise that apply the scale-out principle of HPC to areas

that up to now have been in the focus of High Performance Com-

puting.

Parstream as a clustered database – optimized for highest read-

only parallelized data throughput, next-to-realtime query results

– is such a specialized solution.

transtec ensures that this latest innovative technology is imple-

mented and run on the most reliable systems available, sized

exactly according to the customer’s specific requirements. Par-

stream as the software technology core together with transtec

HPC systems and transtec services combined, provide for the

most reliable and comprehensive Big Data Analytics solutions

available.

158

ʎ complex queries are performed frequently

ʎ datasets are continuously growing

ʎ close-to-real-time analytical results are expected

ʎ infrastructure and operating costs need to be optimized

Key business scenariosBig Data is a game changer in all industries and the public sector.

ParStream has identified many applications across all sectors

that yield high returns for the customer. Examples are ad-spend-

ing analytics and social media monitoring in eCommerce, fraud

detection and algorithmic trading in Finance/Capital markets,

network quality monitoring and customer targeting in Telcos,

smart metering and smart grids in Utilities, and many more.

ʎ Search and Selection – ParStream enables new levels of

search and online shopping satisfaction By providing facet-

ted search in combination with continuously analyzing cus-

tomers preferences and behavior in real-time, ParStream can

guide customers very effectively to products and services of

their interest leading to increased conversion rate and plat-

form revenue.

ʎ Real-Time Analytics – ParStream drives interactive analytics

processes to gain insights faster. Identifying valuable insights

from Big Data is ideally an interactive process where hypothe-

ses are identified and validated. ParStream delivers results

instantaneously on up-to-date Big Data even if many power-

users or automated-agents are querying the data at the same

time. ParStream enables faster organizational learning for

process and product optimization.

ʎ Online-Processing – ParStream responds automatically to

large data streams. For online advertising, re-targeting, trend-

spotting, fraud detection and many more cases ParStream

can automatically process new data, compare to large histo-

ric data sets, and enable react quickly. Fast growing data vo-

lumes and increasing algorithmic complexity create a market

for faster solutions.

Parstream

Big Data Analytics

eCommerceServices

facetted Search

Web analytics

SEO- analytics

Online- Advertising

SocialNetworks

Ad serving

Pro�ling

Targeting

Telco

Customer attrition prevention

Network monitoring

Targeting

Prepaid account mgmt

Finance

Trend analysis

Fraud detection

Automatic trading

Risk analysis

EnergyOil and Gas

Smart metering

Smart grids Wind parks

Mining Solar Panels

Many More

Production

Mining

M2M

Sensors Genetics Intelligence

Wather

ALL INDUSTRIES

MA

NY

APP

LICA

TIO

NS

159

OfferingParStream delivers a unique combination of real-time analyt-

ics, low-latency continuous import and high throughput for Big

Data. ParStream features analysis of billions of records with

thousands of columns in sub-second time with very small com-

puting infrastructure requirements. ParStream does NOT use

cubes to store data, which enables ParStream to analyze data in

its full granularity, flexibly in all dimensions and to import new

data on the fly.

ParStream delivers results typically 1,000 times faster than Post-

greSQL or MySQL. Compared to column-based databases, Par-

Stream delivers superior performance; this performance differ-

ential with traditional technology increases exponentially with

increasing data volume and query complexity.

ParStream has unique intellectual property in compression

and indexing. With High-Performance-Compressed-Index (HPCI)

technology, data indices can be analyzed in a compressed form,

removing the need to go through the extra step of decompres-

sion. This technological advantage results in major performance

advantages compared to other solutions, including:

ʎ Substantially reducing CPU load (previously needed for de-

compression, index operations more efficient than full data

scanning)

ʎ Substantially reducing RAM volume (previously needed to

store the decompressed index)

ʎ Providing much faster query response rates because HPCI

processing can be parallelized eliminating the delays that re-

sult with decompression

Compared to state of the art data warehouse solutions, Par-

Stream delivers results much faster, with more flexibility and

granularity, on more up-to-date data, with lower infrastructure

requirements and at lower TCO. ParStream uniquely addresses

Big Data scenarios with its ability to continuously import data

and make it available to analytics in real-time. This allows for

analytics, filtering and conditioning of “live” data streams, for

example from social network or financial data feeds. ParStream

licensing is based on data volume. The product can be deployed

in four ways that also can be combined:

ʎ As a traditional on premise Software deployment managed by

the customer (including Cloud deployments),

ʎ As an Appliance managed by the customer,

ʎ As a Cloud Service provided by a ParStream Service partner,

and

ʎ As a Cloud Service provided by ParStream.

These options provide flexibility and eases adoption as custom-

ers can use ParStream on their preferred infrastructure or cloud,

regardless of number of servers, cores, etc. Although private

and public cloud-infrastructures are seen as the future model

for many applications, dedicated cluster-infrastructures are fre-

quently the preferred option for real-time Big Data scenarios.

© 2012 by Parstream GmbH

Long QueryRuntime

Query Query

Data Big Data

Parallel ImportNightly Batch - Import

STANDARD DW ARCHITECHTURE PARSTREAM ARCHITECTURE

HPCIFrequent FullTable Scans

Data is at Least1 Day Old

Each Query UsesMultipleProcessor Cores

Query execution compressedindices

ContinuousImport AssuresZimeliness ofData

160

ACML (“AMD Core Math Library“)

A software development library released by AMD. This library

provides useful mathematical routines optimized for AMD pro-

cessors. Originally developed in 2002 for use in high-performance

computing (HPC) and scientific computing, ACML allows nearly

optimal use of AMD Opteron processors in compute-intensive

applications. ACML consists of the following main components:

ʎ A full implementation of Level 1, 2 and 3 Basic Linear Algebra

Subprograms (→ BLAS), with optimizations for AMD Opteron

processors.

ʎ A full suite of Linear Algebra (→ LAPACK) routines.

ʎ A comprehensive suite of Fast Fourier transform (FFTs) in sin-

gle-, double-, single-complex and double-complex data types.

ʎ Fast scalar, vector, and array math transcendental library

routines.

ʎ Random Number Generators in both single- and double-pre-

cision.

AMD offers pre-compiled binaries for Linux, Solaris, and Win-

dows available for download. Supported compilers include gfor-

tran, Intel Fortran Compiler, Microsoft Visual Studio, NAG, PathS-

cale, PGI compiler, and Sun Studio.

BLAS (“Basic Linear Algebra Subprograms“)

Routines that provide standard building blocks for performing

basic vector and matrix operations. The Level 1 BLAS perform

scalar, vector and vector-vector operations, the Level 2 BLAS

perform matrix-vector operations, and the Level 3 BLAS per-

form matrix-matrix operations. Because the BLAS are efficient,

portable, and widely available, they are commonly used in the

development of high quality linear algebra software, e.g. → LA-

PACK . Although a model Fortran implementation of the BLAS

in available from netlib in the BLAS library, it is not expected to

perform as well as a specially tuned implementation on most

high-performance computers – on some machines it may give

much worse performance – but it allows users to run → LAPACK

software on machines that do not offer any other implementa-

tion of the BLAS.

Cg (“C for Graphics”)

A high-level shading language developed by Nvidia in close col-

laboration with Microsoft for programming vertex and pixel

shaders. It is very similar to Microsoft’s → HLSL. Cg is based on

the C programming language and although they share the same

syntax, some features of C were modified and new data types

were added to make Cg more suitable for programming graphics

processing units. This language is only suitable for GPU program-

ming and is not a general programming language. The Cg com-

piler outputs DirectX or OpenGL shader programs.

CISC (“complex instruction-set computer”)

A computer instruction set architecture (ISA) in which each in-

struction can execute several low-level operations, such as a load

from memory, an arithmetic operation, and a memory store, all

in a single instruction. The term was retroactively coined in con-

trast to reduced instruction set computer (RISC). The terms RISC

and CISC have become less meaningful with the continued evo-

lution of both CISC and RISC designs and implementations, with

modern processors also decoding and splitting more complex in-

structions into a series of smaller internal micro-operations that

can thereby be executed in a pipelined fashion, thus achieving

high performance on a much larger subset of instructions.

Glossary

161

cluster

Aggregation of several, mostly identical or similar systems to

a group, working in parallel on a problem. Previously known

as Beowulf Clusters, HPC clusters are composed of commodity

hardware, and are scalable in design. The more machines are

added to the cluster, the more performance can in principle be

achieved.

control protocol

Part of the → parallel NFS standard

CUDA driver API

Part of → CUDA

CUDA SDK

Part of → CUDA

CUDA toolkit

Part of → CUDA

CUDA (“Compute Uniform Device Architecture”)

A parallel computing architecture developed by NVIDIA. CUDA

is the computing engine in NVIDIA graphics processing units or

GPUs that is accessible to software developers through indus-

try standard programming languages. Programmers use “C for

CUDA” (C with NVIDIA extensions), compiled through a PathScale

Open64 C compiler, to code algorithms for execution on the GPU.

CUDA architecture supports a range of computational interfaces

including → OpenCL and → DirectCompute. Third party wrappers

are also available for Python, Fortran, Java and Matlab. CUDA

works with all NVIDIA GPUs from the G8X series onwards, includ-

ing GeForce, Quadro and the Tesla line. CUDA provides both a low

level API and a higher level API. The initial CUDA SDK was made

public on 15 February 2007, for Microsoft Windows and Linux.

Mac OS X support was later added in version 2.0, which super-

sedes the beta released February 14, 2008.

CUDA is the hardware and software architecture that enables

NVIDIA GPUs to execute programs written with C, C++, Fortran,

→ OpenCL, → DirectCompute, and other languages. A CUDA pro-

gram calls parallel kernels. A kernel executes in parallel across

a set of parallel threads. The programmer or compiler organizes

these threads in thread blocks and grids of thread blocks. The

GPU instantiates a kernel program on a grid of parallel thread

blocks. Each thread within a thread block executes an instance

of the kernel, and has a thread ID within its thread block, pro-

gram counter, registers, per-thread private memory, inputs, and

Thread

Thread Block

Grid 0

Grid 1

per-Thread PrivateLocal Memory

per-BlockShared Memory

per-Application

ContextGlobal

Memory

...

...

162

output results.

A thread block is a set of concurrently executing threads that

can cooperate among themselves through barrier synchro-

nization and shared memory. A thread block has a block ID

within its grid. A grid is an array of thread blocks that execute

the same kernel, read inputs from global memory, write re-

sults to global memory, and synchronize between dependent

kernel calls. In the CUDA parallel programming model, each

thread has a per-thread private memory space used for regis-

ter spills, function calls, and C automatic array variables. Each

thread block has a per-block shared memory space used for

inter-thread communication, data sharing, and result sharing

in parallel algorithms. Grids of thread blocks share results in

global memory space after kernel-wide global synchroniza-

tion.

CUDA’s hierarchy of threads maps to a hierarchy of proces-

sors on the GPU; a GPU executes one or more kernel grids; a

streaming multiprocessor (SM) executes one or more thread

blocks; and CUDA cores and other execution units in the SM

execute threads. The SM executes threads in groups of 32

threads called a warp. While programmers can generally ig-

nore warp execution for functional correctness and think of

programming one thread, they can greatly improve perfor-

mance by having threads in a warp execute the same code

path and access memory in nearby addresses. See the main

article “GPU Computing” for further details.

DirectCompute

An application programming interface (API) that supports gen-

eral-purpose computing on graphics processing units (GPUs)

on Microsoft Windows Vista or Windows 7. DirectCompute is

part of the Microsoft DirectX collection of APIs and was initial-

ly released with the DirectX 11 API but runs on both DirectX 10

and DirectX 11 GPUs. The DirectCompute architecture shares

a range of computational interfaces with → OpenCL and →

CUDA.

ETL (“Extract, Transform, Load”)

A process in database usage and especially in data warehousing

that involves:

ʎ Extracting data from outside sources

ʎ Transforming it to fit operational needs (which can include

quality levels)

ʎ Loading it into the end target (database or data warehouse)

The first part of an ETL process involves extracting the data from

the source systems. In many cases this is the most challenging

aspect of ETL, as extracting data correctly will set the stage for

how subsequent processes will go. Most data warehousing proj-

ects consolidate data from different source systems. Each sepa-

rate system may also use a different data organization/format.

Common data source formats are relational databases and flat

files, but may include non-relational database structures such

as Information Management System (IMS) or other data struc-

tures such as Virtual Storage Access Method (VSAM) or Indexed

Sequential Access Method (ISAM), or even fetching from outside

sources such as through web spidering or screen-scraping. The

streaming of the extracted data source and load on-the-fly to the

destination database is another way of performing ETL when

no intermediate data storage is required. In general, the goal of

the extraction phase is to convert the data into a single format

which is appropriate for transformation processing.

An intrinsic part of the extraction involves the parsing of ex-

Glossary

163

tracted data, resulting in a check if the data meets an expected

pattern or structure. If not, the data may be rejected entirely or

in part.

The transform stage applies a series of rules or functions to the

extracted data from the source to derive the data for loading

into the end target. Some data sources will require very little

or even no manipulation of data. In other cases, one or more of

the following transformation types may be required to meet the

business and technical needs of the target database.

The load phase loads the data into the end target, usually the

data warehouse (DW). Depending on the requirements of the

organization, this process varies widely. Some data warehouses

may overwrite existing information with cumulative informa-

tion, frequently updating extract data is done on daily, weekly

or monthly basis. Other DW (or even other parts of the same DW)

may add new data in a historicized form, for example, hourly. To

understand this, consider a DW that is required to maintain sales

records of the last year. Then, the DW will overwrite any data

that is older than a year with newer data. However, the entry of

data for any one year window will be made in a historicized man-

ner. The timing and scope to replace or append are strategic de-

sign choices dependent on the time available and the business

needs. More complex systems can maintain a history and audit

trail of all changes to the data loaded in the DW. As the load

phase interacts with a database, the constraints defined in the

database schema — as well as in triggers activated upon data

load — apply (for example, uniqueness, referential integrity,

mandatory fields), which also contribute to the overall data

quality performance of the ETL process.

FFTW (“Fastest Fourier Transform in the West”)

A software library for computing discrete Fourier transforms

(DFTs), developed by Matteo Frigo and Steven G. Johnson at

the Massachusetts Institute of Technology. FFTW is known as

the fastest free software implementation of the Fast Fourier

transform (FFT) algorithm (upheld by regular benchmarks). It

can compute transforms of real and complex-valued arrays of

arbitrary size and dimension in O(n log n) time.

floating point standard (IEEE 754)

The most widely-used standard for floating-point computation,

and is followed by many hardware (CPU and FPU) and software

implementations. Many computer languages allow or require

that some or all arithmetic be carried out using IEEE 754 for-

mats and operations. The current version is IEEE 754-2008,

which was published in August 2008; the original IEEE 754-1985

was published in 1985. The standard defines arithmetic for-

mats, interchange formats, rounding algorithms, operations,

and exception handling. The standard also includes extensive

recommendations for advanced exception handling, additional

operations (such as trigonometric functions), expression evalu-

ation, and for achieving reproducible results. The standard

defines single-precision, double-precision, as well as 128-byte

quadruple-precision floating point numbers. In the proposed

754r version, the standard also defines the 2-byte half-precision

number format.

FraunhoferFS (FhGFS)

A high-performance parallel file system from the Fraunhofer

Competence Center for High Performance Computing. Built

on scalable multithreaded core components with native → In-

164

finiBand support, file system nodes can serve → InfiniBand and

Ethernet (or any other TCP-enabled network) connections at

the same time and automatically switch to a redundant con-

nection path in case any of them fails. One of the most funda-

mental concepts of FhGFS is the strict avoidance of architectur-

al bottle necks. Striping file contents across multiple storage

servers is only one part of this concept. Another important as-

pect is the distribution of file system metadata (e.g. directory

information) across multiple metadata servers. Large systems

and metadata intensive applications in general can greatly

profit from the latter feature.

FhGFS requires no dedicated file system partition on the serv-

ers – it uses existing partitions, formatted with any of the stan-

dard Linux file systems, e.g. XFS or ext4. For larger networks, it is

also possible to create several distinct FhGFS file system parti-

tions with different configurations. FhGFS provides a coherent

mode, in which it is guaranteed that changes to a file or directo-

ry by one client are always immediately visible to other clients.

Global Arrays (GA)

A library developed by scientists at Pacific Northwest National

Laboratory for parallel computing. GA provides a friendly API for

shared-memory programming on distributed-memory comput-

ers for multidimensional arrays. The GA library is a predecessor

to the GAS (global address space) languages currently being de-

veloped for high-performance computing. The GA toolkit has ad-

ditional libraries including a Memory Allocator (MA), Aggregate

Remote Memory Copy Interface (ARMCI), and functionality for

out-of-core storage of arrays (ChemIO). Although GA was initially

developed to run with TCGMSG, a message passing library that

came before the → MPI standard (Message Passing Interface), it

is now fully compatible with → MPI. GA includes simple matrix

computations (matrix-matrix multiplication, LU solve) and works

with → ScaLAPACK. Sparse matrices are available but the imple-

mentation is not optimal yet. GA was developed by Jarek Nieplo-

cha, Robert Harrison and R. J. Littlefield. The ChemIO library for

out-of-core storage was developed by Jarek Nieplocha, Robert

Harrison and Ian Foster. The GA library is incorporated into many

quantum chemistry packages, including NWChem, MOLPRO,

UTChem, MOLCAS, and TURBOMOLE. The GA toolkit is free soft-

ware, licensed under a self-made license.

Globus Toolkit

An open source toolkit for building computing grids developed

and provided by the Globus Alliance, currently at version 5.

GMP (“GNU Multiple Precision Arithmetic Library”)

A free library for arbitrary-precision arithmetic, operating on

signed integers, rational numbers, and floating point num-

bers. There are no practical limits to the precision except the

ones implied by the available memory in the machine GMP

runs on (operand dimension limit is 231 bits on 32-bit ma-

chines and 237 bits on 64-bit machines). GMP has a rich set of

functions, and the functions have a regular interface. The ba-

sic interface is for C but wrappers exist for other languages in-

cluding C++, C#, OCaml, Perl, PHP, and Python. In the past, the

Kaffe Java virtual machine used GMP to support Java built-in

arbitrary precision arithmetic. This feature has been removed

from recent releases, causing protests from people who claim

that they used Kaffe solely for the speed benefits afforded by

GMP. As a result, GMP support has been added to GNU Class-

path. The main target applications of GMP are cryptography

Glossary

165

applications and research, Internet security applications, and

computer algebra systems.

GotoBLAS

Kazushige Goto’s implementation of → BLAS.

grid (in CUDA architecture)

Part of the → CUDA programming model

GridFTP

An extension of the standard File Transfer Protocol (FTP) for

use with Grid computing. It is defined as part of the → Globus

toolkit, under the organisation of the Global Grid Forum (spe-

cifically, by the GridFTP working group). The aim of GridFTP is

to provide a more reliable and high performance file transfer

for Grid computing applications. This is necessary because of

the increased demands of transmitting data in Grid comput-

ing – it is frequently necessary to transmit very large files, and

this needs to be done fast and reliably. GridFTP is the answer

to the problem of incompatibility between storage and access

systems. Previously, each data provider would make their data

available in their own specific way, providing a library of ac-

cess functions. This made it difficult to obtain data from multi-

ple sources, requiring a different access method for each, and

thus dividing the total available data into partitions. GridFTP

provides a uniform way of accessing the data, encompassing

functions from all the different modes of access, building on

and extending the universally accepted FTP standard. FTP was

chosen as a basis for it because of its widespread use, and be-

cause it has a well defined architecture for extensions to the

protocol (which may be dynamically discovered).

Hierarchical Data Format (HDF)

A set of file formats and libraries designed to store and organize

large amounts of numerical data. Originally developed at the

National Center for Supercomputing Applications, it is currently

supported by the non-profit HDF Group, whose mission is to en-

sure continued development of HDF5 technologies, and the con-

tinued accessibility of data currently stored in HDF. In keeping

with this goal, the HDF format, libraries and associated tools are

available under a liberal, BSD-like license for general use. HDF is

supported by many commercial and non-commercial software

platforms, including Java, MATLAB, IDL, and Python. The freely

available HDF distribution consists of the library, command-line

utilities, test suite source, Java interface, and the Java-based HDF

Viewer (HDFView). There currently exist two major versions of

HDF, HDF4 and HDF5, which differ significantly in design and API.

HLSL (“High Level Shader Language“)

The High Level Shader Language or High Level Shading Language

(HLSL) is a proprietary shading language developed by Microsoft

for use with the Microsoft Direct3D API. It is analogous to the

GLSL shading language used with the OpenGL standard. It is very

similar to the NVIDIA Cg shading language, as it was developed

alongside it.

HLSL programs come in three forms, vertex shaders, geometry

shaders, and pixel (or fragment) shaders. A vertex shader is ex-

ecuted for each vertex that is submitted by the application, and

is primarily responsible for transforming the vertex from object

space to view space, generating texture coordinates, and calcu-

lating lighting coefficients such as the vertex’s tangent, binor-

mal and normal vectors. When a group of vertices (normally 3, to

form a triangle) come through the vertex shader, their output po-

166

sition is interpolated to form pixels within its area; this process

is known as rasterisation. Each of these pixels comes through

the pixel shader, whereby the resultant screen colour is calculat-

ed. Optionally, an application using a Direct3D10 interface and

Direct3D10 hardware may also specify a geometry shader. This

shader takes as its input the three vertices of a triangle and uses

this data to generate (or tessellate) additional triangles, which

are each then sent to the rasterizer.

InfiniBand

Switched fabric communications link primarily used in HPC and

enterprise data centers. Its features include high throughput,

low latency, quality of service and failover, and it is designed to

be scalable. The InfiniBand architecture specification defines a

connection between processor nodes and high performance I/O

nodes such as storage devices. Like → PCI Express, and many other

modern interconnects, InfiniBand offers point-to-point bidirec-

tional serial links intended for the connection of processors with

high-speed peripherals such as disks. On top of the point to point

capabilities, InfiniBand also offers multicast operations as well.

It supports several signalling rates and, as with PCI Express, links

can be bonded together for additional throughput.

The SDR serial connection’s signalling rate is 2.5 gigabit per sec-

ond (Gbit/s) in each direction per connection. DDR is 5 Gbit/s and

QDR is 10 Gbit/s. FDR is 14.0625 Gbit/s and EDR is 25.78125 Gbit/s

per lane. For SDR, DDR and QDR, links use 8B/10B encoding – every

10 bits sent carry 8 bits of data – making the useful data transmis-

sion rate four-fifths the raw rate. Thus single, double, and quad

data rates carry 2, 4, or 8 Gbit/s useful data, respectively. For FDR

and EDR, links use 64B/66B encoding – every 66 bits sent carry 64

bits of data.

Implementers can aggregate links in units of 4 or 12, called 4X or

12X. A 12X QDR link therefore carries 120 Gbit/s raw, or 96 Gbit/s

of useful data. As of 2009 most systems use a 4X aggregate, imply-

ing a 10 Gbit/s (SDR), 20 Gbit/s (DDR) or 40 Gbit/s (QDR) connection.

Larger systems with 12X links are typically used for cluster and

supercomputer interconnects and for inter-switch connections.

The single data rate switch chips have a latency of 200 nano-

seconds, DDR switch chips have a latency of 140 nanoseconds

and QDR switch chips have a latency of 100 nanoseconds. The

end-to-end latency range ranges from 1.07 microseconds MPI la-

tency to 1.29 microseconds MPI latency to 2.6 microseconds. As

of 2009 various InfiniBand host channel adapters (HCA) exist in

the market, each with different latency and bandwidth charac-

teristics. InfiniBand also provides RDMA capabilities for low CPU

overhead. The latency for RDMA operations is less than 1 micro-

second.

See the main article “InfiniBand” for further description of In-

finiBand features

Intel Integrated Performance Primitives (Intel IPP)

A multi-threaded software library of functions for multimedia

and data processing applications, produced by Intel. The library

supports Intel and compatible processors and is available for

Windows, Linux, and Mac OS X operating systems. It is available

separately or as a part of Intel Parallel Studio. The library takes

advantage of processor features including MMX, SSE, SSE2, SSE3,

SSSE3, SSE4, AES-NI and multicore processors. Intel IPP is divided

into four major processing groups: Signal (with linear array or

vector data), Image (with 2D arrays for typical color spaces), Ma-

trix (with n x m arrays for matrix operations), and Cryptography.

GlossaryGLOSSARY

167

Half the entry points are of the matrix type, a third are of the

signal type and the remainder are of the image and cryptogra-

phy types. Intel IPP functions are divided into 4 data types: Data

types include 8u (8-bit unsigned), 8s (8-bit signed), 16s, 32f (32-bit

floating-point), 64f, etc. Typically, an application developer works

with only one dominant data type for most processing func-

tions, converting between input to processing to output formats

at the end points. Version 5.2 was introduced June 5, 2007, add-

ing code samples for data compression, new video codec sup-

port, support for 64-bit applications on Mac OS X, support for

Windows Vista, and new functions for ray-tracing and rendering.

Version 6.1 was released with the Intel C++ Compiler on June 28,

2009 and Update 1 for version 6.1 was released on July 28, 2009.

Intel Threading Building Blocks (TBB)

A C++ template library developed by Intel Corporation for

writing software programs that take advantage of multi-core

processors. The library consists of data structures and algo-

rithms that allow a programmer to avoid some complications

arising from the use of native threading packages such as

POSIX → threads, Windows → threads, or the portable Boost

Threads in which individual → threads of execution are cre-

ated, synchronized, and terminated manually. Instead the li-

brary abstracts access to the multiple processors by allowing

the operations to be treated as “tasks”, which are allocated

to individual cores dynamically by the library’s run-time en-

gine, and by automating efficient use of the CPU cache. A TBB

program creates, synchronizes and destroys graphs of depen-

dent tasks according to algorithms, i.e. high-level parallel pro-

gramming paradigms (a.k.a. Algorithmic Skeletons). Tasks are

then executed respecting graph dependencies. This approach

groups TBB in a family of solutions for parallel programming

aiming to decouple the programming from the particulars of

the underlying machine. Intel TBB is available commercially

as a binary distribution with support and in open source in

both source and binary forms. Version 4.0 was introduced on

September 8, 2011.

iSER (“iSCSI Extensions for RDMA“)

A protocol that maps the iSCSI protocol over a network that pro-

vides RDMA services (like → iWARP or → InfiniBand). This permits

data to be transferred directly into SCSI I/O buffers without in-

termediate data copies. The Datamover Architecture (DA) defines

an abstract model in which the movement of data between iSCSI

end nodes is logically separated from the rest of the iSCSI proto-

col. iSER is one Datamover protocol. The interface between the

iSCSI and a Datamover protocol, iSER in this case, is called Dat-

amover Interface (DI).

SDRSingle Data Rate

DDRDouble Data Rate

QDRQuadruple Data Rate

FDRFourteen Data Rate

EDREnhanced Data Rate

1X 2 Gbit/s 4 Gbit/s 8 Gbit/s 14 Gbit/s 25 Gbit/s



168

iWARP (“Internet Wide Area RDMA Protocol”)

An Internet Engineering Task Force (IETF) update of the RDMA

Consortium’s → RDMA over TCP standard. This later standard is

zero-copy transmission over legacy TCP. Because a kernel imple-

mentation of the TCP stack is a tremendous bottleneck, a few

vendors now implement TCP in hardware. This additional hard-

ware is known as the TCP offload engine (TOE). TOE itself does

not prevent copying on the receive side, and must be combined

with RDMA hardware for zero-copy results. The main component

is the Data Direct Protocol (DDP), which permits the actual zero-

copy transmission. The transmission itself is not performed by

DDP, but by TCP.

kernel (in CUDA architecture)


LAM/MPI

A high-quality open-source implementation of the → MPI specifi-

cation, including all of MPI-1.2 and much of MPI-2. Superseded by

the → OpenMPI implementation-

LAPACK (“linear algebra package”)

Routines for solving systems of simultaneous linear equations,

least-squares solutions of linear systems of equations, eigenvalue

problems, and singular value problems. The original goal of the LA-

PACK project was to make the widely used EISPACK and → LINPACK

libraries run efficiently on shared-memory vector and parallel pro-

cessors. LAPACK routines are written so that as much as possible

of the computation is performed by calls to the → BLAS library.

While → LINPACK and EISPACK are based on the vector operation

kernels of the Level 1 BLAS, LAPACK was designed at the outset to

exploit the Level 3 BLAS. Highly efficient machine-specific imple-

mentations of the BLAS are available for many modern high-per-

formance computers. The BLAS enable LAPACK routines to achieve

high performance with portable software.

layout

Part of the → parallel NFS standard. Currently three types of lay-

out exist: file-based, block/volume-based, and object-based, the

latter making use of → object-based storage devices

LINPACK

A collection of Fortran subroutines that analyze and solve linear

equations and linear least-squares problems. LINPACK was de-

signed for supercomputers in use in the 1970s and early 1980s.

LINPACK has been largely superseded by → LAPACK, which has

been designed to run efficiently on shared-memory, vector su-

percomputers. LINPACK makes use of the → BLAS libraries for

performing basic vector and matrix operations.

The LINPACK benchmarks are a measure of a system‘s float-

ing point computing power and measure how fast a computer

solves a dense N by N system of linear equations Ax=b, which is a

common task in engineering. The solution is obtained by Gauss-

ian elimination with partial pivoting, with 2/3•N³ + 2•N² floating

point operations. The result is reported in millions of floating

point operations per second (MFLOP/s, sometimes simply called

MFLOPS).

LNET

Communication protocol in → Lustre

GlossaryGLOSSARY

169

logical object volume (LOV)

A logical entity in → Lustre

Lustre

An object-based → parallel file system

management server (MGS)

A functional component in → Lustre

MapReduce

A framework for processing highly distributable problems across

huge datasets using a large number of computers (nodes), collec-

tively referred to as a cluster (if all nodes use the same hardware)

or a grid (if the nodes use different hardware). Computational

processing can occur on data stored either in a filesystem (un-

structured) or in a database (structured).

“Map” step: The master node takes the input, divides it into

smaller sub-problems, and distributes them to worker nodes. A

worker node may do this again in turn, leading to a multi-level

tree structure. The worker node processes the smaller problem,

and passes the answer back to its master node.

“Reduce” step: The master node then collects the answers to all

the sub-problems and combines them in some way to form the

output – the answer to the problem it was originally trying to

solve.

MapReduce allows for distributed processing of the map and

reduction operations. Provided each mapping operation is

independent of the others, all maps can be performed in par-

allel – though in practice it is limited by the number of inde-

pendent data sources and/or the number of CPUs near each

source. Similarly, a set of ‘reducers’ can perform the reduction

phase - provided all outputs of the map operation that share

the same key are presented to the same reducer at the same

time. While this process can often appear inefficient com-

pared to algorithms that are more sequential, MapReduce can

be applied to significantly larger datasets than “commodity”

servers can handle – a large server farm can use MapReduce

to sort a petabyte of data in only a few hours. The parallelism

also offers some possibility of recovering from partial failure

of servers or storage during the operation: if one mapper or

reducer fails, the work can be rescheduled – assuming the in-

put data is still available.

metadata server (MDS)


metadata target (MDT)


MKL (“Math Kernel Library”)

A library of optimized, math routines for science, engineering,

and financial applications developed by Intel. Core math func-

tions include → BLAS, → LAPACK, → ScaLAPACK, Sparse Solvers,

Fast Fourier Transforms, and Vector Math. The library supports

Intel and compatible processors and is available for Windows,

Linux and Mac OS X operating systems.

MPI, MPI-2 (“message-passing interface”)

A language-independent communications protocol used to

program parallel computers. Both point-to-point and collec-

tive communication are supported. MPI remains the domi-

nant model used in high-performance computing today. There

170

are two versions of the standard that are currently popular:

version 1.2 (shortly called MPI-1), which emphasizes message

passing and has a static runtime environment, and MPI-2.1

(MPI-2), which includes new features such as parallel I/O, dy-

namic process management and remote memory operations.

MPI-2 specifies over 500 functions and provides language

bindings for ANSI C, ANSI Fortran (Fortran90), and ANSI C++. In-

teroperability of objects defined in MPI was also added to allow

for easier mixed-language message passing programming. A side

effect of MPI-2 standardization (completed in 1996) was clarifi-

cation of the MPI-1 standard, creating the MPI-1.2 level. MPI-2 is

mostly a superset of MPI-1, although some functions have been

deprecated. Thus MPI-1.2 programs still work under MPI imple-

mentations compliant with the MPI-2 standard. The MPI Forum

reconvened in 2007, to clarify some MPI-2 issues and explore de-

velopments for a possible MPI-3.

MPICH2

A freely available, portable → MPI 2.0 implementation, main-

tained by Argonne National Laboratory

MPP (“massively parallel processing”)

So-called MPP jobs are computer programs with several parts

running on several machines in parallel, often calculating simu-

lation problems. The communication between these parts can

e.g. be realized by the → MPI software interface.

MS-MPI

Microsoft → MPI 2.0 implementation shipped with Microsoft HPC

Pack 2008 SDK, based on and designed for maximum compatibil-

ity with the → MPICH2 reference implementation.

MVAPICH2

An → MPI 2.0 implementation based on → MPICH2 and developed

by the Department of Computer Science and Engineering at

Ohio State University. It is available under BSD licensing and sup-

ports MPI over InfiniBand, 10GigE/iWARP and RDMAoE.

NetCDF (“Network Common Data Form”)

A set of software libraries and self-describing, machine-indepen-

dent data formats that support the creation, access, and shar-

ing of array-oriented scientific data. The project homepage is

hosted by the Unidata program at the University Corporation

for Atmospheric Research (UCAR). They are also the chief source

of NetCDF software, standards development, updates, etc. The

format is an open standard. NetCDF Classic and 64-bit Offset For-

mat are an international standard of the Open Geospatial Con-

sortium. The project is actively supported by UCAR. The recently

released (2008) version 4.0 greatly enhances the data model by

allowing the use of the → HDF5 data file format. Version 4.1 (2010)

adds support for C and Fortran client access to specified subsets

of remote data via OPeNDAP. The format was originally based on

the conceptual model of the NASA CDF but has since diverged

and is not compatible with it. It is commonly used in climatology,

meteorology and oceanography applications (e.g., weather fore-

casting, climate change) and GIS applications. It is an input/out-

put format for many GIS applications, and for general scientific

data exchange. The NetCDF C library, and the libraries based on it

(Fortran 77 and Fortran 90, C++, and all third-party libraries) can,

starting with version 4.1.1, read some data in other data formats.

Data in the → HDF5 format can be read, with some restrictions.

Data in the → HDF4 format can be read by the NetCDF C library if

created using the → HDF4 Scientific Data (SD) API.

GlossaryGLOSSARY

171

NetworkDirect

A remote direct memory access (RDMA)-based network in-

terface implemented in Windows Server 2008 and later. Net-

workDirect uses a more direct path from → MPI applications

to networking hardware, resulting in very fast and efficient

networking. See the main article “Windows HPC Server 2008

R2” for further details.

NFS (Network File System)

A network file system protocol originally developed by Sun

Microsystems in 1984, allowing a user on a client computer to

access files over a network in a manner similar to how local

storage is accessed. NFS, like many other protocols, builds on

the Open Network Computing Remote Procedure Call (ONC

RPC) system. The Network File System is an open standard

defined in RFCs, allowing anyone to implement the protocol.

Sun used version 1 only for in-house experimental purposes.

When the development team added substantial changes to

NFS version 1 and released it outside of Sun, they decided to

release the new version as V2, so that version interoperation

and RPC version fallback could be tested. Version 2 of the pro-

tocol (defined in RFC 1094, March 1989) originally operated

entirely over UDP. Its designers meant to keep the protocol

stateless, with locking (for example) implemented outside of

the core protocol Version 3 (RFC 1813, June 1995) added:

ʎ support for 64-bit file sizes and offsets, to handle files larger

than 2 gigabytes (GB)

ʎ support for asynchronous writes on the server, to improve

write performance

ʎ additional file attributes in many replies, to avoid the need

to re-fetch them

ʎ a READDIRPLUS operation, to get file handles and attributes

along with file names when scanning a directory

ʎ assorted other improvements

At the time of introduction of Version 3, vendor support for TCP

as a transport-layer protocol began increasing. While several

vendors had already added support for NFS Version 2 with TCP

as a transport, Sun Microsystems added support for TCP as a

transport for NFS at the same time it added support for Ver-

sion 3. Using TCP as a transport made using NFS over a WAN

more feasible.

Version 4 (RFC 3010, December 2000; revised in RFC 3530, April

2003), influenced by AFS and CIFS, includes performance im-

provements, mandates strong security, and introduces a state-

ful protocol. Version 4 became the first version developed with

the Internet Engineering Task Force (IETF) after Sun Microsys-

tems handed over the development of the NFS protocols.

NFS version 4 minor version 1 (NFSv 4.1) has been approved by

the IESG and received an RFC number since Jan 2010. The NFSv 4.1

specification aims: to provide protocol support to take advantage

of clustered server deployments including the ability to provide

scalable parallel access to files distributed among multiple serv-

ers. NFSv 4.1 adds the parallel NFS (pNFS) capability, which enables

data access parallelism. The NFSv 4.1 protocol defines a method

of separating the filesystem meta-data from the location of the

file data; it goes beyond the simple name/data separation by strip-

ing the data amongst a set of data servers. This is different from

the traditional NFS server which holds the names of files and their

data under the single umbrella of the server.

172

In addition to pNFS, NFSv 4.1 provides sessions, directory delega-

tion and notifications, multi-server namespace, access control

lists (ACL/SACL/DACL), retention attributions, and SECINFO_NO_

NAME. See the main article “Parallel Filesystems” for further de-

tails.

Current work is being done in preparing a draft for a future ver-

sion 4.2 of the NFS standard, including so-called federated file-

systems, which constitute the NFS counterpart of Microsoft’s

distributed filesystem (DFS).

NUMA (“non-uniform memory access”)

A computer memory design used in multiprocessors, where the

memory access time depends on the memory location relative

to a processor. Under NUMA, a processor can access its own local

memory faster than non-local memory, that is, memory local to

another processor or memory shared between processors.

object storage server (OSS)


object storage target (OST)


object-based storage device (OSD)

An intelligent evolution of disk drives that can store and serve

objects rather than simply place data on tracks and sectors. This

task is accomplished by moving low-level storage functions into

the storage device and accessing the device through an object

interface. Unlike a traditional block-oriented device providing ac-

cess to data organized as an array of unrelated blocks, an object

Glossary

store allows access to data by means of storage objects. A storage

object is a virtual entity that groups data together that has been

determined by the user to be logically related. Space for a storage

object is allocated internally by the OSD itself instead of by a host-

based file system. OSDs manage all necessary low-level storage,

space management, and security functions. Because there is no

host-based metadata for an object (such as inode information),

the only way for an application to retrieve an object is by using

its object identifier (OID). The SCSI interface was modified and ex-

tended by the OSD Technical Work Group of the Storage Network-

ing Industry Association (SNIA) with varied industry and academic

contributors, resulting in a draft standard to T10 in 2004. This stan-

dard was ratified in September 2004 and became the ANSI T10 SCSI

OSD V1 command set, released as INCITS 400-2004. The SNIA group

continues to work on further extensions to the interface, such as

the ANSI T10 SCSI OSD V2 command set.

OLAP cube (“Online Analytical Processing”)

A set of data, organized in a way that facilitates non-predeter-

mined queries for aggregated information, or in other words,

online analytical processing. OLAP is one of the computer-based

techniques for analyzing business data that are collectively

called business intelligence. OLAP cubes can be thought of as ex-

tensions to the two-dimensional array of a spreadsheet. For ex-

ample a company might wish to analyze some financial data by

product, by time-period, by city, by type of revenue and cost, and

by comparing actual data with a budget. These additional meth-

ods of analyzing the data are known as dimensions. Because

there can be more than three dimensions in an OLAP system the

term hypercube is sometimes used.

GLOSSARY

PCIe 1.x PCIe 2.x PCIe 3.0 PCIe 4.0

x1 256 MB/s 512 MB/s 1 GB/s 2 GB/s

x2 512 MB/s 1 GB/s 2 GB/s 4 GB/s

x4 1 GB/s 2 GB/s 4 GB/s 8 GB/s

x8 2 GB/s 4 GB/s 8 GB/s 16 GB/s

x16 4 GB/s 8 GB/s 16 GB/s 32 GB/s

x32 8 GB/s 16 GB/s 32 GB/s 64 GB/s

173

OpenCL (“Open Computing Language”)

A framework for writing programs that execute across heteroge-

neous platforms consisting of CPUs, GPUs, and other processors.

OpenCL includes a language (based on C99) for writing kernels

(functions that execute on OpenCL devices), plus APIs that are

used to define and then control the platforms. OpenCL provides

parallel computing using task-based and data-based parallelism.

OpenCL is analogous to the open industry standards OpenGL

and OpenAL, for 3D graphics and computer audio, respectively.

Originally developed by Apple Inc., which holds trademark rights,

OpenCL is now managed by the non-profit technology consor-

tium Khronos Group.

OpenMP (“Open Multi-Processing”)

An application programming interface (API) that supports multi-

platform shared memory multiprocessing programming in C,

C++ and Fortran on many architectures, including Unix and Mi-

crosoft Windows platforms. It consists of a set of compiler direc-

tives, library routines, and environment variables that influence

run-time behavior.

Jointly defined by a group of major computer hardware and soft-

ware vendors, OpenMP is a portable, scalable model that gives

programmers a simple and flexible interface for developing par-

allel applications for platforms ranging from the desktop to the

supercomputer.

An application built with the hybrid model of parallel program-

ming can run on a computer cluster using both OpenMP and

Message Passing Interface (MPI), or more transparently through

the use of OpenMP extensions for non-shared memory systems.

OpenMPI

An open source → MPI-2 implementation that is developed and

maintained by a consortium of academic, research, and industry

partners.

parallel NFS (pNFS)

A → parallel file system standard, optional part of the current →

NFS standard 4.1. See the main article “Parallel Filesystems” for

further details.

PCI Express (PCIe)

A computer expansion card standard designed to replace the

older PCI, PCI-X, and AGP standards. Introduced by Intel in 2004,

PCIe (or PCI-E, as it is commonly called) is the latest standard

for expansion cards that is available on mainstream comput-

ers. PCIe, unlike previous PC expansion standards, is structured

around point-to-point serial links, a pair of which (one in each

direction) make up lanes; rather than a shared parallel bus.

These lanes are routed by a hub on the main-board acting as

a crossbar switch. This dynamic point-to-point behavior allows

174

more than one pair of devices to communicate with each other

at the same time. In contrast, older PC interfaces had all devices

permanently wired to the same bus; therefore, only one device

could send information at a time. This format also allows “chan-

nel grouping”, where multiple lanes are bonded to a single de-

vice pair in order to provide higher bandwidth. The number of

lanes is “negotiated” during power-up or explicitly during op-

eration. By making the lane count flexible a single standard can

provide for the needs of high-bandwidth cards (e.g. graphics

cards, 10 Gigabit Ethernet cards and multiport Gigabit Ethernet

cards) while also being economical for less demanding cards.

Unlike preceding PC expansion interface standards, PCIe is a net-

work of point-to-point connections. This removes the need for

“arbitrating” the bus or waiting for the bus to be free and allows

for full duplex communications. This means that while standard

PCI-X (133 MHz 64 bit) and PCIe x4 have roughly the same data

transfer rate, PCIe x4 will give better performance if multiple

device pairs are communicating simultaneously or if communi-

cation within a single device pair is bidirectional. Specifications

of the format are maintained and developed by a group of more

than 900 industry-leading companies called the PCI-SIG (PCI Spe-

cial Interest Group). In PCIe 1.x, each lane carries approximately

250 MB/s. PCIe 2.0, released in late 2007, adds a Gen2-signalling

mode, doubling the rate to about 500 MB/s. On November 18,

2010, the PCI Special Interest Group officially publishes the final-

ized PCI Express 3.0 specification to its members to build devic-

es based on this new version of PCI Express, which allows for a

Gen3-signalling mode at 1 GB/s.

On November 29, 2011, PCI-SIG has annonced to proceed to PCI

Express 4.0 featuring 16 GT/s, still on copper technology. Addi-

Glossary

tionally, active and idle power optimizations are to be investigat-

ed. Final specifications are expected to be released in 2014/15.

PETSc (“Portable, Extensible Toolkit for Scientific Computation”)

A suite of data structures and routines for the scalable (parallel)

solution of scientific applications modeled by partial differen-

tial equations. It employs the → Message Passing Interface (MPI)

standard for all message-passing communication. The current

version of PETSc is 3.2; released September 8, 2011. PETSc is in-

tended for use in large-scale application projects, many ongoing

computational science projects are built around the PETSc li-

braries. Its careful design allows advanced users to have detailed

control over the solution process. PETSc includes a large suite

of parallel linear and nonlinear equation solvers that are eas-

ily used in application codes written in C, C++, Fortran and now

Python. PETSc provides many of the mechanisms needed within

parallel application code, such as simple parallel matrix and vec-

tor assembly routines that allow the overlap of communication

and computation. In addition, PETSc includes support for paral-

lel distributed arrays useful for finite difference methods.

process

→ thread

PTX (“parallel thread execution”)

Parallel Thread Execution (PTX) is a pseudo-assembly language

used in NVIDIA’s CUDA programming environment. The ‘nvcc’

compiler translates code written in CUDA, a C-like language, into

PTX, and the graphics driver contains a compiler which trans-

lates the PTX into something which can be run on the processing

cores.

GLOSSARY

175

RDMA (“remote direct memory access”)

Allows data to move directly from the memory of one computer

into that of another without involving either one‘s operating

system. This permits high-throughput, low-latency networking,

which is especially useful in massively parallel computer clus-

ters. RDMA relies on a special philosophy in using DMA. RDMA

supports zero-copy networking by enabling the network adapt-

er to transfer data directly to or from application memory, elim-

inating the need to copy data between application memory and

the data buffers in the operating system. Such transfers require

no work to be done by CPUs, caches, or context switches, and

transfers continue in parallel with other system operations.

When an application performs an RDMA Read or Write request,

the application data is delivered directly to the network, reduc-

ing latency and enabling fast message transfer. Common RDMA

implementations include → InfiniBand, → iSER, and → iWARP.

RISC (“reduced instruction-set computer”)

A CPU design strategy emphasizing the insight that simplified

instructions that “do less“ may still provide for higher perfor-

mance if this simplicity can be utilized to make instructions ex-

ecute very quickly → CISC.

ScaLAPACK (“scalable LAPACK”)

Library including a subset of → LAPACK routines redesigned

for distributed memory MIMD (multiple instruction, multiple

data) parallel computers. It is currently written in a Single-

Program-Multiple-Data style using explicit message passing

for interprocessor communication. ScaLAPACK is designed

for heterogeneous computing and is portable on any com-

puter that supports → MPI. The fundamental building blocks

of the ScaLAPACK library are distributed memory versions

(PBLAS) of the Level 1, 2 and 3 → BLAS, and a set of Basic Linear

Algebra Communication Subprograms (BLACS) for communi-

cation tasks that arise frequently in parallel linear algebra

computations. In the ScaLAPACK routines, all interprocessor

communication occurs within the PBLAS and the BLACS. One

of the design goals of ScaLAPACK was to have the ScaLAPACK

routines resemble their → LAPACK equivalents as much as

possible.

service-oriented architecture (SOA)

An approach to building distributed, loosely coupled applica-

tions in which functions are separated into distinct services that

can be distributed over a network, combined, and reused. See the

main article “Windows HPC Server 2008 R2” for further details.

single precision/double precision

→ floating point standard

SMP (“shared memory processing”)

So-called SMP jobs are computer programs with several

parts running on the same system and accessing a shared

memory region. A usual implementation of SMP jobs is

→ multi-threaded programs. The communication between the

single threads can e.g. be realized by the → OpenMP software

interface standard, but also in a non-standard way by means of

native UNIX interprocess communication mechanisms.

SMP (“symmetric multiprocessing”)

A multiprocessor or multicore computer architecture where

two or more identical processors or cores can connect to a

176

GlossaryGLOSSARY

single shared main memory in a completely symmetric way, i.e.

each part of the main memory has the same distance to each of

the cores. Opposite: → NUMA

storage access protocol

Part of the → parallel NFS standard

STREAM

A simple synthetic benchmark program that measures sustain-

able memory bandwidth (in MB/s) and the corresponding com-

putation rate for simple vector kernels.

streaming multiprocessor (SM)

Hardware component within the → Tesla GPU series

subnet manager

Application responsible for configuring the local → InfiniBand

subnet and ensuring its continued operation.

superscalar processors

A superscalar CPU architecture implements a form of parallel-

ism called instruction-level parallelism within a single proces-

sor. It thereby allows faster CPU throughput than would other-

wise be possible at the same clock rate. A superscalar processor

executes more than one instruction during a clock cycle by si-

multaneously dispatching multiple instructions to redundant

functional units on the processor. Each functional unit is not

a separate CPU core but an execution resource within a single

CPU such as an arithmetic logic unit, a bit shifter, or a multi-

plier.

Tesla

NVIDIA‘s third brand of GPUs, based on high-end GPUs from

the G80 and on. Tesla is NVIDIA‘s first dedicated General Pur-

pose GPU. Because of the very high computational power

(measured in floating point operations per second or FLOPS)

compared to recent microprocessors, the Tesla products are

intended for the HPC market. The primary function of Tesla

products are to aid in simulations, large scale calculations (es-

pecially floating-point calculations), and image generation for

professional and scientific fields, with the use of → CUDA. See

the main article “NVIDIA GPU Computing” for further details.

thread

A thread of execution is a fork of a computer program into two or

more concurrently running tasks. The implementation of threads

and processes differs from one operating system to another, but

in most cases, a thread is contained inside a process. On a single

processor, multithreading generally occurs by multitasking: the

processor switches between different threads. On a multiproces-

sor or multi-core system, the threads or tasks will generally run

at the same time, with each processor or core running a particu-

lar thread or task. Threads are distinguished from processes in

that processes are typically independent, while threads exist as

subsets of a process. Whereas processes have separate address

spaces, threads share their address space, which makes inter-

thread communication much easier than classical inter-process

communication (IPC).

thread (in CUDA architecture)


177

thread block (in CUDA architecture)


thread processor array (TPA)

Hardware component within the → Tesla GPU series

10 Gigabit Ethernet

The fastest of the Ethernet standards, first published in 2002

as IEEE Std 802.3ae-2002. It defines a version of Ethernet with a

nominal data rate of 10 Gbit/s, ten times as fast as Gigabit Eth-

ernet. Over the years several 802.3 standards relating to 10GbE

have been published, which later were consolidated into the

IEEE 802.3-2005 standard. IEEE 802.3-2005 and the other amend-

ments have been consolidated into IEEE Std 802.3-2008. 10

Gigabit Ethernet supports only full duplex links which can be

connected by switches. Half Duplex operation and CSMA/CD

(carrier sense multiple access with collision detect) are not

supported in 10GbE. The 10 Gigabit Ethernet standard encom-

passes a number of different physical layer (PHY) standards.

As of 2008 10 Gigabit Ethernet is still an emerging technology

with only 1 million ports shipped in 2007, and it remains to be

seen which of the PHYs will gain widespread commercial ac-

ceptance.

warp (in CUDA architecture)

Part of the → CUDA programming modelOLAP cube

I am interested in:

P compute nodes and other hardware

P cluster and workload management

P HPC-related services

P GPU computing

P parallel filesystems and HPC storage

P InfiniBand solutions

P remote postprocessing and visualization

P HPC databases and Big Data Analytics

P YES: I would like more information about

transtec’s HPC solutions.

P YES: I would like a transtec HPC specialist to contact me.

Company:

Department:

Name:

Address:

ZIP, City:

Email:

Please fill in the postcard on the right, or send an email to

[email protected] for further information.

tran

stec A

GW

aldh

oern

lestrasse 1872072 Tu

ebin

gen

Germ

any

Simulation

Simulation

CAD

CAD

CAE

CAE

Engineering

Engineering

Price Modelling

Price Modelling

Life Sciences

Life SciencesAutomotive

Automotive

Aerospace

Aerospace

Risk Analysis

Risk Analysis

Big Data Analytics

Big Data Analytics



transtec GermanyTel +49 (0) 7071/[email protected]

transtec SwitzerlandTel +41 (0) 44/818 47 [email protected]

transtec United KingdomTel +44 (0) 1295/756 [email protected]

ttec NetherlandsTel +31 (0) 24 34 34 [email protected]

transtec FranceTel +33 (0) [email protected]

Texts and concept:

Layout and design:

Dr. Oliver Tennert, Director Technology Management & HPC Solutions | [email protected]

Stefanie Gauger, Graphics & Design | [email protected]

© transtec AG, June 2013The graphics, diagrams and tables found herein are the intellectual property of transtec AG and may be reproduced or published only with its express permission. No responsibility will be assumed for inaccuracies or omissions. Other names or logos may be trademarks of their respective owners.

Simulation

CAD

CAEEngineering

Price Modelling

Life Sciences

Automotive

Aerospace

Risk Analysis

Risk Analysis

Big Data Analytics


hpc compass 2013-final_web

Technology

moab hpc suite basic

hpc clusters

moab hpc suite grid

panasas hpc storage

intel fabric suite

intel truescale infiniband

nvidia gpu computing

scientific computing