introdução ao coprocessador intel® xeon phi™ - intel software conference 2013

and/or other countries. *Other names and brands may be claimed as the property of others.

Introduction to the Intel® Xeon Phi™ Coprocessor

Leo Borges (leonardo.borges@intel.com)

Intel - Software and Services Group

iStep-Brazil, August 2013

Click to edit Master title style

Introduction

High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software

Intel Xeon Phi Case Studies

Intel Xeon Phi Ecosystem

Conclusions & References

Large ScaleClustersfor Test & Optimization

Tera-ScaleResearch

Leading Performance,Energy Efficient

Platform BuildingBlocks

Dedicated,Renowned ApplicationsExpertise

Broad Software Tools Portfolio

DefinedHPCApplicationPlatform

ManyIntegrated CoreArchitecture

ManufacturingProcessTechnologies

Exa-Scale Labs

A long term commitment to the HPC market segment

Intel in High-Performance Computing

HPC Processor Solutions

Common Intel Environment

Portable code, common tools

Xeon®

General Purpose Architecture

Leadership Per Core Performance

FP/core via AVX

Multi-Core Performance Intel® Xeon Phi™ Coprocessor

Trades a “big” IA core for multiple lower performance IA cores resulting in higher performance for a subset of highly parallel applications

ENGeneral purpose

perf/watt

EPMax perf/watt

w/ Higher Memory BW / freq and QPI ideal for HPC

Xeon EXAdditional

sockets & big memory

EP 4SAdditional compute density

Multi-Core Many-Core

and/or other countries. *Other names and brands may be claimed as the property of others.5

Highly parallel and vectorized applications, or with need for higher memory bandwidth, will run even faster on Intel® Xeon Phi™ Coprocessors

Most applications will still run best on multi-core Intel® Xeon® processors

Optimizing code often delivers significant performance gains

RUNNING

EXISTING SERIAL SOFTWARE

RUNNING

OPTIMIZEDSOFTWARE

Big Gains for Selected Applications

Medical imaging and biophysics

Computer Aided Design & Manufacturing

Climate modeling & weather prediction

Financial analyses, trading

Energy &oil exploration

Digital content creation

Evaluating Your Applicationsfor Intel® Xeon Phi™

Can your workload benefit from more

memory bandwidth?

Can your workload benefit from

large vectors?

Can your workload scale to over 100 threads?

Use Intel® Xeon Phi™ coprocessors for applications that scale with:

• Threads • Vectors • Memory Bandwidth

Introduction

Intel Many Integrated Core (MIC, pronounced “Mike”)

Product Family/Architecture for Highly Parallel Applications

• Based on large number of smaller, low power, Intel Arch. Cores

• 512-bit wide vector engine

• Compliments Intel Xeon processor product line

• Provides breakthrough performance for highly parallel apps

– Familiar x86 programming model– Same source code supports both Intel Xeon processor & Intel Xeon Phi coprocessor– Initially a coprocessor with PCI Express form factor

First products announced at SC12: Code named Knights Corner (KNC)

• Up to 61 cores, 4 threads per core

• Up to 16GB GDDR5 memory (up to 352 GB/s)

• 225-300W (Cooling: Both passive & active SKUs)

• x16 PCIe Form-Factor (requires IA host)

Intel® Xeon® Phi™ Product FamilyBased on the Intel MIC Architecture

Each Xeon Phi can be addressed asan Individual Node in the Cluster

6 to 16 GB GDDR5 memory

INTEL CONFIDENTIAL

• Click to edit Master text styles

‒ Second level

Third level

o Fourth level

Fifth level

3 Family Outstanding Parallel Computing Solution

Performance/$ leadership

Intel® Xeon Phi™ Coprocessors

3120P 3120A

5 FamilyOptimized for High Density Environments

Performance/watt leadership

7 FamilyHighest Level of FeaturesPerformance leadership

7120P 7120X

16GB GDDR5

352 GB/s

> 1.2 TFlops DP

8GB GDDR5

>300 GB/s

>1 TFlops DP

6GB GDDR5

240 GB/s

>1 TFlops DP

Introduction

Performance Considerations

Based on memory access and flops required

• Temporal/spatial locality of data

• Bandwidth Requirement

6 GB/s

Bandwidth

LimitedCore Limited

Stream-triad

BLAS1 & BLAS

Linpack

Scientific

Sparse

Matrix-

Vector

Scientific

SPECfp2000

Reservoir

Simulation

Oil & GasKirchhoff

Migration

Oil & Gas

Fluid Dynamics

Ocean Models

ScientificFFT

Oil & Gas

Mil HPC

(Y: Math Kernel; B: Applications; W: Segment)

Option

pricing

Molecular

Dynamic

Scientific

Application Characterization

Oil & Gas

INTEL CONFIDENTIAL13

STREAM Triad (GB/s)

SMP Linpack (GF/s)

DGEMM (GF/s)

SGEMM (GF/s)

1. Intel® Xeon® Processor E5-2680 used for all SGEMM Matrix = 12800 x 12800 , DGEMM Matrix 10752 x 10752, SMP Linpack Matrix 26000 x 26000

2. Intel® Xeon Phi™ coprocessor SE10P (ECC on) with “Gold” SW stack SGEMM Matrix = 12800 x 12800, DGEMM Matrix 12800 x 12800, SMP Linpack Matrix 26872 x 28672

3. Average single-node results from measurements across a set of nodes from the TACC+ Stampede* Cluster

+ Texas Advanced Computing Center (TACC) at the University of Texas at Austin.

++ Measured on the TACC+ Stampede Cluster

Coprocessor results: Benchmark run 100% on coprocessor, no help from Intel® Xeon® processor host (aka native)

Synthetic BenchmarksIntel® Xeon Phi™ Coprocessor and Intel® MKL

2.4XUP TO

2.5XUP TO

2.2XUP TO

Higher is Better

• 2S Intel® Xeon® • Intel Xeon Phi

ECC ON84% Efficient 83% Efficient 75% Efficient

INTEL CONFIDENTIAL

4.634.81

2S Intel® Xeon® Processor SMP Linpack DGEMM SGEMM

tive P

alized

Baselin

Performance per Watt

Intel® Xeon Phi™ Coprocessor vs. 2S Intel® Xeon® processor (Intel MKL)

1 Intel® Xeon Phi™ Coprocessorvs.

2 Socket Intel® Xeon® processor

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific

computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you

in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Source: Intel Measured results as of October 26, 2012 Configuration Details: Please reference slide speaker notes.

For more information go to http://www.intel.com/performance

Notes:

1. 2 X Intel® Xeon® Processor E5-2670 (2.6GHz, 8C, 115W)

2. Intel® Xeon Phi™ coprocessor 5110P (ECC on) with Gold RC SW stack (Coprocessor power only)

Higher is Better

Coprocessor results: Benchmark run 100% on coprocessor, no help from Intel® Xeon® processor host (aka native)

Introduction

Native, Offload and Variations

INTEL CONFIDENTIAL

‒ Second level

Third level

o Fourth level

Fifth level

Wide Spectrum of Execution Models

General purpose serial and parallel

computing

Codes with highly-parallel phases

Highly-parallel codes

Codes with balanced needs

Main( )Foo( )

MPI_*()

Foo( )

Main( )Foo( )

MPI_*()

Main()Foo( )

MPI_*()

Main( )Foo( )

MPI_*()

Main( )Foo( )

MPI_*()

Multicore

Many-core

Multicore Centric Many-core Centric

(Intel® Xeon® processors) (Intel® Many Integrated Core co-processors)

Multi-core-hosted Offload Symmetric Many-core-hosted

Range of Models to Meet Application Needs

The Intel Manycore Platform Software Stack (MPSS) provides Linux on the coprocessor

Linux* OS

Intel® Xeon Phi™ Coprocessor support libraries, tools, and

drivers

Linux* OS

PCI-E Bus PCI-E Bus

Intel® Xeon Phi™ Coprocessor communication and application-

launch support

Intel® Xeon Phi™ Coprocessor Host Processor

System-level code System-level code

User-level codeUser-level code

Runs either as an accelerator for offloadedhost computation…

Linux* OS

drivers

Linux* OS

PCI-E Bus PCI-E Bus

launch support

Offload libraries, user-level driver, user-accessible APIs

and libraries

User code

Host-side offload application

User code

Offload libraries, user-accessible APIs and libraries

Target-side offload applicationAdvantages

• More memory available• Better file access• Host better on serial code• Better uses resources

…Or runs as a native orMPI* compute node via IP or OFED

Linux* OS

drivers

Linux* OS

PCI-E Bus PCI-E Bus

launch support

Advantages• Simpler model

• No directives• Easier port

• Good kernel test

ssh or telnetconnection to coprocessor IP

address

Virtual terminal session

Use if• Not serial • Modest memory• Complex code

Target-side “native” application

User code

Standard OS libraries plus any 3rd-party or

Intel libraries

IB fabric

Flexible: Enables Multiple Programming Models

CPU MIC

Homogenous network of many-core CPUs

CPU MIC

Heterogeneous network of homogeneous CPUs

CPU MIC

Offload

Homogenous network of heterogeneous nodes

Coprocessor only Host+Offload Symmetric

Click to edit Master text styles

• Second level

– Third level

– Fourth level

– Fifth level

Advisor XEVTune Amplifier XEInspector XETrace Analyzer

Code Analysis

Comprehensive set of SW tools for Xeon and Xeon Phi Programing

Intel Cilk PlusThreading Building BlocksOpenMPOpenCLMPIOffload/Native/MYO

Programming Models

Math Kernel LibraryIntegrated Performance Primitives Intel Compilers

Libraries & Compilers

First Level

• Second level

– Third level

– Fourth level

– Fifth level

INTEL CONFIDENTIAL

‒ Second level

Third level

o Fourth level

Fifth level

Options for Thread Parallelism

Intel® Math Kernel Library

OpenMP*

Intel® Threading Building Blocks

Intel® Cilk™ Plus

OpenCL*

Pthreads* and other threading libraries Programmer control

Ease of use / code maintainability

Choice of unified programming to target Intel® Xeon® and Intel® Xeon Phi™ Architecture!

Introduction

FASTER

0.46SECONDS

STEP 1.

OPTIMIZE CODE

Parallelize and vectorize code and continue to run on

multi-core Intel Xeon processors

67.097SECONDS

CurrentPerformance

STARTING POINT

Unoptimized serial code running on multi-core

Intel® Xeon® processors

2.3XFASTER

0.197SECONDS

STEP 2.

USE COPROCESSORS

Run all or part of the optimized code on Intel®

Xeon Phi™ coprocessors

The Following Performance Results are Based on Already Optimized Code

SOURCE: INTEL MEASURED RESULTS AS OF NOVEMBER, 2012

Example: A Two-Step Process with SAXPY

Parallelizing for High Performance

340XFASTER

INTEL CONFIDENTIAL

• Application: Hybrid Monte-Carlo program that simulates lattice QCD with dynamical Wilson fermions. It is one of the main production programs of the QCDSF collaboration (DEISA) and beyond used for quark simulation.

• Status: Many optimizations already in released version; more optimizations and alternative offload model version in development

• Demonstrated Results:

- No source code changes

- Recompiled, selected run-time parameters to get maximum performance

Performance Proof-Point: Government and Academic Research

“The performance improvement for BQCD using the Intel Xeon Phi coprocessor was reached in record time, requiring only recompilation. We are confident that larger speed-ups can be obtained with modest modifications of the code.”

Prof. Dr. Tilo Wettig

Principal Investigator of the QPACE project

BQCD Scalability Gflops/Sec(Higher is Better)

1 2 4 8

SOURCE: INTEL MEASURED MARCH’13

• 2S Intel® Xeon® Processor E5-2670

• Intel® Xeon Phi™ coprocessor–native(pre-production HW/SW)

• 2S Intel Xeon E5-2670 +

Intel® Xeon Phi™ coprocessor–symmetric(pre-production HW/SW)

INTEL CONFIDENTIAL

• Application: Seismic imaging technique used to obtain a subsurface depth image from input seismic data

• Status: See presentation Rice O&G HPC workshop, http://rice2013.og-hpc.org/technical-program

• Execution Model: Fully Hybrid MPI+OpenMP using symmetric mode

– Highly scalable on cluster

• Code Optimization:

– Minimal source code changes for dynamic load balancing

Performance Proof-Point: Energy Industry

CGG: WAVE EQUATION MIGRATION (WEM)

Speedup(Higher is Better)

• 2S Intel® Xeon® processor E5-2670 4 MPI / 4 OMP

• Intel® Xeon Phi™ Coprocessor (pre-production HW/SW) 12 MPI / 20 OMP

• 2S Intel Xeon processor E5-2670 (4/4)

+ Intel® Xeon Phi™ coprocessor (12/20)(pre-production HW/SW)

• 2S Intel Xeon processor E5-2670 (4/4)

+ 2x Intel® Xeon Phi™ coprocessor (12/20 + 12/20) (pre-production HW/SW)

26 SOURCE: ARSLAN ET AL., CGG 2013, MARCH’13

INTEL CONFIDENTIAL

• Application: Monte Carlo algorithms are used to evaluate complex instruments, portfolios, and investments. Performance depends on raw computational power and the performance of exp2()

• Status: Case Study available

• Highlights: Dramatic performance scaling for bothsingle-precision and double-precision calculations

• Demonstrated Results:

- Intel® Xeon Phi™ coprocessor fast exp2() and FMA instructions deliver high performance, high accuracy for single precision computations

- Compiler based loop unrolling delivers high performance

- Cache blocking further optimizes cache utilization, reduces cache misses, and makes outer loop vectorization possible

• Read the Case Study: software.intel.com/en-us/articles/case-

study-achieving-high-performance-on-monte-carlo-european-option-on-intel-xeon-phi

Performance Proof-Point: Financial Services

MONTE CARLO EUROPEAN OPTIONS

Single Precision

Double Precision

• 2S Intel® Xeon® processor E5-2670

• 2S Intel Xeon processor E5-2670 +

Intel® Xeon Phi™ Coprocessor (pre-production HW/SW)

SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2013

INTEL CONFIDENTIAL

• Application: Weather Research and Forecasting (WRF)

• Status: WRF V3.5 was released 4/18/13

• Code Optimization:

– Approximately two dozen files with less than 2,000 lines of code were modified (out of approximately 700,000 lines of code in about 800 files, all Fortran standard compliant)

– Most modifications improved performance for both the host and the co-processors

• Performance Measurements: Pre release of WRF 3.5 (V3.5Pre) and NCAR supported CONUS2.5KM benchmark (a high resolution weather forecast)

• Acknowledgments: There were many contributors to these results, including the National Renewable Energy Laboratory and The Weather Channel Companies

WEATHER RESEARCH AND FORECASTING (WRF)

• 2S Intel® Xeon® processor E5-2670 with

eight-node cluster configuration

• 2S Intel® Xeon® processor E5-2670 +

Intel® Xeon Phi™ coprocessor (pre-production HW/SW)

with eight-node cluster configuration

28 SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2013

INTEL CONFIDENTIAL

• Application: Sandia National Laboratories' best approximation to an unstructured implicit finite element or finite volume application in fewer than 8000 lines of code

• Status: available at http://software.sandia.gov/trac/mantevo/browser/trunk/packages

• Demonstrated Results:- Porting was easy using OpenMP- Substituting an Intel MKL routine for the sparse matrix-

vector product accelerated performance and will simplify future optimization

- The Intel MPI Library enables rapid performance improvement when adding an Intel® Xeon Phi™ coprocessor

• Read the Case Study:

SANDIA MANTEVO miniFE

• 2S Intel® Xeon® processor E5-2670

• 2S Intel Xeon processor E5-2670 +

Intel® Xeon Phi™ coprocessor (pre-production HW/SW)

SOURCE: INTEL MEASURED RESULTS AS OF MARCH, 2012

“The programming models available for the Intel MIC Architecture are open-standard and portable between traditional processors and Intel Xeon Phi coprocessors. This should allow us to leverage code development across multiple platforms.”James A. Ang, Ph.D.Extreme-scale Computing, Sandia National Laboratories

software.intel.com/en-us/articles/running-minife-on-intel-xeon-phi-coprocessors

DEMONSTRATED PERFORMANCE BENEFITSIntel® Xeon Phi™ Coprocessor

Acceleware 8th Order Isotropic

Variable Velocity2

Seismic

Sandia National Labs MiniFE1

Finite Element Analysis

1. 8 node cluster, each node with 2S Xeon* (comparison is cluster performance with and without 1 Xeon Phi* per node) (Hetero)2. 2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor (unless otherwise noted)3. 2S Xeon* vs. 2S Xeon* + 2 Xeon Phi* (offload)

China Oil & Gas Geoeast Pre-stack

Time Migration3

DEMONSTRATED PERFORMANCE BENEFITSIntel® Xeon Phi™ Coprocessor

10.75X

Monte Carlo SP3

Finance

Jefferson LabLattice QCD

Physics

UP TO 7XBlack-Scholes SP3

Notes:1. 2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor unless otherwise noted)2. Intel Measured Oct. 20123. Includes additional FLOPS from transcendental function unit

SPEED-UP

Intel Labs Ray Tracing2

Embree Ray Tracing

Introduction

INTEL CONFIDENTIAL

• System: TACC Stampede is a 10 petaflop supercomputer, one of the largest computing systems in the world for open science research. It became operational on January 7, 2013

• Status: In Service

• Workloads: Runs hundreds of applications for thousands of users around the world

• Performance:

– More than 7 petaflops using Intel® Xeon Phi™ coprocessors1

– More than 2 petaflops using the Intel® Xeon®

processor E5 family1

• More Information:

– SC12 interview: insidehpc.com/2012/12/06/video-intel-xeon-phi-powers-7-tacc-stampede-super/

– TACC HPC systems overview: www.tacc.utexas.edu/resources/hpc

Implementation Proof-Point: Government and Academic Research

Texas Advanced Computing Center (TACC)

1 http://www.tacc.utexas.edu/resources/hpc/stampede

INTEL CONFIDENTIAL

System: Located in Southwest China, it contains 16,000 nodes composing the world's largest (public) installation of Intel Ivy Bridge and Xeon Phi’s processors. Each cluster node is formed with

• 2 CPUs hex-core Intel® Xeon® Ivy-Bridge @ 2.2GHz• 3 Intel® Xeon Phi™ cards, each with 57 cores @ 1.1GHz

Performance: Theoretical peak of 54.9 Pflop/s

• 6.8 Pflop/s from 32,000 Xeon Ivy Bridge sockets • 48.1 Pflop/s from 48,000 Xeon Phi cards• for a total of 3,120,000 cores.

30.65 Pflop/s sustained Linpack.

More Information: "Visit to the National University for Defense Technology Changsha, China." Jack Dongarra, University of Tennessee, and Oak Ridge National Laboratory. June 2013. www.netlib.org/utk/people/JackDongarra/PAPERS/tianhe-2-dongarra-report.pdf

Tianhe-2 System: #1 June 2013 Top500 List

INTEL CONFIDENTIALOther brands and names are the property of their respective owners.

A Growing Sotware Ecosystem:Developing today on Intel® Xeon Phi™ coprocessors

Shown at SC’12, November 2012

Introduction

• Second level

– Third level

– Fourth level

– Fifth level

Conclusions

Intel® Xeon Phi™ coprocessor advantages:

• Comparable performance potential to other accelerators

• Faster time to solution due to reduced development effort

• Better investment protection with a single code base for processors and coprocessors

Flexible and Wide range of programming models: from pure Native to Offloaded – and all variants between

All with the familiar Intel development environment

• Second level

– Third level

– Fourth level

– Fifth level

One Stop Shop for:

Tools & Software Downloads

Getting Started Development Guides

Video Workshops, Tutorials, & Events

Code Samples & Case Studies

Articles, Forums, & Blogs

Associated Product Links

http://software.intel.com/mic-developer

Intel® Xeon Phi™ Coprocessor DeveloperSite: http://software.intel.com/mic-developer

Obrigado.

• Second level

– Third level

– Fourth level

– Fifth level

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice

introdução ao coprocessador intel® xeon phi™ - intel software conference 2013

Technology

do multicore ao manycore: práticas de configuração,...

intel pentium ii xeon

11 intel ® xeon ® intel ® xeon ® servers for small...

estudio de arquitecturas intel xeon vs intel xeon phi y

Процессоры intel xeon и технологии...

powerful intel xeon processors - lenovo.com

oracle solaris intel xeon 159049

fujitsu bs2000 se systeme mehr als nur die nächste...

optimizing vlpl-s pic on intel xeon & xeon phi

intel xeon hyperthreading

case study 1: intel cherry creek cluster case studies_svlg...

intel xeon processor 5500...

intel® xeon phi™ coprocessor: introductionŸ7.pdf ·...

intel xeon phi – basic tutorial

procesadores intel xeon

intel® xeon® d-2100 processor product brief · intel®...

cisco unified computing system: поддержка...

original author james reinders, intel presented by aditya...

intel xeon phi

intel® xeon® processor d-1500 product family nda … ·...