notes on numa architecture

Intel Software Conference 2014 Brazil

May 2014Leonardo Borges

Notes on NUMA architecture

2 2

Non-Uniform Memory Access (NUMA)

� FSB architecture- All memory in one location

� Starting with Nehalem- Memory located in multiple places

� Latency to memory dependent on location

� Local memory- Highest BW- Lowest latency

� Remote Memory - Higher latency

Socket 0 Socket 1QPI

Ensure software is NUMA-optimized for best performance

Notes for Intel Software Conference – Brazil, May 2014

3 3

CPU1 DRAM

Node 1

Non-Uniform Memory Access (NUMA)

� Locality matters- Remote memory access latency ~1.7x than local memory- Local memory bandwidth can be up to 2x greater than remote

Intel® QPI = Intel® QuickPath Interconnect

Remote Memory Access

Intel®

QPICPU0DRAM

Local Memory Access

Node 0

� BIOS:

- NUMA mode (NUMA Enabled)� First Half of memory space on Node 0, second half on Node 1

� Should be default on Nehalem (!)

- Non-NUMA (NUMA Disabled)� Even/Odd cache lines assigned to Nodes 0/1: Line interleaving


4 4

Local Memory Access Example

� CPU0 requests cache line X, not present in any CPU0 cache- CPU0 requests data from its DRAM- CPU0 snoops CPU1 to check if data is present

� Step 2:- DRAM returns data- CPU1 returns snoop response

� Local memory latency is the maximum latency of the two responses� Nehalem optimized to keep key latencies close to each other

CPU0 CPU1QPI

DRAMDRAM


5 5

Remote Memory Access Example

� CPU0 requests cache line X, not present in any CPU0 cache- CPU0 requests data from CPU1- Request sent over QPI to CPU1- CPU1’s IMC makes request to its DRAM- CPU1 snoops internal caches- Data returned to CPU0 over QPI

� Remote memory latency a function of having a low latency interconnect

CPU0 CPU1QPI

DRAMDRAM


6 6

Non Uniform Memory Access and Parallel Execution

� Process-parallel execution:

- NUMA friendly- data belongs only to the process- E.g. MPI- Affinity pinning maximizes local memory access- Standard for HPC

� Shared-memory threading:

- More problematic: same thread may require data from multiple NUMA nodes

- E.g. OpenMP, TBB , explicit threading - OS scheduled thread migration can aggravate situation- NUMA and non-NUMA should be compared


7 7

Operating System Differences� Operating systems allocate data differently� Linux*

- Malloc reserves the memory- Assigns the physical page when data touched (first touch)

� Many HPC code initialize memory by single ‘master’ thread !!

- A couple of extensions available via numactl and libnuma like� numactl --interleave=all /bin/program� numactl --cpunodebind=1 --membind=1 /bin/program� numactl --hardware� numa_run_on_node(3) // run thread on node 3

� Microsoft Windows*- Malloc assigns the physical page on allocation- This default allocation policy is not NUMA friendly- Microsoft Windows has NUMA Friendly API’s

� VirtualAlloc reserves memory (like malloc on Linux*)� Physical pages assigned at first use

� For more details:http://kernel.org/pub/linux/kernel/people/christoph/pmig/numamemory.pdf

http://msdn.microsoft.com/en-us/library/aa363804.aspx


8 8

Other Ways to Set Process Affinity

� taskset: sets or retrieves the CPU affinity

� Intel MPI: using I_MPI_PIN and I_MPI_PIN_PROCESSOR_LIST environment variables

� KMP_AFFINITY on Intel Compilers OpenMP- Compact: binds the OpenMP thread n+1 as close as possible to OpenMP thread n

- Scatter: distributes threads evenly across the entire system. Scatter is the opposite of compact


9 9

NUMA Application Level Tuning: Shared Memory Threading Example: TRIAD

� Parallelized time consuming hotspot “TRIAD” (e.g. of STREAM benchmark) using OpenMP

main() {…#pragma omp parallel{//Parallelized TRIAD loop…#pragma omp parallel for private(j)

for (j=0; j<N; j++)a[j] = b[j]+scalar*c[j];

} //end omp parallel…} //end main

Parallelizing hotspots may not be sufficient for NUMA


10 10

NUMA Shared Memory Threading Example ( Linux* )KMP_AFFINITY=compact,0,verbose

main() {…#pragma omp parallel{#pragma omp for private(i)for(i=0;i<N;i++) { a[i] = 10.0; b[i] = 10.0; c[i] = 10.0;}

…//Parallelized TRIAD loop…#pragma omp parallel for private(j)

for (j=0; j<N; j++)a[j] = b[j]+scalar*c[j];

} //end omp parallel …} //end main

Each thread initializes its data

pinning the pages to local memory

Environment variable

to pin affinity

Same thread that initialized data uses data


11 11

NUMA Optimization Summary

� NUMA adds complexity to software parallelization and optimization

� Optimize for latency and for bandwidth- In most cases goal to minimize latency- Use local memory- Keep memory near the thread it accesses- Keep thread near memory it uses

� Rely on quality middle-ware for CPU affinitization:� Example: Intel Compiler OpenMP or MPI environment variables

� Application level tuning may be required to minimize NUMA first touch policy effects


12 12Notes for Intel Software Conference – Brazil, May 2014

notes on numa architecture

Documents