notes on numa architecture
DESCRIPTION
Intel Software Conference 2014 Brazil May 2014 Leonardo BorgesTRANSCRIPT
Intel Software Conference 2014 Brazil
May 2014Leonardo Borges
Notes on NUMA architecture
2 2
Non-Uniform Memory Access (NUMA)
� FSB architecture- All memory in one location
� Starting with Nehalem- Memory located in multiple places
� Latency to memory dependent on location
� Local memory- Highest BW- Lowest latency
� Remote Memory - Higher latency
Socket 0 Socket 1QPI
Ensure software is NUMA-optimized for best performance
Notes for Intel Software Conference – Brazil, May 2014
3 3
CPU1 DRAM
Node 1
Non-Uniform Memory Access (NUMA)
� Locality matters- Remote memory access latency ~1.7x than local memory- Local memory bandwidth can be up to 2x greater than remote
Intel® QPI = Intel® QuickPath Interconnect
Remote Memory Access
Intel®
QPICPU0DRAM
Local Memory Access
Node 0
� BIOS:
- NUMA mode (NUMA Enabled)� First Half of memory space on Node 0, second half on Node 1
� Should be default on Nehalem (!)
- Non-NUMA (NUMA Disabled)� Even/Odd cache lines assigned to Nodes 0/1: Line interleaving
Notes for Intel Software Conference – Brazil, May 2014
4 4
Local Memory Access Example
� CPU0 requests cache line X, not present in any CPU0 cache- CPU0 requests data from its DRAM- CPU0 snoops CPU1 to check if data is present
� Step 2:- DRAM returns data- CPU1 returns snoop response
� Local memory latency is the maximum latency of the two responses� Nehalem optimized to keep key latencies close to each other
CPU0 CPU1QPI
DRAMDRAM
Notes for Intel Software Conference – Brazil, May 2014
5 5
Remote Memory Access Example
� CPU0 requests cache line X, not present in any CPU0 cache- CPU0 requests data from CPU1- Request sent over QPI to CPU1- CPU1’s IMC makes request to its DRAM- CPU1 snoops internal caches- Data returned to CPU0 over QPI
� Remote memory latency a function of having a low latency interconnect
CPU0 CPU1QPI
DRAMDRAM
Notes for Intel Software Conference – Brazil, May 2014
6 6
Non Uniform Memory Access and Parallel Execution
� Process-parallel execution:
- NUMA friendly- data belongs only to the process- E.g. MPI- Affinity pinning maximizes local memory access- Standard for HPC
� Shared-memory threading:
- More problematic: same thread may require data from multiple NUMA nodes
- E.g. OpenMP, TBB , explicit threading - OS scheduled thread migration can aggravate situation- NUMA and non-NUMA should be compared
Notes for Intel Software Conference – Brazil, May 2014
7 7
Operating System Differences� Operating systems allocate data differently� Linux*
- Malloc reserves the memory- Assigns the physical page when data touched (first touch)
� Many HPC code initialize memory by single ‘master’ thread !!
- A couple of extensions available via numactl and libnuma like� numactl --interleave=all /bin/program� numactl --cpunodebind=1 --membind=1 /bin/program� numactl --hardware� numa_run_on_node(3) // run thread on node 3
� Microsoft Windows*- Malloc assigns the physical page on allocation- This default allocation policy is not NUMA friendly- Microsoft Windows has NUMA Friendly API’s
� VirtualAlloc reserves memory (like malloc on Linux*)� Physical pages assigned at first use
� For more details:http://kernel.org/pub/linux/kernel/people/christoph/pmig/numamemory.pdf
http://msdn.microsoft.com/en-us/library/aa363804.aspx
Notes for Intel Software Conference – Brazil, May 2014
8 8
Other Ways to Set Process Affinity
� taskset: sets or retrieves the CPU affinity
� Intel MPI: using I_MPI_PIN and I_MPI_PIN_PROCESSOR_LIST environment variables
� KMP_AFFINITY on Intel Compilers OpenMP- Compact: binds the OpenMP thread n+1 as close as possible to OpenMP thread n
- Scatter: distributes threads evenly across the entire system. Scatter is the opposite of compact
Notes for Intel Software Conference – Brazil, May 2014
9 9
NUMA Application Level Tuning: Shared Memory Threading Example: TRIAD
� Parallelized time consuming hotspot “TRIAD” (e.g. of STREAM benchmark) using OpenMP
main() {…#pragma omp parallel{//Parallelized TRIAD loop…#pragma omp parallel for private(j)
for (j=0; j<N; j++)a[j] = b[j]+scalar*c[j];
} //end omp parallel…} //end main
Parallelizing hotspots may not be sufficient for NUMA
Notes for Intel Software Conference – Brazil, May 2014
10 10
NUMA Shared Memory Threading Example ( Linux* )KMP_AFFINITY=compact,0,verbose
main() {…#pragma omp parallel{#pragma omp for private(i)for(i=0;i<N;i++) { a[i] = 10.0; b[i] = 10.0; c[i] = 10.0;}
…//Parallelized TRIAD loop…#pragma omp parallel for private(j)
for (j=0; j<N; j++)a[j] = b[j]+scalar*c[j];
} //end omp parallel …} //end main
Each thread initializes its data
pinning the pages to local memory
Environment variable
to pin affinity
Same thread that initialized data uses data
Notes for Intel Software Conference – Brazil, May 2014
11 11
NUMA Optimization Summary
� NUMA adds complexity to software parallelization and optimization
� Optimize for latency and for bandwidth- In most cases goal to minimize latency- Use local memory- Keep memory near the thread it accesses- Keep thread near memory it uses
� Rely on quality middle-ware for CPU affinitization:� Example: Intel Compiler OpenMP or MPI environment variables
� Application level tuning may be required to minimize NUMA first touch policy effects
Notes for Intel Software Conference – Brazil, May 2014
12 12Notes for Intel Software Conference – Brazil, May 2014