performance analysis of multiple threads/cores using the ultrasparc t1 (niagara)
DESCRIPTION
Dimitris Kaseridis & Lizy K. John The University of Texas at Austin Laboratory for Computer Architecture http://lca.ece.utexas.edu. Performance Analysis of Multiple Threads/Cores Using the UltraSPARC T1 (Niagara). Unique Chips and Systems (UCAS-4). Outline. - PowerPoint PPT PresentationTRANSCRIPT
Performance Analysis of Multiple Threads/CoresUsing the UltraSPARC T1(Niagara)
Unique Chips and Systems (UCAS-4)
Dimitris Kaseridis & Lizy K. John
The University of Texas at Austin
Laboratory for Computer Architecture http://lca.ece.utexas.edu
Outline
04/24/23D. Kaseridis - Laboratory for Computer Architecture2
Brief Description of UltraSPARC T1 Architecture
Analysis Objectives / Methodology Analysis of Results
Interference on Shared Resources Scaling of Multiprogrammed Workloads Scaling of Multithreaded Workloads
UltraSPARC T1 (Niagara)
04/24/23D. Kaseridis - Laboratory for Computer Architecture3
A multi-threaded processor that combines CMP & SMT in CMT
8 cores with each one handling 4 hardware context threads 32 active hardware context threads
Simple in-order pipeline with no branch predictor unit per core
Optimized for multithreaded performance Throughput High throughput hide the memory and pipeline stalls/latencies by scheduling other available threads with zero cycle thread switch penalty
UltraSPARC T1 Core Pipeline
04/24/23D. Kaseridis - Laboratory for Computer Architecture4
Thread Group shares L1 cache, TLBs, execution units, pipeline registers and data path
Blue areas are replicated copies per hardware context thread
Objectives
04/24/23D. Kaseridis - Laboratory for Computer Architecture5
Purpose Analysis of interference of multiple executing
threads on the shared resources of Niagara Scaling abilities of CMT architectures for both
multiprogrammed and multithreaded workloads Methodology
Interference on Shared Resources (SPEC CPU2000)
Scaling of a Multiprogrammed Workload (SPEC CPU2000)
Scaling of a Multithreaded Workloads (SPECjbb2005)
Analysis Objectives / Methodology6
04/24/23D. Kaseridis - Laboratory for Computer Architecture
Methodology (1/2)
04/24/23D. Kaseridis - Laboratory for Computer Architecture7
On-chip performance counters for real/accurate results
Niagara: Solaris10 tools : cpustat, cputrack , psrset to bind
processes to H/W threads 2 counters per Hardware Thread with one only for
Instruction count
Methodology (2/2)
04/24/23D. Kaseridis - Laboratory for Computer Architecture8
Niagara has only one FP unit only integer benchmark was considered
Performance Counter Unit in the granularity of a single H/W context thread No way to break down effects of more threads per
H/W thread Software profiling tools too invasive
Only pairs of benchmarks was considered to allow correlation of benchmarks with events
Many iterations and use average behavior
Interference on shared resources Scaling of a multiprogrammed workload Scaling of a multithreaded workload
Analysis of Results9
04/24/23D. Kaseridis - Laboratory for Computer Architecture
Interference on Shared Resources
04/24/23D. Kaseridis - Laboratory for Computer Architecture10
Two modes considered: “Same core” mode executes a
benchmark on the same core Sharing of pipeline, TLBs, L1
bandwidth More like an SMT
“Two cores” mode execute each member of pair on a different core Sharing of L2
capacity/bandwidth and main memory
More like an CMP
Interference “same core” (1/2)
04/24/23D. Kaseridis - Laboratory for Computer Architecture11
On average 12% drop of IPC when running in a pair Crafty followed by twolf showed the worst
performance Eon best behavior keeping the IPC almost close to
the single thread case
04/24/23D. Kaseridis - Laboratory for Computer Architecture12
Interference “same core” (2/2)
DC misses increased 20% on average / 15% taking out crafty Worst DC misses are vortex and perlbmk Highest ratios of L2 misses demonstrated are not the one that features an
important decrease in IPC mcf and eon pairs with more than 70% L2 misses Overall, small performance penalty even when sharing pipeline and L1, L2
bandwidth latency hiding technique is promising
04/24/23D. Kaseridis - Laboratory for Computer Architecture13
Only stressing L2 and shared communication buses On average the misses on L2 are almost the same as in
the case on “same core”: underutilized the available resources Multiprogrammed workload with no data sharing
Interference “two cores”
Scaling of Multiprogrammed Workload
04/24/23D. Kaseridis - Laboratory for Computer Architecture14
Reduced benchmark pair set
Scaling 4 8 16 threads with configurations
Scaling of Multiprogrammed Workload
04/24/23D. Kaseridis - Laboratory for Computer Architecture15
“Same core”
“Mixed mode” mode
04/24/23D. Kaseridis - Laboratory for Computer Architecture16
Scaling of Multiprogrammed “same core”
4 8 case IPC / Data cache misses not affected L2 data misses increased but IPC is not Enough resources running fully occupied memory latency hiding
8 16 case More cores running same benchmark Some footprint / request to L2 /Main memory L2 requirements / shared interconnect traffic decreased
performance
IPC ratio
DC misses ratio
L2 misses ratio
Scaling of Multiprogrammed “mixed mode”
04/24/23D. Kaseridis - Laboratory for Computer Architecture17
Mixed mode case Significant decrease in IPC when moving both
from 4 8 and 8 16 threads Same behavior as “same core” case for DC
and L2 misses with an average of 1% - 2% difference
Overall for both modes Niagara demonstrated that moving from 4 to 16 threads can
be done with less than 40% on average performance drop Both modes showed that significantly increased L1 and L2
misses can be handed favoring throughput
IPC ratio
Scaling of Multithreaded Workload
04/24/23D. Kaseridis - Laboratory for Computer Architecture18
Scaled from 1 up to 64 threads 1 8 threads mapped 1 thread per core 8 16 threads mapped at maximum 2 threads per
core 16 32 threads up to 4 threads per core 32 64 more threads per core, swapping is necessary
Configuration used for SPECjbb2005
Scaling of Multithreaded Workload
04/24/23D. Kaseridis - Laboratory for Computer Architecture19
SPECjbb2005 score per warehouse
GC effect
Scaling of Multithreaded Workload
04/24/23D. Kaseridis - Laboratory for Computer Architecture20
Ratio over 8 threads case with 1 thread per core
Instruction fetch and DTLB stressed the most
L1 data and L2 Caches managed to scale even for more then 32 threads
GC effect
Scaling of Multithreaded Workload
04/24/23D. Kaseridis - Laboratory for Computer Architecture21
Scaling of Performance
Linear scaling of almost 0.66 per thread up to 32 threads 20x speed up at 32 threads SMP (2 Threads/core) gives on average 1.8x speed up over
the CMP configuration (region 1 SMT (up to 4 Threads/core) gives a 1.3x and 2.3x speedup
over the 2-way SMT per core and the single-threaded CMP, respectively.
Conclusions
04/24/23D. Kaseridis - Laboratory for Computer Architecture22
Demonstration of interference on a real CMT system
Long latency hiding technique is effective for L1 and L2 misses and therefore could be a good/promising technique against aggressive speculation
Promising scaling up to 20x for multithreaded workloads with an average of 0.66x per thread
Instruction fetch subsystem and DTLBs the most contented resources followed by L2 cache misses
Q/A
04/24/23D. Kaseridis - Laboratory for Computer Architecture23
Thank you…Questions?
The Laboratory for Computer Architectureweb-site: http://lca.ece.utexas.edu
Email: [email protected]