architectural characterization of an ibm rs6000 s80 server running tpc-w workloads
Post on 06-Jan-2016
43 Views
Preview:
DESCRIPTION
TRANSCRIPT
Architectural Characterization of an IBM Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W RS6000 S80 Server Running TPC-W
WorkloadsWorkloads
Lei Yang & Shiliang HuLei Yang & Shiliang HuComputer Sciences Department, University of Wisconsin - Computer Sciences Department, University of Wisconsin -
MadisonMadison
OutlineOutline
• TPC-W Benchmarks in JavaTPC-W Benchmarks in Java• IBM RS6000 S80IBM RS6000 S80 Enterprise Server Enterprise Server• Hardware Counters in S80Hardware Counters in S80• Experiment ResultsExperiment Results• Problems and Future work Problems and Future work • Conclusions Conclusions
TPC-W benchmarkTPC-W benchmark
• TPC-W is the TPC Council’s newest benchmark for Transactional Web Environments (E-Commerce) Modeling an online book store similar to www.amazon.com – Browsing 95% browsing, 5% transactions – Shopping 80% browsing, 20% transactions – Ordering 50% browsing, 50% transactions
• Transactional Web Environments:– Web serving of static and dynamic content– Online Transaction processing (OLTP)– Some decision support (DSS)
IBM RS6000 S80 Enterprise ServerIBM RS6000 S80 Enterprise Server
• 6 RS64-III Pulsar processors (451MHz)6 RS64-III Pulsar processors (451MHz)
– 4-issue in-order Super Scalar microprocessor with on chip 4-issue in-order Super Scalar microprocessor with on chip 128KB L1 I-Cache, 128KB L1 D-Cache and 8MB L2 Cache.128KB L1 I-Cache, 128KB L1 D-Cache and 8MB L2 Cache.
– No Branch Prediction, Aggressive early branch resolutionNo Branch Prediction, Aggressive early branch resolution– Coarse grain 2-context Multithreading.Coarse grain 2-context Multithreading.
• SMP system. Snooping bus inter-processor SMP system. Snooping bus inter-processor connection.connection.
• 8GB main memory, Huge disk volumes. And very 8GB main memory, Huge disk volumes. And very high bandwidth IO systems.high bandwidth IO systems.
System Configuration:System Configuration:
RS64-III processor
32bits Control word
RS64-III processor
32bits Control word
AIX kernel Kernel Extension
Performance Monitor
Performance Monitor
Performance Monitor
Snooping bus
Java Virtual Machine
Emulated Browser
Java Virtual Machine
DB2 DBMS
Processes
JDBChttp
SUN
Java Web
Server2.0
Java Servlet
Java Servlet
Hardware Counters in S80Hardware Counters in S80
• 3 levels of objects can be counted with their own 3 levels of objects can be counted with their own counting contexts:counting contexts:- System level counting, whole system level context- System level counting, whole system level context
- Process / Process group, process level context- Process / Process group, process level context
- Individual thread, thread level context.- Individual thread, thread level context.
• 3 major components3 major components
- 8 Built-in hardware counters in each RS64-III processor.- 8 Built-in hardware counters in each RS64-III processor.
- - Kernel extension to AIX 4.3Kernel extension to AIX 4.3
- Performance Monitor API in the next release of AIX.- Performance Monitor API in the next release of AIX.
• Some Problems with current version of PM API.- Cannot count for individual processor.- Some Listed events are not available.
Hardware Counters in S80: Countable EventsHardware Counters in S80: Countable Events
• Processor eventsProcessor events- execution cycles and the number of instructions executed- execution cycles and the number of instructions executed. .
• Instruction mix eventsInstruction mix events- Pipeline M, S, B and S instructions executed.- Pipeline M, S, B and S instructions executed.
• Branch eventsBranch events- Conditional branch T/NT events, unconditional branches, zero cycle - Conditional branch T/NT events, unconditional branches, zero cycle branches.branches.
• Address Translation eventsAddress Translation events- TLB/SLB and ERAT/IERAT miss and duration events.- TLB/SLB and ERAT/IERAT miss and duration events.
• Cache eventsCache events- Cache misses and latencies for each of the L1 I-Cache L1 D-Cache L2 - Cache misses and latencies for each of the L1 I-Cache L1 D-Cache L2 CacheCache
• Bus and multi-processor bus snooping eventsBus and multi-processor bus snooping events- bus utilization. multi-processor bus snooping events - bus utilization. multi-processor bus snooping events
Results: CPI for RBE, Java Web Server and Results: CPI for RBE, Java Web Server and DB2DB2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
RBE JWS DB2
CPI
Browsing Shopping Ordering
Results: CPU Cycle CountsResults: CPU Cycle Counts
0 100 2000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2x 10
9 Browsing Mix
Time/sec
DB2JWSRBE
0 100 2000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2x 10
9 Shopping Mix
Time/sec
DB2JWSRBE
0 100 2000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2x 10
9 Ordering Mix
Time/sec
DB2JWSRBE
Cyc
le C
ount
s
Results: Instruction DispatchResults: Instruction Dispatch
Dis
patc
h P
erce
ntag
e %
• Browsing MixBrowsing Mix
0 100 2000
10
20
30
40
50
60
70
80
90
100DB2
Time/sec
0 Instr1 Instr2 Instr3 Instr4 Instr
0 100 2000
10
20
30
40
50
60
70
80
90
100JWS
Time/sec
0 Instr1 Instr2 Instr3 Instr4 Instr
0 100 2000
10
20
30
40
50
60
70
80
90
100RBE
Time/sec
0 Instr1 Instr2 Instr3 Instr4 Instr
Dis
patc
h P
erce
ntag
e %
Results: Instruction DispatchResults: Instruction Dispatch
• Shopping MixShopping Mix
0 100 2000
10
20
30
40
50
60
70
80
90
100DB2
Time/sec
0 Instr1 Instr2 Instr3 Instr4 Instr
0 100 2000
10
20
30
40
50
60
70
80
90
100JWS
Time/sec
0 Instr1 Instr2 Instr3 Instr4 Instr
0 100 2000
10
20
30
40
50
60
70
80
90
100RBE
Time/sec
0 Instr1 Instr2 Instr3 Instr4 Instr
Dis
patc
h P
erce
ntag
e %
Results: Instruction DispatchResults: Instruction Dispatch
• Ordering MixOrdering Mix
0 100 2000
10
20
30
40
50
60
70
80
90
100DB2
Time/sec
0 Instr1 Instr2 Instr3 Instr4 Instr
0 100 2000
10
20
30
40
50
60
70
80
90
100JWS
Time/sec
0 Instr1 Instr2 Instr3 Instr4 Instr
0 100 2000
10
20
30
40
50
60
70
80
90
100RBE
Time/sec
0 Instr1 Instr2 Instr3 Instr4 Instr
Dis
patc
h P
erce
ntag
e %
Results: Instruction MixResults: Instruction Mix
• Browsing MixBrowsing Mix
0 100 2000
10
20
30
40
50
60
70
80
90
100DB2
Time/sec
Logic ArithematicsBranch LD/ST
0 100 2000
10
20
30
40
50
60
70
80
90
100JWS
Time/sec
Logic ArithematicsBranch LD/ST
0 100 2000
10
20
30
40
50
60
70
80
90
100RBE
Time/sec
Logic ArithematicsBranch LD/ST
Inst
ruct
ion
type
Per
cent
age
%
Results: Instruction MixResults: Instruction Mix
• Shopping MixShopping Mix
0 100 2000
10
20
30
40
50
60
70
80
90
100DB2
Time/sec
Logic ArithematicsBranch LD/ST
0 100 2000
10
20
30
40
50
60
70
80
90
100JWS
Time/sec
Logic ArithematicsBranch LD/ST
0 100 2000
10
20
30
40
50
60
70
80
90
100RBE
Time/sec
Logic ArithematicsBranch LD/ST
Inst
ruct
ion
type
Per
cent
age
%
Results: Instruction MixResults: Instruction Mix
• Ordering MixOrdering Mix
0 100 2000
10
20
30
40
50
60
70
80
90
100DB2
Time/sec
Logic ArithematicsBranch LD/ST
0 100 2000
10
20
30
40
50
60
70
80
90
100JWS
Time/sec
Logic ArithematicsBranch LD/ST
0 100 2000
10
20
30
40
50
60
70
80
90
100RBE
Time/sec
Logic ArithematicsBranch LD/ST
Inst
ruct
ion
type
Per
cent
age
%
Results: Branch BehaviorResults: Branch Behavior
Shopping MixShopping Mix
1 2 3 4 5 6 7 80
2
4
6
8
10
12
14
16x 10
9
DB2JWSRBE
1 2 3 4 5 6 7 80
2
4
6
8
10
12
14
16
18x 10
9
DB2JWSRBE
Browsing MixBrowsing Mix
1. Branches conditional taken2. Branch to link register taken3. Branch to counter taken4. Absolute branches
5. Branches unconditional6. Branches conditional not taken7. Zero cycle branch not taken8. Zero cycle branch taken
Results: Branch BehaviorResults: Branch Behavior
1 2 3 4 5 6 7 80
2
4
6
8
10
12x 10
9
DB2JWSRBE
Ordering MixOrdering Mix
1. Branches conditional taken2. Branch to link register taken3. Branch to counter taken4. Absolute branches5. Branches unconditional6. Branches conditional not taken7. Zero cycle branch not taken8. Zero cycle branch taken
Results: Cache Behavior Results: Cache Behavior
1. L1 I cache miss duration latency2. L1 D cache miss duration latency
Browsing MixBrowsing Mix Shopping MixShopping Mix
1 20
10
20
30
40
50
60
70
80
DB2JWSRBE
Lat
ency
/cyc
les
1 20
10
20
30
40
50
60
70
80
DB2JWSRBE
Results: Cache Behavior Results: Cache Behavior
1. L1 I cache miss duration latency2. L1 D cache miss duration latency
1 20
10
20
30
40
50
60
70
80
DB2JWSRBE
Lat
ency
/cyc
les
Shopping MixShopping Mix
1 20
10
20
30
40
50
60
70
80
DB2JWSRBE
Ordering MixOrdering Mix
Results: Cache Behavior Results: Cache Behavior
1. L2 miss count per instruction2. L1 I cache miss count per instruction3. L1 D cache miss count per instruction
Ordering MixOrdering MixShopping MixShopping MixBrowsing MixBrowsing Mix
Cou
nt
1 2 30
0.002
0.004
0.006
0.008
0.01
0.012
0.014DB2JWSRBE
1 2 30
0.002
0.004
0.006
0.008
0.01
0.012
0.014DB2JWSRBE
1 2 30
0.002
0.004
0.006
0.008
0.01
0.012
0.014DB2JWSRBE
Problems & Future WorksProblems & Future Works
• Problems:- Large Dataset - Network and Server end software are the bottleneck?- Hardware counters vs. Simulations.
• Future works:- Measurement of other transactional processing and web serving benchmarks for comparison. - More architectural characterizations such as multithreaded processors, multiprocessor scaling and multiprocessor snooping bus issues.
ConclusionsConclusions
• Server end Software is critical for high-end servers- Network and Server end software are the bottleneck - This is true both for high end commercial server systems and other high performance parallel computers designed for scientific or engineering computing.
• Preliminary performance characterization shows: - CPU utilization is highly dependent upon the application workloads. - High dispatching mechanism on RS64III appears less efficiently used.- Branch instructions are second to load and store instructions.- L2 cache miss rate is unreasonably low and L1 D-cache miss latency is considerable larger than that of L1 I-cache.
AcknowledgementAcknowledgement
• Trey Cain for setting up Java TPC-W and discussion
• Morris Marden for helping quiet the machine and discussion
• Prof. Mikko Lipasti for guidance and support
• Everyone helped us
top related