macsim simulator

30
MacSim Simulator HPArch Research Group

Upload: akiva

Post on 22-Feb-2016

124 views

Category:

Documents


0 download

DESCRIPTION

MacSim Simulator. HPArch Research Group. MacSim Tutorial. Part 2. Overview of MacSim Introduction For b lack box approach users Part 3: Details of MacSim For computer architecture researchers Part 4. MacSim -SST case studies Ocelot-MacSim case studies Research using Ocelot - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: MacSim  Simulator

MacSim SimulatorHPArch Research Group

Page 2: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

MacSim Tutorial| Part 2. Overview of MacSim

Introduction For black box approach users

| Part 3: Details of MacSim For computer architecture researchers

| Part 4. MacSim-SST case studies Ocelot-MacSim case studies Research using Ocelot Research using MacSim

Page 3: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Introduction of MacSim | Heterogeneous architecture simulator (x86+PTX)

| Developed from Georgia Tech

| Trace driven simulator Internal RISC style micro-op generation module X86 traces – using Pin, PTX traces – using GPUOcelot

| Cycle-level simulator Cores, caches, memory systems are modeled

| Support various simulations - single/multi-threaded application, multi-program, heterogeneous (CPU+GPU)

Page 4: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

MacSim’s Target Architectures | Flexible design to support various platforms

| Integration with a parallel simulator (SST) to support high-performance computing systems

| From mobile to Exascale computing systems

Page 5: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Simulator Infrastructure

X86 binaries

CUDA code(.cu)

Open GL codePIN(API

Generator)

PINTrace Generator

NVCC(Compiler)

GPUOcelotTrace Generator

Attila(OpenGL Emulator)

Heterogeneous Architecture

Timing & Power

Simulator

PTX code

Prof. Yalamanchili (Georgia Tech)

InstructionThread information

Ongoing Work

Page 6: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Getting MacSim & Build | Getting MacSim

Stable version – google code projecthttp://macsim.googlecode.com/files/macsim-1.0.tar.gz

Latest code from SVN repository

| Directions are explained in http://code.google.com/p/macsim/wiki/GettingMacsim

| How to build http://code.google.com/p/macsim/wiki/BuildingMacsim Chapter 2 of manual provides an instruction to build README file in the simulator directory

Page 7: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Other Software Packages… | Macsim package

IRIS (NoC simulator from Prof. Yalamanchili’s group) is included CPU trace generator

Download PIN separately. Trace generator tool is in the MacSim Package GPU trace generator

Download Ocelot Separately. Trace generator is in the Ocelot’s package

| MacSim-SST SST needs to be downloaded separately

| Energy Introspector (From Prof. Yalamanchili’s group) EI is a power model based on McPAT, HotSpot.

Because of McPAT license issue, currently EI cannot be distributed, but we will resolve this issue soon

Page 8: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Simulation

Page 9: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

MacSim Run | Once build process is successful, binary will be

created in macsim-top/trunk/bin/macsim

| Screenshot of a simulation

| Now, How to configure simulation models ?

Page 10: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Setting up Architectures| Knob variables need to set up (3 ways)

Default value in the source code Params.in Command line

Core type 1 Core type 2 Core type 3Core type 1 Core type 2 Core type 3

Core type 1 Core type 2 Core type 3Core type 1 Core type 2 Core type 3

Core type 1 Core type 2 Core type 3

Memory

Page 11: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Example) 4-Core 2-way SMT

num_sim_cores 4 // 4 cores num_sim_small_cores 0num_sim_medium_cores 0num_sim_large_cores 4max_threads_per_large_core 2large_core_type x86repeat_trace 1

| Configuration 4 cores 2-way SMT

param<NUM_SIM_CORES, num_sim_cores, int, 4>

./macsim –num_sim_cores=4

.def

params.in

commandline

Page 12: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Example) CPU+GPU Heterogeneous| To configure CPU+GPU arch.

Set up number of cores and type accordingly

num_sim_cores 8 // 4 CPUs + 4 GPUsnum_sim_small_cores 4 // 4 GPUnum_sim_medium_cores 0num_sim_large_cores 4 // 4 CPUscore_type ptx // specify small cores large_core_type x86cpu_frequency 3gpu_frequency 1.5repeat_trace 1

| Usually, we use small core for GPU and large for CPU

| GPU has internally multiple processing elements (N-wide SIMD)

Page 13: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Example) Multi-Program Simulation| Multiple Applications

Set up from trace_file_list

MCF GCCMM

thread1

MMthread

2

Blac

ksch

oles

4 <-- number of applications/sample/mcf/trace.txt <- appl 1/sample/gcc/trace.txt <- appl 2/sample/mm/trace.txt <- appl 3/sample/blackscholes/trace.txt <- appl 4

Page 14: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Repeating Traces | Execution time for each application is different. | Provide an option to enable repeat short traces until

the longest trace ends

| Whether it’s the right way to simulate?

mcf

gcc gcc gcc gcc

bfs bfs bfs bfs bfs

Program 1

Program 2

Program 3

Page 15: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Sample Configuration Files

| Sample configuration files in macsim-top/trunk/params

File name Contents params_8800gt GeForce 8800 GT (G80)

params_gtx280 GeForce GTX 280 (GT200)params_gtx465 NVIDIA GeForce GTX 465 (Fermi)

params_gtx465 GeForce GTX 465 (Fermi)params_x86 Intel’s Sandy Bridge (CPU part

only)

params_hetero_4c_4g Intel’s Sandy Bridge (CPU + GPU)

Page 16: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Limited Support of Multi-thread Applications| Thread spawn is modeled.| Lock is not modeled.

GPU Kernel invocation

core

Main thread

Threads spawn

Barrier

Host thread

core core core

Page 17: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Trace Generation| It will be covered in Part-III

| Trace generator will generate thread execution information is automatically.

| Users do not need to worry about this.

Page 18: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Clock Domain| MacSim has 5 different clock domains

CPU GPU Last-level cache Interconnection network DRAM

# Clockclock_cpu 3clock_gpu 1.5clock_l3 1clock_noc 1clock_mc 1.6

Page 19: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Microarchitecture| X86 instructions are mapped to uops| PTX instructions are mapped to uops (almost 1-1

mapping)

| Pipeline stages

PinXED

Macro instructions with decoded information from Pin’s XED

MacSimTrace

decoder

uopsTiming/power

simulator

Memory

Front-end DecodeRename Schedule Execution Retire

Page 20: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Microarchitecture Setup-I| Front-end, DEC/Rename: Just a simple FIFO queue.

fetch_latency 5 // front-end depth alloc_latency 5 // decode/allocation depth width // pipeline width (same width for all the pipeline) bp_dir_mech gshare bp_hist_length 14 // branch history length

| Rename: create RAW dependency (map structure) rob_size 96 // ROB size

| Scheduler // in-order scheduler, ooo scheduler schedule io, ooo // instruction scheduling policy

Page 21: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Microarchitecture Setup-II| Execution latency

Fixed uop latency (macsim-top/def/uop_latency_[x86,ptx].def) Variable latency: Cache/Memory latency

| Instruction scheduling rates isched_rate 4 // # of integer inst. that can be executed per

cycle msched_rate 2 // # of memory inst. that can be executed per

cycle fsched_rate 2 // # of FP inst. That can be executed per cycle

Page 22: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Microarchitecture Setup-III| Cache configuration

# of sets, # of associativity, line size, # of banks, etc. (See manual)

| Cache size = # of sets x assoc x line_size x # of tiles

| DRAM configuration Frequency, bus width, column/activate/precharge latency # of Memory controllers, # banks, # channels, row buffer size,

DRAM scheduling policy Simple, but fast DRAM model that models key features

| MacSim is connected with DRAM-SIM2 Users can use DRAM-SIM2 for a detailed DRAM timing simulation

L3 only

Page 23: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Simulation Outputs| Statistics

Simulation outputs: *.stat.out macsim/trunk/def file has stat definition (more details in Part-III)

| Important Stats IPC = INST_COUNT_TOT/CYC_COUNT_TOT CPI = CYC_COUNT_TOT/INST_COUNT_TOT

| Per Core stats IPC for core 0 INST_COUNT_CORE_0/CYC_COUNT_CORE_0

| Multiple applications stats *.stat.out.<application_id> e.g.) memory.stat.out.0, bp.stat.out.1 Each stat file contains stats only for the first running (repeated

simulations are ignored)

Page 24: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Other Stats | Memory Systems

L[1-3]_HIT_CPU/L[1-3]_HIT_GPU L[1-3]_MISS_CPU/L[1-3]_MISS_GPU

| Front-end BP_ON_PATH_[CORRECT/MISPREDICT/MISFETCH ]

| Instruction profiling Based on instruction category. inst.stat.out

| More details regarding statistics are in the documentation

| We will provide simple script file to fetch stat data

Page 25: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

GPGPU Support

Page 26: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

GPGPU Support Features | Multi-threading support is already there.| Different ISAs: using micro-ops | Warp ?

One warp is treated as one thread. Each thread generates its own trace file. Active bit information is included

Trace format will be explained in Part-III| Thread and block scheduling

Block-level barrier, block-level scheduling/retirement More details will be explained in Part-III

| Different memory structures Memory systems

Page 27: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Handling Vector Memory Operations

| Include the memory access by each thread of a warp as a separate instruction in the trace

| In trace, mark these accesses as coming from the same warp

SIMD load instructionAddr 0 Addr 1 Addr 2 Addr 3 Addr 4 Addr 5 Addr 6 Addr 7

Coalesced UncoalescedMem inst with 128B size 64B Request 32B Req. 32B Req.

TraceInst

TraceInst_beginTraceMem1TraceMem2TraceMem3

TraceInst_end

Trace file Trace filestart of memory

instruction marker

end of memory instruction marker

Page 28: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Handling Vector Memory Operations

| During simulation, form a “parent” uop that holds all the individual memory accesses as its child uops

| Parent uop flows through the pipeline, only in the memory stage, the individual children uops are issued to the memory Parent uop is ready for retirement when all children have completed

TraceInst_beginTraceMem1TraceMem2TraceMem3

…TraceMemN

TraceInst_end

Trace filestart of memory

instruction marker

end of memory instruction marker

MacSim

uop

addr0 addr1 addr2 addr3

addr4 addr5 … addrN

Mem_type: ld#children: 8

Parent uop

Children uops

Page 29: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

Enhanced MacSim

Page 30: MacSim  Simulator

MacSim Tutorial (In ISCA-39, 2012)

More Features with MacSim| IRIS (From Prof. Yalamanchili’s group)

Flit-level interconnection network simulator Virtual channel, credit-based flow control

deadlock-avoidance, … Part-IV will cover more.

| MacSim-SST Parallel simulation

Node

Node

NodeNode

Topology(Ring, Mesh, Torus, ..)

routerrouter