morpheus: creating application objects efficiently for...

26
Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing Hung-Wei Tseng, Qianchen Zhao, Yuxiao Zhou, Mark Gahagan, Steven Swanson Department of Computer Science and Engineering University of California, San Diego Non-volatile Systems Laboratory NVSL

Upload: others

Post on 14-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing

Hung-Wei Tseng, Qianchen Zhao, Yuxiao Zhou, Mark Gahagan, Steven Swanson

Department of Computer Science and EngineeringUniversity of California, San Diego

Non-volatile Systems LaboratoryNVSL

Page 2: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

Applications interact with files

2

Page 3: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

How we process files today

3SSD

DRAMCPU

GPUGPU

“12345678”

0xBC614E

Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07

Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07

Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07

Page 4: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

The conventional model

4

CPU/APU

DRAM

SSD

GPU

Retrieve File Parse data and create objects Compute kernel

Compute kernel

Creating objects generates traffic on CPU-memory bus and results in system overhead

Page 5: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

Overhead of creating objects

5

0

0.2

0.4

0.6

0.8

1.0

Page

Rank CC bf

s

gaus

sian

hybr

idso

rt

kmea

ns lud nn

srad

JASP

A

aver

age

Perc

enta

ge o

f Exe

cutio

n Ti

me

Object creation Other CPU computationGPU Moving data to GPU

64%

Creating objects is now the bottleneck in applications

GPU accelerated applications

Page 6: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

High-speed storage doesn’t help

6

0102030405060708090

Page

Rank CC bf

s

gaus

sian

hybr

idso

rt

kmea

ns lud nn

srad

JASP

A

aver

ageTh

roug

hput

of P

arsin

g In

put D

ata

(MB/

Sec)

SSD RamDrive HDD

Very little difference among different storage technologies

GPU accelerated applications

Page 7: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

Preventing P2P communication between peripherals

7SSD

DRAMCPU

GPUGPU

Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07

Desired data path

Real data path in the current model

P2P is useless since we need CPU to create application objects

Page 8: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

We need to rethink the processing model!

8

Morpheus

Page 9: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

Outline• The Morpheus model• The system architecture• Experimental result• Conclusion

9

Page 10: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

Morpheus: Creating application objects in SSDs

10

GPU

SSD

DRAMCPU

SSD Processor

Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07

Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07

Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07

Page 11: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

The Morpheus model

11

CPU/APU

DRAM

SSD

GPU

Retrieve objects Compute kernel

Compute kernel

StorageApp

Page 12: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

Benefits of Morpheus• Bypass system overheads• Allow applications to take advantage from

P2P data communication• Reduce traffic over system interconnects• Lower energy consumption

12Non-volatile Systems LaboratoryNVSL

Morpheus: Creating application objects in SSDs

6

GPU

SSD

DRAMCPU

SSD Processor

Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07

Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07

Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07

Page 13: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

Outline• The Morpheus model• The system architecture• Experimental result• Conclusion

13

Page 14: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

Implementing the Morpheus model

14

Application

Morpheus runtime

Morpheus-NVMe Driver

NVMe-P2P

PCIe Interconnect

Morpheus-SSDGPU

OperatingSystem

Hardware

GPURuntime

Page 15: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

Morpheus-NVMe

15

• NVMe: An interface defines how the host computer should interact with non-volatile memory devices

• Morpheus-NVMe extensions• MInit: install and prepare the execution of a

StorageApp• MRead: reads and applies the StorageApp on the

reading data• MWrite: writes and applies the StorageApp on

the writing data• MDeinit: completes and releases the

StorageApp

Page 16: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

Morpheus-SSD

16

Embedded core

Embedded core

Embedded core

Embedded core

Embedded core

Embedded core

Embedded core

Embedded core

AcceleratorAcceleratorAcceleratorAccelerator

In-storage Interconnect

PCIe

/NVM

e In

terfa

ce

DRAM controller

SSD DRAM

flash interfaceDMA Engine

Flash memory

Managing Morpheus-NVMe commandsExecuting StorageAppsFlash

Flash

Flash

Flash

Flash

DDR3/DDR4DRAM

PCIEXPRESS

Page 17: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

GPU

NVMe-P2P

17

• Mapping GPU device memory to PCIe BAR using AMD DirectGMA or NVIDIA GPUDirect

• Generate Morpheus-NVMe commands using GPU memory addresses as the DMA targets

• Morpheus directly pulls/pushes data from/to GPU addresses, without going through the main memory

Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07

Application objects0xCAB23A0x002FEA0x1AA3600x8CA5200xBC614E0x7BDE07

Page 18: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

Creating a StorageApp• Use C to compose a StorageApp• Use the Morpheus-SSD library to access SSD

resources• The compiler generates machine code that the

embedded processors can execute

18

StorageApp int inputApplet(ms_stream ssd_input_stream, void *edge_array) { Edge ssd_edge_array[4096]; int i = 0; while(ms_scanf(ssd_input_stream, "%d %d", &ssd_edge_array[i%4096].first, &ssd_edge_array[i%4096].second)==2) { i++; if(i % 4096 == 0) { ms_memcpy(edge_array, ssd_edge_array, sizeof(Edge)*4096); edge_array += sizeof(Edge)*4096; } } ms_memcpy(edge_array, ssd_edge_array, sizeof(Edge)*(i%4096)); return i;}

Page 19: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

Invoking a StorageApp in host applications

• Like calling a function• Prepare arguments using the Morpheus runtime

library• The runtime library interacts with the driver to

utilize the SSD facilities

19

void test_distributed_page_rank(char* graphfilename, int num_ofVertex, int num_ofEdges, int iterations) { FILE *fin; ms_stream ssd_input_stream; void **arg_list; fin = fopen(graphfilename, "r"); ssd_input_stream = ms_stream_create(fin); Edge *edge_array = (Edge *)malloc(sizeof(Edge)*num_ofEdges); inputApplet(ssd_input_stream, edge_array); ms_stream_destroy(ssd_input_stream); // The rest of code ...}

Page 20: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

Outline• The Morpheus model• The system architecture• Experimental result• Conclusion

20

Page 21: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

Experimental setup• Intel Xeon E5-2609 v2 processor• NVIDIA K20 GPU• Morpheus-SSD: A 512GB SSD with a PMCS

(now Microsemi) NVMe controller

21

Morpheus-SSD

K20 GPU

Page 22: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

Morpheus improves performance

22

00.20.40.60.81.01.21.41.61.8

Page

Rank CC bf

s

gaus

sian

hybr

idso

rt

kmea

ns lud nn

srad

JASP

A

aver

age

Spee

dup

Morpheus-SSD Morpheus+NVMe-P2P

1.32

x1.

39x

GPU accelerated applications

Page 23: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

Morpheus saves power/energy

23

1.32

x1.

39x

00.10.20.30.40.50.60.70.80.91.0

Page

Rank CC bf

s

gaus

sian

hybr

idso

rt

kmea

ns lud nn

srad

JASP

A

aver

age

Norm

alize

d Va

lue

Power Energy

-7 %

-42%

GPU accelerated applications

Page 24: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

Morpheus makes wimpy servers more competitive

24

0

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60Pa

geRa

nk CC bfs

gaus

sian

hybr

idso

rt

kmea

ns lud nn

srad

JASP

A

aver

age

Spee

dup

over

2.5

G C

PUs

1.2G CPUMorpheus-SSD on 1.2G CPUMorpheus-SSD on 1.2G CPU + NVMe-P2P

0.53

x1.

08x

1.12

x

Morpheus-SSD + wimpy CPUs can compete with high-end servers

GPU accelerated applications

Page 25: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

Conclusion

25

• Object creation/deserialization/serialization becomes a new bottleneck for high-performance heterogeneous computers

• Morpheus model leverages under-utilized computing resources in storage device to • bypass system overheads• enable efficient data communication mechanisms

• Morpheus-SSD improves application performance by 1.39x and allows wimpy servers to compete with high-end servers

Page 26: Morpheus: Creating Application Objects Efficiently for ...isca2016.eecs.umich.edu/wp-content/uploads/2016/07/1B-2.pdf · AMD DirectGMA or NVIDIA GPUDirect • Generate Morpheus-NVMe

Non-volatile Systems LaboratoryNVSL

Thank you!

26

Hung-Wei Tseng will be an assistant professor in

starting from this August