hiding periodic i/o costs in parallel applications

Hiding Periodic I/O Costsin Parallel Applications

Xiaosong Ma

Department of Computer Science

University of Illinois at Urbana-Champaign

Spring 2003

3

Roadmap

• Introduction• Active buffering: hiding recurrent output cost• Ongoing work: hiding recurrent input cost• Conclusions

4

Introduction

• Fast-growing technology propels high performance applications– Scientific computation– Parallel data mining – Web data processing– Games, movie graphics

• Individual component’s growth un-coordinated– Manual performance tuning needed

5

We Need Adaptive Optimization

• Flexible and automatic performance optimization desired

• Efficient high-level buffering and prefetching for parallel I/O in scientific simulations

6

Scientific Simulations

• Important– Detail and flexibility

– Save money and lives

• Challenging– Multi-disciplinary

– High performance crucial

7

Parallel I/O in Scientific Simulations

• Write-intensive

• Collective and periodic

• “Poor stepchild”

• Bottleneck-prone

• Existing collective I/O focused on data transfer

Computation

…

I/O

Computation

I/O

Computation

I/O

Computation

…

8

My Contributions

• Idea: I/O optimizations in larger scope– Parallelism between I/O and other tasks – Individual simulation’s I/O need– I/O related self-configuration

• Approach: hide the I/O cost

• Results– Publications, technology transfer, software

9

Roadmap


10

Latency Hierarchy on Parallel Platforms

• Along path of data transfer– Smaller throughput– Lower parallelism and less scalable

local memory access

inter-processor communication

disk I/O

wide-area transfer

11

Basic Idea of Active Buffering

• Purpose: maximize overlap between computation and I/O

• Approach: buffer data as early as possible

12

Challenges

• Accommodate multiple I/O architectures

• No assumption on buffer space

• Adaptive– Buffer availability– User request patterns

13

Roadmap

• Introduction• Active buffering: hiding recurrent output cost

– With client-server I/O architecture [IPDPS ’02]– With server-less architecture

• Ongoing work: hiding recurrent input cost• Related work and future work• Conclusions

14

Client-Server I/O Architecture

compute processors

I/O servers

File SystemFile System

15

Client State Machine

send ablock

preparebufferdata

exit

enter collective

write routine

buffer space

available

data to send

out of bufferspace

sent

no overflow

all data

16

init.

exitmessage

idle, no data tofetch &data to

write done

idle-listen

alloc.buffers

preparereceivea block

fetch& write

all

write ablock

fetch ablock

busy-listen

write request

exit

recv.

recv.fetch

got write req.

no request write done

received all data

idle & to fetch

recv.

exit msg.

no data

out of buffer space

data to receive& enough buffer space

write

Server State Machine

17

Maximize Apparent Throughput

• Ideal apparent throughput per server

Dtotal

Tideal = Dc-buffered Dc-overflow Ds-overflow

Tmem-copy TMsg-passing Twrite

• More expensive data transfer only becomes visible when overflow happens

• Efficiently masks the difference in write speeds

+ +

18

Write Throughput without Overflow

0

200

400

600

800

1000

1200

2 4 8 16 32

number of clients

thro

ug

hp

ut

pe

r s

erv

er

(MB

/s)

local bufferingABMPIbinary write

0

200

400

600

800

1000

1200

2 4 8 16 32

number of clients

local bufferingABMPIHDF4 write

– Panda Parallel I/O library– SGI Origin 2000, SHMEM– Per client: 16MB output data per snapshot, 64MB buffer – Two servers, each with 256MB buffer

19

Write Throughput with Overflow

0

50

100

150

200

250

2 4 8 16 32

number of clients

thro

ug

hp

ut

pe

r s

erv

er

(MB

/s)

idealABMPIbinary write

0

50

100

150

200

250

2 4 8 16 32

number of clients

idealABMPIHDF4 write

– Panda Parallel I/O library– SGI Origin 2000, SHMEM, MPI– Per client: 96MB output data per snapshot, 64MB buffer – Two servers, each with 256MB buffer

20

Give Feedback to Application

• “Softer” I/O requirements

• Parallel I/O libraries have been passive

• Active buffering allows I/O libraries to take more active role– Find optimal output frequency automatically

21

init.

exitmessage

idle, no data tofetch &data to

write done

idle-listen

alloc.buffers

preparereceivea block

fetch& write

all

write ablock

fetch ablock

busy-listen

write request

exit

recv.

recv.fetch

got write req.

no request write done

received all data

idle & to fetch

recv.

exit msg.

no data

out of buffer space

data to receive& enough buffer space

write

Server-side Active Buffering

22

Performance with Real Applications

• Application overview – GENX– Large-scale, multi-component, detailed rocket simulation– Developed at Center for Simulation of Advanced Rockets

(CSAR), UIUC– Multi-disciplinary, complex, and evolving

• Providing parallel I/O support for GENX– Identification of parallel I/O requirements [PDSECA ’03]– Motivation and test case for active buffering

23

Overall Performance of GEN1

– SDSC IBM SP (Blue Horizon)– 64 clients, 2 I/O servers with AB– 160MB output data per snapshot (in HDF4)

0

500

1000

1500

2000

2500

3000

3500

number of snapshots taken in 30 time steps

tim

e (

s)

I/O

Computation

24

Aggregate Write Throughput in GEN2

– LLNL IBM SP (ASCI Frost)– 1 I/O server per 16-way SMP node – Write in HDF4

0

100

200

300

400

500

600

700

800

900

1000

2 (1) 4 (1) 8 (1) 15 (1) 30 (2) 60 (4) 120(8)

240(16)

480(32)

number of compute processors (number of SMP nodes)

app

aren

t ag

gre

gat

e w

rite

th

rou

gh

pu

t (M

B/s

)

Native I/O AB

25

Scientific Data Migration

• Output data need to be moved

• Online migration

• Extend active buffering to migration– Local storage becomes

another layer in buffer hierarchy

Computation

…

I/O

Computation

I/O

Computation

I/O

Computation

internet

internet

26

I/O Architecture with Data Migration

compute processors

InternetInternetFile SystemFile System

workstation runningvisualization tool

servers

27

Active Buffering for Data Migration

• Avoid unnecessary local I/O– Hybrid migration approach

• Combined with data compression [ICS ’02]

• Self-configuration for online visualization

memory-to-memory transfer disk staging

28

Roadmap

• Introduction• Active buffering: hiding recurrent output cost

– With client-server I/O architecture– With server-less architecture [IPDPS ’03]

• Ongoing work: hiding recurrent input cost• Conclusions

29

Server-less I/O Architecture

compute processors

File SystemFile System

I/O thread

30

Making ABT Transparent and Portable

• Unchanged interfaces• High-level and file-system independent• Design and evaluation [IPDPS ’03]• Ongoing transfer to ROMIO

ADIO

NFSHFS NTFS PFS PVFS XFSUFSABT

31

Active Buffering vs. Asynchronous I/O

Active buffering Async I/O Application level (platform-independent)

Supported by file system (platform-dependent)

Transparent to user Not transparent to user

Designed for collective I/O

More difficult to use in collective I/O

Both local and remote I/O Local I/O

Works on top of scientific data formats

May not be supported by scientific data formats

32

Roadmap


33

I/O in Visualization

• Periodic reads

• Dual modes of operation– Interactive– Batch-mode

• Harder to overlap reads with computation

Computation

…

I/O

Computation

I/O

Computation

I/O

Computation

34

Efficient I/O Through Data Management

• In-memory database of datasets– Manage buffers or values

• Hub for I/O optimization– Prefetching for batch mode– Caching for interactive mode

• User-supplied read routine

35

Related Work

• Overlapping I/O with computation– Replacing synchronous calls with async calls [Agrawal et al.

ICS ’96]– Threads [Dickens et al. IPPS ’99, More et al. IPPS ’97]

• Automatic performance optimization– Optimization with performance models [Chen et al. TSE ’00]– Graybox optimization [Arpaci-Dusseau et al. SOSP ’01]

36

Roadmap

• Introduction• Active buffering: hiding recurrent output cost • Ongoing work: hiding recurrent input cost• Conclusions

37

Conclusions

• If we can’t shrink it, hide it!

• Performance optimization can be done – more actively– at higher-level– in larger scope

• Make I/O part of data management

38

References

• [IPDPS ’03] Xiaosong Ma, Marianne Winslett, Jonghyun Lee and Shengke Yu, Improving MPI-IO Output Performance with Active Buffering Plus Threads, 2003 International Parallel and Distributed Processing Symposium

• [PDSECA ’03] Xiaosong Ma, Xiangmin Jiao, Michael Campbell and Marianne Winslett, Flexible and Efficient Parallel I/O for Large-Scale Multi-component Simulations, The 4th Workshop on Parallel and Distributed Scientific and Engineering Computing with Applications

• [ICS ’02] Jonghyun Lee, Xiaosong Ma, Marianne Winslett and Shengke Yu, Active Buffering Plus Compressed Migration: An Integrated Solution to Parallel Simulations' Data Transport Needs, the 16th ACM International Conference on Supercomputing

• [IPDPS ’02] Xiaosong Ma, Marianne Winslett, Jonghyun Lee and Shengke Yu, Faster Collective Output through Active Buffering, 2002 International Parallel and Distributed Processing Symposium

hiding periodic i/o costs in parallel applications

Documents

buffer data

io costresultspublications

io optimizations

data tofetch data towrite

mb output data

buffer spacedata

architectureongoing

technology transfer