the impact of operating system structure on memory system...

TheImpactofOperatingSystemStructure

onMemorySystemPerformance

ACMSIGOPS,1994

J.BradleyChen(CarnegieMellonUniversity),BrianN.Bershad(University ofWashington)

SanghoonHan,ByeonghunHyeon,Gyusun Lee

Previousworks

• Thispaperisabouttracing

• Oldworks?• Memorysystemstructure• Multiprocessors• Subcomponentofthememorysystem

• HowaboutdifferentOSstructure?

2

Thetwosystemstructures

• SystemcallvsIPC

• Memory사용패턴이다를것

• Systemstructure에따라성능에영향을

미치는원인이다를수있다.3

Monolithic Micro-kernel

Wanttodo

• Memorystructure내에서시간을어떻게

소모하는지측정 (가능한한적은오차로)

• 두 system간차이를비교

• 7assertions이 정말맞는지확인

4

Target

• OS• Ultrix(monotonic)• Mach3.0(microkernel)withCMU’sUNIXserver• Botharederived from3.2BSDUNIX

• Machine• DECstation 5000/200• Because itcanrunbothOS

• 13개의 programworkload

5

Progress

• Modifyexecutionfile• 어떤이유로얼마나시간을보내는지• 어떤주소에접근하는지

•결과 :2배크고 15배느려진 program• 결과를믿을수있는지?• 오차를줄이는법?

6

Minimizedistortion

1.Memorydilation• 이전보다메모리를많이사용할것이다• 실행양상이달라질수있다 (더많은 pageout)• 매우큰 physicalmemory를사용하자• 대신 TLB(user)는 simulation을통해서구현

2.Timedilation• 상대적으로외부환경이 15배빨라진효과• Clock도 15배빨리들어온다 (systemclock을 1/15로늦춤)• I/O속도가빨라진다 (idlethread가적게동작한다)• Idle상태에서보낸시간 x15

7

Minimizedifference

3.Differentpagemappingstrategy• Ultrix:deterministic,Mach:random• Itisimportant• Useddeterministicstrategy

Because,• 실험진행상황을다시재현해보기좋다• 두시스템간차이를줄일수있다.

8

Result(Table)

9

Numberofinstructions,cachemisses

WhatisMCPI?

• MemoryCyclesPerInstruction

• MCPI= CPUstallcyclesduetothememorysystemInstruction

• Onlyincludenon-idleinstructions

• Canverifyresultwiththis.

10

Result(MCPI)

11

Verification

• Assumption• IdleloopCPI=1

• Thenthetotalcycle={Idleinstruction}+{non-idleinstruction}x{1+MCPI}

• Example(gcc)• Cycles=63684000+29318000x(1+0.434)

=105726012• Runtime=Cycles/Clockspeed(25Mhz)

=4.22seconds

12

Comparingtwosystems

13

Ultrix에서의 diskI/O가더많았다.Mach3.0이 demandpaging을지원하기때문

Comparingtwosystems

14

Comparingtwosystems

15

User SystemWorkload Ultrix Mach Difference Ultrix Mach Difference

sed 4335.04 4347.28 12.24 1368.96 3415.72 2046.76egrep 41545.92 41876.97 331.05 1731.08 3152.03 1420.95yacc 30831.06 31085.1 254.04 1967.94 3453.9 1485.96gcc 22868.04 23000.96 132.92 6449.96 12938.04 6488.08

compress 13685.76 13748.94 63.18 3210.24 6177.06 2966.82ab 582720.4 587104.3 4383.84 287011.6 611067.7 324056.16

espresso 132677.3 132293.8 383.54 2707.7 5512.24 2804.54lisp 1249386 1251087 1700.43 38640.81 25532.38 13108.43

equtott 1400225 1403689 3464.01 14143.69 14178.68 34.99fpppp 244220.4 244588.1 367.7 21236.56 18409.86 2826.7doduc 318111.8 318844 732.23 3213.25 6507.02 3293.77liv 22317.76 22351.32 33.56 690.24 1426.68 736.44

tomcatv 1985646 1985534 111.87 20057.03 20055.9 1.13

Non-idleSysteminstruction수차이가크다.

Comparingtwosystems

16

Mach3.0의 instruction이더비싸다

Relativesystemoverheads

17

Mach3.0이느린

이유는

IPC때문만이아니다.

7assertions

18

1.Systemanduserlocality

• System의 locality는user보다낮을것이다?

• Lowlocality→Highcachemiss

19

1.Systemanduserlocality

20

2.Systemdependencyoncaches

• Instructioncache에서시간을더많이보낼것이다.

• MCPIcontribution?

21

특히Mach3.0에서더심하다.

System activity증가

3.Competitionbetweentheuserandsystem• User와 system간

cache경쟁?

• Cache를따로

사용한다면?

22

큰성능차이가없다.

4.Systemself-interference

• 동시에 cache에있어야하는instruction끼리경쟁?

• Cache associativity를올리면성능이좋아진다.

23

5.Blockoperations

• Systemblockmemoryoperation에서

많은시간이소요된다?

• 전체MCPI중 큰비중을

차지할것이다.

24

6.Streamingwrites

• Writebuffer는 systemcode를상대로성능이

안좋을것이다

• Systeminstruction을

실행시킬때

Writebufferstall이

빈번할것이다.

25

7.Pagemappingstrategy

• Virtualpagemappingstrategy가

성능에미치는영향이클것이다.

•이전실험을 randomstrategy로바꾸면

성능차이가날것이다.

26

MCPIcomparison(1)

27

Deterministic

Random

MCPIcomparison(2)

28

Deterministic

Random

• Itcausessignificanteffect

• Inmostcase,randomisbetter

Conclusion

• Tracing방법을이용해 systemoverhead가

어디서생기는지

파악할수있다.

• IPC의 overhead는 생각보다비중이적었다.

• Assertion7개중 6개가실제로성립한다는것을

확인할수있었다.

29

MagazinesandVmem:ExtendingtheSlabAllocatorto

ManyCPUsandArbitraryResources

JeffBonwick,JonathanAdamsUSENIXAnnualTechnicalConference,GeneralTrack.01’

Contents

• Introduction• SlabAllocator• Magazine• Vmem• Conclusion• Critic

31

Introduction

• Slaballocatorhascontinuedtoevolve(94’~)• Per-CPUmemoryallocation• Moregeneralresourceallocation• Availableasauser-level library

32

SlabAllocator

• 94’SunMicrosystemsSolarisimplemented

33

Slab : One or more pages of virtuallycontiguous memory

Object :Prepared spacefor frequently objects(ex.structure)

Maintainrunoutobjectsonfreelist ->Reduceallocationsandfreesinstructions

34

Multiprocessor

Magazine

• Background– per-CPUmemoryallocation• Slaballocatorneedslocktoprotectcache’sslablist• Needmultiprocessorscalability

35

obj obj obj obj obj obj obj obj obj objSlab

Mobjects

Magazine Magazine

36

Magazine

Magazinefull+freeorMagazineempty+allocate

allocate

allocatefree

TradeinDepot(needlock)

Topreventfrequenttrade,thereispreviousmagazine

per-CPUcache

Magazine(cont.)

• MagazineSize(M)• ObserveCPUlayer’smissrateaslowasbyincreasingM(Initialvalue)• Observethecontentionrateonthedepotlock(Incrementacontentioncount)

• Ifcontentionrateexceedsfixedthreshold,increasethemagazinesize

• DepotSize• Ifdepot’sfullmagazinelistvariesbetween37~47overagivenperiod,thenworkingsetis10magazines(Remainderareeligibleforreclaiming)

37

MagazinePerformance

• Scalability• 333MHz16-CPUStarfire

• System-LevelBenchmark• SPECweb99• TPC-C• Kenbus

38

Vmem

• Background– moregeneralresourceallocation• AlmostallversionsofUnixhaveresourcemapallocatorcalledrmalloc()• Linear-time algorithm

• Maintainalistoffreesegments• Coalescingsegmentstoreducefragments• Useinsertionsorttoreturnasegmenttothefreesegmentlist->O(n)

• Objectives• Constant-time performance->O(1)• Linearscalability• Lowfragmentation

39

Vmem Structure

40

virtualmemory

heap

vmem_create(“heap”…)

kmem_va

vmem_create(“kmem_va”…)

kmem_default free allocated free allocated free

vmem_create(“kmem_default”…)

*

size

*

size

*

size

segmentlist

boundarytag

2^0 2^1 2^2 2^3 2^4

hashlist

freelist

41

virtualmemory

heap

vmem_create(“heap”…)

kmem_va

vmem_create(“kmem_va”…)

kmem_default free allocated free allocated free

vmem_create(“kmem_default”…)

*

size

*

size

*

size

segmentlist

boundarytag

2^0 2^1 2^2 2^3 2^4

hashlist

freelist

Vmem Performance

• ConstantTime• Hash&Freelist로 allocated혹은 freesegment를 빠르게찾아진행할수있다.->O(1)

• Regardlessofarenafragmentation

• System-LevelBenchmark• LADDIS• WebService• I/OBandwidth

43

User-LevelMemoryAllocation

44

mtmalloc :Selectingafreelist wassimplyround-robin

mtmalloc (fixed):Selectaper-CPUfreelist bythreadIDhashingasin

libumem

Summary

• Magazine• Providesefficientobjectcachingwithverylowlatencyandlinearscaling

• Vmem• Guaranteeconstant-timeperformanceregardlessofallocationsizeorarenafragmentation

45

Critic

46

the impact of operating system structure on memory system...

Documents