the impact of operating system structure on memory system...
TRANSCRIPT
TheImpactofOperatingSystemStructure
onMemorySystemPerformance
ACMSIGOPS,1994
J.BradleyChen(CarnegieMellonUniversity),BrianN.Bershad(University ofWashington)
SanghoonHan,ByeonghunHyeon,Gyusun Lee
Previousworks
• Thispaperisabouttracing
• Oldworks?• Memorysystemstructure• Multiprocessors• Subcomponentofthememorysystem
• HowaboutdifferentOSstructure?
2
Thetwosystemstructures
• SystemcallvsIPC
• Memory사용패턴이다를것
• Systemstructure에따라성능에영향을
미치는원인이다를수있다.3
Monolithic Micro-kernel
Wanttodo
• Memorystructure내에서시간을어떻게
소모하는지측정 (가능한한적은오차로)
• 두 system간차이를비교
• 7assertions이 정말맞는지확인
4
Target
• OS• Ultrix(monotonic)• Mach3.0(microkernel)withCMU’sUNIXserver• Botharederived from3.2BSDUNIX
• Machine• DECstation 5000/200• Because itcanrunbothOS
• 13개의 programworkload
5
Progress
• Modifyexecutionfile• 어떤이유로얼마나시간을보내는지• 어떤주소에접근하는지
•결과 :2배크고 15배느려진 program• 결과를믿을수있는지?• 오차를줄이는법?
6
Minimizedistortion
1.Memorydilation• 이전보다메모리를많이사용할것이다• 실행양상이달라질수있다 (더많은 pageout)• 매우큰 physicalmemory를사용하자• 대신 TLB(user)는 simulation을통해서구현
2.Timedilation• 상대적으로외부환경이 15배빨라진효과• Clock도 15배빨리들어온다 (systemclock을 1/15로늦춤)• I/O속도가빨라진다 (idlethread가적게동작한다)• Idle상태에서보낸시간 x15
7
Minimizedifference
3.Differentpagemappingstrategy• Ultrix:deterministic,Mach:random• Itisimportant• Useddeterministicstrategy
Because,• 실험진행상황을다시재현해보기좋다• 두시스템간차이를줄일수있다.
8
Result(Table)
9
Numberofinstructions,cachemisses
WhatisMCPI?
• MemoryCyclesPerInstruction
• MCPI= CPUstallcyclesduetothememorysystemInstruction
• Onlyincludenon-idleinstructions
• Canverifyresultwiththis.
10
Result(MCPI)
11
Verification
• Assumption• IdleloopCPI=1
• Thenthetotalcycle={Idleinstruction}+{non-idleinstruction}x{1+MCPI}
• Example(gcc)• Cycles=63684000+29318000x(1+0.434)
=105726012• Runtime=Cycles/Clockspeed(25Mhz)
=4.22seconds
12
Comparingtwosystems
13
Ultrix에서의 diskI/O가더많았다.Mach3.0이 demandpaging을지원하기때문
Comparingtwosystems
14
Comparingtwosystems
15
User SystemWorkload Ultrix Mach Difference Ultrix Mach Difference
sed 4335.04 4347.28 12.24 1368.96 3415.72 2046.76egrep 41545.92 41876.97 331.05 1731.08 3152.03 1420.95yacc 30831.06 31085.1 254.04 1967.94 3453.9 1485.96gcc 22868.04 23000.96 132.92 6449.96 12938.04 6488.08
compress 13685.76 13748.94 63.18 3210.24 6177.06 2966.82ab 582720.4 587104.3 4383.84 287011.6 611067.7 324056.16
espresso 132677.3 132293.8 383.54 2707.7 5512.24 2804.54lisp 1249386 1251087 1700.43 38640.81 25532.38 13108.43
equtott 1400225 1403689 3464.01 14143.69 14178.68 34.99fpppp 244220.4 244588.1 367.7 21236.56 18409.86 2826.7doduc 318111.8 318844 732.23 3213.25 6507.02 3293.77liv 22317.76 22351.32 33.56 690.24 1426.68 736.44
tomcatv 1985646 1985534 111.87 20057.03 20055.9 1.13
Non-idleSysteminstruction수차이가크다.
Comparingtwosystems
16
Mach3.0의 instruction이더비싸다
Relativesystemoverheads
17
Mach3.0이느린
이유는
IPC때문만이아니다.
7assertions
18
1.Systemanduserlocality
• System의 locality는user보다낮을것이다?
• Lowlocality→Highcachemiss
19
1.Systemanduserlocality
20
2.Systemdependencyoncaches
• Instructioncache에서시간을더많이보낼것이다.
• MCPIcontribution?
21
특히Mach3.0에서더심하다.
System activity증가
3.Competitionbetweentheuserandsystem• User와 system간
cache경쟁?
• Cache를따로
사용한다면?
22
큰성능차이가없다.
4.Systemself-interference
• 동시에 cache에있어야하는instruction끼리경쟁?
• Cache associativity를올리면성능이좋아진다.
23
5.Blockoperations
• Systemblockmemoryoperation에서
많은시간이소요된다?
• 전체MCPI중 큰비중을
차지할것이다.
24
6.Streamingwrites
• Writebuffer는 systemcode를상대로성능이
안좋을것이다
• Systeminstruction을
실행시킬때
Writebufferstall이
빈번할것이다.
25
7.Pagemappingstrategy
• Virtualpagemappingstrategy가
성능에미치는영향이클것이다.
•이전실험을 randomstrategy로바꾸면
성능차이가날것이다.
26
MCPIcomparison(1)
27
Deterministic
Random
MCPIcomparison(2)
28
Deterministic
Random
• Itcausessignificanteffect
• Inmostcase,randomisbetter
Conclusion
• Tracing방법을이용해 systemoverhead가
어디서생기는지
파악할수있다.
• IPC의 overhead는 생각보다비중이적었다.
• Assertion7개중 6개가실제로성립한다는것을
확인할수있었다.
29
MagazinesandVmem:ExtendingtheSlabAllocatorto
ManyCPUsandArbitraryResources
JeffBonwick,JonathanAdamsUSENIXAnnualTechnicalConference,GeneralTrack.01’
Contents
• Introduction• SlabAllocator• Magazine• Vmem• Conclusion• Critic
31
Introduction
• Slaballocatorhascontinuedtoevolve(94’~)• Per-CPUmemoryallocation• Moregeneralresourceallocation• Availableasauser-level library
32
SlabAllocator
• 94’SunMicrosystemsSolarisimplemented
33
Slab : One or more pages of virtuallycontiguous memory
Object :Prepared spacefor frequently objects(ex.structure)
Maintainrunoutobjectsonfreelist ->Reduceallocationsandfreesinstructions
34
Multiprocessor
Magazine
• Background– per-CPUmemoryallocation• Slaballocatorneedslocktoprotectcache’sslablist• Needmultiprocessorscalability
35
obj obj obj obj obj obj obj obj obj objSlab
Mobjects
Magazine Magazine
36
Magazine
Magazinefull+freeorMagazineempty+allocate
allocate
allocatefree
TradeinDepot(needlock)
Topreventfrequenttrade,thereispreviousmagazine
per-CPUcache
Magazine(cont.)
• MagazineSize(M)• ObserveCPUlayer’smissrateaslowasbyincreasingM(Initialvalue)• Observethecontentionrateonthedepotlock(Incrementacontentioncount)
• Ifcontentionrateexceedsfixedthreshold,increasethemagazinesize
• DepotSize• Ifdepot’sfullmagazinelistvariesbetween37~47overagivenperiod,thenworkingsetis10magazines(Remainderareeligibleforreclaiming)
37
MagazinePerformance
• Scalability• 333MHz16-CPUStarfire
• System-LevelBenchmark• SPECweb99• TPC-C• Kenbus
38
Vmem
• Background– moregeneralresourceallocation• AlmostallversionsofUnixhaveresourcemapallocatorcalledrmalloc()• Linear-time algorithm
• Maintainalistoffreesegments• Coalescingsegmentstoreducefragments• Useinsertionsorttoreturnasegmenttothefreesegmentlist->O(n)
• Objectives• Constant-time performance->O(1)• Linearscalability• Lowfragmentation
39
Vmem Structure
40
virtualmemory
heap
vmem_create(“heap”…)
kmem_va
vmem_create(“kmem_va”…)
kmem_default free allocated free allocated free
vmem_create(“kmem_default”…)
*
size
*
size
*
size
segmentlist
boundarytag
2^0 2^1 2^2 2^3 2^4
hashlist
freelist
41
virtualmemory
heap
vmem_create(“heap”…)
kmem_va
vmem_create(“kmem_va”…)
kmem_default free allocated free allocated free
vmem_create(“kmem_default”…)
*
size
*
size
*
size
segmentlist
boundarytag
2^0 2^1 2^2 2^3 2^4
hashlist
freelist
42
Vmem Performance
• ConstantTime• Hash&Freelist로 allocated혹은 freesegment를 빠르게찾아진행할수있다.->O(1)
• Regardlessofarenafragmentation
• System-LevelBenchmark• LADDIS• WebService• I/OBandwidth
43
User-LevelMemoryAllocation
44
mtmalloc :Selectingafreelist wassimplyround-robin
mtmalloc (fixed):Selectaper-CPUfreelist bythreadIDhashingasin
libumem
Summary
• Magazine• Providesefficientobjectcachingwithverylowlatencyandlinearscaling
• Vmem• Guaranteeconstant-timeperformanceregardlessofallocationsizeorarenafragmentation
45
Critic
46