powering)the)road)to)national)hpc)leadership)...ornl%is%managed%by%ut2battelle%...
TRANSCRIPT
Jack%C.%Wells,%Director%of%ScienceOak$Ridge$Leadership$Computing$Facility/Oak$Ridge$National$Laboratory
Join%the%Conversation%#OpenPOWERSummit
Powering)the)Road)to)National)HPC)Leadership)
ORNL%is%managed%by%UT2Battelle%for%the%US%Department%of%Energy
Powering$the$Road$to$National$HPC$Leadership$
Jack%C.%WellsDirector%of%ScienceOak%Ridge%Leadership%Computing%FacilityOak%Ridge%National%Laboratory
2018%OpenPOWER SummitLas%Vegas19%March%2018
This%research%used%resources%of%the%Oak%Ridge%Leadership%Computing%Facility%at%the%Oak%Ridge%National%Laboratory,%which%is%supported%by%the%Office%of%Science%of%the%U.S.%Department%of%Energy%under%Contract%No.%DE2AC05200OR22725.%Some%of%the%work%presented%here%is%from%the%TOTAL%and%Oak%Ridge%National%Laboratory%collaboration%which%is%done%under%the%CRADA%agreement%NFE214205227.%Some%of%the%experiments%were%supported%by%an%allocation%of%advanced%computing%resources%provided%by%the%National%Science%Foundation.%The%computations%were%performed%on%Nautilus%at%the%National%Institute%for%Computational%Sciences.
A"Little"About"ORNL…
Oak Ridge, Tennessee
Oak$Ridge$National$Laboratory$is$the$
largest$US$Department$of$
Energy$(DOE)$open$science$laboratory$
What$is$a$Leadership$Computing$Facility$(LCF)?
• Collaborative%DOE%Office%of%Science%user2facility%program%at%ORNL%and%ANL
• Mission:%Provide%the%computational%and%data%resources%required%to%solve%the%most%challenging%problems.
• 22centers/22architectures%to%address%diverse%and%growing%computational%needs%of%the%scientific%community
• Highly%competitive%user%allocation%programs%(INCITE,%ALCC).
• Projects%receive%10x%to%100x%more%resource%than%at%other%generally%available%centers.
• LCF%centers%partner%with%users%to%enable%science%&%engineering%breakthroughs%(Liaisons,%Catalysts).
OLCF23
ORNL$has$systematically$delivered$a$series$of$leadershipEclass$systemsOn%scope%•%On%budget%•%Within%schedule
Titan,%five%years%old%in%October%2017,%continues%to%deliver%world2class%science%research%in%support%of%our%user%community.%We%will%operate%Titan%through%2019%when%it%will%be%decommissioned.
OLCF21
OLCF22
10002foldimprovementin%8%years
2012Cray%XK7%Titan
27PF
18.5TF
25%TF
54%TF
62%TF
263%TF
1%PF
2.5PF
2004Cray%X1E%Phoenix%
2005Cray%XT3%Jaguar
2006Cray%XT3%Jaguar
2007Cray%XT4%Jaguar
2008Cray%XT4%Jaguar
2008Cray%XT5%Jaguar
2009Cray%XT5%Jaguar
We$are$building$on$this$record$of$success$to$enable$exascale in$2021
5002foldimprovementin%9%years
OLCF25
OLCF24~1EF
200PF
27PF
2012Cray%XK7%Titan
2021Frontier
2018IBM%
Summit
Summit,%slated%to%be%more%powerful%than%any%other%existing%supercomputer,%is%the%Department%of%Energy’s%Oak%Ridge%National%Laboratory’s%newest%supercomputer%for%open%science.
Coming$in$2018:$Summit$will$replace$Titan$as$the$OLCF’s$leadership$supercomputer$
Summit$Overview
IBM$POWER9• 22%Cores• 4%Threads/core• NVLink
NVIDIA$GV100• 7%TF• 16%GB%@%0.9%TB/s• NVLink
Components
Compute$Node2%x%POWER96%x%NVIDIA%GV100NVMe2compatible%PCIe 1600%GB%SSD%
!
! 25%GB/s%EDR%IB2 (2%ports)512%GB%DRAM2 (DDR4)96%GB%HBM2 (3D%Stacked)Coherent%Shared%Memory
Compute$Rack
39.7%TB%Memory/rack55%KW%max%power/rack
18%Compute%ServersWarm%water%(70°F%direct2cooled%components)
RDHX%for%air2cooled%components
Compute$System10.2$PB$Total$Memory256%compute%racks4,608%compute%nodesMellanox EDR%IB%fabric
200%PFLOPS~13%MW%
GPFS$File$System250$PB$storage
2.5%TB/s%read,%2.5%TB/s%write
Summit$Node$Overview
P9 P9
DRAM256 GBH
BM16
GB
GPU 7 TF
HBM
16 G
B
GPU 7 TF
HBM
16 G
B
GPU 7 TF
DRAM256 GB H
BM16
GB
GPU 7 TF
HBM
16 G
B
GPU 7 TF
HBM
16 G
B
GPU 7 TF
TF 42 TF (6x7 TF)HBM 96 GB (6x16 GB)DRAM 512 GB (2x16x16 GB)NET 25 GB/s (2x12.5 GB/s)MMsg/s 83
NIC
HBM/DRAM Bus (aggregate B/W)NVLINKX-Bus (SMP)PCIe Gen4EDR IB
HBM & DRAM speeds are aggregate (Read+Write).All other speeds (X-Bus, NVLink, PCIe, IB) are bi-directional.
NVM6.0 GB/s Read2.2 GB/s Write
12.5
GB/
s
12.5
GB/
s
16 GB/s 16
GB/
s
64GB/s
135
GB/
s
135
GB/
s
50 G
B/s
50 GB/s
50 GB/s
50 G
B/s
50 GB/s
50 GB/s
50 G
B/s
50 G
B/s
50 G
B/s
50 G
B/s
900
GB/
s90
0 G
B/s
900
GB/
s
900
GB/
s90
0 G
B/s
900
GB/
s
Coming$in$2018:$Summit$will$replace$Titan$as$the$OLCF’s$leadership$supercomputer$
• Many%fewer%nodes
• Much%more%powerful%nodes
• Much%more%memory%per%node%and%total%system%memory
• Faster%interconnect
• Much%higher%bandwidth%between%CPUs%and%GPUs
• Much%larger%and%faster%file%system
Feature Titan SummitApplication Performance Baseline 5210x%Titan
Number%of%Nodes 18,688 4,608
Node%performance 1.4%TF 42%TF
Memory per%Node 32 GB DDR3%+%6%GB%GDDR5 512%GB%DDR4%+%96%GB%HBM2
NV%memory per%Node 0 1600%GB
Total%System%Memory 710%TB >10%PB%DDR4%+%HBM2%+ Non2volatile
System%Interconnect Gemini%(6.4%GB/s) Dual%Rail%EDR2IB (25%GB/s)
Interconnect%Topology 3D Torus Non2blocking%Fat%Tree
Bi2Section%Bandwidth 15.6%TB/s 115.2 TB/s
Processors1%AMD%Opteron™1%NVIDIA%Kepler™
2%IBM%POWER9™6%NVIDIA Volta™
File%System 32%PB,%1%TB/s, Lustre® 250 PB,%2.5%TB/s,%GPFS™
Power%Consumption 9%MW 13%MW
What$is$CORAL?$The$program$through$which$Summit$&$Sierra$are$procured.• Several%DOE%labs%have%strong%supercomputing%programs%and%facilities.%• To%bring%the%next%generation%of%leading%supercomputers%to%these%labs,%DOE%created%CORAL%(the%Collaboration%of%Oak%Ridge,%Argonne,%and%Livermore)%to%jointly%procure%these%systems,%and%in%so%doing,%align%strategy%and%resources%across%the%DOE%enterprise.
• Collaboration%grouping%of%DOE%labs%was%done%based%on%common%acquisition%timings.%Collaboration%is%a%win2win%for%all%parties.%
“Summit”%System “Sierra”%System
OpenPOWER Technologies:%IBM%POWER%CPUs,%NVIDIA%Tesla%GPUs,%Mellanox EDR%100Gb/s%InfiniBand
Paving%The%Road%to%Exascale%Performance
OLCF$Program$to$Ready$Application$Developers$and$Users• We%are%preparing%users%through:– Application%Readiness%and%Early%Science%through%Center%for%Accelerated%Application%Readiness%(CAAR)
– Training%and%web2based%%documentation– Early%access%on%SummitDev and%Summit%Phase%I%system%(already%accepted)– Access%for%broader%user%base%on%final,%accepted%Phase%II%system
• Goals:%– Early%science%achievements,%– Demonstrate%application%readiness,%– Prepare%INCITE%&%ALCC%proposals,%– Harden%Summit%for%full2user%operations
Summit$Early$Science$Program$(ESP)$
• We%put%out%a%Call%for%Proposals%in%December%2017– Resulted%in%62%Letters%of%Intent%(LOI)%received%by%year’s%end.• 27%are%from%PIs%at%universities• 32%are%from%PIs%at%national%laboratories%or%research%institutions%(DOE,%NASA)%• 14%are%CAAR%project2related%LOIs• 27%have%had%past%INCITE%allocations• 9%have%had%past%ALCC%allocations• 15%have%connections%to%the%US%DOE%Exascale%Computing%Project• 9%are%AI%or%deep%learning2related%
– Proposals%are%due%at%the%beginning%of%June– ESP%Users%will%gain%full%access%to%Summit%for%early%science%later%this%year
Summit$will$be$the$world’s$smartest$supercomputer$for$open$scienceBut%what%makes%a%supercomputer%smart?
• GPU$Brawn:$Summit%links%more%than%27,000%deep2learning%optimized%NVIDIA%GPUs%with%the%potential%to%deliver%exascale2level%performance%(a%billion2billion%calculations%per%second)%for%AI%applications.
• HighEspeed$Data$Movement:$NVLink high2bandwidth%technology%built%into%all%of%Summit’s%processors%supplies%the%next2generation%“information%superhighways”%needed%to%train%deep%learning%algorithms%for%challenging%science%problems%quickly.
• Memory$Where$it$Matters:%Summit’s%sizable%local%memory%gives%AI%researchers%a%convenient%launching%point%for%data2intensive%tasks,%an%asset%that%allows%for%faster%AI%training%and%greater%algorithmic%accuracy.
One%of%Summit’s%4,600%IBM%AC922%nodes.%Each%node%contains%six%NVIDIA%Volta%GPUs%and%two%IBM%Power9%CPUs,%giving%scientists%new%opportunities%to%automate,%accelerate%and%drive%understanding%using%artificial%intelligence%techniques.
Summit%provides%unprecedented%opportunities%for%the%integration%of%artificial%intelligence%(AI)%and%scientific%discovery.%Here’s%why:
Science%challenges%for%a%smart%supercomputer:%
Summit$will$be$the$world’s$smartest$supercomputer$for$open$scienceBut%what%can%a%smart%supercomputer%do?
Identifying$NextEgeneration$MaterialsBy%training%AI%algorithms%to%predict%material%properties%from%experimental%data,%longstanding%questions%about%material%behavior%at%atomic%scales%could%be%answered%for%better%batteries,%more%resilient%building%materials,%and%more%efficient%semiconductors.%
Combating$CancerThrough%the%development%of%scalable%deep%neural%networks,%scientists%at%the%US%Department%of%Energy%and%the%National%Cancer%Institute%are%making%strides%in%improving%cancer%diagnosis%and%treatment.%
Deciphering$HighEenergy$Physics$DataWith%AI%supercomputing,%physicists%can%lean%on%machines%to%identify%important%pieces%of%information—data%that’s%too%massive%for%any%single%human%to%handle%and%that%could%change%our%understanding%of%the%universe.
Predicting$Fusion$EnergyPredictive%AI%software%is%already%helping%scientists%anticipate%disruptions%to%the%volatile%plasmas%inside%experimental%reactors.%Summit’s%arrival%allows%researchers%to%take%this%work%to%the%next%level%and%further%integrate%AI%with%fusion%technology.%
Summit$is$still$under$construction
• We%expect%to%accept%the%machine%in%Summer%of%2018,%allow%early%users%on%this%year,%and%allocate%our%first%users%through%the%INCITE%program%in%January%2019.%
• We%are%continuing%node%and%file%storage%installation%and%software%testing.%%
Questions?Jack$Wells