advanced hpc user group - oist groups
TRANSCRIPT
![Page 1: Advanced HPC User Group - OIST Groups](https://reader031.vdocuments.mx/reader031/viewer/2022012504/617e909d0b994704de61f23a/html5/thumbnails/1.jpg)
Advanced HPCUser Group
Meeting2019-12-06
Jan MorenScientific Computing and
Data analysis section
![Page 2: Advanced HPC User Group - OIST Groups](https://reader031.vdocuments.mx/reader031/viewer/2022012504/617e909d0b994704de61f23a/html5/thumbnails/2.jpg)
3
OGiEDSay hello to
High-core AMDand Intel nodes
next-generationnetworking
ultra-highspeed storage
··
![Page 3: Advanced HPC User Group - OIST Groups](https://reader031.vdocuments.mx/reader031/viewer/2022012504/617e909d0b994704de61f23a/html5/thumbnails/3.jpg)
4
456 AMD nodes2 x EPYC 7702 2.0GHz128 cores512GB memory
192 Intel nodes2 x Xeon 6230 2.1GHz40 cores512GB memory
58368 cores
7680 cores
Deigo:
688%688%
more!more!
66048 cores
Sango: 9600 cores
OGiED ··
![Page 4: Advanced HPC User Group - OIST Groups](https://reader031.vdocuments.mx/reader031/viewer/2022012504/617e909d0b994704de61f23a/html5/thumbnails/4.jpg)
5
456 AMD nodes2 x EPYC 7702 2.0GHz128 cores512GB memory
192 Intel nodes2 x Xeon 6230 2.1GHz40 cores512GB memory
58368 cores - 88%
7680 cores - 12%
Why both Intel and AMD?● Per core:
● AMD is a bit faster for integer, I/O● Intel is a bit faster FPU (esp. AVX512)● Depends a lot on your code
● Per node:● AMD trounces Intel
OGiED ··
![Page 5: Advanced HPC User Group - OIST Groups](https://reader031.vdocuments.mx/reader031/viewer/2022012504/617e909d0b994704de61f23a/html5/thumbnails/5.jpg)
6
456 AMD nodes2 x EPYC 7702 2.0GHz128 cores512GB memory
192 Intel nodes2 x Xeon 6230 2.1GHz40 cores512GB memory
58368 cores - 88%
7680 cores - 12%
Why both Intel and AMD?● But: Intel MKL library can perform badly
on non-Intel CPUs.● Intel checks CPU maker (not capability),
selects operations based on that.● some BLAS operations - matrix
multiplication - especially bad● some physics codes, Matlab affected
● Can override maker check with:
export MKL_DEBUG_CPU_TYPE=5
MKL up to 600% faster on AMD...
OGiED ··
![Page 6: Advanced HPC User Group - OIST Groups](https://reader031.vdocuments.mx/reader031/viewer/2022012504/617e909d0b994704de61f23a/html5/thumbnails/6.jpg)
7
compute
test partition
“CFD”largemem
Intel: 192 nodes, 12% cores● Task-specific partitions
● not per-user or per-unit● physics, MD● Intel-dependent code● largemem
● Overlap with low-priority test partition
AMD: 456 nodes, 88% cores● General purpose compute partition
● lots of cores, lots of users● user memory and core limits
● benefits from rebuilding your code
other...
Work inProgressOGiED ··
![Page 7: Advanced HPC User Group - OIST Groups](https://reader031.vdocuments.mx/reader031/viewer/2022012504/617e909d0b994704de61f23a/html5/thumbnails/7.jpg)
8
New /work500TB SSD10TB per unit
Old /work
Login nodes
Compute nodes
Bucket+6 PB
read and write
read-only
OGiED ··
![Page 8: Advanced HPC User Group - OIST Groups](https://reader031.vdocuments.mx/reader031/viewer/2022012504/617e909d0b994704de61f23a/html5/thumbnails/8.jpg)
9
New /work500TB SSD10TB per unit
Login nodes
Compute nodes
Bucket+6 PB
read and writeread-only
New workflow● Similar to Saion:
● /work is only scratch, not storage● 10TB per unit is a hard limit● read access to Bucket from compute nodes
1. read your input from bucket● Read directly or copy to /work beforehand
2. use /work for ongoing computation
3. copy results back to bucket
4. Clean up /work
OGiED ··
![Page 9: Advanced HPC User Group - OIST Groups](https://reader031.vdocuments.mx/reader031/viewer/2022012504/617e909d0b994704de61f23a/html5/thumbnails/9.jpg)
10
Login nodes
Compute nodes
read and writeread-only
Old /work● Will disappear during 2020
● Soon out of warranty● Full, not expandable
● Will be available read-only● You must copy data you need to Bucket● The rest will be archived and effectively
unavailable once shut down
Old /work
New /work500TB SSD10TB per unit
Bucket+6 PB
OGiED ··
![Page 10: Advanced HPC User Group - OIST Groups](https://reader031.vdocuments.mx/reader031/viewer/2022012504/617e909d0b994704de61f23a/html5/thumbnails/10.jpg)
11
Provided software● CentOS 8● GCC 8 (or 9), AOCC● BLIS, LibFLAME● User Software (modules)
● Popular open source modules will be rebuilt.
● Other modules will run directly or through “sango” container, available as module (best effort)
Your software● For best results, rebuild with
modern compilers, libraries● Many will run OK unchanged● Use “sango” container to run
those that won’t:
module load bowtiemyprog -o xyz
module load sangosango -m bowtie myprog -o xyz
OGiED ··
![Page 11: Advanced HPC User Group - OIST Groups](https://reader031.vdocuments.mx/reader031/viewer/2022012504/617e909d0b994704de61f23a/html5/thumbnails/11.jpg)
12
TimelinePreliminary
December January February March April May
install racksinstall nodes, storage
acceptance tests
system setupuser environment software testing
Open
OGiED ··
![Page 12: Advanced HPC User Group - OIST Groups](https://reader031.vdocuments.mx/reader031/viewer/2022012504/617e909d0b994704de61f23a/html5/thumbnails/12.jpg)
13
TimelinePreliminary
December January February March April May
install racksinstall nodes, storage
acceptance tests
system setupuser environment software testing
Openpower outages
Network disruptions
2-3 outagesminimum
OGiED ··
![Page 13: Advanced HPC User Group - OIST Groups](https://reader031.vdocuments.mx/reader031/viewer/2022012504/617e909d0b994704de61f23a/html5/thumbnails/13.jpg)
15
Students and RUAsstorage problems
Symptoms● They get “out of space” errors
when trying to save data on work or bucket
$ ls -l
drwxr-xr-x 2 jan-moren allstudents 4096 14 nov 15.41 jan-moren/
$
● The unit has plenty of space left ● Their files/directories have group
“allstudents” or “allruas”
![Page 14: Advanced HPC User Group - OIST Groups](https://reader031.vdocuments.mx/reader031/viewer/2022012504/617e909d0b994704de61f23a/html5/thumbnails/14.jpg)
16
Students and RUAsstorage problems
Groups● All users have 1 primary group,
multiple secondary groups.● files get primary group by default
● Most OIST members have their unit or section as primary group:
● But students do rotations, and may work with multiple units.
$ groups
scicomsec oist scicom-data ...
Problem● Only IT can change primary groups.● PI’s have the legal responsibilty for
who is a member, and need to give approval● PI have authority but not means● IT has means but not authority
● Can only have one primary group, so for collaborations you need ugly, brittle hacks.
![Page 15: Advanced HPC User Group - OIST Groups](https://reader031.vdocuments.mx/reader031/viewer/2022012504/617e909d0b994704de61f23a/html5/thumbnails/15.jpg)
33
Students and RUAsstorage problems
Problem● Only IT can change primary groups.● PI’s have the legal responsibilty for
who is a member, and need to give approval● PI have authority but not means● IT has means but not authority
● Can only have one primary group, so for collaborations you need ugly, brittle hacks.
![Page 16: Advanced HPC User Group - OIST Groups](https://reader031.vdocuments.mx/reader031/viewer/2022012504/617e909d0b994704de61f23a/html5/thumbnails/16.jpg)
36
Students and RUAsstorage problems
Problem● Only IT can change primary groups.● PI’s have the legal responsibilty for
who is a member, and need to give approval● PI have authority but not means● IT has means but not authority
● Can only have one primary group, so for collaborations you need ugly, brittle hacks.
![Page 17: Advanced HPC User Group - OIST Groups](https://reader031.vdocuments.mx/reader031/viewer/2022012504/617e909d0b994704de61f23a/html5/thumbnails/17.jpg)
37
Students and RUAsstorage problems
Solution1. Put students in “allstudents”
2. PIs can use “grouper” to add anybody as a secondary group.● can delegate to unit members● can’t remove regular members
Problem● Only IT can change primary groups.● PI’s have the legal responsibilty for
who is a member, and need to give approval● PI have authority but not means● IT has means but not authority
● Can only have one primary group, so for collaborations you need ugly, brittle hacks.
![Page 18: Advanced HPC User Group - OIST Groups](https://reader031.vdocuments.mx/reader031/viewer/2022012504/617e909d0b994704de61f23a/html5/thumbnails/18.jpg)
38
Students and RUAsstorage problems
Solution1. Put students in “allstudents”
2. PIs can use “grouper” to add anybody as a secondary group.
3. Set unit directories to override the users’ primary group (setguid)
→ new files get unit group, not “allstudents”
● PIs now do all changes themselves● Students (or anybody) can get
access to multiple units
Problem● Only IT can change primary groups.● PI’s have the legal responsibilty for
who is a member, and need to give approval● PI have authority but not means● IT has means but not authority
● Can only have one primary group, so for collaborations you need ugly, brittle hacks.
![Page 19: Advanced HPC User Group - OIST Groups](https://reader031.vdocuments.mx/reader031/viewer/2022012504/617e909d0b994704de61f23a/html5/thumbnails/19.jpg)
39
Students and RUAsstorage problems
Problem● group ownership is sometimes
overridden, “setguid” removed:● ex. “rsync -a”● “cp -a”, “cp -p”● using SMB (unclear when)
$ ls -l...jan-moren allstudents ... myfolder/
Solution1. Put students in “allstudents”
2. PIs can use“grouper” to add anybody as a secondary group.
3. Set unit directories to override the users’ primary group (setguid)
→ new files get unit group, not “allstudents”
● PIs now do all changes themselves● Students (or anybody) can get
access to multiple units
● “allstudents” has no quota, so writing files fails with out of space error.
![Page 20: Advanced HPC User Group - OIST Groups](https://reader031.vdocuments.mx/reader031/viewer/2022012504/617e909d0b994704de61f23a/html5/thumbnails/20.jpg)
40
Students and RUAsstorage problems
$ fixdirs myfolder myuni$ ls -l... jan-moren myuni ... myfolder/
Solution● “fixdirs” script fixes both group and
setguid bit:
$ rsync -a --no-group --no-perms …$ cp -a --no-preserve=mode,ownership
Solution1. Put students in “allstudents”
2. PIs can use“grouper” to add anybody as a secondary group.
3. Set unit directories to override the users’ primary group (setguid)
→ new files get unit group, not “allstudents”
● PIs now do all changes themselves● Students (or anybody) can get
access to multiple units
● Avoid setting group, permissions when moving data:
![Page 21: Advanced HPC User Group - OIST Groups](https://reader031.vdocuments.mx/reader031/viewer/2022012504/617e909d0b994704de61f23a/html5/thumbnails/21.jpg)
41
New List, New Members
Advanced HPC User Group● “advanced” - reasonable, willing to learn, HPC users
● No PIs, no section leaders - your boss (or mine) won’t see you asking questions.
● The mailing list is a simple, direct way to contact us and each other.
● We will move the mailing list to new system. This might interrupt things. Tell us if there’s a problem.
● We want new members. If you know somebody you think would benefit, tell me.