![Page 1: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/1.jpg)
1
National Center for Supercomputing ApplicationsLCI Conference 2007
SAN Persistent Binding andMultipathing in the 2.6 Kernel
Michelle Butler, Technical Program ManagerAndy Loftus, System EngineerStorage Enabling Technologies
[email protected] or [email protected]
Slides available at http://dims.ncsa.uiuc.edu/set/san/
![Page 2: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/2.jpg)
2
National Center for Supercomputing ApplicationsLCI Conference 2007
Who?• NCSA
– a unit of the University of Illinoisat Urbana-Champaign
– a federal, state, university, andindustry funded center
• Academic Users– NSF peer review
• Large amount ofapplications/user needs– 3rd party codes, user written…– All running on same
environment• Many research areas
![Page 3: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/3.jpg)
3
National Center for Supercomputing ApplicationsLCI Conference 2007
NCSA’s 1st Dell Cluster• Tungsten: 1750 server
cluster– 3.2 GHz Xeon
• 2,560 processors (computeonly)
• 16.4 TF; 3.8 TB RAM;122TB disk
• Dell OpenManage– Myrinet
• Full bi-section– Lustre over Gig-E
• 13 DataDirect 8500• 104 OSTs, 2 MDS
w/separate disk• 11.1 GB/sec sustained
– Power/Cooling• 593 KW / 193 tons
– Production date: April 2004
– User Environment• Platform Computing LSF• Softenv• Intel Compilers• ChaMPIon Pro, MPICH,
VMI-2
The fir
st
large-s
cale
Dell clu
ster!!!
![Page 4: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/4.jpg)
4
National Center for Supercomputing ApplicationsLCI Conference 2007
NCSA’s 3rd Dell Cluster• T2 – retired into:• Tungsten-3 1955 blade cluster
– 2.6 GHz Woodcrest Dual Core• 1,040 processors/2080 cores• 22 TF; 4.1 TB RAM; 20 TB disk• Warewulf
– Cisco InfiniBand• 3 to 1 over-subscribed• OFED-1.1 w/ HPSM subnet
manager– Lustre over IB
• 4 FasT controllers direct FC• 1.2GB/s sustained• 8 OSTs and 2 MDS w/complete
auto failovers– Power/Cooling
• 148 KW / 42 tons
– Production date: March 2007
– User Environment• Torque/Moab• Softenv• Intel Compilers• VMI-2
![Page 5: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/5.jpg)
5
National Center for Supercomputing ApplicationsLCI Conference 2007
NCSA’s 4th Dell Cluster• Abe: 1955 blade cluster
– 2.33 GHz Cloverton Quad-Core• 1,200 blades/9,600 cores• 89.5 TF; 9.6 TB RAM; 120 TB disk• Perceus management; diskless boot
– Cisco Infiniband• 2 to 1 oversubscribed• OFED-1.1 w/ HPSM subnet
manager– Lustre over IB
• 22 OSTs• 2 9500 DDN controllers direct FC• 10 FasT controllers on SAN fabric• 8.4GB/s sustained• 22 OSTs and 2 MDS w/complete
auto failovers– Power/Cooling
• 500 KW / 140 tons
– Production date: May 2007(anticipated)
– User Environment• Torque/Moab• Sofenv• Intel Compiler• MPI: evaluating Intel MPI,
MPICH, MVAPICH, VMI-2, etc.
The lar
gest
Dell clu
ster!!!
![Page 6: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/6.jpg)
6
National Center for Supercomputing ApplicationsLCI Conference 2007
NCSA Facility - ACB• Advanced Computation Building
– Three rooms, totals:• 16,400 sqft raised floor• 4.5 MW power capacity• 250 kW UPS• 1,500 tons cooling capacity
– Room 200:• 7,000 sqft – no columns• 70” raised floor• 2.3 MW power capacity• 750 tons cooling capacity
![Page 7: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/7.jpg)
7
National Center for Supercomputing ApplicationsLCI Conference 2007
NCSA’s Other Systems• Distributed Memory Clusters
– Mercury (IBM, 1.3/1.5 GHz Itanium2):• 1,846 processors• 10 TF; 4.6 TB RAM; 90 TB disk
• Shared Memory Clusters
– Copper (IBM p690,1.3 GHz Power4): 12 x 32processors
• 2 TF; 64 or 256 GB RAM each; 35 TB disk
– Cobalt (SGI Altix, 1.5 GHz Itanium2): 2 x 512 processors• 6.6 TF; 1 TB or 3 TB RAM; 250 TB disk
![Page 8: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/8.jpg)
8
National Center for Supercomputing ApplicationsLCI Conference 2007
NCSA Storage Systems• Archival: SGI/Unitree (5 PB total capacity)
– 72TB disk cache; 50 tape drives– currently 2.8PB of data in MSS
• >1PB ingested in last 6 months• project ~3.2PB by end of CY2006• licensed to support 5PB resident data
– ~30 data collections hosted
• Infrastructure: 394TB FiberchannelSAN connected– Fiberchannel SAN connected; FC and SATA environments– Lustre, IBRIX, NFS filesystems
• Databases:– 8 processor 12GB memory SGI Altix
• 30TB of SAN storage• Oracle 10G, mysql, Postgres
– Oracle RAC cluster– Single-system Oracle deployments for focused projects
![Page 9: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/9.jpg)
9
National Center for Supercomputing ApplicationsLCI Conference 2007
Visualization Resources• 30M-pixel Tiled Display Wall
– 8192 x 3840 pixels compositedisplay
– 40 NEC VT540 projectors, arrangedin a 5H x 8W matrix
– driven by 40-node Linux cluster• dual-processor 2.4GHz Intel Xeons
with NVIDIA FX 5800 Ultra graphicsaccelerator cards
• Myrinet interconnect• to be upgrade by early CY2007
– funded by State of Illinois
• SGI Prisms– 8 x 8 processor (1.6 GHz Itanium2)– 4 graphics pipes each; 1 GB RAM each– InfiniBand connection to Altix machines
![Page 10: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/10.jpg)
10
National Center for Supercomputing ApplicationsLCI Conference 2007
SAN at NCSA
• 1.3PB spinning disk– 895TB SAN attached
• 1392 Brocade switch ports• 7 SAN fabrics• 2 data centers
![Page 11: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/11.jpg)
11
National Center for Supercomputing ApplicationsLCI Conference 2007
Persistent Binding
• Device naming problems• Udev solution• Examples• Interactive Demo
![Page 12: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/12.jpg)
12
National Center for Supercomputing ApplicationsLCI Conference 2007
Device Naming ProblemBefore After
• Add hardware• SAN zoning• New SAN luns• Modify config
Device node mapping can change with changes to
- hardware
- software
- SAN
Devices assigned random names (based on next available major/minor pair for device type)
CLUSTER
- Multiple hosts that see the same disk will assign the disk to different device nodes
- may be /dev/sda on system1 but /dev/sdc on system2
- Can change with hardware changes; what used to be /dev/sda is not /dev/sdc
Devfs helps only a little:
- Fixes device naming; on a single host, disk will always have the same device node
- But different hosts may have different device names for the same physical disk
![Page 13: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/13.jpg)
13
National Center for Supercomputing ApplicationsLCI Conference 2007
What needs to happen
• Storage target always maps to samelocal device (ie. /dev/…)
• Local device name should be meaningful– /dev/sda conveys no information about the
storage device
![Page 14: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/14.jpg)
14
National Center for Supercomputing ApplicationsLCI Conference 2007
udev - Persistent Device Naming
• “Udev is … a userspace solution for adynamic /dev directory, with persistentdevice naming” *– Userspace: not required to remain in memory– Dynamic: /dev not filled with unused files– Persistent: devices always accessable using the
same device node• Provides for custom device names* Daniel Drake (http://www.reactivated.net/writing_udev_rules.html)
Devfs provides dynamic and persistent naming, but:
- kernel based - entire device db stored in kernel memory, never swapped
- not possible to customize device names
UDEV CUSTOM
- custom names for devices
- custom scripts can be run when specifice devices attached/removed
![Page 15: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/15.jpg)
15
National Center for Supercomputing ApplicationsLCI Conference 2007
Setting up udev device mapper
Overview
1. Uniquely identify each lun2. Assign a meaningful name to each lun
![Page 16: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/16.jpg)
16
National Center for Supercomputing ApplicationsLCI Conference 2007
1. Uniquely identify each lun
/sbin/scsi_id
Sample usage:root# scsi_id -g -u -s /block/sdaSSEAGATE_ST318406LC_____3FE27FZP000073302G5W
root# scsi_id -g -u -s /block/sdb3600a0b8000122c6d00000000453174fc
scsi_id SCSI INQUIRYdevice name
Unique id
/sbin/scsi_id
- INPUT: existing local device name
- OUTPUT: string that uniquely identifies the specific device (guaranteed unique among all scsi devices)
SAMPLE:
- sda: locally installed drive
- sdb: SAN attached disk
![Page 17: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/17.jpg)
17
National Center for Supercomputing ApplicationsLCI Conference 2007
2. Associate a meaningful name
• BUS=scsi– /sys/bus/scsi
• SYSFS– <BUS>/devices/H:B:T:L/<filename>
• PROGRAM & RESULT– Program to invoke and result to look for
• NAME– Device name to create (relative to /dev)
New udev rules file: /etc/udev/rules.d/20-local.rulesBUS="scsi", SYSFS{vendor}="DDN", SYSFS{model}="S2A 8000",PROGRAM="/sbin/scsi_id -g -u -s /block/%k ",RESULT="360001ff020021101092fadc32a450100", NAME="disk/fc/sdd4c1l0"
Custom naming controlled by rulesets stored in /etc/udev/rules.d
A rule is a lists of keys to match against.
When all keys match, the specified action is taken (create a device name or symlink)
![Page 18: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/18.jpg)
18
National Center for Supercomputing ApplicationsLCI Conference 2007
Example: Customizing for multiple paths
ProblemMultiple paths to a
single lun results inmultiple devicenodes.
Need to know whichpath each deviceuses.
![Page 19: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/19.jpg)
19
National Center for Supercomputing ApplicationsLCI Conference 2007
Example: Customizing for multiple paths
Custom script : mpio_scsi_id
Sample udev rule:BUS="scsi", SYSFS{vendor}="DDN", SYSFS{model}="S2A 8000",PROGRAM="/root/bin/mpio_scsi_id %k",RESULT="23000001ff03092f360001ff020021101092fadc32a450100",NAME="disk/fc/sdd4c1l0"
mpio_scsi_id scsi_iddevice name
WWPN + scsi_id
Disk CtlrWWPN
udev
Get disk controller WWPN
(Emulex) /sys/class/fc_transport/target<H>:<B>:<T>/port_name
(QLA) grep + awk to pull value from /proc/scsi/ql2xxx/<host_id>
![Page 20: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/20.jpg)
20
National Center for Supercomputing ApplicationsLCI Conference 2007
Demo: udev persistent device naming
• Single HBA• Single disk unit
– 4 luns– Each lun presented
through both controllers• Host sees 8 logical
luns• Use mpio_scsi_id
to identify the ctlr-lun
![Page 21: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/21.jpg)
21
National Center for Supercomputing ApplicationsLCI Conference 2007
Demo: udev persistent device naming
Original Configuation• udev config file
– /etc/udev/udev.conf
• scsi_id config file– /etc/scsi_id.config
• Scan fc luns– {sysfs}/hostX/scan– /dev/disk/by-id
Custom device names• Custom rules file
– 20-local.rules
• Restart udev– udevstart
• Custom devicenames created– /dev/disk/fc
BEGIN
- tail -f /var/log/messages
1. Enable udev logging
2. Enable scsi_id for all devices (options -g)
3. /proc/partitions
4. Scan fc luns (echo “- - -” > /sys/class/scsi_host/hostX/scan)
5. See udev log lines in messages file ; See fc disks in /dev/disk/by-id
6. Enable 20-local rules file
7. Udevstart
8. See udev log lines in messages file ; See fc disks in /dev/disk/fc
DEFAULT CONFIGURATION
Local rules file already exists. Disable it.
Default behavior for scsi_id is to blacklist everything unknown (-b option). Enable white list everything (-g option) so scsi_id’s will be returned.
Even before custom rules are in place, see default udev rule selection activity in /var/log/messages
After running delete_fc_luns, udev removes /dev/sdX devices files (/var/log/messages)
CUSTOM CONFIGURATION
Udev custom rules are selected (see /var/log/messages)
Major/Minor numbers line up for /dev/disk/fc/* and /proc/partition/*
![Page 22: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/22.jpg)
22
National Center for Supercomputing ApplicationsLCI Conference 2007
Demo: udev persistent device naming
Debugging• Not all sysfs files are available immediately
– HBA target WWPN– Add udevstart to boot scripts
• Udev tools can help– udevinfo– udevtest
Examples• udevinfo -a -p $(udevinfo -q path -n /dev/sdb)• udevtest /block/sdb
Exmaple: multiple paths on Nadir
- If luns are removed (delete_fc_luns)
- Then added (scan_fc_luns)
- No matches are found in 20-local.rules
- Add syslog output to mpio_scsi_id
+ Shows params the script is called with
+ Shows what the script returns
+ target_wwpn is not getting set
- Run udevstart (luns already attached now), matches found in 20-local.rules and device files created
Probably either a driver or udev issue.
Easiest solution is to run scan_luns and udevstart at system boot time (/etc/rc.d/rc.local)
![Page 23: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/23.jpg)
23
National Center for Supercomputing ApplicationsLCI Conference 2007
Custom script: ls_fc_lunsGet HBA list sysfs
Get target list
Get lun list
Get lun info
Get HBA type lspci
sysfs (emulex)/proc (QLA)
sysfs
sysfs
/sys/class/fc_host
/sys/class/scsi_host/hostX/targetX:Y:Z/proc/scsi/qla2xxx/X
/sys/class/scsi_host/hostX/targetX:Y:Z/X:Y:Z:L
/sys/class/scsi_host/hostX/targetX:Y:Z/X:Y:Z:L/*
0x10000000c95ebeb4 0x200300a0b8122c6e 2:0:0:0 sdb 3600a0b8000122c6d00000000453174fc0x10000000c95ebeb4 0x200300a0b8122c6e 2:0:0:1 sdc 3600a0b80000fd63200000000453175630x10000000c95ebeb4 0x200200a0b8122c6e 2:0:1:0 sdi 3600a0b8000122c6d00000000453174fc0x10000000c95ebeb4 0x200200a0b8122c6e 2:0:1:1 sdj 3600a0b80000fd6320000000045317563
![Page 24: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/24.jpg)
24
National Center for Supercomputing ApplicationsLCI Conference 2007
Custom script: lip_fc_hosts
Get host list ls_fc_luns
echo “1” > /sys/class/fc_host/hostX/lip
![Page 25: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/25.jpg)
25
National Center for Supercomputing ApplicationsLCI Conference 2007
Custom script: scan_fc_luns
Get host list ls_fc_luns
echo “- - -” > /sys/class/scsi_host/hostX/scan
![Page 26: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/26.jpg)
26
National Center for Supercomputing ApplicationsLCI Conference 2007
Custom script: delete_fc_luns
Get lun list ls_fc_luns
echo “1” > /sys/class/scsi_host/hostX/targetX:Y:Z/X:Y:Z:L/delete
![Page 27: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/27.jpg)
27
National Center for Supercomputing ApplicationsLCI Conference 2007
udev - Additional Resources• man udev• http://www.emulex.com/white/hba/wp_linux26udev.pdf
– Excellent white paper• http://www.reactivated.net/udevrules.php
– How to write udev rules
• http://www.us.kernel.org/pub/linux/utils/kernel/hotplug/udev.html– Information and links
• http://dims.ncsa.uiuc.edu/set/san– FC tools : custom tools used in demo
![Page 28: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/28.jpg)
28
National Center for Supercomputing ApplicationsLCI Conference 2007
Linux Multipath I/O
• Overview• History• Setup• Demos
– Active / Passive Controller Pair– Active / Active Controller Pair
![Page 29: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/29.jpg)
29
National Center for Supercomputing ApplicationsLCI Conference 2007
Linux Multipath - HistoryProviders
• Storage Vendor• HBA Vendor• Filesystem• OS
STORAGE VENDOR
- End to end solution (they provide disk, HBA, driver, add’l software, sometimes even FC switch)
- HBA’s (and other parts) come at a markup
- One location for support tickets, but no alternate recourse if they can’t fix the problem
- Proprietary requirements (typically require 2 HBA’s, only works with their systems)
HBA VENDOR
- QLA
> Linux support spotty
+ 2.4 kernel ok, but strict requirements (2 HBA’s, exactly 2 paths per lun, active/active controllers)
+ 2.6 kernel inconsistent behavior
> Solaris support spotty (2 months to get 1 machine working, next month stops working, machine wasuntouched)
> Dropped Windows support prematurely (Windows MPIO layer not complete yet, only an API forvendors)
> Proprietary solution, only works with their HBA’s and configuration software
- Emulex (unix philosophy, do one thing and do it well; MPIO doesn’t belong in the driver)
FILESYSTEM
- 3rd party - Veritos, others??
- Parallel Filesystems - Ibrix, Lustre, GPFS, CXFS (enable MPIO via failover hosts)
OS
- *NEW* Solaris 10 (XPATH, but requires Solaris branded QLA cards)
- *NEW* Linux (device mapper multipath) (RedHat4, Suse, others…)
![Page 30: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/30.jpg)
30
National Center for Supercomputing ApplicationsLCI Conference 2007
Device Mapper Multipath• Identify luns by scsi_id• Create “path groups”
– Round-robin I/O on all pathsin groups
• Monitor paths for failure– When no paths left in current
group, use next group
• Monitor failed paths forrecovery– Upon path recovery, re-
check group priorities– Assign new active group if
necessary
![Page 31: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/31.jpg)
31
National Center for Supercomputing ApplicationsLCI Conference 2007
Linux Device Mapper Multipath
Overview
1. Identify unique luns2. Monitor active paths for failure3. Monitor failed paths for recovery
Multipath handles 3 areas.
All settings are saved in /etc/multipath.conf
![Page 32: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/32.jpg)
32
National Center for Supercomputing ApplicationsLCI Conference 2007
1. Identify unique luns
Storage Device• vendor• product• getuid_callout
device { vendor "DDN" product "S2A 8000" getuid_callout "/sbin/scsi_id -g -u -s /block/%n"}
![Page 33: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/33.jpg)
33
National Center for Supercomputing ApplicationsLCI Conference 2007
1. Identify unique luns
Multipath Device• wwid• alias
multipath { wwid 360001ff020021101092fb1152a450900 alias sdd4l0}
![Page 34: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/34.jpg)
34
National Center for Supercomputing ApplicationsLCI Conference 2007
2. Monitor Healthy Paths for Failure
• Priority group– Collection of paths to
the same physical lun– I/O is split across all
paths in round-robinfashion
• path_grouping_policy– multibus– failover– group_by_prio– group_by_serial– group_by_node
Multipath control creates priority groups.
Paths are grouped based on path_grouping_policy
MULTIBUS - all paths in one priority group (DDN) (no penalty to access luns via alternate controllers)
FAILOVER - one path per priority group (Use only 1 path at a time) (typically only 1 usable path, such asIBM fastt with AVT disabled)
GROUP_BY_PRIO - Paths with same priority in same priority group, 1 group for each unique priority(Priorities assigned by external program)
GROUP_BY_SERIAL - Paths grouped by scsi target serial (controller node WWN)
GROUP_BY_NODE - (I have not tested or researched this, never had a need to)
![Page 35: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/35.jpg)
35
National Center for Supercomputing ApplicationsLCI Conference 2007
2. Monitor Healthy Paths for Failure
• Path Priority– Integer value assigned to a
path– Higher value == higher
priority– Directly controls priority
group selection
• prio_callout– 3rd party pgm to assign
priority values to each path
prio_callout
multipath
Integer value Device name
Path Grouping Policy = group_by_prio
Only matters if using “group_by_prio” grouping policy
DIRECTLY CONTROLS PRIORITY GROUP SELECTION
- Priority group with highest value is active group
PREVIOUS SLIDE - When all paths in a group are failed, next group becomes active. That would be thepriority group with the next highest priority value that has an active path.
PRIO_CALLOUT
- Provided by vendor or (more typically) custom script written by admin for specific setup
- If not using group_by_prio, then set this to /bin/true
![Page 36: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/36.jpg)
36
National Center for Supercomputing ApplicationsLCI Conference 2007
2. Monitor Healthy Paths for Failure
• path_checker– tur– readsector0– directio– (Custom)
• emc_clarion• hp_sw
• no_path_retry– queue– (N > 0)– fail
TUR
- SCSI Test Unit Ready
- Preferred if lun supports it (OK on DDN, IBM fastt)
- Does not cause AVT on IBM fastt
- Does not fill up /var/log/messages on failures
READSECTOR0
- physical lun access via /dev/sdX (IS THIS CORRECT???)
DIRECTIO
- physical lun access via /dev/sgY (IS THIS CORRECT???)
Both readsector0 and directio cause AVT on IBM fastt, resulting in lun thrashing
Both readsector0 and directio log “fail” messages in /var/log/messages (could be useful if you want tomonitor logs for these events)
NO_PATH_RETRY
- # of retries before failing path
- queue: queue I/O forever
- (N > 0): queue I/O for N retries, then fail
- fail: fail immediately
![Page 37: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/37.jpg)
37
National Center for Supercomputing ApplicationsLCI Conference 2007
3. Monitor failed paths for recovery
• Failback– Immediate (same as n=0)– (n > 0)– manual
FAILBACK
- When a path recovers, wait # seconds before enabling the path
- Recovered path is added back into multipath enabled path list
- multipath re-evaluates priority groups, changes active priority group if neededMANUAL RECOVERY
- User runs ‘/sbin/multipath’ to update enabled paths and priority groups
![Page 38: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/38.jpg)
38
National Center for Supercomputing ApplicationsLCI Conference 2007
Putting it all togehtermultipaths { multipath { wwid 3600a0b8000122c6d00000000453174fc alias fastt21l0 } multipath { wwid 3600a0b80000fd6320000000045317563 alias fastt21l1 }}devices { device { vendor "IBM" product "1742-900" getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
path_grouping_policy group_by_prio prio_callout "/usr/local/sbin/path_prio.sh %n"
path_checker tur no_path_retry fail failback immediate }}
![Page 39: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/39.jpg)
39
National Center for Supercomputing ApplicationsLCI Conference 2007
Putting it all together
/usr/local/etc/primary-paths0x10000000c95ebeb4 0x200200a0b8122c6e 2:0:0:0 sdb 3600a0b8000122c6d00000000453174fc 500x10000000c95ebeb4 0x200200a0b8122c6e 2:0:0:1 sdc 3600a0b80000fd6320000000045317563 20x10000000c95ebeb4 0x200200a0b8122c6e 2:0:0:2 sdd 3600a0b8000122c6d0000000345317524 500x10000000c95ebeb4 0x200200a0b8122c6e 2:0:0:3 sde 3600a0b80000fd6320000000245317593 20x10000000c95ebeb4 0x200300a0b8122c6e 2:0:1:0 sdi 3600a0b8000122c6d00000000453174fc 50x10000000c95ebeb4 0x200300a0b8122c6e 2:0:1:1 sdj 3600a0b80000fd6320000000045317563 510x10000000c95ebeb4 0x200300a0b8122c6e 2:0:1:2 sdk 3600a0b8000122c6d0000000345317524 50x10000000c95ebeb4 0x200300a0b8122c6e 2:0:1:3 sdl 3600a0b80000fd6320000000245317593 51
path_prio.shmultipath Primary-pathsmatchingline
sdb
path_prio.sh
50
PATH_PRIO.SH
- grep device from primary-paths file
- return value from last column
![Page 40: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/40.jpg)
40
National Center for Supercomputing ApplicationsLCI Conference 2007
Demo: Active/Passive Disk• Host
– One Emulex LP11000• Disk
– IBM DS4500– Luns presented through
both controllers– Luns accessible via 1
controller only at a time– AVT enabled
AVT
- Lun will migrate to alternate controller if requested there
- Tolerance of cable/switch failure
- AVT penalty - lun inaccessible for 5-10 secs while controller ownership changing
SCREENS: /var/log/messages , multi-port-mon , command , script host
1. No luns (ls_fc_luns)
2. /etc/multipath.conf
1. Multipaths (fastt)
2. Devices (fastt)
3. /usr/local/sbin/path_prio.sh
1. Identify controller A, controller B
4. /usr/local/etc/primary-paths
5. Add luns (scan_fc_luns)
1. See multipath bindings & path_prio.sh output in /var/log/messages
6. View current multipath configuration
1. Multipath -v2 -l
7. Failover test
1. Script-host: disable disk port A
2. See multipathd reconfig in /var/log/messages
3. See I/O path change in multi-port-mon
8. Recover test
1. Script-host: enable disk port A
![Page 41: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/41.jpg)
41
National Center for Supercomputing ApplicationsLCI Conference 2007
Demo: Active/Active Disk• Host
– One Emulex LP11000• Disk
– DDN 8500– Luns accessible via
both controllers (nopenalty)
SCREENS: multi-port-mon , /var/log/messages , command , script-host
1. /etc/multipath.conf
1. Devices (DDN) (path_prio = /bin/true ; path_grouping_policy = multibus)
2. Multipath (DDN)
2. Luns present? (ls_fc_luns) Add luns if needed (scan_fc_luns)
1. See multipath bindings in /var/log/messages
3. View multipath configuration
1. Multipath -v2 -l
4. Failover test
1. Expected changes in multi-port-mon
2. Disable switch port for disk ctlr 1
3. See failover in /var/log/messages and multi-port-mon
5. Restore ctlr access
1. Expected changes in multi-port-mon
2. Enable switch port for disk ctlr 1
3. See failback in /var/log/messages and multi-port-mon
![Page 42: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/42.jpg)
42
National Center for Supercomputing ApplicationsLCI Conference 2007
Path Grouping Policy Matrix
failover *multiple pointsof failure
Active/Passivew/o AVT
path_prio(demo2)path_prio
Active/Passivewith AVT
multibus(demo1)multibus
Active/Active
2 HBAs1 HBA
ACTIVE/ACTIVE 2 HBAs
- trivial, same as demo1
- Each HBA sees 1 ctlr
- Can let both HBAs see both ctlrs (4 paths to each lun)
+ Use path_prio if need to control path usage
ACTIVE/PASSIVE (AVT) 2 HBAs
- trivial, similar to demo2
ACTIVE/PASSIVE (no AVT) 1 HBA
- Tolerant of ctlr failure only.
- If anything else fails, luns will not AVT to alternate ctlr, host will lose access
ACTIVE/PASSIVE (no AVT) 2 HBAs
- Non-preferred paths will be failed
- Each HBA must have full access to both controllers
![Page 43: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/43.jpg)
43
National Center for Supercomputing ApplicationsLCI Conference 2007
Linux Multipath Errata• Making changes to multipath.conf
– Stop multipathd service– Clear multipath bindings
•/sbin/multipath -F
– Create new multipath bindings•/sbin/multipath -v2 -l
– Start multipathd service• Cannot multipath root or boot device• user_friendly_names
– Not really, just random names dm-1, dm-2 …
CANNOT MULTIPATH ROOT OR BOOT DEVICE
- per ap-rhcs-dm-multipath-usagetxt.html (see references section)
![Page 44: SAN Persistent Binding and Multipathing in the 2.6 Kernel · •Perceus management; diskless boot –Cisco Infiniband •2 to 1 oversubscribed •OFED-1.1 w/ HPSM subnet manager –Lustre](https://reader035.vdocuments.mx/reader035/viewer/2022071000/5fbc73a679ad445c0d2f058a/html5/thumbnails/44.jpg)
44
National Center for Supercomputing ApplicationsLCI Conference 2007
Linux Multipath Resources• multipath.conf.annotated• man multipath• http://christophe.varoqui.free.fr/wiki/wakka.php?wiki=H
ome– Multipath tools official home
• http://www.redaht.com/docs/manuals/csgfs/browse/rh-cs-en/ap-rhcs-dm-multipath-usagetxt.html– Description of output (multipath -v2 -l)
• http://kbase.redhat.com/faq/FAQ_85_7170.shtm– Setup device-mapper multipathing in Red Hat Enterprise Linux 4?
• http://dims.ncsa.uiuc.edu/set/san– Multi-port-mon– Set switchport state : (en/dis)able switch port via SNMP
MULTIPATH.CONF.ANNOTATED (RedHat)
- /usr/share/doc/device-mapper-multipath-0.4.5/multipath.conf.annotated