Home‎ > ‎Server config‎ > ‎

Locating hard drives within a large JBOD or Lustre enclosure

Locating disks on large enclosures containing 90+ drives can be problematic, since their status LED's are not visible, and often are not even activated by ZFS volume errors.

The following applies to tux (270 drives), noss1,noss2,noss3 (90 drives each). All the necessary dependencies and the script itself have been installed in their respective /root/sysadmin/drive_utils directories.

The ledmon and ledctl utilities can work, but require a block id (/dev/sdxx). The system typically only gives UUID's, and these block id's are not even persistent between reboots, so they have to be regenerated every time a disk intervention is required. Only scsi UUID's are persistent.

A zpool status command will typically show errors on the volumes, with scsi id's. The usual zfs replacement routines can be applied, but the physical drives need to be positively identified in the enclosures.

The following script will give all the drives' scsi UUID's, block id's, and physical device serial numbers, for all the controllers present, and all the drives in them:

Located in /root/sysadmin/drive_utils as map_drives.sh
********************************************************************************************
#!/bin/bash

# JWS - March 2019 johann.swart@up.ac.za
# Script to find and associate hard drive block id's with their UUID's and serial numbers
# For example - ZFS, and other system functions will always return a UUID in a fault report, but some utilities require a /dev/sdXX id.
# Also, a serial number of the actual drive must be present to confirm its identity.
# The output of this script is quite large on the JBOD and Lustre enclosures, but can be grep'ed with the UUID from the system.
# The drive will then be located by flashing with ledctl utilities, which requires a block id - /dev/sdXX
# Since block id's are not persistent, they have to be generated from scratch each time, before the main script is run, so this temporary file is datestamped

prefix=`date +%d%m%Y`
file_name="disks_blkid_$prefix.txt"

lsscsi|grep "/dev/sd"|grep "ATA"|awk ' { print $7 } ' | cut -c6- > $file_name

while IFS= read -r block_device
do
udevadm info -q all -p /sys/block/$block_device |
awk '
        /DEVLINKS/  { printf "%s\n", $2; }
        /DEVNAME/  { printf "%s\n",$2; }
        /ID_SCSI_SERIAL/  { printf "%s\n\n",$2; }
            '

done < $file_name

********************************************************************************************
Running this script will give a lot of output like:
.
.
DEVLINKS=/dev/disk/by-id/scsi-35000cca255de35e9
DEVNAME=/dev/sdjh
ID_SCSI_SERIAL=K1J4G74D

DEVLINKS=/dev/disk/by-id/scsi-35000cca255de1ece
DEVNAME=/dev/sdji
ID_SCSI_SERIAL=K1J482AD

DEVLINKS=/dev/disk/by-id/scsi-35000cca255dbf06f
DEVNAME=/dev/sdjj
ID_SCSI_SERIAL=K1HZGA8D
.
.
which can be parsed for a particular UUID with (for example):

./map_drives.sh|grep -A 2 scsi-35000cca255dbf06f
DEVLINKS=/dev/disk/by-id/scsi-35000cca255dbf06f
DEVNAME=/dev/sdjj
ID_SCSI_SERIAL=K1HZGA8D

Once the block id (dev/sdjj) is obtained, it can be used to flash the status LED:

ledctl locate=/dev/sdjj

and to turn it off:
ledctl locate_off=/dev/sdjj

Note: The map_drives.sh script have to be run every time just before a drive is replaced, since the system status can change over time, and the incorrect drives can possibly be identified.

Nobody tells you this, but ledctl and ledmon require OpenIPMI to be installed from the yum repos.

Now that the serial number of the drive is known, it can be located in an enclosure by this script:
(Pass the serial number as argument)
locate_serial.sh
************************************************************************************************************
#!/bin/tcsh

# JWS - March 2019 johann.swart@up.ac.za
# Locates a drive on Tux/Lustre in a specific enclosure
# Command line param gives serial number to search for. The serial can be obtained from map_drives.sh script

set serial=$1

set enclosure = `storcli64 /call/eall/sall show all|grep -B 2 $serial|grep Drive|cut -c9`
echo "Serial: "$serial
echo "Enclosure (0,1,2): "$enclosure

************************************************************************************************************
eg:
[root@tux drive_utils]# ./locate_serial.sh K1HZGA8D
Serial: K1HZGA8D
Enclosure (0,1,2): 2