VSAN 6.0 Part 7 – Blinking those blinking disk LEDs

flashing disk LEDBefore I begin, this isn’t really a feature of VSAN so to speak. In vSphere 6.0, you can also blink LEDs on disk drives without VSAN deployed. However, because of the scale up and scale out features in VSAN 6.0, where you can have very many disk drives and very many ESXi hosts, being able to identify a drive for replacement becomes very important.

So this is obviously a useful feature. And of course I wanted to test it out, see how it works, etc. In my 4 node cluster, I started to test this feature on some disks in each of the hosts. On 2 out of the 4 hosts, this worked fine. On the other 2, it did not. Eh? These were all identically configured hosts, all running 6.0 GA with the same controller and identical disks. In the ensuing investigation, this is what was found.

First, this blinking of disk LEDs can be done from both the UI and the command line. The command as the CLI is as follows:

# esxcli storage core device set -d naa_id -l=locator -L=60

This command basically says light the LED on the disk with the above NAA id for 60 seconds. On the working hosts, this command simply returned to the prompt. On the non-working hosts, the following was returned:

Unable to set device's LED state to locator. 
Error was: No LSU plugin can manage this device.

If you try this via the UI, the behaviour is a little strange. You’ll find the icons for turning on and off LEDs when you select a disk drive in the UI. However the task will report as complete whether it succeeded or failed. You will have to go into the events view to see whether it succeeded or not. If it fails, it reports “Cannot change the host configuration”:

cannot change the host configurationIf you chase this back through the vpxd logs on vCenter, and then the vpxa and hostd logs on ESXi, the reason or the failure is:

Cannot set Led state on disk: 02000500 494341. 
Error: No LSU plugin can manage this device

The first difference we noticed between the working and non-working hosts was in the way the devices were being reported. On the non-working hosts, a number of field were returned as unknown:

# esxcli storage core device list -d naa.id
naa.id
   Display Name: HP Serial Attached SCSI Disk (naa.id)
   .
   .
   Vendor: HP 
   Model: LOGICAL VOLUME 
   .
   .
   Drive Type: unknown
   RAID Level: unknown
   Number of Physical Drives: unknown
   .

When an identical disk device was queried on a working host, these field were populated:

# esxcli storage core device list -d naa.id
naa.id
   Display Name: HP Serial Attached SCSI Disk (naa.id)
   .
   .
   Vendor: HP 
   Model: LOGICAL VOLUME 
   .
   .
   Drive Type: logical
   RAID Level: RAID0
   Number of Physical Drives: 1
   .

We then started to check on the VIBs that were installed on all hosts. We knew that certain LSU VIBs were automatically shipped with the hosts for the purposes of disk operations. On both the work and non-working hosts, the following LSU VIBs were present:

lsu-hp-hpsa-plugin
lsu-lsi-lsi-mr3-plugin
lsu-lsi-lsi-msgpt3-plugin
lsu-lsi-megaraid-sas-plugin
lsu-lsi-mpt2sas-plugin
lsu-lsi-mptsas-plugin

Since these were HP hosts with HP controllers, we then began to look at what additional HP VIBs were installed. This is when we spotted something interesting. The following HP VIBs were installed on the non-working hosts:

ata-pata-hpt3x2n
lsu-hp-hpsa-plugin 
scsi-hpsa

But when this was compared to the working hosts, there was a big difference:

char-hpcru  
char-hpilo  
hp-ams 
hp-build  
hp-conrep  
hp-esxi-fc-enablement  
hp-smx-provider  
hpbootcfg 
hpnmi  
hponcfg  
hpssacli  
hptestevent  
scsi-hpdsa  
scsi-hpsa  
scsi-hpvsa  
ata-pata-hpt3x2n
lsu-hp-hpsa-plugin

That was when we realized what was different between the hosts. Two of the hosts has been running a 5.5U2 OEM version of ESXi from HP, which were then upgraded to 6.0. The other two hosts were vanilla installs of 6.0. The ones upgraded from the HP OEM image were working, the vanilla installations were not.

After some further investigation, we discovered that the VIB that we need to blink the LEDs on HP controllers is “hpssacli” (click on it to be taken to the VIB on the HP web site). Once this VIB was installed, all disks reported valid drive types and RAID levels (rather than “unknown”) and the blinking LED operation now worked also.

Therefore if you use HP controllers, and you wish to identify drives by blinking LEDs, the options that are available to you are:

  • Install the ESXi 6.0 OEM image from HP
  • Install the standard ESXi 6.0 image from VMware and then install the hpssacli VIB

Please note that if you install the OEM image from HP, verify that the drivers have been certified for VSAN use. I’ve reported on this previously.

Finally, if you are using controllers based on LSI (basically all other non-HP controllers), you don’t have to worry about extra VIBs as all of the necessary software needed to blink the disk LEDs is pre-installed.

[Update – Sep 2016] It appears that certain HP controllers (e.g. P440 and H240) when used in pass-thru mode do not allow the disk LEDs to be blinked. We are investigating why this is happening, and will report back as soon as we know more. This does not apply to controllers that are run in RAID-0 mode (e.g. P420i). These continue to function as expected.

3 comments
  1. I have a customer that is more concerned about disk access latency with VSAN. If data is spread across multiple nodes in the cluster, how does this affect IO latency for the VM? Traditional array-based workload latency numbers are fairly well known, but since the data is distributed in VSAN, it would be good to understand how this affects things like latency.

    • The first thing to point out is that even though the data is spread out across hosts and disks in the cluster, it is still only a single hop away. If you are using 10GbE, latency incurred will noyl be in the 10sec of microseconds. Where latency will become a concern is when we introduced support for stretch clusters over long distances. This is something we are currently looking at. However, in current VSAN deployments where the recommendation is to keep everything on the same L2, latency is not such a concern.

Comments are closed.