ESXi Host Slow Boot stuck on vmw_satp

So I came across this recently while upgrading ESXi hosts. Part of the phased migration to vSphere 6.7 was to get our ESXi hosts from version 5.5 to version 6.0. The normal applied, evacuate the vms using DRS, put it in maintenenace mode, attach the ESXi 6.0 baseline upgrade, remediate the host and……3 hours later the host appears on the radar. What! 3 Hours?

One clue here was the ESXi bootloader seemed to be “stuck” on vmw_satp_alua. So why was the boot process being delayed by scanning my storage paths? We took a jump across to our vmkernel.log and saw this line (one of many)

2018-08-15T06:20:29.191Z cpu12:33249)NMP: nmp_ResetDeviceLogThrottling:3440: Error status H:0x0 D:0x18 P:0x0 Sense Data: 0x5 0x24 0x0 from dev “naa.6000d31000086f000000000000004e3b” occurred 2533 times(of 2536 commands)

Error status H:0x0 D:0x18 P:0x0 Sense Data: 0x5 0x24 0x0

D:0x18 Indicates the Device Status has a Reservation Conflict. This device is an RDM mapped to the host but more importantly the device is part of a Microsoft Disk Cluster (often referred to as partaking in a MSCS or Microsoft Cluster Service)

We then come acroos the VMware KB https://kb.vmware.com/s/article/1016106

So the ESXi host was scanning all the RDMs but getting stuck (delayed) when trying to discover any LUN taking part in a MSCS. The host is unable to interrogate the LUN due to the persistent ISCSI reservation placed on the device by the active node in the MSCS cluster.

Fortunately this is expected behaviour, which means there is a workaround. We can set a “Perennially Reserved Flag” ( you will be forgiven trying to pronounce that before you have had coffee). This will tell the host to skip scanning that LUN at boot, esentially speeding up your boot time.

Before we show you how to do this, lets point something out. It is important, IMO, that the “Perennially Reserved Flag” only be set on the RDMs that are taking part in your MSCS. Setting this flag accidentally on a LUN that is a VMFS datastore could, in VMwares words, result in data loss.

To set the flag run:-

esxcli storage core device setconfig -d naa-id –perennially-reserved=TRUE

esxcli storage core device setconfig -d naa.6000d31000086f000000000000004e3b –perennially-reserved=TRUE

You can discover if the device has the flag set by running:-

esxcli storage core device list -d naa.id

esxcli storage core device list -d naa.6000d31000086f000000000000004e3b

Remember, these commands will need to be applied to each applicable ESXi host. Some really intelligent people have some great examples of using PowerShell in PowerCLI to make this job easier for larger environments.

4.5/5 - (2 votes)

Ben Whitmore

November 1, 2018 at 7:11 am

to get the SCSI Canonical Name for you VM that is participating in a MSCS, you can run
Get-VM vmname | Get-Harddisk -DiskType rawPhysical,rawVirtual | Select SCSICanonicalName

Brian Farrugia

August 9, 2019 at 1:31 pm

Thank you very much for this article. I found it very useful. Wasn’t aware that I had to set perennial reservation on MSCS RDMs. Did a lot of search before but apparently not enough :/
On ESXi 6.7 the parameters for esxcli storage core device setconfig -d naa-id –perennially-reserved=TRUE changed slightly requiring an = after -d
esxcli storage core device setconfig -d=naa-id –perennially-reserved=TRUE
Brian

Ben Whitmore
August 11, 2019 at 10:52 pm

Thanks Brian, I hadn’t observed the changed command for 6.7. What doc did you see this in out of interest?

1. Anonymous
  August 12, 2019 at 7:12 am
  
  Hi Ben,
  this morning I went again to replicate the error i was getting without the ‘=’ and it just worked.. copied and pasted the same command. Don’t know what went wrong 🙂 it appears the command line in your post still holds 🙂
  Thanks again for your post

ESXi Host Slow Boot stuck on vmw_satp_alua

4 thoughts on “ESXi Host Slow Boot stuck on vmw_satp_alua”

Leave a Reply to Anonymous Cancel Reply