VSAN considerations when booting from USB/SD

This is a conversation that comes up time and time again. It’s really got to do with the following considerations when booting an ESXi host that is participating in VSAN from an SD/USB device::

  1. What should I do for persisting the vmkernel logs?
  2. What should I do about persisting the VSAN trace files?
  3. What should I do for capturing core dumps (PSOD)?

To find answers to these questions, its first of all important to understand how booting ESXi from a USB/SD device is different to booting from a physical disk. Here is a sample partition layout from a USB/SD device:

[root@cs-ie-h01:~] partedUtil getptbl \
/vmfs/devices/disks/mpx.vmhba32:C0:T0:L0
gpt
250 255 63 4028416
1 64 8191 C12A7328F81F11D2BA4B00A0C93EC93B systemPartition 128
5 8224 520191 EBD0A0A2B9E5443387C068B6B72699C7 linuxNative 0
6 520224 1032191 EBD0A0A2B9E5443387C068B6B72699C7 linuxNative 0
7 1032224 1257471 9D27538040AD11DBBF97000C2911D1B8 vmkDiagnostic 0
8 1257504 1843199 EBD0A0A2B9E5443387C068B6B72699C7 linuxNative 0

In this example, my ESXi host is booted from SD. There are 3 VFAT partitions (5,6 and 8) created on the SD, reported as linuxNative above.

[root@cs-ie-h01:~] esxcfg-scsidevs -f 
mpx.vmhba32:C0:T0:L0:8 /vmfs/devices/disks/mpx.vmhba32:C0:T0:L0:8 \
545caccd-0f395e04-b6f2-001f29595f9f
mpx.vmhba32:C0:T0:L0:5 /vmfs/devices/disks/mpx.vmhba32:C0:T0:L0:5 \
53fe0de8-1e5cebd8-68dd-fd73c38a876e
mpx.vmhba32:C0:T0:L0:6 /vmfs/devices/disks/mpx.vmhba32:C0:T0:L0:6 \
6a9f0fe0-cd3bae2f-6183-2c68615182ec
[root@cs-ie-h01:~]

So how are these used? Well if the root (/) partition is listed on the ESXi host, the following links are visible:

altbootbank -> /vmfs/volumes/6a9f0fe0-cd3bae2f-6183-2c68615182ec 
bootbank -> /vmfs/volumes/53fe0de8-1e5cebd8-68dd-fd73c38a876e
locker -> /store
productLocker -> /locker/packages/6.0.0
store -> /vmfs/volumes/545caccd-0f395e04-b6f2-001f29595f9f
vmupgrade -> /locker/vmupgrade

There is the bootbank and the altbootbank partitions which are used, as the name implies, as primary and backup boot partitions by ESXi.

Then there is /store. This is the interesting one as you can see /locker is linked to /store. We will return to /locker shortly.

The command vmkfstools -Ph -v10 (/bootbank | /altbootbank | /store) may be used to display further information about the partitions if necessary.

[root@cs-ie-h01:~] vmkfstools -Ph -v10 /store 
Could not retrieve max file size: Inappropriate ioctl for device
vfat-0.04 file system spanning 1 partitions.
File system label (if any):
Mode: private
Capacity 285.8 MB, 664 KB available, file block size 8 KB, \
max supported file size 0 bytes
UUID: 545caccd-0f395e04-b6f2-001f29595f9f
Logical device: mpx.vmhba32:C0:T0:L0:8 
Partitions spanned (on "disks"):
mpx.vmhba32:C0:T0:L0:8
Is Native Snapshot Capable: NO
OBJLIB-LIB: ObjLib cleanup done.
WORKER: asyncOps=0 maxActiveOps=0 maxPending=0 maxCompleted=0
[root@cs-ie-h01:~]

Now that we have had a look at the persisted partitions on the SD/USB, it is now time to look at the RAMdisks. A RAMdisk is basically a partition or storage space that resides in memory, so you should note that the contents of a RAMdisk are not persisted during a reboot. The command “vdf -h” can be used to examine RAMdisks on an ESXi host:

[root@cs-ie-h01:~] vdf -h

-----
Ramdisk 1k-blocks Used Available Use% Mounted on
root 32768 3800 28968 11% --
etc 28672 324 28348 1% --
opt 32768 0 32768 0% --
var 49152 548 48604 1% --
tmp 262144 90156 171988 34% --
vsantraces 307200 104428 202772 33% --
epd 51200 24 51176 0% --
hostdstats 822272 2416 819856 0% --
[root@cs-ie-h01:~]

The interesting items here are that both tmp and vsantraces are both RAMdisks. Let’s take another look at the root partition on the ESXi host.

 scratch -> /tmp/scratch

And then if we examine the contents of the /var/log directory, we see that they are all linked back to scratch, which is linked to /tmp which is of course a RAMdisk.

[root@cs-ie-h01:/var/log] ls -la vmk*.log
vmkdevmgr.log -> /scratch/log/vmkdevmgr.log
vmkernel.log -> /scratch/log/vmkernel.log
vmkeventd.log -> /scratch/log/vmkeventd.log
vmksummary.log -> /scratch/log/vmksummary.log
vmkwarning.log -> /scratch/log/vmkwarning.log

So if this host is rebooted, the logs will be lost. This is why it is highly recommended as a best practice to configure a syslog server so that logs do not get lost. The other alternate is to make persistent storage available to the ESXi host, and redirect logs to persistent storage. Note that the logs cannot be stored on the VSAN datastore at this time. One reason for this is to do with triage – if the VSAN datastore is impacted, then we will not be able to retrieve the logs and do a triage on the issue.

Note that if you install ESXi on a local physical magnetic disk, this is not an issue. If a local disk is used to deploy ESXi, and there is enough space, a local VMFS partition (partition #3) is created during ESXi installation. The scratch space is then configured automatically during installation or first boot of an ESXi host to use the local VMFS volume (it does not need to be manually configured).  e.g.  /vmfs/volumes/vmfs-datastore/.locker/

For more info on /scratch configuration  please read VMware KB Article 1033696.

Lets now turn our attention to the VSAN traces. VSAN traces help VMware support and engineering to understand what is going on internally with VSAN. It should be noted that these traces are *not* part of syslog. So if you setup a syslog server to capture VMkernel logs, you will not capture VSAN traces. VSAN trace files are not persisted with syslog because the bandwidth requirements are too high.

VSAN traces require ~500MB of disk space.

Since these traces are of extreme importance to VMware support, extra efforts are made to preserve them when /scratch is not on persistent storage. In these cases, when the ESXi host is booted from SD/USB, and the VSAN traces are on a RAMdisk, they also get copied to /locker for persistence via /etc/init.d/vsantraced when the host reboots. Since /locker is relatively small, typically all the VSAN trace files will not fit. To accommodate this, they are saved in value order so that the most recent/significant information is captured first.

A common question is why do we not just persist the VSAN traces to the SD/USB rather that doing this step? Again, it is due to the bandwidth of the VSAN trace files. The concern here is that the number of writes generated by VSAN traces, and there are a lot of them, can burn out a USB/SD card. In this way, we preserve the lifespan of the SD/USB device. The command “esxcli storage core device stats get” can be used to examine the I/O going to a device if you are interested.

When VSAN trace files are being written to a RAMdisk, they should also be persisted on a PSOD. This can be verified by the command “esxcli system visorfs ramdisk list”. I have truncated a lot of the columns to make it easier to read. The third column in the output is “Include in Coredumps”, and vsantraces should be set to true, as shown below:

[root@localhost:~] esxcli system visorfs ramdisk list
Ramdisk Name  System  Include in Coredumps   Reserved      Maximum        
------------  ------  --------------------  ---------  -----------  
root            true                  true  32768 KiB    32768 KiB        
etc             true                  true  28672 KiB    28672 KiB         
opt             true                  true      0 KiB    32768 KiB            
var             true                  true   5120 KiB    49152 KiB       
tmp            false                 false   2048 KiB   262144 KiB    
hostdstats     false                 false      0 KiB  1078272 KiB    
vsantraces     false                  true      0 KiB   307200 KiB  
epd            false                  true      0 KiB    51200 KiB      
[root@localhost:~]

Now there is one final concern, and this has to do with capturing core dumps when a host PSODs. The following is the support statement on this for Virtual SAN, keeping in mind that the minimum USB/SD size that we support booting ESXi from is 4GB. You can probably get away with using smaller USB/SD cards for standard ESXi deployment, but for VSAN deployments, we are recommending 4GB.

For hosts with up to 512GB memory, we support booting the ESXi host from SD cards/USB with no persistent storage. In this scenario, the RAMdisk is used to storage vmkernel logs. A RAMdisk is also used to store VSAN traces, but these are copied to /locker for persistent when the host reboots. The contents of the vsantraces in RAMdisk are  saved to /locker on shutdown, In general the traces will not all fit on /locker, but they are saved in value order.

Note that VMkernel logs are not persisted either in these cases, and it highly recommended that a syslog server is configured to save logs.

For hosts with greater than 512GB memory, a separate physical disk for persistent storage will be required. This means /scratch will be persistent storage, so vmkernel logs will be persisted (/tmp will be a partition on disk) and rather than copying the VSAN traces RAMdisk to /locker, these will instead be copied to the persistent storage.

Why is the limit set at 512GB memory? It is basically due to PSOD sizes. Since we support a minimum SD/USB size of 4GB for a boot device, 2.2GB of the USB is set aside for the core dump. Before vSphere 5.5, the VMkcore partition was only 100MB in size. However as hosts started to come with multi-gigabyte and now multi-terabyte, this partition size was too small. This new 2.2GB VMKdiagnostic partition size was implemented to allow the capture of core files on hosts with large memory. On a PSOD, the core dump will be dumped onto USB/SD. If memory sizes are any bigger, we may not be able to capture the core dump.

To see where the core dump partition resides, the following command can be used:

[root@cs-ie-h01:/var/log] esxcli system coredump partition get
Active: mpx.vmhba32:C0:T0:L0:7
Configured: mpx.vmhba32:C0:T0:L0:7

The net dumper utility is also available to redirect core dump to an external location if a host PSODs. It is a post crash feature that sends the core dump “unreliably” over a UDP connection. Unfortunately, this tool does have some limitations, as one transmission failure will result in a failed core dump collection, and thus there will be no core dump for root cause analysis.

One final note on VSAN traces. If you do have a NFS datastore available, the VSAN traces can be redirected there via the  “esxcli vsan trace set” command. In this first example, the traces are still at their default and going to RAMdisk:

[root@cs-ie-h01:/var/log] ls -la
 vsantraces -> /vsantraces

In this next example, they are redirected to an NFS datastore:

[root@cs-ie-h01:/var/log] ls -la
 vsantraces -> /vmfs/volumes/NFS-Isilon/

Hope this helps to clarify some of the reasoning behind the boot recommendations.