vSAN

A quick reference to vSAN content

vSAN is VMware’s Hyper-converged Infrastructure (HCI) platform, offering both compute and distributed storage in a single solution.

Books

Posts

 

118 Replies to “vSAN”

  1. Can you give us some detail on calculating Disk Yield? If I have 3 modes with 1tb, will I see 3 tb storage? does a VM that uses 50gb of storage take up 50gb, or 100 gb, or 150 gb?

    1. There should be a sizing guide going live shortly, but all magnetic disks across all the hosts will contribute to the size of the VSAN datastore. The SSDs (or flash devices) do not contribute to capacity. So if you had 1TB of magnetic disk in 3 nodes, your VSAN datastore will be 3TB.

      The amount of disk consumed by your VM is based primarily on the failures to tolerate (FTT) setting in the VM Storage Policy. An FTT of 1 implies 2 replicas/mirrors of the VMDK. Therefore a 50GB VMDK created on a VM with an FTT=1 will consume 100GB. A 50GB VMDK created on a VM with an FTT=2 will make 3 replicas/mirrors and therefore consumes 150GB. Hope that makes sense. Lots of documentation coming around this.

  2. Hi Comac,

    Need to understand on the “Note” of VSAN Part 9 topic:

    On the vSphere HA interop:

    ….”Note however that if VSAN hosts also have access to shared storage, either VMFS or NFS, then these datastores may still be used for vSphere HA heartbeats”

    Questions:
    If for example all the VSAN hosts also have VMFS shared datastore(s) (say using FC SAN), then I can have TWO kind of HA protections which are if the VM located on the VSAN datastore then it gets VSAN HA protection and if the VM located on the shared VMFS datastore then it gets a traditional HA protection?

    Thanks

  3. Just to clarify on the whole disk consumption based on the FTT setting…going back to your example of a FTT=1 for a 50GB VM….

    Are you saying that it will consume an additional 100GB of space due to the 2 replicas created?…or are you saying that the original VM (VMDK) that is created is counted as one of those replicas?

    [quote]
    “therefore a 50GB VMDK created on a VM with an FTT=1 will consume 100GB”

    In regards to being completely clear, would it be better to say

    will consume an extra 100GB in addition to the 50GB VM (VMDK)”?

    I’ve done countless days and days of researching for the past ~6 months or so but every time I hear that, it throws me off on my understanding of FTT > disk consumption.

    Thank you in advance for your time, if you choose to respond.

    *I read your book BTW, you and Duncan Epping are rockstars in the world of virtualization….really good read. Couldn’t have asked for more.

    -JamesM

    1. It means that 2 x 50GB replicas are created for that VMDK James, meaning 100GB in total is consumed on the VSAN datastore (not an additional 100GB). Note however that VMDK are created as thin provisioned on the VSAN datastore, so it won’t consume all of that space immediately, but over time.

      Thanks for the kind words on the book – always nice to hear feedback like that.

      1. Thanks for the reply and clarification…so to make sure I get this right, there will be a single VMDK for the actual VM running in the environment BUT since VSAN is in use, if your FTT=1, then 100GB will be consumed by the 2 replicas that are created (over time with thin provisioning).

        I think my confusion is in the semantics of how every everyone explains it.

        1. Yep – you got it. A single 50GB VMDK, made up of two 50GB mirrors/replicas, each replica sitting on a different disks (and host) but the same datastore and eventually consuming 100GB in total on the VSAN datastore

  4. I have a question for you regarding Part 13 in which you refer to “the VM swap file” and the “swap object”. How does the vmx-*.vswp file fit into all this? This file was introduced in 5.0. Does this file belong in the swap object? Is there a second swap object for it? Or does it simply belong to the VM namespace object?

    1. Yes – this is what we are referring to. This is now instantiated as its own object on the VSAN datastore, and does not consumes space in the VM namespace object.

  5. Hi Cormac,
    A question about the “Virtual SAN 6.0 Design and Sizing Guide”. On page 46 it states ‘For hybrid configurations, this setting defines how much read flash capacity should be reserved for a storage object. It is specified as a percentage of the logical size of the virtual machine disk object.’. So a percentage of the logical size (used storage). The example on page 47 takes the flash read cache reservation as a percentage of the physical space (allocated storage). What is the truth?

    Thanks.

    1. These statements are meant to reflect the same thing Stevin. When I say that it is a “percentage of the logical size”, this is not the same as “used storage”.

      All VMDKs on VSAN are thin by default. they can be pre-allocated (made thick) through the use of the Object Space Reservation capability-.

      However, whether you use that or not, you request a VMDK size during provisioning, e.g. 40GB. Now you may only use a portion of this, e.g. 20GB, as it is thin provisioned.

      But Read Cache is based on a % of the requested size (logical size/allocate storage), so 40GB. Hopefully that makes sense.

      Cormac

  6. Hi Cormac,
    regarding your book Essential VSAN, excellent book btw. The book states: In the initial version of VSAN, there is no proportional share mechanism for this resource when multiple VMs are consuming read cache, so every VM consuming read cache will share it equally. how must i read this? will the total flash read cache size be devided by the number of VMs consuming VSAN storage and that is the amount of flash read cache each VM gets? (this would be a problem for read intensive VMs with more storage than average)
    What about the write cache? Every write has to go through the write cache i presume? How is write cache shared between VMs?

    thanks again.

    1. Hi Cormac

      I would be very interested to know about read and write cache allocation to VM’s when reservation is set to 0 for VSAN 6.2

      If I copy a large file from :C to D: drive in my windows VM, I see very poor transfer rates by comparison to the same copy on a PC (less than half the speed). The transfer rate drops to zero for up to 7 seconds for periods during the transfer. Its almost like its cache allocation has filled up and its waiting for destage to complete.

      Thanks

      1. Hi Karl,

        extremely difficult to figure this out without getting logs, etc. I would recommend opening a call with support.
        However there were some significant bug fixes in the most recent patch – VMware ESXi 6.0, Patch Release ESXi600-201611001 (VMware ESXi 6.0 Patch 04). Are you running this?

  7. Hi Cormac
    A question about the ratio for SSD and HDD numbers. what’s the best number for the ratio? From the system level view, If only one HDD I belived the performance will not good( your data will gating on one HDD interface). but if the HDD disk is around 10, The SSD couldn’t provide enough cache to all. Just wonder if there’s a perfect ratio?

    1. It is completely dependent on the VMs that you deploy. If you have very I/O intensive VMs each with large working sets (data is in a state of change), then you will need a large SSD:HDD ratio. If you have very low I/O VMs with quite small working sets, you can get away with a smaller SSD:HDD capacity. Since it is difficult to state which is the best for every customer, we have used a 10% rule-of-thumb to cover most virtualized application workloads.

      1. Appreciate Cormac, understand the ratio will determine the performance. And user configure it with their application case. it provide flexibly to users.
        I may didn’t make it clear.
        The ratio here I mentioned is physical device number not capacity number.
        Or the Performance has no relationship with physical devices number ratio, only affected by SSD:HDD capacity ratio?

        1. This is one of those depends answers Lyne.

          If all of your writes are hitting the cache layer, and all of your reads are also satisfied by the cache layer, and destaging from flash to disk is working well, then 1:1 ratio will work just fine.

          If however you have read cache misses that need to be serviced from HDD, or there is a large amount of writes in flash that need to be regularly destaged from flash to HDD, then you will find that a larger ratio, and the use of striping across multiple HDDs for your virtual machine objects can give better performance.

  8. Yes, Cormac.
    That’s my concern. we’re struggling with 1SSD:4 HDD and 1SSD:5HDD performance difference.
    I think if we have big SSD, it should have less possibility to miss cache.
    Even the cache missed, the 4HDD and 5HDD shouldn’t affect big right?
    Maybe need to set up environment and collect some test data. 🙂

  9. hi Cormac
    I run VSAN with 3 Host’s
    and also config LACP between server and switch
    SSD Samsung pro 850 with 512GB of size
    after run I tested copy speed between 2 VM that run in VSAN and the speed is between 20MB to 60 MB
    What is my problem
    Notice in Smart Storage Administrative I disabled Caching from SAS and SSD disk, then test speed and speed was very bad
    again delete arrays and create with cached enable and again speed was very bad
    Please help

    1. Hi Morteza,

      Noticed you’re using the same Samsung consumer grade SSDs that I thought would work. Are you using them as the caching tier or as the capacity drives? In my case, I used them as the caching tier and had all sorts of issues, even down to Permanent Disk Loss errors randomly appearing, requiring a host reboot. I’ve since moved them to the capacity tier and put in some Enterprise SSDs and so far, haven’t had any further issues.

      Thanks

      Andrew

  10. I posted this on the VM/Host affinity groups, but didn’t get a reply. I’m looking at setting up a VSAN stretched cluster. Can you help answer this?

    ——–

    How does VM/Host affinity groups work with fault domains? I’m looking at setting up a VSAN, and setting the fault domain for site A to be site B. As I understand, by doing that, when I set FTT=1, the data will be replicated to site B instead of to another node at site A. This is to cover the case where we lose the entire rack at Site A. The VMs will be able to reboot at Site B off of the replicated data at Site B.

    If I were to use VM/Host affinity groups, then wouldn’t I need to replicate to a second node at site A? Would that mean setting FTT=2, and it would replicate to a node at site A, and a node at site B? Maybe VM/Host affinity groups don’t work when using fault domains. Can you help me sort that out?

    1. First, VSAN Stretched Cluster only supports FTT=1. Fault Domains and FTT work together.

      If you have a failure on site A, the VM/Host Affinity rules will attempt to restart the VM on the same site, i.e. site A.

      If you have a complete site failure (e.g. lost power on site A), the VM/Host affinity rules will then attempt to restart the VM on the remote site, i.e. site B.

      You still need to use fault Domain with Stretched Cluster, but simply as a way of grouping hosts on each site together.

      This should be well documented in the stretched cluster guide. There is also a PoC guide due to be released very soon which will provide you with further detail.

      1. Thanks for your reply.

        So if a stretched cluster has FTT=1, then doesn’t that mean it will only replicate data to another node at site B? If it only replicates to another node at site B, and a node at site A goes down, how will VM/HA rules be able to restart the VM on the same site A?

      2. Hi,

        Can you say if removing this limitation of Stretched Cluster (FTT only =1) is on the roadmap? We are looking at implementing it but would like to have 2 copies on the primary site + 1 on the secondary (or maybe 2+2 active-active configuration)

        Thanks, Vjeran

        1. We are definitely looking at this, and the plan is to improve up on it. But there are no dates for the feature that I can share with you I’m afraid.

          1. This didn’t make it in 6.2 release? Sometimes those details don’t get advertised at launch, so I’m still hoping.. 😉

  11. About the Health Check plugin, any thoughts on why it triggers the alarm ‘Site Latency Health’ between host and witness on as low as 15ms when less or equal to 100ms is the recommended figure? is there any way to tweak this?

  12. Hi Cormac, I read your book BTW, you and Duncan Epping are really good in the world of virtualization….really good read. you have expertise in virtualization Couldn’t have asked for more

  13. Hi Cormac,

    I found that some posts have a reply box, but some do not have.
    I have read the post about “VSAN 6.2 Part 1 – Deduplication and Compression” and wanted to leave reply there, but it seems that there is no space…

    How can I leave a reply to that post?

      1. OK, got it.

        So I post my questions about “Deduplication and Compression” here as the last option.

        My environment is as following:
        1. 3 All Flash ESXi hosts with Dedep and Compression enabled.
        2. Only PSC, VCSA and other 2 VMs have been deployed with less then 1TB totally.
        3. The object space reservation is 0% with default VSAN Storage policy.

        But from what i saw is that:
        1. The deduplication and compression overhead is 6.5TB.
        2. The ‘used-total’ grows up to about 2 TB after enable Dedup and compression.

        Is that the normal phenomenon after enable the feature? BTW, is there any formula that i can use to calculate the expected consumed capacity?

        Thanks.

          1. Yes, 5% of the total raw capacity is true. I also found the same description in VMware document center.

            Thanks you so much.

  14. Hi Cormac,
    vSAN is leveraging the new vsanSparse snapshot technology. Does this new snapshot technology also reduce the stunning time during removal of a large snapshot compared to traditional “redo log” snapshots? I didn’t find any comments in the vsan snapshot performance white paper about this.

    1. I think the main difference is the in-memory cache and the granularity that vSANsparse uses – otherwise the techniques are quite similar. However I am not aware of any study to measure the differences. This might have further useful info -https://storagehub.vmware.com/#!/vmware-vsan/vsansparse-tech-note

  15. Hi Cormac,

    I wanted to run a vSAN maintenance scenario by you to see if there are any potential drawbacks, aside from a node failing while performing the maintenance. This is regarding ‘Ensure availability’ and ‘No data migration’ maintenance modes.

    Scenario:

    A single node in a 4 node vSAN cluster is placed into maintenance mode using the ‘No data migration’ method. Once in maintenance mode, software updates/firmware is applied to the node and it’s unavailable for roughly 30-40 minutes. After the maintenance is completed the node is placed back into production and the administrator immediately moves onto the next node in the cluster to be patched. The admin again uses the same ‘No data migration’ maintenance mode on this node, applies updates for 30-40 minutes and so on. These steps are repeated for the remaining nodes.

    Cluster details
    vSAN version: 6.2
    Hosts in Cluster: 4
    Storage Policy on all VMs: FTT=1
    Fault Domains: Single FD per host
    Disk Configuration: Hybrid

    Question:

    If the admin is performing maintenance this way without waiting for components to re-sync after each 30-40 minute window and is not using ‘Ensure availability’, would there be potential data issues or a chance of VMs becoming unavailable as a result? This is again without a node failing in the cluster during these maintenance windows. I understand this is not the preferred way of doing maintenance, but I was just curious what could happen and if there were any fail-safes when this occurs.

    1. You definitely need to be careful with this approach. First, you might like to increase the cmmds repair delay timeout value above the 60 minute value (see KB 2075456). This gives you a bit more lee-way, in case it take a bit longer to apply the fw and reboot the host. It will mean that rebuild won’t start if the maintenance runs over 1 hour..

      Now there may well be some changes that need to be synced once the host has rebooted. You need to wait for this to complete before starting maintenance on the next host. I like to use RVC commands for this such as rvc.resync_dashboard (you can also use the UI). Only commence work on the next host when you are sure that all objects are fully synced and active.

      HTH

      Cormac

  16. Hi Cormac,

    Can you plz reply on my query “What will happen when the whole environment goes down and power back on again ? Do we run some sort of integrity check ?”

    Regards,
    Sai

  17. Hi Cormac,

    I have a question about vSAN could you please explain more detail :

    I have a vSAN Cluster with 3 ESXi host (1SSD 50Gb and 1HDD 300Gb per host), VM storage policy is : Number of Failures to Tolerate = 1, Number of Disk Stripes per Object = 1
    If i have a VM with a Virtual disk size is 400Gb, what happen and how vSAN stored/distributed the VM. It cannot stored 2 replica on 2 Host because only 300Gb HDD per host, is it correct ?

    Thanks you so much

    1. Correct – you will not be able to provision this VM with that policy, unless you override the policy with a ForceProvision entry. With ForceProvision, it means the VM will be provisioned as an FTT=0, so there will be no protection.

      1. Thanks Cormac for your information. So, if i set Number of Failures to Tolerate = 1, Number of Disk Stripes per Object = 2 is it ok for this case

        1. No – Disk Stripes require unique capacity devices, and you do not have enough devices to accommodate this, as you only have one capacity device per host. At most, you can get get FTT=1 with SW=1. These requirements are called out in the product documentation, and in the vSAN deep dive book.

  18. Hello Cormac

    I’m reading the Stretched cluster guide and do not follow the component bifurcation in the BW calculation section.

    ____
    200 virtual machines with 500GB vmdks (12 components each) using PrevSAN 6.6 policies would require 4.8Mbps of bandwidth to the Witness
    host
    3 for swap, 3 for VM home space, 6 for vmdks – 12
    12 components X 200 VMs – 2,400 components
    2Mbps for every 1000 is 2.4 X 2Mbps – 4.8Mbp
    ____

    In this example PFTT-1 and SFTT-0
    Component Calculations for VMDK –
    SiteA – 500GB – Component0-255GB and Component1-245GB – 2 components
    SiteB – 500GB – Component0-255GB and Component1-245GB – 2 components
    Either SiteA or SiteB will also have 2 additional Witnesses, one for component0 and the other for component1 – 2 components
    Witness site – 1 for component0 and other for component1 – 2components
    The above gives us a total of 8 components for VMDK of 500GB – Why do I get 2 addional in the count?

    ____
    The same 200 virtual machines with 500GB vmdks using vSAN 6.6 Policy
    Rules for Cross Site protection with local Mirroring would require
    3 for swap, 7 for VM home space, 14 for vmdks – 24
    24 components X 200 VMs – 4,800 components
    2Mbps for every 1000 is 4.8 X 2Mbps – 9.6Mbps
    ____

    In this example PFTT-1 and SFTT-1
    My calculation gives me at total of 7 SWAP components (is the article not taking SFTT into account?)
    Component0 at SiteA – 1C
    Component0 at SiteA – 1C – Mirror/SFTT-1
    A witness component at SiteA – 1C
    Component0 at SiteB – 1C
    Component0 at SiteB – 1C – Mirror/SFTT-1
    A witness component at SiteB – 1C
    A witness at Witness site – 1C

    Similarly I get 7 for VMHome (which is in accordance with the guide). Why is SWAP 3 in the guide?

    I get the no. of component for a 500GB vDisk to be 14 –
    SiteA – 500GB – Component0-255GB and Component1-245GB – 2 components
    SiteA – 500GB – Component0-255GB and Component1-245GB – 2 components – Mirror/SFTT-1
    SiteA – 1 Witness for component0 and 1 Witness for Component1. This is because of SFTT. – 2 components
    SiteB – 500GB – Component0-255GB and Component1-245GB – 2 components
    SiteB – 500GB – Component0-255GB and Component1-245GB – 2 components – Mirror/SFTT-1
    SiteB – 1 Witness for component0 and 1 Witness for Component1. This is because of SFTT. – 2 components
    Witness site – 1 for component0 and other for component1 – 2 components
    Is my understanding correct?

    1. OK – that is a lot of information to take on-board. Let me see if we can simplify it a little bit.

      Let’s take your first example: “3 for swap, 3 for VM home space, 6 for vmdks – 12”. In this, we are stating 3 for swap and home since this is Stretched Cluster, so RAID-1 mirroring with a witness for swap and home, giving us 3 components, 1 x SiteA, 1xSiteB, 1 x WitnessSite. Thus this is PFTT=1, SFTT=0.

      I don’t understand your next example which states “Either SiteA or SiteB will also have 2 additional Witnesses, one for component0 and the other for component1 – 2 components”. Where are you deriving these additional components from? If this is PFTT=1, SFTT=0, then there are no additional witnesses at either site. To the best of my knowledge, the only time we would have witnesses at SiteA or SiteB is when SFTT>0 and we are protecting VMs in the same site as well as across sites.

      Are you using https://storagehub.vmware.com/t/vmware-vsan/vsan-stretched-cluster-guide/bandwidth-calculation-5/ as your reference?

  19. Thanks for your reply. My bad, I missed the email notification.

    I have taken these examples from vSAN Stretched Cluster Guide, Pg. no. 28.

    Example-1
    When PFTT=1 and SFTT=0, VMKD1=500GB (Compopent0=255GB and Component1=245GB)
    The Witness component will only exist in the Witness site as local site policy isn’t in place. Which is why the first example gives us 6 components for VMDK1.
    This I understand now.

    Example-2
    When PFTT=1 and SFTT=1 (RAID-1), VMKD1=500GB (Component0=255GB and Component1=245GB)
    The guide gets the below nos. –
    3 for swap, 7 for VM home space, 14 for vmdks = 24

    The no. of components for VMDK1 is 14 [5 at SiteA(4 data and 1 Witness component, SFTT=1), 5 at SiteB(4 data and 1 Witness component, SFTT=1) and 4 at Witness site].
    I understand the component bifurcation for VMDK1. But the example has just 3 component for VMSWAP, whereas 7 for VMHome.
    Shouldn’t VMHome and VMSwap follow the same PFTT and SFTT, therefore have the same no. of components?

    Example-3
    When PFTT=1 and SFTT=1 (Erasure coding), VMKD1=500GB (Component0=255GB and Component1=245GB)
    The guide gets the below nos. –
    3 for swap, 9 for VM home space, 18 for vmdks = 30
    For VMKD I also get 18 and below is my calculation –
    4 for Component0 in SiteA – 3 data and 1 parity
    4 for Component1 in SiteA – 3 data and 1 parity
    4 for Component0 in SiteB – 3 data and 1 parity
    4 for Component1 in SiteB – 3 data and 1 parity
    2 Witness components for Component0 and Component1 at Witness site. Thus giving us a total of 18 components.

    For VMHome I also get 9 and below is my calculation –
    4 for Component0 in SiteA – 3 data and 1 parity
    4 for Component0 in SiteB – 3 data and 1 parity
    1 witness components at Witness site. Thus giving us a total of 9 components.

    In this example too I do not understand why is VMSwap 3?

    1. I see – so the issue is that the number of components reported by VMswap is incorrect. Swap should also use the same policy assigned to the VM, so this looks like an error in the calculation. It may be that the guide is using older calculations. In the past, swap was only ever FTT=1 (3 components) and did not inherit the VM policy. However that has changes in more recent vSAN versions, and now swap does indeed use the same policy as the rest of the objects that make up the VM. I’ll inform the document maintainers. Thanks for bringing this to our attention.

  20. Hi Cormac, what is your recommendation for the following situation:

    We have a 3 Nodes vSAN-Cluster with every node in a separate rack – configured as 3 fault domains.
    Now we would like to bring a 4th node inside the cluster and must place it in one of the 3 existing Racks. So if the wrong Rack fails we loose 50% of our nodes and have a split brain situation. Regards Thomas

    1. I don’t see any way of avoiding such a situation if you are only introducing a single node Thomas. Ideally, you would introduce a new node to each FD to maintain availability, but I guess you know this already.

      1. Thx Cormac for your quick answer. I already thought that there is no solution for this situation. Is there any chance to work around this problem with a witness appliance? Regards Thomas

        1. I guess you could implement a 2+2+1 vSAN stretched cluster where 2 data hosts for Fault Domain A are in one rack, 2 data hosts for Fault Domain B are in another rack , and then the witness appliance is deployed on another ESXi host which is in the third rack. There is a bit of work involved in doing something like this, and I’m not sure if I would be comfortable converting a standard vSAN already in production to a stretched vSAN. I think further research would be needed here to see if it even feasible.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.