Recently, I have been asked this question: should we enable Storage I/O control on datastores used by our production databases considering it could prevent my VMs from consuming all the resources they need? The answer is yes, SIOC will not harm your performance, actually it can save you from a very bad day in IT land, and it’s all about the threshold.
Before I dive deeper into that a bit of background:
Storage I/O control is a technology which provides I/O prioritization for VMDKs that reside on a shared datastore, the VMDKs can reside on different hosts but have to be managed by the same vCenter. This is to contrast with adaptive queuing which is an ESXi technology. Anyway, back to SIOC, when a latency threshold is crossed for a shared datastore Storage I/O control will kick in and will start prioritizing access to that datastore based on the proportional shares mechanism, the outcome will be that VMs with higher shares will get more throughput (IOPS) in lower latency than VMs with lower shares. By default all VMs have the same amount of shares and a fair access to the datastore, in that case SIOC will protect from the “noisy neighbor” issue from happening making sure that no one VM monopolizes access to that datastore.
The latency threshold is set in ms and can be set in 2 ways:
- Using a static threshold, default is 30 ms.
- Dynamically using the I/O injector. The I/O injector samples the datastore every 24 hours when the datastore is idle and will set the threshold to the 90% percentile of the datastore capabilities. (90% is the default, this number can be changed) for example if the I/O injector finds that the datastore minimum latency is 20 ms it will set the threshold to 18. This is helpful in cases where there are multiple datastores with different capabilities (e.g. one datastore using SSD and others SAS) in these case the threshold will be based on the capabilities of each datastore.
Starting from vSphere 6 VMware introduced IO reservations with SIOC, when reservations are used, the same I/O injector we use for checking latency also samples the IOPS capabilities of a datastore, when the configured IOPS reservation set on the VMs that reside on that datastore exceed the capabilities of the observed IOPS capabilities of that datastore, IOPS will be distributed to the VM proportionally to their percentage of the amount of set reservations.
For more information regarding SIOC reservations see these VMware blog posts:
https://blogs.vmware.com/vsphere/2015/07/sioc-io-distribution-with-reservations-limits-part1.html &
https://blogs.vmware.com/vsphere/2015/10/sioc-io-distribution-with-reservations-limits-part2.html
And my post here:
For critical systems we usually recommend not to employ limits or throttling the VMs resources, but, even though SIOC falls into the throttling category it also provides a great fail safe in case of unavoidable and unpredictable contention and will only kick in incase the specified threshold is passed. This could be very handy in case there are multiple VMDKs sharing the same datastore for manageability reasons , here is an example:
Consider a situation where we have a vSphere cluster hosting many critical SQL databases, obviously to achieve the highest workload isolation we can place each VMDK in its own datastore and LUN.
Because of manageability reasons and ESXi max LUN limit we would prefer stacking multiple VMDK files from multiple databases on the same datastore, this shared datastore needs to be backed by a physical device that can supply the aggregated IOPS requirements of all the VMDKs that reside on it, that is how we minimize the risk of contention.
But, if something unpredictable happens and one of the VMs using that shared datastore starts to utilize the storage in a way we couldn’t predict, reasons could be that someone runs a huge query on it or worst runs a bad query, it could be because of a bug in the application that is using or even a virus, that could affect the performance of all the rest of the databases using that datastore.
SQL DBA’s know what slowness of tempDB can cause to the overall performance of the database. This is where SIOC can save the day, even if we will enable it and leave the shares on each VMDk at default, as soon as the latency threshold is met SIOC will cap the VMDKs that are making all the noise and will make sure the rest of the VMs can get the resources they need.
As you can see, even though we provide enough I/O resources to the shared datastore so we have no contention, SIOC protects us from unexpected situations.
To conclude, I definitely recommend using SIOC , follow the guidelines and the key is to set the threshold statically to the highest latency the workloads on that datastore can withstand, SIOC will guarantee fairness preventing downtime at the expense of slowing down those rogue workloads.
Any comments are welcome
Niran