Estimating Availability of SAP on ESXi Clusters

This is a follow up to the blog I posted in Jan 2013 which identified a generic formula to estimate the availability, expressed as a percentage/fraction, of SAP virtual machines in an ESXi cluster. The details of the formula are in this whitepaper . This blog provides some example results based on some assumed input data. I used a spreadsheet to model the equation and generate the results – this is shown at the end. The formula is based on mathematical probability techniques. The availability of SAP on an ESXi cluster is dependent on: the probability of failure of multiple ESXi hosts based on the number of spares; the probability that the SPOFs (database & central services) are failing over due to a VMware HA event (depends on failover times and the frequency of ESXi host failures).

The example starts with a single 4-node ESXi cluster running multiple SAP database, application server and central services virtual machines (VMs) corresponding to different SAP applications (ERP, BW, CRM etc.). A sizing engagement has determined that 4 ESXi hosts are required to drive the performance of all the SAP VMs (the SAP landscape). We assume the sizing is such that the memory of all the VMs will not fit into the physical memory of three or less hosts, and as we typically have memory reservations set (a best practice for mission critical SAP), VMs may not restart after a VMware HA event. So we conservatively treat any host failures that result in less than 4 ESXi hosts as downtime for the SAP landscape (not true at the individual VM/SAP system level as some of the VMs can be de-prioritized in the degraded state in favor of others but we are going with the landscape level approach to provide a worst case estimate). For this reason we design with redundancy by adding extra ESXi hosts in the cluster so I will compare three options with different degrees of redundancy:

Option 1: 4 node ESXi cluster with no spares i.e. “4+0” (loss of 1 or more hosts is considered downtime. With this assumption a VMware HA event is mute so failover times are not considered. As there are no spares, the availability for this scenario = a x a x a x a, where a = availability of single ESXi host. For the remaining options I use the formula in the spreadsheet below)

Option 2: 5 node ESXi cluster with 1 spare ESXi host i.e. “4+1” (loss of 2 or more hosts is considered downtime)

Option 3: 6 node ESXi cluster with 2 spare ESXi hosts i.e. “4+2” (loss of 3 or more hosts is considered downtime)

Following input data is required for the formula:

Mean time to failover (via VMware HA) Central Services VM in case of ESXi host failure is 1-2 minutes (source: lab tests). I will use 2 minutes.
Mean time to failover database VM in case of ESXi host failure is 5 minutes. Source: POC from a customer who presented their case study at a SAP tradeshow. Includes time for database to start and perform a recovery, latter is dependent on the workload at the time of failure.
mtbf – meantime between failures of a single ESXi host (this is the failure rate due to h/w or VMware hypervisor failure)
mttr – meantime to repair a failed ESXi host or replace with another ESXi host in order to get the ESXi cluster back up to full strength.
Another term you will come across is mean time to failure (mttf). Note that mtbf = mttf + mttr.
From mtbr and mttr we can calculate the availability of a single ESXi host. The definition is, availability = (mtbf-mttr)/mtbf (see the whitepaper for details. Note this type of analysis is not new, similar content can be found here and here).

The following diagram shows the relationship between mtbf, mttr and mttf.

Unfortunately there are no industry standard values for mttr and mtbf for an x-86 server running a hypervisor. mtbf depends on the hardware and frequency of firmware and hypervisor related issues – the latter in turn is impacted by patch management procedures. So mtbf may vary between different environments. So how can we estimate these metrics? As SAP is typically virtualized after other non-SAP applications, you can gather operational statistics from existing production or non-production ESXi clusters to get estimates for mtbf and mttr. For mtbf, we would need to determine how often VMware HA events have occurred in any existing operational ESXi clusters. Few informal enquires I have made show frequency of failure around 1-2 times a year for an ESXi host and in some cases over a year without incident. For mttr are there any SLAs in place (for example server vendor services contracts) or can IT operations estimate a time they can repair or replace a faulty ESXi host in a production SAP cluster? As SAP business processes are mission critical such an SLA or understanding may be in place or required. Hence I will show results for a range of mtbf and mttr. The results are shown in the following table for mtbf = 90, 180 and 360 days and mttr = 2, 4, 8 and 24 hrs (I have estimated 1 year to 360 days).

You can read the above table as per the following example:

Experience in the datacenter has indicated about 2 failures per ESXi host a year, so we will assume about 180 days for mtbf. So we are interested in the results in the “180 days” section of the above table.
Datacenter Operations have procedures in place to restore a failed ESXi host within 4 hrs, so mttr = 4 hrs.
Hence the availability estimate is 99.9964 for a “4+1” cluster and 99.9973 for a “4+2” cluster.
If any of the input data differs from above or for other sized clusters, recalculate using the spreadsheet /formula (see below).

Some conclusions:

Adding more redundancy (two spare ESXi hosts versus one) increases availability and makes availability less sensitive to mttr which makes sense i.e. with more redundancy there is less time pressure to get a failed ESXi host back online. However there is an extra cost with this redundancy which can be mitigated by using the redundant ESXi hosts to run less important virtual machines that have a reduced SLA and can be taken offline if a single ESXi host fails. VMware resource shares can be configured to make sure these less important VMs do not interfere with the production SAP landscape.
Reliable servers with redundant components and good patch management policies can help to increase mtbf which increases availability.
Having procedures in place to lower the mttr increases availability, for example replacing a failed ESXi host with another from a non-production cluster or some standby pool may be faster than repairing the failed ESXi host.

Note the following about this analysis:

Only considers unplanned downtime due to ESXi host failures. Storage, application software/OS and network failures are NOT considered here. These other parts of the overall architecture have their own availability so the final availability (as experienced by the end-user) is the product of the availabilities of each sub-component as they effectively operate in series.
Formula assumes single instance database not active-active like RAC and no VMware FT for Central Services – these scenarios effectively reduce the failover times driving availability higher.
The availability estimate is for the whole SAP landscape – if you consider the individual VM /SAP system the availability can be higher, for example if we are down to three ESXi hosts, priority could be given to ERP over BW so ERP continues to perform as per normal. In this case ERP VMs would have higher availability than the value calculated for the landscape.
Architecture here assumes database and app tier deployed in the same cluster. Often these layers are isolated into two separate ESXi clusters (e.g. for database licensing and/or management purposes). In this case two separate availabilities need to be calculated for the two clusters and the overall availability is the product of the two.

Why Bother with All This?

While this mathematical approach (using probability) is an established method to estimate availability it does require assumptions and input data values for which we may have to estimate in case of limited data – this is where empirical data from actual implementations will help with accuracy. So the goal of this availability analysis is:

A starting point to generate an estimate during a sizing engagement or during the design phase of a deployment. If actual data is not available assume some “worst case” values for mttr and mtbf (e.g. 8 hrs and 90 days) to generate a base line estimate.
Enable quantitative analysis of different scenarios like:
- How is availability impacted by extra spares in the ESXi cluster? If the business cost of downtime is known (currency per unit time), then we could determine if the cost of redundant ESXi hosts is justified.
- If we need 10 ESXi hosts would it be better for availability if we had one 10-node cluster or two 5-node clusters?
- How do failover times impact the final availability?

Appendix – Create Formula in Excel

The generic availability formula is in the whitepaper SAP on VMware High Availability Analysis – A Mathematical Approach . This formula can be created in Excel as shown below (in this case I am ignoring chances of any failover faults).

The heaviest part of the formula is in cell G46 (see above). This part calculates the probability that all the spares and one extra ESXi host (which is s+1 hosts, where s = number of spares) simultaneously fail resulting in downtime for the cluster. This is based on calculating the different unique combinations of s+1 nodes in the n-node cluster which is described in the whitepaper , but it should be noted that this comes from standard mathematical combination theory, for example see this wiki page.