One of the most common concerns for Administrators looking to virtualize their Microsoft Windows workloads is the well-known issue of Time (or clock) drift in a virtual environment. Given the critical importance of accurate time keeping in modern production environments, this is a legitimate concern – one which has kept some Administrators from embracing and adopting Virtualization in their infrastructure.
Modern applications, authentication and authorization tools and protocols rely on accurate time-keeping.. While some require very strict accuracy, others can tolerate some variances. For our purposes, we will focus on the Kerberos Protocols as implemented in Microsoft Active Directory Domain Services.
Clients and Servers in a Window Active Directory Forest require time synchronization and up-to-dateness to ensure that the Domain Controllers can provide proper and secured Kerberos authentication within the Forest. Kerberos version 5 in Windows has a default maximum tolerance of 5 minutes variance between an authenticating client and a Domain Controller. Although this value is configurable, standard Windows Administration and Security practices discourage deviating from this best practice. One of the most common security implications of extending tolerance is the infamous “Authentication Credentials Replay Attack”. The primary goal of time-keeping hygiene is to ensure that, where possible, all Devices, Clients, Servers, and domain Controllers in a given Forest be synchronized, or (in a worst-case scenario) never be allowed to deviate by more than 5 minutes.
Active Directory Domain Services provide native time synchronization hierarchy which can be illustrated as follows:
As depicted in the diagram above, the Domain Controller holding the PDC Emulator Operations Master Role (PDCe) at the root of the Forest is responsible for ensuring accurate time-keeping in the Forest. It devolves this responsibility to other PDC emulators in all child Domain in the Forest. These child PDCe, in turn, share the responsibility with other domain controllers in their respective Domains. These other Domain Controllers periodically consult their PDCe for the correct time and propagate this information down to domain-joined clients. The Root PDCe is, therefore, the only Domain Controller whose clock is required to be synchronized with an external reliable time source, and it acts as the Time-Keeping Marshall for everyone else in the Enterprise. As long as the Root PDCe is healthy and receiving time updates from its external time source, life in the AD infrastructure is peachy, at least from a timestamp-related authentication perspective.
Well, what if “something happens”? What if the Root PDCe is not receiving accurate time? What if it receives inaccurate time? What if its own clock just skews, for any number of reasons?
In a physical environment, where the Domain Controllers are running on physical Servers, the most common cause of skewed clock is a bad CMOS battery. This can cause the time stored in the Server’s EPROM to become incorrect. This incorrect time is then read by the Server upon reboot or startup and stored by Windows, for the W32Time Service to use. As long as the clock doesn’t skew by more than 48 hours in either direction (forward or backward), the PDCe will be able to receive the correct time from its reliable external source and adjust its clock accordingly. Life is still peachy. If the clock is off by more than 48 hours and this default value has not been changed in the two registry keys controlling the behavior (MaxPosPhaseCorrection and MaxNegPhaseCorrection), Windows will be unable to adjust the wrong time even when it starts receiving the correct time from its reliable source. The net effect is that, time synchronization, authentication, replication, password expiration and other severe issues will occur in the environment. Recovery could be an expensive administrative exercise, depending on when the problem was detected and how quickly corrective actions are implemented.
This problematic scenario is also true in a Virtualized AD infrastructure. While modern Server Hardware don’t frequently encounter the “Bad CMOS” clock failure scenario described above, a virtualized PDCe will nevertheless experience the failure if, for any reason, the ESXi Host on which it resides presents the Virtual Machine with a bad clock.
Yes, yes, I know what you are thinking.
But, we are not syncing the VM’s clock with the Host, and we have read and implemented the recommendations provided in VMware’s KB article Disabling Time Synchronization (1189)
Let’s agree that the proper architecture has been deployed and you have been proactive and chosen the optimal Time Synchronization options and made the changes recommended in KB 1189. Let’s not forget, though, that those settings are meant to address Specific VM Operation Scenarios. Disabling time sync with ESXi Host and implementing the VM Advanced Configuration options in the referenced KB actually does work – for what they are designed for. I will get to those later, but let me complete this thought…
The settings in KB 1189 were not designed for a Host EPROM failure/error situation which affects a physical machine’s clock. As described above, if a Host has bad clock, when a Windows VM is rebooted or started up on that host, the VM’s BIOS will emulate and inherit the wrong time as presented by the physical Host and the windows Guest Operating System in the VM will dutifully accept this as the hardware clock. IF the VM were the Root PDCe, the same failures painted above will occur in the Forest.
To recap, a PHYSICAL HOST’s time failure will be inherited by Windows, whether virtualized or not, and regardless of any defensive mechanism or configuration applied at the Virtual/Hypervisor layer.
So, what do those other settings do then?
I’m glad that you asked. As mentioned in KB 1189, certain VM operations cause the VMware Tools to sync the time on the Guest Operating System inside a Guest VM with its Host’s time. I will not re-list these operations here. I’d rather just simply state that, if a Host has an incorrect clock, a vSphere-based VM will inherit that wrong time every time you perform the Virtual Machine operations listed in the KB unless you have utilized the VM Advanced Configuration settings. With these settings in place, you can perform all those operations all day, every day without any concerns. I should mention here now that powering on or Restarting a VM is not a “Virtual Machine operation”. This means that those settings cannot prevent a VM from behaving like a Physical Server during a Power event in a situation where the Host has a bad CMOS time/clock. Please see the “Virtual CMOS RTC” section of Timekeeping in VMware Virtual Machines for a more detailed description of this behavior.
So, what now? Nothing, really. This write-up is intended to alert the reader and all VM and AD Administrators as to the corner case scenarios in which the recommendations in KB 1189 cannot protect against time skew or corruption. And to highlight the fact that this is not a virtualization deficiency or issue. The problem exists in the physical realm as well. Microsoft has published several guidance docs on how Windows Administrators can best ensure accurate time keeping within an Active Directory infrastructure (e.g. Configuring Systems for High Accuracy). The recommendations in these publications apply to all Windows Operating System instances regardless of whether it is virtualized or not.
To recap:
- Correct and accurate time-keeping on a Host is very important to the stability of an Active Directory infrastructure, regardless of whether or not it is virtualized. An incorrect Host clock can induce instabilities necessitating complex recovery efforts.
- In a vSphere infrastructure, VMware highly recommend that all Hosts be synchronized with a reliable Time Source and that Administrators implement periodic time checking processes to ensure accuracy.
- A VM’s BIOS clock will be set to match the Host’s whenever it is powered on or restarted, following this pattern:
- At fresh power-on (i.e. the first time a VM is powered-on), its BIOS inherits the Host’s time.
- Subsequently, the guest writes time information back to the BIOS clock. For mobility, what is actually saved is an offset from the host time.
- VM BIOS Time = ESXi host time (UTC) + Offset
- So let’s say you boot up a windows vm on esxi (with correct utc time) for the first time, it will come up with initial time matching the host time. Then W32Time synchronizes to an external source, and say you adjust time zone to PST. Subsequently, windows writes back to the bios clock, such that:
- VM BIOS Time = ESXi host time (UTC) + -8:00 (where 8:00 is offset from PST to UTC).
- In other words, time is always saved as an offset from UTC, and this is why it is important for ESXi to be synchronized with NTP.
- VM BIOS Time = ESXi host time (UTC) + Offset
- In a VM running the Windows Guest Operating System, if the inherited time is skewed beyond 48 hours in either direction, W32Time will be unable to correct the skew (unless the appropriate registry key or Group Policy has been adjusted to accommodate such correction)
- The inherited time is not reset by VMware Tools, unless one or more of the Virtual Machines operations mentioned in KB 1189 is performed on the VM.
- VMware Tools will synchronize a VM’s clock with its Host’s time when a Virtual Machine operation is performed, unless the VM Advanced Configuration Settings listed in KB 1189 has been applied to the VM.
- By default, VMware Tools will not reset a VM’s clock backwards, except under the following conditions:
- The Advance Configuration Settings in KB 1189 have NOT been applied to the VM, AND
- The following Advanced Configuration Settings (not mentioned in KB 1189) have been applied to the VM:
- synchronize.restore.backward = TRUE
- synchronize.resume.disk.backward = TRUE
- synchronize.tools.startup.backward = TRUE
- Because of the criticality of the Root PDC Emulator in an Active Directory Forest’s time keeping requirements, it is important that Domain Administrators work with their vSphere Administrators to ensure that the Settings recommended in KB 1189 are applied to all Domain Controllers to ensure that Virtual Machines operations do not interfere with the Operating System’s time-keeping mechanism.
- In the event that a virtualized Domain Controller’s clock becomes skewed as a result of incorrect time on its Host, the quickest remedy is to migrate the VM to (and restart the VM on) another Host which has a known good clock.
I hope that you find this information useful. Please feel free to ask me anything related to the content of this post. I promise to check in frequently and respond in a timely manner.
Bonus Update:
While Administrators can use PowerCLI or the following VMware vRealize Orchestrator (vRO) Workflow to simplify the process of adding the various VM Advanced Configuration Options recommended in KB 1189 (Disabling Time Synchronization) across an entire vSphere infrastructure, many Administrators are uncomfortable with running VM-modification scriptls globally across and infrastructure. This still leaves Customers with manual additions as the only option.
With the release of VMware vSphere 7 Update 1, Administrators can now more easily apply these settings to a VM as a simple check-box operation.
The default behavior of adjusting VM’s clock through the VMware Tools during various virtual machine operations is now exposed in the VM’s Properties as a checkbox. This box is checked by default to reflect the default behavior. Unchecking this box effectively instructs VMware Tools to assume that the Administrators has consciously applied the settings documented in KB 1189.
VMware recommends unchecking this box for VMs running Guest Operatings Systemes which have a native reliable time synchronization and correction mechanism. A Windows VM joined to an Active Directory infrastructure is one such use cases where these boxes should be unchecked.
Thank you.