Support for high-performance VMs has been significantly improved within vSphere v6.0. In particular, the new virtual hardware (vHW) version 11 enables the configuration of VMs with up to 128 virtual CPUs, 4 TB of RAM, and 32 serial ports. The ability to configure much more powerful VMs puts CIO’s under greater pressure to provide Line of Business (LoB) executives with business continuity Service Level Agreements (SLAs) that comply with rigorous Recovery Time and Recovery Point Objectives (RTO and RPO).
For LoB executives, IT backup operations are just a means to achieving business continuity. Their focus is entirely on recovery, which needs to be performed with a minimal loss of data and in as short a time as possible.
BUISINESS CONTINUITY FOR AN ACTIVE MISSION-CRITICAL VM
To test Vembu BDR 3.0 we installed both the Vembu BDR server and client components on a server running Windows Server 2012 R2, Hyper-V, and SQL Server 2014. We utilized the 10GbE iSCSI SAN to provision storage for VembuHIVE®, the Vembu backup repository, and all of the vSphere 6 host datastores.
We shared all of the vSphere datastores with the BDR server. As a result, the BDR client leveraged Vembu’s UltraBlaze™ feature, which automatically detects the best transport mode to use in a backup of a client system. In particular, all full and incremental VM backups run for our backup and replication tests VMs transferred data to the BDR server directly over the iSCSI SAN to minimize any impact on VM processing.
Within our vSphere 6.0 environment, we installed Windows Server 2012 R2 and SQL Server 2012 on a vHW11 VM, which we dubbed oblSQL-1. We provisioned that test VM with 12 logical CPUs, 42GB of RAM, and two logical disks provisioned on a vSphere datastore supported on our iSCSI SAN, which included an SSD-based SAN accelerator.
To implement a mission-critical application on oblSQL-1, we used a 33GB instance of the TPC-E database benchmark, which simulates an online stock trading application for a brokerage firm. On a separate quad-processor VM, we generated customer and broker trade transactions.
While processing stock trades at 800 transactions per second (TPS), we added and changed enough table and log data in the TPC-E database to generate an incremental backup of 20-to-25GB every 30 minutes using VMware’s Change Block Tracking (CBT) mechanism. What’s more, that incremental data represented about 10-to-15% of the total storage resources consumed by the entire VM database volume. As a result, frequent incremental backups were required simply to control the growth of database log files.
More importantly, our transaction load was generating writes to the TPC-E database disk at rates ranging from 800-to-1,500 IOPs. When a VM processes a high number of write IOPS, the complexity and I/O overhead of Copy on Write (CoW) snapshots grows exponentially. This is especially problematic when a backup uses internal Copy on Write (CoW) VSS snapshots to quiesce a specific application within a global CoW VM snapshot. Releasing VM snapshots can become quite glacial and directly impact application processing.
MINIMIZING DATA LOSS WITH APPLICATION-AWARE SNAPSHOTS
To avoid the compounding of I/O overhead that internal CoW snapshots can place on an active VM, Vembu BDR initiates Redirect on Write (RoW) snapshots with VSS to quiesce applications within a VM. In BDR’s RoW-based snapshot scheme, new writes to our TPC-E database were immediately redirected to free disk blocks, without first having to copy the original data to a new disk location.
With Vembu BDR implementing RoW VSS snapshots and leveraging UltraBlaze for SAN-based data transfers, we consistently backed up our test VM, oblSQL-1, from our BDR server, DellR710A, in about 5 minutes. At the same time, oblSQL-1 was processing 800 TPC-E transactions per second with an average query response time of 23ms. More importantly, during incremental backups, we measured no increase in wait states for SQL Server as incremental backups were processed. As a result, we complied with a 30-minute RPO with no impact on the database application running on the VM. (https://www.bdrsuite.com/blog/vembu-bdr-vspherev6-backup/).
MEETING RECOVERY OBJECTIVES WITH VembuHIVE®
The key to BDR’s ability to provide multiple fast recovery options is VembuHIVE, a document-oriented repository that Vembu virtualizes as a file system. A document-oriented database encapsulates information in documents by encoding data in value-key pairs. In particular, the value-key construct enables a document-oriented database to store any data without following a strict schema. What’s more, the value-key construct creates a database that is highly scalable through the simple addition of storage and compute resources. This explains why a number of commercial web sites, including eBay, are underpinned by a document-oriented database.
With respect to VembuHIVE, the VembuBDR service handles all Vembu BDR backup and recovery functions performed on the server. The VembuBDR service encodes transferred backup data with content metadata and removes metadata related to the source machine’s file system. Next, the VembuBDR service performs data compression and de-duplication to minimize the amount of storage resources consumed in storing backups. Finally, the processed data is rapidly streamed into VembuHIVE using large data blocks that can be greater than 2MB.
In periodic full backups of our test VM, oblSQL-1, Vembu’s UltraBlaze™ feature automatically configured the BDR server, DellR710A, to access the VM’s vSphere datastore directly as physical disk 6. As a result, end-to-end backup throughput ranged between 1.4 and 1.8 Gbps. More importantly, as the Vembu BDRservice streamed restructured, de-duplicated, and compressed backup data to the VembuHIVE LUN (disk G on DellR710A), the volume of data stored was reduced by a dramatic 85-to-90 percent. More telling, backing up the VM datastore over a SAN and streaming processed data to VembuHIVE created very different disk load patterns, which rapidly diverged as the process continued.
BACKUP ANYWHERE RESTORE EVERYWHERE
With VembuHIVE uncoupled from physical devices and virtualized as a file system, Vembu BDR modules can mimic file system utilities, to provide services such as data compression and de-duplication. In particular, by applying formatting utilities to VembuHIVE documents, Vembu BDR is able to present persistent VM disk images associated with any VM backup as a mount point in multiple formats, including vhd, vhdx, vmdk, and img. As a result, all Vembu recovery functions can be applied to any VM backup in any VM environment.
We began our evaluation of Vembu BDR by determining RTO baselines for a full VM recovery of our vSphere test VM in both vSphere and Hyper-V environments as a new VM. Recovery as a vSphere VM was facilitated with a single-click option that configured a new VM on a designated vSphere host and datastore. Data was then transferred from VembuHIVE to the host, which wrote all of the appropriate infrastructure files sequentially at about 60MB per second in just under 2 hours.
Recovery as a Hyper-V VM on our BDR server, DellR710A, was a two-step process. First, in less than 45 minutes, we restored the two logical drives associated with oblSQL-1 as vhd disk files in parallel. Next we used the Hyper-V console to configure a new VM, dubbed oblSQL-1HV, with the same CPU and memory resources as the original VM and the vhd logical disks restored from VembuHIVE to the VM. In this recovery scenario, we brought a new VM into production with the same TP application performance as the original, while easily meeting a 1-hour RTO.
System administrators are also able to leverage Vembu BDR’s support of on-demand read/write access to disk images to simplify alternative recovery operations. With a single click, a system administrator can create, expose, and locally mount a new persistent VM disk image from an existing backup recovery point. As a result, application utilities can be run locally to recover data items on the BDR server without running a specialized backup or booting the VM from a backup file.
Using the single-click VHD option, we exposed a recovery point our vSphere test VM as a set of virtual disks within the Vembu BDR virtual drive (H:) and mounted the VM’s database LUN (SQLDB) as a local, shared, virtual drive dubbed SQLDB (X:). We were then able to use SQL Server 2014 Management Studio on our BDR server to attach the TPC_E database that was located on SQLDB (X:).
To recovery TPC-E database objects, we created a new database, which we dubbed VembuRecover, on the local instance of SQL Server 2014. Next, we copied several table structures, including indices and triggers, to VembuRecover. We then detached the TPC-E database and dismounted the exposed in the BDR virtual volume. As a result, we were able to work on the TPC-E tables in VembuRecover independently of any backup or recovery activity involving VembuHIVE.
BOOTING A VM FROM VembuHIVE FOR A 10 MINUTE RTO
VembuHIVE disk images can also be used to boot a backed-up vSphere VM as either a Hyper-V or a vSphere VM. With a single click, a system administrator can invoke the Vembu Instant Boot option and avoid a number of complicated overhead tasks that typify booting a VM from a backup file.
On a Windows server with Hyper-V, the Vembu BDR Instant Boot option instantiates a new independent persistent document in VembuHIVE, which can be read, modified, and saved. Once the new logical VM volumes are exposed, Vembu BDR configures a fully functional Hyper-V VM using local defaults. There is no need to remap disk writes on a datastore containing read-only pointers to a backup file, or consolidate pointers and logs into a standard VM configuration.
For our test VM, oblSQL-1, the initial time to boot into the Hyper-V environment from a recovery point in VembuHIVE was just under 8 minutes. The extended initial boot time included the time needed for the VembuBDR service to instantiate a new persistent image document, expose the VM’s disk images on the BDR virtual volume, and configure a new Hyper-V VM. As a result, we were able to boot a functioning Hyper-V VM for an RTO of less than 10 minutes. Subsequent Hyper-V booting took about 2 minutes.
What’s more, system administrators are also able to set up and boot a vSphere VM from VembuHIVE in a relatively simple two-stage process. The key to this process is the ability to export the entire Vembu BDR virtual drive, which is used to expose the logical disks associated with a recovery point, as an NFS volume.
With a single click, a system administrator is able to create an NFS mount point, dubbed /VembuNFS, for the virtual drive. Once the Vembu BDR virtual drive is imported as an NFS datastore on a vSphere host, all of the vmdk-formatted disk images associated with a VM recovery point will be automatically available to that host, whenever a recovery point is mounted on the Vembu BDR virtual drive.
We imported the Vembu BDR virtual drive as a datastore, dubbed NFS_VembuVD, on each vSphere host. As a result, whenever we mounted a VM recovery point on the Vembu BDR virtual drive, the associated logical disks were automatically available on each host within NFS_VembuVD. In particular, we configured a new vSphere VM, dubbed oblSQL-1Bvembu, on an existing iSCSI datastore, ION_SQL, and attached the vmdk files associated with a recovery point by pointing to the location of the vmdk files in the NFS_VembuVD datastore. Once the new VM was configured, we were able to boot it into our production environment.
While booting a VM directly from VembuHIVE can be used to meet an aggressive RTO, this solution is not without overhead issues that affect performance. When the VM is booted into a Hyper-V environment running on the BDR server, the CPU overhead associated with supporting an active VM via VembuHIVE can limit VM scalability. Similarly, the network overhead generated by accessing logical disks using NFS over a 1GbE LAN can also introduce significant scalability limitations.
We ran our TP benchmark on a Hyper-V VM that was created by invoking the Instant Boot option for a backup of oblSQL-1. Initially, TP performance of the stock-trading application running on the Hyper-V VM, oblSQL-1HV, paralleled TP performance on the original VM. With oblSQL-1HV using logical disks exposed on the Vembu BDR virtual disk, we were able to scale up to 60 TPS, which is sufficient for most applications. Nonetheless, while average response time was only 20ms at 60TPs, SQL Server was consuming 95% of the VM’s available CPU resources. As a result, the lack of CPU resources limited any further scaling of TP processing.
AGGRESSIVE RTO AND RPO WITH FULL PERFORMANCE
To recover our oblSQL-1 VM from a catastrophic failure with full TP processing scalability without running a full recovery process, we needed to invoke Vembu BDR’s replication option. Unlike replication in vSphere, Vembu BDR includes replication as a standard feature and Vembu’s UltraBlaze technology is able to choose the most efficient mode for transferring data from the original VM to the BDR server.
With Vembu, both replication and backup start with an incremental backup of the original VM. In the final stage, however, only a small amount of metadata is written to VembuHIVE. All of the data for the VM disk snapshots are transferred via the local LAN to the host supporting the replica VM, which is entirely responsible for writing the snapshot data. More importantly, for the greatest compatibility, Vembu snapshot data is returned to the server formatted for vHW v7, which makes the VM compatible with vSphere 4.1 and up.
We ran our replication tests of oblSQL-1 under the same conditions that we tested backups of that VM. Every 30 minutes we scheduled an incremental backup that was stored as a VM snapshot on a replica VM, dubbed oblSQL-1B_rep. The end-to-end process of creating an incremental backup, transferring the data to the Vembu BDR server, and saving that data as a VM disk snapshot took about 17 minutes. To recover the VM at any snapshot point, we only needed to update the VM to vHW v11 in the vCenter console for our advanced CPU and RAM configuration to be recognized during a boot process. As a result, we were able to use Vembu BDR replication to support both an aggressive RTO of 10 minutes and an aggressive RPO of 30 minutes.