Optimizing operational costs in CloudEndure Disaster Recovery

Some drivers involved in moving disaster recovery (DR) to the AWS Cloud include reducing infrastructure and management costs, and enabling access to greater scalability and elasticity. CloudEndure Disaster Recovery, offered by AWS, helps you shift your disaster recovery (DR) strategy to AWS from data centers, private clouds, or other public clouds. By shifting their DR to the cloud, customers can take advantage of opportunities to decrease operating costs, as initially introduced in the blog “Well-Architected approach to CloudEndure Disaster Recovery.”

In this blog, I outline how to further optimize the services CloudEndure Disaster Recovery uses for replication and configuration: Amazon EC2 and Amazon EBS. Amazon EBS and EBS snapshots are the largest component, in terms of cost, when deploying CloudEndure Disaster Recovery. CloudEndure Disaster Recovery can replicate to the AWS Region of your choice. As costs can vary between Regions, choose the lowest cost Region that satisfies the latency and regulatory requirements of your organization. By optimizing the services involved with CloudEndure Disaster Recovery, you can save on your DR costs while maintaining a superb DR setup.

Staging Area resources

CloudEndure Disaster Recovery replicates data to a subnet, within an Amazon Virtual Private Cloud (VPC), termed as the Staging Area. You configure the Staging Area in the CloudEndure Disaster Recovery user console, under Replication Settings. The replication settings, amongst other configuration, define the type of resources CloudEndure Disaster Recovery uses. By default, CloudEndure Disaster Recovery provisions EC2 instances and EBS volumes for receiving replicated data from the source machines.

If you have insight into the requirements and data change rates of your workloads, configure your project settings accordingly from the onset, before deploying agents. The settings can be adjusted at the machine level if agents have been installed and provides you with additional granularity.

EBS volumes

Within the staging area, CloudEndure Disaster Recovery creates an EBS volume for each source disk you are protecting. The EBS volumes are the same size as the provisioned source disks. As CloudEndure Disaster Recovery performs block level replication, the volumes must be the same size. There is no option to provision an EBS volume based on used physical capacity of the source volume.

For replication, CloudEndure Disaster Recovery uses either magnetic (<500 GB) or gp2 (>500 GB) EBS volumes. Within the project Replication Settings, the console notes this default as Use fast SSD data disks.

Within the project replication settings, the console notes the default as Use fast SSD data disks.

Project Replication Settings – Staging Area Disks (default)

To reduce costs, you can choose Use slower, low-cost standard disks from the pull-down menu. This option uses st1 EBS volumes for sizes greater than 500 GB. Volumes less than 500 GB will remain magnetic. Reminder that project level settings only impact new agent installations.

Project Replication Settings – Staging Area Disks (slower disks)

Project Replication Settings – Staging Area Disks (slower disks)

Additional granularity of the EBS volume types used can be defined within the Replication Settings at the machine level. After agent install, the selection of Use fast SSD data disks or you can choose Use slower, low-cost standard disks per machine(s). Additionally, at the machine level, you can explicitly choose disk types on a per volume basis. This level of granularity provides the ability to use the right EBS volume type for the specific requirement of the volume. Many volumes that do not have frequent changes can take advantage of lower-cost st1, and even sc1. A best practice is to change any volumes using or requiring the Use fast SSD disks (defaults to gp2) to use gp3, with default IOPS and throughput. With baseline settings, gp3 offers similar performance, with included IOPS and throughput, as gp2 with a 20% reduction in cost.

Machine Replication Settings – Staging Area Disks

Machine Replication Settings – Staging Area Disks

It is important to note, that using ST1 (Use slower, low-cost standard disks) impacts initial synchronization time. It can also generate lag on replication for servers with a high rate of change. As a best practice, I recommend switching to ST1 after the initial sync has completed. This switch can also be automated using the CloudEndure API by patching "replicationConfiguration": {"useLowCostDisks": True}.

Another opportunity to reduce cost is to review source volumes to be replicated. Disks or volumes that are not required upon recovery should be reviewed and identified. Volumes being used exclusively as a target for backup, dump, swap, or cron jobs are an example. Any volumes that may not be required for recovery should be excluded from replication. This is done by installing the CloudEndure Disaster Recovery agent and instructing it to replicate only specific volumes while excluding the unnecessary ones. This can be done interactively by excluding the –no-prompt flag during the default installation. This will prompt you for the volumes you want to replicate. Details can be found in the CloudEndure Disaster Recovery documentation. It is important to note that excluding volumes, or explicitly specifying volumes to replicate, will disable the automatic disk detection. Understand the impact to your workloads by reviewing the documentation of the automatic disk detection feature.

Amazon EBS snapshots

During the replication process, CloudEndure Disaster Recovery uses Amazon EBS snapshots as part of the solution. CloudEndure Disaster Recovery optimizes and manages two groups of EBS snapshots. The first group is 5–7 snapshots, which are continually created and deleted as part of the nearly continuous replication process. The second group of EBS snapshots are point-in-time snapshots, which are part of the CloudEndure recovery points. CloudEndure Disaster Recovery creates one recovery point every ten minutes for the last hour, one recovery point every hour in the last 24 hours, and one recovery point per day for the last 30 days. Note that as of April 11, 2021, the default has been changed from 30 to 7 days for new projects.

If business requirements do not dictate the 7 or 30-day retention, reducing the number of recovery points and Amazon EBS snapshots may reduce overall costs. This is particularly the case with machines that have a high data change rate. To reduce the number of CloudEndure recovery points, adjust the Snapshot Retention Policy under the project’s Replication Settings. Since block changes vary over the day to day, there may not be a linear or direct impact to cost.

CloudEndure Disaster Recovery (DR) - Snapshot retention policy

The two groups of Amazon EBS snapshots that CloudEndure Disaster Recovery manages equates to approximately 42–44 snapshots per EBS volume replicated when using the default 7-day Snapshot Retention Policy. Depending on the number of servers being replicated and the number of volumes attached to each server, the snapshot count can be in the hundreds or thousands. It is important to note that the number of EBS snapshots does not correlate to cost. The first Amazon EBS snapshot taken is a full and all subsequent snapshots are stored, and billed for, incrementally. When you take a new snapshot, after the initial snapshot, only the blocks that changed are saved. You are billed only for those changed blocks. Within the AWS Management Console, the EBS snapshot size is listed as the original EBS volume size, even though the actual data stored is incremental.

Replication servers

CloudEndure Disaster Recovery uses replication servers to receive the data from the source and write the block data to the corresponding EBS volume. Replication servers are a shared resource. CloudEndure Disaster Recovery uses t3.small EC2 instances as the default instance type, mounting up to 15 EBS volumes to each. While the default instance type is typically sufficient for most workloads, sometimes a larger instance type or dedicated server is required. A dedicated server may be required to maintain replication of servers with a high rate of change. The selection of a dedicated replication server can be done at the machine level Replication Settings.

Machine Replication Settings – Replication Server

Machine Replication Settings – Replication Servers

Changing certain settings at the machine level may unintentionally cause the launch of a dedicated server. For example, changing an individual machine’s replication settings, such as replication instance type, volume encryption, or staging area tags, will launch a dedicated replication server. You may consider changing the replication settings via the Machine Actions menu, which applies settings across multiple machines.

Machine Actions – Modify Replication Settings

Machine Actions: Modify Replication Settings

Using a larger or dedicated replication server will increase operational costs based on the AWS pay-as-you-go approach for pricing of cloud services. Generally, changing replication server settings should be done only to ensure recovery point objectives (RPO) are met. Additional details on understanding CloudEndure Disaster Recovery RPO can be found in the blog post “Understanding Recovery Point Objectives using CloudEndure Disaster Recovery.”

Incremental cost saving can also be gained by using Amazon EC2 Reserved Instances for the replication servers. When using the default t3.small instance type for replication servers, Amazon EC2 Reserved Instances may provide a significant discount compared to on pricing. View Reserved Instance pricing on the Amazon EC2 pricing page.

Conclusion

In this blog, I reviewed the resources used by CloudEndure Disaster Recovery and recommended methods to reduce the overall operating cost. Depending on workload requirements, there is an opportunity to reduce CloudEndure Disaster Recovery operational costs by updating replication settings from the default settings. Selecting a lower cost Region, excluding unnecessary source disks, updating EBS volumes, using a Snapshot Retention Policy, and monitoring replication servers can reduce the operational cost of CloudEndure Disaster Recovery. Since workloads vary, monitoring and adjusting settings as necessary is recommended. All the changes recommended can be adjusted without the need to resynchronize data. The elasticity inherent to AWS allows you to change CloudEndure Disaster Recovery resources to fit their needs.

Visit the AWS CloudEndure Disaster Recovery page to get started and for case studies of customers that have shifted their recovery site to AWS. Additional best practices can be viewed in the CloudEndure documentation.

Thanks for reading this blog post. If you have any comments or questions, please leave them in the comment section.