Use Amazon Cloud Watch math expressions and composite alarms for detailed monitoring of AWS Elastic Load Balancers

AWS Elastic Load Balancing encompasses the following load balancers in AWS: Application Load Balancers, Network Load Balancers, Gateway Load Balancers, and Classic Load Balancers. The load balancer serves as a single contact point for clients and it distributes incoming traffic across multiple targets such as EC2 instances as well as it is crucial to monitor the health of target(s) behind the load balancer. In general, when an Application Load Balancer (ALB), Network Load Balancer (NLB), and Classic Load Balancer (CLB) is launched in an AWS account, only a few metrics such as “UnhealthyHostCount” and “HealthyHostCount”, are provisioned by default as CloudWatch metrics. These default metrics do not provide the information related to underlying target details of unhealthy targets such as EC2 Instance ID/IP address, tag values, Number of Targets Failed, Time Stamp, AWS Account ID, Cause of Failure.

The solution in this blog post provides more granular and detailed information regarding Application Load Balancer (ALB), Network Load Balancer (NLB), and Classic Load Balancer (CLB) unhealthy targets using AWS Lambda and Amazon Simple Notification Service(SNS). Along with that, our solution makes use of dynamic monitoring of unhealthy targets by utilizing Amazon CloudWatch Metric Math and multi-threshold level alarms by using Amazon CloudWatch Composite Alarms. The deployment process has multiple steps with Infrastructure as Code (IaC) automation utilizing AWS CloudFormation. The solution can also integrate with a ServiceNow ticketing system, so that the Cloud Operations teams can take proper action on alerts and engage appropriate teams.

Solution overview

Our solution is deployed in an AWS Account using CloudFormation stacks to provision all the required resources. You calculate the percentage of Unhealthy targets in the solution by using the Metric Math feature in CloudWatch on the default “UnhealthyHostCount” and “HealthyHostCount” load balancer metrics. The importance of utilizing the Percentage-Based Metric using metric math instead of default metrics is that it makes the metric more dynamic and usable with any setup regardless of how many targets have been registered on that load balancer. For example, let’s say you have a load balancer in your account with ‘n’ number of targets today, and in the future, you register or de-register targets so that the number of targets changes to ‘m’. In that case, since the solution utilizes percentages, you do not have to worry about modifying the CloudWatch alarms threshold according to the target number. Furthermore, if you have multiple load balancers in your account you do not have to worry about adjusting the threshold values for each load balancer with a different number of targets.

The solution creates three sets of severity (Low, Medium, and High) percentage alarms according to threshold values -> Low Level > 0%, Medium level > 30% and High Level > 60%. You can define your own threshold values as per your requirement based on these three levels. Moreover, our solution utilizes the Composite Alarms feature and instead of single threshold alarms, the alarms are monitored on the basis of the threshold ranges, such as LOW – 0 to 30, Medium – 30 to 60, and High – 60 and above.

Once the composite alarms are triggered, we will send a notification to an Amazon Simple Notification Service (SNS) topic and an AWS Lambda function subscribes to that topic. The purpose of the Lambda function is to capture required information, such as information of the unhealthy target(s), the root cause of why the target is failing the health check, how many targets were marked unhealthy, and corresponding instance Ids and Instance IPs. This detailed information is then pushed to another SNS topic with an email subscription. This means that if any of the load balancer targets goes into an unhealthy state then the user will receive a customized notification showing which server went down along with a probable cause, that can help operations teams to act and respond quickly. This enables them to save time fixing the issue rather spending time in just gathering information.

Solution architecture

Figure 1 shows the overall architecture for the detailed monitoring of AWS Elastic Load Balancer unhealthy targets using Cloud Watch Math Expression, Composite alarms and Lambda function.

When a load balancer target (EC2 Instance) fails an ELB health check, it triggers the Percentage and Composite CloudWatch alarms based on the defined percentage threshold of failed targets. The Composite alarms action sends an Amazon Simple Notification Service (SNS) notification to trigger the Lambda function. The Lambda function makes a describe-instances, describe-load-balancer, or describe_target_groups API call to receive the identity (metadata) of the failed target, as well as the cause of the failure.

Simultaneously, the Lambda function utilizes all of this information related to how many instances failed, cause of failure, instance-ids, and if the application logs are stored in CW log group. Then it gets the most recent logs, drafts an email response and sends a SNS Notification to the subscribed email address in another SNS Topic.

Architecture diagram displaying infrastructure as code components for provisioning required services for detailed monitoring of elastic load balancing unhealthy targets.

Figure 1: Architecture diagram

Solution components

It consists of the following AWS CloudFormation templates that you can download and create CloudFormation stacks:

Step1_IAMRole_For_Lambda.yml – Sets up the AWS Identity and Access Management (IAM) role assumed by the Lambda function as well as the SNS topic used for email notifications.
Step2_ALB_unhealthy_targets_monitoring_Deployment.yml – Sets up the percentage-based metrics, percentage CloudWatch alarms and Composite CloudWatch alarms for dynamic and multi-level threshold monitoring. It also provisions the Lambda function for extracting underlying details of unhealthy targets.

Walkthrough

This section walks through the prerequisites and steps to set up and deploy this solution.

Prerequisites

Create an S3 bucket with this naming convention: elb-detailed-monitoring-code-<accountid>-<region>. Create a folder called ‘lambdacode’ in your S3 bucket. Create the S3 bucket and folder with any names as long as you are using the same names while launching the CloudFormation stack. Once the S3 bucket and the folder is created, download the Python Lambda code .ZIP file into the folder. Note the path, as you’ll need it when you deploy the CloudFormation stack. If the folder name is ‘lambdacode’, then the path to be used will be lambdacode/identifying_unhealthy_targets.zip
Create another S3 bucket with this naming convention: elb-detailed-notifications-log-<accountid>-<region>. Create the S3 bucket with any name as long as you are using the same name while launching the CloudFormation stack.
This solution requires a valid email address. This email address will be configured with SNS Topic to push ALARM notification of any Unhealthy targets behind the load balancer.

Solution setup

Step – 1 – Launch the CloudFormation Stack to create the IAM Role and SNS Topic [to be executed once per region]

From the AWS account, create a stack in the AWS CloudFormation console to launch the Step1_IAMRole_For_Lambda.yml template. The template takes parameters for the following components:

Name of the IAM role used by the UnhealthyTargetMonitoring Lambda
Name of the SNS topic used by the Lambda function to send alert notifications
Email address for SNS subscription
S3 bucket to store logs / alerts / notifications.
True/False if application logs for EC2 servers are stored in a CloudWatch Log Group:
CloudWatch Log Group Name where EC2 servers logs are stored

Running Step1_IAMRole_For_Lambda.yml template during the initial setup will provision following resources:

The template provisions the IAM role. This IAM role will be associated to the lambda function which will be provisioned by the CloudFormation stack in Step2.
It provisions the SNS topic. The Lambda function will be sending the custom notification message with all the required information related to the load balancer’s unhealthy targets to this SNS topic
It provisions a conditional IAM Policy if the application logs are enabled and parameter is set to TRUE.

Step -2 – Launch the CloudFormation Stack to create CloudWatch Alarms and the Lambda function that monitors unhealthy targets [To be executed once per load balancer. If you want to setup this solution for multiple Load Balancers in the account repeat only Step-2 for every other load balancer]

From the AWS account, create a stack in the AWS CloudFormation console to launch the Step2_ALB_unhealthy_targets_monitoring_Deployment.yml template. The template takes parameters for the following components:

The full name of the Load Balancer in the format as described here.
- Navigate to the Amazon EC2 Console and in the navigation pane, under Load Balancing, choose Load Balancers. Select the load balancer to be used for this setup. Choose Description and under Basic Configuration section locate the ARN of the load balancer (for example, arn:aws:elasticloadbalancing:us-east-1:111111111111:loadbalancer/app/lbname/ d665cae1604417d). From the ARN you only need the last section as explained in the examples below. Also, for Classic Load Balancer there is no ARN so just use the Name field instead.
  - If this setup is for Classic Load Balancer, simply provide the Name of the load balancer
  - If this setup is for Application Load Balancer, provide the name included in the ARN of the ALB in this format – e.g., app/lbname/d665cae1604417d.
  - If this setup is for Network Load Balancer, provide the name included in the ARN of the ALB in this format – e.g., net/lbname/d665cae1604417d.
Name tag used for the load balancer: This is the name tag of the Load Balancer.
- Navigate to the Amazon EC2 Console and in the navigation pane, under Load Balancing, choose Load Balancers. Select the load balancer to be used for this setup. Choose Description and under Basic Configuration section locate the Name of the load balancer.
ARN of the target group to be monitored for unhealthy targets.
- Navigate to the Amazon EC2 Console and in the navigation pane, under Load Balancing, choose Target Groups. Select the target group to be used for this setup and in the section below you will be able to locate the ARN of the target group for e.g., arn:aws:elasticloadbalancing:us-east-1:111111111111:targetgroup/Test-TG/5jsh7ysuh27y2.
  - If this setup is for the Classic Load balancer, enter NONE.
  - If this setup is for the Application or Network Load balancer, Enter the ARN of the target group that needs to be monitored.
Target group name: Name of the Target group that will be used for this setup.
- If this setup is for the Classic Load balancer, enter NONE.
- If this setup is for the Application or Network Load balancer, enter the name of the target group that needs to be monitored.
Namespace of the CloudWatch alarm depends on the type of load balancer. This is the namespace from the dropdown according to the type of the ELB.
- If this setup is for Classic Load Balancer, select AWS/ELB.
- If this setup is for Application Load Balancer, select AWS/ApplicationELB.
- If this setup is for Network Load Balancer, select AWS/NetworkELB.
ARN of the IAM role For the Lambda Function: This is the ARN of the IAM role provisioned by the CloudFormation Stack in Step1.
Targetgroup type: This is the one of the options from the dropdown i.e., Instance or IP.
- You can identify the target group type by going to the Amazon EC2 Console and in the navigation pane, under Load Balancing, choose Target Groups. Select the target group to be used for this setup and choose In this section, verify the value of the field Target type whether it is Instance or IP and make the selection for this parameter accordingly.
  - Select Instance if the targets are registered using the InstanceId
  - Select IP if the targets are registered using the PrivateIPs
Name of the S3 Bucket used to store the Lambda code: This S3 bucket should have already been created as a part of the prerequisites. Default value is ‘elb-detailed-monitoring-code-<accountid>-<region>’
S3 folder along with the Zip file: This is the prefix which includes the name of the file along with the folder. This folder was created as part of the prerequisites. Default value is lambdacode/identifying_unhealthy_targets.zip.
ARN of the SNS topic used for email subscription.
- You can find the SNS topic ARN from the output section of the Step1 CloudFormation stack. Navigate to the AWS CloudFormation console and click on the stack which was created in Step1. Choose Outputs and copy the value for the SNSTopicForNotifications
S3 Bucket name where all the log Files will be uploaded as a backup copy: This S3 bucket should have already been created as a part of the prerequisites. Default value is ‘elb-detailed-notifications-log-<accountid>-<region>’
LOW severity alarm threshold value: Default value is ‘0’.
MEDIUM severity alarm threshold value: Default value is ‘30’.
HIGH severity alarm threshold value: Default value is ‘70’.
True/False if application logs for EC2 servers are stored in CloudWatch Log Group
CloudWatch Log Group name where EC2 server logs are stored: If you have selected FALSE, then leave this parameter as NONE. Otherwise, provide the name of the CloudWatch Log Group where you are storing the application logs.

Running Step2_ALB_unhealthy_targets_monitoring_Deployment.yml CloudFormation template for the first-time provisions the following resources:

6 CloudWatch Alarms – 3 alarms for the unhealthy percentage math expression for the LOW, MED and HIGH thresholds and 3 Composite alarms to define the range and achieve multi threshold level monitoring.
SNS topic subscribed to the Lambda function. This SNS topic gets triggered whenever any of the Composite alarms enters the ALARM state.
Lambda Function. The Lambda function invokes API calls to describe and fetch EC2 and ELB in-depth information related to the unhealthy targets.

Validate Solution

Once both of the CloudFormation stacks from Step 1 and Step 2 are deployed successfully and the complete setup is in place, then the notification email will go out to the subscribed SNS topic email distribution address whenever one or more targets behind the load balancer used in this setup are marked as unhealthy. The email notification contains the details of unhealthy targets behind that Load Balancer.

Follow these steps to validate the solution:

Step 1: For the load balancer which has been used for this setup, identify the targets.

If the load balancer is an ALB or NLB, then go to the Amazon EC2 Console and in the navigation pane, under Load Balancing, choose Target Groups. Select the target group to be used for this setup which is associated with the corresponding load balancer. Choose Targets and verify that your instances are in healthy
If the load balancer is a CLB, then go to the Amazon EC2 Console and in the navigation pane, under Load Balancing, choose Load balancers, select the CLB which has been used for the setup, select Instances and confirm the status of the targets and it should be InService.

Step 2: Make required changes to mark the targets as Unhealthy.

make required changes to mark the target as unhealthy under the target group for ALB/NLB or under the load balancer for CLB. You can test and change the status of the targets to ‘unhealthy’ with one of the following options:
1. Option1: Make changes on the targets ( i.e., the application EC2 servers) itself that you would like to be as mark as unhealthy. You can either stop the application running on the target completely or make changes to the application specific config file that is listening on the port configured on the load balancer to forward t traffic to the application. NOTE: Please do not stop the EC2 server itself from the console to do the testing.
2. Option2: Make changes to the security groups associated with the targets to block traffic from the load balancer. You can either update or delete the existing rules such that load balancer is not able to communicate with the targets on the registered port and in turn the load balancer will mark those targets as unhealthy.
Confirm the concerned targets on which changes has been made, are showing unhealthy under the target group or the load balancer.

Step 3: Confirm the notification has been received in the expected format at the subscribed email.

Once one or more target(s) behind the load balancer becomes unhealthy, then the SNS notification should be sent to the subscribed email address in the format as shown in Figure 2. Please check the email which has been used for this setup and confirm that you have received the notification as expected.

Sample email notification that is sent to the subscribed email address once one or more targets are marked unhealthy by the load balancer.

Figure 2 – Sample Email Notification

Cleanup

To clean up your account after trying the solution outlined in this blog post, conduct the following:

Revert the changes made on the targets or the corresponding security groups and make sure the targets are healthy again.
Delete the CloudFormation stack created in Step 1 and Step 2. Please refer to the link here to follow the steps to delete a CloudFormation stack.

Conclusion

In this blog post, we have provided a solution that provides dynamic monitoring of unhealthy targets by utilizing Amazon CloudWatch Metric Math and multi-threshold level alarms by using Amazon CloudWatch Composite Alarms. Our solution provisions a detailed monitoring set up of your load balanced targets in AWS with timely alerts whenever these targets behind the load balancers become unhealthy. This lets you focus on resolving the core issue behind these unhealthy targets as quickly as possible instead of spending time gathering information.

AWS Cloud Operations & Migrations Blog