If you had to architect a multi-account security logging strategy, where should you start?

This post, part of the “Continuous Visibility into Ephemeral Cloud Environments” series, will describe a design for a state of the art multi-account security-related logging platform in AWS. Later posts will also cover a similar setup for both GCP and Kubernetes.

This is a living document, which I regularly update as new services/improvements get released.
Last updated:

Problem Statement

One of the usual requirements for Security teams is to improve the visibility over (production) environments. In this regard, it is often necessary to design and rollout a strategy around security-related logging. This entails defining the scope for logging (resources, frequency, etc.), as well as providing an integration with existing monitoring and alerting systems.

The end goal is to deploy a security logging and monitoring solution with well established metrics and integrations with a SIEM of choice (Elasticsearch in this case). In particular, the solution should be able to:

  • Collect security-related logs from all environments.
  • Ingest those logs into a SIEM (e.g., Elasticsearch).
  • Parse those logs and use them to generate dashboards in Kibana.
  • Create alerts on anomalies.

In this regard, this post is composed of two main parts. The first introduces the logging-related services made available by AWS to their customers, alongside with their main features. The second describes a state of the art the design for a security-related logging platform, and provides the high-level architecture and best practices to follow during the implementation phase.


Which Services Can We Leverage?

AWS offers multiple services around logging and monitoring. For example, you have almost certainly heard of CloudTrail and CloudWatch, but they are just the tip of the iceberg.

CloudWatch Logs is the default logging service for many AWS resources (like EC2, RDS, etc.): it captures application events and error logs, and allows to monitor and troubleshoot application performance. CloudTrail, on the other hand, works at a lower level, monitoring API calls for various AWS services.

Although listing (and describing) all services made available by AWS is out of scope for this blog post, there are a few brilliant resources which tackle this exact problem:

In the remainder of this section I’ll provide a summary of the main services we will need to design our security logging platform. Before doing so, though, it might be helpful having a high-level overview of how these services communicate (special thanks to Scott Piper for the original idea):

Relationships Between AWS Logging/Monitoring Services.
Relationships Between AWS Logging/Monitoring Services.

CloudTrail

AWS CloudTrail is defined as:

A service that enables governance, compliance, operational auditing, and risk auditing of AWS accounts. CloudTrail can be used to log, continuously monitor, and retain account activity related to actions across an AWS-based infrastructure.

CloudTrail provides event history of AWS account activity, including actions taken through the AWS Management Console, AWS SDKs, command line tools, and other AWS services. This event history can be leveraged for security analysis, resource change tracking, and troubleshooting. In addition, CloudTrail can be used to detect unusual activity in your AWS accounts.

In short, CloudTrail monitors AWS API calls across nearly every AWS service, recording information such as the user agent, IP address, IAM user or role ARN, and other details about the request. It delivers log files to a designated S3 bucket approximately every five minutes, along with the option of log file integrity validation. CloudTrail can also be configured to send a message via SNS when new logs are delivered, and integrates with CloudWatch Logs and Lambda for processing.

Data Events, usually not logged by default, can also be leveraged to provide visibility into the resource operations performed on or within a resource (also known as data plane operations). Be wary these events are often high-volume activities.

CloudWatch

CloudWatch is a monitoring and observability service that provides data and actionable insights to monitor applications, systems, optimize resource utilization, and get a unified view of operational health.

It collects monitoring and operational data in the form of logs, metrics, and events, and visualizes it using automated dashboards to provide a unified view of resources, applications, and services. CloudWatch can also be used to create alarms based on custom metric value thresholds, or can watch for anomalous metric behavior based on machine learning algorithms. Automated actions can be set up to notify members of staff if an alarm is triggered.

CloudWatch Logs can be manually exported to S3 for long-term storage, or streamed to subscriptions such as Lambda, a Kinesis Data Stream, or Kinesis Data Firehose Stream.

GuardDuty

GuardDuty is a threat detection service that continuously monitors for malicious activity and unauthorized behavior to protect AWS accounts and workloads.

The service uses machine learning, anomaly detection, and integrated threat intelligence (from AWS, CrowdStrike, and Proofpoint) to identify and prioritize potential threats such as:

  • Crypto-currency mining.
  • Credential compromise behavior.
  • Communication with known command-and-control servers.
  • API calls from known malicious IPs.

In addition to detecting threats, GuardDuty can perform automated remediation actions by leveraging CloudWatch Events and Lambda.

Config

Config creates an inventory of AWS resources, including configuration history/change notification, and relationships between such resources. It provides a timeline of resource configuration changes for specific services, with snapshots stored in a specified S3 bucket, with the possibility to send SNS notifications when AWS resource changes are detected.

Main use cases for Config are tracking changes to resources configuration, as well as answer questions about resource configurations, demonstrate compliance either at a specific point in time or over a period of time, troubleshoot, or perform security analysis.

Access Logs

Particular mention has to be made for Access Logs, which are generated by a variety of services:

Service Description
S3
  • S3 access logging records individual requests made to S3 buckets and can be useful for access auditing.
  • Access Logs can be configured at the Bucket level (Management Events) or Object level (Data Events). In addition, Server Access Logs can be collected as well.
  • Access Logs are delivered to a designated target S3 bucket on a best effort basis.
CloudFront
  • CloudFront access logging records individual requests made to CloudFront distributions.
  • Access logs are delivered to a designated target S3 bucket on a best effort basis.
VPC Flow Logs
  • VPC Flow Logs capture information about the IP traffic going to and from a VPC's network interfaces and can be applied at the VPC, subnet, or individual Elastic Network Interface (ENI) level.
  • Flow log data is stored using CloudWatch Logs and can be exported using CloudWatch streams for additional analytics or visualization of network traffic flows.
  • VPC Flow Logs can be useful when organizational legal or security policies require capturing network flow data.
Elastic Load Balancing (v1)
  • Elastic Load Balancing access logging records individual requests made to a load balancer.
  • Elastic Load Balancing access logs are delivered to a designated target S3 bucket at user requested 5 or 60 minute intervals.
Application Load Balancers (ALB)
  • Logs requests (as best effort) sent to the load balancer, including requests that never made it to the targets (malformed requests, requests with no target response)
  • Logs the details of each request/connection made to the Load Balancer (e.g., connection type, timestamp, client/target IP/port, status code, etc.)
  • A log file for each ALB node is published every 5 minutes
Network Load Balancers (NLB)
  • Logs detailed information about the TLS requests sent to the NLB (access logs are created only if the load balancer has a TLS listener)
  • Logs the details of each TLS request/connection made to the Load Balancer (e.g., connection type, timestamp, client/target IP/port, status code, TLS cipher/protocol version, etc.)
  • A log file for each NLB node is published every 5 minutes
Databases (Redshift, RDS, DynamoDB)
  • Redshift Logs capture information about database connections, user activity, and changes to user definitions. Logs are delivered to a designated target S3 bucket.
  • RDS Logs capture information about database access, performance, errors, and operation. Log files can be queried through DB engine-specific database tables, but a custom process to export RDS Access Logs into a central access log repository (like S3 or CloudWatch) needs to be implemented.
  • DynamoDB: starting April 2021, it is now possible to enable data plane activity logging for fine-grained monitoring of all DynamoDB item activity within a table by using CloudTrail.

I'm writing a book! 📖

The CloudSec Engineer will be a book on how to enter, establish yourself, and thrive in the cloud security industry as an individual contributor.
You can sign up to get updates and free samples of the book as I write it at: CloudSecBooks.com.


State of the Art Security Logging Platform in AWS

So how could we design a multi-account security-related logging platform in AWS?

Let’s start with a high-level architecture diagram of a solution with multiple “projects” (or customers), each with production and non-production environments (note how every project/customer will have the same setup). Here I will assume the workloads run predominantly in a Kubernetes cluster (managed EKS), but with some stateful services involved as well (i.e., RDS).

Architecture Diagram - Security Logging Platform in AWS
Architecture Diagram - Security Logging Platform in AWS

Collection

Starting from collection, logging services should be enabled in every AWS account, so to collect logs from every environment (whether it is production or not).

In particular, the following information should be collected:

Log Type Description
API Call Logs
  • CloudTrail should be enabled in every region (more on this below) to monitor API calls for various AWS services.
Application Event Logs
  • CloudWatch Logs​ can be used to capture application event and error logs, as it provides a centralized service for storing and aggregating log data.
Access Logs
  • VPC Flow Logs: V​PC Flow Logs​ can be collected to comply with regulatory policies requiring to capture network flow data, as it ingests information about IP traffic going to and from a VPC's network interfaces.
  • S3: S3 Access Logging can be enabled to record individual requests made to S3 buckets (especially ones containing sensitive data).
  • ELB: ELBs Access Logging can be enabled to record individual requests made to load balancers.
Data Events
  • S3 object-level API activity: can record all API actions on S3 Objects (i.e., GetObject, DeleteObject, and PutObject API operations) and receive detailed information such as the AWS account of the caller, IAM user role of the caller, time of the API call, IP address of the API, etc.
  • Lambda function execution activity: can record the Invoke API.
Kubernetes Logs
  • Control plane logs: control plane API, audit, controller, authenticator, and scheduler logs are collected by EKS itself and forwarded to CloudWatch Logs.
  • Worker node logs: collection depends on whether the compute plane is self-managed via EC2 or AWS-managed via Fargate. I'll write a follow-up post specifically on this.
  • Task container logs: the application logs. I'll write a follow-up post specifically on this.
DNS Query Logs
  • Route 53 Resolver Query Logs can record information about the public DNS queries that Route 53 receives, such as requested domain, time of request, record type, etc.
  • DNS Resolver Firewall allows to define Rule Groups with multiple Block/Allow/Alert rules (within each rule, you can specify your own domain list).
  • Route 53 will send logs to CloudWatch Logs.

Since CloudTrail retains logs for a limited period of time, an ​Organization Trail (see “​Creating a Trail for an Organization”) should be configured to store logs for extended periods, both to meet compliance obligations and for historical analysis. In addition, an Organizational Trail would prevent Member (child) accounts from disabling CloudTrail and/or modifying the trail itself.

In order to obtain a complete record of events (whether taken by a user, role, or service), each trail should be configured to log events in all AWS Regions. By logging events in all AWS Regions, it is ensured that all events that occur in an AWS account are logged, regardless of which AWS Region where they occurred. This includes logging global service events, which are logged to an AWS Region specific to a service.

Logging in all AWS Regions has the added benefit that, if an AWS Region is added after a trail has been created, that new region is automatically included, and events in that region are logged by default.

Delivery

Since the integrity, completeness and availability of the collected logs is crucial for forensic and auditing purposes, a queueing system like ​Kinesis ​should be used to receive and buffer all the logs collected.

This not only will improve the resiliency of the platform by queueing (without discarding) messages in the event of the failure of a downstream component which is meant to consume logs, but it also allows to decouple log ingestion from log consumption.

Long-Term Storage and Audit Trail

A dedicated and highly restricted AWS account (here named ​Logging Account​) should also be created for each project/customer for long term (immutable) storage of the logs.

In that account a Logstash​ Agent can be used to pull logs directly from Kinesis (by consuming the stream produced by it) and to store them into an S3 bucket where they will be treated as immutable files. This can be achieved via S3 Object Lock in Compliance mode (see “Protecting data with Amazon S3 Object Lock”) to ensure that nobody, including the root user in the AWS account, would be able to delete the objects during a pre-defined retention period.

In addition, a ​Data Loss Prevention (DLP)​ solution could be employed to prevent and detect cases of attempted data exfiltration. It should be noted that, to ensure the integrity of the logs stored in such projects, IAM controls should be put in place to limit access to these S3 buckets.

As an additional measure, log files should be encrypted. Although by default CloudTrail encrypts all log files using S3 server-side encryption (SSE-S3), these files should be encrypted with a custom AWS Key Management Service (​SSE-KMS​) key instead, backed by KMS (see “​Encrypting CloudTrail Log Files with AWS KMS–Managed Keys (SSE-KMS)​”).

To ensure durability of the logs collected in each ​Logging Account​, ​MFA Delete​ should be enabled on the S3 bucket where the log files are stored (see “S3 MFA Delete”). MFA Delete ensures that any attempt to change the versioning state of the bucket or permanently delete an object version requires additional authentication. This helps prevent any operation that could compromise the integrity of the log files, even if a malicious user acquires the password of an IAM user that has permissions to permanently delete S3 objects.

In case of a forensic investigation, the CloudTrail Log File Integrity Validation (see “​Validating CloudTrail Log File Integrity​”) process could be used to validate the integrity of the log files stored in each ​Logging Account​ and detect whether the log files were unchanged, modified, or deleted since CloudTrail delivered them.

Monitoring and Alerting

Finally, a centralized AWS account (here called ​Centralized Monitoring Account) can then be used to aggregate logs collected from the different projects.

In this account, another Logstash Agent will have dedicated subscriptions to pull logs from each Kinesis stream defined in every account and forward them to an ​ElasticSearch​ instance used by a Security Operations (i.e., SOC) team to monitor and respond to threats in (near) real time.

In conjunction, the machine learning, anomaly detection, and integrated threat intelligence provided by ​GuardDuty​ can be leveraged to obtain an out of the box set of alerts with a very good signal-to-noise ratio (i.e., if a GuardDuty alert fires, you should probably want to take a look and investigate it).


Conclusions

In this blog post, part of the “Continuous Visibility into Ephemeral Cloud Environments” series, I described a possible approach for designing a multi-account security-related logging platform in AWS.

Later posts will also cover a similar setup for both GCP and Kubernetes.

I hope you found this post useful and interesting, and I’m keen to get feedback on it! If you find the information shared was useful, if something is missing, or if you have ideas on how to improve it, please let me know on Twitter.