Containers

Introducing CloudWatch Container Insights Prometheus Support with AWS Distro for OpenTelemetry on Amazon ECS and Amazon EKS

You can use CloudWatch Container Insights to monitor, troubleshoot, and alarm on your containerized applications and microservices. Amazon CloudWatch collects, aggregates, and summarizes compute utilization information like CPU, memory, disk, and network data. It also helps you isolate issues and resolve them quickly by providing diagnostic information like container restart failures. Container Insights gives you insights from container management services such as Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), AWS Fargate, and standalone Kubernetes.

As members of the Cloud Native Computing Foundation (CNCF)’s OpenTelemetry community, we are working to define an open standard for the collection of distributed traces and metrics. Back in October 2020, we launched a preview of AWS Distro for OpenTelemetry (ADOT). ADOT is a secure and supported distribution of the APIs, libraries, agents, and collectors defined in the OpenTelemetry Specification.

The ADOT collector runs within your environment. For common uses in Amazon EKS and Amazon ECS, we recommend launching the ADOT collector as a:

Note that the ADOT collector can also be launched in other modes, such as statefulSet. You can configure the metrics and traces that you want to collect, as well as which AWS services to forward them to. Additionally, you can control the sampling rate (what percentage of the raw data is forwarded and ultimately stored).

With the default configuration, ADOT collects the complete set of metrics as defined in our Metrics collected by Container Insights documentation. It is possible to reduce AWS costs for ingesting Container Insights metrics in CloudWatch by having the ADOT stream specific metrics to CloudWatch.

Prometheus background

Prometheus is a popular open source monitoring tool that also graduated as a CNCF project and has a large and active community of practitioners. Container Insights automates the discovery and collection of Prometheus metrics from containerized applications. It automatically collects, filters, and creates aggregated custom CloudWatch metrics visualized in dashboards for popular workloads, such as AWS App Mesh, NGINX, Java/JMX, Memcached, and HAProxy. By default, pre-selected services are scraped, pre-aggregated, and automatically enriched with metadata, such as cluster and pod names.

Discovering Prometheus metrics is supported for Amazon ECS, Amazon EKS, and Kubernetes clusters running on Amazon EC2 instances. The Prometheus counter, gauge, and summary metric types are collected. For Amazon ECS and Amazon EKS clusters, both the EC2 and Fargate launch types are supported. Container Insights automatically collects metrics from several workloads, and you can configure it to collect metrics from any workload.

You can adopt Prometheus as an open-source and open-standard method to ingest custom metrics in CloudWatch. CloudWatch Container Insights Prometheus supports sending Prometheus metrics to CloudWatch and out-of-the-box visualization. Previously, only CloudWatch Agent supported collecting Prometheus metrics and converting them to CloudWatch log/metrics format used in Container Insights performance dashboard. In the latest release of ADOT Collector (0.11), we’re offering the same set of features as CloudWatch Agent additionally to adding an improved EMF Exporter and a new ecsobserver Extension.

Design and implementation

There are two problems to solve for integrating Prometheus with CloudWatch Container Insights Prometheus. The first is getting the metrics from Prometheus on different AWS container environments. The second problem is exposing them in CloudWatch-specific format with the right sets of metadata.

Prometheus is a pull-based metrics collecting system, so the collector needs to discover Prometheus scrape targets using an API from the underlying container environment. By watching the changes from container orchestration system API, the collector can start collecting when new containers come up and stop when old containers are gone. Furthermore, these APIs provide additional metadata, like service name, container name, etc., which is not exposed by the application itself but is essential for monitoring a large number of services. Without these metadata, it is impossible to filter based on different dimensions and run a query like get heap usage of all java application that has feature flag A enabled. For Amazon EKS, Prometheus has built-in discovery support for Kubernetes and we can use it directly. For Amazon ECS, however, we have to write our own. We already implemented one in CloudWatch Agent, so we ported it to OpenTelemetry Collector as a new extension called ecsobserver.

For supporting CloudWatch and its auto-generated Container Insights Prometheus performance dashboards, we export metrics with specific dimensions using Prometheus’s built-in relabel config and EMF Exporter’s metrics_declaration. However, if you don’t need all the features from Container Insights and want to reduce cost, you can decrease number of metrics and dimensions by simply updating the configuration file. You still get all your container metrics even if you remove all the metrics_declaration rules. The metrics are always exported as a structured JSON log and you can analyze them using CloudWatch Logs Insight. If you think the structured logs still contain too much information, you can reduce it even further by using processors to filter metrics at the collector level. This is different from CloudWatch Agent, where the knobs for cost reduction are limited and not flexible.

Similar to other AWS-contributed OpenTelemetry components, all the implementation is done in a community-managed repository on GitHub and included in ADOT. We made improvements to the existing CloudWatch EMF Exporter so that it is aware of Prometheus metrics, and added additional attributes, like prom_metric_type. We already implemented one in CloudWatch Agent, so we ported it to OpenTelemetry Collector as a new extension called ecsobserver. The ecsobserver extension is a new component and it reuses existing Prometheus file based discovery. It exports discovered Amazon ECS targets as a YAML file so Prometheus Receiver can consume it using file_sd without code change.

For different environments, we have a different pipeline to select different set of components. Pipelines are defined in the configuration file, like our default config-all.yaml. A pipeline defines the data flow in OpenTelemetry collector and includes receiving, processing, exporting metrics, and trace data. In each stage, there can be multiple components and they may run in serial (processor) or in parallel (receiver, exporter). Internally, all the components communicate using OpenTelemetry’s unified data models so components from different vendors can work together. Receivers collect data from source systems and translate them into internal models. Processors can filter and modify metrics. Exporters convert the data to other schema and send to (remote) target systems. Extensions can work with different components and may fit into different stages. You can find more details in the OpenTelemetry Collector Architecture.

In the following graphs, we illustrate the key components that we defined in the ECS/EKS pipeline to support a CloudWatch Container Insights scenario. The only difference between the two pipelines is the data source of Prometheus metrics. We use different discovery plugins for Amazon EKS and Amazon ECS. For an Amazon ECS pipeline, ecsobserver extension queries the ECS API and writes the result to a local file for the Prometheus receiver to consume. For an Amazon EKS pipeline, the Prometheus built-in Kubernetes discovery queries Kubernetes APIServer and works out of the box. The rest of the pipelines convert and send metrics to CloudWatch using EMF Exporter.

Amazon ECS pipeline

On Amazon ECS, the collector discovers scrape metrics target using ecsobserver extension. It then collects metrics using Prometheus receiver and sends to CloudWatch using EMF Exporter.

Amazon EKS pipeline

On Amazon EKS, the collector discovers scrape targets using built-in plugin by watching Kubernetes API. It then collects metrics using Prometheus receiver and sends to CloudWatch using emf exporter.

Getting started

For a quick start, run the following command to deploy the all-in-one otel-container-insights-prometheus.yaml to your existing EKS cluster. You can also view the official documentation for more workloads and step-by step instructions.

export CLUSTER_NAME=<eks-cluster-name>
export AWS_REGION=<aws-region>
wget https://raw.githubusercontent.com/aws-observability/aws-otel-collector/main/deployment-template/eks/otel-container-insights-prometheus.yaml
cat otel-container-insights-prometheus.yaml |sed "s/{{region}}/$AWS_REGION/g" | sed "s/{{cluster_name}}/$CLUSTER_NAME/g" | kubectl apply -f - 

The preceding command replaces cluster name in otel-container-insights-prometheus.yaml and deploys it using kubectl. The single replica deployment runs the collector using the bundled config-all.yaml that is included in the container when building the Docker image. To deploy in production, you can use config map on Amazon EKS, and a SSM parameter on Amazon ECS.

# https://github.com/aws-observability/aws-otel-collector/blob/main/config/eks/prometheus/config-all.yaml
    spec:
      serviceAccountName: aws-otel-collector
      containers:
        - name: aws-otel-collector
          image: amazon/aws-otel-collector:latest
          command: [ "/awscollector" ]
          args: [ "--config", "/etc/eks/prometheus/config-all.yaml" ]

Inside the config, we configure the Prometheus receiver to use built-in Kubernetes discovery to discover pods using receivers > prometheus > config > scrape_configs > kubernetes_sd_configs. There is a long list of relabel_configs for Prometheus scraper to convert Kubernetes metadata to Prometheus labels. These labels get converted by the Prometheus receiver into OpenTelemetry labels and eventually become CloudWatch dimensions or structured log fields in the EMF Exporter. For example, __meta_kubernetes_namespace is metadata from a Kubernetes pod renamed to Namespace in the Prometheus receiver. In the EMF Exporter, it is extracted as a dimension using metrics_decalarations > dimensions: [ [ ClusterName, Namespace ] ]. In contrast, __meta_kubernetes_pod_name becomes pod_name, but it will only show up in structured logs because the label name does not match the list of dimensions.

# https://github.com/aws-observability/aws-otel-collector/blob/main/config/eks/prometheus/config-jmx.yaml
receivers:
  prometheus:
    config:
      global:
        scrape_interval: 1m
        scrape_timeout: 10s
      scrape_configs:
        - job_name: 'kubernetes-pod-jmx'
          sample_limit: 10000
          metrics_path: /metrics
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - action: replace
              source_labels:
                - __meta_kubernetes_namespace
              target_label: Namespace 
            - source_labels: [ __meta_kubernetes_pod_name ]
              action: replace
              target_label: pod_name
....

exporters:
  awsemf:
    namespace: ContainerInsights/Prometheus
    log_group_name: "/aws/containerinsights/{ClusterName}/prometheus"
    log_stream_name: "{TaskId}"
    resource_to_telemetry_conversion:
      enabled: true
    dimension_rollup_option: NoDimensionRollup
    metric_declarations:
      - dimensions: [ [ ClusterName, Namespace ] ]
        metric_name_selectors:
          - "^jvm_threads_(current|daemon)$"
          - "^jvm_classes_loaded$"
        label_matchers:
          - label_names:
              - service.name
            regex: ^kubernetes-pod-jmx$

After having the collector up and running, you can deploy workloads like JMX to try out the pre-built dashboard.
After the deployment, you can find the pre-built dashboard in CloudWatch console > Container Insights > Performance monitoring > EKS/ECS Prometheus. These dashboards are auto-generated for each cluster for supported workloads, and you can find the full list of supported workloads in the documentation.

Pre-built dashboard in console

Under CloudWatch console performance dashboard showing Prometheus metrics of App Mesh-like request per second,, total heap usage, etc.

Metrics are exported as EMF logs so you can find them in corresponding log groups. Some logs may not have metric metadata based on your EMF Exporter metric_declarations configuration. (NOTE: log groups for Amazon EKS and Amazon ECS are different. The following image shows an Amazon ECS example.)

For Amazon EKS, the log group is /aws/containerinsights/my-eks-cluster-name/prometheus. JSON fields under _aws include metadata of metrics extracted from logs. These should match your configuration in the collector’s EMF Exporter.

ECSS SD view log

Under CloudWatch Logs console, a JSON log entry that includes embedded metrics metadata for metrics namespace and dimensions.

Conclusion

In this blog post, we explained how we can use ADOT to monitor Prometheus metrics for Amazon ECS and Amazon EKS on CloudWatch Container Insights. The preview of the ADOT support for Prometheus metrics is now available and you can start using it today. To learn more about AWS observability functionalities on Amazon CloudWatch and AWS X-Ray, watch our One Observability Demo workshop.

We will be tracking the upstream repository and plan to release a fresh version of the toolkit monthly. We are always working to improve the usability of components mentioned in this blog. Some examples include cluster name auto discovery and improving performance of observer creator framework. This is an open source project and we welcome your pull requests and community contributions. If you need feedback/a review for AWS-related components, feel free to tag us on GitHub PR/Issues. if you have questions, like “getting confused by (too many) examples,” you can also open issues in ADOT repository.

About the authors

  • Pinglei Guo is a SDE for Amazon CloudWatch. He contributes to OpenTelemetry collector and CloudWatch Agent. During his free time, he watches anime and plays video games (e.g. Monster Hunter) with his friends.
  • Mengyi Zhou is a SDE for Amazon CloudWatch. She contributes to observability and monitoring projects in the OpenTelemetry community. She is also keen on Kubernetes and building large-scale PaaS platforms for containerized applications.
  • Javier Martin is a Senior Product Manager for Amazon CloudWatch based out of Seattle. Javier loves building products in AWS that help customers monitor their systems and applications.
Javier Martin

Javier Martin

Javier Martin is a Senior Product Manager for Amazon CloudWatch based out of Seattle. Javier loves building products in AWS that help customers monitor their systems and applications.

Pinglei Guo

Pinglei Guo

Pinglei is a SDE for Amazon CloudWatch. He contributes to OpenTelemetry collector and CloudWatch Agent. During his free time, he watches anime and plays video games (e.g. Monster Hunter) with his friends.

Mengyi Zhou

Mengyi Zhou

Mengyi is a SDE for Amazon CloudWatch. She is a contributor to observability and monitoring projects in the OpenTelemetry community. She is also keen on Kubernetes and building large-scale PaaS platforms for containerized applications.