AWS Glue is a service that helps you discover, combine, enrich, and transform data so that it can be understood by other applications. Though it’s marketed as a single service, Glue is actually a suite of tools and features, comprising an end-to-end data integration solution.

Glue can help you extract data from multiple sources, merge, reformat, normalize, and filter it, and then load the result into other systems. This process is commonly known as “Extract, Transform, Load” (ETL).

AWS Glue is serverless, so there is no need to provision long-running infrastructure. It’s (mostly) billed by usage, so you only pay for resources while your jobs are actively running. Workflows can be automated, running on a schedule or responding to events. In many cases, you can create workflows without writing any code.

What is ETL?

ETL is the process of combining, reformatting, enriching, filtering, and otherwise transforming data in order to improve its suitability for other uses. 

This usually means converting it to another format, adding or removing columns, normalizing schemas and column types, or combining it with other data. Once the data is transformed, it can be fed to other applications. 

Typically, data is sent to analytics or business intelligence (BI) platforms. But the possibilities are endless: ad-hoc querying in Athena, storage in data lakes, streaming to another application, or any other useful purpose. This end-to-end process is often called a data pipeline.

ETL has existed, as a concept, since the 1970s. But at that time, governments and large corporations were the primary users of computers and databases. As such, ETL processes were highly bespoke, time-consuming, and probably not very fun. 

But in the 1990s, three concurrent trends made ETL much more important: 

  1. The rise of data warehousing, 
  2. The proliferation of BI tools, and 
  3. The insanely swift adoption of personal computers by nearly every business and household. 

In this newly tech-centric environment, having a handle on your data could give you a serious competitive edge. So, it suddenly became very important to quickly synthesize all kinds of business data and feed it to systems that aid in decision making.

Over the course of the last 20 years, the volume and complexity of data has increased even further. Modern businesses are ingesting data from a dizzying array of sources: mobile apps, IoT devices, websites, watches, robotic dogs, what have you—and the number and scale of these sources continues to grow each day.

Several tools have appeared to help handle various parts of the data pipeline, and many of them enjoy wide adoption and relative maturity. But despite that, ETL is still a fairly specialized skill set, and many companies now employ a data engineer—someone who is impressively/creepily adept at building pipelines and understanding data flows. 

That is to say: ETL is important and complex enough to demand a lot of attention and resources, and it’s not going away any time soon.

A short history of AWS Glue

Since the earliest days of AWS, customers have been building their own ETL pipelines. 

In the mid-to-late-2000s, the core of these pipelines (i.e., the “transform” part) was often a MapReduce cluster: software that breaks data processing jobs into smaller tasks, distributes them to multiple workers, and then recombines the results—perfect for ETL transforms! 

In 2009, AWS released this technology as a service, calling it (of course) Elastic MapReduce (EMR).

EMR made many ETL tasks simpler. But clusters still had to be manually launched, configured, and integrated into the rest of the data pipeline. 

So, at the end of 2012, AWS launched Data Pipeline, a service designed to help automate parts of the most common ETL use cases: running EMR tasks, loading data, running queries, saving results, and allowing you to schedule all of these tasks.

It was a noble effort, but Data Pipeline honestly wasn’t great. It did help eliminate some repetitive or manual tasks. But the interface was terrible, troubleshooting was difficult, and it wasn’t really a self-contained service, so much as an orchestrator for other services. 

Using it felt like two steps above running a long, uncommented shell script. And it had a lot of quirks. (That’s putting it charitably; check out the short but consequential list of limitations in the developer guide.)

AWS rarely advertises one service as a “successor” to another. But AWS Glue is—in practice, if not by design—a wholesale replacement for Data Pipeline. Launched in 2017, Glue is like a next-generation version of Data Pipeline: a truly self-contained set of tools that help automate the most common aspects of ETL. Out of the box, it’s ready to handle dozens of common use cases and most can be set up with little to no coding.

As work on Data Pipeline has dried up, Glue has been consistently improved since launch. And with the recent addition of features like Glue Studio and Elastic Views, it seems like AWS is committed to Glue as the core of its ETL offerings.

Really, Glue is an attempt to democratize ETL. What used to be a highly technical process is now available to anybody who understands the basic concepts of ETL and can use a GUI.

Pros and cons of AWS Glue

You can set up ETL pipelines in a lot of different ways in AWS. But Glue is a solid choice for a few key reasons:

  1. Glue is serverless, so you don’t have to manage resources. The tradeoff here is that you have less control over the resources your jobs are running on. But for many use cases, that’s not a primary concern. And since Glue is billed by usage, it’s often cheaper than long-running solutions like EMR.
  2. Glue is easy to use, and quick to set up. Workflows are managed via a wizard-style interface and it’s relatively straightforward to set up a lot of common transforms. The recently released Glue Studio makes this even simpler by providing a GUI for job creation.
  3. You don’t need to write code, but you can! Glue automatically generates code for many common use cases, so it’s possible to create a Glue Job without knowing how to write Spark scripts. But if for some reason you’re interested in writing transforms from scratch, you can totally do that.
  4. Glue integrates with a variety of AWS services as source or destination endpoints. And Glue catalogs can be used as the source for things like Athena tables, making it very easy to expose data for ad-hoc querying. If you’re using AWS services as both the sources and destinations for your pipelines, you’ll likely be up and running very quickly.
  5. Glue is good at inferring data schemas. And for common formats and flat data structures, you won’t have to explicitly define any schemas. Glue also detects schema changes over time and gives you some basic options for reacting to those changes in catalogs.

Having said all that, there are some drawbacks and limitations that you should consider, particularly if you’re a data engineer looking to migrate existing workloads:

  1. You can’t really control the compute resources. Glue offers only three instance types, geared toward general purpose, memory-intensive, and machine learning tasks respectively. There aren’t many knobs to turn, and if your jobs require very specific compute profiles, you may not be happy with the options.
  2. Glue runs Spark under the hood and only accepts Python or Scala scripts. So if you’ve got a bunch of existing scripts written for another platform or in another language, it may be a pain in the butt to port them to Glue.
  3. While you can include Python modules as part of your Glue scripts, you can’t extend Spark itself (as far as I know). If you’re moving over from a self-managed, customized Spark cluster, this may be a problem for you.
  4. Like all services, there are limits for the various elements of Glue (though these can be increased by request). 
  5. While Glue is good at detecting schemas, it’s not great at it. For complex and/or nested data structures, you may find that you need to write a custom classifier, which currently isn’t the simplest experience. And testing classifiers can be a real drag on your setup time.

Components of AWS Glue

Glue is actually a suite of tools, each solving a common data integration problem. AWS is constantly adding functionality and launching Glue-related tools (see “The Future of Glue” below). But there are some core components that are important to understand.

Crawlers and classifiers

Crawlers are the “data discovery” portion of the Glue service. They scan your source locations for new data on a schedule that you set. When they discover new data, they infer its type and schema and import it into a data catalog. In many cases, crawlers will also infer data partitions, i.e., the main ways your data is classified or categorized. This is most often by date but can be based on any property of the data.

Catalogs

A Glue catalog is a database containing metadata about your source data. The catalog itself doesn’t contain any of your actual source data but instead stores the location, age, type, and schema of all the source objects (e.g., a file stored in S3). 

Some AWS services can use your Glue catalog to better understand your data and possibly even load it directly. For example, a Glue catalog can be a source for an Amazon Athena table, giving Athena all the information it needs to load your data directly from S3 at runtime. This is a common (and handy!) way to make S3 data directly queryable.

Jobs

Jobs transform data based on scripts that Glue generates, or that you provide—they’re the business logic of your pipeline. Defined as Spark, Spark Streaming, or vanilla Python scripts, jobs allow you to add logic for analyzing, formatting, deduplicating, enriching, or otherwise transforming your data. 

Glue offers a large catalog of predefined scripts for common data transformation tasks and includes a long list of pre-defined transforms. In some circumstances, jobs allow you to define Machine Learning transforms, which can help identify related records and automatically remove duplicates. 

AWS sometimes tacks questionable ML features onto services. But this actually seems like a useful feature!

Development endpoints

Development endpoints are an easy way to test your scripts and explore your data without having to deploy anything into a live Glue workflow. If you’re familiar with the concept of “notebooks,” a development endpoint essentially provides notebook functionality. 

You can either set up a Zeppelin server to connect to or create a SageMaker notebook right in the Glue console.

AWS Glue pricing

The various components of AWS Glue are priced independently. But generally speaking, you’re billed by usage down to the second.

Jobs and crawlers are billed by the “DPU hour,” which equates to an hour of computing using 4 vCPU and 16 GB of memory. Though DPUs are priced by the hour, they’re billed by the second, with a one-minute minimum.

Glue Catalogs are priced by the number of objects cataloged and the number of requests to the catalog (the first million objects and requests are free per month). Though both metrics are billed at the seemingly-generous “per million,” this is one of many places in AWS where the tradeoff between size of files and number of files can yield a non-trivial difference in your bill—if you deal in very large volumes of data, that is.

Dev Endpoints are currently the only feature of Glue that is long-running (i.e., it runs until you turn it off) and billed by time. Much like jobs and crawlers, endpoints are billed by DPU, with a per-second accuracy. But there is a 10-minute minimum for each provisioned endpoint.

Note that Glue pricing does not include the cost of your source or output data stores. So, if your source data is in S3, and your target data store is RDS, you’ll pay separately for each of those.

Getting started

AWS provides a high-level guide for getting started with Glue, but it’s light on the specifics of actually creating jobs.

I’m sure you’re a responsible developer, and you always Read The Freaking Manual™. But, if you’re feeling loose, queue up “Welcome to the Jungle” in your headphones, and follow along for a quick tour of the basics of creating a transform job in Glue. 

We’ll use Glue Studio, which is a slick new GUI for setting up jobs (instead of the standard “wizard” format—though both are relatively easy to follow).

I’ll go ahead and set up a very simple job that converts WAF logs from JSON to Parquet, which is a columnar format that many databases and analytics platforms prefer. This is a very basic transform example, but the overall process is similar for most structured or semi-structured data.

In the “Jobs” section of the Glue Studio console, we can see a “Create job” panel containing two dropdowns: one for source and one for target. Possible sources include RDS, Redshift, Kinesis and Kafka, but I’ll stick with S3 for both source and destination.

After clicking “create”, I’m sent right to the GUI. I’ll click on the “data source” node, and add the S3 URL for my source data (conveniently located in a directory named “source”). WAF logs are in JSON format, but Glue is pretty good at understanding common structured formats. Clicking the “infer schema” button ensures that Glue can read and understand the data that we want to transform.

Next, I’ll click the “transform” node. You can see that Glue has inferred the schema of our WAF logs and created a data mapping. We’re doing an “Apply mapping” transform, so here we have the ability to update key names, change data types, or drop columns in the destination file. 

Glue provides many other types of transformations out of the box, but “Apply mapping” is common and straightforward. Also note that “custom transformations” are possible if you want to write your own transform scripts.

After clicking the “data target” node, we can see a few basic options—one of which is the output format. I’ll choose “parquet” and I’ll add the S3 URL where I’d like the transformed data to be sent.

One more important detail remains: I have to assign an IAM role for the job. The role should have permissions for accessing both the source and destination data locations. This setting lives in the “Job details” tab, alongside various options for Glue engine version, instance type, and timeouts, among other things.

And that’s basically it! I can manually hit the “Run” button, and the job will run one time. Or I can create schedules in the “Schedules” tab.

This is obviously a simple and contrived example, and many things can complicate the process. As always, your mileage may vary. 

But if your source and destination locations are in AWS, many common ETL flows are only slightly more complicated than the example we just walked through.

The future of Glue

One of the more frustrating aspects of the AWS experience is that they don’t do a good job positioning their products within their own ecosystem. So, sometimes it’s not clear which product you should use for what purpose or use case. There is still a lot of overlap between Glue and various other AWS products: Data Pipeline still exists, Amazon SageMaker Data Wrangler can perform various ETL steps for preparing data, and managed EMR is still a solid core component for folks who want to build their own ETL pipelines.

That said, AWS seems pretty invested in Glue as their core suite of ETL components. It’s less a single product than an expanding suite of related products and features. Above, we looked at the newly-released Glue Studio (which, I predict, will eventually replace the existing Glue interface), and there are even more recent additions to the Glue-branded family of services.

The confusingly-named Glue Data Brew is not a feature of Glue, but a standalone tool aimed at helping data analysts and data scientists prepare data for machine learning and analytics platforms. It’s possible that this will be incorporated in the core Glue experience in the future. But it’s also possible that AWS is building completely separate experiences for non-technical users and/or certain analytics use cases.

Elastic Views, currently in preview, lets you use SQL to define materialized views that are propagated to multiple data stores. This could be handy, for example, to create rollup data that will be directly accessible to your applications and visualization layers. Really, this could replace a lot of common ETL tasks that can be done by queries alone.

Again, it seems like Glue is here to stay, and that AWS is heavily invested in making it better. Add it all up, and the future of Glue looks bright.

Suffice it to say the AWS ecosystem is complex, and nobody knows all of it. If you have questions about AWS services, we can help. Drop us a line and let us know what’s on your mind.