Up and Running with AWS Glue

AWS Glue is a managed service that can really help simplify ETL work. In this blog I’m going to cover creating a crawler, creating an ETL job, and setting up a development endpoint. Since Glue is managed you will likely spend the majority of your time working on your ETL script. If this blog helps you get to a point where you can focus on your script then I have accomplished my goal.

Before I get started I want to say that the official documentation is great. Most of what I know comes from that. Also, I feel I should add a note about cost. AWS Glue is not free! You can find details about how pricing works here. Time to get started.

First, you need a place to store the data. In this example you are going to use S3 as the source and target destination. Make an S3 bucket with whatever name you’d like and add a source and target folder in the bucket. You’ll end up with something like this:

s3://name-of-my-bucket/source
s3://name-of-my-bucket/target

The data being used is a pipe delimited CSV and is shown below. Copy that into a file named test-data.txt and upload that to the source folder in S3. For this example I chose data about dogs. Imagine an organization that fostered dogs until they could be adopted. Maybe the data about all dogs that need to be adopted is made available via a file like this only with more entries.

Breed|Sex|Weight|Age|DateAdded
Boston Terrier|M|20|5|2019-06-20
Beagle|M|24|3|2019-06-22
English Bulldog|F|60|7|2019-06-23
Basset Hound|F|35|4|2019-06-19
Boxer|M|70|6|2019-06-17

Now that the data is in S3 it’s time to head to the Glue Console. First, you need to create a database. Do this by selecting Databases under Data catalog. Click Add database. Give it a name then click Create. This is not a database in the usual sense of the word. If you delete this database your data will still exist. This database and its tables, that you’ll create later, store metadata about the data you have in S3.

Crawler

Under Data catalog click Crawlers then click Add crawler. You will arrive at the first screen that will guide you through making a crawler. Give it a name then click Next. Under Crawler source type select Data stores then click Next. Under Choose a data store select S3 (that should be the default) then under Include path add the path to the source folder you created earlier. Click Next. Select No when asked to add another data store then click Next. On this screen you will add an IAM role to give Glue access to your S3 bucket among other things. Select Create an IAM role, give it a name, then click Next. Frequency should be set to Run on demand; click Next. Under Database select the database that you created earlier then click Next. Review then click Finish.

Make sure the managed policy you created for the IAM role gives Glue permission to read and write to the S3 bucket and its folders. There is more than one way to make a policy that will accomplish this and below is one such example.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::name-of-my-bucket/source*",
                "arn:aws:s3:::name-of-my-bucket/target*"
            ]
        }
    ]
}

Time to run the crawler. Do so by selecting the checkbox next to it then clicking Run crawler. Once the crawler has completed click on Tables under Data catalog. You should see a new table created by the crawler. Click on the table (not the checkbox) to view information about it or make edits. You now have data in S3 and a database with a table in it that represents the data.

Job

To create an ETL job click on Jobs under ETL then click Add job. Give your job a name, under IAM role select the role you created earlier, fill out a name under Script file name (Should default to the job name), leave the rest as the defaults then click Next. Select the table that was created by the glue crawler then click Next. For your target select Create tables in your data target. Under Data store select Amazon S3, under Format select JSON, under Target path add the path to the target folder then click Next. Leave the mapping as is then click Save job and edit script.

Using that information glue has generated a script that will query your CSV file, transform the data to JSON, and write it back to the target folder. You can run your job from this screen by clicking Run job in the upper left. Click that then click the X in the upper right corner to exit the script editor. Back in the Glue console select the checkbox next to your job name. Once selected you can view its current status in the History tab; it should say Running. This could take up to 10 minutes to complete. Hopefully, when the job has finished running you see the word Success under Run status. If it says Failed it should also give a little information as to what happened under the Error column.

Once your job has completed successfully head over to the target folder in S3 and you should see the new file. This is good progress, but you will likely want to do more with your data. Using the Glue console is a great place to start creating your jobs, but what about when you need to manually modify your script and test it? Click the checkbox next to your glue job then go to the Script tab. You can click Edit script to get back to the script editor and update your script this way. This works, but there is one really big downside; you need to run your job every time you want to test your changes. You can avoid this by setting up a Dev endpoint.

Zeppelin

Before you do anything in the Glue console you need to do a little setup on your machine. First download Apache Zeppelin. Make sure you download version 0.7.3. Once you have Zeppelin installed start Zeppelin. On my Mac from the Zeppelin directory the command looks like what is below. It will be different depending on your operating system.

./bin/zeppelin-daemon.sh start

Go to localhost:8080 and you should see the Zeppelin welcome screen. In the top right you’ll see a dropdown with the word anonymous, click that and select interpreter. Scroll down to the spark section and click edit. First, check Connect to existing process. Second, set host to localhost and port to 9007. Third, under Properties set master to yarn-client. Finally, if spark.executor.memory or spark.driver.memory exist in the properties section delete them. Click Save.

You are going to need ssh keys to connect your local instance of Zeppelin to your Glue Dev endpoint. I used ssh-keygen to make mine, but there are other ways to create them. Look here for instructions on how to use ssh-keygen. However you go about creating your keys in the end you should have a public key, private key, and password that you’ll use shortly.

If you don’t want to install Zeppelin on your own machine you can use an EC2 instance. The instructions to do it this way are here.

Dev endpoint

Head back to the Glue console and select Dev endpoints under ETL. Click Add endpoint. Give it a name and under IAM Role select the IAM Role that you created earlier. Dev endpoints default to using 5 DPUs. You can change this by expanding Security configuration, script libraries, and job parameters (optional) section and filling out a new value under Data processing units (DPUs). For this example, use the minimum of 2 DPUs then click Next. Choose Skip networking information then click Next. Under Public key contents upload the public key you created earlier then click Next. Review your changes then click Finish.

When your Dev endpoint is ready click on it (not the checkbox). Here you will see a lot of useful information. The piece of information that you want is the value of SSH tunnel to remote interpreter. Which should look something like this.

ssh -i <private-key.pem> -vnNT -L :9007:169.254.76.1:9007 glue@ec2-##-##-###-##.us-east-2.compute.amazonaws.com

Copy this command and run it after changing <private-key.pem> to the location of the private key you made earlier. After running this command it will ask if you want to continue, yes. Enter the password you created earlier to go with your key pair. You should now be ready to work on your spark script in Zeppelin. Go back to Zeppelin and in the upper left click the Notebook dropdown then select + Create new note. Give it a name and make sure Default interpreter is set to spark. Go back to the Glue console and copy the spark script from there into the new Note in Zeppelin. After removing the comments it should look similar to what is below.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
 
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
 
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
 
datasource0 = glueContext.create_dynamic_frame.from_catalog(database="name-of-my-db", table_name="source",
                                                            transformation_ctx="datasource0")
 
applymapping1 = ApplyMapping.apply(frame=datasource0,
                                   mappings=[("breed", "string", "breed", "string"), ("sex", "string", "sex", "string"),
                                             ("weight", "long", "weight", "long"), ("age", "long", "age", "long"),
                                             ("dateadded", "string", "dateadded", "string")],
                                   transformation_ctx="applymapping1")
 
datasink2 = glueContext.write_dynamic_frame.from_options(frame=applymapping1, connection_type="s3",
                                                         connection_options={"path": "s3://name-of-my-bucket/target"},
                                                         format="json", transformation_ctx="datasink2")
job.commit()

If you try to run your script by pressing the play button in the upper right corner it will finish with an error. Unfortunately, the script won’t work exactly as it is copied from the Glue editor. You need to make a few adjustments for this to work in Zeppelin. Here is a quick summary of the changes you need to make: add %pyspark to the top of the file, remove all the code that is associated with a Glue Job, and create the GlueContext differently.

Once you have tested your script and are satisfied that it is working you will need to add these back before uploading your changes. Remember, this script is stored in an S3 bucket. You can upload it to that bucket or copy/paste your changes into the editor in the Glue console and save. With the adjustments it should look similar to what you see below.

%pyspark
 
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
 
glueContext = GlueContext(SparkContext.getOrCreate())
 
datasource0 = glueContext.create_dynamic_frame.from_catalog(database="name-of-my-db", table_name="source",
                                                            transformation_ctx="datasource0")
 
applymapping1 = ApplyMapping.apply(frame=datasource0,
                                   mappings=[("breed", "string", "breed", "string"), ("sex", "string", "sex", "string"),
                                             ("weight", "long", "weight", "long"), ("age", "long", "age", "long"),
                                             ("dateadded", "string", "dateadded", "string")],
                                   transformation_ctx="applymapping1")
 
datasink2 = glueContext.write_dynamic_frame.from_options(frame=applymapping1, connection_type="s3",
                                                         connection_options={"path": "s3://name-of-my-bucket/target"},
                                                         format="json", transformation_ctx="datasink2")

Click play and it should run without error. Check the target folder in S3 to verify. Now that everything is set up you can finally start working on the script. Change this script a little to filter out only dogs that weigh less than 50. Notice the Applymapping.apply method. It returns a dynamicframe, but you need a PySpark dataframe to do the filtering. Luckily, it’s easy to convert back and forth from a dynamicframe to a dataframe. Here is what the new script looks like with filtering.

%pyspark
 
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
 
glueContext = GlueContext(SparkContext.getOrCreate())
 
datasource0 = glueContext.create_dynamic_frame.from_catalog(database="name-of-my-db", table_name="source",
                                                            transformation_ctx="datasource0")
 
applymapping1 = ApplyMapping.apply(frame=datasource0,
                                   mappings=[("breed", "string", "breed", "string"), ("sex", "string", "sex", "string"),
                                             ("weight", "long", "weight", "long"), ("age", "long", "age", "long"),
                                             ("dateadded", "string", "dateadded", "string")],
                                   transformation_ctx="applymapping1")
 
dataframe = applymapping1.toDF()
dataframe = dataframe.filter(dataframe["weight"] < 50)
 
dataframe.write.mode("append").json("s3://name-of-my-bucket/target")

Notice how you’re able to write back to S3 using just the dataframe. You could convert this back to a dynamicframe and then write it back to S3 as in the previous version. PySpark can do so much more than I have shown here. For more information on the PySpark SQL module check here.

Those are some of the basics to get up and running with AWS Glue. Obviously, there is a lot more to learn. For example, what if you want to use an RDS as a data store? One of the most important things for me when I first started with Glue was to be able to work on my spark script like I would any other piece of code and hopefully this blog can help you get to that point quickly. Until next time.

About the Author

Scott Riva profile.

Scott Riva

Sr. Consultant

I am a full-stack software engineer that has spent my career developing web applications and APIs mostly using Java, Spring, JavaScript, Node, and Angular. Recently, I have worked on banking and healthcare projects, but the majority of my career has been spent working on DOD contracts.  Lately, I’m doing my best to become an expert in the many services provided by AWS, so I can leverage them in my work and personal projects.  When not working you can catch me enjoying spicy ramen, reading manga, or baking treats that I really should share with my co-workers more often.

Leave a Reply

Your email address will not be published.

Related Blog Posts
Natively Compiled Java on Google App Engine
Google App Engine is a platform-as-a-service product that is marketed as a way to get your applications into the cloud without necessarily knowing all of the infrastructure bits and pieces to do so. Google App […]
Building Better Data Visualization Experiences: Part 2 of 2
If you don't have a Ph.D. in data science, the raw data might be difficult to comprehend. This is where data visualization comes in.
Unleashing Feature Flags onto Kafka Consumers
Feature flags are a tool to strategically enable or disable functionality at runtime. They are often used to drive different user experiences but can also be useful in real-time data systems. In this post, we’ll […]
A security model for developers
Software security is more important than ever, but developing secure applications is more confusing than ever. TLS, mTLS, RBAC, SAML, OAUTH, OWASP, GDPR, SASL, RSA, JWT, cookie, attack vector, DDoS, firewall, VPN, security groups, exploit, […]