Saturday, July 13, 2024
HomeBig DataFinish-to-end improvement lifecycle for information engineers to construct an information integration pipeline...

Finish-to-end improvement lifecycle for information engineers to construct an information integration pipeline utilizing AWS Glue

Information is a key enabler for your enterprise. Many AWS clients have built-in their information throughout a number of information sources utilizing AWS Glue, a serverless information integration service, to be able to make data-driven enterprise selections. To develop the ability of information at scale for the long run, it’s extremely really helpful to design an end-to-end improvement lifecycle in your information integration pipelines. The next are frequent asks from our clients:

  • Is it attainable to develop and take a look at AWS Glue information integration jobs on my native laptop computer?
  • Are there really helpful approaches to provisioning parts for information integration?
  • How can we construct a steady integration and steady supply (CI/CD) pipeline for our information integration pipeline?
  • What’s the finest apply to maneuver from a pre-production setting to manufacturing?

To sort out these asks, this put up defines the event lifecycle for information integration and demonstrates how software program engineers and information engineers can design an end-to-end improvement lifecycle utilizing AWS Glue, together with improvement, testing, and CI/CD, utilizing a pattern baseline template.

Finish-to-end improvement lifecycle for an information integration pipeline

In the present day, it’s frequent to outline not solely information integration jobs but additionally all the info parts in code. This implies that you may depend on commonplace software program finest practices to construct your information integration pipeline. The software program improvement lifecycle on AWS defines the next six phases: Plan, Design, Implement, Take a look at, Deploy, and Keep.

On this part, we focus on every section within the context of information integration pipeline.


Within the planning section, builders gather necessities from stakeholders comparable to end-users to outline an information requirement. This might be what the use circumstances are (for instance, advert hoc queries, dashboard, or troubleshooting), how a lot information to course of (for instance, 1 TB per day), what sorts of information, what number of completely different information sources to drag from, how a lot information latency to simply accept to make it queryable (for instance, quarter-hour), and so forth.


Within the design section, you analyze necessities and establish the most effective answer to construct the info integration pipeline. In AWS, it is advisable select the best companies to attain the objective and give you the structure by integrating these companies and defining dependencies between parts. For instance, chances are you’ll select AWS Glue jobs as a core part for loading information from completely different sources, together with Amazon Easy Storage Service (Amazon S3), then integrating them and preprocessing and enriching information. Then chances are you’ll need to chain a number of AWS Glue jobs and orchestrate them. Lastly, chances are you’ll need to use Amazon Athena and Amazon QuickSight to current the enriched information to end-users.


Within the implementation section, information engineers code the info integration pipeline. They analyze the necessities to establish coding duties to attain the ultimate consequence. The code contains the next:

  • AWS useful resource definition
  • Information integration logic

When utilizing AWS Glue, you may outline the info integration logic in a job script, which might be written in Python or Scala. You should use your most popular IDE to implement AWS useful resource definition utilizing the AWS Cloud Growth Package (AWS CDK) or AWS CloudFormation, and likewise the enterprise logic of AWS Glue job scripts for information integration. To be taught extra about implement your AWS Glue job scripts regionally, discuss with Develop and take a look at AWS Glue model 3.0 and 4.0 jobs regionally utilizing a Docker container.

Take a look at

Within the testing section, you verify the implementation for bugs. High quality evaluation contains testing the code for errors and checking if it meets the necessities. As a result of many groups instantly take a look at the code you write, the testing section typically runs parallel to the event section. There are various kinds of testing:

  • Unit testing
  • Integration testing
  • Efficiency testing

For unit testing, even for information integration, you may depend on a regular testing framework comparable to pytest and ScalaTest. To be taught extra about obtain unit testing regionally, discuss with Develop and take a look at AWS Glue model 3.0 and 4.0 jobs regionally utilizing a Docker container.


When information engineers develop an information integration pipeline, you code and take a look at on a special copy of the product than the one which the end-users have entry to. The setting that end-users use is named manufacturing, whereas different copies are stated to be within the improvement or the pre-production setting.

Having separate construct and manufacturing environments ensures that you may proceed to make use of the info integration pipeline even whereas it’s being modified or upgraded. The deployment section contains a number of duties to maneuver the most recent construct copy to the manufacturing setting, comparable to packaging, setting configuration, and set up.

The next parts are deployed via the AWS CDK or AWS CloudFormation:

  • AWS assets
  • Information integration job scripts for AWS Glue

AWS CodePipeline lets you construct a mechanism to automate deployments amongst completely different environments, together with improvement, pre-production, and manufacturing. While you commit your code to AWS CodeCommit, CodePipeline routinely provisions AWS assets based mostly on the CloudFormation templates included within the commit and uploads script information included within the decide to Amazon S3.


Even after you deploy your answer to a manufacturing setting, it’s not the top of your venture. You want to monitor the info integration pipeline repeatedly and hold sustaining and enhancing it. Extra particularly, you additionally want to repair bugs, resolve buyer points, and handle software program modifications. As well as, it is advisable monitor the general system efficiency, safety, and consumer expertise to establish new methods to enhance the prevailing information integration pipeline.

Resolution overview

Sometimes, you’ve got a number of accounts to handle and provision assets in your information pipeline. On this put up, we assume the next three accounts:

  • Pipeline account – This hosts the end-to-end pipeline
  • Dev account – This hosts the mixing pipeline within the improvement setting
  • Prod account – This hosts the info integration pipeline within the manufacturing setting

If you need, you should use the identical account and the identical Area for all three.

To start out making use of this end-to-end improvement lifecycle mannequin to your information platform simply and shortly, we ready the baseline template aws-glue-cdk-baseline utilizing the AWS CDK. The template is constructed on high of AWS CDK v2 and CDK Pipelines. It provisions two sorts of stacks:

  • AWS Glue app stack – This provisions the info integration pipeline: one within the dev account and one within the prod account
  • Pipeline stack – This provisions the Git repository and CI/CD pipeline within the pipeline account

The AWS Glue app stack provisions the info integration pipeline, together with the next assets:

  • AWS Glue jobs
  • AWS Glue job scripts

The next diagram illustrates this structure.

On the time of publishing of this put up, the AWS CDK has two variations of the AWS Glue module: @aws-cdk/aws-glue and @aws-cdk/aws-glue-alpha, containing L1 constructs and L2 constructs, respectively. The pattern AWS Glue app stack is outlined utilizing aws-glue-alpha, the L2 assemble for AWS Glue, as a result of it’s simple to outline and handle AWS Glue assets. If you wish to use the L1 assemble, discuss with Construct, Take a look at and Deploy ETL options utilizing AWS Glue and AWS CDK based mostly CI/CD pipelines.

The pipeline stack provisions the complete CI/CD pipeline, together with the next assets:

The next diagram illustrates the pipeline workflow.

Each time the enterprise requirement modifications (comparable to including information sources or altering information transformation logic), you make modifications on the AWS Glue app stack and re-provision the stack to replicate your modifications. That is accomplished by committing your modifications within the AWS CDK template to the CodeCommit repository, then CodePipeline displays the modifications on AWS assets utilizing CloudFormation change units.

Within the following sections, we current the steps to arrange the required setting and reveal the end-to-end improvement lifecycle.


You want the next assets:

Initialize the venture

To initialize the venture, full the next steps:

  1. Clone the baseline template to your office:
    $ git clone
    $ cd aws-glue-cdk-baseline.git

  2. Create a Python digital setting particular to the venture on the consumer machine:

We use a digital setting to be able to isolate the Python setting for this venture and never set up software program globally.

  1. Activate the digital setting based on your OS:
    • On MacOS and Linux, use the next command:
      $ supply .venv/bin/activate

    • On a Home windows platform, use the next command:
      % .venvScriptsactivate.bat

After this step, the next steps run throughout the bounds of the digital setting on the consumer machine and work together with the AWS account as wanted.

  1. Set up the required dependencies described in necessities.txt to the digital setting:
    $ pip set up -r necessities.txt
    $ pip set up -r requirements-dev.txt

  2. Edit the configuration file default-config.yaml based mostly in your environments (substitute every account ID with your personal):
    awsAccountId: 123456789101
    awsRegion: us-east-1
    awsAccountId: 123456789102
    awsRegion: us-east-1
    awsAccountId: 123456789103
    awsRegion: us-east-1

  3. Run pytest to initialize the snapshot take a look at information by working the next command:
    $ python3 -m pytest --snapshot-update

Bootstrap your AWS environments

Run the next instructions to bootstrap your AWS environments:

  1. Within the pipeline account, substitute PIPELINE-ACCOUNT-NUMBER, REGION, and PIPELINE-PROFILE with your personal values:
    $ cdk bootstrap aws://<PIPELINE-ACCOUNT-NUMBER>/<REGION> --profile <PIPELINE-PROFILE> 
    --cloudformation-execution-policies arn:aws:iam::aws:coverage/AdministratorAccess

  2. Within the dev account, substitute PIPELINE-ACCOUNT-NUMBER, DEV-ACCOUNT-NUMBER, REGION, and DEV-PROFILE with your personal values:
    $ cdk bootstrap aws://<DEV-ACCOUNT-NUMBER>/<REGION> --profile <DEV-PROFILE> 
    --cloudformation-execution-policies arn:aws:iam::aws:coverage/AdministratorAccess 

  3. Within the prod account, substitute PIPELINE-ACCOUNT-NUMBER, PROD-ACCOUNT-NUMBER, REGION, and PROD-PROFILE with your personal values:
    $ cdk bootstrap aws://<PROD-ACCOUNT-NUMBER>/<REGION> --profile <PROD-PROFILE> 
    --cloudformation-execution-policies arn:aws:iam::aws:coverage/AdministratorAccess 

While you use just one account for all environments, you may simply run the cdk bootstrap command one time.

Deploy your AWS assets

Run the command utilizing the pipeline account to deploy the assets outlined within the AWS CDK baseline template:

$ cdk deploy --profile <PIPELINE-PROFILE>

This creates the pipeline stack within the pipeline account and the AWS Glue app stack within the improvement account.

When the cdk deploy command is accomplished, let’s confirm the pipeline utilizing the pipeline account.

On the CodePipeline console, navigate to GluePipeline. Then confirm that GluePipeline has the next levels: Supply, Construct, UpdatePipeline, Property, DeployDev, and DeployProd. Additionally confirm that the levels Supply, Construct, UpdatePipeline, Property, DeployDev have succeeded, and DeployProd is pending. It will probably take about quarter-hour.

Now that the pipeline has been created efficiently, you can even confirm the AWS Glue app stack useful resource on the AWS CloudFormation console within the dev account.

At this step, the AWS Glue app stack is deployed solely within the dev account. You may attempt to run the AWS Glue job ProcessLegislators to see the way it works.

Configure your Git repository with CodeCommit

In an earlier step, you cloned the Git repository from GitHub. Though it’s attainable to configure the AWS CDK template to work with GitHub, GitHub Enterprise, or Bitbucket, for this put up, we use CodeCommit. If you happen to desire these third-party Git suppliers, configure the connections and edit to outline the variable supply to make use of the goal Git supplier utilizing CodePipelineSource.

Since you already ran the cdk deploy command, the CodeCommit repository has already been created with all of the required code and associated information. Step one is to arrange entry to CodeCommit. The following step is to clone the repository from the CodeCommit repository to your native. Run the next instructions:

$ mkdir aws-glue-cdk-baseline-codecommit
$ cd aws-glue-cdk-baseline-codecommit
$ git clone ssh://

Within the subsequent step, we make modifications on this native copy of the CodeCommit repository.

Finish-to-end improvement lifecycle

Now that the setting has been efficiently created, you’re prepared to begin creating an information integration pipeline utilizing this baseline template. Let’s stroll via end-to-end improvement lifecycle.

While you need to outline your personal information integration pipeline, it is advisable add extra AWS Glue jobs and implement job scripts. For this put up, let’s assume the use case so as to add a brand new AWS Glue job with a brand new job script to learn a number of S3 areas and be a part of them.

Implement and take a look at in your native setting

First, implement and take a look at the AWS Glue job and its job script in your native setting utilizing Visible Studio Code.

Arrange your improvement setting by following the steps in Develop and take a look at AWS Glue model 3.0 and 4.0 jobs regionally utilizing a Docker container. The next steps are required within the context of this put up:

  1. Begin Docker.
  2. Pull the Docker picture that has the native improvement setting utilizing the AWS Glue ETL library:
    $ docker pull

  3. Run the next command to outline the AWS named profile identify:

  4. Run the next command to make it out there with the baseline template:
    $ cd aws-glue-cdk-baseline/

  5. Run the Docker container:
    $ docker run -it -v ~/.aws:/house/glue_user/.aws -v $WORKSPACE_LOCATION:/house/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true 
    --rm -p 4040:4040 -p 18080:18080 
    --name glue_pyspark pyspark

  6. Begin Visible Studio Code.
  7. Select Distant Explorer within the navigation pane, then select the arrow icon of the workspace folder within the container

If the workspace folder just isn’t proven, select Open folder and choose /house/glue_user/workspace.

Then you will note a view just like the next screenshot.

Optionally, you may set up AWS Software Package for Visible Studio Code, and begin Amazon CodeWhisperer to allow code suggestions powered by machine studying mannequin. For instance, in aws_glue_cdk_baseline/job_scripts/, you may put feedback like “# Write a DataFrame in Parquet format to S3”, press Enter key, then CodeWhisperer will suggest a code snippet just like the next:

CodeWhisperer on Visual Studio Code

Now you put in the required dependencies described in necessities.txt to the container setting.

  1. Run the next instructions in the terminal in Visible Studio Code:
    $ pip set up -r necessities.txt
    $ pip set up -r requirements-dev.txt

  2. Implement the code.

Now let’s make the required modifications for a brand new AWS Glue job right here.

  1. Edit the file aws_glue_cdk_baseline/ Let’s add the next new code block after the prevailing job definition of ProcessLegislators to be able to add the brand new AWS Glue job JoinLegislators:
            self.new_glue_job = glue.Job(self, "JoinLegislators",
               a part of(path.dirname(__file__), "job_scripts/")
                description="a brand new instance PySpark job",
                    "--input_path_orgs": config[stage]['jobs']['JoinLegislators']['inputLocationOrgs'],
                    "--input_path_persons": config[stage]['jobs']['JoinLegislators']['inputLocationPersons'],
                    "--input_path_memberships": config[stage]['jobs']['JoinLegislators']['inputLocationMemberships']
                    "setting": self.setting,
                    "artifact_id": self.artifact_id,
                    "stack_id": self.stack_id,
                    "stack_name": self.stack_name

Right here, you added three job parameters for various S3 areas utilizing the variable config. It’s the dictionary generated from default-config.yaml. On this baseline template, we use this central config file for managing parameters for all of the Glue jobs within the construction <stage identify>/jobs/<job identify>/<parameter identify>. Within the continuing steps, you present these areas via the AWS Glue job parameters.

  1. Create a brand new job script known as aws_glue_cdk_baseline/job_scripts/
    import sys
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    from awsglue.transforms import Be a part of
    from awsglue.utils import getResolvedOptions
    class JoinLegislators:
        def __init__(self):
            params = []
            if '--JOB_NAME' in sys.argv:
            args = getResolvedOptions(sys.argv, params)
            self.context = GlueContext(SparkContext.getOrCreate())
            self.job = Job(self.context)
            if 'JOB_NAME' in args:
                jobname = args['JOB_NAME']
                self.input_path_orgs = args['input_path_orgs']
                self.input_path_persons = args['input_path_persons']
                self.input_path_memberships = args['input_path_memberships']
                jobname = "take a look at"
                self.input_path_orgs = "s3://awsglue-datasets/examples/us-legislators/all/organizations.json"
                self.input_path_persons = "s3://awsglue-datasets/examples/us-legislators/all/individuals.json"
                self.input_path_memberships = "s3://awsglue-datasets/examples/us-legislators/all/memberships.json"
            self.job.init(jobname, args)
        def run(self):
            dyf = join_legislators(self.context, self.input_path_orgs, self.input_path_persons, self.input_path_memberships)
            df = dyf.toDF()
    def read_dynamic_frame_from_json(glue_context, path):
        return glue_context.create_dynamic_frame.from_options(
                'paths': [path],
                'recurse': True
    def join_legislators(glue_context, path_orgs, path_persons, path_memberships):
        orgs = read_dynamic_frame_from_json(glue_context, path_orgs)
        individuals = read_dynamic_frame_from_json(glue_context, path_persons)
        memberships = read_dynamic_frame_from_json(glue_context, path_memberships)
        orgs = orgs.drop_fields(['other_names', 'identifiers']).rename_field('id', 'org_id').rename_field('identify', 'org_name')
        dynamicframe_joined = Be a part of.apply(orgs, Be a part of.apply(individuals, memberships, 'id', 'person_id'), 'org_id', 'organization_id').drop_fields(['person_id', 'org_id'])
        return dynamicframe_joined
    if __name__ == '__main__':

  2. Create a brand new unit take a look at script for the brand new AWS Glue job known as aws_glue_cdk_baseline/job_scripts/checks/
    import pytest
    import sys
    import join_legislators
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    from awsglue.utils import getResolvedOptions
    @pytest.fixture(scope="module", autouse=True)
    def glue_context():
        args = getResolvedOptions(sys.argv, ['JOB_NAME'])
        context = GlueContext(SparkContext.getOrCreate())
        job = Job(context)
        job.init(args['JOB_NAME'], args)
    def test_counts(glue_context):
        dyf = join_legislators.join_legislators(glue_context, 
        assert dyf.toDF().depend() == 10439

  3. In default-config.yaml, add the next beneath prod and dev:
          inputLocationOrgs: "s3://awsglue-datasets/examples/us-legislators/all/organizations.json"
          inputLocationPersons: "s3://awsglue-datasets/examples/us-legislators/all/individuals.json"
          inputLocationMemberships: "s3://awsglue-datasets/examples/us-legislators/all/memberships.json"

  4. Add the next beneath "jobs" within the variable config in checks/unit/, checks/unit/, and checks/snapshot/ (no want to exchange S3 areas):
                "JoinLegislators": {
                    "inputLocationOrgs": "s3://path_to_data_orgs",
                    "inputLocationPersons": "s3://path_to_data_persons",
                    "inputLocationMemberships": "s3://path_to_data_memberships"

  5. Select Run on the high proper to run the person job scripts.

If the Run button just isn’t proven, set up Python into the container via Extensions within the navigation pane.

  1. For native unit testing, run the next command in the terminal in Visible Studio Code:
    $ cd aws_glue_cdk_baseline/job_scripts/
    $ python3 -m pytest

Then you may confirm that the newly added unit take a look at handed efficiently.

  1. Run pytest to initialize the snapshot take a look at information by working following command:
    $ cd ../../
    $ python3 -m pytest --snapshot-update

Deploy to the event setting

Full following steps to deploy the AWS Glue app stack to the event setting and run integration checks there:

  1. Arrange entry to CodeCommit.
  2. Commit and push your modifications to the CodeCommit repo:
    $ git add .
    $ git commit -m "Add the second Glue job"
    $ git push

You may see that the pipeline is efficiently triggered.

Integration take a look at

There may be nothing required for working the mixing take a look at for the newly added AWS Glue job. The mixing take a look at script runs all the roles together with a particular tag, then verifies the state and its period. If you wish to change the situation or the edge, you may edit assertions at the top of the integ_test_glue_job methodology.

Deploy to the manufacturing setting

Full the next steps to deploy the AWS Glue app stack to the manufacturing setting:

  1. On the CodePipeline console, navigate to GluePipeline.
  2. Select Evaluate beneath the DeployProd stage.
  3. Select Approve.

Await the DeployProd stage to finish, then you may confirm the AWS Glue app stack useful resource within the dev account.

Clear up

To scrub up your assets, full following steps:

  1. Run the next command utilizing the pipeline account:
    $ cdk destroy --profile <PIPELINE-PROFILE>

  2. Delete the AWS Glue app stack within the dev account and prod account.


On this put up, you discovered outline the event lifecycle for information integration and the way software program engineers and information engineers can design an end-to-end improvement lifecycle utilizing AWS Glue, together with improvement, testing, and CI/CD, via a pattern AWS CDK template. You will get began constructing your personal end-to-end improvement lifecycle in your workload utilizing AWS Glue.

Concerning the creator

Noritaka Sekiyama is a Principal Huge Information Architect on the AWS Glue crew. He works based mostly in Tokyo, Japan. He’s chargeable for constructing software program artifacts to assist clients. In his spare time, he enjoys biking along with his highway bike.



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments