How to Set Up Glue Crawler for Event-Driven Crawling with SQS

Takahiro Iwasa

1/3/2023

4 min read

ETL Glue

Introduction

Glue Crawlers can crawl S3 buckets based on event notifications with SQS, avoiding a full scan and offering benefits in terms of both crawling performance and costs.

This new feature does not initiate the crawling process of the Glue crawler itself.

If you intend to publish event notifications to multiple consumers, SNS can be used instead.

Creating AWS Resources

Create a CloudFormation template with the following content:

AWSTemplateFormatVersion: '2010-09-09'
Description: Glue crawler test

Resources:
  SqsQueue:
    Type: AWS::SQS::Queue
    Properties:
      SqsManagedSseEnabled: true
      QueueName: glue-crawler-test-queue

  SqsQueuePolicy:
    Type: AWS::SQS::QueuePolicy
    Properties:
      Queues:
        - !Ref SqsQueue
      PolicyDocument:
        Version: '2008-10-17'
        Id: __default_policy_ID
        Statement:
          - Effect: Allow
            Principal:
              AWS:
                - !Sub arn:aws:iam::${AWS::AccountId}:root
                - !GetAtt IAMRoleGlueCrawler.Arn
            Action: sqs:*
            Resource: !GetAtt SqsQueue.Arn
          - Effect: Allow
            Principal:
              Service: s3.amazonaws.com
            Action: sqs:*
            Resource: !GetAtt SqsQueue.Arn

  S3Bucket:
    Type: AWS::S3::Bucket
    DependsOn: SqsQueuePolicy
    Properties:
      BucketName: !Sub glue-crawler-test-${AWS::AccountId}-${AWS::Region}
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: AES256
      NotificationConfiguration:
        QueueConfigurations:
          - Event: 's3:ObjectCreated:*'
            Queue: !GetAtt SqsQueue.Arn
      PublicAccessBlockConfiguration:
        BlockPublicAcls: TRUE
        BlockPublicPolicy: TRUE
        IgnorePublicAcls: TRUE
        RestrictPublicBuckets: TRUE

  IAMRoleGlueCrawler:
    Type: AWS::IAM::Role
    Properties:
      Path: /service-role/
      RoleName: !Sub glue-crawler-test-service-role
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: glue.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: cw-logs
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                Resource: "*"
        - PolicyName: glue
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - glue:CreateTable
                  - glue:GetDatabase
                  - glue:GetTable
                  - glue:UpdateTable
                Resource: "*"
        - PolicyName: s3
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - s3:GetObject
                  - s3:ListBucket
                  - s3:PutObject
                Resource: '*'

  GlueDatabase:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Name: glue-crawler-test-db

  GlueTable:
    Type: AWS::Glue::Table
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: !Ref GlueDatabase
      TableInput:
        Name: glue-crawler-test-table
        TableType: EXTERNAL_TABLE
        Parameters:
          classification: json
        StorageDescriptor:
          Location: !Sub 's3://${S3Bucket}/'
          Compressed: false
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          SerdeInfo:
            SerializationLibrary: org.openx.data.jsonserde.JsonSerDe

  GlueCrawler:
    Type: AWS::Glue::Crawler
    Properties:
      Name: glue-crawler-test
      Role: !Sub service-role/${IAMRoleGlueCrawler}
      Targets:
        CatalogTargets:
          - DatabaseName: !Ref GlueDatabase
            Tables:
              - !Ref GlueTable
      SchemaChangePolicy:
        UpdateBehavior: UPDATE_IN_DATABASE
        DeleteBehavior: LOG

Replace <YOUR_CFN_BUCKET> with the actual value and deploy the CloudFormation stack with the following commands:

STACK_NAME=glue-crawler-test

aws cloudformation package \
  --template-file template.yaml \
  --s3-bucket <YOUR_CFN_BUCKET> \
  --s3-prefix "$STACK_NAME/$(date +%Y)/$(date +%m)/$(date +%d)/$(date +%H)/$(date +%M)" \
  --output-template-file package.template

aws cloudformation deploy \
  --stack-name $STACK_NAME \
  --template-file package.template \
  --capabilities CAPABILITY_NAMED_IAM

Updating Glue Crawler Target Table Setting

CloudFormation does not currently support S3 event notifications for Glue Crawlers. Update the target table settings manually:

https://github.com/aws-cloudformation/cloudformation-coverage-roadmap/issues/947

CloudFormation Support For S3 Event isn’t currently available. S3 Event Crawler’s integration with CloudFormation is in scope and in the works. We plan on releasing this coverage to Cloudformation some later this year. Thank you for patience.

Select the table and click Edit.

Choose Crawl based on events in the Subsequent crawler runs section.

Testing

Updating Glue Table Schema by JSON Version 1

Upload a sample JSON file to trigger an S3 event notification:

echo '{"message": "Hello World"}' > sample1.json
aws s3 cp sample1.json s3://glue-crawler-test-<ACCOUNT_ID>-<REGION>/

Start the Glue crawler:

aws glue start-crawler --name glue-crawler-test

Monitor its status until it shows STOPPING:

$ aws glue get-crawler --name glue-crawler-test | jq -r '.Crawler.State'
STOPPING

Verify the updated Glue table schema:

$ aws glue get-table \
  --database-name glue-crawler-test-db \
  --name glue-crawler-test-table \
| jq '.Table.StorageDescriptor.Columns'
[
  {
    "Name": "message",
    "Type": "string"
  }
]

Updating Glue Table Schema by JSON Version 2

Repeat the process with a new JSON file version:

echo '{"message": "Hello World", "statusCode": 200}' > sample2.json
aws s3 cp sample2.json s3://glue-crawler-test-<ACCOUNT_ID>-<REGION>/

Start the Glue crawler:

aws glue start-crawler --name glue-crawler-test

Monitor its status until it shows STOPPING:

$ aws glue get-crawler --name glue-crawler-test | jq -r '.Crawler.State'
STOPPING

Verify the updated Glue table schema:

aws glue get-table \
  --database-name glue-crawler-test-db \
  --name glue-crawler-test-table \
| jq '.Table.StorageDescriptor.Columns'
[
  {
    "Name": "message",
    "Type": "string"
  },
  {
    "Name": "statuscode",
    "Type": "int"
  }
]

Checking SQS Message Count

Verify that no messages are left in the SQS queue:

$ queue_url=$(aws sqs get-queue-url --queue-name glue-crawler-test-queue | jq -r '.QueueUrl')
$ aws sqs get-queue-attributes \
  --queue-url $queue_url \
  --attribute-names ApproximateNumberOfMessages
{
    "Attributes": {
        "ApproximateNumberOfMessages": "0"
    }
}

You can now see that the statuscode column has been added.

Cleaning Up

Remove provisioned AWS resources:

aws s3 rm s3://glue-crawler-test-<ACCOUNT_ID>-<REGION>/ --recursive
aws cloudformation delete-stack --stack-name $STACK_NAME

Conclusion

By configuring Glue Crawlers to use S3 event notifications with SQS, you can avoid full scans, resulting in reduced costs and enhanced performance. This approach is a highly efficient solution for event-driven data pipelines.

Happy Coding! 🚀