Having Glue Crawler Crawl Based on Events with SQS

Having Glue Crawler Crawl Based on Events with SQS

Takahiro Iwasa
Takahiro Iwasa
4 min read
ETL Glue

Glue Crawlers can crawl S3 buckets based on event notifications with SQS, avoiding a full scan and offering benefits in terms of both crawling performance and costs.

Overview

Creating AWS Resources

Create a CloudFormation template with the following content.

AWSTemplateFormatVersion: '2010-09-09'
Description: Glue crawler test

Resources:
  SqsQueue:
    Type: AWS::SQS::Queue
    Properties:
      SqsManagedSseEnabled: true
      QueueName: glue-crawler-test-queue

  SqsQueuePolicy:
    Type: AWS::SQS::QueuePolicy
    Properties:
      Queues:
        - !Ref SqsQueue
      PolicyDocument:
        Version: '2008-10-17'
        Id: __default_policy_ID
        Statement:
          - Effect: Allow
            Principal:
              AWS:
                - !Sub arn:aws:iam::${AWS::AccountId}:root
                - !GetAtt IAMRoleGlueCrawler.Arn
            Action: sqs:*
            Resource: !GetAtt SqsQueue.Arn
          - Effect: Allow
            Principal:
              Service: s3.amazonaws.com
            Action: sqs:*
            Resource: !GetAtt SqsQueue.Arn

  S3Bucket:
    Type: AWS::S3::Bucket
    DependsOn: SqsQueuePolicy
    Properties:
      BucketName: !Sub glue-crawler-test-${AWS::AccountId}-${AWS::Region}
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: AES256
      NotificationConfiguration:
        QueueConfigurations:
          - Event: 's3:ObjectCreated:*'
            Queue: !GetAtt SqsQueue.Arn
      PublicAccessBlockConfiguration:
        BlockPublicAcls: TRUE
        BlockPublicPolicy: TRUE
        IgnorePublicAcls: TRUE
        RestrictPublicBuckets: TRUE

  IAMRoleGlueCrawler:
    Type: AWS::IAM::Role
    Properties:
      Path: /service-role/
      RoleName: !Sub glue-crawler-test-service-role
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: glue.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: cw-logs
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                Resource: "*"
        - PolicyName: glue
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - glue:CreateTable
                  - glue:GetDatabase
                  - glue:GetTable
                  - glue:UpdateTable
                Resource: "*"
        - PolicyName: s3
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - s3:GetObject
                  - s3:ListBucket
                  - s3:PutObject
                Resource: '*'

  GlueDatabase:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Name: glue-crawler-test-db

  GlueTable:
    Type: AWS::Glue::Table
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: !Ref GlueDatabase
      TableInput:
        Name: glue-crawler-test-table
        TableType: EXTERNAL_TABLE
        Parameters:
          classification: json
        StorageDescriptor:
          Location: !Sub 's3://${S3Bucket}/'
          Compressed: false
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          SerdeInfo:
            SerializationLibrary: org.openx.data.jsonserde.JsonSerDe

  GlueCrawler:
    Type: AWS::Glue::Crawler
    Properties:
      Name: glue-crawler-test
      Role: !Sub service-role/${IAMRoleGlueCrawler}
      Targets:
        CatalogTargets:
          - DatabaseName: !Ref GlueDatabase
            Tables:
              - !Ref GlueTable
      SchemaChangePolicy:
        UpdateBehavior: UPDATE_IN_DATABASE
        DeleteBehavior: LOG

Replace <YOUR_CFN_BUCKET> with the actual value and deploy the CloudFormation stack with the following command.

$ STACK_NAME=glue-crawler-test
$ aws cloudformation package \
  --template-file template.yaml \
  --s3-bucket <YOUR_CFN_BUCKET> \
  --s3-prefix "$STACK_NAME/$(date +%Y)/$(date +%m)/$(date +%d)/$(date +%H)/$(date +%M)" \
  --output-template-file package.template
$ aws cloudformation deploy \
  --stack-name $STACK_NAME \
  --template-file package.template \
  --capabilities CAPABILITY_NAMED_IAM

Updating Glue Crawler Target Table Setting

Since CloudFormation does not currently support S3 event notifications for Glue Crawler, update the target table setting manually.

CloudFormation Support For S3 Event isn’t currently available. S3 Event Crawler’s integration with CloudFormation is in scope and in the works. We plan on releasing this coverage to Cloudformation some later this year. Thank you for patience.

Select the table and click Edit.

Select Crawl based on events at Subsequent crawler runs.

Testing

Updating Glue Table Schema by JSON Version 1

Upload the sample JSON file with the following command to trigger an S3 event notification.

$ echo '{"message": "Hello World"}' > sample1.json
$ aws s3 cp sample1.json s3://glue-crawler-test-<ACCOUNT_ID>-<REGION>/

Start the Glue crawler.

$ aws glue start-crawler --name glue-crawler-test

Check the status of the Glue crawler until it shows STOPPING.

$ aws glue get-crawler --name glue-crawler-test | jq -r '.Crawler.State'
STOPPING

You should see the Glue table updated and have the new schema.

$ aws glue get-table \
  --database-name glue-crawler-test-db \
  --name glue-crawler-test-table \
| jq '.Table.StorageDescriptor.Columns'
[
  {
    "Name": "message",
    "Type": "string"
  }
]

Updating Glue Table Schema by JSON Version 2

Upload the sample JSON file v2 with the following command to trigger an S3 event notification.

$ echo '{"message": "Hello World", "statusCode": 200}' > sample2.json
$ aws s3 cp sample2.json s3://glue-crawler-test-<ACCOUNT_ID>-<REGION>/

Start the Glue crawler.

$ aws glue start-crawler --name glue-crawler-test

Check the status of the Glue crawler until it shows STOPPING.

$ aws glue get-crawler --name glue-crawler-test | jq -r '.Crawler.State'
STOPPING

You should see the Glue table updated and have the new v2 schema.

$ aws glue get-table \
  --database-name glue-crawler-test-db \
  --name glue-crawler-test-table \
| jq '.Table.StorageDescriptor.Columns'
[
  {
    "Name": "message",
    "Type": "string"
  },
  {
    "Name": "statuscode",
    "Type": "int"
  }
]

Checking SQS Message Count

Check the number of messages in the queue. You should see ApproximateNumberOfMessages: 0.

$ queue_url=$(aws sqs get-queue-url --queue-name glue-crawler-test-queue | jq -r '.QueueUrl')
$ aws sqs get-queue-attributes \
  --queue-url $queue_url \
  --attribute-names ApproximateNumberOfMessages
{
    "Attributes": {
        "ApproximateNumberOfMessages": "0"
    }
}

Cleaning Up

Clean up the provisioned AWS resources with the following command.

$ aws s3 rm s3://glue-crawler-test-<ACCOUNT_ID>-<REGION>/ --recursive
$ aws cloudformation delete-stack --stack-name $STACK_NAME

Conclusion

By crawling S3 buckets using event notifications, we can avoid full scans, leading to reduced crawling costs and improved performance.

I hope you will find this post useful.

Takahiro Iwasa

Takahiro Iwasa

Software Developer at KAKEHASHI Inc.
Involved in the requirements definition, design, and development of cloud-native applications using AWS. Now, building a new prescription data collection platform at KAKEHASHI Inc. Japan AWS Top Engineers 2020-2023.