Event-Driven AWS Glue Crawlers: Setting Up SQS-Based Triggers

Event-Driven AWS Glue Crawlers: Setting Up SQS-Based Triggers

Takahiro Iwasa
Takahiro Iwasa
4 min read
ETL Glue

Glue Crawlers can crawl based on S3 event notifications, avoiding a full scan and offering benefits in terms of both crawling performance and costs.

Building

Create a CloudFormation template with the following content:

AWSTemplateFormatVersion: '2010-09-09'
Description: Glue crawler test
Resources:
SqsQueue:
Type: AWS::SQS::Queue
Properties:
SqsManagedSseEnabled: true
QueueName: glue-crawler-test-queue
SqsQueuePolicy:
Type: AWS::SQS::QueuePolicy
Properties:
Queues:
- !Ref SqsQueue
PolicyDocument:
Version: '2008-10-17'
Id: __default_policy_ID
Statement:
- Effect: Allow
Principal:
AWS:
- !Sub arn:aws:iam::${AWS::AccountId}:root
- !GetAtt IAMRoleGlueCrawler.Arn
Action: sqs:*
Resource: !GetAtt SqsQueue.Arn
- Effect: Allow
Principal:
Service: s3.amazonaws.com
Action: sqs:*
Resource: !GetAtt SqsQueue.Arn
S3Bucket:
Type: AWS::S3::Bucket
DependsOn: SqsQueuePolicy
Properties:
BucketName: !Sub glue-crawler-test-${AWS::AccountId}-${AWS::Region}
BucketEncryption:
ServerSideEncryptionConfiguration:
- ServerSideEncryptionByDefault:
SSEAlgorithm: AES256
NotificationConfiguration:
QueueConfigurations:
- Event: 's3:ObjectCreated:*'
Queue: !GetAtt SqsQueue.Arn
PublicAccessBlockConfiguration:
BlockPublicAcls: TRUE
BlockPublicPolicy: TRUE
IgnorePublicAcls: TRUE
RestrictPublicBuckets: TRUE
IAMRoleGlueCrawler:
Type: AWS::IAM::Role
Properties:
Path: /service-role/
RoleName: !Sub glue-crawler-test-service-role
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: glue.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: cw-logs
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- logs:CreateLogGroup
- logs:CreateLogStream
- logs:PutLogEvents
Resource: "*"
- PolicyName: glue
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- glue:CreateTable
- glue:GetDatabase
- glue:GetTable
- glue:UpdateTable
Resource: "*"
- PolicyName: s3
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- s3:GetObject
- s3:ListBucket
- s3:PutObject
Resource: '*'
GlueDatabase:
Type: AWS::Glue::Database
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseInput:
Name: glue-crawler-test-db
GlueTable:
Type: AWS::Glue::Table
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref GlueDatabase
TableInput:
Name: glue-crawler-test-table
TableType: EXTERNAL_TABLE
Parameters:
classification: json
StorageDescriptor:
Location: !Sub 's3://${S3Bucket}/'
Compressed: false
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
SerdeInfo:
SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
GlueCrawler:
Type: AWS::Glue::Crawler
Properties:
Name: glue-crawler-test
Role: !Sub service-role/${IAMRoleGlueCrawler}
Targets:
CatalogTargets:
- DatabaseName: !Ref GlueDatabase
Tables:
- !Ref GlueTable
SchemaChangePolicy:
UpdateBehavior: UPDATE_IN_DATABASE
DeleteBehavior: LOG

Deploy the CloudFormation stack with the following command:

Terminal window
STACK_NAME=glue-crawler-test
aws cloudformation package \
--template-file template.yaml \
--s3-bucket <YOUR_CFN_BUCKET> \
--s3-prefix "$STACK_NAME/$(date +%Y)/$(date +%m)/$(date +%d)/$(date +%H)/$(date +%M)" \
--output-template-file package.template
aws cloudformation deploy \
--stack-name $STACK_NAME \
--template-file package.template \
--capabilities CAPABILITY_NAMED_IAM

Updating Glue Crawler Target Table Setting

CloudFormation does not currently support S3 event notifications for Glue Crawlers. Update the target table settings manually:

https://github.com/aws-cloudformation/cloudformation-coverage-roadmap/issues/947

CloudFormation Support For S3 Event isn’t currently available. S3 Event Crawler’s integration with CloudFormation is in scope and in the works. We plan on releasing this coverage to Cloudformation some later this year. Thank you for patience.

Select the table and click Edit.

Choose Crawl based on events in the Subsequent crawler runs section.

Testing

Feeding Version 1

Upload a sample JSON file to trigger an S3 event notification:

Terminal window
echo '{"message": "Hello World"}' > sample1.json
aws s3 cp sample1.json s3://glue-crawler-test-<ACCOUNT_ID>-<REGION>/

Start the Glue crawler:

Terminal window
aws glue start-crawler --name glue-crawler-test

Monitor its status until it shows STOPPING:

Terminal window
$ aws glue get-crawler --name glue-crawler-test | jq -r '.Crawler.State'
STOPPING

Verify the updated Glue table schema:

Terminal window
aws glue get-table \
--database-name glue-crawler-test-db \
--name glue-crawler-test-table \
| jq '.Table.StorageDescriptor.Columns'

Expected output:

[
{
"Name": "message",
"Type": "string"
}
]

Feeding Version 2

Repeat the process with a new JSON file version:

Terminal window
echo '{"message": "Hello World", "statusCode": 200}' > sample2.json
aws s3 cp sample2.json s3://glue-crawler-test-<ACCOUNT_ID>-<REGION>/

Start the Glue crawler:

Terminal window
aws glue start-crawler --name glue-crawler-test

Monitor its status until it shows STOPPING:

Terminal window
$ aws glue get-crawler --name glue-crawler-test | jq -r '.Crawler.State'
STOPPING

Verify the updated Glue table schema:

Terminal window
aws glue get-table \
--database-name glue-crawler-test-db \
--name glue-crawler-test-table \
| jq '.Table.StorageDescriptor.Columns'

Expected output:

[
{
"Name": "message",
"Type": "string"
},
{
"Name": "statuscode",
"Type": "int"
}
]

Checking SQS Message Count

Verify that no messages are left in the SQS queue:

Terminal window
queue_url=$(aws sqs get-queue-url --queue-name glue-crawler-test-queue | jq -r '.QueueUrl')
aws sqs get-queue-attributes \
--queue-url $queue_url \
--attribute-names ApproximateNumberOfMessages
{
"Attributes": {
"ApproximateNumberOfMessages": "0"
}
}

Cleaning Up

Clean up all the AWS resources provisioned during this example with the following command:

Terminal window
aws s3 rm s3://glue-crawler-test-<ACCOUNT_ID>-<REGION>/ --recursive
aws cloudformation delete-stack --stack-name $STACK_NAME
Takahiro Iwasa

Takahiro Iwasa

Software Developer
Involved in the requirements definition, design, and development of cloud-native applications using AWS. Japan AWS Top Engineers 2020-2023.