Having Glue Crawler Crawl Based on Events with SQS
Glue Crawlers can crawl S3 buckets based on event notifications with SQS, avoiding a full scan and offering benefits in terms of both crawling performance and costs.
Overview
Creating AWS Resources
Create a CloudFormation template with the following content.
AWSTemplateFormatVersion: '2010-09-09'
Description: Glue crawler test
Resources:
SqsQueue:
Type: AWS::SQS::Queue
Properties:
SqsManagedSseEnabled: true
QueueName: glue-crawler-test-queue
SqsQueuePolicy:
Type: AWS::SQS::QueuePolicy
Properties:
Queues:
- !Ref SqsQueue
PolicyDocument:
Version: '2008-10-17'
Id: __default_policy_ID
Statement:
- Effect: Allow
Principal:
AWS:
- !Sub arn:aws:iam::${AWS::AccountId}:root
- !GetAtt IAMRoleGlueCrawler.Arn
Action: sqs:*
Resource: !GetAtt SqsQueue.Arn
- Effect: Allow
Principal:
Service: s3.amazonaws.com
Action: sqs:*
Resource: !GetAtt SqsQueue.Arn
S3Bucket:
Type: AWS::S3::Bucket
DependsOn: SqsQueuePolicy
Properties:
BucketName: !Sub glue-crawler-test-${AWS::AccountId}-${AWS::Region}
BucketEncryption:
ServerSideEncryptionConfiguration:
- ServerSideEncryptionByDefault:
SSEAlgorithm: AES256
NotificationConfiguration:
QueueConfigurations:
- Event: 's3:ObjectCreated:*'
Queue: !GetAtt SqsQueue.Arn
PublicAccessBlockConfiguration:
BlockPublicAcls: TRUE
BlockPublicPolicy: TRUE
IgnorePublicAcls: TRUE
RestrictPublicBuckets: TRUE
IAMRoleGlueCrawler:
Type: AWS::IAM::Role
Properties:
Path: /service-role/
RoleName: !Sub glue-crawler-test-service-role
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: glue.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: cw-logs
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- logs:CreateLogGroup
- logs:CreateLogStream
- logs:PutLogEvents
Resource: "*"
- PolicyName: glue
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- glue:CreateTable
- glue:GetDatabase
- glue:GetTable
- glue:UpdateTable
Resource: "*"
- PolicyName: s3
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- s3:GetObject
- s3:ListBucket
- s3:PutObject
Resource: '*'
GlueDatabase:
Type: AWS::Glue::Database
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseInput:
Name: glue-crawler-test-db
GlueTable:
Type: AWS::Glue::Table
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref GlueDatabase
TableInput:
Name: glue-crawler-test-table
TableType: EXTERNAL_TABLE
Parameters:
classification: json
StorageDescriptor:
Location: !Sub 's3://${S3Bucket}/'
Compressed: false
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
SerdeInfo:
SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
GlueCrawler:
Type: AWS::Glue::Crawler
Properties:
Name: glue-crawler-test
Role: !Sub service-role/${IAMRoleGlueCrawler}
Targets:
CatalogTargets:
- DatabaseName: !Ref GlueDatabase
Tables:
- !Ref GlueTable
SchemaChangePolicy:
UpdateBehavior: UPDATE_IN_DATABASE
DeleteBehavior: LOG
Replace <YOUR_CFN_BUCKET>
with the actual value and deploy the CloudFormation stack with the following command.
$ STACK_NAME=glue-crawler-test
$ aws cloudformation package \
--template-file template.yaml \
--s3-bucket <YOUR_CFN_BUCKET> \
--s3-prefix "$STACK_NAME/$(date +%Y)/$(date +%m)/$(date +%d)/$(date +%H)/$(date +%M)" \
--output-template-file package.template
$ aws cloudformation deploy \
--stack-name $STACK_NAME \
--template-file package.template \
--capabilities CAPABILITY_NAMED_IAM
Updating Glue Crawler Target Table Setting
Since CloudFormation does not currently support S3 event notifications for Glue Crawler, update the target table setting manually.
CloudFormation Support For S3 Event isn’t currently available. S3 Event Crawler’s integration with CloudFormation is in scope and in the works. We plan on releasing this coverage to Cloudformation some later this year. Thank you for patience.
Select the table and click Edit
.
Select Crawl based on events
at Subsequent crawler runs
.
Testing
Updating Glue Table Schema by JSON Version 1
Upload the sample JSON file with the following command to trigger an S3 event notification.
$ echo '{"message": "Hello World"}' > sample1.json
$ aws s3 cp sample1.json s3://glue-crawler-test-<ACCOUNT_ID>-<REGION>/
Start the Glue crawler.
$ aws glue start-crawler --name glue-crawler-test
Check the status of the Glue crawler until it shows STOPPING
.
$ aws glue get-crawler --name glue-crawler-test | jq -r '.Crawler.State'
STOPPING
You should see the Glue table updated and have the new schema.
$ aws glue get-table \
--database-name glue-crawler-test-db \
--name glue-crawler-test-table \
| jq '.Table.StorageDescriptor.Columns'
[
{
"Name": "message",
"Type": "string"
}
]
Updating Glue Table Schema by JSON Version 2
Upload the sample JSON file v2 with the following command to trigger an S3 event notification.
$ echo '{"message": "Hello World", "statusCode": 200}' > sample2.json
$ aws s3 cp sample2.json s3://glue-crawler-test-<ACCOUNT_ID>-<REGION>/
Start the Glue crawler.
$ aws glue start-crawler --name glue-crawler-test
Check the status of the Glue crawler until it shows STOPPING
.
$ aws glue get-crawler --name glue-crawler-test | jq -r '.Crawler.State'
STOPPING
You should see the Glue table updated and have the new v2 schema.
$ aws glue get-table \
--database-name glue-crawler-test-db \
--name glue-crawler-test-table \
| jq '.Table.StorageDescriptor.Columns'
[
{
"Name": "message",
"Type": "string"
},
{
"Name": "statuscode",
"Type": "int"
}
]
Checking SQS Message Count
Check the number of messages in the queue.
You should see ApproximateNumberOfMessages: 0
.
$ queue_url=$(aws sqs get-queue-url --queue-name glue-crawler-test-queue | jq -r '.QueueUrl')
$ aws sqs get-queue-attributes \
--queue-url $queue_url \
--attribute-names ApproximateNumberOfMessages
{
"Attributes": {
"ApproximateNumberOfMessages": "0"
}
}
Cleaning Up
Clean up the provisioned AWS resources with the following command.
$ aws s3 rm s3://glue-crawler-test-<ACCOUNT_ID>-<REGION>/ --recursive
$ aws cloudformation delete-stack --stack-name $STACK_NAME
Conclusion
By crawling S3 buckets using event notifications, we can avoid full scans, leading to reduced crawling costs and improved performance.
I hope you will find this post useful.