Event-Driven AWS Glue Crawlers: Setting Up SQS-Based Triggers

Glue Crawlers can crawl based on S3 event notifications, avoiding a full scan and offering benefits in terms of both crawling performance and costs.
Building
Create a CloudFormation template with the following content:
AWSTemplateFormatVersion: '2010-09-09'Description: Glue crawler test
Resources: SqsQueue: Type: AWS::SQS::Queue Properties: SqsManagedSseEnabled: true QueueName: glue-crawler-test-queue
SqsQueuePolicy: Type: AWS::SQS::QueuePolicy Properties: Queues: - !Ref SqsQueue PolicyDocument: Version: '2008-10-17' Id: __default_policy_ID Statement: - Effect: Allow Principal: AWS: - !Sub arn:aws:iam::${AWS::AccountId}:root - !GetAtt IAMRoleGlueCrawler.Arn Action: sqs:* Resource: !GetAtt SqsQueue.Arn - Effect: Allow Principal: Service: s3.amazonaws.com Action: sqs:* Resource: !GetAtt SqsQueue.Arn
S3Bucket: Type: AWS::S3::Bucket DependsOn: SqsQueuePolicy Properties: BucketName: !Sub glue-crawler-test-${AWS::AccountId}-${AWS::Region} BucketEncryption: ServerSideEncryptionConfiguration: - ServerSideEncryptionByDefault: SSEAlgorithm: AES256 NotificationConfiguration: QueueConfigurations: - Event: 's3:ObjectCreated:*' Queue: !GetAtt SqsQueue.Arn PublicAccessBlockConfiguration: BlockPublicAcls: TRUE BlockPublicPolicy: TRUE IgnorePublicAcls: TRUE RestrictPublicBuckets: TRUE
IAMRoleGlueCrawler: Type: AWS::IAM::Role Properties: Path: /service-role/ RoleName: !Sub glue-crawler-test-service-role AssumeRolePolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Principal: Service: glue.amazonaws.com Action: sts:AssumeRole Policies: - PolicyName: cw-logs PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - logs:CreateLogGroup - logs:CreateLogStream - logs:PutLogEvents Resource: "*" - PolicyName: glue PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - glue:CreateTable - glue:GetDatabase - glue:GetTable - glue:UpdateTable Resource: "*" - PolicyName: s3 PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - s3:GetObject - s3:ListBucket - s3:PutObject Resource: '*'
GlueDatabase: Type: AWS::Glue::Database Properties: CatalogId: !Ref AWS::AccountId DatabaseInput: Name: glue-crawler-test-db
GlueTable: Type: AWS::Glue::Table Properties: CatalogId: !Ref AWS::AccountId DatabaseName: !Ref GlueDatabase TableInput: Name: glue-crawler-test-table TableType: EXTERNAL_TABLE Parameters: classification: json StorageDescriptor: Location: !Sub 's3://${S3Bucket}/' Compressed: false InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat SerdeInfo: SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
GlueCrawler: Type: AWS::Glue::Crawler Properties: Name: glue-crawler-test Role: !Sub service-role/${IAMRoleGlueCrawler} Targets: CatalogTargets: - DatabaseName: !Ref GlueDatabase Tables: - !Ref GlueTable SchemaChangePolicy: UpdateBehavior: UPDATE_IN_DATABASE DeleteBehavior: LOG
Deploy the CloudFormation stack with the following command:
STACK_NAME=glue-crawler-test
aws cloudformation package \ --template-file template.yaml \ --s3-bucket <YOUR_CFN_BUCKET> \ --s3-prefix "$STACK_NAME/$(date +%Y)/$(date +%m)/$(date +%d)/$(date +%H)/$(date +%M)" \ --output-template-file package.template
aws cloudformation deploy \ --stack-name $STACK_NAME \ --template-file package.template \ --capabilities CAPABILITY_NAMED_IAM
Updating Glue Crawler Target Table Setting
CloudFormation does not currently support S3 event notifications for Glue Crawlers. Update the target table settings manually:
https://github.com/aws-cloudformation/cloudformation-coverage-roadmap/issues/947
CloudFormation Support For S3 Event isn’t currently available. S3 Event Crawler’s integration with CloudFormation is in scope and in the works. We plan on releasing this coverage to Cloudformation some later this year. Thank you for patience.
Select the table and click Edit
.
Choose Crawl based on events
in the Subsequent crawler runs
section.
Testing
Feeding Version 1
Upload a sample JSON file to trigger an S3 event notification:
echo '{"message": "Hello World"}' > sample1.jsonaws s3 cp sample1.json s3://glue-crawler-test-<ACCOUNT_ID>-<REGION>/
Start the Glue crawler:
aws glue start-crawler --name glue-crawler-test
Monitor its status until it shows STOPPING
:
$ aws glue get-crawler --name glue-crawler-test | jq -r '.Crawler.State'STOPPING
Verify the updated Glue table schema:
aws glue get-table \ --database-name glue-crawler-test-db \ --name glue-crawler-test-table \| jq '.Table.StorageDescriptor.Columns'
Expected output:
[ { "Name": "message", "Type": "string" }]
Feeding Version 2
Repeat the process with a new JSON file version:
echo '{"message": "Hello World", "statusCode": 200}' > sample2.jsonaws s3 cp sample2.json s3://glue-crawler-test-<ACCOUNT_ID>-<REGION>/
Start the Glue crawler:
aws glue start-crawler --name glue-crawler-test
Monitor its status until it shows STOPPING
:
$ aws glue get-crawler --name glue-crawler-test | jq -r '.Crawler.State'STOPPING
Verify the updated Glue table schema:
aws glue get-table \ --database-name glue-crawler-test-db \ --name glue-crawler-test-table \| jq '.Table.StorageDescriptor.Columns'
Expected output:
[ { "Name": "message", "Type": "string" }, { "Name": "statuscode", "Type": "int" }]
Checking SQS Message Count
Verify that no messages are left in the SQS queue:
queue_url=$(aws sqs get-queue-url --queue-name glue-crawler-test-queue | jq -r '.QueueUrl')aws sqs get-queue-attributes \ --queue-url $queue_url \ --attribute-names ApproximateNumberOfMessages
{ "Attributes": { "ApproximateNumberOfMessages": "0" }}
Cleaning Up
Clean up all the AWS resources provisioned during this example with the following command:
aws s3 rm s3://glue-crawler-test-<ACCOUNT_ID>-<REGION>/ --recursiveaws cloudformation delete-stack --stack-name $STACK_NAME