Automating of Partitioning Athena Tables using Partition Projection
Athena Partition Projection, introduced in June 2020, enables automatic partitioning of Athena tables. Using this feature, MSCK REPAIR TABLE
query is no longer necessary to add new partitions.
Overview
Quoting from the official AWS documentation:
In partition projection, partition values and locations are calculated from configuration rather than read from a repository like the AWS Glue Data Catalog. 1
partition projection can reduce the runtime of queries against highly partitioned tables. 2
Partition projection allows Athena to avoid calling
GetPartitions
because the partition projection configuration gives Athena all of the necessary information to build the partitions itself. 3
Creating AWS Resources
Create a CloudFormation template with the following content.
The key point is GlueTable.TableInput.Parameters
of GlueTable
definition (lines 27-35).
projection.year_month.range
property is treated as UTC. If you intend to use a different timezone in your S3 object path, you need to specify the time difference using a format like NOW+9HOURS
. For more details, please refer to the official documentation.
AWSTemplateFormatVersion: "2010-09-09"
Description: Stack for Athena partition projection sample
Resources:
S3:
Type: AWS::S3::Bucket
Properties:
BucketName: athena-partition-projection-logs
BucketEncryption:
ServerSideEncryptionConfiguration:
- ServerSideEncryptionByDefault:
SSEAlgorithm: AES256
GlueDatabase:
Type: AWS::Glue::Database
Properties:
DatabaseInput:
Name: sample
CatalogId: !Ref AWS::AccountId
GlueTable:
Type: AWS::Glue::Table
Properties:
DatabaseName: !Ref GlueDatabase
CatalogId: !Ref AWS::AccountId
TableInput:
TableType: EXTERNAL_TABLE
Parameters:
classification: json
"projection.enabled": true
"projection.year_month.format": yyyy/MM
"projection.year_month.interval": 1
"projection.year_month.interval.unit": MONTHS
"projection.year_month.range": 2021/09,NOW
"projection.year_month.type": date
"storage.location.template": s3://athena-partition-projection-logs/${year_month}
StorageDescriptor:
Columns:
- Name: id
Type: int
- Name: message
Type: string
Location: !Sub s3://${S3}/
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: false
NumberOfBuckets: 0
SerdeInfo:
SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
Parameters:
paths: id,message
StoredAsSubDirectories: false
PartitionKeys:
- Name: year_month
Type: string
Retention: 0
Name: sample_logs
Deploy the CloudFormation stack with the following command.
aws cloudformation deploy --template-file stack.yml --stack-name athena-partition-projection-sample
Testing
By running the following command, put objects containing the example JSON below.
$ echo '{"id": 1, "message": "hello"}' > 2021-09.json
$ echo '{"id": 2, "message": "world"}' > 2021-10.json
$ aws s3 cp 2021-09.json s3://athena-partition-projection-logs/2021/09/
$ aws s3 cp 2021-10.json s3://athena-partition-projection-logs/2021/10/
$ aws s3 ls s3://athena-partition-projection-logs/2021/
PRE 09/
PRE 10/
Let’s query the data in the 2021/09
partition.
SELECT * FROM "sample"."sample_logs"
WHERE year_month = '2021/09'
LIMIT 10;
Result:
1 hello 2021/09
Next, let’s query the data in the 2021/10
partition.
SELECT * FROM "sample"."sample_logs"
WHERE year_month = '2021/10'
LIMIT 10;
Result:
2 world 2021/10
Cleaning Up
Clean up the provisioned AWS resources with the following command.
aws s3 rm --recursive s3://athena-partition-projection-logs
aws cloudformation delete-stack --stack-name athena-partition-projection-sample
Conclusion
Athena Partition Projection can avoid AWS users from managing partitions, which enables us to focus more on important parts of our applications.
I hope you will find this post useful.