Automating Partitioning in Athena Tables with Partition Projection

Takahiro Iwasa

Oct 8, 2021

3 min read

Athena

It has been more than 3 years since this post was published.

Introduction

Amazon Athena Partition Projection, introduced in June 2020, eliminates the need to run MSCK REPAIR TABLE to add new partitions manually. This blog post explores how you can set up and use Partition Projection effectively.

What is Partition Projection

Partition Projection allows Athena to calculate partition values and locations from a configuration rather than retrieving them from repositories like AWS Glue Data Catalog. This reduces query runtime for highly partitioned tables.

Key highlights from the AWS documentation:

Efficiency: “Partition projection can reduce the runtime of queries against highly partitioned tables.”
Automation: “Partition values and locations are calculated from configuration instead of being retrieved manually.”
Reduced Overhead: “Athena avoids calling GetPartitions, leveraging the partition projection configuration instead.”

For more details, see the official AWS documentation.

Setting Up AWS Resources

Below is an example CloudFormation template for implementing Partition Projection.

AWSTemplateFormatVersion: "2010-09-09"
Description: Stack for Athena partition projection sample
Resources:
  S3:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: athena-partition-projection-logs
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: AES256

  GlueDatabase:
    Type: AWS::Glue::Database
    Properties:
      DatabaseInput:
        Name: sample
      CatalogId: !Ref AWS::AccountId

  GlueTable:
    Type: AWS::Glue::Table
    Properties:
      DatabaseName: !Ref GlueDatabase
      CatalogId: !Ref AWS::AccountId
      TableInput:
        TableType: EXTERNAL_TABLE
        Parameters:
          classification: json
          "projection.enabled": true
          "projection.year_month.format": yyyy/MM
          "projection.year_month.interval": 1
          "projection.year_month.interval.unit": MONTHS
          "projection.year_month.range": 2021/09,NOW
          "projection.year_month.type": date
          "storage.location.template": s3://athena-partition-projection-logs/${year_month}
        StorageDescriptor:
          Columns:
            - Name: id
              Type: int
            - Name: message
              Type: string
          Location: !Sub s3://${S3}/
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          Compressed: false
          NumberOfBuckets: 0
          SerdeInfo:
            SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
            Parameters:
              paths: id,message
          StoredAsSubDirectories: false
        PartitionKeys:
          - Name: year_month
            Type: string
        Retention: 0
        Name: sample_logs

Key Point: `GlueTable.TableInput.Parameters`

The key point in the configuration is the GlueTable.TableInput.Parameters section of the GlueTable definition (lines 27–35). This section enables Partition Projection with the following important properties:

projection.enabled: Enables partition projection.
projection.year_month.format: Specifies the date format for partitions.
projection.year_month.range: Defines the range for partitions, supporting NOW with offsets (e.g., NOW+9HOURS).

The projection.year_month.range property is treated as UTC. If you intend to use a different timezone, you need to specify the time difference using a format like NOW+9HOURS. For more details, please refer to the official documentation.

Deployment

Deploy the above CloudFormation stack with the following command:

aws cloudformation deploy --template-file stack.yml --stack-name athena-partition-projection-sample

Testing the Setup

Upload data to the S3 bucket for testing:

echo '{"id": 1, "message": "hello"}' > 2021-09.json
echo '{"id": 2, "message": "world"}' > 2021-10.json
aws s3 cp 2021-09.json s3://athena-partition-projection-logs/2021/09/
aws s3 cp 2021-10.json s3://athena-partition-projection-logs/2021/10/

Verify that the files are uploaded:

aws s3 ls s3://athena-partition-projection-logs/2021/

Querying Data

Run the following SQL query to fetch data from the 2021/09 partition:

SELECT * FROM "sample"."sample_logs"
WHERE year_month = '2021/09'
LIMIT 10;

Expected result:

1	hello	2021/09

Similarly, query the 2021/10 partition:

SELECT * FROM "sample"."sample_logs"
WHERE year_month = '2021/10'
LIMIT 10;

Expected result:

2	world	2021/10

Cleaning Up

Clean up resources to avoid unnecessary charges:

aws s3 rm --recursive s3://athena-partition-projection-logs
aws cloudformation delete-stack --stack-name athena-partition-projection-sample

Conclusion

Partition Projection simplifies the management of Athena partitions, saving time and reducing complexity. It allows developers to focus on application logic without worrying about partition maintenance. By following the steps above, you can efficiently set up Partition Projection for your projects.

Happy Coding! 🚀