How to Use Athena for Log Analysis with Kinesis Data Firehose and S3

Takahiro Iwasa

Aug 31, 2019

3 min read

Athena Firehose Kinesis

It has been more than 3 years since this post was published.

Introduction

Amazon Athena is a powerful tool that allows you to query objects stored in Amazon S3 using SQL. It provides a scalable and efficient solution for log analysis. With Athena, you can:

Query data directly from S3 without the need to load it into a separate database
Use standard SQL to analyze structured, semi-structured, and unstructured data
Pay only for the queries you run, with no upfront costs or infrastructure to manage
Seamlessly integrate with other AWS services like Glue, Lambda, and QuickSight

Athena is ideal for ad-hoc querying, reporting, and analyzing large datasets, making it a great choice for building scalable log analytics solutions on AWS.

This post contains outdated content. Please refer to the latest guide for modern methods.

wasabee.dev

Automating Partitioning in Athena Tables with Partition Projection | wasabee.dev

This guide walks you through setting up Kinesis Data Firehose, Amazon S3, and Athena to create an efficient log analysis environment. Note that this does not cover AWS Glue Crawler, which can simplify partitioning tasks.

System Architecture

Step 1: Create an S3 Bucket

Start by creating an S3 bucket to store log files. Choose a naming convention that aligns with your organization’s standards.

Step 2: Configure Kinesis Data Firehose

Custom prefix support was introduced in February 2019, allowing you to specify Apache Hive-style prefixes for S3 object keys and use MSCK REPAIR TABLE to create partitions in Athena. Follow these steps to configure the delivery stream:

2.1. Create a Delivery Stream

Press Create delivery stream and enter a name.

2.2. Select the Source

Choose Direct PUT or other sources.

Skip the record processing settings.

2.3. Set Destination to S3

Select S3 as the destination.

Configure the prefix and error prefix with the following format:

Field	Value
Prefix	`logs/!{timestamp:'year='yyyy'/month='MM'/day='dd'/hour='HH}/`
Error Prefix	`error_logs/!{timestamp:'year='yyyy'/month='MM'/day='dd'/hour='HH}/!{firehose:error-output-type}`

2.4. Optimize Buffer Settings

Adjust Buffer size and Buffer interval based on your requirements.

2.5. Enable Compression

Use GZIP compression to reduce storage costs.

2.6. Set IAM Role

Create or select an IAM role for Firehose.

Step 3: Stream Data Using PHP (Optional)

You can stream log data to Kinesis Data Firehose programmatically. Here’s an example using the AWS SDK for PHP - FirehoseClient#putRecord:

$client = new FirehoseClient([
    'region' => '<AWS_REGION>',
    'version' => 'latest',
]);

$data = [
    'log_id' => 12345,
    'url' => 'https://example.com',
];

$client->putRecord([
    'DeliveryStreamName' => '<YOUR_STREAM>',
    'Record' => [
        'Data' => json_encode($data) . PHP_EOL,
    ],
]);

Step 4: Create an Athena Table

Select Create table from S3 bucket data in the Athena console.

Enter a database name, table name, and the S3 path used by Firehose.

Specify JSON as the file format.

Define columns based on your log structure.

Configure partitions (e.g., year/month/day/hour) to improve query performance.

Partitioning minimizes the data scanned, significantly reducing costs.

Load partitions with the following command:

MSCK REPAIR TABLE {TABLE_NAME};

Step 5: Querying the Athena Table

You can use SQL to query the log data efficiently. Example query:

SELECT
  *
FROM
  table_name
WHERE
  year = 2019
  AND month = 8
  AND day = 30
LIMIT 10;

Best Practices:

Always include partition keys in the WHERE clause.
Use the LIMIT statement to avoid unnecessary scans.

Conclusion

By combining Athena, Kinesis Data Firehose, and S3, you can create a scalable, cost-effective, and highly available log analysis environment. Proper partitioning and efficient querying techniques ensure you keep costs under control.

Happy Coding! 🚀