How to Query S3 Logs with Athena and Kinesis Data Firehose

How to Query S3 Logs with Athena and Kinesis Data Firehose

Takahiro Iwasa
Takahiro Iwasa
2 min read
Athena Firehose Kinesis

This note describes how to query objects stored in S3 using Athena.

System Architecture

Creating Kinesis Data Firehose

The custom S3 prefixes was introduced in February 2019, allowing you to specify Apache Hive-style prefixes for S3 object keys and use MSCK REPAIR TABLE to create partitions in Athena.

Press Create delivery stream and enter a name.

Choose Direct PUT or other sources.

Skip the record processing settings.

Select Amazon S3 as the destination.

Configure the prefix and error prefix with the following format:

  • Prefix: logs/!{timestamp:'year='yyyy'/month='MM'/day='dd'/hour='HH}/
  • Error Prefix: error_logs/!{timestamp:'year='yyyy'/month='MM'/day='dd'/hour='HH}/!{firehose:error-output-type}

Adjust Buffer size and Buffer interval based on your requirements.

Use GZIP compression to reduce storage costs.

Create or select an IAM role for the delivery stream.

Streaming Data

You can stream data to the Kinesis Data Firehose delivery stream programmatically.

Here’s an example using the AWS SDK for PHP: FirehoseClient#putRecord:

$client = new FirehoseClient([
'region' => '<AWS_REGION>',
'version' => 'latest',
]);
$data = [
'log_id' => 12345,
'url' => 'https://example.com',
];
$client->putRecord([
'DeliveryStreamName' => '<YOUR_STREAM>',
'Record' => [
'Data' => json_encode($data) . PHP_EOL,
],
]);

Creating Athena Table

Select Create table from S3 bucket data in the Athena console.

Enter a database name, table name, and the S3 path used by Firehose.

Specify JSON as the data format.

Define columns based on your data structure.

Configure partitions (e.g., year/month/day/hour) to improve query performance.

💡 Tip

Partitioning minimizes the data scanned, significantly reducing costs.

Load partitions with the following command:

MSCK REPAIR TABLE {TABLE_NAME};

Querying Athena Table

You can use SQL to query the data efficiently. Example query:

SELECT
*
FROM
table_name
WHERE
year = 2019
AND month = 8
AND day = 30
LIMIT 10;

Best Practices:

  • Always include partition keys in the WHERE clause.
  • Use the LIMIT statement to avoid unnecessary scans.
Takahiro Iwasa

Takahiro Iwasa

Software Developer
Involved in the requirements definition, design, and development of cloud-native applications using AWS. Japan AWS Top Engineers 2020-2023.