How to Query S3 Logs with Athena and Kinesis Data Firehose

This note describes how to query objects stored in S3 using Athena.
Creating Kinesis Data Firehose
The custom S3 prefixes was introduced in February 2019, allowing you to specify Apache Hive-style prefixes for S3 object keys and use MSCK REPAIR TABLE
to create partitions in Athena.
Press Create delivery stream
and enter a name.
Choose Direct PUT or other sources
.
Skip the record processing settings.
Select Amazon S3
as the destination.
Configure the prefix and error prefix with the following format:
- Prefix:
logs/!{timestamp:'year='yyyy'/month='MM'/day='dd'/hour='HH}/
- Error Prefix:
error_logs/!{timestamp:'year='yyyy'/month='MM'/day='dd'/hour='HH}/!{firehose:error-output-type}
Adjust Buffer size
and Buffer interval
based on your requirements.
Use GZIP
compression to reduce storage costs.
Create or select an IAM role for the delivery stream.
Streaming Data
You can stream data to the Kinesis Data Firehose delivery stream programmatically.
Here’s an example using the AWS SDK for PHP: FirehoseClient#putRecord:
$client = new FirehoseClient([ 'region' => '<AWS_REGION>', 'version' => 'latest',]);
$data = [ 'log_id' => 12345, 'url' => 'https://example.com',];
$client->putRecord([ 'DeliveryStreamName' => '<YOUR_STREAM>', 'Record' => [ 'Data' => json_encode($data) . PHP_EOL, ],]);
Creating Athena Table
Select Create table from S3 bucket data
in the Athena console.
Enter a database name, table name, and the S3 path used by Firehose.
Specify JSON
as the data format.
Define columns based on your data structure.
Configure partitions (e.g., year/month/day/hour
) to improve query performance.
Partitioning minimizes the data scanned, significantly reducing costs.
Load partitions with the following command:
MSCK REPAIR TABLE {TABLE_NAME};
Querying Athena Table
You can use SQL to query the data efficiently. Example query:
SELECT *FROM table_nameWHERE year = 2019 AND month = 8 AND day = 30LIMIT 10;
Best Practices:
- Always include partition keys in the
WHERE
clause. - Use the
LIMIT
statement to avoid unnecessary scans.