Athena Log Analysis with Kinesis Data Firehose and S3
AWS users can query their application logs stored in S3 using Athena, which I think is built on Presto. We can quickly build log analysis environments.
Overview
This post does not handle AWS Glue Crawler which is a useful tool for partitioning.
Creating S3 Bucket
Create a bucket for log files.
Creating Kinesis Data Firehose Delivery Stream
Custom prefix support was added on February 12th, 2019.
Before the custom prefix was supported, Firehose created files with prefix path/to/YYYY/MM/DD
.
Now you can specify Apache Hive format as S3 object keys and use MSCK REPAIR TABLE
to create partitions in Athena.
Press Create delivery stream
.
Specify a stream name.
Select Direct PUT or other sources
.
Skip configuration to process records.
Select S3
as destination.
Specify the following prefixes by Apache Hive format.
Field | Value |
---|---|
Prefix | logs/!{timestamp:'year='yyyy'/month='MM'/day='dd'/hour='HH}/ |
Error prefix | error_logs/!{timestamp:'year='yyyy'/month='MM'/day='dd'/hour='HH}/!{firehose:error-output-type} |
Specify appropriate values for Buffer size
and Buffer interval
based on your requirements.
I strongly recommend using GZIP
compression.
Create or select an IAM role which Firehose assumed.
PHP Example for Streaming Data to Firehose
The example below demonstrates how to stream data using AWS SDK for PHP, FirehoseClient#putRecord
API.
$client = new FirehoseClient([
'region' => '<AWS_REGION>',
'version' => 'latest',
]);
// Create a log record.
$data = [
'log_id' => 12345,
'url' => 'https://hoge/fuga/',
];
// Stream the record to Firehose.
$client->putRecord([
'DeliveryStreamName' => '<YOUR_STREAM>',
'Record' => [
// New line must be included because single record expresses single log data.
'Data' => json_encode($data) . PHP_EOL,
],
]);
Creating Athena Table
Select Create table from S3 bucket data
.
Enter a database name, table name, and the S3 location to which Firehose streams data.
Specify JSON
.
Specify appropriate columns based on your S3 data.
Configure how to partition it. The example here uses year/month/day/hour
.
Select Load partitions
from the menu and then execute MSCK REPAIR TABLE {TABLE_NAME};
.
Querying Athena Table using SQL
You can query data using SQL as usual. Keep the following conditions in mind when querying in Athena to avoid unexpected high billing costs.
- Include partition keys in
WHERE
clause. - Add
LIMIT
statement to prevent unnecessary scanning.
SELECT
*
FROM
table_name
WHERE
year = 2019
AND month = 8
AND day = 30
LIMIT 10;
Conclusion
Using Athena, AWS users can quickly build a log analysis environment which is scalable, cost-effective, and highly available.
I hope you will find this post useful.