Athena Log Analysis with Kinesis Data Firehose and S3

Athena Log Analysis with Kinesis Data Firehose and S3

Takahiro Iwasa
Takahiro Iwasa
3 min read
Athena Firehose Kinesis

AWS users can query their application logs stored in S3 using Athena, which I think is built on Presto. We can quickly build log analysis environments.

Overview

This post does not handle AWS Glue Crawler which is a useful tool for partitioning.

Creating S3 Bucket

Create a bucket for log files.

Creating Kinesis Data Firehose Delivery Stream

Custom prefix support was added on February 12th, 2019.

Before the custom prefix was supported, Firehose created files with prefix path/to/YYYY/MM/DD. Now you can specify Apache Hive format as S3 object keys and use MSCK REPAIR TABLE to create partitions in Athena.

Press Create delivery stream.

Specify a stream name.

Select Direct PUT or other sources.

Skip configuration to process records.

Select S3 as destination.

Specify the following prefixes by Apache Hive format.

FieldValue
Prefixlogs/!{timestamp:'year='yyyy'/month='MM'/day='dd'/hour='HH}/
Error prefixerror_logs/!{timestamp:'year='yyyy'/month='MM'/day='dd'/hour='HH}/!{firehose:error-output-type}

Specify appropriate values for Buffer size and Buffer interval based on your requirements.

I strongly recommend using GZIP compression.

Create or select an IAM role which Firehose assumed.

PHP Example for Streaming Data to Firehose

The example below demonstrates how to stream data using AWS SDK for PHP, FirehoseClient#putRecord API.

$client = new FirehoseClient([
    'region' => '<AWS_REGION>',
    'version' => 'latest',
]);

// Create a log record.
$data = [
    'log_id' => 12345,
    'url' => 'https://hoge/fuga/',
];

// Stream the record to Firehose.
$client->putRecord([
    'DeliveryStreamName' => '<YOUR_STREAM>',
    'Record' => [
        // New line must be included because single record expresses single log data.
        'Data' => json_encode($data) . PHP_EOL,
    ],
]);

Creating Athena Table

Select Create table from S3 bucket data.

Enter a database name, table name, and the S3 location to which Firehose streams data.

Specify JSON.

Specify appropriate columns based on your S3 data.

Configure how to partition it. The example here uses year/month/day/hour.

Select Load partitions from the menu and then execute MSCK REPAIR TABLE {TABLE_NAME};.

Querying Athena Table using SQL

You can query data using SQL as usual. Keep the following conditions in mind when querying in Athena to avoid unexpected high billing costs.

  • Include partition keys in WHERE clause.
  • Add LIMIT statement to prevent unnecessary scanning.
SELECT
  *
FROM
  table_name
WHERE
  year = 2019
  AND month = 8
  AND day = 30
LIMIT 10;

Conclusion

Using Athena, AWS users can quickly build a log analysis environment which is scalable, cost-effective, and highly available.

I hope you will find this post useful.

Takahiro Iwasa

Takahiro Iwasa

Software Developer at KAKEHASHI Inc.
Involved in the requirements definition, design, and development of cloud-native applications using AWS. Now, building a new prescription data collection platform at KAKEHASHI Inc. Japan AWS Top Engineers 2020-2023.