Explore AWS Snapshot Behavior of Agentless CNAPP Solutions using CloudTrail & Athena

Table of Contents

Introduction

We recently evaluated several CNAPP solutions for AWS. Many were agent-based, some were hybrid, while some of these were completely agentless. The agentless solutions only needed AWS API access. They essentially work by snapshotting your EBS volumes periodically & scanning them for vulnerabilities. In this article, I will describe how we gained insight into the CNAPP product’s snapshotting behavior, in order to better understand & plan for integrating the product into our ecosystem.

By the way, if you’re unfamiliar with CNAPP solutions, check out my earlier article below:

Evaluating CNAPP Solutions for AWS

The Objective

We had learnt from our numerous sessions with the product vendor & their extensive documentation, how exactly they snapshot our volumes, how frequently they do that, and what kind of performance & cost implications it might have for us. But we wanted to test it out for ourselves, not just take their word for it. 😊

In particular, we wanted to see how long the snapshots lived. This has direct cost implications for us & since the snapshot is mounted onto the CNAPP product’s SaaS environment for vulnerability scanning, we wanted to know how long our data stayed in their environment.

The Approach

Since everything is done via AWS APIs, CloudTrail was the obvious tool of choice. All we had to do is query CloudTrail for snapshot create & delete events & just like that, we would know the snapshots’ lifespans. But that’s easier said than done. 😊 This article describes all the challenges we faced in trying to accomplish this seemingly simply objective.

Acquire Queryable CloudTrail Logs

We have 90+ AWS accounts. And we integrated all of them with the CNAPP product. First, we needed a central location of all our CloudTrail logs so we could query all accounts at once. Fortunately, we already have a design in place for this. We aggregate all CloudTrail logs from all accounts in a single central S3 bucket. However, the way we did this became an issue for this activity. More on that later.

Up-to-date users of AWS might argue that there is no need to use Athena to query CloudTrail, you can simply use CloudTrail Lake. CloudTrail Lake is a new feature from AWS that provides an Athena-like query interface in your AWS console to query CloudTrail events.

Unfortunately though, it cannot work with existing CloudTrail trails. You must configure new “event data stores” where CloudTrail Lake will collect AWS API events & make they queryable, just like CloudTrail trails. Due to this limitation, we had to use Athena to query our logs, which I must say, is quite an overhead.

Athena + CloudTrail

The first step to querying CloudTrail with Athena is to create a table pointing to the S3 location of your CloudTrail logs. This might be trivial when dealing with small datasets, but when trying to work with years of CloudTrail logs from 90+ AWS accounts, it’s not easy to get it right.

Attempt #1

You start by letting AWS create a table for you. Just follow the official docs:

Using the CloudTrail console to create an Athena table for CloudTrail logs

When you try to query the resulting table, especially for large time spans, you’ll notice how extremely slow it is. That’s because this table is not partitioned!

Attempt #2

Now that you realize the value of data partitioning, let’s give it another go. Create a partitioned table by following the docs again:

Creating the table for CloudTrail logs in Athena using manual partitioning

This work great but it’s partitioned by the date of when the CloudTrail logs were generated. Clearly, creating & loading a partition manually for every day of logs is not going to work for 2+ years of CloudTrail logs.

Attempt #3

Partitioning is unavoidable for performance reasons, but what if we could automate partition management? That’s where “partition projection” comes in. Once again, follow the docs to create a table with projected partitions, so you don’t have to create or load partitions manually:

Creating the table for CloudTrail logs in Athena using partition projection

This works great but only for a single account, single region. 🙁

Attempt #4

After several hours spent trying to figure out a way to create an Athena table with partition projection, that would work for our account structure, we finally gave up & reached out to AWS enterprise support. 😁 A day later, we had a solution. 🥳

The way we collect CloudTrail logs from various accounts into a single S3 bucket, is by organizing logs into the following directory structure:

s3://<bucket>/<account-alias>/AWSLogs/<account-id>/CloudTrail/<region>/<year>/<month>/<day>

The partitioning scheme of the Athena table has to account for this structure & so here is the CREATE TABLE query we ended up using:

CREATE EXTERNAL TABLE cloudtrail_logs (
    eventVersion STRING,
    userIdentity STRUCT<
        type: STRING,
        principalId: STRING,
        arn: STRING,
        accountId: STRING,
        invokedBy: STRING,
        accessKeyId: STRING,
        userName: STRING,
        sessionContext: STRUCT<
            attributes: STRUCT<
                mfaAuthenticated: STRING,
                creationDate: STRING>,
            sessionIssuer: STRUCT<
                type: STRING,
                principalId: STRING,
                arn: STRING,
                accountId: STRING,
                userName: STRING>>>,
    eventTime STRING,
    eventSource STRING,
    eventName STRING,
    awsRegion STRING,
    sourceIpAddress STRING,
    userAgent STRING,
    errorCode STRING,
    errorMessage STRING,
    requestParameters STRING,
    responseElements STRING,
    additionalEventData STRING,
    requestId STRING,
    eventId STRING,
    readOnly STRING,
    resources ARRAY<STRUCT<
        arn: STRING,
        accountId: STRING,
        type: STRING>>,
    eventType STRING,
    apiVersion STRING,
    recipientAccountId STRING,
    serviceEventDetails STRING,
    sharedEventID STRING,
    vpcendpointid STRING
)
PARTITIONED BY (
    `alias` string,
    `account` string,
    `region` string,
    `timestamp` string
)
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://BUCKET/'
TBLPROPERTIES (
    'projection.enabled' = 'true',
    'projection.alias.type' = 'injected',
    'projection.account.type' = 'injected',
    'projection.region.type' = 'enum',
    'projection.region.values' = 'us-east-1',
    'projection.timestamp.format' = 'yyyy/MM/dd',
    'projection.timestamp.interval' = '1',
    'projection.timestamp.interval.unit' = 'DAYS',
    'projection.timestamp.range' = '2022/01/01,NOW',
    'projection.timestamp.type' = 'date',
    'storage.location.template' = 's3://BUCKET/${alias}/AWSLogs/${account}/CloudTrail/${region}/${timestamp}'
)

Notice how the query uses “injected” properties. This was the only way to make it work for our S3 directory structure. Unfortunately, this has the side-effect of having to provide all injected properties in WHERE clauses of any query we want to run on this table.

Query CloudTrail

Now that our Athena table is ready, we could start querying CloudTrail logs through it. But because of the “mandatory WHERE clause limitation” imposed by the injected properties, we needed a separate query for each one of our 90+ accounts.

Since we were going to combine the query results of all accounts into one dataset anyway, we decided to build a single giant query that would UNION the results of all 90+ account-specific queries into one resultset.

Even though it’s a one-time activity, we did not want to copy-paste 90+ queries & change their WHERE clause values by hand. That was too error prone. It’s much more fun to automate the query generation itself using a Python script! Here is that script:

#!/usr/bin/env python3
# Account aliases & numbers
# Taken from the central CloudTrail bucket

ACCOUNT_ALIASES = [ 'alias-1', 'alias-2', 'alias-3' ]
ACCOUNT_NUMBERS = [ '111122223333', '444455556666', '777788889999' ]
SUBQUERY = ''

for i in range(len(ACCOUNT_ALIASES)):
    SUBQUERY += f"\
  SELECT ct1.alias AS accountName,\n\
    JSON_EXTRACT_SCALAR(ct1.responseelements, '$.ownerId') AS accountNo,\n\
    JSON_EXTRACT_SCALAR(ct1.requestparameters, '$.volumeId') AS volumeId,\n\
    JSON_EXTRACT_SCALAR(ct1.responseelements, '$.volumeSize') AS volumeSizeGb,\n\
    JSON_EXTRACT_SCALAR(ct1.responseelements, '$.snapshotId') AS snapshotId,\n\
    FROM_ISO8601_TIMESTAMP(ct2.eventtime) - FROM_ISO8601_TIMESTAMP(ct1.eventtime) AS snapshotDuration,\n\
    ct1.eventtime AS snapshotCreated, ct2.eventtime AS snapshotDeleted,\n\
    ct1.useridentity.sessioncontext.sessionissuer.username AS ct1username,\n\
    ct2.useridentity.sessioncontext.sessionissuer.username AS ct2username,\n\
    ct1.eventname AS ct1eventname, ct2.eventname AS ct2eventname\n\
  FROM cloudtrail_logs AS ct1 JOIN cloudtrail_logs AS ct2\n\
  ON JSON_EXTRACT_SCALAR(ct1.responseelements, '$.snapshotId') =\n\
     JSON_EXTRACT_SCALAR(ct2.requestparameters, '$.snapshotId')\n\
  WHERE ct1.alias = '{ACCOUNT_ALIASES[i]}' AND ct1.account = '{ACCOUNT_NUMBERS[i]}'\n\
  AND   ct2.alias = '{ACCOUNT_ALIASES[i]}' AND ct2.account = '{ACCOUNT_NUMBERS[i]}'"

    if i+1 < len(ACCOUNT_ALIASES):
        SUBQUERY += '\n) UNION ALL (\n'

QUERY = f"SELECT accountName, accountNo, volumeId, volumeSizeGb, snapshotId,\n\
       snapshotDuration, snapshotCreated, snapshotDeleted\n\
FROM ((\n\
{SUBQUERY}\n\
))\n\
WHERE ct1username = 'CNAPPAccessRole' AND ct2username = 'CNAPPAccessRole'\n\
AND ct1eventname = 'CreateSnapshot' AND ct2eventname = 'DeleteSnapshot'\n\
ORDER BY snapshotDuration DESC LIMIT 10000"

print(QUERY)

The Athena query generated by this script is as follows:

SELECT accountName, accountNo, volumeId, volumeSizeGb, snapshotId,
       snapshotDuration, snapshotCreated, snapshotDeleted
FROM ((
  SELECT ct1.alias AS accountName,
    JSON_EXTRACT_SCALAR(ct1.responseelements, '$.ownerId') AS accountNo,
    JSON_EXTRACT_SCALAR(ct1.requestparameters, '$.volumeId') AS volumeId,
    JSON_EXTRACT_SCALAR(ct1.responseelements, '$.volumeSize') AS volumeSizeGb,
    JSON_EXTRACT_SCALAR(ct1.responseelements, '$.snapshotId') AS snapshotId,
    FROM_ISO8601_TIMESTAMP(ct2.eventtime) - FROM_ISO8601_TIMESTAMP(ct1.eventtime) AS snapshotDuration,
    ct1.eventtime AS snapshotCreated, ct2.eventtime AS snapshotDeleted,
    ct1.useridentity.sessioncontext.sessionissuer.username AS ct1username,
    ct2.useridentity.sessioncontext.sessionissuer.username AS ct2username,
    ct1.eventname AS ct1eventname, ct2.eventname AS ct2eventname
  FROM cloudtrail_logs AS ct1 JOIN cloudtrail_logs AS ct2
  ON JSON_EXTRACT_SCALAR(ct1.responseelements, '$.snapshotId') =
     JSON_EXTRACT_SCALAR(ct2.requestparameters, '$.snapshotId')
  WHERE ct1.alias = 'alias-1' AND ct1.account = '111122223333'
  AND   ct2.alias = 'alias-1' AND ct2.account = '111122223333'
) UNION ALL (
  SELECT ct1.alias AS accountName,
    JSON_EXTRACT_SCALAR(ct1.responseelements, '$.ownerId') AS accountNo,
    JSON_EXTRACT_SCALAR(ct1.requestparameters, '$.volumeId') AS volumeId,
    JSON_EXTRACT_SCALAR(ct1.responseelements, '$.volumeSize') AS volumeSizeGb,
    JSON_EXTRACT_SCALAR(ct1.responseelements, '$.snapshotId') AS snapshotId,
    FROM_ISO8601_TIMESTAMP(ct2.eventtime) - FROM_ISO8601_TIMESTAMP(ct1.eventtime) AS snapshotDuration,
    ct1.eventtime AS snapshotCreated, ct2.eventtime AS snapshotDeleted,
    ct1.useridentity.sessioncontext.sessionissuer.username AS ct1username,
    ct2.useridentity.sessioncontext.sessionissuer.username AS ct2username,
    ct1.eventname AS ct1eventname, ct2.eventname AS ct2eventname
  FROM cloudtrail_logs AS ct1 JOIN cloudtrail_logs AS ct2
  ON JSON_EXTRACT_SCALAR(ct1.responseelements, '$.snapshotId') =
     JSON_EXTRACT_SCALAR(ct2.requestparameters, '$.snapshotId')
  WHERE ct1.alias = 'alias-2' AND ct1.account = '444455556666'
  AND   ct2.alias = 'alias-2' AND ct2.account = '444455556666'
) UNION ALL (
  SELECT ct1.alias AS accountName,
    JSON_EXTRACT_SCALAR(ct1.responseelements, '$.ownerId') AS accountNo,
    JSON_EXTRACT_SCALAR(ct1.requestparameters, '$.volumeId') AS volumeId,
    JSON_EXTRACT_SCALAR(ct1.responseelements, '$.volumeSize') AS volumeSizeGb,
    JSON_EXTRACT_SCALAR(ct1.responseelements, '$.snapshotId') AS snapshotId,
    FROM_ISO8601_TIMESTAMP(ct2.eventtime) - FROM_ISO8601_TIMESTAMP(ct1.eventtime) AS snapshotDuration,
    ct1.eventtime AS snapshotCreated, ct2.eventtime AS snapshotDeleted,
    ct1.useridentity.sessioncontext.sessionissuer.username AS ct1username,
    ct2.useridentity.sessioncontext.sessionissuer.username AS ct2username,
    ct1.eventname AS ct1eventname, ct2.eventname AS ct2eventname
  FROM cloudtrail_logs AS ct1 JOIN cloudtrail_logs AS ct2
  ON JSON_EXTRACT_SCALAR(ct1.responseelements, '$.snapshotId') =
     JSON_EXTRACT_SCALAR(ct2.requestparameters, '$.snapshotId')
  WHERE ct1.alias = 'alias-3' AND ct1.account = '777788889999'
  AND   ct2.alias = 'alias-3' AND ct2.account = '777788889999'
))
WHERE ct1username = 'CNAPPAccessRole' AND ct2username = 'CNAPPAccessRole'
AND ct1eventname = 'CreateSnapshot' AND ct2eventname = 'DeleteSnapshot'
ORDER BY snapshotDuration DESC LIMIT 10000

Run the query in Athena & after a few minutes, you should have your results:

accountNameaccountNovolumeIdvolumeSizeGbsnapshotIdsnapshotDurationsnapshotCreatedsnapshotDeleted
alias-1111122223333vol-11111aa1a11aaa11a500snap-11111aa1a11aaa11a1 08:50:55.0002022-05-17T08:32:25Z2022-05-18T17:23:20Z
alias-2444455556666vol-22222bb2b22bbb22b500snap-22222bb2b22bbb22b1 07:33:27.0002022-05-23T11:00:41Z2022-05-24T18:34:08Z
alias-3777788889999vol-33333cc3c33ccc33c500snap-33333cc3c33ccc33c1 07:21:18.0002022-05-19T03:50:23Z2022-05-20T11:11:41Z

Conclusion

And there you have it. If you ever find yourself running advanced Athena queries over large datasets & struggling with query performance or schema design, I hope this article taught you a few tips & tricks you can use to get stuff done. 😊

About the Author ✍🏻

Harish KM is a Principal DevOps Engineer at QloudX & a top-ranked AWS Ambassador since 2020. 👨🏻‍💻

With over a decade of industry experience as everything from a full-stack engineer to a cloud architect, Harish has built many world-class solutions for clients around the world! 👷🏻‍♂️

With over 20 certifications in cloud (AWS, Azure, GCP), containers (Kubernetes, Docker) & DevOps (Terraform, Ansible, Jenkins), Harish is an expert in a multitude of technologies. 📚

These days, his focus is on the fascinating world of DevOps & how it can transform the way we do things! 🚀

Leave a Reply

Your email address will not be published. Required fields are marked *