Skip to content
>GLB_
Go back

Optimizing Amazon Athena Queries with Partitions: A Practical Example

When working with Amazon Athena, one of the most effective strategies to improve query performance and reduce costs is partitioning your data. Partitions allow Athena to scan only the relevant portions of your dataset instead of reading everything, which is critical when you deal with large files stored in Amazon S3.

In this post, we will explore how to choose a good partitioning strategy using an anonymized dataset.

The Dataset

Suppose we are tracking user activity in an online learning platform. The dataset looks like this:

Level,Section,Lesson_URL,Completions,Period
Level 1,1.1 Section A,https://platform.org/course-1/lesson-1/,140,2025-01
Level 1,1.1 Section A,https://platform.org/course-1/lesson-2/,120,2025-01
Level 1,1.2 Section B,https://platform.org/course-2/lesson-1/,55,2025-01
Level 1,1.2 Section B,https://platform.org/course-2/lesson-2/,61,2025-01
Level 2,2.1 Section C,https://platform.org/course-3/lesson-1/,33,2025-01
Level 2,2.2 Section D,https://platform.org/course-4/lesson-1/,32,2025-01
...

Each row represents a lesson, the number of completions (Completions), and the reporting Period (month).

Why Partitioning Matters

If we query this dataset directly without partitioning, Athena will scan all records every time. That becomes inefficient when historical data grows over months or years.

The key is to select a column with low cardinality, frequent filtering, and natural grouping. In this case, the column Period is an excellent candidate:

Creating the Partitioned Table in Athena

CREATE EXTERNAL TABLE IF NOT EXISTS analytics.course_activity (
  Level STRING,
  Section STRING,
  Lesson_URL STRING,
  Completions INT
)
PARTITIONED BY (
  Period STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'field.delim' = ','
)
LOCATION 's3://your-bucket/activity-data/'
TBLPROPERTIES ('has_encrypted_data'='false');

Loading Partitions

If your S3 structure is organized by Period:

s3://your-bucket/activity-data/Period=2025-01/
s3://your-bucket/activity-data/Period=2025-02/

You can register partitions automatically:

MSCK REPAIR TABLE analytics.course_activity;

Or add them manually:

ALTER TABLE analytics.course_activity
ADD PARTITION (Period='2025-01')
LOCATION 's3://your-bucket/activity-data/Period=2025-01/';

Querying the Data

SELECT Section, SUM(Completions) AS total_completions
FROM analytics.course_activity
WHERE Period = '2025-01'
GROUP BY Section;

This query will scan only the partition for January 2025, instead of the entire dataset.

Key Takeaways

By applying these practices, you will reduce query costs and improve performance in Athena.


Reference:
Amazon Athena – Partitions


Share this post:

Previous Post
Incremental Data Loads: Choosing Between resource_version and created_at/updated_at
Next Post
Running Apache Airflow Across Environments