Choosing Between saveAsTable and Iceberg’s writeTo in AWS Glue and Athena

When working with Spark on AWS Glue, there are multiple ways to persist DataFrames as tables and make them queryable in Amazon Athena. Two common approaches are:

Using Spark’s Hive-style saveAsTable
Using Apache Iceberg’s writeTo API

At first glance they may look similar, but they solve different problems and have distinct implications for scalability, schema evolution, and data management.

1. Writing with `saveAsTable`

A simple Spark DataFrame write might look like this:

df.write.mode("overwrite").saveAsTable("staging.budget")

What happens

Spark writes the dataset to storage (e.g., S3) in Parquet format by default.
The table is registered in the Glue Data Catalog, making it visible in Athena.
Using overwrite drops the old data and replaces it with the new DataFrame.

Characteristics

Simple and widely supported.
Schema evolution is limited.
Partitioning requires manual setup (.partitionBy() before .saveAsTable()).
No transactional guarantees. If a job fails mid-write, the table can be left in an inconsistent state.

This approach works best when datasets are small-to-medium sized, updates are full replacements, and advanced transactional features are not required.

2. Writing with Iceberg’s `writeTo`

A more modern way is to use Apache Iceberg through the writeTo API:

try:
    table_exists = spark.catalog._jcatalog.tableExists(qualified_table)

    if table_exists:
        df.writeTo(qualified_table).overwritePartitions()
        print(f"Data written/overwritten to {qualified_table}")
    else:
        df.writeTo(qualified_table) \
          .using("iceberg") \
          .tableProperty("location", table_path) \
          .partitionedBy(partition_column) \
          .create()
        print(f"Table created at {table_path}")

What happens

If the table does not exist, it is created as an Iceberg table.
If it exists, only the partitions present in the DataFrame are overwritten (overwritePartitions()), avoiding a full table rewrite.
Metadata and schema are managed by Iceberg, and the table remains registered in Glue Catalog.

Characteristics

Transactional guarantees (ACID): Safe concurrent writes, consistent snapshots.
Efficient partition handling: No need to manage partitions manually.
Schema evolution: Adding, dropping, or renaming columns is supported natively.
Querying in Athena: Iceberg tables are natively supported, enabling advanced features such as time travel and incremental queries.

This approach is ideal for large datasets, incremental updates, and scenarios where data reliability and long-term governance matter.

Choosing Between saveAsTable and Iceberg’s writeTo in AWS Glue and Athena

1. Writing with saveAsTable

What happens

Characteristics

2. Writing with Iceberg’s writeTo

What happens

Characteristics

Related Posts

1. Writing with `saveAsTable`

2. Writing with Iceberg’s `writeTo`