Skip to content
>GLB_
Go back

Choosing Between saveAsTable and Iceberg’s writeTo in AWS Glue and Athena

When working with Spark on AWS Glue, there are multiple ways to persist DataFrames as tables and make them queryable in Amazon Athena. Two common approaches are:

  1. Using Spark’s Hive-style saveAsTable
  2. Using Apache Iceberg’s writeTo API

At first glance they may look similar, but they solve different problems and have distinct implications for scalability, schema evolution, and data management.

1. Writing with saveAsTable

A simple Spark DataFrame write might look like this:

df.write.mode("overwrite").saveAsTable("staging.budget")

What happens

Characteristics

This approach works best when datasets are small-to-medium sized, updates are full replacements, and advanced transactional features are not required.

2. Writing with Iceberg’s writeTo

A more modern way is to use Apache Iceberg through the writeTo API:

try:
    table_exists = spark.catalog._jcatalog.tableExists(qualified_table)

    if table_exists:
        df.writeTo(qualified_table).overwritePartitions()
        print(f"Data written/overwritten to {qualified_table}")
    else:
        df.writeTo(qualified_table) \
          .using("iceberg") \
          .tableProperty("location", table_path) \
          .partitionedBy(partition_column) \
          .create()
        print(f"Table created at {table_path}")

What happens

Characteristics

This approach is ideal for large datasets, incremental updates, and scenarios where data reliability and long-term governance matter.


Share this post:

Previous Post
Understanding the Strategy Design Pattern
Next Post
Debugging Spark DataFrame .show() Timeouts in PyCharm and VSCode