Posts
All the articles I've posted.
-
Choosing Between saveAsTable and Iceberg’s writeTo in AWS Glue and Athena
When working with Spark on AWS Glue , there are multiple ways to persist DataFrames as tables and make them queryable in Amazon Athena . Two common approaches are: Using Spark’s Hive-style saveAsTable
-
Debugging Spark DataFrame .show() Timeouts in PyCharm and VSCode
When working with PySpark , one of the first commands developers use to quickly inspect data is: raw_df.show() However, in certain environments (especially when running inside PyCharm or VSCode with a
-
Incremental Data Loads: Choosing Between resource_version and created_at/updated_at
Incremental data loading is a cornerstone of modern data engineering pipelines. Instead of re-ingesting entire datasets on each execution, incremental strategies focus on retrieving only records that
-
Optimizing Amazon Athena Queries with Partitions: A Practical Example
When working with Amazon Athena, one of the most effective strategies to improve query performance and reduce costs is partitioning your data . Partitions allow Athena to scan only the relevant
-
Running Apache Airflow Across Environments
Apache Airflow has become a de facto standard for orchestrating data workflows. However, depending on the environment, the way Airflow runs can change significantly. Many teams get confused when
-
Can You Perform Data Grouping Directly with the yFinance API?
When working with financial data, efficient aggregation and analysis are essential for generating meaningful insights. A common question among developers and data analysts is whether the yFinance
-
Optimizing Partition Strategies in Apache Iceberg on AWS
When working with large-scale analytical datasets, efficient partitioning is critical for achieving optimal query performance and cost savings. Apache Iceberg, a modern table format designed for big
-
How Transactions Work in Databricks Using Delta Lake
Databricks is a powerful platform for big data analytics and machine learning. One of its key features is the ability to run transactional workloads over large-scale data lakes using Delta Lake . This