Tag: Data
All the articles with the tag "Data".
-
Hive Metastore: The Glue Holding Big Data Together
When people think of Hive, they often remember the early days of Hadoop and MapReduce. But while Hive as a query engine has largely faded, one of its components remains critical to the modern data
-
Why Parquet Became the Standard for Analytics
In the early days of Big Data, data was often stored in simple formats such as CSV, JSON, or text logs. While these formats were easy to generate and understand, they quickly became inefficient at
-
Facebook and Big Data: The Open Source Projects That Changed the Industry
When people talk about the history of Big Data, a few companies come to mind: Google, Yahoo, and Facebook. Each of them faced unique challenges that forced them to build large-scale distributed
-
Managing Evolving Schemas in Apache Spark: A Strategic Approach
Schema management is one of the most overlooked yet critical aspects of building reliable data pipelines. In a fast-moving environment, schemas rarely remain static: new fields are added, data types
-
Understanding Window Functions in SQL: Beyond Simple Aggregations
When we think about SQL functions, we often start with scalar functions ( UPPER() , ROUND() , NOW() ) or aggregate functions ( SUM() , AVG() , COUNT() ). But there is a third type that is essential
-
Choosing Between saveAsTable and Iceberg’s writeTo in AWS Glue and Athena
When working with Spark on AWS Glue , there are multiple ways to persist DataFrames as tables and make them queryable in Amazon Athena . Two common approaches are: Using Spark’s Hive-style saveAsTable
-
Debugging Spark DataFrame .show() Timeouts in PyCharm and VSCode
When working with PySpark , one of the first commands developers use to quickly inspect data is: raw_df.show() However, in certain environments (especially when running inside PyCharm or VSCode with a
-
Running Apache Airflow Across Environments
Apache Airflow has become a de facto standard for orchestrating data workflows. However, depending on the environment, the way Airflow runs can change significantly. Many teams get confused when