Archives

All the articles I've archived.

2026 ⁴⁷

May ¹⁹

caveman: cutting Claude's output tokens by 75%

30 May, 2026

The caveman skill strips filler from Claude responses while preserving technical accuracy — reducing output tokens by ~75% and keeping long sessions inside context limits.
diagnose: enforcing the debug loop Claude skips

30 May, 2026

The diagnose skill enforces a structured debug loop — reproduce, minimise, hypothesise, instrument, fix, regression test — preventing Claude from jumping straight to a proposed solution.
grill-me: stress-testing a plan before touching the codebase

30 May, 2026

The grill-me skill interrogates a proposed plan against the existing domain model before any code is written — surfacing assumptions, naming drift, and design conflicts early.
improve-codebase-architecture: review that knows your domain

30 May, 2026

The improve-codebase-architecture skill runs structured architecture review informed by domain language and ADRs — not generic refactor suggestions.
Building a data quality dashboard on top of Tuva's DQI mart

28 May, 2026

Tuva's DQI mart produces fill rates and anomaly flags for every input field in a claims dataset. Here's how to surface that in a Streamlit dashboard with meaningful visualizations.
Building a multipage Streamlit app with st.navigation() — the modern way

28 May, 2026

Streamlit 1.31 added st.navigation() as the modern replacement for the pages/ directory convention. Here's how the new API works and why it's better for anything beyond a simple demo.
Connecting Streamlit to DuckDB: read-only mode and the lock problem

28 May, 2026

A Streamlit app reading from a DuckDB file while dbt might be writing to it requires read-only mode and careful connection handling. Here's the pattern that works.
dbt macro resolution order: a real-world debugging story

28 May, 2026

A Compilation Error in a Tuva Project model revealed the exact resolution order dbt uses when looking up macros — and why that order matters when you're three packages deep.
dbt + DuckDB: the good, the bad, and the workarounds

28 May, 2026

Running dbt 1.11 with DuckDB 1.10 on a 167k-row dataset is fast and free, but the combination has real rough edges. An honest assessment after a full Tuva Project run.
Deploying a Streamlit analytics app in one afternoon

28 May, 2026

Streamlit Community Cloud deploys directly from a GitHub repo. Here's the exact setup for a multipage analytics app with a DuckDB backend — including requirements.txt, secrets, and the database file.
DuckDB concurrency in 2026: why you can't run dbt and DBeaver at the same time

28 May, 2026

DuckDB's single-writer model means that opening a .duckdb file in DBeaver blocks dbt from acquiring its write connection. Here's the exact error, why it happens, and the right fix.
Feature engineering from claims data for a Random Forest classifier

28 May, 2026

Healthcare claims have dozens of potential features for patient risk models. Here's how to select and validate features from a Tuva Project dataset for a Random Forest classifier predicting HCC gaps.
From raw claims to RAF: what the data pipeline actually looks like

28 May, 2026

The path from a raw Medicare claim file to a patient's Risk Adjustment Factor score involves five distinct transformation layers. This is what each layer does and where the dbt models fit in.
HCC suspecting explained from a data engineering perspective

28 May, 2026

HCC suspecting is about identifying conditions documented in prior years that haven't appeared in claims yet this year. This is what the data pipeline looks like and what the Tuva mart actually produces.
Predicting patient risk with scikit-learn on top of HCC suspecting data

28 May, 2026

A Random Forest classifier on Tuva's hcc_suspecting__summary table, using age, sex, paid amount, and condition count to predict which patients have above-median HCC gaps.
Running the Tuva Project on DuckDB — what breaks and how to fix it

28 May, 2026

Tuva 0.17.2 on DuckDB 1.10 with dbt 1.11 produces three distinct failure modes. This is what actually broke and how each one was fixed.
The limit_zero macro bug: how dbt resolves macros across packages

28 May, 2026

A missing limit_zero macro in a Tuva Project run turned into a detailed lesson on how dbt resolves macros across package namespaces, adapter dispatch, and why the fix is a two-line local macro.
What 167k synthetic Medicare claims taught me about US healthcare data

28 May, 2026

Running the Tuva Project on 167k synthetic Medicare claims reveals the structural complexity of US healthcare data — before you've dealt with a single real data quality issue.
What is the Tuva Project and why should data engineers care

28 May, 2026

The Tuva Project is an open-source dbt package that transforms raw healthcare claims into analytics-ready mart tables covering HEDIS, HCC, CCSR, readmissions, and data quality. This is what it actually does and why it matters.

April ⁴

Batch Means Two Different Things: Why the Term Became Confusing in Data Engineering

26 Apr, 2026

In data systems, some of the most common words are also the most overloaded. Few terms illustrate this better than batch . Historically, batch processing described a very specific operating model:
Why apt upgrade Didn’t Update VS Code (and What Actually Happened)

26 Apr, 2026

Problem Statement sudo apt update sudo apt upgrade But Visual Studio Code (Visual Studio Code) remains outdated. sudo apt install --only-upgrade code it updates successfully. This behavior is not
Tracking Subdomains in PostHog Without Breaking User Journeys

24 Apr, 2026

When a website grows, it often stops being just one site. A main domain may coexist with multiple subdomains, such as a marketing site, an events portal, a documentation site, or a learning platform.
Why Terraform Does Not Deploy Your Lambda Container Image

24 Apr, 2026

When teams start packaging AWS Lambda functions as container images, a common misunderstanding appears quickly: “I created the Lambda with Terraform, so why is AWS saying the image does not exist?”

March ¹⁰

ABC in Python: What It Is, Where It Comes From, and Why It Exists

27 Mar, 2026

When people begin exploring object-oriented design in Python, they eventually encounter this import: from abc import ABC, abstractmethod That usually leads to a natural question: what does ABC mean,
Can an AWS VPC Have Two Peering Connections? Yes. But Should It?

24 Mar, 2026

When teams begin structuring cloud networks in AWS, one of the first connectivity mechanisms they encounter is VPC Peering . It is simple, direct, and usually easy to implement for small environments.
Sending Events to Multiple PostHog Projects from the Same Website

24 Mar, 2026

In some architectures, a single website needs to send analytics events to multiple PostHog projects. This situation commonly appears in the following scenarios: Environment separation (development,
Lambda vs n8n: A Simple Explanation for Data Workflows

23 Mar, 2026

Introduction When building data systems or integrating APIs, a common question appears: should we use AWS Lambda or n8n? Both tools can automate processes, call APIs, and move data between systems,
Should You Use AWS Lambda or AWS Glue to Update Records in HubSpot?

23 Mar, 2026

When integrating HubSpot with a data platform on AWS, a common architectural decision appears quickly: Should updates to HubSpot be executed from AWS Lambda or AWS Glue? The correct choice depends on
Understanding client_ingestion_warning in PostHog: Are You Losing Data?

3 Mar, 2026

When using PostHog with the default posthog-js configuration, you may encounter the following warning: posthog-js client rate limited. Config is set to 10 events per second and 100 events burst limit.
Daily Failure Reporting in DynamoDB Using Lambda, EventBridge Scheduler, and SES

1 Mar, 2026

Operational monitoring requires structured visibility into failures. If your processes write execution logs to DynamoDB and mark failed executions with status = FAILED , you can implement a
Hardening OAuth Token Management in Postman: Preventing Environment Cross-Contamination

1 Mar, 2026

When working with multiple third-party APIs (Zoom, HubSpot, Meta, etc.), a common operational risk in Postman is environment cross-contamination . Tokens may be overwritten unintentionally if the
Understanding ip-api Batch Limits and Effective Throughput

1 Mar, 2026

When integrating IP geolocation into a data pipeline, understanding rate limits and batching constraints is essential. This post analyzes the practical limits of the ip-api free tier and how to
Window Functions vs JOIN in Spark: A Physical Plan Perspective

1 Mar, 2026

When solving analytical queries in Spark SQL, there are often multiple correct formulations. However, they do not produce equivalent execution plans. This article compares two approaches to the same

February ⁶

Can You Know the Location of an IPv6 Address?

28 Feb, 2026

Example IPv6: 2600:100e:b0c7:7403:f88c:92d0:bc41:46ff Short answer: only approximately , and with significant limitations. This article explains what can and cannot be inferred from an IPv6 address,
AWS Glue + Chargebee: Diagnosing CERTIFICATE_VERIFY_FAILED After TLS Chain Updates

27 Feb, 2026

Context An AWS Glue job that consumes the Chargebee API begins failing with: SSLError: SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate The same
From OLTP to OLAP: How Data Moves from 3NF to a Dimensional Data Warehouse

26 Feb, 2026

Modern data architectures typically separate operational systems from analytical systems. This separation is not accidental—it reflects fundamentally different workloads, data models, and optimization
Why There Is No “Interpreter” Endpoint in the Zoom API

26 Feb, 2026

Many teams attempt to retrieve language interpretation usage (e.g., minutes consumed per language channel) through the Zoom REST API, only to discover that no such endpoint exists for Meetings or
Why You Can’t Get Full Social Analytics from the HubSpot API (Even with Marketing Hub Pro)

26 Feb, 2026

Many teams assume that upgrading to Marketing Hub Professional unlocks full programmatic access to social media performance metrics. It does not. This article clarifies what is technically possible,
Why Small Tables Can Explode: Understanding JOIN Cardinality in SQL

20 Feb, 2026

It is common to assume that joining two small tables will produce a small result set. In practice, this assumption frequently fails. Even tables with only a handful of rows can generate unexpectedly

January ⁸

Resolving the Node.js Error: Cannot find module jsonwebtoken

28 Jan, 2026

When developing backend services with Node.js, especially APIs that implement authentication, it is common to rely on JSON Web Tokens (JWT). One frequent runtime error encountered in this context is:
Sending Athena Query Results to Amazon SQS: Architecture, Costs, and Limitations

27 Jan, 2026

Introduction Amazon Athena is a serverless query service designed for interactive analysis of data stored in Amazon S3. Amazon SQS (Simple Queue Service), on the other hand, is a fully managed message
Extracting and Managing Access Tokens in Postman

26 Jan, 2026

When working with APIs that use OAuth 2.0 or token-based authentication, a common requirement is to extract an access_token from a successful authentication request and reuse it in subsequent API
How PostHog Uses ClickHouse for High-Performance Product Analytics

26 Jan, 2026

Modern product analytics platforms must process billions of events while still delivering low-latency queries for dashboards, funnels, and retention analysis. PostHog addresses this requirement by
Hiding Personal Information in AWS Glue with Spark

25 Jan, 2026

Protecting personal data before analytics consumption is a core requirement in modern data platforms. In AWS-based lake architectures, this is typically achieved through data de-identification during
Rebasing vs Creating a New Branch: How to Handle Outdated Feature Branches Correctly

12 Jan, 2026

In collaborative software projects, it is common to face the following situation: a feature branch was created some time ago, work was done on it, and meanwhile the main branch continued to evolve.
Automating OAuth 2.0 in Postman: storing and refreshing access tokens without copy-paste

11 Jan, 2026

Introduction When working with APIs protected by OAuth 2.0, Postman is commonly used for development and testing. A frequent pain point is manual token handling : requesting an access token, copying
Running Scheduled GitHub Actions Locally for Safer Debugging

10 Jan, 2026

Overview When working with scheduled automation jobs in GitHub Actions, it is common to face a simple but critical question: Can this workflow be executed locally before pushing to production? The

2025 ¹³¹

December ¹

Designing a Scalable Course Progress Service on AWS

19 Dec, 2025

EC2, Lambda, DynamoDB, and RDS Cost and Architecture Trade-offs Context In a multi-platform learning environment where users can advance through courses using both Web and Mobile applications ,

November ²

Handling Boolean vs IntegerType Mismatches Between MySQL and Spark (Glue JDBC)

12 Nov, 2025

When integrating MySQL data into Apache Spark (for example, through AWS Glue), you might encounter schema mismatches caused by how MySQL represents TINYINT(1) fields. This issue often surfaces when
Controlling Branch Deployments and Redirects in Vercel: A Practical Guide

10 Nov, 2025

Continuous deployment platforms simplify the release process, but they can easily become noisy when every branch triggers a build. Teams working with multiple development environments often need finer

October ⁴

AWS EventBridge Rules vs EventBridge Scheduler: Which One Should You Use?

2 Oct, 2025

In the AWS ecosystem, there are two main ways to schedule and automate tasks: EventBridge Rules (scheduled rules) and the newer EventBridge Scheduler , which introduces Schedule Groups . While both
Estimating the Cost of an AWS Glue Workflow

2 Oct, 2025

When working with AWS Glue, one of the most common questions data engineers ask is: How much will this job cost me? If you have a workflow that runs for 13 minutes, understanding the cost model of AWS
Modern Table Formats: Iceberg, Delta Lake, and Hudi

1 Oct, 2025

Data Lakes made it possible to store raw data at scale, but they lacked the reliability and governance of data warehouses. Files could be dropped into storage (S3, HDFS, MinIO), but analysts struggled
Running Production Servers on AWS: EC2 vs RDS Cost Breakdown

1 Oct, 2025

When planning to run production workloads in the cloud, cost is one of the most important considerations. In this post, we will explore the monthly expenses of running two application servers and a

September ²⁶

Trino in Modern Architectures: SQL Queries on S3 and MinIO

30 Sep, 2025

The rise of cloud object storage has transformed how organizations build data platforms. Hadoop Distributed File System (HDFS) once dominated, but today services like Amazon S3, Google Cloud Storage
Hive Metastore: The Glue Holding Big Data Together

23 Sep, 2025

When people think of Hive, they often remember the early days of Hadoop and MapReduce. But while Hive as a query engine has largely faded, one of its components remains critical to the modern data
Why Parquet Became the Standard for Analytics

23 Sep, 2025

In the early days of Big Data, data was often stored in simple formats such as CSV, JSON, or text logs. While these formats were easy to generate and understand, they quickly became inefficient at
Facebook and Big Data: The Open Source Projects That Changed the Industry

19 Sep, 2025

When people talk about the history of Big Data, a few companies come to mind: Google, Yahoo, and Facebook. Each of them faced unique challenges that forced them to build large-scale distributed
HDFS vs. Object Storage: The Battle for Distributed Storage

19 Sep, 2025

Distributed storage has always been the foundation of Big Data. In the early days, Hadoop Distributed File System (HDFS) was the de facto standard. Today, however, object storage systems like Amazon
The History of Hive and Trino: From Hadoop to Lakehouses

19 Sep, 2025

The evolution of Big Data architectures is deeply tied to the history of two projects born at Facebook: Hive and Trino . Both emerged from real engineering pain points, but at different times and for
What Is a Data Lake and What Is a Data Lakehouse?

19 Sep, 2025

Over the last decade, the world of data architecture has gone through several transformations. From traditional data warehouses to Hadoop-based data lakes and now to the emerging Lakehouse paradigm,
Google Bigtable vs. Amazon DynamoDB: Understanding the Differences

17 Sep, 2025

When choosing a NoSQL database for scalable, low-latency applications, two major options stand out: Google Cloud Bigtable and Amazon DynamoDB . While both are managed, highly available, and
How to Keep a Docker Container Running Persistently

17 Sep, 2025

When working with Docker, you may have noticed that some containers stop as soon as you exit the shell. This is because Docker considers the container's main process to have finished. In this post, we
Fixing Cursor Login Issues on Linux (AppImage)

16 Sep, 2025

When running Cursor on Linux, especially with the AppImage version, you might encounter a situation where you can’t log in. This usually happens because Cursor stores its session state locally, and
Managing Evolving Schemas in Apache Spark: A Strategic Approach

16 Sep, 2025

Schema management is one of the most overlooked yet critical aspects of building reliable data pipelines. In a fast-moving environment, schemas rarely remain static: new fields are added, data types
Orchestrating Multiple AWS Glue Workflows: A Practical Guide

16 Sep, 2025

AWS Glue provides a robust environment for building and managing ETL pipelines, but many data engineers face the challenge of chaining or coordinating multiple workflows . This article explores
Secure Ways to Share Private Data on AWS: Beyond Public Buckets

16 Sep, 2025

When building data platforms in the cloud, it is common to share data with partners, clients, or internal teams outside your own. AWS provides several mechanisms to grant secure, granular access — far
Designing a Semantic Layer for Athena + Power BI

15 Sep, 2025

Modern data architectures benefit from a clear separation of layers: Ingesta , Staging , and Semantic (Presentation) . When using Amazon Athena as the query engine and Power BI as the visualization
Querying JSONB in PostgreSQL Efficiently

15 Sep, 2025

In modern applications, it is common to store semi-structured data in JSON format inside a relational database like PostgreSQL. However, to analyze this data properly, you need a way to transform it
Understanding Window Functions in SQL: Beyond Simple Aggregations

15 Sep, 2025

When we think about SQL functions, we often start with scalar functions ( UPPER() , ROUND() , NOW() ) or aggregate functions ( SUM() , AVG() , COUNT() ). But there is a third type that is essential
Automating Data Extraction with Airflow, BeautifulSoup, and MinIO

7 Sep, 2025

In the data engineering ecosystem, a common task is to automate the extraction of data from external sources, perform minimal processing, and store it in a data lake for further analysis. In this
How to Set CloudWatch Log Retention Policies with Terraform

7 Sep, 2025

AWS CloudWatch is a powerful service for monitoring applications and infrastructure. However, by default, CloudWatch Logs are configured to never expire . This can lead to excessive storage costs and
How to Disable an AWS Glue Trigger from the CLI

6 Sep, 2025

When working with AWS Glue , triggers are an important mechanism to orchestrate jobs or workflows. Sometimes, however, you may need to temporarily disable a trigger without deleting it—for example, to
Orchestrating Multiple AWS Glue Workflows with Step Functions

6 Sep, 2025

In modern data architectures, it is common to manage multiple ETL pipelines that must run in sequence or in parallel. AWS Glue provides a robust framework for building workflows, but when we need to
Understanding the Strategy Design Pattern

6 Sep, 2025

In the landscape of software design, maintaining flexibility and scalability is crucial. One of the most effective ways to achieve these qualities is by leveraging design patterns. Among the
Choosing Between saveAsTable and Iceberg’s writeTo in AWS Glue and Athena

3 Sep, 2025

When working with Spark on AWS Glue , there are multiple ways to persist DataFrames as tables and make them queryable in Amazon Athena . Two common approaches are: Using Spark’s Hive-style saveAsTable
Debugging Spark DataFrame .show() Timeouts in PyCharm and VSCode

3 Sep, 2025

When working with PySpark , one of the first commands developers use to quickly inspect data is: raw_df.show() However, in certain environments (especially when running inside PyCharm or VSCode with a
Incremental Data Loads: Choosing Between resource_version and created_at/updated_at

3 Sep, 2025

Incremental data loading is a cornerstone of modern data engineering pipelines. Instead of re-ingesting entire datasets on each execution, incremental strategies focus on retrieving only records that
Optimizing Amazon Athena Queries with Partitions: A Practical Example

1 Sep, 2025

When working with Amazon Athena, one of the most effective strategies to improve query performance and reduce costs is partitioning your data . Partitions allow Athena to scan only the relevant
Running Apache Airflow Across Environments

1 Sep, 2025

Apache Airflow has become a de facto standard for orchestrating data workflows. However, depending on the environment, the way Airflow runs can change significantly. Many teams get confused when

July ¹⁷

Can You Perform Data Grouping Directly with the yFinance API?

26 Jul, 2025

When working with financial data, efficient aggregation and analysis are essential for generating meaningful insights. A common question among developers and data analysts is whether the yFinance
Optimizing Partition Strategies in Apache Iceberg on AWS

24 Jul, 2025

When working with large-scale analytical datasets, efficient partitioning is critical for achieving optimal query performance and cost savings. Apache Iceberg, a modern table format designed for big
How Transactions Work in Databricks Using Delta Lake

22 Jul, 2025

Databricks is a powerful platform for big data analytics and machine learning. One of its key features is the ability to run transactional workloads over large-scale data lakes using Delta Lake . This
Versioning Terraform Resources to Meet CIS Security Standards

19 Jul, 2025

Infrastructure as Code (IaC) has become a foundational practice for modern DevOps and cloud-native teams. Terraform, as one of the most widely adopted IaC tools, enables infrastructure automation,
Choosing Between DynamoDB and Cassandra for a Crypto Exchange

13 Jul, 2025

When designing the backend of a crypto exchange, selecting the right database architecture is crucial. Two common NoSQL databases often considered for this type of application are Amazon DynamoDB and
Handling Python datetime Objects in Amazon DynamoDB

13 Jul, 2025

When developing data pipelines or applications that store time-based records in Amazon DynamoDB , developers frequently encounter serialization errors when working with Python's datetime objects.
AWS Glue Workflow vs Apache Airflow: A Professional Comparison

12 Jul, 2025

While both serve the common purpose of managing and automating data workflows, they differ significantly in architecture, flexibility, integration capabilities, and operational control. This article
Reducing AWS Costs: How to Temporarily Stop an Aurora Serverless v2 Cluster

9 Jul, 2025

When managing cloud infrastructure, minimizing costs without compromising data integrity is a continuous priority. Amazon Aurora Serverless v2 offers scalability and high availability, but unlike
he Enduring Relevance of Peter Chen’s Entity-Relationship Model

8 Jul, 2025

In the landscape of data modeling, few contributions have had the long-lasting impact of Peter Chen’s Entity-Relationship (E-R) Model , introduced in 1976. More than four decades later, it remains a
EMR vs AWS Glue: Choosing the Right Data Processing Tool on AWS

6 Jul, 2025

When working with big data on AWS, two commonly used services for data processing are Amazon EMR and AWS Glue . Although both support scalable data transformation and analytics, they differ
How Hadoop Made Specialized Storage Hardware Obsolete

6 Jul, 2025

In the early 2000s, enterprise data processing was dominated by high-end hardware. Organizations relied heavily on centralized storage systems such as SAN (Storage Area Networks) and NAS (Network
When Should You Use Iceberg with Athena? Partitioning Strategies and Best Practices

5 Jul, 2025

As data lakes grow in size and complexity, tools like Amazon Athena combined with table formats like Apache Iceberg become essential for scalability, data governance, and performance. In this post,
Why You Should Use the -out Option with terraform plan

5 Jul, 2025

When working with Terraform, a common workflow involves running terraform plan followed by terraform apply . However, you may have come across the following warning: "You didn't use the -out option to
How Google Changed Big Data: The Story of GFS, MapReduce, and Bigtable

4 Jul, 2025

In the early 2000s, Google faced a unique challenge: how to store, process, and query massive amounts of data across thousands of unreliable machines. The traditional systems of the time—designed for
ecure Database Access in AWS Using SSH Tunneling

2 Jul, 2025

Accessing databases located in private subnets within AWS Virtual Private Clouds (VPCs) is a common requirement in enterprise architectures. To ensure secure connectivity without exposing the database
Did Early Personal Computers Really Have a CPU? A Look at the von Neumann Architecture

1 Jul, 2025

When we think of a personal computer (PC), we typically imagine a processor, memory, a keyboard, and a display. But a deeper question often goes unasked: Did all early personal computers actually
Mastering the Linux find Command: A Practical Introduction

1 Jul, 2025

When working with Linux, one of the most powerful tools at your disposal is the find command. Whether you're managing a personal machine or maintaining a production server, being able to locate files

June ⁸

The Origin and Evolution of the DataFrame

30 Jun, 2025

When working with data today—whether in Python, R, or distributed computing platforms like Spark—one of the most commonly used structures is the DataFrame . But where did it come from? This post
Understanding ORM: Bridging the Gap Between Objects and Relational Databases

30 Jun, 2025

In modern software development, working with databases is a fundamental requirement. Most applications need to persist, retrieve, and manipulate data stored in relational databases such as PostgreSQL,
Python Decorators: What They Are, How They Work, and Why C Doesn't Have Them

29 Jun, 2025

In Python, decorators are a powerful feature for applying common logic to multiple functions without duplicating code. They allow you to extend or modify the behavior of functions, methods, or classes
Understanding findOne and findOneAndUpdate in Mongoose: Key Differences and Practical Usage

24 Jun, 2025

When working with MongoDB through Mongoose in Node.js, developers frequently encounter two essential methods: findOne and findOneAndUpdate . Both methods perform document lookups, but they serve
Are NoSQL Databases Really Schema-less?

19 Jun, 2025

A Perspective from the MERN Stack When we first start learning about NoSQL databases, one of the most common things we hear is that they are "schema-less." At first glance, this seems like a huge
How Network Topology Shapes Distributed Computing and Big Data Systems

19 Jun, 2025

When discussing distributed systems and Big Data, people often focus on storage, processing frameworks, and scalability—but one foundational concept underlies it all: network topology . It’s the
When Should You Use Parquet and When Should You Use Iceberg?

18 Jun, 2025

In modern data architectures, selecting the right storage and management solution is essential for building efficient, reliable, and scalable pipelines. Two popular choices that often come up are
How to Fix 'DataFrame' object has no attribute 'writeTo' When Working with Apache Iceberg in PySpark

17 Jun, 2025

If you’re working with Apache Iceberg in PySpark and encounter this error: Failed to write to Iceberg table: 'DataFrame' object has no attribute 'writeTo' You’re not alone. This is a common mistake

May ¹⁷

What Is Sharding and Why It Matters

21 May, 2025

As our world becomes increasingly digital, the amount of data we create every day is staggering. Think about all the emails, messages, orders, and photos uploaded every second. How do big companies
From Tables to Partitions: Designing NoSQL Databases with Cassandra

20 May, 2025

As data professionals transition from relational databases to NoSQL systems like Apache Cassandra, one of the most important mindset shifts is understanding that you don't model data for storage, but
Apache Cassandra vs Apache Parquet: Understanding the Differences

14 May, 2025

In modern data architectures, it's common to encounter both Apache Cassandra and Apache Parquet , particularly when dealing with large-scale, distributed systems. Both technologies are associated with
Import Live Crypto Prices into Google Sheets

11 May, 2025

Are you tired of checking crypto prices manually? Want to automate your portfolio tracking or build a custom crypto dashboard? Good news — with just a few steps, you can pull live cryptocurrency
Fixing Spark Ivy Error in Docker: "basedir must be absolute"

9 May, 2025

If you're running Apache Spark inside Docker using Bitnami's images and suddenly encounter an Ivy error that says: Exception in thread "main" java.lang.IllegalArgumentException: basedir must be
How Dynamo Reshaped the Internal Architecture of Amazon S3

9 May, 2025

Introduction Amazon S3 launched in 2006 as a scalable, durable object storage system. It avoided hierarchical file systems and used flat key-based addressing from day one. However, early versions of
What’s Behind Amazon S3?

9 May, 2025

When you upload a file to the cloud using an app or service, there's a good chance it's being stored on Amazon S3 (Simple Storage Service). But what powers it under the hood? What is Amazon S3? Amazon
How HDFS Achieves Fault Tolerance Through Replication

6 May, 2025

One of the core strengths of the Hadoop Distributed File System (HDFS) is its fault tolerance . In a world of distributed computing, failures are not rare—they're expected. HDFS tackles this by using
Summary: Teaching HDFS Concepts to New Learners

6 May, 2025

Introducing Hadoop Distributed File System (HDFS) to newcomers can be both exciting and challenging. To make the learning experience structured and impactful, it’s helpful to break down the core
How Clients Know Where to Read or Write in HDFS

5 May, 2025

Hadoop Distributed File System (HDFS) is designed to decouple metadata management from actual data storage . But how does a client—like a Spark job or command-line tool—know where to read or write the
How HDFS Avoids Understanding File Content

5 May, 2025

One of the defining features of Hadoop Distributed File System (HDFS) is that it doesn’t understand the contents of the files it stores . This is not a limitation—it's an intentional design choice
How Spark and MapReduce Handle Partial Records in HDFS

5 May, 2025

When working with large-scale data processing frameworks like Apache Spark or Hadoop MapReduce, one common question arises: What happens when a record (e.g., a line of text or a JSON object) is split
How HDFS Tracks Block Size and File Boundaries

4 May, 2025

When dealing with massive files, Hadoop Distributed File System (HDFS) doesn't read or store them as a whole. Instead, it splits them into large, fixed-size blocks . But how does it know where each
How Metadata Works in HDFS and What It Stores

4 May, 2025

HDFS stores metadata separately from the actual file content to optimize performance and scalability. This metadata is managed entirely by the NameNode , which allows clients to quickly locate and
The Architecture of HDFS: NameNode, DataNodes, and Metadata

3 May, 2025

HDFS (Hadoop Distributed File System) was built to support the reliable storage and access of large datasets distributed across commodity hardware. To make this possible, HDFS relies on a master/slave
What Happens When HDFS Splits Files Mid-Word or Mid-Row?

2 May, 2025

HDFS is designed to store and process massive amounts of data efficiently. One of its key design decisions is to split files into large, fixed-size blocks , typically 128MB or 256MB. But what happens
How HDFS Handles File Partitioning and Block Distribution

1 May, 2025

One of the key innovations behind the Hadoop Distributed File System (HDFS) is how it breaks down large files and distributes them across multiple machines. This mechanism, called partitioning and

April ²³

What is HDFS and Why Was It Revolutionary for Big Data?

30 Apr, 2025

In the early 2000s, the world was generating data at a scale never seen before—web logs, social media, sensors, and more. Traditional storage systems simply couldn't keep up with the volume, velocity,
What Is Serialization?

30 Apr, 2025

In the world of data engineering and software systems, serialization is a fundamental concept that allows you to efficiently store, transmit, and reconstruct data structures. If you’ve worked with
From HDFS to S3: The Evolution of Data Lakes in the Cloud

29 Apr, 2025

For years, HDFS (Hadoop Distributed File System) was the default choice for building data lakes in on-premises and Hadoop-based environments. But as cloud computing gained momentum, a new player took
Is S3 the New HDFS? Comparisons and Use Cases in Big Data

29 Apr, 2025

Over the past decade, the way organizations store and manage big data has shifted dramatically. Once dominated by the Hadoop Distributed File System (HDFS) , the field is now led by Amazon S3 and
The History and Evolution of Amazon S3: Was It Ever Based on HDFS?

28 Apr, 2025

When discussing cloud storage today, Amazon S3 is almost synonymous with scalable, reliable object storage. However, a common question among those familiar with big data technologies like Hadoop is:
MapReduce: A Framework for Processing Unstructured Data

27 Apr, 2025

MapReduce is both a programming model and a framework designed to process massive volumes of data across distributed systems. It gained popularity primarily due to its efficiency in handling
Understanding .master() in Apache Spark

27 Apr, 2025

In Apache Spark, the .master() method is used to specify how your application will run, either on your local machine or on a cluster. Choosing the correct option is essential depending on your
How Joins Work in PostgreSQL

13 Apr, 2025

Joins are one of the most powerful features in SQL, allowing you to combine data from multiple tables in a single query. PostgreSQL, as a relational database system, provides robust support for
How to Improve Query Performance in PostgreSQL

13 Apr, 2025

PostgreSQL is a powerful relational database, but even the most robust systems can suffer from slow queries without proper tuning. Optimizing query performance is crucial to ensure scalability,
Optimizing Joins in PostgreSQL: Practical Cases

11 Apr, 2025

Joins are essential for querying relational databases, but they can significantly impact performance if not optimized correctly. PostgreSQL provides several ways to improve join efficiency, from
Benchmarking OLTP vs. OLAP: Measuring Performance Effectively

8 Apr, 2025

Understanding the performance differences between OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) is crucial for designing efficient database systems. This post outlines a
OLTP vs. OLAP: How JOINs and Efficiency Shape Their Differences

7 Apr, 2025

Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) are two distinct database architectures, each designed for different purposes. One key factor that differentiates them is
The Origins of OLTP and OLAP: A Brief History

7 Apr, 2025

Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) are fundamental concepts in database management, each serving distinct purposes. But when did these terms first appear, and
Comparison Between Star Schema and Snowflake Schema in PostgreSQL

6 Apr, 2025

Comparison Between Star Schema and Snowflake Schema in PostgreSQL When designing a database for analytical workloads, choosing the right schema can significantly impact performance and query
Running PySpark on Google Colab: Do You Still Need findspark?

5 Apr, 2025

Introduction For a long time, using Apache Spark in Google Colab required manual setup, including installing Spark and configuring Python to recognize it. This was often done using the findspark
Testing Apache Airflow DAGs: A Modular Approach

5 Apr, 2025

Introduction Apache Airflow is a powerful workflow automation tool, but testing DAGs can be challenging due to their dependency on the Airflow scheduler and execution environment. In this post, we
Visualizing EXPLAIN ANALYZE in PostgreSQL

4 Apr, 2025

When working with PostgreSQL, understanding how queries execute can greatly improve performance tuning and optimization. PostgreSQL provides the EXPLAIN ANALYZE command to help developers analyze
Enabling Internet Access for Resources in a Public Subnet

3 Apr, 2025

When deploying resources in a public subnet within an AWS Virtual Private Cloud (VPC), you need to configure several components to allow them to communicate with the internet. Below are the essential
Network Address Translation (NAT): Overcoming IPv4 Shortages

3 Apr, 2025

Introduction Network Address Translation (NAT) is a technology designed to mitigate the shortage of IPv4 addresses by allowing multiple devices on a private network to share a limited number of public
Understanding Subnets, Gateways, and Route Tables in AWS

3 Apr, 2025

When designing applications in AWS, it's crucial to understand how networking components interact within a Virtual Private Cloud (VPC). This post will cover subnets, gateways, and route tables,
Generating a Calendar Table in Power Query (M Language)

2 Apr, 2025

When working with Power BI or other Power Query-supported tools, having a well-structured calendar table is essential for time-based analysis. In this blog post, we will walk through an M Language
How to Display an Error in Excel When More Than 5 "FALSE" Values Appear in a Row

2 Apr, 2025

Introduction When working with data in Excel, there may be instances where you need to monitor certain conditions and flag errors based on specific criteria. In this guide, we'll walk through a simple
Splitting Strings in Excel: A Simple Guide

2 Apr, 2025

When working with Excel, you may encounter situations where you need to split a string into separate parts. For example, consider the following string: orderId: 12345abc-de67-89fg-hijk-123456lmnop If

March ¹³

Handling Schema Changes in a Data Warehouse

31 Mar, 2025

When building and maintaining a Data Warehouse (DWH) , handling schema changes without breaking existing processes is a crucial challenge for data engineers. As new requirements emerge, we often need
Understanding How Hive Converts SQL Queries into Hadoop Jobs

31 Mar, 2025

When you execute a SQL query in Apache Hive, the query is not directly run on a traditional database. Instead, Hive translates it into a Hadoop job, which is then executed across a distributed system.
Delta Lake vs. Traditional Data Lakes: Key Differences and Vendor Options

30 Mar, 2025

Introduction As data-driven organizations scale their analytics and machine learning workloads, the limitations of traditional data lakes become more apparent. Delta Lake is an open-source storage
Why OLTP Systems Don't Retain Historical Changes

30 Mar, 2025

Online Transaction Processing (OLTP) systems are designed for high-speed transactions and efficient data management. However, one of their characteristics is that they do not retain historical changes
Understanding Slowly Changing Dimensions (SCD) in Data Warehousing

28 Mar, 2025

When dealing with data warehouses, handling changes in dimension data over time is crucial. Unlike operational databases where updates are straightforward, data warehouses require preserving
Modes and Examples of KPIs in Data Analysis Expressions (DAX)

26 Mar, 2025

Last Year Comparison When analyzing sales performance, it is often useful to compare the current year's sales with the same period in the previous year. To do this, we create several calculated
Understanding Surrogate Keys in Databases

26 Mar, 2025

When designing relational databases, one crucial decision is how to uniquely identify each record in a table. This is where surrogate keys come into play. Unlike natural keys, which derive from
Understanding the Relationship Between Database Replication and the CAP Theorem

26 Mar, 2025

Introduction Database replication is a fundamental strategy in distributed systems that ensures data is duplicated across multiple nodes. However, when designing a replicated database, one must
Understanding Pagination vs. Batch Processing in Data Handling

25 Mar, 2025

When working with large datasets, developers often face the challenge of efficiently extracting, processing, and managing data. Two commonly used techniques for handling such data efficiently are
Tracking Daily File Size Changes in SQL

23 Mar, 2025

When working with databases that store file metadata, it's often useful to track how file sizes change over time. If you have a table with the following structure: id | timestamp | name_file | size
Resolving 'index.lock' Issue in Git

7 Mar, 2025

When working with Git, you may encounter an error preventing you from switching branches or performing other operations. A common issue is the following: fatal: Unable to create '.git/index.lock':
Merging Data in PostgreSQL vs. MySQL: How to Handle Upserts

2 Mar, 2025

When working with databases, you often need to update existing records or insert new ones based on whether a match is found. In PostgreSQL, this is efficiently handled using the MERGE statement.
Understanding the Difference Between hostname -i and hostname -I in Linux

1 Mar, 2025

When working with Linux, you might come across the commands hostname -i and hostname -I , both of which return IP addresses. At first glance, they seem similar, but they serve different purposes. In

February ⁷

Understanding the CAP Theorem in NoSQL Databases

9 Feb, 2025

The CAP theorem (Consistency, Availability, and Partition Tolerance) plays a crucial role in designing and selecting NoSQL databases. This theorem states that in a distributed system, it is impossible
Understanding the Differences Between Parquet, Avro, JSON, and CSV

9 Feb, 2025

When working with data, choosing the right file format can significantly impact performance, storage efficiency, and ease of use. In this post, we will compare four widely used data formats: Parquet,
Optimizing Queries with Partitioning in Databricks

8 Feb, 2025

Partitioning is a crucial optimization technique in big data environments like Databricks. By partitioning datasets, we can significantly improve query performance and reduce computation time. This
Calculating Levenshtein Distance in Apache Spark Using a UDF

7 Feb, 2025

When working with text data in big data environments, measuring the similarity between strings can be essential. One of the most commonly used metrics for this is the Levenshtein distance , which
Creating a PySpark DataFrame for Sentiment Analysis

4 Feb, 2025

When working with sentiment analysis, having structured data in a PySpark DataFrame can be very useful for processing large datasets efficiently. In this post, we will create a PySpark DataFrame
Understanding Docker Engine Components

2 Feb, 2025

Docker Engine is an open-source platform that has revolutionized how applications are developed, deployed, and executed using container technology. By encapsulating applications and their dependencies
Automating Payment Calculation in Google Docs Using Apps Script

1 Feb, 2025

Introduction Google Apps Script is a powerful tool that allows you to automate tasks within Google Workspace applications, such as Google Docs. In this tutorial, we will create a script that prompts

January ¹³

Ranking Products Using Window Functions in PySpark

30 Jan, 2025

Introduction Window functions are powerful tools in SQL and PySpark that allow us to perform calculations across a subset of rows related to the current row. In this blog post, we'll explore how to
Handling Null Values in Data: Algorithms and Strategies

27 Jan, 2025

Null values are a common challenge in data analysis and machine learning. Dealing with them effectively is essential to ensure the reliability of your insights and models. In this post, we’ll explore
Exploring Free Resources to Learn AWS and Azure Cloud Platforms

18 Jan, 2025

Cloud computing is an essential skill in today’s tech landscape. Among the major players, AWS and Azure stand out as leading cloud platforms, offering a wealth of free resources to help individuals
What Does an Exploratory Data Analysis (EDA) Evaluate?

18 Jan, 2025

An Exploratory Data Analysis (EDA) is a critical step in the data analysis process that focuses on evaluating and examining data to uncover its main characteristics. It is performed before delving
Adding Custom Columns to Your Date Table in Power BI

14 Jan, 2025

Introduction A Date Table is an integral part of building robust and insightful Power BI reports. While a basic Date Table allows for time-based filtering and analysis, custom columns can add even
Grouping Data in PySpark with Aliases for Aggregated Columns

13 Jan, 2025

When working with large datasets in PySpark, grouping data and applying aggregations is a common task. In this post, we’ll explore how to group data by a specific column and use aliases for the
Handling Offset-Naive and Offset-Aware Datetimes in Python

12 Jan, 2025

When working with datetime objects in Python, you may encounter the error: TypeError: can't compare offset-naive and offset-aware datetimes This error occurs when comparing two datetime objects where
Extracting Dynamic Content from an iFrame with Selenium in Python

7 Jan, 2025

Accessing content inside an iFrame can be tricky, especially when the content is loaded dynamically. In this blog post, we’ll walk through an example of how to navigate an iFrame, click on an
Automating SQL Script Execution with Cron

6 Jan, 2025

In this blog post, we’ll explore how to automate the execution of SQL scripts using cron , a powerful scheduling tool available on Unix-based systems. This approach is ideal for database
Are Indexes a Good Strategy for Analytical Databases?

3 Jan, 2025

Indexes are a well-known optimization technique in database management, often associated with improving query performance. However, whether they are a good strategy for analytical databases depends on
Counting Word Frequency in a SQL Column

3 Jan, 2025

Sometimes, you may need to analyze text data stored in a database, such as counting the frequency of words in a text column. This blog post demonstrates how to achieve this in SQL using a practical
Orchestrating SQL Files: Efficiently Managing Multiple Scripts

2 Jan, 2025

When working on database projects, you often find yourself managing and executing multiple SQL files. Whether these files are for creating schemas, seeding data, or running migrations, orchestrating
Understanding the Evolution of Data Warehousing: From Codd's Relational Model to Modern Data Warehouses

2 Jan, 2025

Data management has undergone significant transformations since the advent of the relational model by Edgar F. Codd. Today, data warehouses stand as a cornerstone of modern data analytics. This blog

2024 ¹³¹

December ²

How to Rename a Git Branch Locally and Remotely

15 Dec, 2024

Renaming Git branches can be necessary when adhering to naming conventions or correcting errors. This guide will walk you through the process of renaming a branch locally and remotely. Scenario: You
Troubleshooting Import Errors in Python: A Case Study

15 Dec, 2024

Python's modular design allows developers to break their code into smaller, reusable components. However, import errors can often disrupt the flow, especially in complex projects. In this post, we’ll

November ³

Creating Dynamic Dates in Excel: A Practical Guide

22 Nov, 2024

When working with Excel, you may encounter situations where you need to dynamically generate a date using the current year, a specific month, and a day. This post will guide you through creating such
How to Simulate Column Headers Without Selecting from a Table in SQL

6 Nov, 2024

In some cases, you may want to produce a result set with specified column names and values without querying an actual table. This is often used for testing purposes, documentation, or even when
How to Create a Date Table in Power BI Using DAX

4 Nov, 2024

Introduction In Power BI, a Date Table is essential for working with time series data effectively. A well-structured Date Table simplifies time-based analysis, allowing you to filter by specific

October ¹²

Parsing Complex Data from HTML Tables with Python

31 Oct, 2024

When working with web scraping, you often encounter scenarios where HTML content is nested or contains encoded data within JavaScript attributes. This post walks through parsing player statistics from
Comparative Investment Analysis of Invesco and Blackstone Using Python

27 Oct, 2024

Introduction In this post, we'll explore how to use Python programming to compare the performance of two investment firms, Invesco and Blackstone. Invesco is known for its focus on public asset
Comparing Risk-Adjusted Returns Using the Sharpe Ratio in Python

27 Oct, 2024

Investors frequently face the challenge of assessing whether an asset's return justifies its risk. This is where the Sharpe Ratio becomes invaluable, providing a measure that accounts for both returns
Handling the "ERR_HTTP_HEADERS_SENT" Error in Node.js Express

18 Oct, 2024

When building REST APIs with Node.js and Express, one common error that developers encounter is ERR_HTTP_HEADERS_SENT: Cannot set headers after they are sent to the client . This error can be
Understanding Stateful vs. Stateless Firewalls in AWS

16 Oct, 2024

When working with network security, it's crucial to understand the difference between stateful and stateless firewalls. In AWS, this understanding is particularly important when configuring security
Creating a Custom Column with SWITCH in Power BI

13 Oct, 2024

In Power BI, creating custom columns based on multiple conditions is a powerful way to enhance the analysis and presentation of your data. One of the most versatile functions for this purpose is
Understanding module.exports in Node.js: Exporting and Importing Modules

12 Oct, 2024

In Node.js, organizing your code into reusable, modular components is a key practice for writing maintainable applications. This is done through modules — self-contained blocks of code that can be
Filtering Data in Azure Data Factory: Keeping Only "FileWrite" Operations

7 Oct, 2024

In this post, I’ll walk through how to filter rows in Azure Data Factory (ADF) using the Filter activity to retain only the rows where a specific column ( OperationName ) has the value "FileWrite".
Handling Deletion of Bootcamps in a Node.js API with Mongoose

6 Oct, 2024

In this post, I’ll walk through the process of handling the deletion of bootcamps in a Node.js API using Mongoose. Recently, while working on a project, I encountered a TypeError when attempting to
Built-in Functions vs. Object-Oriented Methods

5 Oct, 2024

Python strives to be simple and clear, so some operations are implemented as built-in functions , while others are object-specific methods . This distinction arises from the way Python handles
Extracting the Last Element from a Delimited String in Azure Data Factory

5 Oct, 2024

When working with data in Azure Data Factory (ADF), it's common to deal with delimited strings. You might need to extract the last element from such strings. For instance, given a string like
How to Simplify a Mongoose Schema in Node.js

1 Oct, 2024

When working with Mongoose in Node.js, defining a schema for your models can get repetitive and verbose, especially if you're specifying data types and validation repeatedly. In this post, we’ll look

September ²²

How to Choose the Best Classification Model Based on Performance Metrics

30 Sep, 2024

When working on machine learning classification tasks, selecting the best model often involves analyzing various performance metrics like accuracy, precision, recall, and F1-score. In this post, I’ll
How to Log in Python: Console and File Logging with yfinance Example

30 Sep, 2024

Logging is a vital part of any application, offering insights into the application's flow, performance, and error handling. In many scenarios, you may want to log messages both to the console and a
Implementing Query Filtering in Express with Mongoose

30 Sep, 2024

In modern API development, providing flexible querying mechanisms is essential to allow clients to filter and retrieve data efficiently. In this post, we'll go over how to implement query filtering
Downloading Data from the SEC Website using Python

29 Sep, 2024

In this blog post, I’ll show you how to download a JSON file from the U.S. Securities and Exchange Commission (SEC) website using Python. The file contains company tickers, which can be useful for
Understanding Idempotency in Python with Simple Examples

28 Sep, 2024

Idempotency is a fundamental concept in computing that describes operations which produce the same result no matter how many times they are performed. In this blog post, we’ll explore idempotency
Best Practices: Using Direct SQL Queries in CodeIgniter

25 Sep, 2024

In this blog post, we'll discuss the pros and cons of using direct SQL queries in CodeIgniter and explore alternatives that enhance security, readability, and maintainability. What is Direct SQL?
Creating a Custom Column with a Random String in Power BI Using DAX

25 Sep, 2024

Introduction In Power BI, customizing your dataset by adding calculated columns can significantly enhance your data analysis capabilities. One common need is to generate random strings or categories
How to Implement MVC in CodeIgniter to Clean Up Your Views

25 Sep, 2024

When building web applications, it's easy to end up with PHP logic mixed directly into your HTML views, especially in smaller projects. However, this can lead to messy, hard-to-maintain code. The
Counting Covered Points on a Number Line

19 Sep, 2024

Introduction Algorithmic challenges often involve intervals and can initially seem complex. One such problem is determining how many unique points are covered by a set of intervals on a number line.
Renaming Modules in Python for Clarity and Accuracy

19 Sep, 2024

Renaming modules in Python is an essential practice to improve code clarity and maintainability, especially as projects grow in complexity. Using intuitive and descriptive names helps in quickly
Handling shutil.SameFileError When Copying Files in Python

18 Sep, 2024

When using Python’s shutil.copy() or shutil.copy2() to copy files, you might run into a shutil.SameFileError if you mistakenly attempt to copy a file onto itself. This error occurs when the source and
Preserving Directory Structure While Copying Files in Python - version 2

17 Sep, 2024

When copying files from one directory to another in Python, it's important to maintain the original directory structure, especially when dealing with nested directories. In this post, we'll explore
Avoiding Duplicate File Copies Based on Content in Python on AWS

16 Sep, 2024

When working with large file systems, copying files can often lead to unintentional duplication, especially if files with the same content are repeatedly copied into different directories. While
Handling NoneType Errors When Extending Lists in Python

16 Sep, 2024

When working with Python, especially with functions that return lists or other iterable objects, you might encounter a TypeError that says something like: TypeError: 'NoneType' object is not iterable
Tracking File Changes in S3 Using ETags

16 Sep, 2024

When working with AWS S3, tracking changes to files can be essential, especially when versioning is not enabled on the bucket. The ETag associated with each file in S3 can provide a simple way to
Working with S3 Object Metadata: Understanding ETags and Last Modified Dates

13 Sep, 2024

When working with AWS S3, managing large amounts of data effectively involves understanding key metadata like the ETag and Last Modified date. These properties help track file changes and ensure data
Implementing Retries in Python

9 Sep, 2024

In many real-world applications, simply handling an error isn't always enough. Sometimes, the failure is temporary, and retrying the operation can help resolve the issue. In this post, we’ll explore
Efficiently Listing and Filtering S3 Objects by Date

7 Sep, 2024

When working with AWS S3 buckets, it’s common to have a large number of objects stored, and you might need to filter them based on certain criteria like dates. This blog post will guide you on how to
Handling Division Errors and Implementing Basic Retry Logic in Python

7 Sep, 2024

In Python, error handling is essential for preventing crashes and ensuring smooth execution. One common error that developers encounter is the ZeroDivisionError , which occurs when trying to divide by
Handling Split Errors in Azure Data Factory: A Step-by-Step Guide

6 Sep, 2024

In Azure Data Factory (ADF), we often use expressions to manipulate strings and extract specific parts of data. One common operation is splitting strings based on a delimiter. However, this can
Customizing Legends in Seaborn Boxplots: A Guide

2 Sep, 2024

Creating clear and informative visualizations is key to effectively communicating data insights. In this post, we will explore how to customize legends in Seaborn boxplots, ensuring that the labels
How to Create Age Group Categories in Pandas and Visualize Them with Matplotlib

2 Sep, 2024

Data visualization is a key part of data analysis, helping to communicate insights clearly. In this blog post, we'll learn how to categorize age data into specified groups using Pandas and then

August ³⁷

Comparing Window Functions with Aggregate Functions in SQL

26 Aug, 2024

Introduction SQL is a powerful language for querying and manipulating data, and both window functions and aggregate functions are central to its capabilities. While they serve related purposes, they
Defining Custom Window Frames in SQL Server

26 Aug, 2024

Introduction Window functions in SQL Server are powerful tools that allow for advanced data analysis within queries. One of the key features of window functions is the ability to define custom window
Creating a Running Total in SQL Server with Window Functions

25 Aug, 2024

Introduction Calculating a running total is a common requirement in many data analysis tasks, such as tracking cumulative sales, computing cumulative scores, or keeping track of inventory levels. In
Filtering Items in Azure Data Factory: Excluding Items That Begin with an Underscore

24 Aug, 2024

Azure Data Factory (ADF) is a powerful tool for building ETL (Extract, Transform, Load) workflows in the cloud. One common requirement is to filter data or files based on certain conditions. In this
Using the OVER() Clause with Window Functions in SQL Server

24 Aug, 2024

Introduction SQL window functions have become an indispensable tool for data analysts and developers. They allow for advanced calculations that go beyond simple aggregates, enabling analysis over a
Extracting Year, Month, and Day from Dates in Azure Data Factory

23 Aug, 2024

In Azure Data Factory (ADF), working with dates is a common task, especially when dealing with data transformations and scheduling tasks. ADF allows you to handle dates in different formats, such as
Splitting Strings and Accessing Elements in Azure Data Factory

23 Aug, 2024

Introduction Azure Data Factory (ADF) is a powerful cloud-based data integration service that allows you to create data-driven workflows for orchestrating and automating data movement and data
Extracting the Header from a CSV File in Python

21 Aug, 2024

When working with CSV files in Python, it's often necessary to extract the header (the first row of the file) to understand the structure of the data or to perform specific operations on the remaining
Handling CSV Files in Python: Writing Specific Rows with Custom Headers

21 Aug, 2024

When working with CSV files in Python, you often need to filter and write specific rows to a new file, including handling headers properly. This blog post will guide you through the process using
Understanding Window Functions in SQL: A Deep Dive

21 Aug, 2024

Introduction When working with databases, you'll often need to perform calculations across a set of rows related to the current row in your query. Whether you're calculating a running total, ranking
Extracting the Last Segment of a String in SQL Server

16 Aug, 2024

When working with data, you often need to manipulate strings to extract meaningful information. A common scenario is having strings with segments separated by underscores ( _ ), and you only need the
How to Check if Two Tables Have the Same Columns in SQL

16 Aug, 2024

When working with databases, it's sometimes necessary to compare two tables to ensure they have the same structure. Specifically, you might need to verify that two tables have the same columns before
Identifying Duplicate Records in SQL Based on Specific Fields

16 Aug, 2024

In database management, identifying and handling duplicate records is crucial to ensure data integrity. This post will guide you through a SQL query designed to find duplicates based on a specific
Creating a Pandas DataFrame from a List of Lists

14 Aug, 2024

Working with data in Python often involves using pandas, one of the most powerful libraries for data manipulation. A common task is converting a list of lists into a pandas DataFrame. This post will
Inserting Data from a CSV File into SQL Server Using PowerShell with Windows Authentication

14 Aug, 2024

Introduction In many data integration scenarios, you'll need to import data from a CSV file into a SQL Server database. PowerShell is a powerful tool that allows you to automate this process
Debugging SQL Joins: Troubleshooting OR Joins with Multiple Columns in PostgreSQL

12 Aug, 2024

Introduction SQL joins are essential for combining data from different tables in a relational database. However, they can sometimes be tricky, especially when dealing with complex joins that involve
How to Safely Retrieve and Return SQL Query Results in Python

12 Aug, 2024

When working with databases in Python, one common task is to count the number of records in a table. This might seem straightforward, but it’s easy to run into errors if the code isn’t structured
Resolving PostgreSQL Authentication Errors in Docker Compose for Redash

12 Aug, 2024

Introduction When setting up Redash with Docker Compose, one of the common errors users might encounter is related to PostgreSQL authentication. Specifically, the psycopg2.OperationalError:
Understanding the DECIMAL Data Type in SQL: What Happens When You Don't Specify Parameters?

12 Aug, 2024

When working with SQL, one of the fundamental aspects of database design and manipulation is understanding the various data types available. Among them, the DECIMAL type is often used for representing
Understanding Window Functions in SQL with Running Totals

12 Aug, 2024

Introduction Window functions in SQL are incredibly powerful, allowing you to perform calculations across a set of table rows related to the current row. They enable tasks like calculating running
Understanding the Quality of a Multiple Linear Regression Model: Analyzing SalaryUSD Predictions

10 Aug, 2024

In this blog post, we'll dive into the process of analyzing the quality of a multiple linear regression model, specifically focusing on predicting SalaryUSD based on factors like EducationLevel and
How to Handle shutil.SameFileError When Copying Files in Python

9 Aug, 2024

Introduction Have you ever encountered the shutil.SameFileError while trying to copy files in Python? This error occurs when you attempt to copy a file onto itself, resulting in a failed operation. In
Resolving "Same File" Errors in Python When Copying Files with Directory Replication

9 Aug, 2024

When working with file management in Python, you might encounter the dreaded "SameFileError" when trying to copy a file using the shutil.copy2() function. This error occurs when Python detects that
Retrieving the Name of an SQL Script File in Your Query

9 Aug, 2024

When working with SQL scripts, there are times when you might want to dynamically retrieve the name of the script that is currently executing. Unfortunately, SQL itself doesn't provide a
Adding Numbers Around a Center Element in a 2D Grid in Python

8 Aug, 2024

In this post, we'll explore how to manipulate a 2D grid in Python by adding numbers around a specific center element. This is a common problem in various applications, such as implementing a basic
How to Convert a SQL SELECT COUNT Query to SQLAlchemy in Python

8 Aug, 2024

When working with databases in Python, you might often need to translate raw SQL queries into SQLAlchemy, a powerful ORM (Object-Relational Mapper) that allows you to interact with your database in a
Effortlessly Count Lines Starting with "DE" in a Text File Using PowerShell

7 Aug, 2024

PowerShell is a powerful scripting language that can automate various tasks, including file manipulation and data processing. In this post, we'll demonstrate how to count the number of lines in a text
How to Optimize a Function for Counting Matching Products in a DataFrame

7 Aug, 2024

In this post, we'll walk through how to optimize a function that counts the number of rows in a DataFrame where two specified products are present. The original function, though functional, can be
Working with JSON Data in Python: A Comprehensive Guide

6 Aug, 2024

Introduction In today's digital age, handling data in various formats is crucial for developers and data scientists alike. JSON (JavaScript Object Notation) has become a popular choice due to its
Extracting Specific Text Between Strings Using Python

5 Aug, 2024

In this blog post, we'll learn how to extract a specific portion of text between two substrings in a given input string. This technique is useful in various scenarios, such as processing file paths,
Handling Errors in Python: Ensuring Successful String Splits

5 Aug, 2024

When working with strings in Python, you often need to split them based on a delimiter. While the split method is straightforward, there might be scenarios where the split operation doesn't yield the
How to Read a File from a Network Path in Python

4 Aug, 2024

In many business and enterprise environments, data is often stored on network drives accessible to multiple users. Python provides several ways to access and read files from these network paths. This
Avoiding Duplicate File Copies Based on Content in Python

3 Aug, 2024

Introduction When dealing with large datasets, it is common to encounter duplicate files, especially when copying files based on specific criteria. Simply comparing file names or paths isn't
Copying Files Containing a Specific Word Using Python

3 Aug, 2024

Introduction When working with large datasets or numerous text files, you might find yourself needing to search for files containing specific words or phrases. Automating this task can save a lot of
Preserving Directory Structure While Copying Files in Python

3 Aug, 2024

Introduction When working with large datasets or numerous text files, it might be necessary to copy files containing specific words to a new destination while preserving the original directory
Analyzing Salaries by Country: Using Boxplots to Visualize Median and Mean

1 Aug, 2024

Introduction: Understanding salary distributions across different countries is crucial for various economic analyses, market insights, and policy decisions. Boxplots are an effective graphical tool
Generating and Uploading Random Data to Azure Blob Storage Using Python

1 Aug, 2024

Introduction In today's data-driven world, automating data generation and storage is crucial for various applications, including testing, data analysis, and machine learning. This blog post will guide

July ²⁸

Decrypting Encrypted Data with Subqueries in SQL

30 Jul, 2024

When working with encrypted data in SQL, it's essential to ensure that the decryption process is secure and efficient. One effective approach is using subqueries. In this post, we'll demonstrate how
Extracting the Last Part of a String in SQL Server

30 Jul, 2024

Introduction When working with SQL Server, you might often encounter scenarios where you need to extract a specific part of a string. For example, you might have a string in the format
Working with Dates in Python: Extracting and Incrementing Dates

30 Jul, 2024

Dates are a fundamental part of many applications, from logging events to scheduling tasks. Python’s datetime module provides powerful tools to handle dates and times. In this post, we'll explore how
Creating a Dictionary from a Word and a List in Python

27 Jul, 2024

In Python, creating and manipulating dictionaries is a common task. In this post, we'll walk through a simple example of how to write a function that takes a word and a list, and returns a dictionary
Selenium vs. Beautiful Soup: Choosing the Right Tool for Web Scraping

27 Jul, 2024

When it comes to web scraping, two tools often stand out: Selenium and Beautiful Soup. Each has its strengths and is suited for different types of tasks. In this post, we’ll dive into what each tool
Extracting Data from Fixed-Width Text Files into Pandas DataFrame

24 Jul, 2024

Working with fixed-width text files can be challenging, especially when you need to extract specific fields and transform them into a structured format like a Pandas DataFrame. In this blog post, I'll
How to Insert a New Row in a Pandas DataFrame

24 Jul, 2024

Working with data often involves modifying it to suit your analysis needs. One common operation is inserting a new row into a DataFrame. In this post, we'll explore several methods to achieve this in
Loading JSON Data into a pandas DataFrame with Python

24 Jul, 2024

In this post, we will walk through the process of loading JSON data into a pandas DataFrame using Python. JSON (JavaScript Object Notation) is a popular data format for exchanging data between a
Adding SQL Script Filenames to Batch Script Output CSV

23 Jul, 2024

When working with batch scripts to execute multiple SQL scripts, it's often helpful to log not only the results but also the filenames of the executed scripts. This can make it easier to track which
Automating SQL Script Execution and Logging with Batch Scripts

22 Jul, 2024

Introduction Automating database tasks can significantly enhance productivity, especially when dealing with multiple SQL scripts. In this tutorial, we will create a batch script to execute SQL scripts
Leveraging SQL Window Functions with PARTITION BY

22 Jul, 2024

SQL window functions are a powerful tool for performing calculations across a set of rows related to the current row. When combined with the PARTITION BY clause, these functions can provide deep
Avoiding Overwriting and Extra Spaces When Writing to Files in Python

20 Jul, 2024

When working with files in Python, it's common to encounter situations where you need to append new lines to an existing file without overwriting its current content. Additionally, managing whitespace
How to Select Specific Rows from a DataFrame in Python

20 Jul, 2024

When working with DataFrames in Python, you may encounter situations where you need to filter and select specific rows based on certain conditions. In this blog post, we will explore how to create a
Extracting Substrings from Strings in SQL Server

19 Jul, 2024

When working with SQL Server databases, you may often encounter scenarios where you need to extract specific parts of a string based on a pattern. A common requirement is to retrieve the substring
Loading JSON Data into a Pandas DataFrame

17 Jul, 2024

When working with data, it's common to encounter various file formats. JSON (JavaScript Object Notation) is a popular format for data exchange due to its readability and ease of use. In this post,
Running SQL Queries from a Batch File: Retrieving the Server Name

16 Jul, 2024

When working with SQL servers, it's often useful to automate routine tasks using batch files. One common task is retrieving the server name where your database is running. In this post, we'll walk
How to Drop Rows Based on a Column Value in a pandas DataFrame

5 Jul, 2024

Problem Statement Let's say you have a DataFrame containing weather data, and you want to drop all rows where the quantity ( qty ) is less than 5. However, you notice that some rows are being dropped
How to Print Lines Containing Non-Zero, Non-Dot, and Non-Space Characters in Python

5 Jul, 2024

In this blog post, we'll explore a simple yet useful task: printing lines that contain at least one character that is not a zero ( 0 ), a dot ( . ), or a space. This can be particularly handy when
Removing Rows from a Pandas DataFrame that Begin with Specific Characters

5 Jul, 2024

In this post, I'll walk you through how to remove rows from a pandas DataFrame that begin with specific characters, such as "---". This is a common task when cleaning and preprocessing data in Python.
Transforming a Matrix by Adding Numbers Around a Specific Value in Python

5 Jul, 2024

When working with matrices, we often need to perform transformations that update the values based on certain conditions. In this post, we'll walk through a function that takes a matrix and updates it
Ensuring Type Safety in Python Functions

3 Jul, 2024

When writing Python functions, ensuring that the parameters are of the correct type is crucial for robust and error-free code. In this post, we'll explore how to enforce type checks in a function to
Inserting a Student into a Sorted List in Python

3 Jul, 2024

When working with sorted lists in Python, it's essential to ensure that any new elements are added in the correct order. Problem Statement You have a list of student names sorted in alphabetical
Filtering and Counting Keys in Python Dictionaries

2 Jul, 2024

In this post, we'll explore how to count the keys in a dictionary and filter a dictionary by its values in Python. These are common tasks that can be useful in a variety of situations when working
Root Cause Analysis (RCA) for Data

2 Jul, 2024

Introduction In the realm of data management and analysis, problems can range from data quality issues to processing errors and performance bottlenecks. Identifying the root cause of these issues is
Combining Multiple CSV Files into One with Python

1 Jul, 2024

If you work with data, chances are you've encountered situations where you need to merge multiple CSV files into a single file for analysis. Manually combining these files can be time-consuming and
Exploring Key Services for the AWS Solution Architect Exam

1 Jul, 2024

Key AWS Services for the Solution Architect Exam Amazon RDS (Relational Database Service) Amazon RDS makes it easy to set up, operate, and scale a relational database in the cloud. It provides
How to Aggregate Values by Date and Sum Them in Python

1 Jul, 2024

Problem Statement Suppose you have a list of transactions or events, each associated with a date and a numeric value (e.g., sales amount, transaction amount). Your goal is to aggregate these values by
Understanding Data Layout, Files, and Tree Indexes: An Overview

1 Jul, 2024

In this post, we'll explore several fundamental concepts related to data storage and indexing: Data Layout, Files, Tree Indexes, and B+ Trees. Understanding these concepts is crucial for anyone

June ⁵

Effective Knowledge Transfer of Data: Key Elements

29 Jun, 2024

Transferring knowledge, especially when it involves data, is a critical proces. Especially between consultants. Whether you're transitioning to a new team, it's crucial to get this process right. Here
Understanding MAC Addresses: Hexadecimal, Binary, and Decimal Representations

29 Jun, 2024

In this post, we'll explore what a MAC address is, how it's represented in hexadecimal notation, and how to convert it to binary and decimal formats. We'll use the MAC address 88-B2-2F-54-1A-0F as an
Creating Directories in Python

25 Jun, 2024

The os Module Python’s os module provides a way to interact with the operating system. It includes functions for creating, removing, and checking the existence of directories and files. In this
Route Summarization and Subnetting

24 Jun, 2024

We will walk through the process of subnetting a network and performing route summarization using an example. Subnetting Example Let's consider the following four subnets: 192.168.0.0/22
Pandas Dataframe: apply method

22 Jun, 2024

Calculating Discounts, Taxes, and Total Amount in a DataFrame Suppose you have the following data in a DataFrame: Product Price Category 0 A 100 Electronic 1 B 200 Cloth 2 C 150 Electronic 3 D 300

May ⁹

Escape sequences in Pyhon

22 May, 2024

Escape sequences as in C While reading Automate the Boring Stuff with Python: Practical Programming for Total Beginners, I noticed the existence of raw strings. A raw string is created in such a way
Dictionary methods: keys(), values(), and items()

19 May, 2024

In Python, there are three special methods related to dictionaries that are worth mentioning: keys() , values() , and items() . Interestingly, these methods do not return true lists. They cannot be
Minimizing Operational Overhead of EC2 Fleet OS Security Governance in AWS: Recommendations for DevOps Teams

19 May, 2024

Minimizing the operational overhead of EC2 fleet OS security governance is essential for maintaining a secure and efficient AWS environment. In this blog post, we'll explore the challenges faced by
Implementing Resilient Architectures in AWS: Strategies for Automated Recovery and Testing

17 May, 2024

Implementing resilient architectures in AWS is essential for ensuring high availability and reliability of your applications. In this blog post, we'll explore strategies for automating recovery and
Enabling Traceability and Auditing Security Events in AWS: Best Practices and Tools

16 May, 2024

Traceability and auditing of security events are crucial for maintaining the security and compliance of your AWS environment. In this blog post, we'll explore how to enable traceability and auditing
Data Protection and Security Events in AWS: Best Practices for Ensuring Data Security

13 May, 2024

Protecting data in transit and at rest is critical for maintaining the security and compliance of your AWS environment. In this blog post, we'll explore best practices for classifying and protecting
Automating Security Best Practices in AWS: A Guide to Efficient and Secure Operations

12 May, 2024

Automating security best practices in AWS is essential for ensuring the security, scalability, and efficiency of your cloud environment. In this blog post, we'll explore the benefits of automation,
Authentication and Federation in AWS: Best Practices and Implementation Strategies

10 May, 2024

Authentication and federation are critical components of any AWS environment, ensuring secure access to resources and services. In this blog post, we'll explore the different types of identity in AWS,
Applying Security at All Layers in AWS: A Comprehensive Approach

2 May, 2024

Security is paramount in any cloud environment, and AWS offers a range of tools and services to help you apply security at all layers of your infrastructure. In this blog post, we'll explore the

April ⁵

Implementing a Strong Identity Foundation in AWS: Best Practices and Implementation Patterns

30 Apr, 2024

In any cloud environment, ensuring a strong identity foundation is paramount for maintaining security and compliance. AWS offers a range of tools and services to help you implement the principle of
Elements in the JSON Policy Structure in IAM

22 Apr, 2024

Identities in AWS In AWS you manage access by creating policies and attaching them to an identity. The way that AWS thinks of the elements which interact with them is through IDENTITIES or AWS
The AWS Well Architected Framework

18 Apr, 2024

Discover how to effectively design, utilize, and manage workloads in the cloud by translating requirements into architecture and operations while adhering to best practices. The Six Pillars:
Data Encryption at AWS S3

17 Apr, 2024

What is Encryption at rest? Encryption works by using an algorithm to convert plain text into ciphertext. This new ciphertext will be unreadable if it falls into the wrong hands. There are many
Introduction to AWS Identity and Access Management (IAM)

12 Apr, 2024

Theory Users must be authenticated before they can access AWS services and Resources. AWS services can be accessed via AWS CLI AWS SDKs AWS Management Console You can create: Individual IAM users:

March ⁵

Run Redash Locally

7 Mar, 2024

This is only for educative purpose. You don't have to do this in production 1 - Clone the project from the oficial github: Redash on GitHub - Y made a fork previously. Take Care git clone
Understanding Distributed System - Maintainability

6 Mar, 2024

Introduction It’s widely recognized that the bulk of software costs arise after its initial development in maintenance tasks like bug fixes, feature additions, and day-to-day operation. Therefore,
Understanding Distributed System – Resiliency

6 Mar, 2024

Introduction Chapter 24 - Common Failure Causes Hardware Faults Incorrect Error Handling Configuration Changes Single Points of Failure Network Faults Resources Leaks Load Pressure Cascading Failures
Understanding Distributed System – Scalability

3 Mar, 2024

Introduction Scaling an application involves maintaining performance as load increases. The long-term solution for increasing capacity is to architect for horizontal scalability. In this section,
Understanding Distributed System – Coordination

2 Mar, 2024

Introduction Our ultimate goal is to build a distributed application consisting of a group of processes that gives its users the illusion they are interacting with one coherent node. While achieving a

February ¹

Understanding Distributed System - Communication

25 Feb, 2024

Part I - Communication Introduction Interprocess communication (IPC) is fundamental to distributed systems, enabling processes to exchange data over networks. This communication relies on agreed-upon

January ²

On Undertanding Programs - Dijkstra

31 Jan, 2024

In my life I have seen many programming courses that were essentially like the usual kind of driving lessons, in which one is taught how to handle a car instead of how to use a car to reach one's
Testing, Computers and society in Notes On Structured Programming

6 Jan, 2024

The computer scientist Dijkstra, has some strong opinions about tesing, the art of programming and the impact of the computer in the society. Let's take a second to read the opinion he wrote in Notes

2023 ⁸⁹

December ²

Understanding Distributed Systems - Introduction

28 Dec, 2023

Chapter 1: Introduction In the realm of modern technology, the need for distributed systems has become increasingly apparent. But why invest time and resources in building such intricate
Distinctions Between AWS EC2 and ECS

18 Dec, 2023

Introduction Embarking on the cloud computing journey often involves deciphering the nuanced offerings of platforms like Amazon Web Services (AWS). In this exploration, we'll unravel the seemingly

November ⁴

FTP and SFTP - Running through a Container

13 Nov, 2023

Running an FTP server using docker is really easy. In fact, you can use it running the following image: atmoz/sftp - But, at the end we want to know what is and FTP and why is it worth to know a
Building a Lucrative Business Model in the Data Economy

2 Nov, 2023

Building a Lucrative Business Model in the Data Economy Introduction: In today's data-driven world, information is akin to a gold mine. However, to fully capitalize on this valuable resource, one must
Concepts, Techniques and Models of Computer Programming

2 Nov, 2023

Introduction: In the realm of programming, there are three fundamental elements that form its backbone. Understanding these components is crucial for any aspiring programmer. Let's delve into the
William Kent - Data & Reality

2 Nov, 2023

Chapter 1 – Entities. The book The Hitchhiker’s Guide to the Galaxy should be required reading for both business and information technology professionals. Although this is a science fiction book. I

October ⁵

Human Resources and Analytics: Enhancing Personnel Selection

12 Oct, 2023

Human Resources and Analytics: Enhancing Personnel Selection Introduction In today's dynamic landscape, the convergence of Human Resources and Analytics presents an unprecedented opportunity to
Common table expressions

10 Oct, 2023

Specifies a temporary named result set, known as a common table expression (CTE). Microsoft Documentation Although there are some time around us the first time someone asked me about it I was
Principle of Data Wrangling

10 Oct, 2023

Data Wrangling involves the process of cleaning and organizing data before any analysis takes place. It typically consumes between 50% and 80% of an analyst's time. Factors to consider include time,
4.6 Data Warehouses

9 Oct, 2023

DataWarehouses—large historical databases for decision-support that are loaded with new data on a periodic basis — have evolved to require specialized query processing support, and in the next section
Importance of a Database System

9 Oct, 2023

As should be clear from this paper, modern commercial database systems are grounded both in academic research and in the experiences of developing industrial-strength products for high-end customers.

September ¹

Sellenium Vs Beautiful Soup

30 Sep, 2023

Web scraping is a widely recognized strategy for acquiring information. Before diving into this process, it's crucial to familiarize oneself with two essential tools. Personally, this topic initially

July ⁸

Management Skills for developers

21 Jul, 2023

Leadership and direction Vision The business vision ("vision speaks of the future") should be: Specific: Clear and simplified. Avoid redundancy and overly sophisticated words. Objective: It should be
Sweetviz error: .iteritems() → .items()

19 Jul, 2023

If you install Sweetviz using the command: pip install sweetviz You're going to have this error because change in the Pandas library. So, up to the new release in Sweetviz you can use the following
Learn to speak in public

17 Jul, 2023

First steps to public speaking When giving a presentation or starting to speak, it is sometimes common to inform the audience about things they are unaware of, which may cause stress. For example,
Know your String Connection using SQL

12 Jul, 2023

I was looking how to know my server on the internet, and I've found this interesting question in Stackoverflow: How to get the connection String from a database . And one question give us an example
SODA: Connect SQL Server without Password

12 Jul, 2023

When you do your first steps using soda, it is possible you want to connect to an SQL Server database. In that case you can create an specific user and give him the proper rights I wrote about that in
SODA: SSL: CERTIFICATE_VERIFY_FAILED - Solved

12 Jul, 2023

When you tried to connect to soda it is possible you find this error: SSL: CERTIFICATE_VERIFY_FAILED. All the message look like similar to this one: requests.exceptions.SSLError:
PostgreSQL on Windows - server error 500 - Ports are not available

3 Jul, 2023

I made a mistake the first time when I installed PostgreSQL: I installed it locally. It wouldn't be a problem if I weren't planning to use Docker, but as I want to develop a project in Apache Airflow.
What is DOM? Why is it important to understand it?

1 Jul, 2023

The DOM tree is a crucial concept that needs to be understood and managed in order to make changes to a website. It allows for the application of styles to HTML elements and the addition of

June ³

A short introduction to the art of programming

25 Jun, 2023

Edsger W. Dijkstra - A short introduction to the art of programming Link: E.W.Dijkstra Archive: A Short Introduction to the Art of Programming (EWD 316) 1. Preface For those readers who identify the
Border Radius in CSS

25 Jun, 2023

One of the simples project I've found on the internet is change some characteristics of attributes in CSS using a kind of input in a web page. It's a simple project, but always fun. This kind of
Python Django Dev To Deployment

25 Jun, 2023

After finishing the Udemy Course Python Django Dev to Deployment I would like to list the things I've learned during the process and the things I understand that I need to continue learning, or even,

May ¹²

Career management during our professional life

29 May, 2023

I’m not a veteran in the field. I’m in the middle of my thirties. But I found that there are some qualities I really appreciated when it appeared in the leaders and mentors I had. I would like to
The element of programming style

21 May, 2023

When the book saw the lights, programming wasn't as important as today. But, some of the ideas around the style of writing are a worth to notice and to know it. For that reason reading the book
The elements of programming style: Common Blunders

20 May, 2023

Chapter 6: Common Blunders A major concern of programming is making sure that a program can defend against bad data. But even with correct data, there is no guarantee that a program will work. In this
The elements of programming style: Control Structure

20 May, 2023

Chapter 3: Control Structure A computer program is shaped by its data representation and the statements that determine its flow of control. These define the structure of a program. There is no sharp
The elements of programming style: Documentation

20 May, 2023

Chapter 8: Documentation The best documentation for a computer program is a clean structure. It also helps if the code is well formatted, with good mnemonic identifiers and labels (if any are needed),
The elements of programming style: Don't Be Too clever

20 May, 2023

Preface to the Second Edition The practice of computer programming has changed since The Elements of Programming Style first appeared. Programming style has become a legitimate topic of discussion.
The elements of programming style: Efficiency and instrumentation

20 May, 2023

Chapter 7: Efficiency and instrumentation Machines have become increasingly cheap compared to people; any discussion of computer efficiency that fails to take this into account is shortsighted.
The elements of programming style: Epilogue

20 May, 2023

Epilogue There are many good books on languages, algorithms and numerical methods available to those who want to learn programming in greater depth. Our goal was not to teach languages or algorithms,
The elements of programming style: Expressions

20 May, 2023

Chapter 2: Expressions Writing a computer program eventually boils down to wanting a sequence of statements in the language at hand. How each of those statements is expressed determines in large
The elements of programming style: Input and output

20 May, 2023

Chapter 5: Input and output Test input for validity and plausibility Make sure input cannot violate the limits of the program Terminate input by end-of-file or maker, not by count Identify bad input,
The elements of programming style: Program Structure

20 May, 2023

Chapter 4: Program Structure Most programs are too big to be comprehended as a single chunk. They must be divided into smaller pieces that can be conquered separately. That is the only way to write
Refactor or rewrite?

19 May, 2023

While I was reading The elements of programming style found the following quote: Don't patch bad code - rewrite it The element of programming style - Chapter 4 - Page 1 Its make me think about an

April ²

Dijkstra: The Humble programmer

11 Apr, 2023

Dijistra wrote some interesting things about the activity of programmer. In this opportunity I'm going to make some quotations and notes about the article: The humble programmer Rules "discovered" for
Coders at Work

10 Apr, 2023

Coders at work is a series of interviews made by Peter Seibel in 2009 where different programmers talk about their views about the technology, development, how they work as a programmer and the

March ⁴

Notes about: On the cruelty of really teaching computing science

4 Mar, 2023

Radical Novelty 1: "The programmer is the unique position that his is the only discipline and profession in which such a gigantic ratio, which totally baffles our imagination, has to be bridge by a
Django Jinja Isn't a thing

3 Mar, 2023

I was reading about Jinja and an article on Wikipedia caught my attention: Jinja (template engine) At the beginning I read: Jinja is similar to the Django So, Django Jinja and Jinja projects are
Using Google Colab to work from with outside data

1 Mar, 2023

On stack overflow there is this question I've neve made to myself: How can I create a website using google colab [closed] I have my code written in colab. I want to convert this into a website where
While you learn while you build it?

1 Mar, 2023

The quotation and the necessity of understand what you have done is really important when you try to understand some concepts. For that reason when I found this video:

February ¹⁸

FAKER: Create Unique Random

16 Feb, 2023

You have to use: unique.random_int(min=11, max=123) A full example where you can see the creation of a persona is the following: from faker import Faker import pandas as pd fake = Faker() def
People in tech are aware of history? Donald Knuth

16 Feb, 2023

Seibel: Do you feel like programmers and computer scientists are aware enough of the history of our field? It is, after all, a pretty short history. Knuth: There aren’t too many that are scholars.
How to now where is located my current python Virtual enviroment

13 Feb, 2023

If you are working in your machine with different virtual env perhaps you wondered "Wait a minute. What environment I'm working on?" There is two ways (I now to know that) Using PIP: pip -V Or using
The relation between academic computer science and the industrial practice. Donald Knuth overview

13 Feb, 2023

Seibel: You’re an academic but also have worked on big systems and have done some work in industry. How do you see the relation between academic computer science and industrial practice? Knuth: It’s
Programming is harder than writing books? - Vision of Donald Knuth

10 Feb, 2023

Seibel: Do you think you were a dramatically better programmer when you finished TeX than when you started? Knuth: Well, yes, because of literate programming. Seibel: So you had better tools, but had
Freshman computer scientists shouldn't touch a computer. What does Donald Knuth think about that?

8 Feb, 2023

The most named person in the book: Donald Knuth. The author of "The Art of computer programming" Many people, including me, have not read it. But it justifies us. Seibel: Uh-oh; you just revealed your
SODA: Check count distinct elements

8 Feb, 2023

Categories in Data Quality When you are doing a quality you're looking for six levels of knowledge about it. It might change depends on your requirements and what is the usage about it. Because, in
What $ in shell scripts means?

7 Feb, 2023

In this video: https://www.youtube.com/watch?v=o9THkT5ZPi4&t=308s I saw the weird symbol: echo $? When I'm start looking about it on the internet I discovered that a lot of people has asked, and
Why to choose a Data Lake?

7 Feb, 2023

There are some reason that will take you to choose to use a Data Lake as solution for your Data Operations. The most importants are: Increase operational efficiency Make data available faster Lower
SODA: Connect to SQL Server

6 Feb, 2023

After deciding to install SODA to make your quality check you have to connect the data source to SODA. We are going to see how to connect soda to Microsoft SQL Server. Remember that this is a tutorial
Why Ken Thompson's son was not encouraged to study computer science?

6 Feb, 2023

Ken Thompson. One of the most famous Computer Scientist of the book and in the field. He become famous because a series of creations, being Unix the most relevant. Seibel: In a 1999 interview you
TOX: First steps

5 Feb, 2023

Tox is a tool for Python testing. I'm doing my first steps because I found it in the project: Faker, which is: Faker is a Python package that generates fake data for you. Faker Git Hub repository If
Coder, programmer or computer scientist? Peter Deutsch gives us an answer

4 Feb, 2023

Seibel: You were the only person I contacted about this book who had a really strong reaction to the word coder in the title. How would you prefer to describe yourself? Deutsch: I have to say at this
ASTRONOMER: Install on Windows

3 Feb, 2023

Install Astronomer is quite simple In windows you have to follow the following steps: Install docker Install Linux on Windows with WSL. Add the Astro Exe to a location. In general: C\ Change the name
How to be a lead by Dan Ingalls

3 Feb, 2023

Daniel (Dan) Ingalls is one of the creators of Smalltak. Other interesting development of engals was the Context Menu Seibel: Do you have any tips on how to be a good technical leader? Ingalls: The
Talent as a programmer and talent as system-level thinkin. Peter Deutsch talks about that

3 Feb, 2023

Perhaps the most forgotten for the massive public of the book. But, if you use the REPL you you are making a silent tribute, because the first REPL was created by Laurence Peter Deutsch. Link to
Readability and efficiency in your code. Guy Steele analyze this trade-of

1 Feb, 2023

Seibel: So when you’re writing English, you’re obviously writing for a human reader and you seem to contrast that to writing software, which is for a computer. But lots of people—such as Knuth—make a
The language nowadays are easier? Guy Steele's answer

1 Feb, 2023

Seibel: Do you think languages are getting better? You keep designing them, so hopefully you think it’s a worthwhile pursuit. Is it easier to write software now because of advances that we’ve made?

January ³⁰

Being a better reviewer and a good architect by Peter Norvig

31 Jan, 2023

Seibel: So what makes the better reviewers better? Norvig: Well, that they catch more things. Some of it is the trivial stuff of you indented the wrong number of spaces or whatever but some of it is,
Choice the correct language by Guy Steele

31 Jan, 2023

Seibel: How much does a choice of language really matter? Are there good reasons to choose one language over another or does it all just come down to taste? Steele: Why shouldn’t taste be a good
Peter Norvig: programing as a Craftmanship

31 Jan, 2023

Seibel: As a programmer, do you consider yourself a scientist, an engineer, an artist, or a craftsman? Norvig: Well, I know when you compare the various titles of books and so on, I always thought the
Programming: Now Vs Then by Guy Steele

31 Jan, 2023

Guy Steele is an academic know particularly because the "Lambda Papers". Seibel: What has changed the most in the way you think about programming now, vs. then? Other than learning that bubble sort is
Basic concepts about Amazon Redshift

30 Jan, 2023

One of the first things you will know when you do the course Getting Started with Amazon Redshift are the following Redshift is based on PostgreSQL, and there are four key concepts to understand about
Logging in a file to avoid print statements

30 Jan, 2023

A video that was enlightening We need to avoid the misuse of the print statements once we master the basic tools and ideas about programming in particular and software developer in general. So, here
Peter Norvig and the idea of test to drive design

30 Jan, 2023

According the point of view about the way of doing software, that we can resume that try to develop the solver problem element. Another interesting topic where we can focus is the way to using testing
How programming has changed over the years by Peter Norvig

29 Jan, 2023

Continue the Peter Norvig series. The first post was about: What Peter Norvig Learn about ‘Industrial Programming’? now it's time to talk about the way of learning to work on a team during the time.
It's necessary and apprentice approach, according to Peter Norvig

29 Jan, 2023

Seibel: I’m surprised you think the master-programmer model is such a dumb idea. In your “Teach Yourself Programming in Ten Years” essay you make the point that programming is a skill that, like many
Peter Norvig and the Computer Science Curriculum

29 Jan, 2023

Seibel: Speaking of things that aren’t taught as much, you’ve been both an academic and in industry; do you feel like academic computer science and industrial programming meet in the right place?
Peter Norvig: everything in your head

29 Jan, 2023

Seibel: Though your job now doesn’t entail a lot of programming you still write programs for the essays on your web site. When you’re writing these little programs, how do you approach it? Norvig: I
What makes a good programmer by Joe Armstrong. Who does Joe Armstrong hire?

29 Jan, 2023

Joe Armstrong has talked a lot about the topic or being a good programmer as we can see in the previous posts: What Joe Armstrong did to be a better programmer? , Joe Armstrong and the Print
What Peter Norvig Learn about 'Industrial Programming'?

29 Jan, 2023

Peter Norvig is known for his technical abilities, for his degree and for being the Director of Research at google. But, he also made a big contribution to the learning discussion about how to learn
Joe Armstrong and the importance of the writing skills

27 Jan, 2023

Besides the opinion of Joe Armstrong about What Joe Armstrong did to be a better programmer? and Joe Armstrong and the Print Statements he also has an interesting opinion about other skills not
Joe Armstrong and the Print Statements

26 Jan, 2023

Following the Joe Armstrong quotes, there is one about the print statements. Seibel: What are the techniques that you use there? Print statements? Armstrong: Print statements. The great gods of
What Joe Armstrong did to be a better programmer?

23 Jan, 2023

Joe Armstrong the co-designer of Erlang was asked about what he did in order to improve as a programmer. Seibel: Is there anything that you have done specifically to improve your skill as a
Joshua Bloch and the religion about the computer languages

20 Jan, 2023

Continuing the ideas that Joshua Bloch gave us, that has started on this post: Joshua Bloch and his tier list of book here we can see an interesting one: Seibel: Why do people get so religious about
Brendan Eich and the age of the programmers

19 Jan, 2023

The creator of JavaScript, Brendan Eich was asked about the programming languages and the time Seibel: Do you feel at all that programming is a young person’s game? Eich: I think young people have
Brendan Eich and the languages over time

19 Jan, 2023

The creator of JavaScript, Brendan Eich was asked about the programming languages and the time Seibel: In general do you feel like languages are getting better over time? Eich: I think so, yeah. Maybe
Joshua Bloch and his tier list of book

19 Jan, 2023

Joshua Bloch a software engineer related, contributor and (in some way) evangelist of Java has been asking at the time Coders at Work was published about the books of every programmer should read.
Scraping Whale Alerts

19 Jan, 2023

Following the questions about Whale-Alert first post . The whale alert page provides interesting information about whales, in the cryptocurrency argot it’s a transaction above certain amount of money.
How Douglas Crockford detects the talent

17 Jan, 2023

Douglas Crockford, well known because he was the first person who specified the JSON format was asked about the question of detect the talent in a programmer. Seibel: When you’re hiring programmers,
Jamie Zawinski in Coders At Work

16 Jan, 2023

Jamie Zawinski is known about some things he has created. But, besides that, when he was asked about how he see himself he gave a really interesting answer: Seibel: That brings me to another of my
Three ideas to consider to develop a microservice

16 Jan, 2023

In the post about: Is microservice architecture the silver bullet? we can find the explanation about why is not a good idea the microservice architecture to the following applications: real-time
What is programming?

15 Jan, 2023

've finished reading "Coders at work" a series of interviews between @peterseibel and well-known programmers/coders/(etc). The first edition was in 2009. And in the preface you can read: Yet despite
Python calculate seconds and total_seconds

13 Jan, 2023

If you want to calculate the total seconds between two dates. You could be tempted to do a time delta and see the seconds. But this approach will give you an unexpected result. You have to use
SODA: A way to make data quality check

13 Jan, 2023

With the open-source library SODA, you can make different operations you need to know when you are going to make some transformation to a data set. Source: SODA on PIP
Peter Norvig Paper: Oh shinny! antidote

5 Jan, 2023

Dark Knights In the TED talks The mind behind Linux | Linus Torvalds https://www.youtube.com/watch?v=o8NPllzkFhE&ab_channel=TED One of the comments that Linus Said was: Edison may not have been a nice
What is an 'Ephemeral cluster'?

5 Jan, 2023

When you create a service to compute, for example in HD-Insight you can create a cluster which remains active once it's created or, in the other hand, stop (will be 'deleted') after some amount of
Logger In python - First Approach

3 Jan, 2023

Besides using the print statements and the debugging tools sometimes (more and more frequently) I'm seeing in the code the logging module. According to the python documentation: This module defines

2022 ²⁶

December ⁷

Difference between Framework and Libraries

29 Dec, 2022

Software Development has tricky words. Some Jargon that seems as unreachable when we are starting. Even though is not a game changer understand this difference is a nice to have and in one or two
Set secrets in Databricks

29 Dec, 2022

If you add as plain text the user and password of your connections you are making a mistake that it's easy to solve. In order to solve You have to install the data bricks cli with pip: pip install
Resource from Vanguard ETF

26 Dec, 2022

Looking for vanguard ETF data is not an easy task. Because there are a lot of pages that need to subscribe or even purchase a subscription. So, we can't access to free information. I don't know if
Load data from Snowflake to S3

15 Dec, 2022

If you want to load data from Snowflake to S3 should try to use the COPY INTO command so, you run something like this command in the snowflake Web App: copy into @my_ext_unload_stage/d1 from mytable;
Using Presing in AWS

12 Dec, 2022

Presing is a command you can use in the AWS CLI that allows anyone to have the pre-signed URL to make and HTTP get request to retrieve the data that is inside the bucket pre-signed. In the CLI you
Is server-side rendering gives more importance to JavaScript?

10 Dec, 2022

In the last time, server-side rendering has become more important. I don't know how important. But this gives me a question about future jobs from JavaScript vs other Back end languages like Python.
Load CSV file from S3 to NEO4J

9 Dec, 2022

If you try to load data from S3 to NEO4J you are going to need to presing the file. So you need to expose the data to somebody that have the file. So, first you need to presing the file: aws s3

November ¹

UNIX: A History and a Memoir

3 Nov, 2022

In the era of bright consultancy, where all things are opinionated, it’s difficult to find some refreshing ideas. For real it exists, but it is difficult to find. We are talking also about some

October ⁸

Enviroments in Virtual Env

31 Oct, 2022

The importance of using environments As was said in Setting environments in Python it’s important to use environments for your deployment, even if these are side projects or wild repositories. But at
Event Driven Architectures

29 Oct, 2022

In the Gartner Submit of 2006 Mani Chandy talked about the existence of a misconception of Event Driven Architecture (EDA). So, he proposed to talk about the understanding of EDA and its Return of
Matei Zaharia - Spark: The Definitive Guide - Architecture of a Spark Application

23 Oct, 2022

The Architecture of a Spark Application The Spark driver The driver is the process “in the driver seat” of your Spark Application. It is the controller of the execution of a Spark Application and
Matei Zaharia - Spark: The Definitive Guide - Life Cycle of a Spark Application

23 Oct, 2022

The Life Cycle of a Spark Application (Inside Spark) The SparkSession The first step of any Spark Application is creating a SparkSession. In many interactive modes, this is done for you, but in an
Matei Zaharia - Spark: The Definitive Guide. Common Operations

22 Oct, 2022

Define Schemas manually When using Spark for production Extract, Transform, and Load (ETL), it is often a good idea to define your schemas manually, especially when working with untyped data sources
Kleppmann - Designing Data Intensive Applications

17 Oct, 2022

A data-intensive application is typically built from standard building blocks that provide commonly needed functionality. For example, many applications need to: • Store data so that they, or another
Testing in Python: Pytest Vs Unit test

16 Oct, 2022

How important are the tests? Testing is one of the most important skills we need to develop once we join the industry. In fact, knowing about testing is something that is not as evaluated as it could
Why is it important to know what Environment Variables are?

7 Oct, 2022

While learning to avoid hardcoding some keys in my projects, I found the concept of environment variables. I've found this interesting article about this topic here in medium: An Introduction to

September ⁴

Setting environments in Python

28 Sep, 2022

When we start a project in Python we make the beginner mistake of installing each tool in any place. However, as we advance in our knowledge and looking to improve what we do we start thinking about
Empowerment for the new leaders in tech

21 Sep, 2022

Once a new hire is designing as a team leader of a team. One of the first challenges is how it could be possible that this new person could achieve ownership of the project and the inspiration of the
Agro Analytics Datasets

13 Sep, 2022

Looking for data set to put into practice some knowledge about Agroanalytics, I find some interesting challenges: There are a lot of courses about it, for example, at Wageningen University (In fact,
Assert or AssertEqual. Differences.

7 Sep, 2022

Difference Between the statement Asert and AssertEqual in Python

August ⁶

What is a bastion host?

24 Aug, 2022

Definition of Bastion Host A bastion host is a specific computer in a network that has the objective of not affecting another part of the system by the attack from outside the network. For Example,
Are SSH and Bash the same? (Spoiler: No)

17 Aug, 2022

The thing is: when you start to run some console commands you notice that all the things you write in that place are not the same. Simple to understand, difficult to order each part in your head. I
Connect Ubuntu in Virtual Box with SSH

10 Aug, 2022

After understanding the importance of a well understanding of ssh . It’s time to make our first practice connecting from our windows to a Ubuntu installed in a virtual machine in Virtual Box. Download
What is Whale Alert

5 Aug, 2022

What is Whale Alert ? Whale alert is a blockchain tracker, which reports interesting transactions. Especially the larger ones. What is a blockchain tracker? It is a process that follows the blockchain
Good Guidelines to improve as Software Developer

3 Aug, 2022

After learning the basics about programming and understanding the first steps necessary to become a competent beginner software developer, I've started to think. I'm trying to understand whats are the
SSH: A Brave new world

3 Aug, 2022

When you make your first steps as a developer, you realize that one of the first activities you have to do when your code is ready is to deploy it. In that case, generally, the senior dev or someone

Archives

caveman: cutting Claude's output tokens by 75%

diagnose: enforcing the debug loop Claude skips

grill-me: stress-testing a plan before touching the codebase

improve-codebase-architecture: review that knows your domain

Building a data quality dashboard on top of Tuva's DQI mart

Building a multipage Streamlit app with st.navigation() — the modern way

Connecting Streamlit to DuckDB: read-only mode and the lock problem

dbt macro resolution order: a real-world debugging story

dbt + DuckDB: the good, the bad, and the workarounds

Deploying a Streamlit analytics app in one afternoon

DuckDB concurrency in 2026: why you can't run dbt and DBeaver at the same time

Feature engineering from claims data for a Random Forest classifier

From raw claims to RAF: what the data pipeline actually looks like

HCC suspecting explained from a data engineering perspective

Predicting patient risk with scikit-learn on top of HCC suspecting data

Running the Tuva Project on DuckDB — what breaks and how to fix it

The limit_zero macro bug: how dbt resolves macros across packages

What 167k synthetic Medicare claims taught me about US healthcare data

What is the Tuva Project and why should data engineers care

Batch Means Two Different Things: Why the Term Became Confusing in Data Engineering

Why apt upgrade Didn’t Update VS Code (and What Actually Happened)

Tracking Subdomains in PostHog Without Breaking User Journeys

Why Terraform Does Not Deploy Your Lambda Container Image

ABC in Python: What It Is, Where It Comes From, and Why It Exists

Can an AWS VPC Have Two Peering Connections? Yes. But Should It?

Sending Events to Multiple PostHog Projects from the Same Website

Lambda vs n8n: A Simple Explanation for Data Workflows

Should You Use AWS Lambda or AWS Glue to Update Records in HubSpot?

Understanding client_ingestion_warning in PostHog: Are You Losing Data?

Daily Failure Reporting in DynamoDB Using Lambda, EventBridge Scheduler, and SES

Hardening OAuth Token Management in Postman: Preventing Environment Cross-Contamination

Understanding ip-api Batch Limits and Effective Throughput

Window Functions vs JOIN in Spark: A Physical Plan Perspective

Can You Know the Location of an IPv6 Address?

AWS Glue + Chargebee: Diagnosing CERTIFICATE_VERIFY_FAILED After TLS Chain Updates

From OLTP to OLAP: How Data Moves from 3NF to a Dimensional Data Warehouse

Why There Is No “Interpreter” Endpoint in the Zoom API

Why You Can’t Get Full Social Analytics from the HubSpot API (Even with Marketing Hub Pro)

Why Small Tables Can Explode: Understanding JOIN Cardinality in SQL

Resolving the Node.js Error: Cannot find module jsonwebtoken

Sending Athena Query Results to Amazon SQS: Architecture, Costs, and Limitations

Extracting and Managing Access Tokens in Postman

How PostHog Uses ClickHouse for High-Performance Product Analytics

Hiding Personal Information in AWS Glue with Spark

Rebasing vs Creating a New Branch: How to Handle Outdated Feature Branches Correctly

Automating OAuth 2.0 in Postman: storing and refreshing access tokens without copy-paste

Running Scheduled GitHub Actions Locally for Safer Debugging

Designing a Scalable Course Progress Service on AWS

Handling Boolean vs IntegerType Mismatches Between MySQL and Spark (Glue JDBC)

Controlling Branch Deployments and Redirects in Vercel: A Practical Guide

AWS EventBridge Rules vs EventBridge Scheduler: Which One Should You Use?

Estimating the Cost of an AWS Glue Workflow

Modern Table Formats: Iceberg, Delta Lake, and Hudi

Running Production Servers on AWS: EC2 vs RDS Cost Breakdown

Trino in Modern Architectures: SQL Queries on S3 and MinIO

Hive Metastore: The Glue Holding Big Data Together

Why Parquet Became the Standard for Analytics

Facebook and Big Data: The Open Source Projects That Changed the Industry

HDFS vs. Object Storage: The Battle for Distributed Storage

The History of Hive and Trino: From Hadoop to Lakehouses

What Is a Data Lake and What Is a Data Lakehouse?

Google Bigtable vs. Amazon DynamoDB: Understanding the Differences

How to Keep a Docker Container Running Persistently

Fixing Cursor Login Issues on Linux (AppImage)

Managing Evolving Schemas in Apache Spark: A Strategic Approach

Orchestrating Multiple AWS Glue Workflows: A Practical Guide

Secure Ways to Share Private Data on AWS: Beyond Public Buckets

Designing a Semantic Layer for Athena + Power BI

Querying JSONB in PostgreSQL Efficiently

Understanding Window Functions in SQL: Beyond Simple Aggregations

Automating Data Extraction with Airflow, BeautifulSoup, and MinIO

How to Set CloudWatch Log Retention Policies with Terraform

How to Disable an AWS Glue Trigger from the CLI

Orchestrating Multiple AWS Glue Workflows with Step Functions

Understanding the Strategy Design Pattern

Choosing Between saveAsTable and Iceberg’s writeTo in AWS Glue and Athena

Debugging Spark DataFrame .show() Timeouts in PyCharm and VSCode

Incremental Data Loads: Choosing Between resource_version and created_at/updated_at

Optimizing Amazon Athena Queries with Partitions: A Practical Example