Skip to content
>GLB_
Go back

Window Functions vs JOIN in Spark: A Physical Plan Perspective

When solving analytical queries in Spark SQL, there are often multiple correct formulations. However, they do not produce equivalent execution plans.

This article compares two approaches to the same problem:

“Find the second highest salary per department, but only in departments with at least two employees.”

We analyze which approach is more efficient and why, from a physical plan standpoint.

Problem Statement

Given table:

Empleados(EmployeeID, Name, DepartmentID, Salary, JoinDate)

Goal:


Approach 1: CTE + GROUP BY + JOIN

WITH RankedSalaries AS (
    SELECT
        DepartmentID,
        Name,
        Salary,
        RANK() OVER (
            PARTITION BY DepartmentID
            ORDER BY Salary DESC
        ) AS ranking
    FROM Empleados
),
DepartmentsWithAtLeastTwo AS (
    SELECT DepartmentID
    FROM Empleados
    GROUP BY DepartmentID
    HAVING COUNT(*) >= 2
)
SELECT r.DepartmentID, r.Name, r.Salary
FROM RankedSalaries r
JOIN DepartmentsWithAtLeastTwo d
  ON r.DepartmentID = d.DepartmentID
WHERE r.ranking = 2;

What Happens Physically?

Spark builds two logical branches:

  1. Branch A → Window function (RANK)
  2. Branch B → HashAggregate (COUNT per department)

Then it performs a join.

Observed Physical Characteristics

Even if AQE optimizes the join strategy, the structural cost remains higher.

Approach 2: Window-Only Strategy

WITH ResumenEmpleados AS (
    SELECT
        DepartmentID,
        Salary,
        Name,
        COUNT(EmployeeID) OVER (
            PARTITION BY DepartmentID
        ) AS cant_empleado,
        RANK() OVER (
            PARTITION BY DepartmentID
            ORDER BY Salary DESC
        ) AS salario_ranking
    FROM Empleados
)
SELECT
    DepartmentID,
    Salary,
    Name
FROM ResumenEmpleados
WHERE cant_empleado > 1
  AND salario_ranking = 2;

What Happens Physically?

Spark:

There is no separate aggregate branch and no join.

Observed Physical Characteristics


Why Approach 2 Is Generally More Efficient

Performance in Spark is dominated by:

  1. Shuffles
  2. Distributed joins
  3. Sort operations

Approach 2:

In distributed systems, avoiding joins is often more important than reducing SQL verbosity.


Conceptual Lesson

This comparison highlights a deeper principle:

Window functions enrich rows.
GROUP BY reduces rows.
JOIN reconciles datasets.

If your problem is analytical and row-level context must be preserved, window functions are often the most direct and efficient tool.


Important Note About RANK

RANK() can produce multiple rows for rank = 2 if salaries are tied.

If exactly one employee per department is required, use:

ROW_NUMBER() OVER (
    PARTITION BY DepartmentID
    ORDER BY Salary DESC, EmployeeID ASC
)

This ensures deterministic selection.


How to Inspect This Yourself

Use:

df.explain("formatted")

Or open Spark UI:

http://localhost:4040

Navigate to:

Jobs → Stages → DAG Visualization

Focus on:


Key Takeaway

The shorter SQL query is not necessarily faster.
The plan with fewer distributed operations usually is.

In this case, the window-only approach wins because it avoids structural duplication of work.


Share this post:

Previous Post
Understanding ip-api Batch Limits and Effective Throughput
Next Post
Can You Know the Location of an IPv6 Address?