When solving analytical queries in Spark SQL, there are often multiple correct formulations. However, they do not produce equivalent execution plans.
This article compares two approaches to the same problem:
“Find the second highest salary per department, but only in departments with at least two employees.”
We analyze which approach is more efficient and why, from a physical plan standpoint.
Problem Statement
Given table:
Empleados(EmployeeID, Name, DepartmentID, Salary, JoinDate)
Goal:
- Second highest salary per department
- Only departments with ≥ 2 employees
Approach 1: CTE + GROUP BY + JOIN
WITH RankedSalaries AS (
SELECT
DepartmentID,
Name,
Salary,
RANK() OVER (
PARTITION BY DepartmentID
ORDER BY Salary DESC
) AS ranking
FROM Empleados
),
DepartmentsWithAtLeastTwo AS (
SELECT DepartmentID
FROM Empleados
GROUP BY DepartmentID
HAVING COUNT(*) >= 2
)
SELECT r.DepartmentID, r.Name, r.Salary
FROM RankedSalaries r
JOIN DepartmentsWithAtLeastTwo d
ON r.DepartmentID = d.DepartmentID
WHERE r.ranking = 2;
What Happens Physically?
Spark builds two logical branches:
- Branch A → Window function (RANK)
- Branch B → HashAggregate (COUNT per department)
Then it performs a join.
Observed Physical Characteristics
- Two scans of the base dataset
- Two shuffle stages (one per branch)
- A join (initially SortMergeJoin, optimized by AQE to BroadcastHashJoin)
- More stages overall
Even if AQE optimizes the join strategy, the structural cost remains higher.
Approach 2: Window-Only Strategy
WITH ResumenEmpleados AS (
SELECT
DepartmentID,
Salary,
Name,
COUNT(EmployeeID) OVER (
PARTITION BY DepartmentID
) AS cant_empleado,
RANK() OVER (
PARTITION BY DepartmentID
ORDER BY Salary DESC
) AS salario_ranking
FROM Empleados
)
SELECT
DepartmentID,
Salary,
Name
FROM ResumenEmpleados
WHERE cant_empleado > 1
AND salario_ranking = 2;
What Happens Physically?
Spark:
- Performs one scan
- Executes one shuffle by DepartmentID
- Applies window functions
- Filters results
There is no separate aggregate branch and no join.
Observed Physical Characteristics
- One scan
- One shuffle
- WindowGroupLimit optimization for Top-K per partition
- No join
Why Approach 2 Is Generally More Efficient
Performance in Spark is dominated by:
- Shuffles
- Distributed joins
- Sort operations
Approach 2:
- Eliminates one shuffle
- Eliminates the join entirely
- Keeps computation in a single pipeline
In distributed systems, avoiding joins is often more important than reducing SQL verbosity.
Conceptual Lesson
This comparison highlights a deeper principle:
Window functions enrich rows.
GROUP BY reduces rows.
JOIN reconciles datasets.
If your problem is analytical and row-level context must be preserved, window functions are often the most direct and efficient tool.
Important Note About RANK
RANK() can produce multiple rows for rank = 2 if salaries are tied.
If exactly one employee per department is required, use:
ROW_NUMBER() OVER (
PARTITION BY DepartmentID
ORDER BY Salary DESC, EmployeeID ASC
)
This ensures deterministic selection.
How to Inspect This Yourself
Use:
df.explain("formatted")
Or open Spark UI:
http://localhost:4040
Navigate to:
Jobs → Stages → DAG Visualization
Focus on:
- Number of Exchange operators
- Join strategies
- Number of stages
Key Takeaway
The shorter SQL query is not necessarily faster.
The plan with fewer distributed operations usually is.
In this case, the window-only approach wins because it avoids structural duplication of work.