Skip to content
>GLB_
Go back

Hiding Personal Information in AWS Glue with Spark

Protecting personal data before analytics consumption is a core requirement in modern data platforms. In AWS-based lake architectures, this is typically achieved through data de-identification during ingestion or transformation. This post outlines a practical and production-ready approach to hiding personal information using Spark jobs in AWS Glue.


What “Hide Personal Information” Means in Data Engineering

Hiding personal information usually refers to de-identification, an umbrella term that includes:

In regulated environments (GDPR, HIPAA), this step must occur before data is exposed to analytics, BI tools, or data science workloads.


Why Do This in AWS Glue?

AWS Glue provides a managed Spark environment that is well suited for:

By de-identifying data inside Glue jobs, you ensure that only compliant datasets reach curated layers (silver/gold).


Reference Architecture

Flow:

  1. Raw data lands in Amazon S3 (raw/).
  2. A Glue Spark job reads the data.
  3. Personal data is detected and hidden.
  4. Clean data is written to clean/ or curated/.

Two common implementation patterns are described below.


Amazon Comprehend provides managed PII detection and redaction.

How it works

When to use

Official documentation:
https://docs.aws.amazon.com/comprehend/latest/dg/redact-api-pii.html

Pattern 2: Column-Level Hashing in Spark

For structured datasets with known sensitive fields:

from pyspark.sql.functions import sha2, concat_ws, lit, col

SALT = "<secure-salt>"
df_clean = (
    df.withColumn(
        "email_hash",
        sha2(concat_ws(":", col("email"), lit(SALT)), 256)
    )
    .drop("email")
)

When to use


Operational Considerations


Conclusion

Hiding personal information in AWS Glue is not a single feature, but a design choice embedded in your data pipeline. Combining Spark transformations with AWS-native services allows you to meet privacy requirements without sacrificing scalability or performance.

De-identification should be treated as infrastructure, not as an afterthought.


Share this post:

Previous Post
How PostHog Uses ClickHouse for High-Performance Product Analytics
Next Post
Rebasing vs Creating a New Branch: How to Handle Outdated Feature Branches Correctly