Hiding Personal Information in AWS Glue with Spark

Protecting personal data before analytics consumption is a core requirement in modern data platforms. In AWS-based lake architectures, this is typically achieved through data de-identification during ingestion or transformation. This post outlines a practical and production-ready approach to hiding personal information using Spark jobs in AWS Glue.

What “Hide Personal Information” Means in Data Engineering

Hiding personal information usually refers to de-identification, an umbrella term that includes:

Masking: replacing characters with placeholders.
Redaction: removing sensitive substrings entirely.
Hashing: irreversible transformation for joinability.
Tokenization: reversible replacement using secure mappings.

In regulated environments (GDPR, HIPAA), this step must occur before data is exposed to analytics, BI tools, or data science workloads.

Why Do This in AWS Glue?

AWS Glue provides a managed Spark environment that is well suited for:

Large-scale batch processing.
Schema-aware transformations.
Integration with AWS-native security services.

By de-identifying data inside Glue jobs, you ensure that only compliant datasets reach curated layers (silver/gold).

Reference Architecture

Flow:

Raw data lands in Amazon S3 (raw/).
A Glue Spark job reads the data.
Personal data is detected and hidden.
Clean data is written to clean/ or curated/.

Two common implementation patterns are described below.

Pattern 1: PII Redaction with Amazon Comprehend (Recommended)

Amazon Comprehend provides managed PII detection and redaction.

How it works

Use Comprehend’s PII APIs to detect entities such as emails, phone numbers, or names.
Apply masking or replacement.
Persist the sanitized output.

When to use

Text-heavy columns (comments, logs, descriptions).
When you want managed accuracy without maintaining NLP models.

Official documentation:
https://docs.aws.amazon.com/comprehend/latest/dg/redact-api-pii.html

Pattern 2: Column-Level Hashing in Spark

For structured datasets with known sensitive fields:

from pyspark.sql.functions import sha2, concat_ws, lit, col

SALT = "<secure-salt>"
df_clean = (
    df.withColumn(
        "email_hash",
        sha2(concat_ws(":", col("email"), lit(SALT)), 256)
    )
    .drop("email")
)

When to use

Deterministic joins are required.
Re-identification is not allowed.
Performance must be fully inside Spark.

Operational Considerations

Salts and keys must be stored in AWS Secrets Manager or encrypted with KMS.
Avoid synchronous external API calls at high Spark parallelism unless rate-limited.
Prefer pre-ingestion de-identification over post-analytics fixes.
Log transformations for auditability, but never log raw PII.

Conclusion

Hiding personal information in AWS Glue is not a single feature, but a design choice embedded in your data pipeline. Combining Spark transformations with AWS-native services allows you to meet privacy requirements without sacrificing scalability or performance.

De-identification should be treated as infrastructure, not as an afterthought.

Hiding Personal Information in AWS Glue with Spark

What “Hide Personal Information” Means in Data Engineering

Why Do This in AWS Glue?

Reference Architecture

Pattern 1: PII Redaction with Amazon Comprehend (Recommended)

How it works

When to use

Pattern 2: Column-Level Hashing in Spark

When to use

Operational Considerations

Conclusion

Related Posts