In 2017, Equifax suffered one of the worst data breaches in history, exposing the personal data of over 147 million people. The breach included names, Social Security numbers, dates of birth, and addresses. The aftermath? A $700 million penalty, irreparable reputational damage, and shattered customer trust. The root causes? A missed Apache patch, but more critically, a lack of layered, proactive data protection. Source : https://archive.epic.org/privacy/data-breach/equifax/
If even basic tokenization was applied to their sensitive data at the ingestion point, the stolen data would have been useless to attackers. That incident was the inspiration behind this PII Protection Serverless Pipeline.
To build a lightweight, cost-effective, and automated pipeline to tokenize sensitive fields in uploaded CSVs, store tokenized results, optionally mask data for safe sharing, and allow detokenization when required.
┌─────────────────┐ ┌───────────────────┐ ┌─────────────────┐
│ Raw CSV Upload │───▶│ PII Detector │───▶│ Metadata │
│ (S3 /raw) │ │ Lambda │ │ (S3 /metadata) │
└─────────────────┘ └───────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐ ┌───────────────────┐ ┌─────────────────┐
│ Detokenized │◀───│ Detokenizer │◀───│ Tokenizer │
│ (S3 /detok) │ │ Lambda │ │ Lambda │
└─────────────────┘ └───────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ Tokenized │
│ (S3 /tokenized) │
└─────────────────┘
The pipeline uses a well-organized S3 bucket structure for different stages of data processing:
pii-project-harsha/
├── raw/ # Original CSV files with PII data
│ └── customer_data.csv
├── metadata/ # JSON files containing PII field information
│ └── customer_data_pii_fields.json
├── tokenized/ # CSV files with Base64 tokenized PII data
│ └── customer_data_tokenized.csv
└── detokenized/ # CSV files with decoded and optionally masked data
└── customer_data_detokenized.csv
This serverless pipeline consists of 3 core Lambda functions:
- Trigger: S3 PUT events in the
/rawfolder - Purpose: Identifies PII fields in uploaded CSV files
- Output: Creates metadata JSON files in
/metadatafolder - Technology: Uses pattern matching and field name analysis to detect PII
- Trigger: S3 PUT events in the
/metadatafolder - Purpose: Tokenizes identified PII fields using Base64 encoding
- Input: Raw CSV + metadata JSON
- Output: Tokenized CSV in
/tokenizedfolder - Logic: Replaces PII values with
TOKEN_1,TOKEN_2, etc.
- Trigger: S3 PUT events in the
/tokenizedfolder (optional) - Purpose: Decodes tokenized data and optionally applies masking
- Features:
- Automatic CSV delimiter detection
- Configurable masking (enabled/disabled via
masking_enabledflag) - Field-specific masking patterns for names, emails, and phone numbers
- Output: Detokenized/masked CSV in
/detokenizedfolder
The process begins when a raw CSV containing potential PII fields is uploaded to the S3 bucket's raw/ folder.
- The
pii_detector_lambdais automatically triggered by the S3 upload event - It analyzes the CSV headers and content to identify PII fields
- Creates a metadata JSON file listing the detected PII columns
- Saves this metadata to the
metadata/folder
- The metadata upload triggers the
tokenizer_lambda - Reads the original CSV and the PII fields metadata
- Tokenizes sensitive fields using a simple token counter (
TOKEN_1,TOKEN_2, etc.) - Stores the tokenized CSV in the
tokenized/folder
- If detokenization is needed, the
detokenizer_lambdais triggered - Decodes the Base64 tokens back to original values
- Optionally applies field-specific masking for safe viewing
- Saves the result in the
detokenized/folder
Only specific IAM roles or users are granted access to the detokenized/ folder, ensuring strict separation of duties and compliance with least-privilege principles.
The pipeline uses Base64 encoding for tokenization, providing reversibility ideal for MVPs:
import base64
# Encoding (tokenization)
original_value = "John Doe"
token = base64.b64encode(original_value.encode()).decode()
# Output: "Sm9obiBEb2U="
# Decoding (detokenization)
decoded_value = base64.b64decode(token).decode()
# Output: "John Doe"- ✅ Reversible for trusted internal workflows
- ✅ Lightweight and fast
- ✅ Easy to implement with standard libraries
- ✅ No external dependencies
- ❌ Not cryptographically secure
- ❌ Easily decodable if token format is known
- ❌ Not suitable for production without additional encryption layers
The detokenizer includes intelligent masking capabilities:
"John Doe" → "J*** D**""john@example.com" → "j**n@example.com""9876543210" → "******3210"Name,Email,Phone,DOB,TransactionID
John Doe,john@example.com,9876543210,1990-01-01,TXN1001
Jane Smith,jane@gmail.com,9123456789,1991-03-22,TXN1002
["Name", "Email", "Phone"]Name,Email,Phone,DOB,TransactionID
TOKEN_1,TOKEN_2,TOKEN_3,1990-01-01,TXN1001
TOKEN_4,TOKEN_5,TOKEN_6,1991-03-22,TXN1002
Name,Email,Phone,DOB,TransactionID
J*** D**,j**n@example.com,******3210,1990-01-01,TXN1001
J*** S****,j**e@gmail.com,******6789,1991-03-22,TXN1002
pii-protection-serverless/
├── lambdas/
│ ├── pii_detector_lambda.py # Identifies PII fields in CSV files
│ ├── tokenizer_lambda.py # Tokenizes PII data using Base64
│ └── detokenizer_lambda.py # Decodes tokens and applies masking
├── python/ # Dependencies layer
│ ├── cryptography/ # Cryptography library
│ ├── cffi/ # CFFI dependency
│ └── pycparser/ # Parser dependency
├── sample-data/
│ └── customer_data.csv # Sample CSV for testing
├── cryptography-layer.zip # Lambda layer for dependencies
└── README.md # This documentation
- AWS Account with appropriate permissions
- AWS CLI configured
- Python 3.8 or higher
- boto3 library
-
Create S3 Bucket
aws s3 mb s3://your-pii-project-bucket
-
Create Folder Structure
aws s3api put-object --bucket your-pii-project-bucket --key raw/ aws s3api put-object --bucket your-pii-project-bucket --key metadata/ aws s3api put-object --bucket your-pii-project-bucket --key tokenized/ aws s3api put-object --bucket your-pii-project-bucket --key detokenized/
-
Deploy Lambda Layer
- Upload
cryptography-layer.zipas a Lambda layer - Note the layer ARN for Lambda function configuration
- Upload
-
Deploy Lambda Functions
For each Lambda function:
# Package the function zip -r pii_detector_lambda.zip pii_detector_lambda.py # Create the function aws lambda create-function \ --function-name pii-detector \ --runtime python3.9 \ --handler pii_detector_lambda.lambda_handler \ --zip-file fileb://pii_detector_lambda.zip \ --role arn:aws:iam::YOUR_ACCOUNT:role/lambda-s3-role \ --layers arn:aws:lambda:region:account:layer:crypto-layer:1
-
Configure S3 Event Triggers
Create S3 event notifications for each folder:
{ "Rules": [ { "Name": "trigger-pii-detector", "Filter": { "Key": { "FilterRules": [{ "Name": "prefix", "Value": "raw/" }] } }, "Status": "Enabled", "Targets": [{ "Arn": "arn:aws:lambda:region:account:function:pii-detector" }] } ] } -
Test the Pipeline
# Upload a test CSV file aws s3 cp sample-data/customer_data.csv s3://your-bucket/raw/ # Monitor CloudWatch logs aws logs describe-log-groups --log-group-name-prefix /aws/lambda/
Set these environment variables for your Lambda functions:
# For all functions
S3_BUCKET=your-pii-project-bucket
LOG_LEVEL=INFO
# For detokenizer function
MASKING_ENABLED=true # Set to false to disable maskingYour Lambda execution role needs these permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": "arn:aws:s3:::your-pii-project-bucket/*"
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:*:*:*"
}
]
}- Lambda execution duration and errors
- S3 object count per folder
- Custom metrics for tokenization operations
- All functions use structured logging with emoji indicators
- ✅ Success operations
- ❌ Error conditions
- 📊 Processing statistics
✅ Tokenized file uploaded to: tokenized/customer_data_tokenized.csv
❌ Failed to decode field Name: Invalid base64 encoding
📊 Processed 1000 records with 15 PII fields
- Encryption in Transit: All S3 operations use HTTPS
- Access Control: IAM policies restrict folder access
- Audit Trail: CloudTrail logs all S3 and Lambda activities
- Data Retention: Implement S3 lifecycle policies for automatic cleanup
- Use separate IAM roles for each Lambda function
- Enable S3 bucket versioning for data recovery
- Implement S3 bucket policies to prevent public access
- Regular security audits of IAM permissions
- The pipeline supports GDPR "right to be forgotten" through detokenization
- Audit trails provide compliance reporting capabilities
- Data lineage tracking through S3 object metadata
- Pay-per-use: Only charged when processing files
- Auto-scaling: Handles variable workloads automatically
- No idle costs: No charges when not processing data
- Lambda executions: $0.20 per 1M requests
- S3 storage: $0.023 per GB
- Data transfer: Minimal for internal processing
- Total estimated cost: <$10/month for typical workloads
- Small files (<1MB): ~2-3 seconds end-to-end
- Medium files (1-10MB): ~5-15 seconds
- Large files (>10MB): Consider chunking for optimal performance
- Concurrent Lambda executions: Up to 1000 (default limit)
- S3 throughput: Virtually unlimited
- Bottlenecks: Lambda cold starts (~1-2 seconds)
-
Lambda timeout errors
- Increase timeout setting (default: 3 seconds)
- Consider breaking large files into chunks
-
Permission denied errors
- Verify IAM role has S3 access
- Check bucket policies
-
Base64 decode errors
- Ensure tokens weren't corrupted during processing
- Verify character encoding consistency
# Check S3 object metadata
aws s3api head-object --bucket your-bucket --key path/to/file.csv
# View Lambda logs
aws logs filter-log-events --log-group-name /aws/lambda/your-function
# Test Lambda function
aws lambda invoke --function-name your-function --payload '{}' response.json- Support for additional file formats (JSON, XML, Parquet)
- Enhanced PII detection using regex patterns
- Custom masking rules per field type
- Batch processing for multiple files
- Integration with AWS KMS for stronger encryption
- Machine learning-based PII detection using Amazon Comprehend
- Real-time streaming data processing with Kinesis
- REST API for programmatic access
- Integration with data catalogs (AWS Glue)
- Compliance reporting dashboard
- Multi-region deployment support
- Advanced analytics on PII patterns
- Serverless architectures can be both powerful and lightweight for data processing
- Event-driven design enables seamless automation without complex orchestration
- Tokenization provides effective privacy protection even with simple techniques
- Separation of concerns through folder-based organization improves security
- Base64 encoding is sufficient for MVPs but requires enhancement for production
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow PEP 8 for Python code style
- Add unit tests for new features
- Update documentation for any changes
- Test with sample data before submitting
This project is licensed under the MIT License - see the LICENSE file for details.
- Inspired by the need for better data protection following major security breaches like Equifax
- Built as part of exploring Generative AI and Serverless Pipelines in real-world scenarios
- Thanks to the AWS community for serverless best practices
Author: Harsha Mathan
This project demonstrates the power of serverless architectures in building cost-effective, scalable data protection solutions. If you found this helpful, please ⭐ the repository and share your feedback!


