Whether it is a GDPR, CCPA, or simply an organizational requirement, it is often necessary to mask PII data before outputting it for analysis. It is also sometimes required to unmask that data to retrieve the original value. Finally, you may want to delete the original value, due to a request to be forgotten.
Masking a Value
The easiest way to mask a value is to create a calculated field using a hashing function, such as MD5. Here is an example masking an email address with the MD5 function.
After writing that field to Athena the results appear as follows.
Creating a table to Unmask Values involves creating a second table, with both the masked and unmasked values to join to.
Here I am creating an upserting output to avoid duplicate entries in the unmasked data table.
Once we have the unmask_email table created, we can join our original table to it, showing unmasked values.
Removing the Masked to Unmasked Mapping
There are situations that PII must be removed. In these cases, you can remove the mapping of masked to unmasked data so only the unmasked value can not be retrieved. Often the list of items to be deleted will be a separate data source, so we will join our unmask_email table to that, to remove any users that have requested to be deleted.
This involves a couple more advanced features, including a UNION ALL so that both an entry to the orders data source or the deleted_emails data source triggers an update, and a lookup to determine if the email address has been deleted.
SET hashed_email = MD5(data.buyerEmail);
SELECT hashed_email AS hashed_email,
COALESCE(e2d.deleted_email, data.buyerEmail) AS buyeremail
LEFT JOIN LATEST (
SELECT MD5(data.buyeremail) AS hashed_email,
LAST('DELETED') AS deleted_email
GROUP BY MD5(data.buyeremail)
) e2d ON e2d.hashed_email = MD5(data.buyerEmail)
REPLACE ON DUPLICATE hashed_email
After running this output, we can join our original table to it, and we see that the deleted email addresses are no longer visible.
- This is not a complete GDPR or CCPA solution, this is simply one piece of a larger strategy.
- This will not remove data stored in the Upsolver Event Source. Retention policies should be used if the source data must be removed from your data lake.