Dativa Pipeline API: Anonymizing data

Dativa Pipeline API: Anonymizing data

Previous: Validating basic data types | Next: Referential Integrity

Three different forms of data anonymisation are supported:

  • Tokenisation - where data is tokenised using sequential integers and the mapping between integers and the original data returned in a separate file
  • Hashing - where the data is replaced with a hash using the SHA512 hashing algorithm with customisable salt
  • Encryption - where data is encrypted with a public certificate

Tokenisation providers the most reliable form of anonymisation as it cannot be reversed without the returned file. The file can be saved for future use or discarded. Where the file is stored on the file system it's name can be rotated based on a date field, providing for rotating tokenisation of data.

Hashing provides a second irreversible mechanism for anonymisation but collisions may occur so the hashes cannot be guaranteed to be unique. The salt applied to the hash can also be keyed from a date making the hashes persistent only for a limited period of time.

Encryption provides full reversible anonymization for those with the private certificate. Encryption supports the PKCS1 algorithm optimal asymmetric encryption padding 'OAEP'. The padding adds randomness to the encrypted output and we can either use a truly random number for continuously changing encrypted values, a fixed value for static encrypted values, or a fixed number based on a data field allowing for pseudo-rotating-tokens with different encrypted values for each time period.

Tokenization

String fields can be tokenized by setting the tokenize parameter to True. Tokens are constructed using an integer series and stored in a separate file which is added to the df_dict dictionary when it is returned. An existing dictionary can be passed in which case new values are added to it.

The following rule would tokenize an email address in a static token store:

  {
  "rule_type": "String",
  "field": "email_address",
  "params": {
      "minimum_length": 5,
      "maximum_length": 1024,
      "regex": "[^@]+@[^\\.]..*[^\\.]",
      "fallback_mode": "remove_record",
      "tokenize": True,
      "token_store": "anonymization/email_list.csv"
  }}

This rule would tokenize an email address in a rotating store that changes daily:

{
"rule_type": "String",
"field": "from",
"params": {
  "minimum_length": 5,
  "maximum_length": 1024,
  "regex": "[^@]+@[^\\.]..*[^\\.]",
  "fallback_mode": "remove_record",
  "tokenize": True,
  "token_store": "anonymization/email_list_%Y_%m_%d.csv",
  "token_date_field": "date",
  "token_date_format": "%Y-%m-%d %H:%M:%S"
}},

The following parameters are used for tokenization:

  • tokenize - true or false, specifies whether the field should be tokenized after the other validation rules are applied
  • token_store - specifies the name of the token store within the df_dict dictionary. This can contain strftime date format parameters to create a rotating token store
  • token_date_field - if specified, then a date is parsed from this field using the data format and applied to the token store using strftime formatting. This allows for rotating tokens
  • token_date_format - must be specified if token_date_field is specified. This date format is used to parse the date from the token_date_field

Hashing

String fields can be hashed by setting the hash parameter to True. Hashes are constructed from a SHA512 hash algorithm with a salt of up to 16 bytes added. The salt can be constructed from a date field to allow rotating hashes.

The following rule would hash an email address in a static token store:

  {
  "rule_type": "String",
  "field": "email_address",
  "params": {
      "minimum_length": 5,
      "maximum_length": 1024,
      "regex": "[^@]+@[^\\.]..*[^\\.]",
      "fallback_mode": "remove_record",
      "hash": True,
      "salt": "XY@4242SSS"
  }}

This rule would hash an email address with salt that changes daily:

{
"rule_type": "String",
"field": "from",
"params": {
  "minimum_length": 5,
  "maximum_length": 1024,
  "regex": "[^@]+@[^\\.]..*[^\\.]",
  "fallback_mode": "remove_record",
  "hash": True,
  "salt": "XY@4242SSS%y%m%d",
  "token_date_field": "date",
  "token_date_format": "%Y-%m-%d %H:%M:%S"
}},

The full parameters for hashing are:

  • hash - specifies whether the field should be hashed
  • hash_length - the length of the hash to use, defaults to 16 characters. Can be up to 128 characters. The longer the hash, the fewer hash collisions will occur.
  • salt - the salt to be used in the hash, up to 16 bytes in length. This can contain strftime data format parameters to create a rotating token store
  • salt_date_field - if specified, then a date is parsed from this field using the data format and applied to the salt using strftime formatting. This allows for rotating hashes
  • salt_date_format - must be specified if salt_date_field is specified. This date format is used to parse the date from the salt_date_field

Encryption

Encryption can be enabled by setting the encrypt parameter on a string rule. Encryption uses OAEP padding to ensure the encryption is irreversible. If you want consistent encrypted values from source values then set random_string to a fixed value, or apply time-wise rotation on this value to create rotating encryption.

  • encrypt - specifies whether the field should be hashed
  • public_key - the public key to be used in encryption
  • random_string - defaults to "random" in which case a random string is generated according to the standard PKCS1 AOEP algorithm. If this is set then this string (padded to 20 bytes) is used as the random string in the PKCS1 AOEP algorithm. If this is fixed then it will results in consistent values for all time.
  • random_string_date_field - if specified, then a date is parsed from this field using the data format and applied to the random using strftime formatting. This allows for rotating encryption
  • random_string_date_format - must be specified if random_string_date_field is specified. This date format is used to parse the date from the random_string_date_field

The following rule would encrypt an email address with different value each time it is encrypted:

{
"rule_type": "String",
"field": "email_address",
"params": {
  "minimum_length": 5,
  "maximum_length": 1024,
  "regex": "[^@]+@[^\\.]..*[^\\.]",
  "fallback_mode": "remove_record",
  "encrypt": True,
  "public_key": """-----BEGIN PUBLIC KEY-----
                   MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDDmPPm5UC8rXn4uX37m4tN/j4T
                   MAhUVyxN7V7QxMF3HDg5rkl/Ju53DPJbv59TCvlTCXw1ihp9asVyyYpCqrsKCh10
                   sZI0kIrkizlKaB/20Q4P1kYOCgv4Cwds7Iu2y0TFwDosK9a7MPR9IksL7QRWKjD0
                   DoNemKEpyCt2dZTaQwIDAQAB
                   -----END PUBLIC KEY-----"""
}}

The following rule would encrypt an email address with different value for each day:

{
"rule_type": "String",
"field": "email_address",
"params": {
  "minimum_length": 5,
  "maximum_length": 1024,
  "regex": "[^@]+@[^\\.]..*[^\\.]",
  "fallback_mode": "remove_record",
  "encrypt": True,
  "public_key": """-----BEGIN PUBLIC KEY-----
                   MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDDmPPm5UC8rXn4uX37m4tN/j4T
                   MAhUVyxN7V7QxMF3HDg5rkl/Ju53DPJbv59TCvlTCXw1ihp9asVyyYpCqrsKCh10
                   sZI0kIrkizlKaB/20Q4P1kYOCgv4Cwds7Iu2y0TFwDosK9a7MPR9IksL7QRWKjD0
                   DoNemKEpyCt2dZTaQwIDAQAB
                   -----END PUBLIC KEY-----""",
  "random_string": "XY@4242SSS%y%m%d",
  "random_string_date_field": "date",
  "random_string_date_format": "%Y-%m-%d %H:%M:%S"
}}

Previous: Validating basic data types | Next: Referential Integrity

Related documentation

  • Dativa Pipeline API on AWS - The Dativa Pipeline API is available through the AWS marketplace (more)
  • Dativa Pipeline Python Client - Dativa Tools includes a client for the Pipeline API (more)
  • Dativa Pipeline API: Sample Data - Sample files to demonstrate usage of the Dativa Pipeline API (more)
  • Dativa Pipeline API: Validating basic data types - Validating incoming datasets for basic string, number, and date type formatting and range checks using the Dativa Data Pipeline API (more)
  • Dativa Pipeline API: Referential Integrity - Using the Dativa Pipeline API to validate data against other known good datasets to ensure referential integrity (more)
  • Dativa Pipeline API: Handling invalid data - Invalid data can be quarantined or automatically fixed by the Dativa Data Pipeline API (more)
  • Dativa Pipeline API: Working with session data - The Dativa Pipeline API can check for gaps and overlaps in session data and automatically fix them (more)
  • Dativa Pipeline API: Reporting and monitoring data quality - The Dativa Pipeline API logs data that does not meet the defined rules and quarantines bad data (more)
  • Dativa Pipeline API: Full API reference - A field by field breakdown of the full functionality of the Dativa Data Pipeline API (more)

Need help? Get in touch...

Sign up below and one of our data consultants will get right back to you


Dativa is a global consulting firm providing data consulting and engineering services to companies that want to build and implement strategies to put data to work. We work with primary data generators, businesses harvesting their own internal data, data-centric service providers, data brokers, agencies, media buyers and media sellers.

145 Marina Boulevard
San Rafael
California
94901

Registered in Delaware

Thames Tower
Station Road
Reading
RG1 1LX

Registered in England & Wales, number 10202531