Dativa Pipeline API: Referential Integrity

Dativa Pipeline API: Referential Integrity

Previous: Anonymizing data | Next: Handling invalid data

Some fields can be validated with reference to a known dataset. This includes unique identifiers, primary keys in database ETL, and categorical datasets.

In order to implement this, additional reference files(s) need to be passed to the API and a rule of rule_type="Lookup" used.

The lookup rules have the following parameters:

  • original_reference - specifies the reference file
  • reference_field - the name or number of the column in the reference file that contains the values that this field should be validated against.
  • attempt_closest_match - Specifies whether entries that do not validate should be replaced with the value of the closest matching record in the _reference file. If a sufficiently close match, as specified by the string_distance_threshold is not found then the fallback_mode is still applied. Defaults to False
  • string_distance_threshold - This specifies the default distance threshold for closest matches to be applied. This is a variant of the Jaro Winkler distance and defaults to 0.7
from dativa.file_processing import FileProcessor

fp = FileProcessor()
df = pd.read_csv(dirty_file)
db_df = pd.read_csv(product_db)


report = fp.run(df,
                config={"rules": [
                 {
                     "rule_type": "Lookup",
                     "field": "product_guid"
                     "params": {
                         "fallback_mode": "remove_record",
                       "original_reference": "product_db_df",
                         "reference_field": "guid"
                     },
                 }]},
                 df_dict= {"product_db_df": db_df})

for entry in report:
    print (entry)
    print (entry.df.describe())

Note in the above example, the key of the file in the df_dict is the same as the value of the original_reference field.

Checking for duplication

Duplication can be checked on String, Number, and Date fields by setting the parameter is_unique:

  • is_unique - specifies whether this column should only contain unique values, defaults to False

In additional, Uniqueness rules can be set up which check for uniqueness across multiple sets of columns. These have the rule_type="Uniqueness" and automatically quarantine any data that does not meet the criteria.

They take the following parameters:

  • unique_fields - a comma separated list of the fields that should be checked for uniqueness, e.g "device_id,date"
  • use_last_value - specifies whether the the first or last duplicate value should be taken. Defaults to False

Note that "rule_type": "Uniqueness" does not use the "field" parameter:

config = {
    "rules": [
        {
            "rule_type": "Uniqueness",
            "params": {
                "unique_fields": "device_id,date"
            },
        }
    ]
},

Handling blanks

Sometimes it's fine to have blanks in a file, and in this case you can specify the skip_blank parameter in the config:

  • skip_blank - specified whether blank values in this field should be checked or whether they can be skipped, defaults to False
config = {
    "rules": [
        {
            "rule_type": "Date",
            "field": "Name"
            "params": {
                "fallback_mode": "remove_record",
                "date_format": "%s",
                "range_check": "rolling",
                "range_minimum": -7,
                "range_maximum": 0,
                "skip_blank": True
            },
        }
    ]
},

Where there are significant blanks in categorical data, you can use the lookalike match field to fill these automatically:

  • lookalike_match - This specifies whether entries that do not validates should be replaced with value from the record that looks most similar to the other records. This implements a nearest neighbor algorithm based on the similarity of other fields in the dataset. It is useful for filling in blank records and defaults to False

Previous: Anonymizing data | Next: Handling invalid data

Related documentation

  • Dativa Pipeline API on AWS - The Dativa Pipeline API is available through the AWS marketplace (more)
  • Dativa Pipeline Python Client - Dativa Tools includes a client for the Pipeline API (more)
  • Dativa Pipeline API: Sample Data - Sample files to demonstrate usage of the Dativa Pipeline API (more)
  • Dativa Pipeline API: Validating basic data types - Validating incoming datasets for basic string, number, and date type formatting and range checks using the Dativa Data Pipeline API (more)
  • Dativa Pipeline API: Anonymizing data - The Dativa Pipeline API support tokenization, hashing, and encyrption of incoming datasets for anonymisation and pseudonymization (more)
  • Dativa Pipeline API: Handling invalid data - Invalid data can be quarantined or automatically fixed by the Dativa Data Pipeline API (more)
  • Dativa Pipeline API: Working with session data - The Dativa Pipeline API can check for gaps and overlaps in session data and automatically fix them (more)
  • Dativa Pipeline API: Reporting and monitoring data quality - The Dativa Pipeline API logs data that does not meet the defined rules and quarantines bad data (more)
  • Dativa Pipeline API: Full API reference - A field by field breakdown of the full functionality of the Dativa Data Pipeline API (more)

Need help? Get in touch...

Sign up below and one of our data consultants will get right back to you


Dativa is a global consulting firm providing data consulting and engineering services to companies that want to build and implement strategies to put data to work. We work with primary data generators, businesses harvesting their own internal data, data-centric service providers, data brokers, agencies, media buyers and media sellers.

145 Marina Boulevard
San Rafael
California
94901

Registered in Delaware

Thames Tower
Station Road
Reading
RG1 1LX

Registered in England & Wales, number 10202531