Dativa Pipeline API: Handling invalid data

Dativa Pipeline API: Handling invalid data

Previous: Referential Integrity | Next: Working with session data

If the data does not validate then a fall back option is applied depending on the specification of the following params:

  • fallback_mode - specifies what should be done if the data does not comply with the rules, and can take the following values:
    • "remove_entry" - the default value, the record is removed and quarantined in a file sent to the Logger object passed to the API
    • "use_default" - the record is removed and replaced with whatever is specified in the "default_value" field
    • "do_not_replace" - the record is left unchanged but still logged to the Logger object.
  • default_value - The value that records that do not validate should be set to if fallback_mode is set to use_default. Defaults to ''.

For String, Number, and Date fields, there are two additional methods that can be applied for handling invalid data.

The best match options search the other values in the field that meet the validation rule and replaces the invalid value with one that is sufficiently similar. This is controlled by these two parameters:

  • attempt_closest_match - Specifies whether entries that do not validate should be replaced with the value of the closest matching record in the dataset. If a sufficiently close match, as specified by the string_distance_threshold is not found then the fallback_mode is still applied. Defaults to False
  • string_distance_threshold - This specifies the default distance threshold for closest matches to be applied. This is a variant of the Jaro Winkler distance and defaults to 0.7

The best matching is not possible where invalid data is blank. In this case an additional option is available to replace a field with a value from the record that is most similar in its other fields. This is known as a lookalike model and is most useful for filling in blank records.

It is enabled with the single parameter:

  • lookalike_match - This specifies whether entries that do not validates should be replaced with value from the record that looks most similar to the other records. This implements a nearest neighbor algorithm based on the similarity of other fields in the dataset. It is useful for filling in blank records and defaults to False

Previous: Referential Integrity | Next: Working with session data

Related documentation

  • Dativa Pipeline API on AWS - The Dativa Pipeline API is available through the AWS marketplace (more)
  • Dativa Pipeline Python Client - Dativa Tools includes a client for the Pipeline API (more)
  • Dativa Pipeline API: Sample Data - Sample files to demonstrate usage of the Dativa Pipeline API (more)
  • Dativa Pipeline API: Validating basic data types - Validating incoming datasets for basic string, number, and date type formatting and range checks using the Dativa Data Pipeline API (more)
  • Dativa Pipeline API: Anonymizing data - The Dativa Pipeline API support tokenization, hashing, and encyrption of incoming datasets for anonymisation and pseudonymization (more)
  • Dativa Pipeline API: Referential Integrity - Using the Dativa Pipeline API to validate data against other known good datasets to ensure referential integrity (more)
  • Dativa Pipeline API: Working with session data - The Dativa Pipeline API can check for gaps and overlaps in session data and automatically fix them (more)
  • Dativa Pipeline API: Reporting and monitoring data quality - The Dativa Pipeline API logs data that does not meet the defined rules and quarantines bad data (more)
  • Dativa Pipeline API: Full API reference - A field by field breakdown of the full functionality of the Dativa Data Pipeline API (more)

Need help? Get in touch...

Sign up below and one of our data consultants will get right back to you


Dativa is a global consulting firm providing data consulting and engineering services to companies that want to build and implement strategies to put data to work. We work with primary data generators, businesses harvesting their own internal data, data-centric service providers, data brokers, agencies, media buyers and media sellers.

145 Marina Boulevard
San Rafael
California
94901

Registered in Delaware

Thames Tower
Station Road
Reading
RG1 1LX

Registered in England & Wales, number 10202531