Dativa Pipeline API: Full API reference

Dativa Pipeline API: Full API reference

Previous: Reporting and monitoring data quality | Next: Sample files

String rules

  • minimum_length - the minimum length of the string, defaults to 0
  • maximum_length - the maximum allowed length of the string, defaults to 1024
  • regex - a regular expression to be applied to validate the string. This is processed using the standard Python regular expression processor and supports all syntax documented here. It default to ".*"
  • is_unique - specifies whether this column should only contain unique values, defaults to False
  • skip_blank - specified whether blank values in this field should be checked or whether they can be skipped, defaults to False
  • fallback_mode - specifies what should be done if the data does not comply with the rules, and can take the following values: -"remove_record" - the default value, the record is removed and quarantined in a file sent to the Logger object passed to the API
    • "use_default" - the record is removed and replaced with whatever is specified in the "default_value" field
    • "do_not_replace" - the record is left unchanged but still logged to the Logger object.
  • default_value - The value that records that do not validate should be set to if fallback_mode is set to use_default. Defaults to ''.
  • attempt_closest_match - Specifies whether entries that do not validate should be replaced with the value of the closest matching record in the dataset. If a sufficiently close match, as specified by the string_distance_threshold is not found then the fallback_mode is still applied. Defaults to False
  • string_distance_threshold - This specifies the default distance threshold for closest matches to be applied. This is a variant of the Jaro Winkler distance and defaults to 0.7
  • lookalike_match - This specifies whether entries that do not validates should be replaced with value from the record that looks most similar to the other records. This implements a nearest neighbor algorithm based on the similarity of other fields in the dataset. It is useful for filling in blank records and defaults to False

Number rules

  • minimum_value - the minimum allowed value of the number, defaults to 0
  • maximum_value - the maximum allowed value of the number, defaults to 65535
  • decimal_places - the number of decimal places the number should contains, defaults to 0
  • fix_decimal_places - if set to True, the default value, the rule will automatically fix all records to the same number of decimal places
  • is_unique - specifies whether this column should only contain unique values, defaults to False
  • skip_blank - specified whether blank values in this field should be checked or whether they can be skipped, defaults to False
  • fallback_mode - specifies what should be done if the data does not comply with the rules, and can take the following values:
    • "remove_record" - the default value, the record is removed and quarantined in a file sent to the Logger object passed to the API
    • "use_default" - the record is removed and replaced with whatever is specified in the "default_value" field
    • "do_not_replace" - the record is left unchanged but still logged to the Logger object.
  • default_value - The value that records that do not validate should be set to if fallback_mode is set to use_default. Defaults to ''.
  • attempt_closest_match - Specifies whether entries that do not validate should be replaced with the value of the closest matching record in the dataset. If a sufficiently close match, as specified by the string_distance_threshold is not found then the fallback_mode is still applied. Defaults to False
  • string_distance_threshold - This specifies the default distance threshold for closest matches to be applied. This is a variant of the Jaro Winkler distance and defaults to 0.7
  • lookalike_match - This specifies whether entries that do not validates should be replaced with value from the record that looks most similar to the other records. This implements a nearest neighbor algorithm based on the similarity of other fields in the dataset. It is useful for filling in blank records and defaults to False

Date rules

  • date_format - the format of the date parameter. This is based on the standard strftime parameters with the addition of the format '%s' represents an integer epoch time in seconds. The default date format is '%Y-%m-%d %H:%M:%S'
  • range_check - this can be set to the following values:
    • none - in which case the date is not checked
    • fixed - in which case the field is validated to lie between two fixed dates
    • rolling in which case the field is validated against a rolling window of days.
  • range_minimum - if range_check is set to "fixed" then this represents the start of the date range. If the range check is set to "rolling" it represents the offset from the current time, in days, that should be used as the earliest point in the range. The default value is '2000-01-01 00:00:00'
  • range_maximum - if range_check is set to "fixed" then this represents the end of the date range. If the range check is set to "rolling" it represents the offset from the current time, in days, that should be used as the latest point in the range. The default value is '2020-01-01 00:00:00'
  • is_unique - specifies whether this column should only contain unique values, defaults to False
  • skip_blank - specified whether blank values in this field should be checked or whether they can be skipped, defaults to False
  • fallback_mode - specifies what should be done if the data does not comply with the rules, and can take the following values:
    • "remove_record" - the default value, the record is removed and quarantined in a file sent to the Logger object passed to the API
    • "use_default" - the record is removed and replaced with whatever is specified in the "default_value" field
    • "do_not_replace" - the record is left unchanged but still logged to the Logger object.
  • default_value - The value that records that do not validate should be set to if fallback_mode is set to use_default. Defaults to ''.
  • attempt_closest_match - Specifies whether entries that do not validate should be replaced with the value of the closest matching record in the dataset. If a sufficiently close match, as specified by the string_distance_threshold is not found then the fallback_mode is still applied. Defaults to False
  • string_distance_threshold - This specifies the default distance threshold for closest matches to be applied. This is a variant of the Jaro Winkler distance and defaults to 0.7
  • lookalike_match - This specifies whether entries that do not validates should be replaced with value from the record that looks most similar to the other records. This implements a nearest neighbor algorithm based on the similarity of other fields in the dataset. It is useful for filling in blank records and defaults to False

Lookup rules

  • original_reference - specifies the entry in the df_dict dictionary containing the reference file
  • reference_field - the name or number of the column in the reference file that contains the values that this field should be validated against.
  • attempt_closest_match - Specifies whether entries that do not validate should be replaced with the value of the closest matching record in the reference file. If a sufficiently close match, as specified by the string_distance_threshold is not found then the fallback_mode is still applied. Defaults to True
  • string_distance_threshold - This specifies the default distance threshold for closest matches to be applied. This is a variant of the Jaro Winkler distance and defaults to 0.7
  • skip_blank - specified whether blank values in this field should be checked or whether they can be skipped, defaults to False
  • fallback_mode - specifies what should be done if the data does not comply with the rules, and can take the following values:
    • "remove_record" - the default value, the record is removed and quarantined in a file sent to the Logger object passed to the API
    • "use_default" - the record is removed and replaced with whatever is specified in the "default_value" field
    • "do_not_replace" - the record is left unchanged but still logged to the Logger object.
  • default_value - The value that records that do not validate should be set to if fallback_mode is set to use_default. Defaults to ''.
  • lookalike_match - This specifies whether entries that do not validates should be replaced with value from the record that looks most similar to the other records. This implements a nearest neighbor algorithm based on the similarity of other fields in the dataset. It is useful for filling in blank records and defaults to False

Uniqueness rules

  • unique_fields - a comma separated list of the fields that should be checked for uniqueness, e.g "device_id,date"
  • use_last_value - specifies whether the the first or last duplicate value should be taken. Defaults to False

Session rules

  • key_field - specifies the field that keys the session. For most IOT applications is would be th device ID. If two sessions overlap on the same value of key_field then they will be truncated. Defaults to None
  • start_field - specifies the field that controls the start of the session. Defaults to None
  • end_field - specifies the field that controls the end of the session. Defaults to None
  • date_format - defaults to '%Y-%m-%d %H:%M:%S'
  • overlaps_option - specifies how overlaps should be handled:
    • "ignore" - overlaps are not processed
    • "truncate_start" - the overlap is resolved by truncating the start of the next session
    • "truncate_end" - the overlap is resolved by truncating the end of the previous session
  • gaps_option - specifies how overlaps should be handled:
    • "ignore" - gaps are ignored
    • "extend_start" - the gaps are resolved by extending the start of the next session
    • "extend_end" - the gaps are resolved by extending the end of the previous session
    • "insert_new" - the gaps are resolved by inserting a new session as specified in the "template_for_new" parameter
  • template_for_new - contains a comma separated list of values that will be used a a template for a new row in the file to fill the gap. The key_field, start_field, and end_field will be replaced with appropriate vlaues to fill any gaps.
  • allowed_gap_seconds - specifies how many seconds of a gap are allowed before the gap options are implemented, defaults to 1
  • allowed_overlap_seconds - specifies how many seconds of overlap are allowed before the overlap options are implemented, defaults to 1
  • remove_zero_length - specifies whether zero length sessions should be removed. defaults to True

Previous: Reporting and monitoring data quality

Related documentation

  • Dativa Pipeline API on AWS - The Dativa Pipeline API is available through the AWS marketplace (more)
  • Dativa Pipeline Python Client - Dativa Tools includes a client for the Pipeline API (more)
  • Dativa Pipeline API: Sample Data - Sample files to demonstrate usage of the Dativa Pipeline API (more)
  • Dativa Pipeline API: Validating basic data types - Validating incoming datasets for basic string, number, and date type formatting and range checks using the Dativa Data Pipeline API (more)
  • Dativa Pipeline API: Anonymizing data - The Dativa Pipeline API support tokenization, hashing, and encyrption of incoming datasets for anonymisation and pseudonymization (more)
  • Dativa Pipeline API: Referential Integrity - Using the Dativa Pipeline API to validate data against other known good datasets to ensure referential integrity (more)
  • Dativa Pipeline API: Handling invalid data - Invalid data can be quarantined or automatically fixed by the Dativa Data Pipeline API (more)
  • Dativa Pipeline API: Working with session data - The Dativa Pipeline API can check for gaps and overlaps in session data and automatically fix them (more)
  • Dativa Pipeline API: Reporting and monitoring data quality - The Dativa Pipeline API logs data that does not meet the defined rules and quarantines bad data (more)

Need help? Get in touch...

Sign up below and one of our data consultants will get right back to you


Dativa is a global consulting firm providing data consulting and engineering services to companies that want to build and implement strategies to put data to work. We work with primary data generators, businesses harvesting their own internal data, data-centric service providers, data brokers, agencies, media buyers and media sellers.

145 Marina Boulevard
San Rafael
California
94901

Registered in Delaware

Thames Tower
Station Road
Reading
RG1 1LX

Registered in England & Wales, number 10202531