Dativa tools data validation

Dativa tools data validation

The adage of “fail fast, fail hard” is never more accurate than when trying to load and process large volumes of data. How many data engineers have spent hours - and dollars - processing gigabytes in their data pipeline only to discover that some critical data is missing, corrupt or just plain wrong?

Although more subtle issues need be handled later in the data pipeline during data processing and cleansing, there are several simple tests data engineers can run as soon as a batch of data is received. Some of these can even run before the data files are opened. With automated data delivery it’s surprising how many simple things can go wrong: zero length files, missing files, unexpected files, last week’s files, massively bloated files. Over the years we’ve seen them all.

That’s why at Dativa our data engineers have developed tools to detect rogue data as soon as we receive it. Our data validation python library contains methods to validate file sizes, dates, counts, names, and extensions at a specified location and has now been made open source as part of our dativatools which is available on PyPI and GitHub.

The data validation library is straightforward to integrate into a data pipeline. A configuration dictionary is used to describe the data location and criteria for establishing its validity:

config_dict = {
    'path': '',                    # Folder path for data files e.g. 'sampledata/programme'
    'expected_extension': '',      # Extension of files e.g. '.xml'
    'expected_filenames': '',      # File prefix e.g. 'PRG_*'
    'file_count_threshold': [],    # Range in which file count should be e.g. [500, 1000],
    'max_sizes': [],               # Max limit of file size in bytes (single value for all files, or individual values for each file) e.g. [119351]
    'min_sizes': [],               # Min limit of file size in bytes (single value for all files, or individual values for each file) e.g. [586]
    'other_expected_files': [],    # Files to ignore if present
    'date_format': '',             # Format of dates used in this dictionary e.g. '%Y%m%d'
    'date_range': [],              # Range of dates for files as time deltas from ‘processing_date’ according to ‘date_format’ e.g. [-3, -1]
    'processing_date': ''          # Date of folder for checking in ‘date_format’ e.g. '20170628'
}

The dictionary is used to initialize a data validation object, along with a path and file in which to log the detailed results:

obj = DataValidation(config_dict, "logs/", "data_validation.log")

It’s then just a matter of performing the required tests on the data files and responding accordingly:

#  Check if files with the expected filenames are present
result = obj.check_name()
self.assertTrue(result, True)

# Check if any extra files are present.
result = obj.check_extra_files()
self.assertTrue(result[0], True)

# Check if the file dates are within the expected date range.
result = obj.check_date()
self.assertTrue(result[0], True)

# Check if the sizes of files fall within the prescribed thresholds.
result = obj.check_size()
self.assertTrue(result[0], True)

#  Check if the count of files fall within the prescribed thresholds.
result = obj.check_file_count()
self.assertTrue(result[0], True)

# Check if the file extensions match the expected extension.
result = obj.check_file_extension()
self.assertTrue(result[0], True)

Running these simple tests on data received has meant our data engineers have discovered many issues almost the moment the data arrives, so we can resolve the problems quickly and without unduly impacting live services.

Maybe it’s not so much “fail fast, fail hard” as “fail fast, fix easily”!

Related documentation

  • Querying AWS Athena and getting the results in Parquet format - (more)
  • Dativa tools pandas extensions - A set of extensions to pandas for consistent CSV processing and better date time handling (more)
  • Dativa tools pandas extensions - A set of extensions to pandas for consistent CSV processing and better date time handling (more)
  • Dativa tools Athena client - The AthenaClient wraps the AWS boto client in an easy to use wrapper to create and manage tables in S3 (more)

Need help? Get in touch...

Sign up below and one of our data consultants will get right back to you


Dativa is a global consulting firm providing data consulting and engineering services to companies that want to build and implement strategies to put data to work. We work with primary data generators, businesses harvesting their own internal data, data-centric service providers, data brokers, agencies, media buyers and media sellers.

145 Marina Boulevard
San Rafael
California
94901

Registered in Delaware

Thames Tower
Station Road
Reading
RG1 1LX

Registered in England & Wales, number 10202531