Dativa Pipeline API: Reporting and monitoring data quality

Dativa Pipeline API: Reporting and monitoring data quality

Previous: Working with session data | Next: Full API reference

Reporting

The run() method returns a dictionary of ReportEntry objects. The report entry object contains a number of different objects:

  • ReportEntry.date - a python datetime object for when the entry was created
  • ReportEntry.field - a string representing the name of the field which was processed
  • ReportEntry.number_records - the number of records effected
  • ReportEntry.category - a string categorizing the type of action applied to the rows
  • ReportEntry.description - a string with more detail on the action taken
  • ReportEntry.df - a pandas DataFrame containing any records that did not pass validation, and any values they were replaced with

The ReportEntry class serializes to a human readable log file, but it is more common for it to be post-processed into a machine readable format and for the DataFrames to be saved to disk for later review.

Custom reporting

By default the FileProcessor class uses the DefaultReportWrite() class that aggregated ReportEntry() classes and returns them in a list at the end of the project.

In order to write your own custom reporting class you need to implement a class with two interfaces: log_history, and get_report.

Here is an example that simply logs all information to stdout:

class MyReportWriter():

    def log_history(self,
                    rule,
                    field,
                    df,
                    category,
                    description
                    ):
        print("{0}, Field {1}: #{2} {3}/{4}".format(date,
                                                    field,
                                                    df.shape[0],
                                                    category,
                                                    description))

    def get_report(self):
        return None


fp = FileProcessor(report_writer = MyReportWriter())

df = pd.read_csv(my_file)

report = fp.run(df,
                config={"rules": [
                 {
                     "rule_type": "String",
                     "field": "name"
                     "params": {
                         "fallback_mode": "remove_record",
                         "regex": "[\w\s]*"
                     },
                 }]})

Related documentation

  • Dativa Pipeline API on AWS - The Dativa Pipeline API is available through the AWS marketplace (more)
  • Dativa Pipeline Python Client - Dativa Tools includes a client for the Pipeline API (more)
  • Dativa Pipeline API: Sample Data - Sample files to demonstrate usage of the Dativa Pipeline API (more)
  • Dativa Pipeline API: Validating basic data types - Validating incoming datasets for basic string, number, and date type formatting and range checks using the Dativa Data Pipeline API (more)
  • Dativa Pipeline API: Anonymizing data - The Dativa Pipeline API support tokenization, hashing, and encyrption of incoming datasets for anonymisation and pseudonymization (more)
  • Dativa Pipeline API: Referential Integrity - Using the Dativa Pipeline API to validate data against other known good datasets to ensure referential integrity (more)
  • Dativa Pipeline API: Handling invalid data - Invalid data can be quarantined or automatically fixed by the Dativa Data Pipeline API (more)
  • Dativa Pipeline API: Working with session data - The Dativa Pipeline API can check for gaps and overlaps in session data and automatically fix them (more)
  • Dativa Pipeline API: Full API reference - A field by field breakdown of the full functionality of the Dativa Data Pipeline API (more)

Need help? Get in touch...

Sign up below and one of our data consultants will get right back to you


Dativa is a global consulting firm providing data consulting and engineering services to companies that want to build and implement strategies to put data to work. We work with primary data generators, businesses harvesting their own internal data, data-centric service providers, data brokers, agencies, media buyers and media sellers.

145 Marina Boulevard
San Rafael
California
94901

Registered in Delaware

Thames Tower
Station Road
Reading
RG1 1LX

Registered in England & Wales, number 10202531