Cleansing your data

Cleansing your data

Tom Weiss, Sun 16 July 2017

As media companies collect more and more data, from more and more data, the need for a thorough data cleansing regime becomes more important than ever. Out simply - no company wants to risk making bad decisions based on bad data - or licensing bad data to third parties.

Data that is ready for ingest into a data lake is typically provided in CSV, JSON, XML, or Parquet format, but before ingest, it needs to be cleansed to ensure the accuracy of any subsequent analysis undertaken on the data and the referential integrity of the datasets.

Because most data lakes are not built upon full-blown transactional databases which enforce foreign key constraints, it is particularly important that referential integrity is checked before loading data into the data warehouse.

In this blog post we look at the options for cleansing

Open source and AWS options

AWS does not currently provide any data cleansing tools and there are few open source options available.

The most common approach to build your own data cleansing tools is to develop python scripts to pre-process files, checking against the relevant for each incoming data source. These should as a minimum include:

  • Formatting of fields
  • Type checks
  • Range checks
  • Referential integrity
  • Some ability to quarantine or fix data that is not well formed and create alerts.

Off the shelf cleansing products

There are a limited number of off-the-shelf data cleansing products that are specifically usable in the media sector.

There are extensive number of “cleanse and append” products in most countries that will cleanse customer datasets to remove old addresses (both physical and electronic) and for CRM data cleansing these are by far the most effective options but cannot be applied to other datasets.

There are also “master data management” products that are provided by enterprise resource planning vendors as part of an overall toolset to support a manufacturing, retail, and distribution operation.

The two commercial products we are aware of that are suitable for a media company are:

  • Alteryx is a broad data preparation software.
  • Our Pipeline API is a low-cost data cleansing API which can be used to simply set up rules-based filters to ensure the cleanliness of the data lake

We like Alteryx for its tight integration with Tableau and our Data Pipeline API for its API that makes it easy to build into your data workflow

Need help? Get in touch...

Sign up below and one of our data consultants will get right back to you

Other articles about Data Engineering


Dativa is a global consulting firm providing data consulting and engineering services to companies that want to build and implement strategies to put data to work. We work with primary data generators, businesses harvesting their own internal data, data-centric service providers, data brokers, agencies, media buyers and media sellers.

145 Marina Boulevard
San Rafael
California
94901

Registered in Delaware

Thames Tower
Station Road
Reading
RG1 1LX

Registered in England & Wales, number 10202531