Our passion is putting data to work: the strategy about what you should do; the science behind how you can do it; and then the engineering to turn it into a reality. It's once we've completed the build out a data pipeline, the hard work of operating the platform kicks in.
The difficulty of running a data pipeline isn't in the management of the servers or ensuring there's enough capacity. Both of those are nowadays handled by AWS. This tricky part is in making sure that the data that's going into the pipeline arrives promptly, meets specifications, and is fit for purpose.
Most data pipelines have somewhere between 10 and 100 different data source arriving regularly. Some arrive in real-time, some hourly, most will come daily, and others will come less frequently. Where it gets complicated is that some datasets need to get processed in a specific order. If you're getting a set of ad impressions inbound that you want to match against your latest campaign database, you need to make sure your campaign data has arrived BEFORE you start loading your ad impressions. Validating that the data comes in the right order is the first step in any data operations center.
The second step is ensuring that data meets the specifications that we've defined. Simply speaking, if we're expecting a UTF-8 encoded JSON file, are we receiving a UTF-8 encoded JSON file? It's all too frequent that non-English characters and unexpected quotation marks to break both UTF-8 and JSON, making the data unreadable to downstream ingest processes.
Once we've validated each file, we need the same level of sense checking for each field. If there are dates in the data, are they well formatted and are they in the expected date range? Are all strings within the expected length? Do numeric fields have the right distribution? All of these field level checks are critical to ensuring that data providers are delivering consistent data.
Once we're confident that the data meets the specification we still have to ensure that it's fit for purpose. If we usually receive a 3Mb file from a customer and one day we get 43Gb, then we can guess they've sent us the wrong file. If we typically take data from 45 different campaigns and one day we only receive data for 43, we know that something has gone wrong with those other two campaigns.
Once we've validated that all the data is correct, we automatically load it into the pipeline. The manually intensive work of data operations is in dealing with data that's failed validation. We typically separate this data out into quarantine and from there we manually examine it to understand what's gone wrong, discussing with the data provider, and working to get correct data. If we can, we'll manually correct the data to keep it flowing, but ultimately the goal is to get each data provider to take responsibility for delivering data correctly.
In the overall life-cycle of the data, getting it into the data lake or data warehouse is only really the starting point, and data operations also has a significant role to play in making sure that users can make the most of the data. When it comes to querying big datasets, 90% of the performance issues come from users' not knowing the best way to work with the data. Data operations need to keep a close eye on user activities: looking for slow queries from users and potential bottlenecks reading data. We operate a daily process to review the time-consuming tasks and proactively reach out to users with ways to improve the performance. When this approach was rolled out with one of our clients, we saw an average improvement of query time from 10 minutes down to less than 1 minute across the board.
None of these activities are particularly difficult, and many are merely the systematic application of common sense. However, when they are all applied to data operations, the results are spectacular. Getting operations right enables analysts and data scientists to spend less time on data preparation and focus more of their time adding dollars to the bottom line. That's why data science needs data operations.