We’ve written about data mining before and some of the steps that companies take when they carry out a data mining exercise. We wanted to delve into this topic a little more, partly because of what we sometimes see as a misconception about data mining and how it is used.
I've heard data mining described as "digging for buried treasure" before and it does sometimes carry an almost romantic notion with it. The concept of looking in large sets of data and looking for hidden patterns in it strikes at the very heart of the promise data science - taking something that appears random or meaningless, and finding value in it. The beauty of data mining is the ability to find said value in left-field places.
To take an example close to our industry -let's look again, as we did in our original piece, at archives. Media companies have vast archives of TV shows dating back decades and more and more are starting to digitize them. These shows have often been sitting on tapes in a room for decades, waiting for someone to discover and use them. Digitization means that, for the first time, a data science team can write algorithms analyzing this content, and uncovering patterns in the data that we don’t know exists. Could data mining reveal a hitherto-unseen pattern in archive shows that influences content commissioning in the future? We don’t know - but data mining may help us to find an answer.
TV is awash with new consumption data sets – from OTT devices, set-top box providers and Smart TV devices – and the industry is still at the forefront of using them, so there are lots of “known unknowns” waiting to be unearthed via data-mining. More specifically, lots of “unsupervised learning” projects – projects that test data that isn’t labeled or classified in any way – rely heavily on data mining, and lots of the problems we try and solve for our media customers at Dativa require this approach.
Take segment creation or audience clustering. There are of dozens of different ways to cut or look at your audience. Even if you want to create a relatively simple segment - say, people who are heavy viewers of a shorter show - there are multiple different ways to do this (we have written about some of them)[https://www.dativa.com/building-precision-audiences-tv-data] here. Finding the best method involves mining the data and looking for patterns.
The prosaic side of data mining
Data mining is about spotting patterns, but patterns don’t appear by waving on a magic wand. Once a team has looked at a set of data, it needs to validate and cleanse the data - failure to do this can compromise any efforts to find meaningful patterns. Typically, companies spend the bulk of their time preprocessing and conditioning the data to make sure it is clean, consistent, and appropriately combined to deliver business intelligence on which they can rely. Data mining is all about the data and successful data mining requires data that accurately reflects the business.
Data mining is also a more prosaic part of lots of other projects that are not inherently about finding patterns in data. Any supervised learning project - churn prediction, ratings forecasting, or any problem with more defined parameters than an unsupervised project – will start with data mining, so the data science consulting team can understand the parameters of the problem before they think about how best to solve it.
Let’s go back to new data sets – imagine a customer gets hold of a new set of clickstream data that it can use to understand how customers move between different parts of a Smart TV user interface. There are dozens of applications for this set of data – the marketing team can use it to understand how viewers move between different parts of the UI, or the relative effectiveness of different ad slots. But before the marketing team can get to this point, the data engineering team needs to carry out data mining to understand what is actually in the raw data, and to turn it into something useful. This process involves understanding where the data has come from, typically from, how it is constructed and how to find meaning out of it.
All of which goes to demonstrate that data mining is not merely an exciting cherry on top of the data science cake, but something much more foundational to the discipline as a whole.