As TV networks start to embrace audience data from new sources, be that OTT player data, Smart TV data, or set-top-box data, one of the most common applications of the data is to audience indexing. Frequently - and particularly for tune-in - networks are looking for viewers that resonate strongly with particular types of programming.
Now it might sound like a simple question, but in practice, there are many different ways to define how strongly viewers have an affinity with particular types of programming. Choosing the method you use will depend on what you are measuring and how you are going to use the audience. You can have groups that are broad, narrow, homogeneous, highly differentiated, and with high or low coverage of the content.
Few people want low coverage, and most applications would ideally have a set that is both highly differentiated and have excellent coverage of both audience and content. Once we have our cake, we now want to eat it. But the algorithms we use for indexing have inherent trade-offs. As you can see in the chart below, we typically have a compromise between coverage of audience and coverage of content:
Our data science consulting team has grouped the different algorithms we can use for audience indexing into five different categories and have outlined which ones are most effective, as well as how we calculate them. These algorithms don't just apply to tune-in, either. We can use similar methodologies to understand whether an audience indexes heavily towards renting Ford vehicles, listening to Pink Floyd, or more or less any activity where we have supporting data.
This post covers each of these methods and guidance on the advantages and disadvantages of each method. To establish which is most appropriate, our data science consulting team typically creates groups using each approach and compares the outcomes. We examine how sharply they are differentiated from each other, their size, and the overall coverage of the universe of viewers that have been covered by the audiences.
Method 1) Total reach: who watched for longer than X minutes?
Taking any viewers who watch the content for more than a certain threshold of time will produce a large group, but is a very blunt metric that tends to over-emphasize programming that is heavily repeated or unusually long, or with a lot of episodes.
The code to calculate this in SQL is:
-- get all viewers who have watched the show with id show_id for more than 10 minutes SELECT DISTINCT viewer_id FROM viewing WHERE fk_show_id = show_id AND viewing_duration > 10 * 60
As a variant, it can be adjusted to consider what percentage of a particular show was watched rather than a fixed time. This under-measures programming that is heavily repeated but does account for more extended programming.
Method 2) Top viewers: who watched more than average?
To get more consistency between different types of content - for example, to compare regular programming like news with one-offs like the Olympic Games - looking at who watches more than average is a better metric.
The downside of this metric is that it tends to over-emphasis viewers who merely watch more TV. People who are heavier viewers are more likely to consume all programming more than lighter viewers. As such, if you take the indexing viewers for the top 200 shows, you'll get significant overlap between each program and lower overall coverage of your viewer base.
The code to calculate this in SQL is:
-- get all viewers who have watched the show with id show_id for more than average with show_averages as ( SELECT fk_show_id, SUM(viewing_duration) / COUNT(DISTINCT viewer_id) as avg_viewing FROM viewing GROUP BY 1) SELECT DISTINCT viewer_id FROM viewing INNER JOIN show_averages ON viewing.fk_show_id = show_averages.show_id WHERE fk_show_id = show_id
As a variant, you can modify this approach to reduce the group sizes by incorporating the standard deviation into the calculation.
Method 3) Heaviest viewers: who is in the top percentile?
To reduce the number of people in each group and further differentiate the different audiences we can look at only people who are in the highest percentile of viewers for that show.
This metric reduces the number of viewers tied to each show which improves the analytics application of the data but still skews towards much heavier viewers reducing the overall number of viewers included across all groups.
In SQL this would be:
-- get all viewers who are in the top 10% of viewers for the show with id show_id SELECT TOP 10 PERCENT viewer_id FROM viewing WHERE fk_show_id = show_id GROUP BY viewer_id ORDER BY SUM(viewing_duration) DESC
Method 4) Biggest fans: who has the show in their top list?
Our final metric looks at viewers who have the show in their top viewing list. This is the only metric that compensates for viewers who watch little programming as they will still appear in groups for their most viewed shows.
This metric reduces the number of viewers tied to each show while still retaining substantial coverage of the entire universe of viewers. It is more likely to skew towards higher audience shows, and smaller shows will only have tiny groups.
The SQL is:
-- get all viewers who have the show with id show_id in their top 10% of viewing WITH shows as ( SELECT viewer_id, fk_show_id, NTILE(10) OVER (PARTITION BY viewer_id ORDER BY SUM(viewing_duration) DESC) as ntile FROM viewing) SELECT viewer_id FROM shows WHERE ntile = 1 WHERE fk_show_id = show_id
The use of the NTILE() window function makes this more computationally complicated to calculate, and less portable between database types, but as the only metric that provides more comprehensive audience coverage it is more useful for applications where we want to includes the broadest possible audience in our groups.
Data science ensemble methods with TV data
Ultimately most applications require a high percentage of the audience to be included, most of the content, and with stable differentiation. As none of the methods on their own provide this, we typically end up building an ensemble model where we include viewers from each of the different approaches above.
The key is to spend enough time testing and categorizing the groups you produce to ensure they are fit for the purpose you need. The next time someone asks for the "audience that indexes for Game of Thrones" you might want to ask them for a little more detail on their question.