The data science behind audience indexing with TV data

The data science behind audience indexing with TV data

Tom Weiss, Thu 11 January 2018

As TV networks start to embrace audience data from new sources, be that OTT player data, Smart TV data, or set-top-box data, one of the most common applications of the data is to audience indexing. Frequently - and particularly for tune-in - networks are looking for viewers that resonate strongly with particular types of programming.

Now it might sound like a simple question, but in practice, there are many different ways to define how strongly viewers have an affinity with particular types of programming. Choosing the method you use will depend on what you are measuring and how you are going to use the audience. You can have groups that are broad, narrow, homogeneous, highly differentiated, and with high or low coverage of the content.

Few people want low coverage, and most applications would ideally have a set that is both highly differentiated and have excellent coverage of both audience and content. Once we have our cake, we now want to eat it. But the algorithms we use for indexing have inherent trade-offs. As you can see in the chart below, we typically have a compromise between coverage of audience and coverage of content:

Methods to build target audiences from TV data

Our data science consulting team has grouped the different algorithms we can use for audience indexing into five different categories and have outlined which ones are most effective, as well as how we calculate them. These algorithms don't just apply to tune-in, either. We can use similar methodologies to understand whether an audience indexes heavily towards renting Ford vehicles, listening to Pink Floyd, or more or less any activity where we have supporting data.

This post covers each of these methods and guidance on the advantages and disadvantages of each method. To establish which is most appropriate, our data science consulting team typically creates groups using each approach and compares the outcomes. We examine how sharply they are differentiated from each other, their size, and the overall coverage of the universe of viewers that have been covered by the audiences.

Method 1) Total reach: who watched for longer than X minutes?

Taking any viewers who watch the content for more than a certain threshold of time will produce a large group, but is a very blunt metric that tends to over-emphasize programming that is heavily repeated or unusually long, or with a lot of episodes.

The code to calculate this in SQL is:

-- get all viewers who have watched the show with id show_id for more than 10 minutes
SELECT DISTINCT viewer_id
FROM viewing
WHERE fk_show_id = show_id
AND viewing_duration > 10 * 60

As a variant, it can be adjusted to consider what percentage of a particular show was watched rather than a fixed time. This under-measures programming that is heavily repeated but does account for more extended programming.

Method 2) Top viewers: who watched more than average?

To get more consistency between different types of content - for example, to compare regular programming like news with one-offs like the Olympic Games - looking at who watches more than average is a better metric.

The downside of this metric is that it tends to over-emphasis viewers who merely watch more TV. People who are heavier viewers are more likely to consume all programming more than lighter viewers. As such, if you take the indexing viewers for the top 200 shows, you'll get significant overlap between each program and lower overall coverage of your viewer base.

The code to calculate this in SQL is:

-- get all viewers who have watched the show with id show_id for more than average
with show_averages as (
SELECT
fk_show_id,
SUM(viewing_duration) / COUNT(DISTINCT viewer_id) as avg_viewing
FROM viewing
GROUP BY 1)
SELECT DISTINCT viewer_id
FROM viewing
INNER JOIN show_averages
ON viewing.fk_show_id = show_averages.show_id
WHERE fk_show_id = show_id

As a variant, you can modify this approach to reduce the group sizes by incorporating the standard deviation into the calculation.

Method 3) Heaviest viewers: who is in the top percentile?

To reduce the number of people in each group and further differentiate the different audiences we can look at only people who are in the highest percentile of viewers for that show.

This metric reduces the number of viewers tied to each show which improves the analytics application of the data but still skews towards much heavier viewers reducing the overall number of viewers included across all groups.

In SQL this would be:

-- get all viewers who are in the top 10% of viewers for the show with id show_id
SELECT TOP 10 PERCENT viewer_id
FROM viewing
WHERE fk_show_id = show_id
GROUP BY viewer_id
ORDER BY SUM(viewing_duration) DESC

Method 4) Biggest fans: who has the show in their top list?

Our final metric looks at viewers who have the show in their top viewing list. This is the only metric that compensates for viewers who watch little programming as they will still appear in groups for their most viewed shows.

This metric reduces the number of viewers tied to each show while still retaining substantial coverage of the entire universe of viewers. It is more likely to skew towards higher audience shows, and smaller shows will only have tiny groups.

The SQL is:

-- get all viewers who have the show with id show_id in their top 10% of viewing
WITH shows as (
SELECT
viewer_id,
fk_show_id,
NTILE(10) OVER (PARTITION BY viewer_id ORDER BY SUM(viewing_duration) DESC) as ntile
FROM viewing)
SELECT viewer_id
FROM shows
WHERE ntile = 1
WHERE fk_show_id = show_id

The use of the NTILE() window function makes this more computationally complicated to calculate, and less portable between database types, but as the only metric that provides more comprehensive audience coverage it is more useful for applications where we want to includes the broadest possible audience in our groups.

Data science ensemble methods with TV data

Ultimately most applications require a high percentage of the audience to be included, most of the content, and with stable differentiation. As none of the methods on their own provide this, we typically end up building an ensemble model where we include viewers from each of the different approaches above.

The key is to spend enough time testing and categorizing the groups you produce to ensure they are fit for the purpose you need. The next time someone asks for the "audience that indexes for Game of Thrones" you might want to ask them for a little more detail on their question.

Get in touch to find out how we can help

Sign up below and one of our Data Consultants will get right back to you


Dativa is a global consulting firm providing data consulting and engineering services to companies that want to build and implement strategies to put data to work. We work with primary data generators, businesses harvesting their own internal data, data-centric service providers, data brokers, agencies, media buyers and media sellers.

145 Marina Boulevard
San Rafael
California
94901

Registered in Delaware

Genie House
Burchetts Green Road
Maidenhead
SL6 6QS

Registered in England & Wales, number 10202531