One of our goals with Mux Data is to provide meaningful data about video performance at a quick glance, while also providing the ability to drill deep when needed.
We created the Viewer Experience Score when we launched Mux Data in 2016. The Viewer Experience Score is a single number that describes how satisfied users are with video streaming performance. This overall score is composed of four underlying scores, focusing on playback failures, rebuffering, startup time, and video quality, each of which describe how satisfied a user is with the underlying four elements of video performance.
Think of the inverted pyramid used in journalism. The headline of an article summarizes the entire story in just a few words. The first paragraph summarizes what's most important. By the middle of the article, you're in the finer details, and supporting/background material waits until the end.
A Viewer Experience Score is like this. The headline is a single number that describes how happy your users are with system performance (e.g. 75). Dig deeper, and you find out why (e.g. 3.5% of users are seeing HTML5 playback failures, and these failures are concentrated on MP4 playback). At the bottom is every single view that was measured (a user in Toronto tried to watch video 19284, and saw an error after 30 seconds).
This system helps our customers focus on the actual experience of watching video (Quality of Experience), and not just the underlying service-level performance (Quality of Service). It's one thing to learn that you have rebuffer frequency of 0.34 events per minute. That's useful data, but it doesn't really tell you how happy your users are. It's better to have a layer on top that says how happy users are with the amount of rebuffering they're seeing.
Our original methodology was inspired by the Apdex score, created by NewRelic (and an open alliance) to measure application performance. Apdex divides performance into three buckets: satisfactory, tolerating, and frustrated. Satisfactory requests are given a score of
1.0; tolerating an
0.5; and frustrated an
0.0. So if an application has 60% satisfactory performance (
60% * 1.0), 30% tolerating performance (
30% * 0.5), and 10% frustrated performance (
10% * 0.0), the total score would be
We originally used a similar methodology for our Viewer Experience Score. Based on research, we learned that 2 seconds of startup time was generally considered satisfactory for web video, and 8 seconds was generally frustrating. So a view with 1.75 seconds of startup time got a
100, a view with 5 seconds got a
50, and a view with 10 seconds got a
This methodology generally worked well, but based on customer feedback and user research, it has some problems, and we think we can do better.
We're excited to announce that we have released a new version of our Viewer Experience Score.
The new methodology keeps many of the features of our earlier methodology. For example, it still focuses on viewer satisfaction and frustration, and still relies on the four dimensions of performance that make up overall experience.
We've made three three significant changes, however.
Change from binary scores to a function. 2 seconds of startup time isn't the same as 8 seconds, but our old methodology would have assigned exactly
50 points to each. In the new methodology, 2 seconds of startup time will be an
80, while 8 seconds will be
Capture tradeoffs. We did pairwise research into how users experience tradeoffs in video performance - for example, would you rather have a little rebuffering and a lot of startup time, or a little startup time and a lot of rebuffering - and used that to plot abstract utility curves for each dimension. Look for another blog post on that later.
Extend the scale. Rather than assigning
0 points to the point at which a user first becomes "frustrated," we're assigning this point
50 points. This allows us to continue to score performance beyond the first point at which a viewer is frustrated. [tbd: Some scores will still reach
0 at a certain point (e.g. after x rebuffering, the score is
0), while others will get close to zero but never reach it (e.g. 30 seconds of startup time gets a score of
Abstractly, the new formula looks like this.
100. (This is unchanged.)
50. (This is changed from
Playback Success Score is fairly simple. A failure that ends playback is an
0, while a video that plays through without failure is
100. This is unchanged.
What's new is that we now give a score of
50 if a viewer exits before a video starts (EBVS).
Startup Time Score describes how happy or unhappy viewers are with startup time. Longer startup times mean lower scores, while shorter startup times mean higher scores. Once startup time reaches a certain point (around 8 seconds), we begin to decrease the rate of score decay since additional seconds of startup becomes less impactful for long startup times.
Startup Time score decreases quickly after 500 milliseconds.
Smoothness Score measures the amount of rebuffering a viewer sees when watching video. A higher Smoothness Score means the user sees less (or no) rebuffering, while a lower score means a user sees more rebuffering.
Video Quality Score measures the visual quality a user sees by comparing the resolution of a video stream to the resolution of the player in which it is played. If a video stream is significantly upscaled, quality generally suffers, and viewers have an unacceptable experience.
Note that video quality is notoriously difficult to quantify, especially in a reference-free way (without comparing a video to a pristine master). Bitrate doesn't work, since the same bitrate may look excellent on one video and terrible on another.
Several factors contribute to actual video quality: bitrate, codec, content type, and the quality of the original source. However, if content is encoded well and at the right bitrates, upscaling tracks reasonably well to video quality.
Overall Score is the combination of four underlying component scores, which track the four major categories of video performance - playback success, startup time, smoothness/rebuffering, and video quality.
The Tradeoff functions are based on research into the relative tradeoffs between increasing one metric at the expense of another. For example, you can increase Quality at the expense of Startup Time and vice versa. However, doing so would be a bad idea because Startup Time is more valuable than Quality. Generally, we found that Playback Success is the most important, followed by Smoothness then Startup Time, and finally Quality.
In a future post, data scientist Ben Dodson will talk more about the methodology we use to update our scores, including a more detailed description of the tradeoff functions. Look for that soon.
In the meantime, get in touch with any questions.