About a year ago, Christian wrote a great blog post about building an interactive video transcript with Mux Player CuePoints. Here’s what he built in that post:
Interactive video transcripts are great. They’re great for SEO because they make every word of your video eligible for search results. They’re great for accessibility because they make reading the content of a video easier. And they’re great for impatient people like me who don’t want to watch a video but are willing to scan the good parts of a blog post.
So, of course, following Christian’s blog post, I ended up building an interactive video transcript for our own blog!
In the year since Christian’s blog post, a lot has changed, fueled by a tidal wave of AI products and services. There’s good reason to be AI-skeptical, but hear me out. 10 years ago, I spent literal months at an internship trying to train a machine learning model. Now? It’s as easy as calling an API. And, as it turns out, some of those APIs are really useful for dealing with video and audio.
So let’s see what these AI APIs can do for us! Let’s enhance a video transcript with AI in three ways:
But before we get to the AI goodness, let’s recap what Christian taught us on building an interactive video transcript:
Building an interactive video transcript with Mux Player CuePoints (the fast version)
What are CuePoints?
Great question. CuePoints are a way of attaching data to certain times throughout your video. Or, put another way, attaching cues to points throughout your video. (Hah, get it? Cue… points? I just figured that out while writing this.)
A cue might look like this:
Two things make CuePoints magic. First, you can access the playerElement.activeCuePoint to access the cue that corresponds to the current video time. Second, you can listen for cuepointchange events and react to them. With these two tools, you can sync your UI to your video. This lets you do things like pop up actor information (like Amazon X-Ray) for a certain scene, or pop up a question for an educational video, or… sync a transcript to your video!
How do I build an interactive video transcript with CuePoints?
We want an interactive transcript to have two pieces of functionality. First, when we click on a part of the transcript, the video jumps to that point. Second, when the video plays, the transcript highlights the active text.
To implement the first part – clicking the transcript to seek the video – we don’t need CuePoints. We can use the good ‘ol currentTime property. When the transcript is clicked, we’ll set the video’s currentTime to make the video jump to that point.
To implement the second part – highlighting the transcript as the video plays – we’ll use the oncuepointchange event. When the video changes CuePoints, we’ll let the transcript know so it can highlight the appropriate part.
Together, that looks something like this:
This is just a quick recap to get on the same page, so we’re not going to go into implementation. For all the details, do check out Christian’s post, check out his CodeSandbox, and read our docs on Mux Player CuePoints.
Today, we’ll be mostly focusing on how to make the text of the transcript itself better.
How can we make this better?
There’s a few things we can do to Christian’s implementation to make it a bit nicer for our blog. First, we can generate our transcript automatically. Second, we can label our speakers so readers know who’s talking without having to play the videos. And finally, we can add some formatting – like paragraph breaks – to make the reading experience even nicer. Here’s what it all looks like:
And if you’re impatient and don’t want to read anymore, you can cut to the chase and check out the code here.
Generate your transcript with Mux’s AI-powered auto-generated captions
Where does our transcript come from? Turns out, you can turn a caption track into a transcript pretty easily. And you can get a caption track from Mux using our AI-powered auto-generated captions.
Let’s talk about how to enable auto-generated captions, and how to turn those captions into a transcript.
Enable auto-generated captions
When uploading a video to Mux through the dashboard or through the API, you can auto-generate captions with the generated_subtitles field, like this:
(Later on, we’ll be working with an audio-only static rendition of this video, so we added the mp4_support field here, too.)
Note that you can only enable auto-generated captions on videos in the smart encoding tier. Read more about encoding tiers here.
To learn about adding generated captions to direct uploads, or retroactively adding generated captions, be sure to check out our docs.
Turn your captions into CuePoints and a transcript.
Your video now has captions! Cool! For our next trick, let’s turn those captions into CuePoints and a transcript.
Let’s begin by downloading the captions so we can work with them. To do that, we can use the following endpoint:
But where do PLAYBACK_ID and TRACK_ID come from? If you’re in a hurry, you can copy-paste them from your Mux dashboard. Or, you can get them programmatically with mux-node. Screenshot and code sample below.
Cool. Now that we have a TRACK_ID and a PLAYBACK_ID, let’s go ahead and grab the text of our captions.
Rad. This is going really well. Let’s seal the deal by translating the captions into CuePoints. Our captions come in a format called WebVTT. It looks like this:
There’s a few different ways to parse WebVTT. For our purposes, the library subtitle will do!
That’ll produce an object that looks like this:
Well, that just about looks like the CuePoints we were talking about earlier! One last transformation to get us over the line.
Heck yeah. We now have an object that we can pass to our player and our interactive transcript component.
Here it is, all together:
Identify speakers with AssemblyAI
If your video has multiple speakers, readers need to know who’s talking. And if folks aren’t watching the video, they can’t pick it up from visual context. In other words, we need to identify our speakers in our transcript.
Since Mux doesn’t support this (yet?), we’ll need to turn to a third-party service that supports speaker identification. (If you want to impress your friends at parties, you can also call it “speaker diarization” 🥳) Today, we’re using AssemblyAI. We’ll be able to send AssemblyAI the audio from our video and it’ll tell us who’s speaking.
Getting your speaker identities
For this bit, you’ll need to set up an account and get your API key from their dashboard. You’ll also need to tell your function who’s speaking. (After all, how else would it know what to name your speakers?) Finally, you’ll also need to make sure that you have static audio renditions enabled for your video. (If you’ve been following along, you enabled them earlier with the mp4_support field earlier.)
And then, I kid you not, this is the simple code to get a transcript with speaker identities:
aaiTranscript will be a huge object filled with all sorts of metadata. What we’re most interested in, though, is aaiTranscript.utterances. (An utterance, by the way, is a continuous block of speech.) This object will tell us what was said, and by whom.
CuePoints with speaker identities
If you’re okay with longer cues, you can just take this object and provide it to Mux Player. Unlike in the last section, instead of storing just text, we’ll store both the speaker and the text.
Utterances can be kinda long, though. If you would prefer the shorter cues that we got in our previous section, I’d suggest using AssemblyAI to generate a VTT file with shorter cues1:
To identify the speaker of each cue, we’re going to use aaiTranscript.words. Just like aaiTranscript.utterances gave us the speaker for every utterance, words will give us the speaker for every word.
It just takes a small modification to our VTT script from the previous section to identify the speaker of each cue…
Making it look pretty
Finally, let’s make some small changes to the TranscriptRenderer we talked about in the other post. If this is the first cue from a speaker, let’s add their name. And if it’s the last cue from a speaker, let’s follow it with a paragraph break.
Which makes it look like this!
Formatting text with ChatGPT
There’s one last bit of polish I want to apply before we head out. Sometimes, when a speaker speaks for a while, their text ends up being a bit long. It sure would be nice if there were paragraph breaks every so often to improve legibility.
Luckily, LLMs are great at this kind of task. Let’s ask GPT-4o to add some paragraph breaks. The easiest way to do this would be to just feed the model our entire CuePoints object and ask for some new lines. However, feeding an LLM a bunch of data it won’t use (like startTime and endTime) would increase latency (and cost). So let’s create a smaller object for the model to consume…
And then feed that to the model. First, you’ll need to set up the OpenAI SDK. Then, we’ll use that SDK to ask the model to add line breaks. We’ll also use ChatGPT’s new Structured Outputs API to make sure that our output object is of the correct format.
Disclaimer: I’m not a prompt engineer and this is my first time using the ChatGPT API. If you have any suggestions, hit me up @darius_cepulis.
This one seems to work, though!
One last change: normally, HTML will collapse those newline characters. In order to force the CSS to display, you’ll have to add the CSS white-space: pre to your styles. Et, voilà!
Homework
Here’s an example endpoint running on Val Town. Fork it, give it your asset ID and a list of speakers, and it’ll do the rest.
However, there’s more you can do!
First, the rendering code we wrote earlier relies on new lines (<br/> and \n) for line breaks. Ideally, you’d use <p/> paragraph tags instead.
Next, you might want this code to run automatically! Point a Mux Webhook at your endpoint and then save the data to your CMS:
Finally (and this is important!)
Writing this whole blog post, I was reminded of the true nature of our ongoing AI craze. AI is an incredible tool, but it's imperfect. Right now, it works best when it does tedious tasks for us and we supervise it. For example: I can't live without GitHub Copilot where I'm supervising and approving suggestions that I understand. But I'd be hard-pressed to replace our docs with an AI, where a chatbot might provide incorrect suggestions to customers2.
Lucky for us, this is the good kind of AI task. You and I know what speaker labels and paragraph breaks are supposed to look like so we can easily supervise our models.
I'd suggest adding some sort of preview in your CMS so you can edit any mistakes your AI might’ve made. It might look like ours:
The End
You’ve got a lot of data attached to your Mux assets. Here, we played with auto-generated captions and static audio renditions to build an interactive video transcript. But we’ve also got thumbnails, storyboards, and video renditions, too. Combined with webhooks and all these fun, new, AI APIs, you can build some really cool stuff.
When you do, let us know – @MuxHQ!