Summarizing and tagging videos with AI

A workflow for using AI models to summarize and create tags about a video

A lot of information about the content of a video can be found in the audio track, and subsequently in its transcript. Just like using this information to generate chapters, you could use this information to summarize a video, create a title, or generate a list of tags that describe the different subjects being discussed. Using AI for this task allows us to automatically generate complete metadata for a video.

Mux features used

Workflow

Here's a high-level overview of how you might fit the different pieces together:

  • Upload a video with auto-generated captions enabled
  • Wait for the video.asset.track.ready webhook, which will tell you that the captions track has finished being created
  • Retrieve the transcript file and give it to an AI model, like OpenAI's ChatGPT or Anthropic's Claude
  • Use a prompt to ask for a summary of the transcript, with guidance as to how long or short you want the summary to be. Being specific about not wanting any superfluous content returned other than the summary helps, as does setting up a "system prompt" that primes the LLM with for task that it's going to be asked to complete
  • Feed the summary back into the model in order to distill the summary down into a title

A system prompt for this type of task might look something like this:

Your task is to summarize the transcript of a video. Please follow these guidelines:

  • Be brief. Condense the content into a summary that captures the key points and main ideas without losing important details.
  • Avoid jargon or overly complex language unless necessary for the context.
  • Focus on the most critical information, ignoring filler, repetitive statements, or irrelevant tangents.
  • Aim for a summary that is 3-5 sentences long and no more than 200 characters.

LLM's tend to struggle with character limits because of the way that text is represented to them, through tokens than can take up multiple characters. You should still be able to specify a rough upper limit though.

Using the same process, you can prompt for different types of metadata like tags or a simple list of subjects being discussed by amending the system prompt to change what you expect to be returned.

Use cases

Once you have your summary and tags stored in your database you can use them as the basis for other features.

Tags for example can be used to improve a search experience by allowing people to filter videos that only talk about certain subjects with titles and summaries being used in the search results themselves.

You could also use the tags to enrich your analytics by using them to track which types of content users are most interested in, visualizing the tags based on their popularity.

Considerations

It is possible to ask an LLM to return multiple pieces of information at the same time, like a summary and title, but this can have varying effects on the quality of the output compared to asking it to focus on a single task at a time. Depending on the specific model you are using, you might want to benchmark the two methods against each other before deciding on which method to use.

For example: many models, like ChatGPT, now support a strict JSON output mode that gives guarentees about the models output adhering to a strict JSON schema. This schema could have pre-defined properties for the different data points you're trying to extract. Even if this works as expected, you should still test the quality of the results being returned as dividing the models attention and forcing it to not generate invalid JSON affects how it generates tokens for its response.

Was this page helpful?