Generating video chapters with AI

A workflow for using AI models to segment a video into chapters

We also wrote a blog post about using ChatGPT to segment a video into chapters

If you're using a player that supports visualising chapters during playback, like Mux Player does, then you'll need your chapters defined in a format that can be given to your player.

Splitting your video into chapters manually can be tedious though, so we're going to give a high-level overview of how you could leverage AI to help with this.

Ultimately, we need to generate a list of chapter names with timestamps associated with them for when the chapters start.

Output format

Here's a couple examples of the kind of output you will want to generate from your AI integration. You can decide to generate your chapters in either a plain text or a sturctured format like JSON.

In plain text

This is similar to the YouTube chapter format and is a common way to represent chapters in a concise readable way. You will likely parse this output before storing it in your database.

00:00:00 Instant Clipping Introduction
00:00:15 Setting Up the Live Stream
00:00:29 Adding Functionality with HTML and JavaScript
00:00:41 Identifying Favorite Scene for Clipping
00:00:52 Selecting Start and End Time for Clip
00:01:10 Generating Clip URL
00:01:16 Playing the Clipped Video
00:01:24 Encouragement to Start Clipping

In JSON

JSON is more convenient to handle with JavaScript on the front-end.

[
  { start: '00:00:00', title: 'Instant Clipping Introduction' },
  { start: '00:00:15', title: 'Setting Up the Live Stream' },
  {
    start: '00:00:29',
    title: 'Adding Functionality with HTML and JavaScript'
  },
  {
    start: '00:00:41',
    title: 'Identifying Favorite Scene for Clipping'
  },
  { start: '00:00:52', title: 'Selecting Start and End Time for Clip' },
  { start: '00:01:10', title: 'Generating Clip URL' },
  { start: '00:01:16', title: 'Playing the Clipped Video' },
  { start: '00:01:24', title: 'Encouragement to Start Clipping' }
]

You can prompt for JSON to be returned directly from many LLMs, like using OpenAI's strict JSON mode. Depending on the model you are using, you will get different guarentees about whether or not your schema will be strictly adhered to. You should validate the JSON response using a library like Zod.

Mux features used

Information about what subjects are being discussed in a video can usually be found in the transcript. You can therefore use Mux's auto-generated captions feature as a base to generate chapters from. This text data is much easier and faster to process than analysing the video or audio tracks directly.

Workflow

Here's a high-level overview of how you might fit the different pieces together:

  • Upload a video with auto-generated captions enabled
  • Wait for the video.asset.track.ready webhook, which will tell you that the captions track has finished being created
  • Retrieve the transcript file and give the contents of it to an AI model, like OpenAI's ChatGPT or Anthropic's Claude
  • Craft a prompt that requests the transcript be segmented into chapters with timestamps, and not to include any other information in the response
  • Give the resulting chapters to your player to visualise

A system prompt for this task might look something like this:

Your role is to segment the following captions into chunked chapters, summarising each chapter with a title. Your response should be in the YouTube chapter format with each line starting with a timestamp in HH:MM:SS format followed by a chapter title. Do not include any preamble or explanations.

Visualizing

Once you have some chapters, you can display them in Mux Player like this:

// Get a reference to the player
const player = document.querySelector('mux-player');
// startTime is in seconds
player.addChapters([
	{startTime: 5, title: 'Chapter name'},
	{startTime: 15, title: 'Second chapter'},
]);

Here's an example of converting HH:MM:SS text based timestamps into seconds and giving them to Mux Player

import "./styles.css";
import "@mux/mux-player";

const generatedChapters = [
  { start: "00:00:00", title: "Instant Clipping Introduction" },
  { start: "00:00:15", title: "Setting Up the Live Stream" },
  {
    start: "00:00:29",
    title: "Adding Functionality with HTML and JavaScript",
  },
  {
    start: "00:00:41",
    title: "Identifying Favorite Scene for Clipping",
  },
  { start: "00:00:52", title: "Selecting Start and End Time for Clip" },
  { start: "00:01:10", title: "Generating Clip URL" },
  { start: "00:01:16", title: "Playing the Clipped Video" },
  { start: "00:01:24", title: "Encouragement to Start Clipping" },
];

const playerEl = document.querySelector("mux-player");

const parsedChapters = generatedChapters.map(({ start, title }) => {
  // we need to turn our timestamps into seconds
  const split = start.split(":").map((n) => parseInt(n));
  const seconds = split[0] * 3600 + split[1] * 60 + split[2];
  return { startTime: seconds, value: title };
});

playerEl.addChapters(parsedChapters);

Was this page helpful?