You’ve created some video content! High five! …Now comes the slog. Do you have all your tweets written? Is your companion blog post ready to go? Do you have your Open Graph image designed? What about your video poster image?
The sheer number of “required” assets is overwhelming. None of them are particularly difficult to complete, but collectively, it can feel like a full-time job. Wouldn’t you rather spend that time doing something else you enjoy?
With artificial general intelligence (AGI) feeling imminent these days, I wondered if there was an opportunity to make this whole social media workflow a little less painful. Could I incorporate automated content generation into my video workflow? All the tools are there. There’s Whisper, an OpenAI service that transcribes audio into text. Then, of course, there’s ChatGPT, which allows for conversational AI and context setting. And Mux offers webhooks to notify your application when a static rendition is ready to be used.
Iiiiinteresting…
I spoke with a few coworkers about the idea and heard some buzz about Contenda. It’s a service that takes a video as an input and creates a blog post or tutorial from the video contents. I gave it a shot with a video we have hosted over on YouTube, and it showed some promising results:
With no editing, this actually got us to a shockingly good first pass at a blog post based on our video content. Huge first step, but it felt like there was still some work for The Machines left on the table.
We have some public videos in our docs that make for good test subjects. This four-minute overview of how to get started with Mux Player seems like a perfect candidate. Here’s the public playback URL for the video on this page:
The asset ID for this video is 2GbbwDon00uFYwrzwR01vwrxn9xFph9cWChUMPJLtLdjk — we’ll need that to enable static renditions, which will generate downloadable MP4 files.
Keep in mind that we can also enable MP4 support right when a video is uploaded, so this doesn’t always have to be a manual one-off process.
At this point, Mux will send out a few webhooks to any configured receiver URLs.
The first is a video.asset.static_renditions.preparing type that notifies your application about the static renditions currently being generated. The second type, video.asset.static_renditions.ready, provides us with the critical information we need to kick off our automated workflow.
Notice the entries in the data.static_renditions.files array in the payload above. Here, we get a list of the different MP4 qualities that were generated during the MP4 creation process.
Since we're only working with the audio portion of the video in this article, the lowest-quality low.mp4 should work great for our purposes. That video will have the smallest file size, making it much quicker to send to OpenAI’s Whisper API for audio processing.
So, we need a public URL that can handle this webhook request, process the payload, and run whatever custom code we can dream up. Let’s start dreaming.
UPDATE: Mux can now create an auto-generated transcript for you as soon as you upload your video. Check out the docs and save yourself a boatload of time by skipping this section.
We’ll get started here with the transcript bit. The public URL for our new low.mp4 video file can be constructed like this:
https://stream.mux.com/${playbackId}/low.mp4
Let's create a new index.js JavaScript file that can be run locally with Node.
As a high-level overview, here's what we need to do in this script:
Download the low.mp4 file and get the response blob
Create a FormData object and attach the data required by the Whisper API
Submit the form and wait for the transcribed response
Use the transcription to build out the companion content with ChatGPT
Let's knock out the first task. We’ll use the fetch API that comes with Node 17.5 to download the lowest resolution video.
Next, we construct the form data and attach the blob. This data will be sent in the POST request to the Whisper API:
Now we're ready to submit the video file to the Whisper API. We’ll also need an OpenAI API key; you can get that by signing up for an account with Open AI and adding a credit card.
If everything’s working as expected, you should see a text transcript of your video in the console at this point. And… it’s super accurate. Nice!
We’re now at a place where we can go one of many different ways. As you’ve likely seen by now, ChatGPT is extremely flexible and can adapt to the different expectations we set for it.
Let’s assume we need a creative title and description for this video. Sure, we can submit the transcript to ChatGPT and hope for the best. But a better approach might be to give it a system context that helps guide it toward the persona most capable of providing a fitting reply.
Let’s tell it to act like it is a super successful YouTube influencer capable of writing incredibly viral video titles and descriptions:
Hey, not bad!
We’ll need a few tweets to send out as well. We couuulld hire a Twitter influencer who can write us a few companion tweets for our video. Or…
I don’t really know how it got our Twitter handle right. Phew, what should I do with all this time I’m saving? Maybe I’ll start updating my resume?
Let’s get an outline for a blog post that would go along great with the new video:
I can imagine a world where you could create a new draft in your CMS and use this outline as the draft body. If you’re feeling ultra lazy, maybe you’d be willing to let ChatGPT write the entire blog post on its own (possibly with the help of Contenda).
So far, this has all been a little too easy for comfort. Let’s see if we can get some images generated that would be compatible with all the content we have so far:
Here’s the prompt it came up with:
Now, we proceed to send that prompt to DALL-E to create some AI-generated images from it:
The image generation URLs are signed and will expire, so they no longer work. If you were to follow those URLs, here are the images you’d see:
Phew. Looks like my job is safe for now (at least until the next version of DALL-E comes out). It's also pretty telling that I copied the exact same ALT text for each of these images – and, as our content editor wisely called out, a reminder that a real human hand may still be needed when it comes to representation and diversity in AI-generated content.
So now that this script works, all that’s left is to wire it up into a webhook handler endpoint and handle new jobs as each webhook event is delivered.
Sure, we’re still at a point where the generated output content feels somewhat mechanical, markety, maybe even dry. There’s a lack of authenticity, connection, emotion, resonance – a lack of the human element. But I can’t help but wonder how far into the future it will be before I have to update this blog post to say “well… that didn’t take long.” (Or maybe the AI will update it on my behalf?)
For now, these tools can be great starting points for ideation. All it takes is a few edit passes to match your voice and get your message in shape.
I’m glad I grew up in a blue-collar household. If you need me, I’ll be working on remembering how to safely operate a table saw.