Introducing Sieve Dubbing 1.0: AI Dubbing for Developers
We discuss the launch of Sieve’s Dubbing API, the first AI dubbing solution purpose-built for developers.
/blog-assets/authors/mokshith.jpg
by Mokshith Voodarla
Cover Image for Introducing Sieve Dubbing 1.0: AI Dubbing for Developers

From famous political speeches to Mr. Beast videos, we’ve seen AI dubbing go mainstream over the last year. New advances in transcription, translation and text-to-speech have made this possible. What used to be a manual process, economically viable only for top movie studios and streaming networks, can now be completely automated.

While this could mean better distribution and accessibility of content on every video platform, there are no great developer-first options for dubbing that make integration easy and seamless.

That’s why we’re excited to introduce Sieve’s Dubbing API — the first AI dubbing solution purpose-built for developers, powered by the Sieve platform.

You can try it here for free without signing up, or create an account to integrate it into your video platform via API — no commitment, pay-as-you-go.

What developers want

When we looked at the market for solutions around dubbing, we saw a lot of products going directly after content creators or top studios — with API offerings on the side. While developers started adopting these APIs given it was their only option to experience the highest quality, we felt that it seemed like an after-thought of an offering to most of these platforms. There are three things developers look at when considering 3rd party solutions: quality, cost, and flexibility.

With Sieve Dubbing, we offer the best of all three. Sieve Dubbing is the highest quality pure-programatic dubbing solution on the market. It requires zero upfront payment with the friendliest pricing on pay-as-you-go terms and for enterprises. And it offers a ton of flexibility with choice of voice engines, translation engines, and language styles.

How it Works

The most basic version of dubbing can be built through the direct pipelining of various transcription, translation, and text-to-speech models. If we were to take the simplest approach, we’d do the following.

  1. Transcribe file
  2. Translate the transcribed text
  3. Clone the voice from the video
  4. Do text to speech with the translated text given the cloned voice
  5. Adjust audio duration (speed up or slow down) to match original audio / video length

But naturally, you’ll run into a ton of edge cases if you’re working to build a dubbing solution ready for production. We’ll speak through a few of the common problems below.

  • Translation Quality: The first thing a viewer may notice in dubbed content is how accurate or meaningful the translations are. Typical problems with translation services are their inability to capture nuance, deep context, or specific cultures when considering how to translate a given piece of content.
  • Speaking Speed Changes: A common artifact in translation is that the length of what is being said in one language may different from the spoken length of how it is said in another language. This causes out-of-sync issues with the visuals in a video where the audio and video no longer match up. While a possible fix could be to speed up or slow down the audio, this can sound unnatural and robotics — especially if that speed changes constantly.
  • Accurate Voice Cloning: Speech can be very complex person to person, in how their voice sounds, the tone in which they speak, the accents they use — all of which can be complex too capture but still important in order for a piece of content to sound incredibly natural. Capturing this especially on short spoken segments is difficult, as AI models have very little to work with to replicate those exact features for every phrase.
  • Background Noise: Most content out there isn’t just people speaking. It includes background noise and other artifacts that are important to preserve in the dubbed content in order to preserve a natural viewing experience. The latest audio source separation models make some of this possible but making them work in all kinds of noisy environments can be difficult.
  • Multi-speaker content: While a lot of videos might only involve a single speaker, many videos out there include multiple speakers having back and forth conversations. The ability to accurately clone each phrase of each speaker distinctly is important and can be a difficult task, especially when some speech diarization techniques can be rather inaccurate.

Our API solves a lot of these problems for you from the get-go, so you have something ready for production immediately along with features that let you customize for your particular use case.

  • Speaker Style Preservation: Preserve the tone and style of the original speaker.
  • Multiple Speaker Support: Multi-speaker support with distinct voices for each speaker.
  • Broad Range of Languages: Support for 29 popular languages with voice cloning or up to 100 without.
  • Background Noise Addition: Add original background noise back to the dubbed audio for a more natural sounding dub.
  • Language styles: You can specify language styles such as "informal french", "shakespearean english", "brazilian portuguese", etc (only available with gpt4 translation).
  • Faster than realtime: Faster than realtime processing of dubs.
  • Voice engines: Pick from a variety of voice engines depending on your cost, quality, and speed requirements.
  • Safe words: Specify safe words that you don't want to translate such as names or places.
  • Multi-language inputs: Specify multiple target languages as a comma-separated list to get multiple language dubs at once.
  • Metadata Output: Option to output transcription, translation, and diarization metadata along with the dubbed video or audio.

Some Examples and Use Cases

We've been amazed at all the use cases programmatic dubbing can enable across a wide spectrum of industries, including education, sports, news, entertainment, e-commerce, and more. Below are some example outputs from our API that can give you a sense of what it can do.

Sal Khan Teaching Algebra in Spanish

Dubbed (Spanish)

Original

MLK I Have a Dream in Mandarin

Dubbed (Mandarin)

Original

Jensen Huang Speaking "Gen Z" English

Dubbed (Gen Z English)

Original

Charles Barkley Retiring in Arabic

Dubbed (Arabic)

Original

Try Sieve dubbing here via web interface or integrate via API using our usage guide. We’re excited to see what you build.