Dubbing 2.0: The highest quality AI dubbing solution for developers
We discuss the latest updates to Sieve's dubbing pipeline and how it offers the best speech quality, translation controls, and pricing for developers.
/blog-assets/authors/ahmed.jpeg
by Ahmed Hanzala
Cover Image for Dubbing 2.0: The highest quality AI dubbing solution for developers

A few months ago, we released the first AI dubbing solution purpose-built for developers. Since then, we’ve partnered with many enterprise video platforms who have integrated it into their products, helping us improve and bring the pipeline to 2.0.

This is the highest-quality, most cost-effective, fully automated dubbing pipeline available. Below, we’ll walk you through exactly how our solution stands out.

Improving dubbing speech quality

Quality, AI-powered dubbing only became possible over the last few years but introduced an uncanny valley of issues like awkward translations, monotone speech, and unnatural speaking speed. This is because automatically dubbing videos is an intricate process which involves transcribing audio, translating text, cloning voices, doing text-to-speech, and placing the generated speech back at the right times in the original video. Each step influences the next, causing small inaccuracies to compound.

The biggest complaint we heard from developers trying other solutions revolved around two key problems.

  1. Hallucinations: Text-to-speech models often hallucinate, and many tools developers have tried produce random instances of garbled or gibberish speech.
  2. Speaking Speed: For an audio dub to work, the timing of the words in the target language must closely align with the original. Since different languages vary in length, this has been a challenging problem, often leading other solutions to make speech sound too fast or too slow.

Sieve solves both of these problems better than any other automated solution. Below is a comparison between an alternative dubbing pipeline and Sieve’s pipeline which uses the same underlying text-to-speech model.

Example 1: Steve Jobs

Below is a comparison between Sieve's dubbing pipeline and an alternative provider dubbing the video into Spanish with the same underlying TTS model.

Sieve Dubbing

Alternative Provider

  • 0:00-0:07: Slowness

    • The alternative provider's dub starts with noticeably slow and over-emphasized speech.
  • 0:22-0:30: Unnatural speech flow

    • The word "scientifically" is spoken slowly and overlaps awkwardly in the alternative dub, while Sieve's version maintains a natural flow.
  • 0:41-0:48: Unnatural pacing

    • Words like "siempre" and "cambio" sound unnaturally slow in the alternative dub, contrasting with Sieve's more natural pacing.
  • 1:08-1:10: Speech hallucinations

    • The alternative model introduces a random, nonsensical word, which Sieve's dubbing avoids entirely.
  • 1:12-end: Inconsistent speed and poor timing

    • The alternative dub ends with unusually slow speech and odd intonation, while Sieve's version maintains consistent speed and natural timing throughout.

Example 2: Bill Gates

Below is a comparison between Sieve's dubbing pipeline and an alternative provider dubbing the video into Spanish with the same underlying TTS model.

Sieve Dubbing

Alternative Provider

  • 0:16-0:22: Inconsistent Speech Rate

    • The alternative dub exhibits an unnatural shift from slow to fast speech, disrupting the flow.
  • 0:31-0:34: Disruptive Filler Word

    • A lengthy, unnecessary filler word interrupts the natural flow of speech in the alternative dub.
  • 0:44-0:52: Unnatural Word Emphasis

    • The alternative version tends to over-emphasize certain words, making the speech sound unnatural.
  • 1:01-end: Inconsistent Pacing

    • Words in the alternative dub are over-emphasized and poorly timed, sounding out of place with the visual cues.

Example 3: Andrew Mason

Below is a comparison between Sieve's dubbing pipeline and an alternative provider dubbing the video into Spanish with the same underlying TTS model.

Sieve Dubbing

Alternative Provider

  • 0:00-0:03: Overlapping Voices

    • The alternative dub exhibits multiple voices speaking simultaneously, creating confusion.
  • 0:12-0:14: Unnatural Silence

    • An awkward pause occurs in the alternative dub, despite ongoing speech in the original video.
  • 0:22-0:32: Reduced Speech Rate

    • The alternative dub demonstrates noticeably slowed and slurred speech during this segment.
  • 0:39-end: Voice Duplication

    • Multiple overlapping voices reappear in the alternative dub, diminishing clarity and quality.

The core improvements made are based on various algorithms we have implemented to detect hallucinations, influence speaking speed of text-to-speech models, and optimize placement of generated speech segments.

Unlike other solutions, Sieve solely focuses on delivering the best pipeline for dubbing while letting developers pick which underlying text-to-speech and translation models they want to use.

More translation controls

While automated translation is great, there are many reasons you might want a way to influence them. Some reasons include sticking to brand guidelines and preventing mis-translation of product names.

If you wish to translate certain words to specific other words, rather than letting the backend decide for itself, you can pass a JSON with the required word mappings.

For Example:

{
  "China": "Africa",
  "New York": "NYU"
}

Note: You can either specify the desired translated word in the target language or in the source language, and it will be automatically translated for you.

You can also use the translation dictionary to edit your speech by replacing certain words or phrases with desired words or phrases.

For Example:

{
  "Oh my God!": "Oh Lord!",
  "It's bad news": "It's great news"
}

A more flexible pricing model

Previously, we would charge a flat rate per minute of content dubbed without allowing you to bring your own API keys to external services like ElevenLabs or OpenAI. Now, developers have two options.

  1. Get billed directly through Sieve at a flat rate
  2. Enter your own API keys to third-party TTS and LLM services and get billed at a lower flat rate

This is especially useful if you already have volume commitments with other providers; and if you don’t, you can choose to avoid the headache of managing multiple vendors. You can learn more about pricing in detail in the Dubbing README.

Conclusion

Sieve’s dubbing pipeline offers flexible, high-quality, cost-effective dubbing for developers. You can try it on your own videos using our playground or integrate via API for your own applications. We’re excited to see developers make use of these updates. If you have any questions, feel free to email us at contact@sievedata.com or join our Discord!