AI Dubbing

Sieve’s AI Dubbing is an entirely automated dubbing API built for developers. It can take in an input video or audio file and seamlessly dub it across many languages.

Note: The API currently accepts files up to 15 minutes long. Please contact us (sales@sievedata.com) for access if you need to dub longer files.

For pricing, click here.

For further information, click here.

Key Features

  • Speaker Style Preservation: Preserve the tone and style of the original speaker.
  • Multiple Speaker Support: Multi-speaker support with distinct voices for each speaker.
  • Broad Range of Languages: Support for 29 popular languages.
  • Background Noise Addition: Add original background noise back to the dubbed audio for a more natural-sounding dub.
  • Language Styles: You can specify language styles such as "informal french", "shakespearean english", "brazilian portuguese", etc (only available with gpt4 translation).
  • Multi-Step Dubbing: You can change the dubbing mode to translation-only to get the translations as a dictionary, edit them and then proceed with the dubbing process, read more here.
  • Faster Than Realtime: Faster than realtime processing of dubs.
  • Choose Voice Engines: Pick from a variety of voice engines depending on your cost, quality, and speed requirements.
  • Safe Words: Specify safe words you don't want to translate, such as names or places.
  • Translation Dictionary: Customize translations by specifying mappings for specific words or phrases to control the output dub.
  • Multi-language Inputs: Specify multiple target languages as a comma-separated list to get multiple language dubs simultaneously.
  • Metadata Output: Option to output transcription, translation, and diarization metadata along with the dubbed video or audio.
  • Lipsyncing (Experimental): The app has an enable_lipsyncing option that will sync the lips of the source video to the dubbed audio. This parameter is useful for creating more natural-looking dubs. However, it is still experimental and expected to fail at times, especially on longer videos.
  • Preserve Background Audio: You can choose whether or not to preserve the background audio. By default, the preserve_background_audio parameter is set to True, However for content such as educational videos, where the background audio is not as important, this parameter can be set to False to ignore the background audio.
  • Segment Editing: Edit specific segments of videos by specifying translations of your choice, read more here.

Pricing

A minimum billing duration of 10 seconds applies - any audio shorter than this will be billed at the 10-second rate.

Dubbing (Price per Minute)

Voice EngineTranslation EngineStandard PriceTranscript ProvidedTranslation Provided
elevenlabsgpt4$0.535$0.487$0.350
elevenlabsseamless$0.518$0.470$0.333
openaigpt4$0.403$0.325$0.253
openaiseamless$0.386$0.308$0.236

Notes:

  • The prices above are the cost for dubbing per minute of audio.
  • The standard price column indicates the prices when no transcript is provided in the edit_segments parameter.
  • The transcript provided column indicates the prices if the entire transcript for the audio is provided in the edit_segments parameter.
  • The translation provided column indicates the prices if you provide translations for the segments in the edit_segments parameter.
  • If you are editing a segment using the edit_segments parameter, you are only charged for the duration of the segment you are editing.

Translation Only (Price per Minute)

Translation EngineStandard PriceTranscript Provided
gpt4$0.117$0.051
seamless$0.100$0.034

Note:

  • The standard price refers to the price when no transcript is provided in the edit_segments parameter.
  • The transcript provided column indicates the prices if the entire transcript for the audio is provided in the edit_segments parameter, in such case we only translate the segments provided.

Using Your Own Keys

By default, you will be billed based on the length of content you submit. However, if you'd like, you can also choose to enter third-party keys to OpenAI and ElevenLabs to be billed at your own rates for those services externally while being charged for the rest of the pipeline via Sieve.

Here's how much it would cost you to use your own keys:

Dubbing Cost (Price per Minute)

Voice EngineTranslation EngineStandard PriceTranscript ProvidedTranslation Provided
elevenlabsgpt4$0.378$0.300$0.238
elevenlabsseamless$0.361$0.280$0.221
openaigpt4$0.377$0.300$0.238
openaiseamless$0.361$0.280$0.221

Translation Only Cost (Price per Minute)

Translation EngineStandard PriceTranscript Provided
gpt4$0.109$0.042
seamless$0.100$0.034

To use your own keys, enter your API keys in the Secrets section:

  • OpenAI Secret: OPENAI_API_KEY
  • ElevenLabs Secret: ELEVEN_LABS_API_KEY

Translation Dictionary Usage

If you wish to translate certain words to specific words, rather than letting the backend decide for itself, you can pass a JSON with the required word mappings.

For Example:

{
 "China": "Africa",
 "New York": "NYU"
}

Note You can either specify the desired translated word in the target language or in the source language, and it will be automatically translated for you.

You can also use the translation dictionary to edit your speech by replacing certain words or phrases with desired words or phrases.

For Example:

{
 "Oh my God!": "Oh Lord!",
 "It's bad news" : "It's great news"
}

Output Modes

The API supports two output modes controlled by the output_mode parameter:

  • translation-only: Returns just the translated text without generating audio. Useful for previewing or validating translations before dubbing.
  • voice-dubbing: Returns the fully dubbed video/audio with translations spoken by the selected voice engine.

Edit Segments

The edit_segments parameter allows you to selectively dub or edit specific portions of the media. This is useful for:

  • Fixing translations in an already dubbed video
  • Only dubbing certain segments while leaving others untouched
  • Adding custom translations for specific segments

The parameter accepts a list of segment objects with the following structure:

[
 {
  "start": 0,
  "end": 10,
  "translation": "Hello, how are you?"
 }
]

Note:

  • start: The start time of the segment in seconds. (required)
  • end: The end time of the segment in seconds. (required)
  • translation: The translation for the segment. (optional)
  • text: The text for the segment. (optional)
  • speaker: The speaker for the segment. (optional)
  • The start and end times should be precise of which segments you'd like to edit. If the translation is too short or too long relative to the duration of the segment, the audio quality may be degraded.

Supported Languages

Sieve's AI Dubbing supports various languages depending on the voice_engine you use. If you use elevenlabs for TTS, you can only dub into the 29 following languages. If you use openai for TTS, you can dub into 100+ languages -- though voice cloning is not supported for openai.

  • 🇺🇸 English
  • 🇮🇳 Hindi
  • 🇵🇹 Portuguese
  • 🇨🇳 Mandarin (Chinese)
  • 🇪🇸 Spanish
  • 🇫🇷 French
  • 🇩🇪 German
  • 🇯🇵 Japanese
  • 🇦🇪 Arabic
  • 🇷🇺 Russian
  • 🇰🇷 Korean
  • 🇮🇩 Indonesian
  • 🇮🇹 Italian
  • 🇳🇱 Dutch
  • 🇹🇷 Turkish
  • 🇵🇱 Polish
  • 🇸🇪 Swedish
  • 🇵🇭 Tagalog (Filipino)
  • 🇲🇾 Malay
  • 🇷🇴 Romanian
  • 🇺🇦 Ukrainian
  • 🇬🇷 Greek
  • 🇨🇿 Czech
  • 🇩🇰 Danish
  • 🇫🇮 Finnish
  • 🇧🇬 Bulgarian
  • 🇭🇷 Croatian
  • 🇸🇰 Slovak
  • 🇮🇳 Tamil