Job TreeNavigate the job tree to view your child job details
Loading job tree...
Translate any video or audio to several languages
ready
Outputs
waiting for outputs
Logs
listening for logs...
README

AI Translation & Lip-Syncing

Take a video or audio sample of multiple people speaking and dub the audio in any language of your choosing. You can also lipsync the dub on the source material if it is a video. We clone individual speakers and dub their audio in the target language.

Note: The processing time depends on length, and if a video on the resolution and length of the video but a general rule of thumb is that it takes 1 second per second of audio, and 12 seconds to generate a video with lipsyncing with default settings.

Options

  • source_file: An audio or video input file to dub.
  • target_language: The language to which the audio will be translated. Default is "spanish".
  • enable_lipsyncing: Whether to enable lip-syncing on the original video to the dubbed audio. Defaults to False. Only applicable for video inputs. Otherwise, audio is returned.
  • translation_backend: The translation backend to use. Defaults to "seamless". Other options are "gpt4", "mixtral", and "deepl".

Other Translation Backends

By default, we use Seamless Translation to dub. You can also use 3rd party APIs in this app for translation such as OpenAI, Mixtral, and DeepL. To test other backends, you must have the following environment variables set in your Sieve secrets:

  • OPENAI_API_KEY (OpenAI API key) if you use gpt4 backend
  • TOGETHER_API_KEY (Together.ai AI key) if you use mixtral backend
  • DEEPL_API_KEY (DeepL API key) if you use deepl backend

Currently, the target languages supported are:

  • English
  • Spanish
  • French
  • German
  • Italian
  • Portuguese
  • Polish
  • Turkish
  • Russian
  • Dutch
  • Czech
  • Arabic
  • Chinese
  • Afrikaans
  • Amharic
  • Arabic
  • Assamese
  • Azerbaijani
  • Belarusian
  • Bengali
  • Bosnian
  • Bulgarian
  • Catalan
  • Cebuano
  • Welsh
  • Danish
  • Greek
  • Estonian
  • Finnish
  • French
  • Irish
  • Galician
  • Gujarati
  • Hebrew
  • Hindi
  • Croatian
  • Hungarian
  • Armenian
  • Igbo
  • Indonesian
  • Icelandic
  • Italian
  • Javanese
  • Japanese
  • Kannada
  • Georgian
  • Kazakh
  • Khmer
  • Korean
  • Lao
  • Lithuanian
  • Ganda
  • Luo
  • Latvian
  • Maithili
  • Malayalam
  • Marathi
  • Macedonian
  • Maltese
  • Meitei
  • Burmese
  • Dutch
  • Nepali
  • Nyanja
  • Odia
  • Punjabi
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Slovak
  • Slovenian
  • Shona
  • Sindhi
  • Somali
  • Spanish
  • Serbian
  • Swedish
  • Swahili
  • Tamil
  • Telugu
  • Tajik
  • Tagalog
  • Thai
  • Turkish
  • Ukrainian
  • Urdu
  • Vietnamese
  • Yoruba
  • Cantonese
  • Zulu

Known Limitations

  • Inputs with a lot of background noise sometimes fail to pick up speech, which leads to poor translation.
  • Speaker identification sometimes leaks into the next speaker. This can lead to the wrong voice being used for the beginning of a speaker's segment.
  • The translated text is sometimes longer or shorter than the original text, which can lead to very fast / slow speech.
  • Lip-syncing quality degrades when artifacts block the face, and when the face moves around too much.
  • Speed is directly proportional with the number of speakers and scene cuts.

© Copyright 2024. All rights reserved.