AI Translation & Lip-Syncing
Take a video or audio sample of multiple people speaking and dub the audio in any language of your choosing. You can also lipsync the dub on the source material if it is a video. We clone individual speakers and dub their audio in the target language.
Note: The processing time depends on length, and if a video on the resolution and length of the video but a general rule of thumb is that it takes 1 second per second of audio, and 12 seconds to generate a video with lipsyncing with default settings.
Options
source_file
: An audio or video input file to dub.target_language
: The language to which the audio will be translated. Default is "spanish".enable_lipsyncing
: Whether to enable lip-syncing on the original video to the dubbed audio. Defaults to False. Only applicable for video inputs. Otherwise, audio is returned.translation_backend
: The translation backend to use. Defaults to "seamless". Other options are "gpt4", "mixtral", and "deepl".
Other Translation Backends
By default, we use Seamless Translation to dub. You can also use 3rd party APIs in this app for translation such as OpenAI, Mixtral, and DeepL. To test other backends, you must have the following environment variables set in your Sieve secrets:
OPENAI_API_KEY
(OpenAI API key) if you usegpt4
backendTOGETHER_API_KEY
(Together.ai AI key) if you usemixtral
backendDEEPL_API_KEY
(DeepL API key) if you usedeepl
backend
Currently, the target languages supported are:
- English
- Spanish
- French
- German
- Italian
- Portuguese
- Polish
- Turkish
- Russian
- Dutch
- Czech
- Arabic
- Chinese
- Afrikaans
- Amharic
- Arabic
- Assamese
- Azerbaijani
- Belarusian
- Bengali
- Bosnian
- Bulgarian
- Catalan
- Cebuano
- Welsh
- Danish
- Greek
- Estonian
- Finnish
- French
- Irish
- Galician
- Gujarati
- Hebrew
- Hindi
- Croatian
- Hungarian
- Armenian
- Igbo
- Indonesian
- Icelandic
- Italian
- Javanese
- Japanese
- Kannada
- Georgian
- Kazakh
- Khmer
- Korean
- Lao
- Lithuanian
- Ganda
- Luo
- Latvian
- Maithili
- Malayalam
- Marathi
- Macedonian
- Maltese
- Meitei
- Burmese
- Dutch
- Nepali
- Nyanja
- Odia
- Punjabi
- Polish
- Portuguese
- Romanian
- Russian
- Slovak
- Slovenian
- Shona
- Sindhi
- Somali
- Spanish
- Serbian
- Swedish
- Swahili
- Tamil
- Telugu
- Tajik
- Tagalog
- Thai
- Turkish
- Ukrainian
- Urdu
- Vietnamese
- Yoruba
- Cantonese
- Zulu
Known Limitations
- Inputs with a lot of background noise sometimes fail to pick up speech, which leads to poor translation.
- Speaker identification sometimes leaks into the next speaker. This can lead to the wrong voice being used for the beginning of a speaker's segment.
- The translated text is sometimes longer or shorter than the original text, which can lead to very fast / slow speech.
- Lip-syncing quality degrades when artifacts block the face, and when the face moves around too much.
- Speed is directly proportional with the number of speakers and scene cuts.