Translating videos involves tasks like generating subtitles, dubbing with synchronized audio, and replacing on-screen text. In this tutorial, we will explore how to implement different components of video translation using Sieve’s APIs. Specifically, we’ll cover:
- Subtitle translation
- Voice dubbing
- Lipsync dubbing
- On-screen text translation
Sieve offers a versatile pipeline called sieve/dubbing
, capable of performing all these components. Additionally, Sieve provides dedicated functions for each specific operation, allowing you to customize the process to suit your needs. By the end of this guide, you’ll understand how to leverage these pipelines to create an efficient video translation pipeline that suits your use case.
Use Cases of Video Translation
Video translation enables effective global communication across industries. Key applications include:
- Media and Entertainment: Translate movies, shows, and documentaries to reach international audiences.
- E-Learning: Localize courses and tutorials for global learners.
- Corporate Training: Provide translated training and onboarding materials for multinational teams.
- Social Media: Help creators expand reach with subtitles and dubbing.
- Marketing: Adapt promotional content for diverse audiences.
- Gaming: Localize in-game videos and trailers for global players.
- Public Outreach: Translate videos for awareness campaigns and multilingual initiatives.
Subtitle Translation
Subtitles enable viewers to follow dialogue and essential audio elements in their native language. To show translated subtitles, we need to perform the following steps.
- Transcription: Convert original dialogue of the video into text using a transcription model.
- Translation: Translate the transcribed text from the source language to the target language using a translation model.
Implementation with Sieve
Sieve offers the sieve/dubbing
pipeline in translation-only
mode which returns just the translated text with sentence-level timestamps for alignment. The primary use of this output mode is to enable human-in-the-loop dubbing, though it can also be used to simply generate translated subtitles. Below is a simple implementation.
import sieve
source_file = sieve.File(url="https://storage.googleapis.com/sieve-prod-us-central1-public-file-upload-bucket/c4d968f5-f25a-412b-9102-5b6ab6dafcb4/27953634-d321-4c67-b199-5b06cf11f5c8-markWiens_pizza_italy_52s.mp4")
dubbing = sieve.function.get("sieve/dubbing")
dubbed_files = dubbing.run(source_file = source_file, target_language = "italian",
output_mode = "translation-only", return_transcript = True)
for dubbed_file in dubbed_files:
print("Dubbed media path:", dubbed_file.path)
Specialized Functions for Advanced Needs
If you want more granular control over options like world level timestamps, diarization, and audio chunking during transcription, you can leverage sieve/transcribe
to extract subtitles.
import sieve
file = sieve.File(url="https://storage.googleapis.com/sieve-prod-us-central1-public-file-upload-bucket/c4d968f5-f25a-412b-9102-5b6ab6dafcb4/78baa57b-fb88-42f6-ba4a-3f0c0807dc8c-markWiens_pizza_italy_52s.mp4")
backend = "whisper-zero"
word_level_timestamps = True
transcribe = sieve.function.get("sieve/transcribe")
chunks = transcribe.run(file, backend, word_level_timestamps)
for chunk in chunks:
print(chunk)
Here is an input video we can try transcribing (Source).
Transcription Result
Today we're going to go eat the best pizza in Italy. By coincidence, just 30 minutes away from each other in the mountains outside of Naples, Italy, you will find two pizzerias. One that serves the best rated pizza in the entire world, and one that has the world's best pizza chef. I have a puddle of cheese in my mouth. Oh wow, the toppings rearranged, engineered to perfection. I've traveled from across the world to find out why this is the best pizza on earth, and today we're eating both of them.
After doing this you can use sieve/translate
for translation, which gives you granular control over backends, styles, and additional forms of prompting.
import sieve
text = "Today we\'re going to go eat the best pizza in Italy. By coincidence, just 30 minutes away from each other in the mountains outside of Naples, Italy, you will find two pizzerias. One that serves the best rated pizza in the entire world, and one that has the world\'s best pizza chef. I have a puddle of cheese in my mouth. Oh wow, the toppings rearranged, engineered to perfection. I\'ve traveled from across the world to find out why this is the best pizza on earth, and today we\'re eating both of them."
translate = sieve.function.get("sieve/translate")
output = translate.run(
text,
target = "italian",
translation_style = "informal"
)
print(output)
Translation Result
Oggi andiamo a mangiare la migliore pizza d'Italia. Per coincidenza, a soli 30 minuti l'uno dall'altro tra le montagne fuori Napoli, in Italia, troverai due pizzerie. Una che serve la pizza più votata dell'intero mondo, e un'altra che ha il miglior pizzaiolo del mondo. Ho una pozza di formaggio in bocca. Oh wow, i condimenti si sono riorganizzati, ingegnerizzati alla perfezione. Ho viaggiato da tutto il mondo per scoprire perché questa è la migliore pizza sulla terra, e oggi le mangeremo entrambe.
Voice Dubbing
Voice dubbing enables developers to integrate translated audio tracks by replacing existing ones in the input video. We can do this using the sieve/dubbing
pipeline in voice-dubbing
mode.
import sieve
source_file = sieve.File(url="https://storage.googleapis.com/sieve-prod-us-central1-public-file-upload-bucket/c4d968f5-f25a-412b-9102-5b6ab6dafcb4/49ce7d0c-c8f4-41f5-8673-68bb4f1d98f9-markWiens_pizza_italy_52s.mp4")
dubbing = sieve.function.get("sieve/dubbing")
output = dubbing.run(
source_file,
target_language = "italian",
output_mode = "voice-dubbing"
)
for dubbed_language in output:
print("Dubbed media:", dubbed_language.path)
Below is the video dubbed into Italian.
Lipsync Dubbing
Lipsync dubbing ensures that the translated voiceovers match the lip movements of the on-screen speakers, creating a natural and synchronized appearance. To implement this form of dubbing, you can use sieve/dubbing
to generate the translated audio and then use sieve/lipsync
to synchronize audio and visual elements.
import sieve
file = sieve.File(url="https://storage.googleapis.com/sieve-prod-us-central1-public-file-upload-bucket/c4d968f5-f25a-412b-9102-5b6ab6dafcb4/98a2ee22-bb8c-4781-a5db-6d8a26613d8e-2_markWiens_italy_lipsync_60-100s.mp4")
audio = sieve.File(url="https://storage.googleapis.com/sieve-prod-us-central1-public-file-upload-bucket/c4d968f5-f25a-412b-9102-5b6ab6dafcb4/f2c69c02-b308-44e2-bb17-a6e38e802c24-2_italian_audio.wav")
enhance = "default"
backend = "sievesync"
downsample = False
cut_by = "shortest"
lipsync = sieve.function.get("sieve/lipsync")
output = lipsync.run(file, audio, enhance, backend, downsample, cut_by)
print(output)
Below is an example output from lipsyncing.
Note: You can also achieve this by using sieve/dubbing
in a single step using the enable_lipsyncing
option.
On-Screen Text Translation
On-screen text translation involves identifying and translating text visible in the video, such as captions, signs, or graphics. While Sieve doesn’t have anything available out-of-the-box for this, a methodology similar to the following would allow for this.
- Extract frames containing on-screen text
- Use OCR (Optical Character Recognition) to detect and extract text. You can perform this using the
sieve/florence-2
function. - Translate the text into the target language.
- Inpaint (remove) the original text and overlay the translated version.
Conclusion
Building a video translation tool can seem complex, but Sieve’s pipelines simplify the process by offering specialized functions for every aspect of video localization. From generating subtitles to creating synchronized voiceovers and lipsynced dubs, these tools empower developers with modular and scalable solutions to meet diverse project needs. If you’re interested in building something similar and have questions for our team, feel free to join our Discord or reach out to us at contact@sievedata.com.