Building a Comprehensive Video Translation Tool: Subtitles, Voices, Lipsync, and On-Screen Text
A guide to generating subtitles, voices, lipsync, and translated on-screen text for video translation using Sieve.
/blog-assets/authors/akshara.jpeg
by Akshara Soman
Cover Image for Building a Comprehensive Video Translation Tool: Subtitles, Voices, Lipsync, and On-Screen Text

Translating videos involves tasks like generating subtitles, dubbing with synchronized audio, and replacing on-screen text. In this tutorial, we will explore how to implement different components of video translation using Sieve’s APIs. Specifically, we’ll cover:

  1. Subtitle translation
  2. Voice dubbing
  3. Lipsync dubbing
  4. On-screen text translation

Sieve offers a versatile pipeline called sieve/dubbing, capable of performing all these components. Additionally, Sieve provides dedicated functions for each specific operation, allowing you to customize the process to suit your needs. By the end of this guide, you’ll understand how to leverage these pipelines to create an efficient video translation pipeline that suits your use case.

Use Cases of Video Translation

Video translation enables effective global communication across industries. Key applications include:

  1. Media and Entertainment: Translate movies, shows, and documentaries to reach international audiences.
  2. E-Learning: Localize courses and tutorials for global learners.
  3. Corporate Training: Provide translated training and onboarding materials for multinational teams.
  4. Social Media: Help creators expand reach with subtitles and dubbing.
  5. Marketing: Adapt promotional content for diverse audiences.
  6. Gaming: Localize in-game videos and trailers for global players.
  7. Public Outreach: Translate videos for awareness campaigns and multilingual initiatives.

Subtitle Translation

Subtitles enable viewers to follow dialogue and essential audio elements in their native language. To show translated subtitles, we need to perform the following steps.

  1. Transcription: Convert original dialogue of the video into text using a transcription model.
  2. Translation: Translate the transcribed text from the source language to the target language using a translation model.

Implementation with Sieve

Sieve offers the sieve/dubbing pipeline in translation-only mode which returns just the translated text with sentence-level timestamps for alignment. The primary use of this output mode is to enable human-in-the-loop dubbing, though it can also be used to simply generate translated subtitles. Below is a simple implementation.

import sieve

source_file = sieve.File(url="https://storage.googleapis.com/sieve-prod-us-central1-public-file-upload-bucket/c4d968f5-f25a-412b-9102-5b6ab6dafcb4/27953634-d321-4c67-b199-5b06cf11f5c8-markWiens_pizza_italy_52s.mp4")

dubbing = sieve.function.get("sieve/dubbing")
dubbed_files = dubbing.run(source_file = source_file, target_language = "italian",
                                         output_mode = "translation-only", return_transcript = True)

for dubbed_file in dubbed_files:
    print("Dubbed media path:", dubbed_file.path)

Specialized Functions for Advanced Needs

If you want more granular control over options like world level timestamps, diarization, and audio chunking during transcription, you can leverage sieve/transcribe to extract subtitles.

import sieve

file = sieve.File(url="https://storage.googleapis.com/sieve-prod-us-central1-public-file-upload-bucket/c4d968f5-f25a-412b-9102-5b6ab6dafcb4/78baa57b-fb88-42f6-ba4a-3f0c0807dc8c-markWiens_pizza_italy_52s.mp4")
backend = "whisper-zero"
word_level_timestamps = True

transcribe = sieve.function.get("sieve/transcribe")
chunks = transcribe.run(file, backend, word_level_timestamps)

for chunk in chunks:
    print(chunk)

Here is an input video we can try transcribing (Source).

Transcription Result

Today we're going to go eat the best pizza in Italy. By coincidence, just 30 minutes away from each other in the mountains outside of Naples, Italy, you will find two pizzerias. One that serves the best rated pizza in the entire world, and one that has the world's best pizza chef. I have a puddle of cheese in my mouth. Oh wow, the toppings rearranged, engineered to perfection. I've traveled from across the world to find out why this is the best pizza on earth, and today we're eating both of them.

After doing this you can use sieve/translate for translation, which gives you granular control over backends, styles, and additional forms of prompting.

import sieve

text = "Today we\'re going to go eat the best pizza in Italy. By coincidence, just 30 minutes away from each other in the mountains outside of Naples, Italy, you will find two pizzerias. One that serves the best rated pizza in the entire world, and one that has the world\'s best pizza chef. I have a puddle of cheese in my mouth. Oh wow, the toppings rearranged, engineered to perfection. I\'ve traveled from across the world to find out why this is the best pizza on earth, and today we\'re eating both of them."
translate = sieve.function.get("sieve/translate")

output = translate.run(
    text,
    target = "italian",
    translation_style = "informal"
)

print(output)

Translation Result

Oggi andiamo a mangiare la migliore pizza d'Italia. Per coincidenza, a soli 30 minuti l'uno dall'altro tra le montagne fuori Napoli, in Italia, troverai due pizzerie. Una che serve la pizza più votata dell'intero mondo, e un'altra che ha il miglior pizzaiolo del mondo. Ho una pozza di formaggio in bocca. Oh wow, i condimenti si sono riorganizzati, ingegnerizzati alla perfezione. Ho viaggiato da tutto il mondo per scoprire perché questa è la migliore pizza sulla terra, e oggi le mangeremo entrambe.

Voice Dubbing

Voice dubbing enables developers to integrate translated audio tracks by replacing existing ones in the input video. We can do this using the sieve/dubbing pipeline in voice-dubbing mode.

import sieve

source_file = sieve.File(url="https://storage.googleapis.com/sieve-prod-us-central1-public-file-upload-bucket/c4d968f5-f25a-412b-9102-5b6ab6dafcb4/49ce7d0c-c8f4-41f5-8673-68bb4f1d98f9-markWiens_pizza_italy_52s.mp4")
dubbing = sieve.function.get("sieve/dubbing")

output = dubbing.run(
    source_file,
    target_language = "italian",
    output_mode = "voice-dubbing"
)

for dubbed_language in output:
    print("Dubbed media:", dubbed_language.path)

Below is the video dubbed into Italian.

Lipsync Dubbing

Lipsync dubbing ensures that the translated voiceovers match the lip movements of the on-screen speakers, creating a natural and synchronized appearance. To implement this form of dubbing, you can use sieve/dubbing to generate the translated audio and then use sieve/lipsync to synchronize audio and visual elements.

import sieve

file = sieve.File(url="https://storage.googleapis.com/sieve-prod-us-central1-public-file-upload-bucket/c4d968f5-f25a-412b-9102-5b6ab6dafcb4/98a2ee22-bb8c-4781-a5db-6d8a26613d8e-2_markWiens_italy_lipsync_60-100s.mp4")
audio = sieve.File(url="https://storage.googleapis.com/sieve-prod-us-central1-public-file-upload-bucket/c4d968f5-f25a-412b-9102-5b6ab6dafcb4/f2c69c02-b308-44e2-bb17-a6e38e802c24-2_italian_audio.wav")
enhance = "default"
backend = "sievesync"
downsample = False
cut_by = "shortest"

lipsync = sieve.function.get("sieve/lipsync")
output = lipsync.run(file, audio, enhance, backend, downsample, cut_by)

print(output)

Below is an example output from lipsyncing.

Note: You can also achieve this by using sieve/dubbing in a single step using the enable_lipsyncing option.

On-Screen Text Translation

On-screen text translation involves identifying and translating text visible in the video, such as captions, signs, or graphics. While Sieve doesn’t have anything available out-of-the-box for this, a methodology similar to the following would allow for this.

  • Extract frames containing on-screen text
  • Use OCR (Optical Character Recognition) to detect and extract text. You can perform this using the sieve/florence-2 function.
  • Translate the text into the target language.
  • Inpaint (remove) the original text and overlay the translated version.

Conclusion

Building a video translation tool can seem complex, but Sieve’s pipelines simplify the process by offering specialized functions for every aspect of video localization. From generating subtitles to creating synchronized voiceovers and lipsynced dubs, these tools empower developers with modular and scalable solutions to meet diverse project needs. If you’re interested in building something similar and have questions for our team, feel free to join our Discord or reach out to us at contact@sievedata.com.