Job TreeNavigate the job tree to view your child job details
Loading job tree...
A token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech.
ready
Outputs
waiting for outputs
Logs
listening for logs...
README

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

Paper githubio Hugging Face Spaces Open In Colab

TL;DR

VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.

To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference.

How to run inference

TTS Mode

For simple inference using a reference voice use TTS Mode

  1. pass a reference audio you want to imitate to reference_audio parameter and pass the text you want to synthesize to the input_text param
  2. Generate similar sounding speech with your text.
audio = sieve.Audio(path="path/to/audio")
text = "text to synthesize"
voicecraft = sieve.function.get("sieve/voicecraft")
output = voicecraft.run(reference_audio = audio, input_text = text)
print(output.path)

Edit Mode

For editing audios you can use this mode.

  1. For edit mode change the mode = "Edit".
  2. Editing mode allows for insertions, deletions and subsitutions.
  3. For insertion pass the edit_start_time to the place where you want to insert words or sentence, edit_end_time should also be equal to edit_start_time in insertion.
  4. For subsitution, pass the edit_start_time and edit_end_time of the word you want to replace, and pass the new word in the input_text
  5. For deletions you may pass the starting and end time of the word to be deleted and pass ' ' in the reference string
audio = sieve.Audio(path="path/to/audio")
text = "text to subsitute "

voicecraft = sieve.function.get("sieve/voicecraft")
output = voicecraft.run(reference_audio = audio, input_text = text, mode= "Edit", edit_start_time = 1, edit_end_time = 2.4)
print(output.path)

Long TTS

For many sentences or longer texts you can use this mode

  1. Change mode to mode = "Long TTS"
  2. You can specify how your sentences are seperated by a Newline or by Sentences using the split_text parameter
  3. Generate
audio = sieve.Audio(path="path/to/audio/")
text = " this is sentence one \n this is sentence two \n this is sentence three "

voicecraft = sieve.function.get("sieve/voicecraft")
output = voicecraft.run(reference_audio = audio, input_text = text, mode= "Long TTS", split_text = "Newline")

© Copyright 2024. All rights reserved.