Job TreeNavigate the job tree to view your child job details
Loading job tree...
public
A token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech.
ready
waiting for outputs
listening for logs...
README
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
TL;DR
VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.
To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference.
How to run inference
TTS Mode
For simple inference using a reference voice use TTS Mode
- pass a reference audio you want to imitate to reference_audio parameter and pass the text you want to synthesize to the input_text param
- Generate similar sounding speech with your text.
audio = sieve.Audio(path="path/to/audio")
text = "text to synthesize"
voicecraft = sieve.function.get("sieve/voicecraft")
output = voicecraft.run(reference_audio = audio, input_text = text)
print(output.path)
Edit Mode
For editing audios you can use this mode.
- For edit mode change the mode = "Edit".
- Editing mode allows for insertions, deletions and subsitutions.
- For insertion pass the edit_start_time to the place where you want to insert words or sentence, edit_end_time should also be equal to edit_start_time in insertion.
- For subsitution, pass the edit_start_time and edit_end_time of the word you want to replace, and pass the new word in the input_text
- For deletions you may pass the starting and end time of the word to be deleted and pass ' ' in the reference string
audio = sieve.Audio(path="path/to/audio")
text = "text to subsitute "
voicecraft = sieve.function.get("sieve/voicecraft")
output = voicecraft.run(reference_audio = audio, input_text = text, mode= "Edit", edit_start_time = 1, edit_end_time = 2.4)
print(output.path)
Long TTS
For many sentences or longer texts you can use this mode
- Change mode to mode = "Long TTS"
- You can specify how your sentences are seperated by a Newline or by Sentences using the split_text parameter
- Generate
audio = sieve.Audio(path="path/to/audio/")
text = " this is sentence one \n this is sentence two \n this is sentence three "
voicecraft = sieve.function.get("sieve/voicecraft")
output = voicecraft.run(reference_audio = audio, input_text = text, mode= "Long TTS", split_text = "Newline")
MORE EXAMPLES
See more examples of this app by clicking on the jobs below.