Job TreeNavigate the job tree to view your child job details

Loading job tree...

public

A token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech.

ready

Outputs

waiting for outputs

Logs

listening for logs...

README

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

TL;DR

VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.

To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference.

How to run inference

TTS Mode

For simple inference using a reference voice use TTS Mode

pass a reference audio you want to imitate to reference_audio parameter and pass the text you want to synthesize to the input_text param
Generate similar sounding speech with your text.

audio = sieve.Audio(path="path/to/audio")
text = "text to synthesize"
voicecraft = sieve.function.get("sieve/voicecraft")
output = voicecraft.run(reference_audio = audio, input_text = text)
print(output.path)

Edit Mode

For editing audios you can use this mode.

For edit mode change the mode = "Edit".
Editing mode allows for insertions, deletions and subsitutions.
For insertion pass the edit_start_time to the place where you want to insert words or sentence, edit_end_time should also be equal to edit_start_time in insertion.
For subsitution, pass the edit_start_time and edit_end_time of the word you want to replace, and pass the new word in the input_text
For deletions you may pass the starting and end time of the word to be deleted and pass ' ' in the reference string

audio = sieve.Audio(path="path/to/audio")
text = "text to subsitute "

voicecraft = sieve.function.get("sieve/voicecraft")
output = voicecraft.run(reference_audio = audio, input_text = text, mode= "Edit", edit_start_time = 1, edit_end_time = 2.4)
print(output.path)

Long TTS

For many sentences or longer texts you can use this mode

Change mode to mode = "Long TTS"
You can specify how your sentences are seperated by a Newline or by Sentences using the split_text parameter
Generate

audio = sieve.Audio(path="path/to/audio/")
text = " this is sentence one \n this is sentence two \n this is sentence three "

voicecraft = sieve.function.get("sieve/voicecraft")
output = voicecraft.run(reference_audio = audio, input_text = text, mode= "Long TTS", split_text = "Newline")

MORE EXAMPLES

See more examples of this app by clicking on the jobs below.

378e75c7-d8d9-4269-adc5-837b9bb91dc2