No Examples Found
Transcribe
This app provides precise audio transcription with advanced features including word-level timestamps, speaker diarization, and optional auto-translation. It offers multiple backend models to balance cost, quality, and speed, along with customization options for noise reduction and vocabulary.
For pricing, click here.
For detailed information on backend options, parameter specifications, and performance comparisons, please refer to the Notes section below.
Key Features
- Word-level Timestamps: Provides precise timestamps for each word in the transcript (not available in
groq-whisper-large-v3
). - Speaker Diarization: Identifies and labels different speakers in the audio.
- Custom Vocabulary: You can specify a custom vocabulary of words to find and replace in the transcript.
- Model Backend Options: Choose from various backend models to balance cost, quality, and speed.
- Auto Translation: Dynamically translates transcriptions into multiple languages.
- Denoise Audio: Removes background noise from the audio, pick from various noise reduction models.
- Customization: Pick from a variety of transcription models, denoising models, and segmentation models to best fit your needs.
- Segmentation Backends: You can pick different methods for splitting the audio for parallel processing, or manually specify the chunks to transcribe.
Pricing
This app consists of multiple transcription models, each with different pricing. Here's a breakdown of the pricing for each model:
Model | $/hour of audio transcribed |
---|---|
stable-ts-whisper-large-v3 | $0.15 |
stable-ts-whisper-large-v3-turbo | $0.13 |
groq-whisper-large-v3 | $0.11 |
whisper-timestamped-whisper-large-v3 | $0.69 |
whisper-timestamped-whisper-base | $0.25 |
whisperx-whisper-large-v3 | $0.18 |
whisperx-whisper-base | $0.085 |
whisper-zero | $0.80 |
Note:
For stable-ts-whisper-large-v3
, stable-ts-whisper-large-v3-turbo
, whisper-timestamped-whisper-large-v3
, whisper-timestamped-whisper-base
, whisperx-whisper-large-v3
and whisperx-whisper-base
, the pricing listed above is an approximation based on average processing times. Actual costs may vary depending on the specific parameters set for the chosen model and the complexity of the audio being processed.
Denoise Pricing
When the denoise_backend option is selected (i.e., not set to "None"), audio denoising is performed using Sieve's Audio Enhance application. For detailed pricing information regarding the denoising process, please consult the Pricing section of the Audio Enhance documentation.
Translation Pricing
When the translation_backend is selected (i.e., not set to "None"), then the translation is done using Sieve's Translation application. Please refer to the translation Pricing for more pricing details.
Example 1: Transcribing a 2.5 Hour Podcast!
Assuming we select groq-whisper-large-v3
as the backend, here is the cost breakdown:
cost = audio_duration_in_hours * $0.11
cost = 2.5 * $0.11
cost = $0.275
Remarkably, transcribing a 2.5-hour podcast costs a mere $0.275! What's even more impressive is that the transcription was completed in just 53 seconds. You can view the details of this lightning-fast job here.
Let's use stable-ts-whisper-large-v3-turbo
as the backend so we can get word-level timestamps, here is the cost breakdown:
cost = audio_duration_in_hours * $0.13
cost = 2.5 * $0.13
cost = $0.325
The job costs $0.325 to transcribe, and the processing time was 3 minutes and 13 seconds. You can view the details of the job here.
Notes
Backend Comparison
Model | Speed | Accuracy | Notes |
---|---|---|---|
stable-ts-whisper-large-v3 | ⚡⚡ | ⭐⭐⭐ | Accurate timestamps |
stable-ts-whisper-large-v3-turbo | ⚡⚡⚡ | ⭐⭐⭐ | Accurate timestamps, faster quantized model |
groq-whisper-large-v3 | ⚡⚡⚡⚡ | ⭐⭐⭐ | Fastest model, cheapest whisper-large-v3 model |
whisper-timestamped-whisper-large-v3 | ⚡⚡ | ⭐⭐⭐ | Accurate timestamps, relatively slower |
whisper-timestamped-whisper-base | ⚡⚡⚡ | ⭐⭐ | Accurate timestamps, but lower transcription quality |
whisperx-whisper-large-v3 | ⚡⚡ | ⭐⭐⭐ | High quality, accurate timestamps |
whisperx-whisper-base | ⚡⚡⚡ | ⭐⭐ | Cheaper and faster than whisperx, uses whisper-base model so less accurate |
whisper-zero | ⚡ | ⭐⭐⭐⭐ | Most accurate transcriptions, but slowest option, best for diarization |
Picking the Right Model
- Choose
groq-whisper-large-v3
for fast, high-quality transcriptions in most scenarios. - Choose
stable-ts-whisper-large-v3-turbo
for fast transcriptions with word-level timestamps. - Use
stable-ts-whisper-large-v3
orwhisper-timestamped-whisper-large-v3
when precise word-level timestamps are crucial. - Opt for
whisperx-whisper-large-v3
for a good balance of quality, speed, and timestamp accuracy. - Consider
whisper-zero
for the highest accuracy, especially in multi-speaker scenarios, but be prepared for longer processing times. - The
whisper-base
variants (whisper-timestamped-whisper-base, whisperx-whisper-base) are suitable for faster and cheaper processing when slight accuracy trade-offs are acceptable.
Diarization Backends
diarization_backend
can be used to specify the backend to use for speaker diarization, returns speaker IDs for each segment in the transcript. Currently, we support pyannote-3.1.1
and none
.
- pyannote-3.1.1: Assigns unique IDs (e.g.,
SPEAKER_01
,SPEAKER_02
) to different speakers. You can setmin_speakers
andmax_speakers
to help the model accurately identify the number of speakers in the audio. - none: This backend does not perform speaker diarization.
Custom Vocabulary
The custom_vocabulary
parameter allows you to specify a custom set of words for find-and-replace operations in the transcript. This feature is useful for:
- Correcting common transcription errors
- Ensuring accurate transcription of technical terms or proper nouns
- Standardizing specific terminology in your transcripts
Example: Sieve is often transcribed as "Civ" by the whisper models, so you can add custom_vocabulary = {"Civ": "Sieve"}
to the function to ensure that "Civ" is transcribed as "Sieve" instead of "Civ."
Segmentation Backends
Segmentation backends define how the audio is split into chunks for transcription. Each backend has its own set of parameters that you can tune to best fit your needs.
-
ffmpeg-silence: This backend uses ffmpeg to split the audio at specified intervals. You can specify the
min_silence_length
parameter to define the minimum length of silence that must be detected to split the audio.min_segment_length
defines the minimum length a segment must have. -
vad: This backend uses a voice activity detection model to split the audio based on human voice activity. You can specify the
vad_threshold
parameter to define the threshold for the voice activity detection model, its set to 0.2 by default. -
pyannote-3.1.1: This backend uses speaker diarization to split the audio based on different speakers. You can specify the
pyannote_segmentation_threshold
parameter to define the threshold for the speaker diarization model, its set to 0.8 by default. -
none: Choosing
none
does not perform any segmentation, and the entire audio is transcribed as one segment.
Manual Segmentation
To manually define audio segments instead of using automatic segmentation, you can pass in a list of start and end time strings seperated by commas to the chunks
parameter.
For example, if you want to transcribe an audio into 10 second chunks, you may pass in chunks = ["0,10","10,20","20,30"]
.
Output Format
The output is generated and returned incrementally as each segment is processed. The structure of the output is as follows:
{
"text": "string",
"language_code": "string",
"segments": [
{
"start": "number",
"end": "number",
"text": "string",
"speaker_id": "string", # only available if diarization_backend is not none
"words": [ # only available if word-level timestamps are enabled
{
"start": "number",
"end": "number",
"confidence": "number",
"word": "string"
}
]
}
]
}
Languages
We support 99 total languages. You may enter a language code into the language
parameter if you already know the language of the original audio. If you don't know the language of the original audio, you may leave the language
parameter blank and we will automatically detect the language of the original audio. If you want to see the full list of supported languages, you may refer to the table below.
en
(English)zh
(Chinese)de
(German)es
(Spanish)ru
(Russian)ko
(Korean)fr
(French)ja
(Japanese)pt
(Portuguese)tr
(Turkish)pl
(Polish)ca
(Catalan)nl
(Dutch)ar
(Arabic)sv
(Swedish)it
(Italian)id
(Indonesian)hi
(Hindi)fi
(Finnish)vi
(Vietnamese)he
(Hebrew)uk
(Ukrainian)el
(Greek)ms
(Malay)cs
(Czech)ro
(Romanian)da
(Danish)hu
(Hungarian)ta
(Tamil)no
(Norwegian)th
(Thai)ur
(Urdu)hr
(Croatian)bg
(Bulgarian)lt
(Lithuanian)la
(Latin)mi
(Maori)ml
(Malayalam)cy
(Welsh)sk
(Slovak)te
(Telugu)fa
(Persian)lv
(Latvian)bn
(Bengali)sr
(Serbian)az
(Azerbaijani)sl
(Slovenian)kn
(Kannada)et
(Estonian)mk
(Macedonian)br
(Breton)eu
(Basque)is
(Icelandic)hy
(Armenian)ne
(Nepali)mn
(Mongolian)bs
(Bosnian)kk
(Kazakh)sq
(Albanian)sw
(Swahili)gl
(Galician)mr
(Marathi)pa
(Punjabi)si
(Sinhala)km
(Khmer)sn
(Shona)yo
(Yoruba)so
(Somali)af
(Afrikaans)oc
(Occitan)ka
(Georgian)be
(Belarusian)tg
(Tajik)sd
(Sindhi)gu
(Gujarati)am
(Amharic)yi
(Yiddish)lo
(Lao)uz
(Uzbek)fo
(Faroese)ps
(Pashto)tk
(Turkmen)nn
(Nynorsk)mt
(Maltese)sa
(Sanskrit)lb
(Luxembourgish)my
(Myanmar)bo
(Tibetan)tl
(Tagalog)mg
(Malagasy)as
(Assamese)tt
(Tatar)haw
(Hawaiian)ln
(Lingala)ha
(Hausa)ba
(Bashkir)jw
(Javanese)su
(Sundanese)yue
(Cantonese)my
(Burmese)ca
(Valencian)nl
(Flemish)ht
(Haitian)lb
(Letzeburgesch)ps
(Pushto)pa
(Panjabi)ro
(Moldavian)si
(Sinhalese)es
(Castilian)zh
(Mandarin)