Transcribe

This app provides precise audio transcription with advanced features including word-level timestamps, speaker diarization, and optional auto-translation. It offers multiple backend models to balance cost, quality, and speed, along with customization options for noise reduction and vocabulary.

For pricing, click here.

For detailed information on backend options, parameter specifications, and performance comparisons, please refer to the Notes section below.

Key Features

  • Word-level Timestamps: Provides precise timestamps for each word in the transcript (not available in groq-whisper-large-v3).
  • Speaker Diarization: Identifies and labels different speakers in the audio.
  • Custom Vocabulary: You can specify a custom vocabulary of words to find and replace in the transcript.
  • Model Backend Options: Choose from various backend models to balance cost, quality, and speed.
  • Auto Translation: Dynamically translates transcriptions into multiple languages.
  • Denoise Audio: Removes background noise from the audio, pick from various noise reduction models.
  • Customization: Pick from a variety of transcription models, denoising models, and segmentation models to best fit your needs.
  • Segmentation Backends: You can pick different methods for splitting the audio for parallel processing, or manually specify the chunks to transcribe.

Pricing

This app consists of multiple transcription models, each with different pricing. Here's a breakdown of the pricing for each model:

Model$/hour of audio transcribed
stable-ts-whisper-large-v3$0.15
stable-ts-whisper-large-v3-turbo$0.13
groq-whisper-large-v3$0.11
whisper-timestamped-whisper-large-v3$0.69
whisper-timestamped-whisper-base$0.25
whisperx-whisper-large-v3$0.18
whisperx-whisper-base$0.085
whisper-zero$0.80

Note: For stable-ts-whisper-large-v3, stable-ts-whisper-large-v3-turbo, whisper-timestamped-whisper-large-v3, whisper-timestamped-whisper-base, whisperx-whisper-large-v3 and whisperx-whisper-base, the pricing listed above is an approximation based on average processing times. Actual costs may vary depending on the specific parameters set for the chosen model and the complexity of the audio being processed.

Denoise Pricing

When the denoise_backend option is selected (i.e., not set to "None"), audio denoising is performed using Sieve's Audio Enhance application. For detailed pricing information regarding the denoising process, please consult the Pricing section of the Audio Enhance documentation.

Translation Pricing

When the translation_backend is selected (i.e., not set to "None"), then the translation is done using Sieve's Translation application. Please refer to the translation Pricing for more pricing details.

Example 1: Transcribing a 2.5 Hour Podcast!

Assuming we select groq-whisper-large-v3 as the backend, here is the cost breakdown:

cost = audio_duration_in_hours * $0.11
cost = 2.5 * $0.11
cost = $0.275

Remarkably, transcribing a 2.5-hour podcast costs a mere $0.275! What's even more impressive is that the transcription was completed in just 53 seconds. You can view the details of this lightning-fast job here.

Let's use stable-ts-whisper-large-v3-turbo as the backend so we can get word-level timestamps, here is the cost breakdown:

cost = audio_duration_in_hours * $0.13
cost = 2.5 * $0.13
cost = $0.325

The job costs $0.325 to transcribe, and the processing time was 3 minutes and 13 seconds. You can view the details of the job here.

Notes

Backend Comparison

ModelSpeedAccuracyNotes
stable-ts-whisper-large-v3⚡⚡⭐⭐⭐Accurate timestamps
stable-ts-whisper-large-v3-turbo⚡⚡⚡⭐⭐⭐Accurate timestamps, faster quantized model
groq-whisper-large-v3⚡⚡⚡⚡⭐⭐⭐Fastest model, cheapest whisper-large-v3 model
whisper-timestamped-whisper-large-v3⚡⚡⭐⭐⭐Accurate timestamps, relatively slower
whisper-timestamped-whisper-base⚡⚡⚡⭐⭐Accurate timestamps, but lower transcription quality
whisperx-whisper-large-v3⚡⚡⭐⭐⭐High quality, accurate timestamps
whisperx-whisper-base⚡⚡⚡⭐⭐Cheaper and faster than whisperx, uses whisper-base model so less accurate
whisper-zero⭐⭐⭐⭐Most accurate transcriptions, but slowest option, best for diarization

Picking the Right Model

  • Choose groq-whisper-large-v3 for fast, high-quality transcriptions in most scenarios.
  • Choose stable-ts-whisper-large-v3-turbo for fast transcriptions with word-level timestamps.
  • Use stable-ts-whisper-large-v3 or whisper-timestamped-whisper-large-v3 when precise word-level timestamps are crucial.
  • Opt for whisperx-whisper-large-v3 for a good balance of quality, speed, and timestamp accuracy.
  • Consider whisper-zero for the highest accuracy, especially in multi-speaker scenarios, but be prepared for longer processing times.
  • The whisper-base variants (whisper-timestamped-whisper-base, whisperx-whisper-base) are suitable for faster and cheaper processing when slight accuracy trade-offs are acceptable.

Diarization Backends

diarization_backend can be used to specify the backend to use for speaker diarization, returns speaker IDs for each segment in the transcript. Currently, we support pyannote-3.1.1 and none.

  • pyannote-3.1.1: Assigns unique IDs (e.g., SPEAKER_01, SPEAKER_02) to different speakers. You can set min_speakers and max_speakers to help the model accurately identify the number of speakers in the audio.
  • none: This backend does not perform speaker diarization.

Custom Vocabulary

The custom_vocabulary parameter allows you to specify a custom set of words for find-and-replace operations in the transcript. This feature is useful for:

  • Correcting common transcription errors
  • Ensuring accurate transcription of technical terms or proper nouns
  • Standardizing specific terminology in your transcripts

Example: Sieve is often transcribed as "Civ" by the whisper models, so you can add custom_vocabulary = {"Civ": "Sieve"} to the function to ensure that "Civ" is transcribed as "Sieve" instead of "Civ."

Segmentation Backends

Segmentation backends define how the audio is split into chunks for transcription. Each backend has its own set of parameters that you can tune to best fit your needs.

  • ffmpeg-silence: This backend uses ffmpeg to split the audio at specified intervals. You can specify the min_silence_length parameter to define the minimum length of silence that must be detected to split the audio. min_segment_length defines the minimum length a segment must have.

  • vad: This backend uses a voice activity detection model to split the audio based on human voice activity. You can specify the vad_threshold parameter to define the threshold for the voice activity detection model, its set to 0.2 by default.

  • pyannote-3.1.1: This backend uses speaker diarization to split the audio based on different speakers. You can specify the pyannote_segmentation_threshold parameter to define the threshold for the speaker diarization model, its set to 0.8 by default.

  • none: Choosing none does not perform any segmentation, and the entire audio is transcribed as one segment.

Manual Segmentation

To manually define audio segments instead of using automatic segmentation, you can pass in a list of start and end time strings seperated by commas to the chunks parameter.

For example, if you want to transcribe an audio into 10 second chunks, you may pass in chunks = ["0,10","10,20","20,30"].

Output Format

The output is generated and returned incrementally as each segment is processed. The structure of the output is as follows:

{
  "text": "string",
  "language_code": "string",
  "segments": [
    {
      "start": "number",
      "end": "number",
      "text": "string",
      "speaker_id": "string", # only available if diarization_backend is not none
      "words": [ # only available if word-level timestamps are enabled
        {
          "start": "number",
          "end": "number",
          "confidence": "number",
          "word": "string"
        }
      ]
    }
  ]
}

Languages

We support 99 total languages. You may enter a language code into the language parameter if you already know the language of the original audio. If you don't know the language of the original audio, you may leave the language parameter blank and we will automatically detect the language of the original audio. If you want to see the full list of supported languages, you may refer to the table below.

  • en (English)
  • zh (Chinese)
  • de (German)
  • es (Spanish)
  • ru (Russian)
  • ko (Korean)
  • fr (French)
  • ja (Japanese)
  • pt (Portuguese)
  • tr (Turkish)
  • pl (Polish)
  • ca (Catalan)
  • nl (Dutch)
  • ar (Arabic)
  • sv (Swedish)
  • it (Italian)
  • id (Indonesian)
  • hi (Hindi)
  • fi (Finnish)
  • vi (Vietnamese)
  • he (Hebrew)
  • uk (Ukrainian)
  • el (Greek)
  • ms (Malay)
  • cs (Czech)
  • ro (Romanian)
  • da (Danish)
  • hu (Hungarian)
  • ta (Tamil)
  • no (Norwegian)
  • th (Thai)
  • ur (Urdu)
  • hr (Croatian)
  • bg (Bulgarian)
  • lt (Lithuanian)
  • la (Latin)
  • mi (Maori)
  • ml (Malayalam)
  • cy (Welsh)
  • sk (Slovak)
  • te (Telugu)
  • fa (Persian)
  • lv (Latvian)
  • bn (Bengali)
  • sr (Serbian)
  • az (Azerbaijani)
  • sl (Slovenian)
  • kn (Kannada)
  • et (Estonian)
  • mk (Macedonian)
  • br (Breton)
  • eu (Basque)
  • is (Icelandic)
  • hy (Armenian)
  • ne (Nepali)
  • mn (Mongolian)
  • bs (Bosnian)
  • kk (Kazakh)
  • sq (Albanian)
  • sw (Swahili)
  • gl (Galician)
  • mr (Marathi)
  • pa (Punjabi)
  • si (Sinhala)
  • km (Khmer)
  • sn (Shona)
  • yo (Yoruba)
  • so (Somali)
  • af (Afrikaans)
  • oc (Occitan)
  • ka (Georgian)
  • be (Belarusian)
  • tg (Tajik)
  • sd (Sindhi)
  • gu (Gujarati)
  • am (Amharic)
  • yi (Yiddish)
  • lo (Lao)
  • uz (Uzbek)
  • fo (Faroese)
  • ps (Pashto)
  • tk (Turkmen)
  • nn (Nynorsk)
  • mt (Maltese)
  • sa (Sanskrit)
  • lb (Luxembourgish)
  • my (Myanmar)
  • bo (Tibetan)
  • tl (Tagalog)
  • mg (Malagasy)
  • as (Assamese)
  • tt (Tatar)
  • haw (Hawaiian)
  • ln (Lingala)
  • ha (Hausa)
  • ba (Bashkir)
  • jw (Javanese)
  • su (Sundanese)
  • yue (Cantonese)
  • my (Burmese)
  • ca (Valencian)
  • nl (Flemish)
  • ht (Haitian)
  • lb (Letzeburgesch)
  • ps (Pushto)
  • pa (Panjabi)
  • ro (Moldavian)
  • si (Sinhalese)
  • es (Castilian)
  • zh (Mandarin)