Transcript Analysis

This app offers a comprehensive solution for extracting valuable information from video or audio files. It automatically generates titles, chapters, summaries, and tags, enhancing the accessibility and discoverability of your media content. This tool is particularly useful for content creators, marketers, and educational purposes, providing insights and structured data that can improve content engagement and understanding.

For pricing, click here.

For detailed notes, click here.

Key Features

  • Title Generation: Creates a concise and relevant title for your media file based on its content, with a maximum length of 10 words.
  • Chapter Generation: Automatically divides the content into chapters, making it easier to navigate through different sections of the video or audio.
  • Summary Creation: Produces a summary of the main points covered in the media file, with a customizable maximum sentence length.
  • Tagging: Generates tags related to the content, aiding in searchability and categorization.
  • Audio Denoising: Improves transcription accuracy by denoising audio tracks before processing.
  • Highlight Extraction: Identifies and extracts highlights based on predefined search phrases, with adjustable maximum duration.
  • Custom Chapters: Allows for custom chapter titles to be provided, with options for manual or LLM-generated chapters. Check out the Custom Chapters section for more details.
  • Target Language: Supports analysis in multiple languages.
  • Customization: Allows picking different models for different parts of the pipeline, choose between different LLMs, transcription models, and customize outputs through natural language, using prompts.
  • Custom Vocabulary: Allows for custom vocabulary for audio transcription, can be used to find and replace or spell words. See Custom Vocabulary for more details.
  • Bring your own Keys: Use your own LLM provider keys, or use ours. See API Keys for more details.

Comparing Backends

LLM Backends

ModelSpeedAccuracyNotes
🚀 gpt-4o-mini⚡⚡⚡⭐⭐Fastest, most cost-effective, but potentially less detailed
🔄 gpt-4o-2024-08-06⚡⚡⭐⭐⭐Balanced performance, up-to-date knowledge
🏆 gpt-4o-2024-05-13⚡⚡⭐⭐⭐Most capable, highest accuracy, slightly older version
🎻 claude-3-5-sonnet-20241022⚡⚡⭐⭐⭐Latest Claude model, balanced performance

Transcription Backends

ModelSpeedAccuracyNotes
🏎️ groq-whisper⚡⚡⚡⭐⭐⭐Fastest, duration-based pricing
💰 stable-ts⚡⚡⭐⭐⭐Compute-based pricing, cost-effective for longer audio
🎯 whisper-timestamped⚡⚡⭐⭐⭐⭐Accurate timestamps, compute-based pricing

* Compute-based pricing: costs are calculated based on processing time taken.

Pricing

This function combines multiple models, including transcription models and various LLMs. The pricing is calculated based on the combined usage of all models involved in the process.

LLM ModelTranscription ModelTotal Cost per Hour of Audio
gpt-4o-minigroq-whisper$0.3272
gpt-4o-miniwhisper-timestamped$0.2537
gpt-4o-ministable-ts$0.2412
gpt-4o-2024-08-06groq-whisper$0.497
gpt-4o-2024-08-06whisper-timestamped$0.4235
gpt-4o-2024-08-06stable-ts$0.411
gpt-4o-2024-05-13groq-whisper$0.581
gpt-4o-2024-05-13whisper-timestamped$0.5075
gpt-4o-2024-05-13stable-ts$0.495
claude-3-5-sonnet-20241022groq-whisper$0.533
claude-3-5-sonnet-20241022whisper-timestamped$0.459
claude-3-5-sonnet-20241022stable-ts$0.447

Notes:

  • These are average costs based on default settings, costs may vary based on additional parameters selected.
  • The prices in the table above include a $0.2 base charge per hour of audio. This is our fee for running analytics and processing on your video or audio.
  • This app utilizes the speech_transcriber function for transcription. As a result:
    1. The costs shown in the table above represent the combined price for both this app and the speech_transcriber function.
    2. In your Usage page, you'll see separate entries for this app and the speech_transcriber app.
    3. The cost displayed for this app in the Usage page will appear lower than the total cost, as the transcription cost is billed under the speech_transcriber app.

Example 1: groq-whisper + gpt-4o-mini

If groq-whisper is selected as a backend for transcription, and gpt-4o-mini is selected as the llm_backend, the pricing will be calculated as, (assuming a 30 minute video file and generating a summary, title and tags)

total_cost = video_duration_in_mins/60 * llm_transcription_cost
total_cost = 30/60 * $0.3272 = $0.1636

Example 2: stable-ts + gpt-4o-mini

If stable-ts is selected as a backend for transcription, and gpt-4o-mini is selected as a backend for the rest of the pipeline, the pricing will be calculated as, (assuming a 30 minute video file, generating a summary, title and tags)

total_cost = video_duration_in_mins/60 * llm_transcription_cost
total_cost = 30/60 * $0.2412 = $0.1206

Output Formats

Check out the example outputs for full details on the output formats.

The function returns outputs in separate sections, indexed from 0 onwards. The number and order of sections depend on which parameters you've selected.

Here's the full list of possible outputs in their default order:

  1. Video Metadata (always included)
  2. Transcript (always included)
  3. Summary (if generate_summary is True)
  4. Title (if generate_title is True)
  5. Tags (if generate_tags is True)
  6. Highlights (if generate_highlights is True)
  7. Sentiment Analysis (if generate_sentiments is True)
  8. Chapters (if generate_chapters is True)

Note: If you set any of these parameters to False, the subsequent sections will shift up in the index. For example, if generate_summary is False, the Title (if generated) will be at index 2 instead of 3.

Highlights

Highlights are returned as an array of objects, each containing:

{
  "highlights": [
    {
      "title": "Final Countdown and Escape",
      "score": 95,
      "start_time": 471.81,
      "end_time": 588.23,
      "start_timecode": "0:07:51.810000",
      "end_timecode": "0:09:48.230000",
      "duration": 116.42
    },
    {
      "title": "Laser Maze Tension",
      "score": 90,
      "start_time": 304.71,
      "end_time": 325.67,
      "start_timecode": "0:05:04.709000",
      "end_timecode": "0:05:25.670000",
      "duration": 20.96
    }
  ]
}

Sentiment Analysis

Sentiment analysis results are returned as an array of objects for each segment:

{
  "sentiment_analysis": [
    {
      "start_time": 0,
      "end_time": 2,
      "text": "I just started this 10 minute timer.",
      "sentiment": "neutral",
      "score": 50
    },
    {
      "start_time": 2,
      "end_time": 5,
      "text": "And when it hits zero, that room all the way over there,",
      "sentiment": "neutral",
      "score": 50
    }
  ]
}

Chapters

Chapters are returned as an array of objects, each representing a chapter:

{
  "chapters": [
    {
      "title": "Introduction & Setup",
      "start_time": 0,
      "timecode": "00:00:00"
    },
    {
      "title": "Starting the Challenge",
      "start_time": 22.76,
      "timecode": "00:00:22"
    }
  ]
}

API Keys

By default, the transcript-analysis function will run without requiring any API keys. If you'd like to provide your own API keys, you'll only be charged the small compute-based fee outlined here. You can read about environment variables on Sieve here.

OpenAI

Set OPENAI_API_KEY environment variable to your OpenAI API key.

Azure OpenAI

Set AZURE_OPENAI_API_KEY to your Azure OpenAI API key. Set AZURE_OPENAI_ENDPOINT to your organization's Azure OpenAI endpoint. (ex. https://my-organization.openai.azure.com/)

Additionally, you'll need to provide the Azure OpenAI model deployment name for each model you'd like to use.

ModelEnvironment Variable
gpt-4o-2024-08-06AZURE_OPENAI_GPT_4O_2024_08_06_MODEL_NAME
gpt-4o-2024-05-13AZURE_OPENAI_GPT_4O_2024_05_13_MODEL_NAME
gpt-4o-miniAZURE_OPENAI_GPT_4O_MINI_MODEL_NAME

Azure OpenAI deployment information can be found here.

Anthropic

Set the ANTHROPIC_API_KEY environment variable to your Anthropic API key.

Notes

Custom Chapters

Custom chapters allow you to have more control over how your content is segmented, ensuring that specific topics or sections are always included in the chapter breakdown.

What are Custom Chapters?

Custom chapters are user-defined topics or sections that you want to be identified and timestamped in your content. This feature allows you to guide the AI in creating a more relevant and accurate chapter structure.

Why Use Custom Chapters?

  1. Ensure Important Topics are Covered: Guarantee that key subjects are always included in the chapter list, even if they're brief.
  2. Maintain Consistency: Use a similar chapter structure across multiple videos or podcasts in a series.
  3. Combine AI and Human Insight: Leverage your knowledge of the content while still benefiting from AI-powered timestamp identification.

How to Use Custom Chapters:

  1. Create a list of chapter titles or topics you want to include.
  2. Pass this list to the custom_chapters parameter.
  3. Choose a custom_chapter_mode:
    • "strict": Only use your provided chapters
    • "extended" (default): Use your chapters and allow the AI to add additional relevant chapters

Example:

custom_chapters = ["Introduction", "Problem Statement", "Methodology", "Results", "Conclusion"]
custom_chapter_mode = "extended"

In this example, the AI will ensure these five chapters are included, and may add other relevant chapters it identifies in the content.

By using custom chapters, you can ensure your content is organized in a way that best serves your audience and highlights the most important aspects of your material.

Parameters:

  • file: Video or audio file (required).
  • transcription_backend: Backend to use for transcription. Options: "groq-whisper", "stable-ts", "whisper-timestamped". Default is "groq-whisper".
  • llm_backend: LLM to use for processing the transcript. Options: "gpt-4o-2024-08-06", "gpt-4o-2024-05-13", "gpt-4o-mini". Default is "gpt-4o-2024-08-06".
  • prompt: Custom prompt to guide the LLM's analysis. Influences the analysis, focus areas, and overall output of the generated content.
  • generate_summary: Whether to generate a summary. Default is True.
  • generate_title: Whether to generate a title. Default is True.
  • generate_chapters: Whether to generate chapters. Default is True.
  • generate_tags: Whether to generate tags. Default is True.
  • generate_highlights: Whether to generate highlights. Default is False.
  • generate_sentiments: Whether to perform sentiment analysis on each segment. Default is False.
  • custom_chapters: List of custom chapters. Pass in a list of strings for custom chapter generation.
  • custom_chapter_mode: Mode for custom chapters. Options: "extended" (default) or "strict".
  • max_summary_length: Maximum number of sentences in the summary. Default is 5.
  • max_title_length: Maximum number of words in the title. Default is 10.
  • min_chapter_length: Minimum length of chapters in seconds. Default is 0.
  • num_tags: Number of tags to generate. Default is 5.
  • target_language: The target language of the output. If empty, uses the input file's language.
  • custom_vocabulary: Custom vocabulary for audio transcription. A dictionary of words to add to the vocabulary, can be used to find and replace or spell words.
  • speaker_diarization: Whether to use speaker diarization to split audio into segments. Default is False.
  • use_vad: Whether to use voice activity detection to split audio into segments. Default is False.
  • denoise_audio: Whether to denoise audio before analysis. Improves transcription but slows processing. Default is False.
  • return_as_json_file: Whether to return each output as a JSON file. Useful for efficient fetching of large payloads. Default is False.
  • use_azure: Whether to use Azure OpenAI for OpenAI models instead of OpenAI's API. Default is False.

Custom Vocabulary

The custom_vocabulary parameter allows you to specify a custom set of words for find-and-replace operations in the transcript. This feature is useful for:

  • Correcting common transcription errors
  • Ensuring accurate transcription of technical terms or proper nouns
  • Standardizing specific terminology in your transcripts

Example: Sieve is often transcribed as "Civ" by the whisper models, so you can add custom_vocabulary = {"Civ": "Sieve"} to the function to ensure that "Civ" is transcribed as "Sieve" instead of "Civ."