Comparing the best methods for OCR on videos

Text OCR (Optical Character Recognition) is a powerful technology that extracts text from images, converting printed, handwritten, or typed content into machine-readable formats. It has numerous applications, from automating data extraction to simplifying content analysis. In this comprehensive guide, we will explore how to extend OCR capabilities to videos using various foundation models and Text OCR solutions, evaluating their performance across different metrics.

Benefits of Text OCR on videos?

Text OCR in videos offers several key benefits:

Automates metadata creation, improving indexing and organization
Detects inappropriate text within videos, supporting content moderation
Provides more data to refine recommendation algorithms
Enables keyword-based search that directly links to the exact timestamp in the video
Enables more precise ad targeting by identifying relevant keywords

Various Models capable of OCR

Gemini

Gemini is a family of multimodal AI models developed by Google DeepMind, designed to process a wide array of data types, including text, images, audio, and video. This versatility allows the models to analyze and reason across these formats, making them highly effective for tackling complex question-answering tasks. The Gemini family includes models like Gemini 1.5 Pro, which is optimized for a broad range of reasoning tasks, and Gemini 1.5 Flash, designed for handling lower-complexity tasks that require high-frequency processing.

Florence 2

Florence 2 is a open-source visual foundation model designed for image question-and-answering tasks. On Sieve, it is accessible via the API function sieve/florence-2, where the model has been extended to support video. This enhancement enables the model to perform Text OCR on videos, expanding its capabilities beyond static images. However, it lacks context awareness and reasoning, limiting its ability to fully understand and interpret complex scenarios in videos.

Tesseract

Tesseract is an open-source Optical Character Recognition (OCR) engine designed to detect and extract text from images. To use it with video, the video must first be converted into frames, and OCR must then be performed on each individual frame. Similar to Florence 2, it also lacks context awareness and reasoning, limiting its ability to fully understand and interpret complex scenarios in videos.

GPT-4o

GPT-4o, developed by OpenAI, is a powerful multimodal model capable of analyzing entire videos to extract valuable insights. However, in contrast to Gemini, GPT-4o requires videos to be converted into individual frames before processing. As a result, extracting text via OCR from videos with GPT-4o involves first breaking the video into frames, which are then analyzed collectively by the model for text extraction. But unlike Florence 2 and Tesseract, which process each frame independently, GPT-4o analyzes frames in relation to one another, allowing for greater context awareness.

Mitigating OCR Inconsistencies using Levenshtein distance

Unlike Multimodal models like Gemini and GPT 4o, models like Florence 2 and Tesseract performs text OCR on a frame-by-frame basis, with each OCR operation functioning independently. As a result, the same text in a video may produce different OCR outputs due to factors such as camera movements, occlusions, or lighting changes between frames, leading to slight variations in characters or words. To address this, we calculate the Levenshtein distance between consecutive OCR outputs and establish a Levenshtein threshold. When multiple OCR outputs have a Levenshtein distance below this threshold, they are considered identical, indicating that the text remains consistent across those frames.

Testing Methodologies

The effectiveness of various models for text OCR in videos was assessed by testing them on diverse video samples, with a focus on the following key metrics:

Text Accuracy: The model should reliably recognize visible texts with no spelling mistakes, even from poor-quality or complex frames
Interframe Consistency: The model should maintain consistency of OCR on same text across the consecutive frames it appears
Timestamp Synchronization: The timestamp for the text must be synchronized to the appearance and disappearance of text in the video
Occlusion Handling: The model must detect and compensates for occlusions, such as obstructions in the texts of the video
Text Recognition Flexibility: The model should recognize text with editing effects, such as fading, upside-down text, or glowing effects

Text OCR with a Multimodal LLM: Gemini 1.5 Pro

In the video below, the model has missed certain words due to the fast pace of the captions, and the timing of some text is off by one second. However, the recognized words have accurate spelling.

In the video at 0:23, the model accurately recognizes the text on the paper.

In the video below, the model avoids performing OCR on text that is too blurry to read, instead of attempting OCR and producing incorrect results.

Text OCR with a Multimodal LLM: Gemini 1.5 Flash

In the video below, the model has missed certain words due to the fast pace of the captions, and the timing of some text is off by one second.

In the video at 0:23, the model accurately recognizes the text on the paper.

In the video, the model generates incorrect timing for the first OCR text, causing it to last longer than it should. Additionally, later parts of the essay visible in the video are excluded from the OCR outputs.

Text OCR with a Vision Foundation model: Florence 2

In the video below, the model fails to interpret text effects such as upside-down text, 3D shadows, and other visual effects.

At the start of the video below, the model attempts to OCR a nearly invisible street sign in the background, which should have been avoided, resulting in gibberish text. Again at 0:23, the model fails to recognize occlusions caused by the flower, treating the obstructed text as a completely new phrase or sentence.

In the video below, there are spelling mistakes and character errors.

Text a traditional OCR model: Tesseract

To use Tesseract for video, the video must first be converted into frames, and OCR must be applied to each frame. After testing on sample frames we found that the Tesseract model performs poorly on most video frames - either completely missing the text or recognizing gibberish text. Therefore, it is best to avoid using it for text OCR on videos.

Results

Metric	Gemini 1.5 Pro	Gemini 1.5 Flash	Florence 2	Tesseract
Text Accuracy	Excels in slow-paced videos with flawless text recognition but struggles with fast-paced captions, often missing quickly flashed words.	Excels in slow-paced videos having very little text on screen, fails to OCR essays and lengthy texts leaving out parts of it	Spelling mistakes and missing words may occur depending on the frame conditions, resulting in accurate OCR for some frames and errors in others.	Spelling mistakes and dropping words are common; struggles with real-world props unless the prop is camera-focused.
Timestamp Synchronization	Generates timestamps for text with a 1-second margin of error; does not support frame-level timestamping.	Timestamps are inaccurate; doesn’t support frame-level timestamping.	Supports frame-level timestamps, providing highly precise timestamping.	Timestamping is inaccurate due to frequent word-dropping across frames.
Interframe Consistency	Performs Text OCR without relying on a frame-by-frame approach, avoiding issues of interframe inconsistency.	Performs Text OCR without relying on a frame-by-frame approach, avoiding issues of interframe inconsistency.	Lacks interframe consistency, producing varying OCR outputs for the same text due to subtle changes between frames.	Lacks interframe consistency, producing different OCR outputs for the same text depending on slight changes between frames.
Occlusion Handling	Accurately interprets text despite occlusions by leveraging previous or upcoming visibility.	Accurately interprets text despite occlusions by leveraging previous or upcoming visibility.	Struggles with occlusions, treating any text obscured by occlusions as entirely new text.	Cannot handle occlusions, treating occluded text as entirely new.
Text Recognition Flexibility	Handles upside-down text and flashy effects with ease.	Handles upside-down text and flashy effects with ease.	Struggles with upside-down text, producing gibberish for flashy effects and text on real-world props.	Does not recognize text with flashy effects at all.

Comparison of Foundational Models and Text OCR Models

The emergence of foundational models has significantly transformed the landscape of text recognition, often outperforming traditional Text-specific Optical Character Recognition (OCR) models. This comparison highlights the key differences between these two types of models, particularly in their ability to recognize text in various forms and contexts, such as within videos.

Aspect	Foundational Models	Text OCR-Specific Models
Models	Gemini 1.5 Pro, GPT-4, Florence 2	Tesseract, docTR
Primary Focus	General-purpose text generation tasks across various domains	Specialized for recognizing and extracting text from images
Training Data	Diverse datasets including texts, images, and videos	Text-centric datasets, often including synthetic and cropped text images
Performance on OCR	Highly accurate, even in the absence of text preprocessing	Very Low accuracy, especially without text preprocessing
Contextual Understanding	Can integrate text recognition with broader reasoning and analysis	Focuses solely on extracting text and lacks any contextual interpretation
Versatility	High; capable of handling diverse OCR tasks across a wide range of test case variations	Limited; specifically optimized for recognizing text in cropped or pre-processed images
Efficiency	Resource-intensive; may require significant computational power	Lightweight and faster for text recognition tasks

Why Foundational Models Excel Over Text OCR Models

The main capabilities of Foundational Models that make it better at Text OCR tasks in videos than Text OCR specific models are:

Contextual Understanding: Foundational models excel at interpreting context, enabling coherent and relevant text recognition, especially in dynamic video environments where context shifts rapidly
Diverse Training Data: Trained on vast datasets across languages, styles, and topics, allowing foundational models to generalize well across varied scenarios, including the visual diversity in videos
Improved Error Handling: These models reduce errors by leveraging language and context, effectively interpreting ambiguous or poorly rendered text compared to standard Text OCR models

Choosing the best OCR model for you

Selecting the ideal model for your video text OCR pipeline depends on several factors, including the type of video, output requirements, costs, and speed.

Type of Video

For videos with text on real-world props, consistent occlusions, and camera movement, choose a multimodal : Gemini 1.5 Pro
For videos with fast-paced captions, Florence 2 is a better choice.

Output Requirements

If you need highly accurate text outputs use Gemini 1.5 Pro for OCR.
If you need highly precise timestamps for text outputs, Florence 2 is the best option.
If contextual understanding of the OCR text is required, use multimodal : Gemini 1.5 Pro

Costs

Let's compare the costs across the model on a 2-minute video.

Gemini 1.5 Pro: A 2-minute video would cost approximately $0.0432, assuming 3,000 output characters and input under 128k tokens.
Gemini 1.5 Flash: A 2-minute video would cost approximately $0.002625, assuming 3,000 output characters and input under 128k tokens.
GPT-4o: Analyzing a 2-minute video (1920x1080 resolution, 30 fps) with GPT-4o at high resolution costs about $9.95, and at low resolution, it costs about $0.77. Using GPT-4o mini for the same video costs $19.89 at high resolution and $1.53 at low resolution. Using o1 model for the same video costs $52.66 at high resolution and $4.06 at low resolution.
Florence 2: Florence 2 is hosted on L4 GPUs with a pay-as-you-go rate of $1.25/hr. For a 2-minute video, processing took 38 minutes, which costs approximately $0.791.

Speed

Processing times for different models depend on both complexity and hardware. Gemini 1.5 Pro typically takes around 12 minutes to process a 2-minute video, while Gemini 1.5 Flash is significantly faster, completing the same task in just 24 seconds. Both models’ processing times may vary based on prompt complexity. Meanwhile, Florence 2 requires approximately 38 minutes on an L4 GPU, with the option to upgrade the GPU for improved processing speeds.

Using Gemini 1.5 Pro through Sieve for Text OCR

Gemini 1.5 Pro can be accessed by through the Sieve function sieve/visual-qa. To be able to utilize Gemini 1.5 Pro for Text OCR on videos you need to utilize three of Visual QA’s parameters.

backend: The model to use for processing, It has two options "gemini-1.5-flash" for simple task and for more complex tasks use "gemini-1.5-pro".

prompt : The prompt guides our output in our case we need to guide it to generate the timestamps along side Text OCR.

function_json: The JSON structure that our output should follow.

Below is the code to carry out Text OCR on videos using Visual QA.

import sieve

file = sieve.File(path="path_to_your_video")

prompt = "Identify all the text displayed in the video and provide me an START and END on when the words appear in the video. Multiple texts can have overlapping start and end time."

function_json = {
    "type": "list",
    "items": {
        "type": "object",
        "properties": {
            "text": {
                "type": "string",
                "description": "The text visible in the video."
            },
            "start_time": {
                "type": "number",
                "description": "The starting time of the text segment in seconds. This is the time when the text appears on screen."
            },
            "end_time": {
                "type": "number",
                "description": "The ending time of the text segment in seconds. This is the time when the text disappears on screen."
            }
        }
    }
}

visual_qa = sieve.function.get("sieve/visual-qa")
output = visual_qa.push(file, backend = "gemini-1.5-pro", prompt = prompt, function_json = function_json)
ocr_texts = output.result()
print(ocr_texts) # Text OCR output

Below code merges the texts that appears at the same time for the same duration.

# Merge consecutive texts that start and end at the same timestamps
def merge_texts(ocr_texts):
    merged_texts = []
    current_item = None

    for item in ocr_texts:
        if current_item and current_item['start_time'] == item['start_time'] and current_item['end_time'] == item['end_time']:
            current_item['text'] += " " + item['text']
        else:
            if current_item:
                merged_texts.append(current_item)
            current_item = item

    # Add the last item to the list
    if current_item:
        merged_texts.append(current_item)

    return merged_texts

merged_texts = merge_texts(ocr_texts)
for m in merged_texts:
    print(m) # OCR Text output with timestamp in seconds

Using Florence 2 for Text OCR

Florence 2 is accessible through the Sieve function sieve/florence-2. And to carry Text OCR using it we utilize its task_prompt parameter. While keeping debug_visualization as False.

import sieve

file = sieve.File(path="path_to_your_video")

florence_2 = sieve.function.get("sieve/florence-2")
output = florence_2.push(file, task_prompt = "<OCR>", debug_visualization = False)

ocr_texts = output.result()
print(ocr_texts)

We now use Levenshtein distance to attach a timestamp to text generated by OCR based on its consistency across those frames.

pip install python-Levenshtein

import Levenshtein
import cv2

def get_fps(video_path):
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    cap.release()
    return fps

# Merge consecutive frames having identical texts to calculate start and end timestamps
def merge_texts(ocr_texts, fps, levenshtein_threshold=5):
    merged_texts = []
    current_ocr = None

    for ocr_text_obj in ocr_texts:
        ocr_text = ocr_text_obj["<OCR>"]
        frame_number = ocr_text_obj["frame_number"]

        current_time = frame_number / fps

        if current_ocr:
            distance = Levenshtein.distance(current_ocr.strip(), ocr_text.strip())

            if distance <= levenshtein_threshold and frame_number == previous_frame + 1:
                # Extend the end time for the current segment
                end_time = current_time
            else:
                # Add the current segment to the merged list
                merged_texts.append({
                    "text": current_ocr,
                    "start_frame": start_frame,
                    "end_frame": previous_frame,
                    "start_time": start_time,
                    "end_time": end_time
                })
                # Start a new segment
                start_frame = frame_number
                start_time = current_time
                end_time = current_time
        else:
            # Initialize the first segment
            start_frame = frame_number
            start_time = current_time
            end_time = current_time

        current_ocr = ocr_text
        previous_frame = frame_number

    # Add the last segment after the loop
    if current_ocr:
        merged_texts.append({
            "text": current_ocr,
            "start_frame": start_frame,
            "end_frame": previous_frame,
            "start_time": start_time,
            "end_time": end_time
        })

    return merged_texts

fps = get_fps(file.path)
merged_texts = merge_texts(ocr_texts, fps = fps, levenshtein_threshold = 5) # The threshold depends on the type of video content; for text-heavy content, use a higher threshold.
for m in merged_texts:
    print(m) # OCR Text output with timestamp in seconds

Conclusion

With the rapid advancement and widespread availability of open-source foundation models, alongside easily accessible closed-source solutions, the field of Optical Character Recognition (OCR) has been transformed. These state-of-the-art models, trained on diverse and extensive datasets, significantly outperform traditional text-specific OCR tools like Tesseract. Modern foundation models exhibit superior generalization and advanced reasoning capabilities, resulting in remarkable improvements in accuracy and versatility across various video OCR applications.

If you're looking to integrate Video Text OCR into your application, join our Discord community. For professional support, email us at contact@sievedata.com.