Text OCR (Optical Character Recognition) is a powerful technology that extracts text from images, converting printed, handwritten, or typed content into machine-readable formats. It has numerous applications, from automating data extraction to simplifying content analysis. In this comprehensive guide, we will explore how to extend OCR capabilities to videos using various foundation models and Text OCR solutions, evaluating their performance across different metrics.
Benefits of Text OCR on videos?
Text OCR in videos offers several key benefits:
- Automates metadata creation, improving indexing and organization
- Detects inappropriate text within videos, supporting content moderation
- Provides more data to refine recommendation algorithms
- Enables keyword-based search that directly links to the exact timestamp in the video
- Enables more precise ad targeting by identifying relevant keywords
Various Models capable of OCR
Gemini
Gemini is a family of multimodal AI models developed by Google DeepMind, designed to process a wide array of data types, including text, images, audio, and video. This versatility allows the models to analyze and reason across these formats, making them highly effective for tackling complex question-answering tasks. The Gemini family includes models like Gemini 1.5 Pro, which is optimized for a broad range of reasoning tasks, and Gemini 1.5 Flash, designed for handling lower-complexity tasks that require high-frequency processing.
Florence 2
Florence 2 is a open-source visual foundation model designed for image question-and-answering tasks. On Sieve, it is accessible via the API function sieve/florence-2, where the model has been extended to support video. This enhancement enables the model to perform Text OCR on videos, expanding its capabilities beyond static images. However, it lacks context awareness and reasoning, limiting its ability to fully understand and interpret complex scenarios in videos.
Tesseract
Tesseract is an open-source Optical Character Recognition (OCR) engine designed to detect and extract text from images. To use it with video, the video must first be converted into frames, and OCR must then be performed on each individual frame. Similar to Florence 2, it also lacks context awareness and reasoning, limiting its ability to fully understand and interpret complex scenarios in videos.
GPT-4o
GPT-4o, developed by OpenAI, is a powerful multimodal model capable of analyzing entire videos to extract valuable insights. However, in contrast to Gemini, GPT-4o requires videos to be converted into individual frames before processing. As a result, extracting text via OCR from videos with GPT-4o involves first breaking the video into frames, which are then analyzed collectively by the model for text extraction. But unlike Florence 2 and Tesseract, which process each frame independently, GPT-4o analyzes frames in relation to one another, allowing for greater context awareness.
Mitigating OCR Inconsistencies using Levenshtein distance
Unlike Multimodal models like Gemini and GPT 4o, models like Florence 2 and Tesseract performs text OCR on a frame-by-frame basis, with each OCR operation functioning independently. As a result, the same text in a video may produce different OCR outputs due to factors such as camera movements, occlusions, or lighting changes between frames, leading to slight variations in characters or words. To address this, we calculate the Levenshtein distance between consecutive OCR outputs and establish a Levenshtein threshold. When multiple OCR outputs have a Levenshtein distance below this threshold, they are considered identical, indicating that the text remains consistent across those frames.
Testing Methodologies
The effectiveness of various models for text OCR in videos was assessed by testing them on diverse video samples, with a focus on the following key metrics:
- Text Accuracy: The model should reliably recognize visible texts with no spelling mistakes, even from poor-quality or complex frames
- Interframe Consistency: The model should maintain consistency of OCR on same text across the consecutive frames it appears
- Timestamp Synchronization: The timestamp for the text must be synchronized to the appearance and disappearance of text in the video
- Occlusion Handling: The model must detect and compensates for occlusions, such as obstructions in the texts of the video
- Text Recognition Flexibility: The model should recognize text with editing effects, such as fading, upside-down text, or glowing effects
Text OCR with a Multimodal LLM: Gemini 1.5 Pro
In the video below, the model has missed certain words due to the fast pace of the captions, and the timing of some text is off by one second. However, the recognized words have accurate spelling.
In the video at 0:23, the model accurately recognizes the text on the paper.
In the video below, the model avoids performing OCR on text that is too blurry to read, instead of attempting OCR and producing incorrect results.
Text OCR with a Multimodal LLM: Gemini 1.5 Flash
In the video below, the model has missed certain words due to the fast pace of the captions, and the timing of some text is off by one second.
In the video at 0:23, the model accurately recognizes the text on the paper.
In the video, the model generates incorrect timing for the first OCR text, causing it to last longer than it should. Additionally, later parts of the essay visible in the video are excluded from the OCR outputs.
Text OCR with a Vision Foundation model: Florence 2
In the video below, the model fails to interpret text effects such as upside-down text, 3D shadows, and other visual effects.
At the start of the video below, the model attempts to OCR a nearly invisible street sign in the background, which should have been avoided, resulting in gibberish text. Again at 0:23, the model fails to recognize occlusions caused by the flower, treating the obstructed text as a completely new phrase or sentence.
In the video below, there are spelling mistakes and character errors.
Text a traditional OCR model: Tesseract
To use Tesseract for video, the video must first be converted into frames, and OCR must be applied to each frame. After testing on sample frames we found that the Tesseract model performs poorly on most video frames - either completely missing the text or recognizing gibberish text. Therefore, it is best to avoid using it for text OCR on videos.
Results
Metric | Gemini 1.5 Pro | Gemini 1.5 Flash | Florence 2 | Tesseract |
---|---|---|---|---|
Text Accuracy | Excels in slow-paced videos with flawless text recognition but struggles with fast-paced captions, often missing quickly flashed words. | Excels in slow-paced videos having very little text on screen, fails to OCR essays and lengthy texts leaving out parts of it | Spelling mistakes and missing words may occur depending on the frame conditions, resulting in accurate OCR for some frames and errors in others. | Spelling mistakes and dropping words are common; struggles with real-world props unless the prop is camera-focused. |
Timestamp Synchronization | Generates timestamps for text with a 1-second margin of error; does not support frame-level timestamping. | Timestamps are inaccurate; doesn’t support frame-level timestamping. | Supports frame-level timestamps, providing highly precise timestamping. | Timestamping is inaccurate due to frequent word-dropping across frames. |
Interframe Consistency | Performs Text OCR without relying on a frame-by-frame approach, avoiding issues of interframe inconsistency. | Performs Text OCR without relying on a frame-by-frame approach, avoiding issues of interframe inconsistency. | Lacks interframe consistency, producing varying OCR outputs for the same text due to subtle changes between frames. | Lacks interframe consistency, producing different OCR outputs for the same text depending on slight changes between frames. |
Occlusion Handling | Accurately interprets text despite occlusions by leveraging previous or upcoming visibility. | Accurately interprets text despite occlusions by leveraging previous or upcoming visibility. | Struggles with occlusions, treating any text obscured by occlusions as entirely new text. | Cannot handle occlusions, treating occluded text as entirely new. |
Text Recognition Flexibility | Handles upside-down text and flashy effects with ease. | Handles upside-down text and flashy effects with ease. | Struggles with upside-down text, producing gibberish for flashy effects and text on real-world props. | Does not recognize text with flashy effects at all. |
Comparison of Foundational Models and Text OCR Models
The emergence of foundational models has significantly transformed the landscape of text recognition, often outperforming traditional Text-specific Optical Character Recognition (OCR) models. This comparison highlights the key differences between these two types of models, particularly in their ability to recognize text in various forms and contexts, such as within videos.
Aspect | Foundational Models | Text OCR-Specific Models |
---|---|---|
Models | Gemini 1.5 Pro, GPT-4, Florence 2 | Tesseract, docTR |
Primary Focus | General-purpose text generation tasks across various domains | Specialized for recognizing and extracting text from images |
Training Data | Diverse datasets including texts, images, and videos | Text-centric datasets, often including synthetic and cropped text images |
Performance on OCR | Highly accurate, even in the absence of text preprocessing | Very Low accuracy, especially without text preprocessing |
Contextual Understanding | Can integrate text recognition with broader reasoning and analysis | Focuses solely on extracting text and lacks any contextual interpretation |
Versatility | High; capable of handling diverse OCR tasks across a wide range of test case variations | Limited; specifically optimized for recognizing text in cropped or pre-processed images |
Efficiency | Resource-intensive; may require significant computational power | Lightweight and faster for text recognition tasks |
Why Foundational Models Excel Over Text OCR Models
The main capabilities of Foundational Models that make it better at Text OCR tasks in videos than Text OCR specific models are:
- Contextual Understanding: Foundational models excel at interpreting context, enabling coherent and relevant text recognition, especially in dynamic video environments where context shifts rapidly
- Diverse Training Data: Trained on vast datasets across languages, styles, and topics, allowing foundational models to generalize well across varied scenarios, including the visual diversity in videos
- Improved Error Handling: These models reduce errors by leveraging language and context, effectively interpreting ambiguous or poorly rendered text compared to standard Text OCR models
Choosing the best OCR model for you
Selecting the ideal model for your video text OCR pipeline depends on several factors, including the type of video, output requirements, costs, and speed.
Type of Video
- For videos with text on real-world props, consistent occlusions, and camera movement, choose a multimodal : Gemini 1.5 Pro
- For videos with fast-paced captions, Florence 2 is a better choice.
Output Requirements
- If you need highly accurate text outputs use Gemini 1.5 Pro for OCR.
- If you need highly precise timestamps for text outputs, Florence 2 is the best option.
- If contextual understanding of the OCR text is required, use multimodal : Gemini 1.5 Pro
Costs
Let's compare the costs across the model on a 2-minute video.
- Gemini 1.5 Pro: A 2-minute video would cost approximately $0.0432, assuming 3,000 output characters and input under 128k tokens.
- Gemini 1.5 Flash: A 2-minute video would cost approximately $0.002625, assuming 3,000 output characters and input under 128k tokens.
- GPT-4o: Analyzing a 2-minute video (1920x1080 resolution, 30 fps) with GPT-4o at high resolution costs about $9.95, and at low resolution, it costs about $0.77. Using GPT-4o mini for the same video costs $19.89 at high resolution and $1.53 at low resolution. Using o1 model for the same video costs $52.66 at high resolution and $4.06 at low resolution.
- Florence 2: Florence 2 is hosted on L4 GPUs with a pay-as-you-go rate of $1.25/hr. For a 2-minute video, processing took 38 minutes, which costs approximately $0.791.
Speed
Processing times for different models depend on both complexity and hardware. Gemini 1.5 Pro typically takes around 12 minutes to process a 2-minute video, while Gemini 1.5 Flash is significantly faster, completing the same task in just 24 seconds. Both models’ processing times may vary based on prompt complexity. Meanwhile, Florence 2 requires approximately 38 minutes on an L4 GPU, with the option to upgrade the GPU for improved processing speeds.
Using Gemini 1.5 Pro through Sieve for Text OCR
Gemini 1.5 Pro can be accessed by through the Sieve function sieve/visual-qa. To be able to utilize Gemini 1.5 Pro for Text OCR on videos you need to utilize three of Visual QA’s parameters.
backend
: The model to use for processing, It has two options "gemini-1.5-flash" for simple task and for more complex tasks use "gemini-1.5-pro".
prompt
: The prompt guides our output in our case we need to guide it to generate the timestamps along side Text OCR.
function_json
: The JSON structure that our output should follow.
Below is the code to carry out Text OCR on videos using Visual QA.
import sieve
file = sieve.File(path="path_to_your_video")
prompt = "Identify all the text displayed in the video and provide me an START and END on when the words appear in the video. Multiple texts can have overlapping start and end time."
function_json = {
"type": "list",
"items": {
"type": "object",
"properties": {
"text": {
"type": "string",
"description": "The text visible in the video."
},
"start_time": {
"type": "number",
"description": "The starting time of the text segment in seconds. This is the time when the text appears on screen."
},
"end_time": {
"type": "number",
"description": "The ending time of the text segment in seconds. This is the time when the text disappears on screen."
}
}
}
}
visual_qa = sieve.function.get("sieve/visual-qa")
output = visual_qa.push(file, backend = "gemini-1.5-pro", prompt = prompt, function_json = function_json)
ocr_texts = output.result()
print(ocr_texts) # Text OCR output
Below code merges the texts that appears at the same time for the same duration.
# Merge consecutive texts that start and end at the same timestamps
def merge_texts(ocr_texts):
merged_texts = []
current_item = None
for item in ocr_texts:
if current_item and current_item['start_time'] == item['start_time'] and current_item['end_time'] == item['end_time']:
current_item['text'] += " " + item['text']
else:
if current_item:
merged_texts.append(current_item)
current_item = item
# Add the last item to the list
if current_item:
merged_texts.append(current_item)
return merged_texts
merged_texts = merge_texts(ocr_texts)
for m in merged_texts:
print(m) # OCR Text output with timestamp in seconds
Using Florence 2 for Text OCR
Florence 2 is accessible through the Sieve function sieve/florence-2. And to carry Text OCR using it we utilize its task_prompt
parameter. While keeping debug_visualization
as False
.
import sieve
file = sieve.File(path="path_to_your_video")
florence_2 = sieve.function.get("sieve/florence-2")
output = florence_2.push(file, task_prompt = "<OCR>", debug_visualization = False)
ocr_texts = output.result()
print(ocr_texts)
We now use Levenshtein distance to attach a timestamp to text generated by OCR based on its consistency across those frames.
pip install python-Levenshtein
import Levenshtein
import cv2
def get_fps(video_path):
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
cap.release()
return fps
# Merge consecutive frames having identical texts to calculate start and end timestamps
def merge_texts(ocr_texts, fps, levenshtein_threshold=5):
merged_texts = []
current_ocr = None
for ocr_text_obj in ocr_texts:
ocr_text = ocr_text_obj["<OCR>"]
frame_number = ocr_text_obj["frame_number"]
current_time = frame_number / fps
if current_ocr:
distance = Levenshtein.distance(current_ocr.strip(), ocr_text.strip())
if distance <= levenshtein_threshold and frame_number == previous_frame + 1:
# Extend the end time for the current segment
end_time = current_time
else:
# Add the current segment to the merged list
merged_texts.append({
"text": current_ocr,
"start_frame": start_frame,
"end_frame": previous_frame,
"start_time": start_time,
"end_time": end_time
})
# Start a new segment
start_frame = frame_number
start_time = current_time
end_time = current_time
else:
# Initialize the first segment
start_frame = frame_number
start_time = current_time
end_time = current_time
current_ocr = ocr_text
previous_frame = frame_number
# Add the last segment after the loop
if current_ocr:
merged_texts.append({
"text": current_ocr,
"start_frame": start_frame,
"end_frame": previous_frame,
"start_time": start_time,
"end_time": end_time
})
return merged_texts
fps = get_fps(file.path)
merged_texts = merge_texts(ocr_texts, fps = fps, levenshtein_threshold = 5) # The threshold depends on the type of video content; for text-heavy content, use a higher threshold.
for m in merged_texts:
print(m) # OCR Text output with timestamp in seconds
Conclusion
With the rapid advancement and widespread availability of open-source foundation models, alongside easily accessible closed-source solutions, the field of Optical Character Recognition (OCR) has been transformed. These state-of-the-art models, trained on diverse and extensive datasets, significantly outperform traditional text-specific OCR tools like Tesseract. Modern foundation models exhibit superior generalization and advanced reasoning capabilities, resulting in remarkable improvements in accuracy and versatility across various video OCR applications.
If you're looking to integrate Video Text OCR into your application, join our Discord community. For professional support, email us at contact@sievedata.com.