Comparing Zero-Shot Object Detection Models: YOLO vs. Florence 2

Zero-shot object detection models are designed to identify objects in images or videos, even if the model has not encountered those objects during training. Unlike traditional object detection models, which rely heavily on extensive labeled data for every category, zero-shot object detection models generalize to new categories by leveraging external knowledge or shared attributes between previously seen and unseen classes. This approach significantly reduces the reliance on labeled data, making it highly effective for tasks involving rare, novel, or highly diverse object categories.

Why is Object Detection Important?

Simplifies Editing: Automatically detects and isolates objects, enabling efficient removal, replacement, or enhancement of elements in a video
Enhances Personalization: Identifies faces, objects, or scenes to create tailored video edits
Supports Augmented Reality (AR): Enables real-time object recognition for overlays, such as virtual makeup, furniture placement, or interactive effects
Improves Accessibility: Generates captions or descriptive annotations for videos by recognizing objects and activities

Benefits of Zero-Shot Object Detection Models

Recognizes Unseen Classes: Detects objects not present in the training dataset
Reduces Annotation Effort: Requires less labeled data for new classes
Highly Scalable: Adapts to new classes without retraining
Handles Rare Objects: Identifies uncommon or unique objects effectively
Cross-Domain Applications: Applicable across diverse fields, from video editing to enhancing functionality in social media platforms
Saves Computational Resources: Avoids costly retraining for new classes
Supports Open-World Scenarios: Suitable for evolving and unpredictable tasks

Various Models capable of Zero-Shot Object Detection

YOLOv8

YOLOv8 is the latest version in the YOLO series of object detection models, known for its high-speed, real-time performance and accuracy across diverse domains. It detects objects without retraining, enabling effective generalization to new, unseen objects. YOLOv8 offers various optimized variants, such as yolov8l-face, yolov8l-world, etc., each tailored to specific tasks. For example, yolov8l-face has been specialized for facial detection, while yolov8l-world for open-vocabulary object detection, allowing it to recognize a wide range of objects outside the COCO dataset classes through a class input.

Florence 2

Florence-2, developed by Microsoft, is a cutting-edge visual language model designed to tackle a wide range of computer vision tasks, such as object detection, image captioning, and understanding image-text relationships. It has been trained on a diverse dataset, allowing it to apply knowledge across various tasks without the need for task-specific training. The model also includes a text input interface, enabling users to easily specify objects for recognition, ensuring seamless interaction.

GroundingDINO

Grounding DINO (Grounded Vision-Language Pretraining) is a vision-language model designed to enhance the interaction between visual content and language. It leverages deep learning to connect words with specific objects or regions in an image, enabling the model to not only detect objects but also reason about their interactions. This capability allows for more advanced object detection with contextual understanding.

MediaPipe

MediaPipe is an object detection model, within Google's MediaPipe framework, designed for efficient object detection in both images and videos. It supports real-time detection and offers cross-platform compatibility, functioning seamlessly on mobile devices and desktop systems. In addition to detecting 2D bounding boxes, MediaPipe is also capable of identifying 3D bounding boxes, enhancing its versatility and applications.

Testing Methodologies

The performance of various object detection models on videos was evaluated using a set of video samples. The analysis centered on the following key metrics:

Accuracy: The model's ability to accurately detect and classify objects in the video
Interframe Consistency: The model's ability to consistently detect objects across frames without intermittent misses
Class Categorization Robustness: The model's ability to handle significant variation within object classes
Context-Based Class Classification: The model's ability to detect objects and categorize them into specific classes based on the contextual properties of the objects

Object Detection across different domains

The examples below demonstrate object detection using two models: the YOLO variant yolov8l-world and the Florence 2 model. The output of object detection from yolov8l-world is shown on the left, with detection confidence visually represented using red and blue—red indicates low confidence, while green signifies confidence above 0.5 on a scale from 0 to 1. On the right, the object detection from Florence 2 is displayed, with detected objects consistently marked in blue. Unlike yolov8l-world, Florence 2 does not provide numerical confidence levels for its detections.

COCO class based objects

The following examples demonstrate object detection performed on video objects that have been categorized as classes within the COCO dataset.

Object 1: Car

In the example below, the video on the left shows yolov8l-world struggling to detect cars farther from the camera lens, whereas the video on the right demonstrates Florence 2 performing significantly better at detecting distant cars.

Object 2: Chair

In the example below, the video on the left highlights yolov8l-world struggling to detect chairs farther from the camera lens, while the video on the right showcases Florence 2 performing significantly better at detecting distant chairs. Additionally, Florence 2 demonstrates greater consistency in chair detection throughout the video.

Non COCO class based objects

The following examples demonstrate object detection performed on video objects that have not been categorized as classes within the COCO dataset.

Object 1: Necklace

In the example below, the video on the left shows yolov8l-world mistakenly detecting a notebook as a necklace at 0:02, whereas Florence 2 consistently and accurately detects the necklace.

Object 2: Toy

In the example below, the video on the left highlights yolov8l-world struggling to generate consistent bounding boxes for the toy objects between 0:00 and 0:01, whereas Florence 2 demonstrates significantly greater consistency.

Context-Based Class Classification

The following example demonstrates object detection based on the contextual properties of the objects. In this instance, both models struggle to accurately detect white cars, with misclassifications occurring throughout the video.

Provided Context: White Car

Results

	yolov8l-world	Florence 2
Accuracy	Struggles to detect objects, especially those farther from the camera lens, and sometimes misses objects	Offers better classification accuracy than yolov8l-world and performs better with objects farther away from camera lens
Interframe Consistency	Exhibits significant inconsistencies in object detection across frames	Maintains consistent object detection between frames
Class Categorization Robustness	Supports classification of any object class	Supports classification of any object class
Contextual Classification	Fails at context-based classification	Fails at context-based classification

Choosing the best Object Detection model for you

Selecting the ideal model to complete your object detection pipeline depends on several factors, including the requirements, speed, and costs.

Requirements

If you prioritize high accuracy Florence 2 is the best choice.
For consistent object detection across frames, Florence 2 will deliver reliable results.
If you need confidence scores for each object detection use yolov8l-world.
For 3D object detection use cases and running the object detection model on mobile devices choose MediaPipe

Speed

On Sieve, yolov8l-world took 2 min 46 sec to complete object detection on a 2-min long video.
On Sieve, Florence 2 took 33 mins to complete object detection on a 2-minute video on an L4 GPU.

Costs

On Sieve, yolov8l-world costs $0.034 per minute of input video processed. For a 2-min video, it should cost approximately $0.068.

On Sieve, Florence 2 runs on an L4 GPU which is billed at a compute-based pay-as-you-go rate of $1.25/hr. For a 2-min video that took 33 min to process, it cost approximately $0.687.

Using YOLOv8 for Object Detection

YOLOv8 models can be accessed by through the Sieve function sieve/yolov8. Sieve allows different variants of YOLOv8 for object detection including yolov8l, yolov8s, yolov8l-world, yolov8s-world etc.

import sieve

yolov8 = sieve.function.get("sieve/yolov8")

file = sieve.File(path = "your_video_for_object_detection")
classes = "names_of_an_objects_to_detect"
output = yolov8.push(
    file,
    classes = classes,
    confidence_threshold = 0.05,
    models = "yolov8l-world",
)

yolo_coordinates = output.result()
print(yolo_coordinates) # All coordinates for objects detected

Using Florence 2 for Object Detection

Florence 2 model can be accessed by through the Sieve function sieve/florence-2 where the model has been extended to support video.

import sieve

file = sieve.File(path = "your_video_for_object_detection")

task_prompt = "<CAPTION_TO_PHRASE_GROUNDING>"
text_input = "names_of_the_objects_to_detect",

florence_2 = sieve.function.get("sieve/florence-2")
output = florence_2.push(file,
                        task_prompt = task_prompt,
                        text_input = text_input,
                        debug_visualization = True)
output_object = output.result()
visualization_path = output_object[0].path
florence_coordinates = output_object[1]

print(florence_coordinates) # All coordinates for objects detected
print(visualization_path) # Video path for the visualization of bounding boxes

Conclusion

Zero-shot object detection is revolutionizing the field of computer vision by enabling models to identify objects they have never been seen during training. The choice of such detection models depends heavily on specific use-case requirements, including accuracy, speed, and cost. By understanding these factors, along with the strengths and weaknesses of these models, developers can better tailor their object detection pipelines to meet diverse demands.

If you're looking to integrate Zero-shot object detection into your application, consider joining our Discord community. For professional support, email us at contact@sievedata.com.