Zero-shot object detection models are designed to identify objects in images or videos, even if the model has not encountered those objects during training. Unlike traditional object detection models, which rely heavily on extensive labeled data for every category, zero-shot object detection models generalize to new categories by leveraging external knowledge or shared attributes between previously seen and unseen classes. This approach significantly reduces the reliance on labeled data, making it highly effective for tasks involving rare, novel, or highly diverse object categories.
YOLOv8 is the latest version in the YOLO series of object detection models, known for its high-speed, real-time performance and accuracy across diverse domains. It detects objects without retraining, enabling effective generalization to new, unseen objects. YOLOv8 offers various optimized variants, such as yolov8l-face, yolov8l-world, etc., each tailored to specific tasks. For example, yolov8l-face has been specialized for facial detection, while yolov8l-world for open-vocabulary object detection, allowing it to recognize a wide range of objects outside the COCO dataset classes through a class input.
Florence-2, developed by Microsoft, is a cutting-edge visual language model designed to tackle a wide range of computer vision tasks, such as object detection, image captioning, and understanding image-text relationships. It has been trained on a diverse dataset, allowing it to apply knowledge across various tasks without the need for task-specific training. The model also includes a text input interface, enabling users to easily specify objects for recognition, ensuring seamless interaction.
Grounding DINO (Grounded Vision-Language Pretraining) is a vision-language model designed to enhance the interaction between visual content and language. It leverages deep learning to connect words with specific objects or regions in an image, enabling the model to not only detect objects but also reason about their interactions. This capability allows for more advanced object detection with contextual understanding.
MediaPipe is an object detection model, within Google's MediaPipe framework, designed for efficient object detection in both images and videos. It supports real-time detection and offers cross-platform compatibility, functioning seamlessly on mobile devices and desktop systems. In addition to detecting 2D bounding boxes, MediaPipe is also capable of identifying 3D bounding boxes, enhancing its versatility and applications.
The performance of various object detection models on videos was evaluated using a set of video samples. The analysis centered on the following key metrics:
The examples below demonstrate object detection using two models: the YOLO variant yolov8l-world and the Florence 2 model. The output of object detection from yolov8l-world is shown on the left, with detection confidence visually represented using red and blue—red indicates low confidence, while green signifies confidence above 0.5 on a scale from 0 to 1. On the right, the object detection from Florence 2 is displayed, with detected objects consistently marked in blue. Unlike yolov8l-world, Florence 2 does not provide numerical confidence levels for its detections.
The following examples demonstrate object detection performed on video objects that have been categorized as classes within the COCO dataset.
In the example below, the video on the left shows yolov8l-world struggling to detect cars farther from the camera lens, whereas the video on the right demonstrates Florence 2 performing significantly better at detecting distant cars.
In the example below, the video on the left highlights yolov8l-world struggling to detect chairs farther from the camera lens, while the video on the right showcases Florence 2 performing significantly better at detecting distant chairs. Additionally, Florence 2 demonstrates greater consistency in chair detection throughout the video.
The following examples demonstrate object detection performed on video objects that have not been categorized as classes within the COCO dataset.
In the example below, the video on the left shows yolov8l-world mistakenly detecting a notebook as a necklace at 0:02, whereas Florence 2 consistently and accurately detects the necklace.
In the example below, the video on the left highlights yolov8l-world struggling to generate consistent bounding boxes for the toy objects between 0:00 and 0:01, whereas Florence 2 demonstrates significantly greater consistency.
The following example demonstrates object detection based on the contextual properties of the objects. In this instance, both models struggle to accurately detect white cars, with misclassifications occurring throughout the video.
yolov8l-world | Florence 2 | |
---|---|---|
Accuracy | Struggles to detect objects, especially those farther from the camera lens, and sometimes misses objects | Offers better classification accuracy than yolov8l-world and performs better with objects farther away from camera lens |
Interframe Consistency | Exhibits significant inconsistencies in object detection across frames | Maintains consistent object detection between frames |
Class Categorization Robustness | Supports classification of any object class | Supports classification of any object class |
Contextual Classification | Fails at context-based classification | Fails at context-based classification |
Selecting the ideal model to complete your object detection pipeline depends on several factors, including the requirements, speed, and costs.
On Sieve, yolov8l-world costs $0.034 per minute of input video processed. For a 2-min video, it should cost approximately $0.068.
On Sieve, Florence 2 runs on an L4 GPU which is billed at a compute-based pay-as-you-go rate of $1.25/hr. For a 2-min video that took 33 min to process, it cost approximately $0.687.
YOLOv8 models can be accessed by through the Sieve function sieve/yolov8. Sieve allows different variants of YOLOv8 for object detection including yolov8l, yolov8s, yolov8l-world, yolov8s-world etc.
import sieve
yolov8 = sieve.function.get("sieve/yolov8")
file = sieve.File(path = "your_video_for_object_detection")
classes = "names_of_an_objects_to_detect"
output = yolov8.push(
file,
classes = classes,
confidence_threshold = 0.05,
models = "yolov8l-world",
)
yolo_coordinates = output.result()
print(yolo_coordinates) # All coordinates for objects detected
Florence 2 model can be accessed by through the Sieve function sieve/florence-2 where the model has been extended to support video.
import sieve
file = sieve.File(path = "your_video_for_object_detection")
task_prompt = "<CAPTION_TO_PHRASE_GROUNDING>"
text_input = "names_of_the_objects_to_detect",
florence_2 = sieve.function.get("sieve/florence-2")
output = florence_2.push(file,
task_prompt = task_prompt,
text_input = text_input,
debug_visualization = True)
output_object = output.result()
visualization_path = output_object[0].path
florence_coordinates = output_object[1]
print(florence_coordinates) # All coordinates for objects detected
print(visualization_path) # Video path for the visualization of bounding boxes
Zero-shot object detection is revolutionizing the field of computer vision by enabling models to identify objects they have never been seen during training. The choice of such detection models depends heavily on specific use-case requirements, including accuracy, speed, and cost. By understanding these factors, along with the strengths and weaknesses of these models, developers can better tailor their object detection pipelines to meet diverse demands.
If you're looking to integrate Zero-shot object detection into your application, consider joining our Discord community. For professional support, email us at contact@sievedata.com.