Exploring SAM 2 and its Variants for Video Object Segmentation

Video object segmentation involves tracking specific objects throughout a video. Meta's Segment Anything Model 2 (SAM 2) and its variants have transformed this field with faster and more accurate object segmentation and tracking. In this post, we'll explore cutting-edge variants—SAMURAI, DAM4SAM, Video Amodal Segmentation, SAM2LONG, SAMWISE, and EfficientTAM—and examine how they enhance SAM 2's capabilities.

SAM 2

Meta AI's Segment Anything Model 2 (SAM 2) is an open-source foundational model for object segmentation in images and videos. Building on its predecessor, SAM 2 introduces dynamic video capabilities for accurate object selection and tracking across frames. SAM 2 is available as an API on Sieve. Key features include:

Unified Architecture: Combines image and video segmentation for consistency across media types
Dynamic Memory Integration: Retains object information for improved tracking and occlusion handling
Advanced Prompting: Supports multiple interaction methods (clicks, bounding boxes, masks)

SAMURAI

SAMURAI is a zero-shot method that leverages SAM 2.1 weights to conduct Visual Object Tracking experiments. It enhances SAM's capabilities by addressing limitations in handling dynamic video environments, particularly with fast-moving or occluded objects. SAMURAI improves upon SAM 2 through three key advancements:

Kalman Filter-based Motion Modeling: This technique allows SAMURAI to predict object trajectories, ensuring accurate tracking even in dynamic environments with rapid movements or visually similar objects.
Motion-aware Memory Selection: SAMURAI dynamically evaluates previous frames and prioritizes those with higher-quality masks and object confidence, reducing error propagation and improving robustness.
Dynamic Frame Prioritization: By selecting frames with higher-quality masks and confidence, SAMURAI addresses error propagation issues present in SAM 2's fixed-window memory architecture.

Output Comparisons

In the following example, SAM 2 fails to track the person in motion during the time intervals from 0:12 to 0:23 and again from 0:29 to 0:33.

Performance Improvements

On the LaSOText dataset (real-world dataset for visual tracking tasks, specifically designed to evaluate trackers on long-term tracking challenges), SAMURAI excelled in handling occlusions and fast-moving objects, improving tracking under camera motion by 16.5% and enhancing fast-moving target tracking by nearly 10% compared to SAM 2.

DAM4SAM

DAM4SAM is an shorthand representation for Distractor-Aware Memory (DAM) for Visual Object Tracking with SAM 2 it is an innovative enhancement to the SAM 2 framework, designed to improve visual object tracking by addressing the challenges posed by distractors in complex scenes. It enhances the SAM 2 framework for visual object tracking by addressing challenges posed by distractors. It achieves this through several key mechanisms:

Dual Memory Structure:
- Recent Appearance Memory (RAM): This component retains recent target appearances, ensuring accurate segmentation by focusing on the most relevant frames.
- Distractor-Resolving Memory (DRM): This part stores anchor frames specifically designed to help differentiate the target from distractors, enhancing the model's ability to maintain focus on the intended object.
Introspection-Based Updating: The updating mechanism for DRM utilizes output information from SAM 2, allowing the model to adaptively refine its memory based on tracking reliability and the presence of distractors.
Temporal Encoding: RAM employs a FIFO buffer and temporal encoding to prioritize the most relevant frames, which helps in managing visual redundancy and improving localization capabilities.

Output Comparisons

In the following example, SAM 2.1 (a minor update to SAM 2, with improved speed and processing) loses track of the correct zebra at 0:10. Similarly, SAMURAI fails to track the correct zebra starting at 0:21. In contrast, DAM4SAM consistently tracks the correct zebra throughout the video, even in the presence of distractions.

Performance Improvements

On DiDi dataset(dataset that includes video sequences with distractors), DAM4SAM demonstrates a substantial increase in robustness and accuracy compared to its predecessor, SAM 2.1. Specifically, DAM4SAM achieves a robustness score of 0.944, an increase of 3.3%, and an accuracy score of 0.727, which is a 1.3% improvement over SAM 2.1.

Video Amodal Segmentation

Object segmentation done by SAM 2 do not account for this amodal nature of objects, and only work for segmentation of visible or modal objects. “Using Diffusion Priors for Video Amodal Segmentation” is a research paper that aims to tackle this problem of video amodal segmentation. using a two-stage method that generates its amodal (visible + invisible) masks.

First stage (Amodal Mask Generation):
- The system starts with an object’s modal masks(generated from SAM 2) and pseudo-depth
- Using these, it creates amodal masks, which indicate where the object is located in a more complete way, including areas that might be partially hidden or occluded
Second stage (Inpainting):
- The predicted amodal masks are then passed to the second stage, along with the visible part of the object’s image
- In this stage, the system in paints the hidden parts of the object, predicting what the full object looks like by adding the missing details

Working of Amodal segmentation

Example outputs

In the following example, SAM 2 cannot track the object once it becomes invisible, whereas Video Amodal Segmentation effectively tracks the object even when it is obscured by an opaque obstruction.

Performance Improvements

When the tracked object in a video becomes invisible, SAM 2 fails to maintain tracking, whereas Video Amodal Segmentation effectively continues tracking such objects with ease.

SAM2LONG

SAM2Long significantly improves upon SAM 2 by addressing error accumulation issue, particularly in challenging long-term video scenarios involving object occlusion and reappearance. With SAM2Long, the segmentation process becomes more resilient and accurate over time, maintaining strong performance even as objects are occluded or reappear in the video stream. It achieves this through several key mechanisms:

Error Accumulation Mitigation: SAM2Long reduces segmentation errors by maintaining multiple pathways per frame, exploring diverse hypotheses, and avoiding reliance on a single pathway.
Training-Free Memory Tree: Introduces a dynamic, training-free memory tree to manage segmentation hypotheses, prune less optimal paths, and minimize error propagation.
Cumulative Scoring Mechanism: Uses a scoring system to prioritize pathways with consistent accuracy, improving long-term segmentation reliability.
Occlusion Awareness: Enhances tracking during occlusions by focusing on pathways that detect objects even when briefly obscured.

Output Comparisons

In the following example, SAM 2 frequently loses track of the green car during scene changes, while SAM2LONG demonstrates significantly better performance in handling these transitions.

Performance Improvements

SA-V Benchmark: SAM2Long improved by 5.3 points over SAM 2 on SA-V, a large and diverse dataset doing better here highlights segmentation accuracy across various videos
LVOS Validation Set: SAM2Long outperformed SAM 2 by 3.5 points on LVOS(Long-term Video Object Segmentation), a dataset of 720 videos averaging 1.14 minutes, showcasing better performance on longer videos

SAMWISE

SAMWISE is an advanced approach to Referring Video Object Segmentation (RVOS), which involves segmenting objects in video sequences based on language expressions. It builds upon the strengths of SAM 2 by incorporating natural language understanding and temporal modeling. By addressing critical challenges in video segmentation—such as maintaining contextual information across frames—SAMWISE ensures more accurate object tracking and segmentation. It achieves this through several key mechanisms:

Natural Language Understanding: SAMWISE uses a frozen text encoder to extract meaningful features from language queries, enabling precise object identification and segmentation in videos.
Cross-Modal Temporal (CMT) Adapter: This module captures temporal object evolution and aligns visual and textual features for better multi-modal interaction.
Visual-to-Text and Text-to-Visual Attention: Symmetric cross-attention mechanisms help focus on relevant objects by aligning visual cues with textual descriptions.
Conditional Memory Encoder (CME): CME dynamically refocuses on objects aligned with text, reducing tracking bias and enhancing segmentation accuracy.

Working of SAMWISE

Example outputs

In the example below, SAMWISE enables object tracking to begin directly from natural prompts, whereas SAM 2 requires box coordinates or masks to initiate tracking.

Performance Improvements

On the MeViS dataset, which tests motion expression segmentation based on complex natural language descriptions, SAMWISE achieves a J&F score of 48.3, outperforming GroundingDINO+SAM 2 (37.7). GroundingDINO+SAM 2 is a pipeline that replicates SAMWISE by using natural prompts to generate the initial box coordinates for tracking in SAM 2 via GroundingDINO, which then outputs the necessary coordinates for SAM 2 to begin tracking.
On the Ref-Youtube-VOS, which is a dataset focused on textual queries, SAMWISE scores 67.2 J&F, exceeding SAM 2-based methods by at least 9.7 points.

The J&F score is an average of two metrics: J (Region Similarity), which evaluates the alignment between the segmented region and the ground truth (higher scores indicate better spatial accuracy), and F (Contour Accuracy), which measures the precision of segmentation boundaries (higher scores reflect sharper, more accurate contours). Together, these metrics provide a comprehensive evaluation of segmentation performance.

EfficientTAM

While SAM 2 achieves impressive results in video object segmentation using a powerful, multi-stage image encoder and a memory mechanism to track objects across frames, its high computational complexity makes it unsuitable for real-world tasks, particularly on resource-constrained devices such as mobile phones. EfficientTAM (Efficient Track Anything Model) aims to overcome these challenges by offering a more efficient and lightweight solution. It achieves this through several key mechanisms:

Vanilla Lightweight Vision Transformer: Uses a simple, non-hierarchical Vision Transformer which reduces computational overhead and processing time.
Efficient Memory Module:
- Introduces an optimized cross-attention mechanism
- Minimizes computation and memory costs by leveraging the continuity and similarity among tokens in memory

Output Comparisons

In the example below, SAM 2 and Efficient TAM exhibit nearly identical accuracy, despite EfficientTAM being faster.

Performance Benefits

On the A100 GPU, EfficientTAM runs at ~94.4 fps while SAM 2 runs at 47.2 fps achieving approximately 2x speedup over SAM 2 while maintaining comparable accuracy (74.5% vs. 74.7% on the SA-V test dataset).
Capable of running at about 10 FPS on mobile devices, making it suitable for real-time video object segmentation and tracking.

Choosing the Best SAM 2 Variant

The optimal SAM 2 variant depends on your specific use case:

SAMURAI: Best for tracking fast-moving objects
DAM4SAM: Excels in videos with significant distractions
Video Amodal Segmentation: Optimal for tracking partially occluded objects
SAM2LONG: Specialized for extended video tracking
SAMWISE: Ideal for natural language-based tracking
EfficientTAM: Best for mobile devices and speed-critical applications

Conclusion

SAM 2 has enabled numerous video segmentation applications. While it may not be perfect for all scenarios, its variants address specific limitations and enhance reliability across diverse use cases.

For implementation support, join our Discord community or contact us at contact@sievedata.com.