Exploring ways to text prompt SAM 2
SAM 2 can't natively take in text prompts. We discuss various ways to build pipelines around SAM 2 to accomplish text-prompted segmentation.
/blog-assets/authors/lachlan.jpeg
by Lachlan Gray
Cover Image for Exploring ways to text prompt SAM 2

SAM 2 is very impressive. Just point to some things at the start of a video with points or bounding boxes, and in most cases, SAM 2 will accurately track and segment that object for the rest of the video with no problem.

But it's not perfect. In this post, we'll tackle one very addressable difficulty: text-based prompting.

Bane of SAM 2: text-to-segment

In real-world use cases, especially large-scale and automated ones, it's infeasible to hand-label object locations for every piece of data. Many of SAM 2's most promising applications involve segmenting a vocabulary of objects from a swath of videos or images.

Some examples include:

  • Data labeling: segmenting a large dataset of videos and images to train a vision or multimodal model
  • Green screen: segment key objects from the foreground of rough footage
  • Measurements: e.g., estimate the square footage of all buildings from drone or satellite footage

Below is an example of text-prompting SAM 2 with the prompt "duckling" via our (demo) text-to-segment pipeline.

Possible Solutions: Architecture or Pipeline?

There are two broad angles to approaching problems like this: improving the architecture, or improving the pipeline.

An architecture solution involves altering or replacing the SAM 2 model. It could mean enhancing SAM 2's architecture, like EVF-SAM, where they train additional components on top of SAM so that it can ingest text prompts. It could also mean using an entirely different model that handles text prompts natively (at the loss of SAM 2's performance), perhaps a multimodal model like SEEM.

A pipeline solution is instead along the lines of using a distinct "text-to-location" model to extract a bounding box with a text prompt, then passing the bounding box into SAM 2 with the image or video for the actual segmentation.

We prefer pipeline solutions for three main reasons. First, they allow us to enjoy frequent quality and performance improvements released for individual models. Secondly, pipelines introduce flexibility; e.g. we will surely see the release of more models that can predict points or bounding boxes from a text prompt, like Florence-2. Tethering all of the functionality into one model restricts our choices.

However, the main reason we like pipelines is that they are predictable. Technology-wise, no single model or approach works out of the box in a practical setting. There are always edge cases, and they manifest in different ways across applications. As a rule of thumb, we have found that it is often much harder to isolate problems and incrementally improve "all in one" style models:

Execution

At a high level, here's the pipeline we'll make:

  1. pass a video frame or image and text prompt into YOLOv8
  2. get the bounding box
  3. prompt SAM 2 with the bounding box for the image or video

We'll use YOLOv8 to get bounding boxes from text prompts. We often use YOLOv8 as ol' reliable; it can reliably identify prominent objects such as people, and it can do so quickly. If we needed more precision, we could swap out for Florence-2, but for most use cases YOLOv8 is just fine, and the performance cost of switching isn't justified.

Before getting into it, let's fast-forward to the final form of the pipeline. We'll step through it afterward. For those who want to run it now, find the demo here and the code here.

def segment(file: sieve.File, object_name: str):
    sam = sieve.function.get("sieve/sam2")    # get SAM2 endpoint

    if is_video(file):                        # video or image?
        image = get_first_frame(file)
    else:
        image = file

    print("fetching bounding box...")
    box = get_object_bbox(image, object_name)  # YOLOv8 in here

    sam_prompt = {                             # sam prompt:
        "object_id": 1,                        #    id to track the object
        "frame_index": 0,                      #    first frame (if it's a video)
        "box": box                             #    bounding box [x1, y1, x2, y2]
    }

    print("segmenting...")
    sam_out = sam.run(                         # run SAM2
        file=file,
        prompts=[sam_prompt],
        model_type="tiny",
    )

    return sam_out

First we check if the input file is a video or an image, and if it's a video we grab the first frame. We do this with two little helper functions. is_video() checks if the file extension matches a video format, and get_first_frame() uses opencv-python to read the video and dump the first frame into a sieve.File.

def is_video(file: sieve.File):
    file_path = file.path

    video_formats = ['mp4', 'avi', 'mov', 'flv', 'wmv', 'webm', 'mkv']

    if file_path.split(".")[-1] in video_formats:
        return True

    return False

def get_first_frame(video: sieve.File):
    video_path = video.path

    cap = cv2.VideoCapture(video_path)
    ret, frame = cap.read()

    if ret:
        cv2.imwrite('first_frame.png', frame)
    else:
        raise Exception("Failed to read the video; empty or does not exist")

    frame = sieve.File(path='first_frame.png')
    cap.release()

    return frame

After that, we get the bounding box of the desired object from the first frame. This is the function we would replace or modify if we wanted to use a different model in place of YOLOv8.

def get_object_bbox(image: sieve.File, object_name: str):
    yolo = sieve.function.get('sieve/yolov8')                # yolo endpoint

    response = yolo.run(                                     # call yolo
        file=image,
        classes=object_name,
        models='yolov8l-world',
    )

    box = response['boxes'][0]                                # most confident bounding box
    bounding_box = [box['x1'],box['y1'],box['x2'],box['y2']]  # parse response into list

    return bounding_box

Once we have that the bounding box, we are good to construct the prompt and submit the job to SAM2, and return the result. There we have it -- a basic text-to-segmentation pipeline.

Here’s how we used it to get the video from the start of this blog:

if __name__ == "__main__":
    video_path = "duckling.mp4"
    text_prompt = "duckling"

    video = sieve.File(path=video_path)
    sam_out = segment(video, text_prompt)

    mask = zip_to_mp4(sam_out['masks'])

    os.makedirs("outputs", exist_ok=True)
    shutil.move(mask.path, "output.mp4")

Improvements

This pipeline is very basic and there are several ways to improve it. Among these:

  • it tracks only one object
  • it will struggle with scene cuts and cases where the object appears after the first frame
  • it might have trouble with small, numerous, or fast-moving subjects

Tracking multiple categories and multiple objects would be relatively simple. To do that, we would have YOLOv8 predict more bounding boxes, and we would assign each one a separate object_id and prompt. SAM 2 would take it from there.

Scene cuts are sometimes an issue with SAM 2. It's surprisingly good at tracking an object from different views across scenes, but it's not perfect and can confuse objects. One way this could be addressed by using PySceneDetect to split the video up, and process the scenes separately. We could parallelize that too!

Fast moving and small objects is more tricky. One way to help is to label points more precisely. Bounding boxes are one option for labeling an object for SAM 2, but for higher precision, we can also label points of interest (+) and points not of interest (-) like this:

Screenshot 2024-08-19 at 11.28.13 AM Screenshot 2024-08-19 at 11.18.08 AM

As mentioned before, we could also replace YOLOv8 with another model, such as the Florence-2 model, which offers greater zero-shot performance but at a much higher computational cost.

Conclusion

SAM 2 is an awesome model, and we can extend SAM 2’s capabilities by leveraging other models in a pipeline. In this case, we added simple text-prompting capabilities to SAM2 with YOLOv8.

You can see the full code for this demo here, and you can also try running it for yourself here.

Sieve is also the fastest way to run SAM 2 (2x faster) so let us know if you end up building even better pipelines around the model. If you have questions or want to share your use case, please join our Discord community and for specific questions about this pipeline, email lachlan [at] sievedata [dot] com.