Building a robust ball tracking system for sports with SAM 2

The Segment Anything Model 2 (SAM 2) is Meta AI's latest State-Of-The-Art model for video object segmentation and tracking. Building on its predecessor, SAM 2 delivers enhanced frame-by-frame object tracking capabilities, making it particularly effective for sports applications like soccer match visualizations and automated ball tracking highlights. This comprehensive guide explores implementing SAM 2 for robust ball tracking in sports videos, including solutions for common challenges.

Why is ball tracking essential in sports?

Ball tracking technology enables:

Automated Highlight Detection: Instantly identify key moments like goals and counter-attacks
Mobile-First Content: Generate vertical video highlights optimized for social media
Enhanced Visualization: Create 3D replays and trajectory animations
Advanced Analytics: Generate ball movement heatmaps and predictive patterns

Ball Segmentation using SAM 2

The SAM 2 model is accessible via the Sieve function sieve/sam2. Its core functionality for object segmentation is driven by prompts.

Prompts

In the context of SAM 2, a prompt is simply a labeled region of interest within an image or video frame belonging to an object that is to be segmented. At least one prompt is required to initiate the segmentation process. However, SAM 2 also supports multiple prompts from different frames throughout the video. Multiple prompts proves especially useful in scenarios involving camera angle shifts or environmental changes, enabling more accurate and consistent object segmentation.

A prompt can follow either of the two structures shown below:

example_prompts = [
    { # Point-Label Prompt
        "frame_index": 0, # the index of the frame to apply the prompt to
        "object_id": 1, # the id of the object to track
        "points": [[300,200], [200, 300]], # 2d array of x,y points corresponding to labels
        "labels": [1, 0], # labels for each point (1 for positive, 0 for negative)
    },
    { # Bounding Box Prompt
        "frame_index": 50, # the index of the frame to apply the prompt to
        "object_id": 2, # the id of the object to track
        "box": [200, 200, 300, 400], # xmin, ymin, xmax, ymax
    }
    # ... you can add as many prompts as you want!
]

Object Segmentation

For our specific use case, the bounding box coordinates of the ball is used as the region of interest in prompts to segment the ball throughout the video.

import sieve

sam2 = sieve.function.get("sieve/sam2")

file = sieve.File(path = "your_match_video_path")
prompts = [
    {
        "frame_index": 0, # frame number
        "object_id": 1,
        "box": [882, 845, 919, 885] # Manually extracted box coordinates for the ball at frame 0 in the video
    }
]

output = sam2.push(file = file, prompts = prompts)
print(output.result()[0].path)  # Ball segmented video generated by SAM 2

Below are some examples of ball tracking using SAM 2.

Frame By Frame Tracking using SAM 2 Coordinates

The SAM 2 model also outputs tracking coordinates in JSON format, which can be used to build tracking applications. This JSON provides bounding box coordinates on a frame-by-frame basis to track an object. In our case, we use SAM 2 coordinates to track the ball, displaying a bright green tracking circle on each frame where the ball is present. These frames can then be reassembled into a video, with the audio reattached afterward.

Below is an example of ball coordinates provided by SAM 2.

{
  "0": [
    {
      "object_id": 1,
      "frame_index": 0,
      "bbox": [591, 841, 609, 858],
      "timestep": 0
    }
  ],
  "1": [
    {
      "object_id": 1,
      "frame_index": 1,
      "bbox": [589, 838, 608, 856],
      "timestep": 0.03333333333333333
    }
  ],
  .
  .
  .
  "341": [
    {
      "object_id": 1,
      "frame_index": 341,
      "bbox": [1377, 655, 1401, 672],
      "timestep": 11.366666666666667
    }
  ]
}

Below we have used these coordinates to draw tracking circles.

Limitations of SAM 2 Ball Tracking

While powerful, SAM 2 has several key limitations to consider:

Manual Prompting: SAM 2 requires you to manually detect and prepare at least one bounding box coordinates for the ball from any video frame, as an initial prompt, to begin tracking.
Increased Effort for Improved Accuracy: Due to the inefficiency of a single prompt in tracking, SAM 2 offers the option to use multiple prompts, However, this improved accuracy comes at the cost of added manual effort.
Tracking Errors: SAM 2 may occasionally generate excessively large or inaccurate bounding boxes for the tracked ball, as well as exhibit irregular tracking, such as missing the ball’s movement in certain frames.
Failure Despite Correct Prompts: SAM 2 may still fail, even with accurate initial bounding boxes, especially when dealing with blurry or indistinct objects.

Examples of Failures

The video below shows the ball being tracked irregularly.

Below is a different object being misidentified as the ball.

Addressing the Limitations of SAM 2

There are several ways to improve SAM 2's capabilities for better ball tracking in videos. Let’s explore each of these approaches.

Automated Prompt Generation

We will use the SAM 2 model to track the match ball but generation of the initial prompt is automated with the use of YOLOv8 object detection model. YOLOv8 is a family of zero-shot models designed for a range of tasks, including object detection on unseen data. In our case, we will use the yolov8l-world variant, which is specifically tailored for open-vocabulary object detection. This allows the detection of any object by providing a simple description, making it well-suited for identifying the match ball’s bounding boxes through out the video.

import sieve

yolov8 = sieve.function.get("sieve/yolov8")

file = sieve.File(path = "your_match_video_path")

output = yolov8.push(
    file,
    classes = "sports ball",
    confidence_threshold = 0.1,
    models = "yolov8l-world",
    max_num_boxes = 1
)

yolo_coordinates = output.result() # All coordinates for ball detected in the video

We specifically use the bounding box from the first frame where the ball appears in the YOLOv8 coordinates as the initial prompt for SAM 2.

sam2 = sieve.function.get("sieve/sam2")

# Extracts the first bounding box coordinates of the ball
def extract_first_ball_coordinates(yolo_coordinates):
    THRESHOLD_WIDTH = 300
    THRESHOLD_HEIGHT = 300

    for coordinate in yolo_coordinates:
        if coordinate["boxes"]:
            x1, y1, x2, y2 = (
                coordinate["boxes"][0]["x1"],
                coordinate["boxes"][0]["y1"],
                coordinate["boxes"][0]["x2"],
                coordinate["boxes"][0]["y2"],
            )
            width = x2 - x1
            height = y2 - y1

            if width >= THRESHOLD_WIDTH or height >= THRESHOLD_HEIGHT:
                print(f"Box dimensions too large to be a ball: {width}x{height}")
                continue
            return coordinate["boxes"][0], coordinate["frame_number"]
    return {}, -1

# Converts a yolo box coordinates to a SAM 2 prompt
def create_sam_prompt(first_ball_coordinates, frame_index):
    return {
        "frame_index": frame_index,
        "object_id": 1,
        "box": [
            first_ball_coordinates['x1'],
            first_ball_coordinates['y1'],
            first_ball_coordinates['x2'],
            first_ball_coordinates['y2'],
        ]
    }

first_ball_coordinates, frame_index = extract_first_ball_coordinates(yolo_coordinates)

if frame_index == -1:
    print("No Ball detected")
else:
    prompts = [create_sam_prompt(first_ball_coordinates, frame_index)]

Prompt Validation

YOLOv8 struggles to reliably detect the ball, especially due to its small size or motion, often producing false bounding box coordinates by misidentifying other objects as the ball. This makes it challenging to determine correct initial prompt for SAM 2. To address this issue, we incorporate Visual QA through the sieve function sieve/visual-qa to verify whether the received YOLOv8 coordinates corresponds to a ball before using its as an initial prompt.

import cv2

visual_qa = sieve.function.get("sieve/visual-qa")

# Validates weather the image or scene contains a ball
def detect_sports_ball(file, start_time, end_time, x1, y1, x2, y2):
    backend = "gemini-1.5-flash"
    prompt = "Is there a sports ball in the scene?"
    fps = 1
    audio_context = False
    function_json = {
        "type": "object",
        "properties": {
            "isSportsBallInScene": {
                "type": "boolean",
                "description": "Indicates whether a sports ball is in the scene"
            }
        }
    }

    crop_coordinates = f"{x1}, {y1}, {x2}, {y2}"
    output = visual_qa.push(file, backend, prompt, fps, audio_context, function_json, start_time, end_time, crop_coordinates)
    return output.result()

# Converts video into frames
def extract_frames(video_path):
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        raise ValueError("Unable to open video file.")

    frames = []
    frame_index = 0

    while True:
        ret, frame = cap.read()

        if not ret:
            if frame_index == 0:
                raise ValueError("Failed to read the first frame. The video might be corrupted.")
            else:
                print(f"End of video reached at frame {frame_index}.")
            break

        frames.append(frame)
        frame_index += 1

    cap.release()
    return frames

# Converts a video frame into jpeg
def get_frame_as_jpeg(frames, frame_number):
    if frame_number < 0 or frame_number >= len(frames):
        raise ValueError(f"Frame number {frame_number} is out of range. Total frames: {len(frames)}")
    frame = frames[frame_number]

    if frame is None:
        raise ValueError(f"Frame {frame_number} is empty or corrupted.")

    output_path = f"frame_{frame_number}.jpg"
    success = cv2.imwrite(output_path, frame)
    if not success:
        raise ValueError(f"Failed to save frame {frame_number} as JPEG.")

    return output_path

# Extracts the first validated bounding box coordinates of the ball
def extract_first_ball_coordinates(yolo_coordinates, frames):
    THRESHOLD_WIDTH = 300
    THRESHOLD_HEIGHT = 300

    for coordinate in yolo_coordinates:
        if coordinate["boxes"]:
            x1, y1, x2, y2 = (
                coordinate["boxes"][0]["x1"],
                coordinate["boxes"][0]["y1"],
                coordinate["boxes"][0]["x2"],
                coordinate["boxes"][0]["y2"],
            )

            width = x2 - x1
            height = y2 - y1

            if width >= THRESHOLD_WIDTH or height >= THRESHOLD_HEIGHT:
                print(f"Box dimensions too large to be a ball: {width}x{height}")
                continue

            try:
                img = get_frame_as_jpeg(frames, coordinate["frame_number"])
            except ValueError as e:
                print(f"Error in frame: {e}")
                return {}, -1

            vqa_img_result = detect_sports_ball(
                file = sieve.File(path = img),
                start_time = 0,
                end_time = -1,
                x1 = x1,
                y1 = y1,
                x2 = x2,
                y2 = y2,
            )

            # Box coordinates validation for the ball
            if not vqa_img_result["isSportsBallInScene"]:
                print("Coordinate rejected for not having a ball", coordinate)
                continue

            return coordinate["boxes"][0], coordinate["frame_number"]

    return {}, -1

frames = extract_frames(file.path)
first_ball_coordinates, frame_index = extract_first_ball_coordinates(yolo_coordinates, frames)

if frame_index == -1:
    print("No Ball detected")
else:
    prompts = [create_sam_prompt(first_ball_coordinates, frame_index)]

The video on the left shows the ball being tracked before validation while the video on the right shows the ball being tracked after validation. You'll notice that there is a fist of a person in the crowd being tracked instead of the ball on the left video.

Multiple Prompts

A single initial prompt coordinate may not be suitable for different scenes in a match due to factors such as lighting changes and camera angles. To address this, SAM 2 supports the use of multiple prompts to track the ball effectively. To automate this process, we use PySceneDetect to divide the video into multiple scenes. It is accessible through the Sieve function sieve/pyscenedetect. For each scene, a scene-specific prompt is detected and used as one of the multiple prompts for generating ball tracking coordinates using SAM 2.

pyscenedetect = sieve.function.get("sieve/pyscenedetect")

# Represents a full video as a scene
def convert_video_to_scene(video_path):
    cap = cv2.VideoCapture(video_path)

    if not cap.isOpened():
        raise ValueError("Error: Could not open video.")

    fps = cap.get(cv2.CAP_PROP_FPS)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    start_frame = 0
    end_frame = total_frames - 1
    start_seconds = start_frame / fps
    end_seconds = end_frame / fps

    cap.release()
    return {
        "start_seconds": start_seconds,
        "end_seconds": end_seconds,
        "scene_number": 1,
        "start_frame": start_frame,
        "end_frame": end_frame
    }

# Extracts all the scenes from a video
def get_scenes(video, adaptive_threshold, threshold):
    output = pyscenedetect.push(video=video, adaptive_threshold=adaptive_threshold, threshold=threshold)
    print('Scene detection started')

    scenes = []
    for output_object in output.result():
        scenes.append(output_object)
    if scenes == []:
        print("Video too short PySceneDetect couldn't generate multiple scenes")
        scenes = [convert_video_to_scene(video.path)]
    return scenes

adaptive_threshold = True
threshold = 27 # Threshold the average change in pixel intensity must exceed to trigger a cut. Only used if adaptive_threshold is False.

scenes = get_scenes(file, adaptive_threshold, threshold)

Now we process each scene individually to generate individual prompt for the scene. Then we collect these individual initial prompts and pass them onto SAM 2 as multiple prompts to track the ball.

# Extract the yolo coordinates for a specific scene
def filter_yolo_coordinates(yolo_coordinates, start_frame, end_frame):
    return [
        frame_data
        for frame_data in yolo_coordinates
        if start_frame <= frame_data["frame_number"] <= end_frame
    ]

# Generate prompts for segmenting a ball in the video
def get_initial_prompts(file):

    yolo_output = yolov8.push(
            file,
            classes = 'sports ball',
            confidence_threshold = 0.1,
            models = "yolov8l-world",
            max_num_boxes = 1
            )

    print("Extracting ball coordinates using YOLO...")
    yolo_coordinates = yolo_output.result()
    print("Ball YOLOv8 coordinates extracted")


    prompts = []
    frames = extract_frames(file.path)
    scenes = get_scenes(file, adaptive_threshold = True, threshold = 27)

    for scene in scenes:
        # Checking if ball is present in the scene
        vqa_result = detect_sports_ball(
            file,
            scene['start_seconds'],
            scene['end_seconds'],
            -1, -1, -1, -1
        )

        if not vqa_result['isSportsBallInScene']:
            print(f"No ball detected in the video between {scene['start_seconds']} and {scene['end_seconds']}")
            continue

        yolo_coordinates_for_scene = filter_yolo_coordinates(yolo_coordinates, scene['start_frame'], scene['end_frame'])
        first_ball_coordinates, frame_index = extract_first_ball_coordinates(yolo_coordinates_for_scene, frames)

        if frame_index == -1:
            print(f"No ball detected in the video between {scene['start_seconds']} and {scene['end_seconds']}")
            continue

        prompt = create_sam_prompt(first_ball_coordinates, frame_index)
        prompts.append(prompt)

    return prompts

file = sieve.File(path = "your_match_video_path")
prompts = get_initial_prompts(file) # Generated Multi Prompts for SAM 2

Let's compare the results of tracking the ball with a single prompt and multiple prompts. The videos on left are with single prompt while the videos on the right are with multiple prompts.

In this video, you'd notice that there is a delay in tracking the ball during scene changes with a single prompt while there is no delay with multiple prompts.

In this video, you'd notice that tracking fails in certain scene changes with a single prompt while it doesn't fail with multiple prompts.

Using Florence 2

Florence 2 is an advanced vision-language model developed by Microsoft that excels at integrating text and visual data to perform tasks such as image captioning, object detection and object segmentation. It is accessible through the Sieve API function sieve/florence-2 , where the model has been extended to support video, making it a viable alternative to YOLOv8. While both models often produce similar results and errors, Florence 2 is an overall better performer for ball detection. Below is an example where Florence 2 produces better results than YOLOv8.

Further Improvements

The current system works best when:

The ball is highly visible and clear
The match is played in a wide, unobstructed crowd view
The camera angles are stable and consistent
The pitch is free of colorful ads and logos

The current system has the most difficulty when:

The ball is obscured or too small for accurate tracking
Insufficient contrast or indistinct color of the ball
Colorful brand logos on the pitch
Poor video quality
Visible ball-like logos in the video
Unstable or unfavorable camera angles

Below we list some of the ways to improve the ball tracking system.

Dynamic thresholding using PySceneDetect: For better scene segmentations thresholding parameters must be unique to the video , better scene segmentations thus results in better prompt for those scenes leading to better over results.
Improving the Visual QA validations for prompts: To minimize the risk of misidentified ball objects being utilized by SAM 2 for generating tracking coordinates, apply multiple layers of validation using Visual QA prompts. This ensures more accurate inputs, enhancing the reliability of SAM 2 tracking.
Fine-tuning: Fine-tune YOLOv8, Florence 2, and SAM 2 on match data to increase it accuracy.

Overview of the Sieve Functions Used

Task	Function
Detects the ball coordinates for initial prompt	sieve/yolov8
Detects the ball coordinates for initial prompt	sieve/florence-2
Provides frame by frame coordinates for ball tracking	sieve/sam2
Validates the initial prompt coordinates	sieve/visual-qa
Segments the video based on scene changes	sieve/pyscenedetect

Conclusion

This guide tackles significant challenges such as scene changes, ineffective prompting techniques, and inaccuracies in ball tracking using the SAM 2 model. By integrating enhanced validation and scene segmentation techniques, combining tools like YOLOv8, Florence 2, Visual QA and PySceneDetect along side SAM 2, these challenges can be addressed to a considerable extent. However, there remains ample scope for further improvement.

Looking to refine ball tracking in your project? Join our Discord community. For professional support or email us at contact@sievedata.com.