Sapiens

image

Sapiens is a collection of models released by Meta focused on human-centric vision tasks.

Repo: https://github.com/facebookresearch/sapiens

Paper: https://arxiv.org/abs/2408.12569

We are hosting the following Sapiens model types:

A model can be chosen by setting the mode parameter to one of the following:

  • normal: Surface Normal Estimation
  • segmentation: Body Part Segmentation
  • depth: Depth Estimation

For each model, we support either the 0.3b or 1b model size. This can be chosen by setting the model_size parameter to one of the following:

  • 0.3b: faster, lower quality
  • 1b: slower, higher quality

We found results are best when the model is applied to an image/video with a single person in-frame, where the person's body takes up the majority of the image. It also helps for the video/image to have a simple background.

The endpoint accepts any length of video, but be aware that the sapiens model is expensive to run. We advise trying out shorter clips of videos before submitting longer ones (2+ minutes).

See Examples for a walkthrough on using each of the Sapiens models.

Pricing

This function runs on an A100 40GB GPU and is billed based on our compute-based pricing rates at $4.20/hr.

Output

This endpoint supports both video and image inputs.

Each model returns the following items:

  • raw output: The raw output of the model. (numpy .npy file)
  • visualization: A visualization of the output. (.png image)

If the raw_output parameter is set to true, only the raw output will be returned. This will reduce costs, as depth estimation and surface normal estimation require running the segmentation model to create visualizations.

Video Outputs

Frame-level video outputs are returned as a dictionary of zipfiles, one for each output type. Each zipfile contains the output for each frame in the video, labeled by a frame number (e.g. the visualization for the 3rd frame will have the file name 000002.png).

For video inputs, setting the make_video parameter to true will return a visualization of the segmentation overlayed onto the original video in addition to other outputs. raw_output=True overrides make_video -- no visualization video will be returned.

When a video is returned, the output format is video, dict[str, zipfile]. Otherwise, the output format is dict[str, file].

Input Outputs

The output format for a single image is visualization, raw_output if raw_output=True, otherwise it is visualization.

For more information on the output formats for each model, see Models.

Models

Body Part Segmentation

image This mode is used to segment the body parts of a person in an image. The classes and their corresponding ids can be found here: classes

The following items are returned:

  • raw: The class ids for each pixel in a frame.
  • visualization: A visualization of the segmentation overlayed onto the original image.

Depth Estimation

image This mode estimates the depth of each pixel belonging to a person in an image.

The following items are returned:

  • raw output: The estimated depth for each pixel in a frame.
  • depth visualization: A normalized visualization of the depth map.

Surface Normal Estimation

image This mode estimates the surface normal of each pixel belonging to a person in an image.

The following items are returned:

  • raw output: The surface normal map for each pixel in a frame.
  • surface normal visualization: A visualization of the surface normal map.

Examples

# make sure you've installed the sieve package with `pip install sievedata`
# and set your api key with `export SIEVE_API_KEY=<your_api_key>`
# or use
# os.environ["SIEVE_API_KEY"] = "<your_api_key>"

import numpy as np
import sieve

# get the sapiens sieve function
sapiens_fn = sieve.function.get("sieve/sapiens")

# choose one of the following modes:
# mode = "segmentation"
# mode = "depth"
# mode = "normal"
mode = "segmentation"

# choose one of the following model sizes:
# model_size = "0.3b"
# model_size = "1b"
model_size = "1b"

# segment a single image
visualiztion, raw_output = sapiens_fn.run(sieve.File(path="path/to/image.png"), mode=mode, model_size=model_size)

# load raw output as numpy array
raw_output = np.load(raw_output.path)

# segment a video, return a visualization video
visualization_video, output_dict = sapiens_fn.run(sieve.File(path="path/to/video.mp4"), mode=mode, model_size=model_size, make_video=True)

# path to output zip files containing frame-level outputs
path_to_raw_output_zip = output_dict["raw"].path
path_to_frame_visualization_zip = output_dict["visualization"].path

License

Please read the license for Sapiens here.