Autocrop

This app takes an input video and automatically crops it to a specified aspect ratio based on smart subject tracking, speaker detection, and more.

Autocrop currently works best for speaking videos.

  • ✅ Podcasts
  • ✅ Commentaries
  • ✅ Product Reviews
  • ✅ Educational Videos
  • ✅ Single Speaker Speeches
  • ❌ Crowds of People or Busy Background Scenes
  • ❌ Sports games
  • ❌ Vlogs
  • ❌ Music Videos
  • ❌ Online Gaming

For pricing, click here.

For further notes click here.

Key Features

  • Subject Tracking: The app tracks the subjects in the video (currently limited to people) and crops the video to keep them in frame.
  • Speaker Detection: The app can detect who is speaking in the video and crop the video to focus on them using active speaker detection via Talknet-ASD
  • Dynamic Layout: The app can dynamically choose between layouts of 1, 2, 3, or 4 subjects at a time.
  • Automatic Aspect Ratio: The app can automatically crop the video to a specified aspect ratio.

Pricing

We price per minute of video. We bucket pricing into standard definition (<=720p), high definition (<=1080p), and 4k (>1080p) videos. We've listed the pay-as-you-go rates below.

Additionally, there's a small compute-based fee associated with processing object detection on YoloV8 or Mediapipe. Still, it is quite small relative to the rest of the processing, so we've not listed it here. If you increase the FPS of object detection, the compute price of the object detection algorithms will go up.

If you render the video, we charge an additional fee depending on the resolution of video. The costs are listed below.

The fee for Talknet-ASD is baked into the default pricing below, we highly recommend using active speaker detection for any scenario involving a person.

ResolutionPrice / MinutePrice without Rendering / MinutePrice without ASD / Minute
> 1080p (4k)$0.169$0.13$0.039
> 720p (up to 1080p)$0.1105$0.078$0.0325
≤ 720p$0.0676$0.052$0.0156

Note: If your video is poorly encoded, we will re-encode it for you as it would otherwise cause the pipeline to be prohibitively slow. For this, we charge $0.01 per compute minute to re-encode the video.

Notes

Usage

  • file: The input video file.
  • active_speaker_detection: Whether to use active speaker detection to crop the video to the active speaker.
  • aspect_ratio: The aspect ratio to crop the video to.
  • return_video: Whether to return the resulting video or just the metadata required to crop the video.
  • include_subjects: Whether or not to include the information about the subjects (people, faces, etc) of the crops in the metadata. Defaults to False.
  • include_non_active_layouts: whether or not to include the versions of the layouts if speaker detection was off in the output metadata. This is useful if you want the option to switch between active speaker and non-active speaker layouts without having to reprocess the video. Defaults to False and is only used if active_speaker_detection is True and return_video is False.
  • prompt: Experimental feature. Currently limited to the classes listed at the bottom of this README. Soon to support any natural language prompt. Defaults to "person".
  • min_scene_length: Cropping is computed on a scene-by-scene basis. min_scene_length sets the minimum length of a scene in seconds. Short scenes will be merged with succeeding scenes until the minimum length is reached. Defaults to 0.0 (no merging or minimum scene length).

Metadata Output Format

When return_video is set to false, the app will return the following metadata in JSON format per frame as it processes the video:

  • frame_number: The frame number of the video.
  • frame_width: The width of the video frame.
  • frame_height: The height of the video frame.
  • crops: An array of crop objects, each containing:
    • x1: The x-coordinate of the top left corner of the crop.
    • y1: The y-coordinate of the top left corner of the crop.
    • x2: The x-coordinate of the bottom right corner of the crop.
    • y2: The y-coordinate of the bottom right corner of the crop.
    • apply_letterbox: A boolean value indicating whether to apply a black border around the frame. This is used when it's decided that it's best not to crop in certain scenarios and instead show the whole frame with borders.

The crops are keyed by the aspect ratio of the crop. For example, if the aspect ratio is 9:16, the crop would be keyed by "9:16". If you turn on active speaker detection, the key would be "9:16-active-speaker" instead.

If return_scene_data is set to true and return_video is set to false, the app will return an additional payload per frame with the data of the scene. The format of this data is as follows:

  • scene: A JSON object of scene information, including:
    • start_seconds: The start time of the scene in seconds.
    • end_seconds: The end time of the scene in seconds.
    • start_frame: The starting frame number of the scene.
    • end_frame: The ending frame number of the scene.
    • start_timecode: The start time of the scene in timecode format (HH:MM:SS.sss).
    • end_timecode: The end time of the scene in timecode format (HH:MM:SS.sss).
    • scene_number: The scene number.

Prompt Usage (Experimental)

Currently, the app works best on people-related content but we are starting to support prompts using the prompt and negative_prompt parameters. prompt can be a comma-separated list of classes to look for in the video described in natural language. For example, "a person speaking" or "a person talking". negative_prompt can be a comma-separated list of classes to avoid in the video. For example, "a person not speaking" or "a person not talking". The app will then use the prompt to determine the best layout and crop for the video.

smart_edit is a pre-set use of prompt and negative_prompt with the following classes:

  • prompt: the subject of focus, the most important thing in the image, the main person speaking, the main object in the scene, the object that stands out the most
  • negative_prompt: background object, logo, small object, blurry object, backdrop graphic, background graphic, a large crowd, blurry graphic, news headline graphic, the back of a head, crowded area, table, set prop, a person whose face isn't visible, a person in a crowd, logo graphic

Known Limitations

  • The app currently performs poorly when there are large crowds of people. Think scenes such as political rallies with people behind the speaker, large audience crowds, busy streets, etc.
  • The app works best when there are 3 or fewer subjects in the frame. While with 4+ subjects, the app may still work, it may not be as stable.
  • Speaker detection works best when speakers are closer to the camera. Far away speakers may not always be classified as active speakers.