Introducing Autocrop 1.0: Format videos into different aspect ratios with AI editing

After months of feedback from early customers, we’re announcing the 1.0 release of Autocrop — a new way for developers to edit videos into different aspect ratios using AI. The video below does the best job at explaining what this means.

Video distribution and consumption is changing quickly, mainly because of popular video platforms like TikTok, Instagram Reels, and YouTube Shorts. This is great for brands and content creators because niche content is more likely to go viral. The only challenge is that it requires a different editing style, which is can vary from genre to genre. For podcasts, this might mean focusing on the person speaking, while for sports, it might mean focusing on where the action is.

It also needs to fit a certain shape depending on the platform (like 9:16, 1:1, 5:4, etc) and needs to happen at scale (on hundreds of clips). This is something that many content teams are not used to.

Products like Opus Clip have made this easier and have become very popular as a result. However we believe more products would benefit from offering similar capabilities, hence our introduction of a developer-focused API that makes adding this capability seamless.

What does it actually do?

Autocrop can edit clips for various aspect ratios by detecting speakers and moving objects in order to create the most engaging presentation. It does object detection to find the right subjects, active speaker tracking to focus on the subjects that are actively speaking, and then creates an output layout that may involve multiple subjects — all automatically. Video below courtesy of Opus Clip.

Why is this hard?

At first you’d think this is just simple face detection → cropping but it’s much more difficult than that when you consider various edge cases. The demos below contain the original video on the right and the Sieve autocropped 9:16 version on the left.

Consider a scene like this with many people. By doing simple object detection, you’d pick up every person in the scene. How do you know who to focus on and who to ignore? Is the biggest subject always the “right” one? How can you tell if their head is facing away?

What about videos with multiple people in them? How do you know who’s speaking or who to focus on when multiple people are speaking?

How does it work?

Our implementation of autocrop is a combination of robust object detection, high-quality active speaker detection which we wrote about in a separate post, and an intricate algorithm built on top that combines these results to decide the right final layout to use in the edited video.

It’s engineered to work best on “people-focused” content like podcasts, commentaries, product reviews, educational videos, and other similarly formatted videos though it can generalize across other use cases as well.

Lets visualize whats happening under the hood. The green box is the crop we ended up picking while the other boxes are all the detections that exist in the video.

You’ll notice it’s implicitly picking the right size of box to obey the final aspect ratio, the right moments to switch crop positions, the right way to animate the crops so the editing feels smooth, and which boxes to completely ignore (like the hand that places the plates in frame). Below is the final edited video.

Parallelism & Speed

Similar to how it’s implemented in our standalone active speaker detection app, autocrop takes advantage of parallelism in processing by chunking up the video and making calls to object detection and speaker detection models in parallel. You can see this from the job tree associated with any autocrop job and from the speed of processing that can range anywhere from 15%-35% of the video length when video rendering is turned off.

Job Tree

Advanced Prompting Features (Beta)

Autocrop also has capabilities that allow it to work on non-person content through prompting. Take this video for example that doesn’t contain humans and instead showcases a device called the Rabbit R1. By simply adding the prompt machinery, tech device, big brand logo, we’re able to get autocrop to refocus on those attributes instead of humans.

Conclusion

Autocrop has a ton of parameters for developers to explore here and the ability to either return just the required metadata or the fully rendered video — depending on how your infrastructure is set up. To get started, check out the usage guide available here or join our Discord if you want to share your experience with autocrop. We’re excited to see what you build!