Video
Audio
Image
Text
Pipelines
Models
Generating
Editing
Understanding
Utilities
Popular production ready functions for scale
Translate any video or audio content with natural sounding translations and voices.
sieve
/
dubbing
Smart, automatic cropping of a video to a given aspect ratio based on subject detection and speaker tracking.
autocrop
High-quality background removal for images and videos.
background-removal
A comprehensive solution for video lipsyncing with a suite of different model and enhancements options.
lipsync
State-of-the-art audio-visual active speaker detection based on new, efficent face and speaker detection models.
active_speaker_det...
Fast, high quality speech transcription with many available backends, word-level timestamps, speaker diarization, and translation capabilities.
transcribe
Filters for removing background noise, enhancing speech, and more in audio files.
audio-enhance
Correct eye contact in videos by redirecting the eyes to look at the camera.
eye-contact-correc...
Generate and render video or audio highlights for long-form content based on search phrases.
highlights
Given a video or audio, generate a title, chapters, summary, tags, and highlights.
transcript-analysi...
A set of text-to-speech models and tooling that helps generate natural-sounding speech, clone voices, control emotions, access word timestamps, and more.
tts
Moderate videos and images for harmful content.
visual-moderation
LivePortrait is a video-driven portrait animation system that can animate a portrait video using another driving video. It can also retarget facial expressions from one image to another.
liveportrait
An active speaker detection model to detect which people are speaking in a video.
talknet-asd
A highly customizable text moderation tool that combines AI and algorithmic methods to detect and manage harmful, inappropriate, or unwanted content in real-time.
text-moderation
Reliable customizable text translation supporting 200+ languages through complex tokenization and sentence splitting.
translate
YOLOv8 real-time object detection model with COCO, face, and world variants.
yolov8
Generate depth maps from images or videos.
depth-anything-v2
High-quality speech recognition using major improvements on top of Whisper
whisper
YouTube downloader, download videos, audios, subtitles, and metadata at scale.
youtube-downloader
Generate a portrait avatar from a source image and driving audio with multiple backends and enhancement options.
portrait-avatar
A visual language foundation model that can perform a variety of image and video question-answer tasks, such as object detection, image captioning, segmentation, and OCR.
florence-2
A comprehensive visual question answering app that integrates image and video analysis with text-based queries to provide accurate, structured, and context-aware responses.
visual-qa
This is an optimized implementation of Segment Anything 2, a model that can dynamically segment objects in an image or video.
sam2
MuseTalk is a lip-sync model that generates realistic talking faces from audio input.
musetalk
A diffusion-based audio-driven portrait animation model
echomimic
CodeFormer is a face restoration model that can restore low-resolution faces to high-resolution faces.
codeformer
Demucs is a state-of-the-art music source separation model, currently capable of separating drums, bass, and vocals from the rest of the accompaniment.
demucs
An optimized version of VideoReTalking, an audio-based lip synchronization model for talking head video editing in the wild.
video_retalking
Detect scene changes in a video with PySceneDetect.
pyscenedetect
WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
whisperx
Detect faces in image and video with MediaPipe.
mediapipe_face_det...
Diarize audio using pyannote-audio
pyannote-diarizati...
Resemble Enhance is an AI-powered tool that aims to improve the overall quality of speech by performing denoising and enhancement
resemble-enhance