Video
Audio
Image
Text
Pipelines
Models
Generating
Editing
Understanding
Utilities
Popular production ready functions for scale
Translate any video or audio content with natural sounding translations and voices.
sieve
/
dubbing
Smart, automatic cropping of a video to a given aspect ratio based on subject detection and speaker tracking.
autocrop
This is an optimized implementation of Segment Anything 2, a model that can dynamically segment objects in an image or video.
sam2
A comprehensive solution for video lipsyncing with a suite of different model and enhancements options.
lipsync
State-of-the-art audio-visual active speaker detection based on new, efficent face and speaker detection models.
active_speaker_det...
Fast, high quality speech transcription with many available backends, word-level timestamps, speaker diarization, and translation capabilities.
transcribe
Filters for removing background noise, enhancing speech, and more in audio files.
audio-enhance
Correct eye contact in videos by redirecting the eyes to look at the camera.
eye-contact-correc...
Generate and render video or audio highlights for long-form content based on search phrases.
highlights
Given a video or audio, generate a title, chapters, summary, tags, and highlights.
transcript-analysi...
A set of text-to-speech models and tooling that helps generate natural-sounding speech, clone voices, control emotions, access word timestamps, and more.
tts
LivePortrait is a video-driven portrait animation system that can animate a portrait video using another driving video. It can also retarget facial expressions from one image to another.
liveportrait
An active speaker detection model to detect which people are speaking in a video.
talknet-asd
A highly customizable text moderation tool that combines AI and algorithmic methods to detect and manage harmful, inappropriate, or unwanted content in real-time.
text-moderation
Download YouTube videos in MP4 format at any resolution.
youtube_to_mp4
Reliable customizable text translation supporting 200+ languages through complex tokenization and sentence splitting.
translate
YOLOv8 real-time object detection model with COCO, face, and world variants.
yolov8
High-quality speech recognition using major improvements on top of Whisper
whisper
A visual language foundation model that can perform a variety of image and video question-answer tasks, such as object detection, image captioning, segmentation, and OCR.
florence-2
MuseTalk is a lip-sync model that generates realistic talking faces from audio input.
musetalk
CodeFormer is a face restoration model that can restore low-resolution faces to high-resolution faces.
codeformer
Models for human-centric segmentation, depth estimation, and surface normal estimation. Supports video and image inputs.
sapiens
Demucs is a state-of-the-art music source separation model, currently capable of separating drums, bass, and vocals from the rest of the accompaniment.
demucs
A comprehensive visual question answering app that integrates image and video analysis with text-based queries to provide accurate, structured, and context-aware responses.
visual-qa
Count OpenAI tokens with tiktoken, manage limits and costs!
tiktoken
An optimized version of VideoReTalking, an audio-based lip synchronization model for talking head video editing in the wild.
video_retalking
Detect scene changes in a video with PySceneDetect.
pyscenedetect
WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
whisperx
Detect faces in image and video with MediaPipe.
mediapipe_face_det...
Diarize audio using pyannote-audio
pyannote-diarizati...
Translate text into 96 different languages
seamless_text2text
Resemble Enhance is an AI-powered tool that aims to improve the overall quality of speech by performing denoising and enhancement
resemble-enhance
Detect the language of a text with LangID, a lightweight language identification tool
langid