No Examples Found

EchoMimic

This is an implementation of EchoMimic, a lifelike audio-driven portrait animation model, using the accelerated inference option.

Usage

  • source_image and driving_audio are required.
  • source_image can be a video or image. If video, only the first frame will be used.
  • driving_audio can be a video or audio file. If audio, the audio will be extracted and used as the driving audio. If video, only the audio will be used.
  • output_width and output_height default to 512. It is recommended that it be kept at 512 to maintain the quality of the animation.
  • video_length defaults to -1, which will use the length of the driving audio.
  • facemask_dilation_ratio and facecrop_dilation_ratio default to 0.1 and 0.5 respectively. Increasing these can lead to a bigger crop of the source image, but can also lead to artifacts around the face.

Note: The following parameters can be sensitive and require more experimentation for the best results. Default values are recommended.

  • context_frames and context_overlap default to 12 and 3 respectively.
  • cfg defaults to 1.0.
  • steps defaults to 6.
  • fps defaults to 24.

Pricing

This model runs on a single A100 40GB GPU which is priced at $4.2 per hour. Check out our pricing page for more information.

Inference time is approximately 5 seconds per second of audio.

Examples

Audio Driven (Sing)

Audio Driven (English)

Audio Driven (Chinese)