MuseTalk

MuseTalk is a real-time high quality audio-driven lip-syncing AI model with latent space inpainting (30fps+ on an NVIDIA Tesla V100). MuseTalk can be applied with input videos, e.g., generated by MuseV, as a complete virtual human solution.

This is an optimized implementation of MuseTalk that runs up to 40% faster than the original version. You can read more about the optimization details here.

Complete README here.

Pricing Information

Our video processing service uses the following pricing structure:

  • Base price: $0.16 per minute for videos higher than 720p, capped at 1080p
  • 20% discount applied for videos at 720p or lower The final price is calculated based on the video duration and its resolution. For example:
  • A 1-minute 1080p video would cost $0.16
  • A 1-minute 720p video would cost $0.128
  • A 1-minute 480p video would cost $0.128
  • Any content above 1080p will be downsampled to 1080p Prices are subject to change. Please refer to our latest documentation for the most up-to-date pricing information.

Overview

MuseTalk is a real-time high quality audio-driven lip-syncing model trained in the latent space of ft-mse-vae, which

  1. modifies an unseen face according to the input audio, with a size of face region of 256 x 256.
  2. supports audio in various languages, such as Chinese, English, and Japanese.
  3. supports real-time inference with 30fps+ on an NVIDIA Tesla V100.
  4. supports modification of the center point of the face region proposes, which SIGNIFICANTLY affects generation results.
  5. checkpoint available trained on the HDTF dataset.
  6. training codes (comming soon).

Acknowledgement

  1. We thank open-source components like whisper, dwpose, face-alignment, face-parsing, S3FD.
  2. MuseTalk has referred much to diffusers and isaacOnline/whisper.
  3. MuseTalk has been built on HDTF datasets.

Thanks for open-sourcing!

Limitations

  • Resolution: Though MuseTalk uses a face region size of 256 x 256, which make it better than other open-source methods, it has not yet reached the theoretical resolution bound. We will continue to deal with this problem.
    If you need higher resolution, you could apply super resolution models such as GFPGAN in combination with MuseTalk.

  • Identity preservation: Some details of the original face are not well preserved, such as mustache, lip shape and color.

  • Jitter: There exists some jitter as the current pipeline adopts single-frame generation.

Citation

@article{musetalk,
  title={MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting},
  author={Zhang, Yue and Liu, Minhao and Chen, Zhaokang and Wu, Bin and He, Yingjie and Zhan, Chao and Zhou, Wenjiang},
  journal={arxiv},
  year={2024}
}

Disclaimer/License

  1. code: The code of MuseTalk is released under the MIT License. There is no limitation for both academic and commercial usage.
  2. model: The trained model are available for any purpose, even commercially.
  3. other opensource model: Other open-source models used must comply with their license, such as whisper, ft-mse-vae, dwpose, S3FD, etc..
  4. The testdata are collected from internet, which are available for non-commercial research purposes only.
  5. AIGC: This project strives to impact the domain of AI-driven video generation positively. Users are granted the freedom to create videos using this tool, but they are expected to comply with local laws and utilize it responsibly. The developers do not assume any responsibility for potential misuse by users.