Portrait Avatar

Generate a portrait avatar from a source image and driving audio with multiple backends and enhancement options.

Available backends include:

  • Hedra: This backend uses the Hedra model Character-2 to generate a portrait avatar from a source image and driving audio.

  • EchoMimic: This backend uses the open source EchoMimic model to generate a portrait avatar from a source image and driving audio.

  • Infinity AI: This backend uses the Infinity AI model to generate a portrait avatar from a source image and driving audio.

For pricing, click here.

For examples, click here.

For model-specific parameters, click here.

For tips to ensure better performance, click here.

Ethical Considerations

Lipsync and avatar generation technologies come with social risks, particularly the potential for misuse in creating deepfakes. To mitigate these risks, it's crucial to follow ethical guidelines and adopt responsible usage practices. Currently, the synthesized results contain visual artifacts that may help in detecting deepfakes as well as watermarks that identify the use of Sieve. Please note that we do not assume any legal responsibility for the use of the results generated by this app.

Please reach out to us at sales@sievedata.com or via Discord if you have any questions or concerns or if you want to request a watermark removal.

Model-Specific Parameters:

  • Hedra:

    • aspect_ratio: The aspect ratio of the output video. Options include 1:1, 16:9 and 9:16. If a crop is needed, it will be applied in the center of the frame.
  • Infinity AI:

    • resolution: The resolution of the output video. Options include 320 (320x320px), 512 (512x512px), 640 (640x640px). Has a direct impact on the price and inference time.
    • crop_head: Whether to crop only the head of the person in the source image.
    • expressiveness: The expressiveness of the avatar. Ranges from 0 to 1 (inclusive). Higher values result in more emotion / movement.

Important Notes:

  • Enhance applies restoration to the face only and does not affect the resolution of the video and can be optionally applied for an additional cost.
  • The processing time depends on video resolution and video length.
  • There must be a single person in the source image for reliable results.
  • Each backend has different capabilities, performance, and costs. Please see the pricing section for more details.
  • The Infinity video model can generate expressive talking heads across any style and any angle.
  • Hedra's Character-2 model can generate talking heads across any style while maintaining a lot of realism and quality.
  • EchoMimic can generate stable and expressive talking heads but it can only generate them in a square crop.
  • Hedra blocks the following types of inputs:
    • Celebrities/Public figures
    • Kids + sexual content
    • Anything that violates the OpenAI moderation api except the harassment category

Tips for better performance:

  • Ensure there is only a single primary speaker in the audio.
  • For all backends, use a high quality source image with a sharp face.
  • Ensure the person is facing the camera.
  • Ensure that the frame being used is a stable image with minimal blurring.

Pricing

Note: You are also billed for the CPU compute time of each job at $0.4 per compute hour. More info here.

BackendEnhancePrice per Minute
HedraFalse$0.50
True$0.65
Infinity AI (320)False$0.60
True$0.75
Infinity AI (512)False$1.50
True$1.65
Infinity AI (640)False$2.35
True$2.50
EchoMimicFalse$0.45
True$0.60