New AI Video Model ‘Hunyuan’ Debuts via ComfyUI

Last week, a groundbreaking AI video generation model was released: ComfyUI Hunyuan Custom. This innovative model builds upon recent advances in reference-based video generation, enabling users to generate video clips using a single image and a text prompt.

In this post, we’ll explore how Hunyuan Custom works, its unique capabilities, and how you can start testing it yourself today.


What Is ComfyUI Hunyuan Custom?

ComfyUI Hunyuan Custom is a reference-to-video AI model. It allows you to:

  • Input a reference image (of a person, object, etc.)
  • Add a simple text prompt
  • Generate coherent, stylized video clips (up to 129 frames)

Whether it’s a single-subject or multi-subject setup, the model replicates the style, posture, and look of the reference subject while placing them into new environments based on your prompt.


How It Works: Lava + LLaVA for Vision-Language Understanding

The model is powered by a multimodal understanding framework called Lava (LLaVA). It accepts an image input for advanced image captioning and subject analysis.

Here’s how it processes the video:

  1. Image Input: You upload an image (e.g., a man walking).
  2. Text Prompt: Provide a prompt like “A senior man walking on the beach.”
  3. Lava Processing: The image is captioned and analyzed, identifying the subject (e.g., “senior man”).
  4. Video Generation: The subject is animated into a video scene based on your prompt.

You can enhance results by removing backgrounds from reference images, helping the model isolate the subject more accurately.


Key Features and Customizations

Single Subject Reference-to-Video

Upload one image and describe an action or scene. The model retains the style and outfit of the subject across frames. Example:

  • “A female in a red dress walking a dog in a park.”

Multi-Subject Support (Workaround)

Although the current version supports single-subject references, you can concatenate multiple subjects into one image to simulate a multi-subject video.

Audio-Driven Video (Coming Soon)

In development: input audio and generate a video that matches the tone or narrative of the audio clip.

Video-to-Video Customization (Coming Soon)

Edit existing videos by swapping objects (like a teddy bear → husky toy, or backpack swaps for product demos).


System Requirements

Hunyuan Custom comes in two versions:

  • FP16 (Full Precision): Requires 80GB VRAM
  • FP8 (Quantized): Requires 24GB VRAM (tested locally)

For most users, FP8 is the go-to version. It generates 720p videos up to 129 frames, though generation may be slow.


ComfyUI Integration

You can run Hunyuan Custom using ComfyUI, thanks to updates in the Hunyuan Video Wrapper Node. The node supports:

  • FP8 scaled and quantized models
  • Torch Compile / Stage Attention setups
  • Clip Vision and VAE loaders
  • Resize and pad options for image compatibility

Important Settings:

  • Use remove background to isolate subjects.
  • Stick to reference dimensions listed in the Hugging Face repo.
  • Set frame numbers (e.g., 129) for full-length video outputs.
  • Use appropriate VAE settings (FP16 or P32 based on precision).

Where to Download

Know more from GitHub.

Use Cases

This model is a game-changer for:

  • Product marketing: Showcase products in real-world settings without filming
  • Fashion try-ons: Apply garments to characters in different video scenarios
  • Creative storytelling: Turn a photo into a character for animated storytelling
  • User-generated content: Leverage existing assets to generate high-quality video

Final Thoughts

Hunyuan Custom is still in active development, with exciting features like audio-driven and video-driven customization coming soon. Even in its current form, it’s a powerful tool for AI video generation that rivals models like Pika or Runway, with strong reference fidelity and detailed subject animation.

As the AI video space evolves, tools like ComfyUI Hunyuan are setting the pace for what’s possible in media creation.


Leave a Reply

x
Advertisements