Hunyuanvideo On Gpu Droplets

2 hours ago

ARTICLE AD BOX

The advent of text-to-video models has been 1 of nan galore AI miracles that came from nan past year. From SORA to VEO-2, we personification seen immoderate genuinely unthinkable models deed nan closed guidelines market. These models are tin of generating videos of each kinds, including photorealism, animation, maestro looking effects, and overmuch more. Like everything different seemingly follows successful Deep Learning, nan open-source betterment statement has followed nan occurrence of these closed originated models intimately & open-source models are ever trying to execute nan aforesaid video worth and punctual fidelity.

Recently, we personification seen nan merchandise of 2 notable AI text-to-video models that are making waves for illustration Stable Diffusion erstwhile did. These are specifically nan LTX and HunyuanVideo text-to-video models. LTX’s debased RAM requirements and HunYuan’s versatility and trainability personification surged nan fame of text-to-video models to levels higher than ever.

In this bid of articles, we will talk really to usage these unthinkable models connected DigitalOcean’s NVIDIA GPU enabled GPU Droplets; first, by taking a deeper look astatine HunyuanVideo. Readers tin expect to clip disconnected this first article pinch a firmer knowing of really HunyuanVideo and related next-generation text-to-video models activity nether nan hood. After covering nan underlying theory, we will proviso a demo showing really to get started moving nan model.

Follow connected to study really to create your ain unthinkable videos pinch HunyuanVideo and DigitalOcean.

Prerequisites

Python: this demo will incorporated intermediate level Python code. Anyone will beryllium tin to transcript and paste nan codification successful to recreation along, but knowing and manipulation of nan scripts will require Python
Deep Learning: we will surface nan underlying mentation down nan exemplary successful nan first conception of this article, and terminology utilized will require acquisition pinch Deep Learning concepts DigitalOcean account: We are going to create a GPU Droplet connected DigitalOcean, which whitethorn require nan personification to create an narration if they personification not already

HunyuanVideo

HunyuanVideo is, arguably, nan first open-source exemplary to rival competitory closed guidelines models for text-to-video image generation. To execute this success, HunyuanVideo’s investigation squad made respective considerations pinch respect to its accusation connection and nan pipeline architecture.

The accusation itself was cautiously curated and refined successful bid to only usage nan astir informative training videos pinch highly dense descriptions successful text. First, nan video accusation was aggregated from respective sources. Then, this accusation was parsed utilizing a bid of hierarchical refinements for each resolution, 256p, 360p, 540p, and 720p. These filtration steps focused connected removing immoderate accusation from nan original guidelines that had traits that were undesirable, and vanished pinch a past measurement of manual selection. After selecting nan video accusation manually, nan researchers developed a proprietary VLM to grip nan task of creating descriptions for each video for each of nan pursuing categories: a short description, a dense description, and descriptions of nan background, style, shot-type, lighting, and ambiance of each video. These strategy captions proviso nan textual crushed for training and inference.

Let’s now look astatine nan exemplary architecture. HunyuanVideo is simply a powerful video generative exemplary pinch complete 13 cardinal parameters, 1 of nan largest disposable to nan open-source community. The exemplary was trained connected a spatial-temporally compressed latent space, which was compressed utilizing a Causal 3D VAE. The matter prompts were past encoded utilizing a ample relationship model, and utilized arsenic nan condition. To make nan image, nan Gaussian sound and accusation are taken arsenic input, and nan exemplary generates an output latent, which is decoded into images aliases videos done nan 3D VAE decoder. (Source)

Looking a spot deeper, we tin spot nan Transformer creation successful HunyuanVideo above. It employs a unified Full Attention strategy for superior capacity compared to divided spatiotemporal attention, it supports unified procreation for immoderate images and videos, and it leverages existing LLM-related acceleration capabilities overmuch effectively, enhancing immoderate training and conclusion efficiency. (Source)

"To merge textual and ocular accusation effectively, they recreation nan strategy of a “Dual-stream to Single-stream” hybrid exemplary creation for video generation. In nan dual-stream style of this methodology, video and matter tokens are processed independently done aggregate Transformer blocks, enabling each modality to study its ain owed modulation mechanisms without interference. In nan single-stream phase, it concatenates nan video and matter tokens and provender them into consequent Transformer blocks for effective multimodal accusation fusion. This creation captures analyzable interactions betwixt ocular and semantic information, enhancing wide exemplary performance.” (Source)

For nan matter encoder, they “utilize a pre-trained Multimodal Large Language Model (MLLM) pinch a Decoder-Only building [], which has pursuing advantages: (i) Compared pinch T5, MLLM aft ocular instruction finetuning has amended image-text alignment successful nan characteristic space, which alleviates nan problem of instruction pursuing successful diffusion models; (ii) Compared pinch CLIP, MLLM has been demonstrated superior expertise successful image point mentation and analyzable reasoning; (iii) MLLM tin play arsenic a zero-shot learner by pursuing strategy instructions prepended to personification prompts, helping matter features net overmuch attraction to cardinal information.” (Source)

Put together, we personification a pipeline for creating caller videos aliases images from conscionable matter inputs.

HunyuanVideo Code demo

GPU Selection

To tally HunyuanVideo, we first impulse that users personification tin computing powerfulness to tally nan model. We impulse astatine slightest 40GB of VRAM, ideally 80. For this, we for illustration to usage DigitalOcean’s Cloud GPU Droplet offerings. For overmuch details, cheque retired this nexus to get started connected a GPU Droplet.

Once you personification chosen a GPU connected a bully unreality level & started it up, we tin move connected to nan adjacent step.

Python code

First, we are going to show really to tally HunyuanVideo pinch Python codification and Gradio. To get started, paste nan pursuing into nan terminal.

Git clone https://github.com/Tencent/HunyuanVideo Cd HunyuanVideo/ Pip instal -r requirements.txt python -m pip instal ninja python -m pip instal git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3 python -m pip instal xfuser==0.4.0 python -m pip instal "huggingface_hub[cli]" Huggingface-cli login

You will past beryllium prompted to log successful to HuggingFace, which is required to entree nan models. To really download them, aft doing nan HuggingFace login, paste nan pursuing into nan terminal.

huggingface-cli download tencent/HunyuanVideo --local-dir ./ckpts

Once nan downloads are complete, we tin motorboat nan web exertion pinch this past command:

python3 gradio_server.py --flow-reverse --share

This will create a publically accessible and shareable nexus that we tin now unfastened successful our conception machine’s browser.

From here, we tin return advantage of our powerful GPU to commencement generating videos. Enter successful a descriptive and elaborate punctual into your matter input first. Then, we propose starting pinch a debased solution (540p), to overmuch quickly make nan first video. Use nan default settings pinch this alteration to make videos to start, until you find a video you like. Then, utilizing nan precocious options, group a repeatable seed, truthful that you tin recreate an upscaled type of nan aforesaid video astatine a higher resolution. We tin too summation nan number of conclusion steps, which we recovered to personification a greater effect connected nan video worth than nan value of nan output.

The exemplary is incredibly versatile and easy to use. In our testing, we recovered that it was tin of creating videos successful a wide assortment of styles including realism, fantasy, moving artwork, animation successful immoderate 2d and 3d, and overmuch more. We were peculiarly impressed by nan realism nan exemplary could nutrient for value figures. We moreover recovered immoderate occurrence doing basal effects activity pinch realistic characters. In particular, attraction should beryllium connected really exceptional HunyuanVideo is astatine generating each aspects of nan value assemblage and face. It does look to struggle pinch hands, but that is still nan suit for astir diffusion based image synthesis models and to beryllium expected. Additionally, it’s worthy noting that nan exemplary is highly elaborate successful nan foreground, while being somewhat lacking successful specifications successful nan background; a fuzz seems to surface overmuch of nan inheritance moreover astatine higher measurement counts. Overall, we recovered nan exemplary to beryllium very effective, and bully worthy nan costs of utilizing nan GPU.

Here is simply a sample video we made by compositing 5 HunyuanVideo samples pinch a MusicGen sample audio track. As you tin see, nan possibilities are really endless arsenic overmuch betterment and fine-tunes recreation retired for this awesome model.

Conclusion

HunyuanVideo is simply a really awesome first effort astatine closing nan dispersed betwixt nan unfastened and closed guidelines video procreation models. While it does not look to alternatively lucifer nan precocious ocular levels touted by models for illustration VEO-2 and SORA, HunyuanVideo does an admirable business of matching nan diverseness of subjects covered by these models during training. In nan adjacent future, we tin expect to spot overmuch accelerated steps guardant for video models now that open-sourcing has deed this peculiar assemblage of development, peculiarly from players for illustration TenCent.

Look retired for information 2 of this bid wherever we will surface Image-to-Video procreation pinch LTX Video!