MagicDance: Realistic Human Dance Video Generation

17 Min Read

Laptop imaginative and prescient is among the most mentioned fields within the AI business, due to its potential functions throughout a variety of real-time duties. Lately, pc imaginative and prescient frameworks have superior quickly, with trendy fashions now able to analyzing facial options, objects, and far more in real-time eventualities. Regardless of these capabilities, human movement switch stays a formidable problem for pc imaginative and prescient fashions. This job entails retargeting facial and physique motions from a supply picture or video to a goal picture or video. Human movement switch is broadly utilized in pc imaginative and prescient fashions for styling pictures or movies, enhancing multimedia content material, digital human synthesis, and even producing information for perception-based frameworks.

On this article, we deal with MagicDance, a diffusion-based mannequin designed to revolutionize human movement switch. The MagicDance framework particularly goals to switch 2D human facial expressions and motions onto difficult human dance movies. Its purpose is to generate novel pose sequence-driven dance movies for particular goal identities whereas sustaining the unique identification. The MagicDance framework employs a two-stage coaching technique, specializing in human movement disentanglement and look components like pores and skin tone, facial expressions, and clothes. We’ll delve into the MagicDance framework, exploring its structure, performance, and efficiency in comparison with different state-of-the-art human movement switch frameworks. Let’s dive in.

As talked about earlier, human movement switch is among the most advanced pc imaginative and prescient duties due to the sheer complexity concerned in transferring human motions and expressions from the supply picture or video to the goal picture or video. Historically, pc imaginative and prescient frameworks have achieved human movement switch by coaching a task-specific generative mannequin together with GAN or Generative Adversarial Networks heading in the right direction datasets for facial expressions and physique poses. Though coaching and utilizing generative fashions ship passable ends in some circumstances, they normally endure from two main limitations. 

  1. They rely closely on a picture warping part because of which they typically battle to interpolate physique components invisible within the supply picture both on account of a change in perspective or self-occlusion. 
  2. They can not generalize to different pictures sourced externally that limits their functions particularly in real-time eventualities within the wild. 

Trendy diffusion fashions have demonstrated distinctive picture era capabilities throughout completely different circumstances, and diffusion fashions at the moment are able to presenting highly effective visuals on an array of downstream duties similar to video era & picture inpainting by studying from web-scale picture datasets. Owing to their capabilities, diffusion fashions is perhaps a really perfect choose for human movement switch duties. Though diffusion fashions could be applied for human movement switch, it does have some limitations both by way of the standard of the generated content material, or by way of identification preservation or affected by temporal inconsistencies because of mannequin design & coaching technique limits. Moreover, diffusion-based fashions reveal no vital benefit over GAN frameworks by way of generalizability. 

To beat the hurdles confronted by diffusion and GAN based mostly frameworks on human movement switch duties, builders have launched MagicDance, a novel framework that goals to use the potential of diffusion frameworks for human movement switch demonstrating an unprecedented degree of identification preservation, superior visible high quality, and area generalizability. At its core, the basic idea of the MagicDance framework is to separate the issue into two phases : look management and movement management, two capabilities required by picture diffusion frameworks to ship correct movement switch outputs. 

See also  9 Best Video Enhancer Tools & Apps (March 2023)

The above determine provides a quick overview of the MagicDance framework, and as it may be seen, the framework employs the Secure Diffusion mannequin, and in addition deploys two extra elements : Look Management Mannequin and Pose ControlNet the place the previous offers look steering to the SD mannequin from a reference picture by way of consideration whereas the latter offers expression/pose steering to the diffusion mannequin from a conditioned picture or video. The framework additionally employs a multi-stage coaching technique to be taught these sub-modules successfully to disentangle pose management and look. 

In abstract, the MagicDance framework is a

  1. Novel and efficient framework consisting of appearance-disentangled pose management, and look management pretraining.  
  2. The MagicDance framework is able to producing real looking human facial expressions and human movement below the management of pose situation inputs and reference pictures or movies. 
  3. The MagicDance framework goals to generate appearance-consistent human content material by introducing a Multi-Supply Consideration Module that gives correct steering for Secure Diffusion UNet framework. 
  4. The MagicDance framework can be utilized as a handy extension or plug-in for the Secure Diffusion framework, and in addition ensures compatibility with current mannequin weights because it doesn’t require extra fine-tuning of the parameters. 

Moreover, the MagicDance framework exhibits distinctive generalization capabilities for each look and movement generalization. 

  1. Look Generalization : The MagicDance framework demonstrates superior capabilities in the case of producing various appearances. 
  2. Movement Generalization : The MagicDance framework additionally has the power to generate a variety of motions. 

MagicDance : Aims and Structure

For a given reference picture both of an actual human or a stylized picture, the first goal of the MagicDance framework is to generate an output picture or an output video conditioned on the enter and the pose inputs {P, F} the place P represents human pose skeleton and F represents the facial landmarks. The generated output picture or video ought to have the ability to protect the looks and identification of the people concerned together with the background contents current within the reference picture whereas retaining the pose and expressions outlined by the pose inputs. 


Throughout coaching, the MagicDance framework is skilled as a body reconstruction job to reconstruct the bottom fact with the reference picture and pose enter sourced from the identical reference video. Throughout testing to realize movement switch, the pose enter and the reference picture is sourced from completely different sources. 

The general structure of the MagicDance framework could be break up into 4 classes: Preliminary stage, Look Management pretraining, Look-disentangled Pose Management, and Movement Module. 

Preliminary Stage

Latent Diffusion Fashions or LDM characterize uniquely designed diffusion fashions to function throughout the latent house facilitated by way of an autoencoder, and the Secure Diffusion framework is a notable occasion of LDMs that employs a Vector Quantized-Variational AutoEncoder and temporal U-Web structure. The Secure Diffusion mannequin employs a CLIP-based transformer as a textual content encoder to course of textual inputs by changing textual content inputs into embeddings. The coaching section of the Secure Diffusion framework exposes the mannequin to a textual content situation and an enter picture with the method involving the encoding of the picture to a latent illustration, and topics it to a predefined sequence of diffusion steps directed by a Gaussian methodology. The resultant sequence yields a loud latent illustration that gives an ordinary regular distribution with the first studying goal of the Secure Diffusion framework being denoising the noisy latent representations iteratively into latent representations. 

See also  Cisco goes all in on AI to strengthen its cybersecurity strategy

Look Management Pretraining

A significant situation with the unique ControlNet framework is its incapability to regulate look amongst spatially various motions persistently though it tends to generate pictures with poses carefully resembling these within the enter picture with the general look being influenced predominantly by textual inputs. Though this methodology works, it isn’t suited to movement switch involving duties the place its not the textual inputs however the reference picture that serves as the first supply for look info. 

The Look Management Pre-training module within the MagicDance framework is designed as an auxiliary department to offer steering for look management in a layer-by-layer strategy. Moderately than counting on textual content inputs, the general module focuses on leveraging the looks attributes from the reference picture with the goal to boost the framework’s capacity to generate the looks traits precisely notably in eventualities involving advanced movement dynamics. Moreover, it’s only the Look Management Mannequin that’s trainable throughout look management pre-training. 

Look-disentangled Pose Management

A naive answer to regulate the pose within the output picture is to combine the pre-trained ControlNet mannequin with the pre-trained Look Management Mannequin straight with out fine-tuning. Nonetheless, the combination would possibly consequence within the framework battling appearance-independent pose management that may result in a discrepancy between the enter poses and the generated poses. To sort out this discrepancy, the MagicDance framework fine-tunes the Pose ControlNet mannequin collectively with the pre-trained Look Management Mannequin. 

Movement Module

When working collectively, the Look-disentangled Pose ControlNet and the Look Management Mannequin can obtain correct and efficient picture to movement switch, though it’d end in temporal inconsistency. To make sure temporal consistency, the framework integrates an extra movement module into the first Secure Diffusion UNet structure. 

MagicDance : Pre-Coaching and Datasets

For pre-training, the MagicDance framework makes use of a TikTok dataset that consists of over 350 dance movies of various lengths between 10 to fifteen seconds capturing a single particular person dancing with a majority of those movies containing the face, and the upper-body of the human. The MagicDance framework extracts every particular person video at 30 FPS, and runs OpenPose on every body individually to deduce the pose skeleton, hand poses, and facial landmarks. 

For pre-training, the looks management mannequin is pre-trained with a batch measurement of 64 on 8 NVIDIA A100 GPUs for 10 thousand steps with a picture measurement of 512 x 512 adopted by collectively fine-tuning the pose management and look management fashions with a batch measurement of 16 for 20 thousand steps. Throughout coaching, the MagicDance framework randomly samples two frames because the goal and reference respectively with the photographs being cropped on the identical place alongside the identical peak. Throughout analysis, the mannequin crops the picture centrally as a substitute of cropping them randomly. 

See also  GLM-130B: An Open Bilingual Pre-Trained Model

MagicDance : Outcomes

The experimental outcomes carried out on the MagicDance framework are demonstrated within the following picture, and as it may be seen, the MagicDance framework outperforms current frameworks like Disco and DreamPose for human movement switch throughout all metrics. Frameworks consisting a “*” in entrance of their identify makes use of the goal picture straight because the enter, and consists of extra info in comparison with the opposite frameworks. 

It’s attention-grabbing to notice that the MagicDance framework attains a Face-Cos rating of 0.426, an enchancment of 156.62% over the Disco framework, and almost 400% enhance in comparison towards the DreamPose framework. The outcomes point out the strong capability of the MagicDance framework to protect identification info, and the seen enhance in efficiency signifies the prevalence of the MagicDance framework over current state-of-the-art strategies. 

The next figures evaluate the standard of human video era between the MagicDance, Disco, and TPS frameworks. As it may be noticed, the outcomes generated by the GT, Disco, and TPS frameworks endure from inconsistent human pose identification and facial expressions. 

Moreover, the next picture demonstrates the visualization of facial features and human pose switch on the TikTok dataset with the MagicDance framework with the ability to generate real looking and vivid expressions and motions below various facial landmarks and pose skeleton inputs whereas precisely preserving identification info from the reference enter picture. 

It’s price noting that the MagicDance framework boasts of outstanding generalization capabilities to out-of-domain reference pictures of unseen pose and types with spectacular look controllability even with none extra fine-tuning on the goal area with the outcomes being demonstrated within the following picture. 

The next pictures reveal the visualization capabilities of MagicDance framework by way of facial features switch and zero-shot human movement. As it may be seen, the MagicDance framework generalizes to in-the-wild human motions completely. 

MagicDance : Limitations

OpenPose is an integral part of the MagicDance framework because it performs a vital position for pose management, affecting the standard and temporal consistency of the generated pictures considerably. Nonetheless, the MagicDance framework nonetheless finds it a bit difficult to detect facial landmarks and pose skeletons precisely, particularly when the objects within the pictures are partially seen, or present speedy motion. These points can lead to artifacts within the generated picture. 


On this article, we’ve got talked about MagicDance, a diffusion-based mannequin that goals to revolutionize human movement switch. The MagicDance framework tries to switch 2D human facial expressions and motions on difficult human dance movies with the particular goal of producing novel pose sequence pushed human dance movies for particular goal identities whereas retaining the identification fixed. The MagicDance framework is a two-stage coaching technique for human movement disentanglement and look like pores and skin tone, facial expressions, and garments.

MagicDance is a novel strategy to facilitate real looking human video era by incorporating facial and movement expression switch, and enabling constant within the wild animation era while not having any additional fine-tuning that demonstrates vital development over current strategies. Moreover, the MagicDance framework demonstrates distinctive generalization capabilities over advanced movement sequences and various human identities, establishing the MagicDance framework because the lead runner within the discipline of AI assisted movement switch and video era. 

Source link

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *