Durian: Dual Reference-guided Portrait Animation with Attribute Transfer
Abstract
Durian uses dual reference networks and a diffusion model to generate high-fidelity portrait animations with attribute transfer from a reference image to a target portrait in a zero-shot manner.
We present Durian, the first method for generating portrait animation videos with facial attribute transfer from a given reference image to a target portrait in a zero-shot manner. To enable high-fidelity and spatially consistent attribute transfer across frames, we introduce dual reference networks that inject spatial features from both the portrait and attribute images into the denoising process of a diffusion model. We train the model using a self-reconstruction formulation, where two frames are sampled from the same portrait video: one is treated as the attribute reference and the other as the target portrait, and the remaining frames are reconstructed conditioned on these inputs and their corresponding masks. To support the transfer of attributes with varying spatial extent, we propose a mask expansion strategy using keypoint-conditioned image generation for training. In addition, we further augment the attribute and portrait images with spatial and appearance-level transformations to improve robustness to positional misalignment between them. These strategies allow the model to effectively generalize across diverse attributes and in-the-wild reference combinations, despite being trained without explicit triplet supervision. Durian achieves state-of-the-art performance on portrait animation with attribute transfer, and notably, its dual reference design enables multi-attribute composition in a single generation pass without additional training.
Community
More result videos can be found at https://hyunsoocha.github.io/durian/. The code will be released at https://github.com/snuvclab/durian.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- X-NeMo: Expressive Neural Motion Reenactment via Disentangled Latent Attention (2025)
- X-UniMotion: Animating Human Images with Expressive, Unified and Identity-Agnostic Motion Latents (2025)
- FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers (2025)
- RAP: Real-time Audio-driven Portrait Animation with Video Diffusion Transformer (2025)
- HairShifter: Consistent and High-Fidelity Video Hair Transfer via Anchor-Guided Animation (2025)
- Stable-Hair v2: Real-World Hair Transfer via Multiple-View Diffusion Model (2025)
- X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper