SIGGRAPH 2026 · ACM TOG · Project Page

UniVidX A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

1MMLab, HKUST 2BUAA 3NJU 4CUHK 5BAAI 6FNii, CUHKSZ 7Stanford 8AIR,THU
Omni-directional Generation Shared Multimodal Space Diffusion Priors 30 Tasks · 2 Domains

Abstract

Recent progress has shown that video diffusion models (VDMs) can be repurposed to solve various multimodal graphics tasks. However, existing approaches typically train separate models for each specific problem setting. This practice not only ignores the joint correlations across modalities, but also locks models into fixed input-output mappings, severely limiting their flexibility. In this paper, we present UniVidX, a unified multimodal framework designed to enable versatile video generation. Our goal is to (i) master diverse tasks by formulating them as conditional generation problems within multimodal space, (ii) adapt to modality-specific distributions without compromising the backbone’s native priors, and (iii) ensure cross-modal consistency during synthesis. Concretely, we propose three key designs: 1) Stochastic Condition Masking (SCM): by randomly partitioning modalities into clean conditions and noisy targets during training, we enable the model to learn omni-directional conditional generation rather than fixed mappings. 2) Decoupled Gated LoRA (DGL): we attach per-modality LoRAs and activate them when a modality serves as a generation target, thereby preserving the VDM's strong priors. 3) Cross-Modal Self-Attention (CMSA): we explicitly share keys/values across modalities while maintaining modality-specific queries, facilitating information exchange and inter-modal alignment. We validate our framework by instantiating it in two domains: 1) UniVid-Intrinsic for RGB videos and their intrinsic maps (albedo, irradiance, normal), and 2) UniVid-Alpha for blended RGB videos and their constituent RGBA layers. Experimental results demonstrate that both models achieve performance competitive with state-of-the-art methods. Notably, they exhibit robust generalization capabilities in in-the-wild scenarios, even when trained on limited datasets of fewer than 1k videos.

Method

Method Diagram
Stochastic Condition Masking (SCM)
  • Randomly masks modalities during training.
  • Enables omni-directional video generation.
Decoupled Gated LoRA (DGL)
  • Decoupled, Per-modality LoRA adapters.
  • Gate ON when modality is a Target, OFF when modality is a Condition.
Cross-Modal Self-Attention (CMSA)
  • Shared Keys/Values and modality-specific Queries.
  • Accelerates cross-modal alignment.

Baseline Comparison

UniVid-Intrinsic —— Normal Estimation

Input RGB
RGB↔X
Stable Normal
Lotus
Diffusion Renderer
NormalCrafter
Ouroboros
Ours

UniVid-Alpha —— Video Matting

Input BL
RVM
MODNet
VMFormer
Ours

Citation

@article{chen2026unividx,
  title     = {UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors},
  author    = {Chen, Houyuan and Li, Hong and Kong, Xianghao and Zhu, Tianrui and Xu, Shaocong and Xiao, Weiqing and Guo, Yuwei and Ye, Chongjie and Zhang, Lvmin and Zhao, Hao and Rao, Anyi},
  journal   = {ACM Transactions on Graphics},
  volume    = {45},
  number    = {4},
  articleno = {51},
  year      = {2026},
  month     = jul,
  doi       = {10.1145/3811304},
  url       = {https://doi.org/10.1145/3811304}
}