UniVidX: Omni-directional Video Generation

Abstract

Recent progress has shown that video diffusion models (VDMs) can be repurposed to solve various multimodal graphics tasks. However, existing approaches typically train separate models for each specific problem setting. This practice not only ignores the joint correlations across modalities, but also locks models into fixed input-output mappings, severely limiting their flexibility. In this paper, we present UniVidX, a unified multimodal framework designed to enable versatile video generation. Our goal is to (i) master diverse tasks by formulating them as conditional generation problems within multimodal space, (ii) adapt to modality-specific distributions without compromising the backbone’s native priors, and (iii) ensure cross-modal consistency during synthesis. Concretely, we propose three key designs: 1) Stochastic Condition Masking (SCM): by randomly partitioning modalities into clean conditions and noisy targets during training, we enable the model to learn omni-directional conditional generation rather than fixed mappings. 2) Decoupled Gated LoRA (DGL): we attach per-modality LoRAs and activate them when a modality serves as a generation target, thereby preserving the VDM's strong priors. 3) Cross-Modal Self-Attention (CMSA): we explicitly share keys/values across modalities while maintaining modality-specific queries, facilitating information exchange and inter-modal alignment. We validate our framework by instantiating it in two domains: 1) UniVid-Intrinsic for RGB videos and their intrinsic maps (albedo, irradiance, normal), and 2) UniVid-Alpha for blended RGB videos and their constituent RGBA layers. Experimental results demonstrate that both models achieve performance competitive with state-of-the-art methods. Notably, they exhibit robust generalization capabilities in in-the-wild scenarios, even when trained on limited datasets of fewer than 1k videos.

Method

Stochastic Condition Masking (SCM)

Randomly masks modalities during training.
Enables omni-directional video generation.

Decoupled Gated LoRA (DGL)

Decoupled, Per-modality LoRA adapters.
Gate ON when modality is a Target, OFF when modality is a Condition.

Cross-Modal Self-Attention (CMSA)

Shared Keys/Values and modality-specific Queries.
Accelerates cross-modal alignment.

Results Gallery

Across two model instantiations (UniVid-Intrinsic and UniVid-Alpha), each supporting 15 distinct tasks.

UniVid-Intrinsic

UniVid-Alpha

Baseline Comparison

UniVid-Intrinsic —— Normal Estimation

Input RGB

RGB↔X

Stable Normal

Lotus

Diffusion Renderer

NormalCrafter

Ouroboros

Ours

UniVid-Alpha —— Video Matting

Input BL

RVM

MODNet

VMFormer

Ours

Citation

@article{chen2026unividx,
  title     = {UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors},
  author    = {Chen, Houyuan and Li, Hong and Kong, Xianghao and Zhu, Tianrui and Xu, Shaocong and Xiao, Weiqing and Guo, Yuwei and Ye, Chongjie and Zhang, Lvmin and Zhao, Hao and Rao, Anyi},
  journal   = {ACM Transactions on Graphics},
  volume    = {45},
  number    = {4},
  articleno = {51},
  year      = {2026},
  month     = jul,
  doi       = {10.1145/3811304},
  url       = {https://doi.org/10.1145/3811304}
}