Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

[📑 Technical Report (Coming Soon)] [💜 Project Page (Demo & Benchmark)] [🤗 Model ]

¹Shanghai AI Laboratory, ²Shanghai Innovation Institute, ³Shanghai Jiao Tong University

⁴Nanjing University, ⁵The University of Sydney

⁶The Chinese University of Hong Kong, ⁷Tsinghua University

📚 Introduction

We introduce Lumina-DiMOO, an omni foundational model for seamless multimodal generation and understanding. Lumina-DiMOO is distinguished by four key innovations:

Unified Discrete Diffusion Architecture: Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities.
Versatile Multimodal Capabilities: Lumina-DiMOO supports a broad spectrum of multimodal tasks, including text-to-image generation (allowing for arbitrary and high-resolution), image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), alongside advanced image understanding.
Higher Sampling Efficiency: Compared to previous AR or hybrid AR-diffusion paradigms, Lumina-DiMOO demonstrates remarkable sampling efficiency. Additionally, we design a bespoke caching method to further speed up the sampling speed by 2x.
Superior Performance: Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multimodal models, setting a new standard in the field.

📽️ Qualitative Results

Here we present some comparative generation results with other models. For additional visualization results, please see our Project Page.

Text-to-Image Comparison

Image Editing Comparison

Controllable & Subject-Driven Generation Comparison

Image Inpainting & Extrapolation

📊 Quantitative Performance

GenEval Benchmark

DPG Benchmark

OneIG-EN Benchmark

TIIF Benchmark

Image-to-Image Benchmark

Image Understanding Benchmark

🚀 Sampling Speed Analysis

Since text generation is performed in a block-wise manner, unlike image generation which uses a single global decoding step, its speed is influenced by both the number of blocks and the number of steps. Therefore, the speed improvement of image understanding is not as significant as that of image generation.
Lumina-DiMOO Settings: For image generation, we sample 64 steps. For image understanding, we set the block length to 256 and the number of sampling steps to 128.

Sampling Speed Comparison

💬 Discussion

You can reach us with this WeChat QR code!

📜 Acknowledgements

This work was also supported and implemented by MindSpeed MM, an open-source training framework for large-scale multimodal models designed for distributed training, developed and maintained by Huawei's Computing Product Line. Specifically Optimized for Huawei‘s Ascend AI chips, MindSpeed MM offers comprehensive support for distributed training and is tailored for a wide range of multimodal tasks.

📖 BibTeX

@misc{lumina-dimoo,
      title={Lumina-DiMOO: A Unified Masked Diffusion Model for Multi-Modal Generation and Understanding},
      author={Alpha VLLM Team},
      year={2025},
      url={https://github.com/Alpha-VLLM/Lumina-DiMOO},
}

Alpha-VLLM
/

Lumina-DiMOO