|
--- |
|
license: apache-2.0 |
|
language: en |
|
library_name: transformers |
|
tags: |
|
- d2f |
|
- diffusion-llm |
|
- text-generation |
|
- llada |
|
- lora |
|
base_model: GSAI-ML/LLaDA-8B-Instruct |
|
model_name: D2F_LLaDA_8B_Instruct_Lora |
|
--- |
|
# D2F LoRA adapter for LLaDA-8B-Instruct |
|
|
|
This repository contains the **LoRA adapter** for the `GSAI-ML/LLaDA-8B-Instruct` model, trained using the **Discrete Diffusion Forcing (D2F)** method. |
|
|
|
This adapter allows the `LLaDA-8B-Instruct` diffusion LLM (dLLM) to achieve inference speeds that are significantly faster than both its original version and leading autoregressive (AR) models like LLaMA3, while maintaining comparable output quality. |
|
|
|
The D2F method and its results are detailed in the paper: **[D2F: Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing](https://arxiv.org/abs/2508.09192)**. |
|
|
|
- **Official Code:** [D2F GitHub Repository](https://github.com/zhijie-group/Discrete-Diffusion-Forcing) |
|
- **Demo Space:** [D2F-LLaDA-Instruct-8B](https://huggingface.co/spaces/zhijie3/D2F-LLaDA-Instruct-8B) |
|
|
|
## Method: Discrete Diffusion Forcing (D2F) |
|
|
|
Diffusion LLMs (dLLMs) have long promised ultra-fast parallel decoding, but this potential was historically crippled by two main bottlenecks: |
|
1. **KV Cache Incompatibility:** Their bidirectional attention mechanism prevented the use of the Key-Value Cache, a critical optimization in AR models. |
|
2. **Strict Inter-Block Dependency:** Previous attempts at block-based generation required each block to be fully generated before starting the next, preventing true parallelism. |
|
|
|
**D2F** solves these issues with a novel hybrid approach: |
|
|
|
1. **Hybrid Architecture:** D2F reframes text generation as a block-autoregressive process. |
|
* **Within a block:** Attention remains **bidirectional** to capture rich local context. |
|
* **Between blocks:** Attention is made **causal**, allowing the model to be fully compatible with the standard **KV Cache**. |
|
|
|
2. **Pipelined Parallel Decoding:** D2F uses an efficient training and inference strategy. |
|
* **Training:** It uses *Asymmetric Distillation*, where a D2F student model learns to mimic a powerful bidirectional teacher model, efficiently transferring its capabilities to the fast, cache-friendly architecture. |
|
* **Inference:** It enables a dynamic **pipelined parallel decoder**. New text blocks are added to the pipeline as soon as their predecessors are only partially complete. This creates an asynchronous workflow that maximizes GPU utilization and dramatically boosts throughput. |
|
|
|
|
|
## How to Use |
|
|
|
⚠️ **Important:** This is a LoRA adapter and requires the official D2F codebase for inference. |
|
|
|
For detailed instructions and code, please refer to the official GitHub repository: |
|
|
|
➡️ **https://github.com/zhijie-group/Discrete-Diffusion-Forcing** ⬅️ |