SJTU-Deng-Lab
/

D2F_LLaDA_Instruct_8B_Lora

Text Generation

Model card Files Files and versions

D2F_LLaDA_Instruct_8B_Lora / README.md

UnhurriedDawn's picture

Update README.md

14b748b verified about 2 months ago

|

history blame contribute delete

2.82 kB

	---
	license: apache-2.0
	language: en
	library_name: transformers
	tags:
	- d2f
	- diffusion-llm
	- text-generation
	- llada
	- lora
	base_model: GSAI-ML/LLaDA-8B-Instruct
	model_name: D2F_LLaDA_8B_Instruct_Lora
	---
	# D2F LoRA adapter for LLaDA-8B-Instruct

	This repository contains the LoRA adapter for the `GSAI-ML/LLaDA-8B-Instruct` model, trained using the Discrete Diffusion Forcing (D2F) method.

	This adapter allows the `LLaDA-8B-Instruct` diffusion LLM (dLLM) to achieve inference speeds that are significantly faster than both its original version and leading autoregressive (AR) models like LLaMA3, while maintaining comparable output quality.

	The D2F method and its results are detailed in the paper: [D2F: Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing](https://arxiv.org/abs/2508.09192).

	- Official Code: [D2F GitHub Repository](https://github.com/zhijie-group/Discrete-Diffusion-Forcing)
	- Demo Space: [D2F-LLaDA-Instruct-8B](https://huggingface.co/spaces/zhijie3/D2F-LLaDA-Instruct-8B)

	## Method: Discrete Diffusion Forcing (D2F)

	Diffusion LLMs (dLLMs) have long promised ultra-fast parallel decoding, but this potential was historically crippled by two main bottlenecks:
	1. KV Cache Incompatibility: Their bidirectional attention mechanism prevented the use of the Key-Value Cache, a critical optimization in AR models.
	2. Strict Inter-Block Dependency: Previous attempts at block-based generation required each block to be fully generated before starting the next, preventing true parallelism.

	D2F solves these issues with a novel hybrid approach:

	1. Hybrid Architecture: D2F reframes text generation as a block-autoregressive process.
	* Within a block: Attention remains bidirectional to capture rich local context.
	* Between blocks: Attention is made causal, allowing the model to be fully compatible with the standard KV Cache.

	2. Pipelined Parallel Decoding: D2F uses an efficient training and inference strategy.
	* Training: It uses Asymmetric Distillation, where a D2F student model learns to mimic a powerful bidirectional teacher model, efficiently transferring its capabilities to the fast, cache-friendly architecture.
	* Inference: It enables a dynamic pipelined parallel decoder. New text blocks are added to the pipeline as soon as their predecessors are only partially complete. This creates an asynchronous workflow that maximizes GPU utilization and dramatically boosts throughput.


	## How to Use

	⚠️ Important: This is a LoRA adapter and requires the official D2F codebase for inference.

	For detailed instructions and code, please refer to the official GitHub repository:

	➡️ https://github.com/zhijie-group/Discrete-Diffusion-Forcing ⬅️