Overwrite with converted Qwen2.5-3B model files

Browse files

Files changed (14) hide show

README.md +222 -71
VLV_stage1.py +257 -0
VLV_stage2.py +460 -0
build.py +78 -0
config.json +13 -9
configuration_vlv.py +172 -0
model-00001-of-00005.safetensors +3 -0
model-00002-of-00005.safetensors +3 -0
model-00003-of-00005.safetensors +3 -0
model-00004-of-00005.safetensors +3 -0
model-00005-of-00005.safetensors +3 -0
model.safetensors.index.json +0 -0
modeling_clip.py +60 -3
vlv_utils.py +71 -0

README.md CHANGED Viewed

@@ -1,104 +1,255 @@
 ---
 license: apache-2.0
-language:
-- en
-base_model:
-- Qwen/Qwen2.5-3B-Instruct
-- microsoft/Florence-2-large
 pipeline_tag: image-to-text
 ---
-# Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models
-[![Website](https://img.shields.io/badge/Project%20Page-Website-brightgreen?logo=googlechrome&logoColor=white)](https://lambert-x.github.io/Vision-Language-Vision/)
-[![arXiv](https://img.shields.io/badge/arXiv-2507.07104-B31B1B.svg?logo=arXiv&logoColor=white)](https://arxiv.org/abs/2507.07104)
-[![GitHub](https://img.shields.io/badge/Code-GitHub-black?logo=github)](https://github.com/Tiezheng11/Vision-Language-Vision)
-[![HF Model](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue)](https://huggingface.co/lambertxiao/Vision-Language-Vision-Captioner-Qwen2.5-3B)
-[![HF Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-yellow)](https://huggingface.co/datasets/ccvl/LAION-High-Qualtiy-Pro-6M-VLV)
-## VLV Captioner (Qwen 2.5 3B)
-This repository hosts the 3-billion-parameter **Vision-Language-Vision Captioner** model, distantly supervised by diffusion models and built on top of Qwen 2.5 3B.
-Checkpoint URL: **<https://huggingface.co/lambertxiao/Vision-Language-Vision-Captioner-Qwen2.5-3B>**
----
-## 1 · Install Dependencies
 ```bash
-# inside your virtualenv / conda env
-pip install -r requirements.txt
 ```
-## 2 · Example Usage
 ```python
-from transformers import AutoModel
-from PIL import Image
-import torch, numpy as np
-MODEL_NAME = "lambertxiao/Vision-Language-Vision-Captioner-Qwen2.5-3B"
-device = "cuda" if torch.cuda.is_available() else "cpu"
-# ────── load model ──────
-model = (
-    AutoModel.from_pretrained(
-        MODEL_NAME,
-        trust_remote_code=True,
-        low_cpu_mem_usage=False,
-    )
-    .to(device)
-    .eval()
-)
-# ────── helpers ──────
-def _trim_tail(text: str) -> str:
-    """Remove an incomplete trailing sentence fragment, if any."""
-    sentences = [s.strip() for s in text.split(".") if s.strip()]
-    if not text.rstrip().endswith("."):
-        sentences = sentences[:-1]            # drop dangling fragment
-    return ". ".join(sentences) + ("." if sentences else "")
-def caption_image(img: Image.Image, max_len: int = 77) -> str:
-    """Generate a caption for one PIL image."""
-    with torch.no_grad():
-        raw = model([img], max_len).generated_text[0]
-    return _trim_tail(raw)
-def caption_from_numpy(arr: np.ndarray, max_len: int = 77) -> str:
-    """
-    Wrapper for NumPy arrays.
-    Accepts uint8 [0, 255] or float [0, 1] ranges.
-    """
-    if arr.dtype != np.uint8:
-        arr = (np.clip(arr, 0, 1) * 255).astype(np.uint8)
-    return caption_image(Image.fromarray(arr, mode="RGB"), max_len)
 ```
-## 3 · Quick Test
 ```python
-# caption a remote sample image (cat photo) in one cell
-import io, requests
-from PIL import Image
-from IPython.display import display  # Jupyter/Colab only
-IMG_URL = "https://huggingface.co/datasets/huggingface/cats-image/resolve/main/cats_image.jpeg"
-# download & open
-img = Image.open(io.BytesIO(requests.get(IMG_URL, timeout=10).content)).convert("RGB")
-display(img)                    # show the image
-print(caption_image(img))       # generate and print the caption
 ```
-## 4 · Citation
 ```bibtex
-@article{zhang2025vision,
-  title   = {Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models},
-  author  = {Zhang, Tiezheng and Li, Yitong and Chou, Yu-Cheng and Chen, Jieneng and Yuille, Alan and Wei, Chen and Xiao, Junfei},
-  journal = {arXiv preprint arXiv:2507.07104},
-  year    = {2025}
 }

 ---
 license: apache-2.0
+tags:
+- image-captioning
+- multimodal
+- vision-language
+- diffusion
+- pytorch
+- transformers
+library_name: transformers
 pipeline_tag: image-to-text
+datasets:
+- conceptual_captions
+- coco
+model_type: VLV_decoder
 ---
+# VLV Captioner Model
+This is a VLV (Vision-Language-Vision) model for image captioning. The model combines stable diffusion image encoding with Qwen language model for generating descriptive captions from images.
+## Model Description
+The VLV Captioner is a multimodal model that:
+- Uses a diffusion-based vision encoder to extract image features
+- Employs the Qwen2.5-3B language model for text generation
+- Generates natural language descriptions of input images
+## Model Architecture
+- **Vision Encoder**: Stable Diffusion-based image encoder with Florence2 components
+- **Language Model**: Qwen2.5-3B transformer model
+- **Image Size**: 384x384 pixels
+- **Max Caption Length**: 300 tokens
+- **Precision**: Mixed precision (bfloat16/float32)
+## Usage
+### Method 1: Load from Hugging Face Hub
+```python
+from transformers import AutoModel, AutoConfig
+from PIL import Image
+import torch
+import os
+# Optional: Set custom cache directory if needed
+cache_dir = "/path/to/your/cache"  # Use a directory with sufficient space
+os.makedirs(cache_dir, exist_ok=True)
+# Load the model with authentication token (if required)
+token = os.getenv('HUGGINGFACE_TOKEN')  # or your token string
+print("Loading config...")
+config = AutoConfig.from_pretrained(
+    "your-username/vlv-captioner",
+    trust_remote_code=True,
+    token=token,
+    cache_dir=cache_dir
+)
+print("Loading model...")
+try:
+    model = AutoModel.from_pretrained(
+        "your-username/vlv-captioner",
+        trust_remote_code=True,
+        token=token,
+        cache_dir=cache_dir,
+        torch_dtype=torch.float32,  # Specify dtype explicitly
+        low_cpu_mem_usage=True
+        # Note: Avoid device_map="auto" to prevent meta tensor issues
+    )
+    print("Model loaded successfully!")
+    # Load and process an image
+    image = Image.open("path/to/your/image.jpg")
+    # Move model to GPU if available
+    if torch.cuda.is_available():
+        model = model.to('cuda')
+        print("Model moved to GPU!")
+    # Generate caption
+    print("Generating caption...")
+    with torch.no_grad():
+        captions = model([image], max_length=300)
+        # Handle different possible output formats
+        if hasattr(captions, 'generated_text'):
+            print("Generated caption:", captions.generated_text[0])
+        elif isinstance(captions, list):
+            print("Generated caption:", captions[0])
+        else:
+            print("Generated caption:", captions)
+except Exception as e:
+    print(f"Error during model loading or inference: {e}")
+    # If cached files are corrupted, try clearing cache and redownloading
+    import shutil
+    cache_path = f"{cache_dir}/modules/transformers_modules/your-username/vlv-captioner"
+    if os.path.exists(cache_path):
+        print(f"Clearing cache at {cache_path}")
+        shutil.rmtree(cache_path)
+    # Retry with force download
+    model = AutoModel.from_pretrained(
+        "your-username/vlv-captioner",
+        trust_remote_code=True,
+        token=token,
+        cache_dir=cache_dir,
+        force_download=True,
+        torch_dtype=torch.float32
+    )
+```
+### Method 2: Load from original checkpoint
+```python
+from VLV_stage2 import VLV_MODEL
+# Load from original .pt checkpoint file
+model = VLV_MODEL.from_checkpoint("path/to/model.pt")
+# Load and process an image
+image = Image.open("path/to/your/image.jpg")
+# Generate caption
+with torch.no_grad():
+    captions = model([image], max_length=300)
+    print(captions.generated_text[0])  # Generated caption
+```
+## Model Details
+- **Model Type**: Vision-Language Model
+- **Architecture**: VLV_decoder
+- **Language Backbone**: Qwen/Qwen2.5-3B
+- **Vision Backbone**: Stable Diffusion + Florence2
+- **Training Data**: Various image-caption datasets
+- **Framework**: PyTorch, Transformers
+## Training Configuration
+- **Batch Size**: 1 (inference)
+- **Learnable Token Length**: 77
+- **Guidance Scale**: 7.5
+- **Inference Steps**: 50
+- **Beam Search**: 4 beams
+## Requirements
 ```bash
+pip install torch transformers safetensors torchvision pillow diffusers
 ```
+## Troubleshooting
+### Common Issues and Solutions
+#### 1. Meta Tensor Issues
+If you encounter meta tensor errors, avoid using `device_map="auto"` when loading the model:
 ```python
+# ❌ Don't use this - can cause meta tensor issues
+model = AutoModel.from_pretrained("model-name", device_map="auto")
+# ✅ Use this instead
+model = AutoModel.from_pretrained("model-name", torch_dtype=torch.float32, low_cpu_mem_usage=True)
+if torch.cuda.is_available():
+    model = model.to('cuda')
+```
+#### 2. Cache Issues
+If you experience corrupted cache files, clear the cache and redownload:
+```python
+import shutil
+import os
+cache_dir = "/your/cache/directory"
+cache_path = f"{cache_dir}/modules/transformers_modules/your-username/model-name"
+if os.path.exists(cache_path):
+    shutil.rmtree(cache_path)
+# Then reload with force_download=True
+model = AutoModel.from_pretrained("model-name", force_download=True)
 ```
+#### 3. Authentication Issues
+Make sure your Hugging Face token is properly set:
+```bash
+# Option 1: Environment variable
+export HUGGINGFACE_TOKEN="your_token_here"
+# Option 2: Hugging Face CLI login
+huggingface-cli login
+```
+#### 4. Memory Issues
+For large models, use a custom cache directory with sufficient space:
 ```python
+cache_dir = "/path/to/large/storage"
+os.makedirs(cache_dir, exist_ok=True)
+model = AutoModel.from_pretrained("model-name", cache_dir=cache_dir, low_cpu_mem_usage=True)
+```
+## Advanced Usage
+### Batch Processing with Original Inference Script
+For large-scale inference, you can use the original training inference script:
+```bash
+python Caption_inference.py \
+  --input_path /path/to/images \
+  --output_path captions.json \
+  --clip_decoder_checkpoint /path/to/model.pt \
+  --qwen_model Qwen/Qwen2.5-3B \
+  --stable_diffusion_model_path stabilityai/stable-diffusion-2-1-base \
+  --florence2_model_path microsoft/Florence-2-large \
+  --batch_size 4 \
+  --max_length 300 \
+  --num_beams 4 \
+  --image_size 384 \
+  --guidance_scale 7.5 \
+  --use_text_encoder \
+  --distributed  # For multi-GPU inference
+```
+### Configuration Parameters
+- `image_size`: Input image resolution (default: 384)
+- `guidance_scale`: Diffusion guidance scale (default: 7.5)
+- `learnable_token_length`: Number of vision tokens (default: 77)
+- `max_length`: Maximum caption length (default: 300)
+- `num_beams`: Beam search width (default: 4)
+- `use_text_encoder`: Enable CLIP text encoder (recommended: True)
 ```
+## Citation
 ```bibtex
+@article{vlv_autoencoder,
+  title={Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models},
+  author={Zhang, Tiezheng and Li, Yitong and Chou, Yu-Cheng and Chen, Jieneng and Yuille, Alan L. and Wei, Chen and Xiao, Junfei},
+  journal={arXiv preprint},
+  year={2024}
 }
+```
+## License
+This model is released under the Apache 2.0 license.

VLV_stage1.py ADDED Viewed

	@@ -0,0 +1,257 @@

+import os
+import torch
+import torch.nn as nn
+from typing import Optional
+from dataclasses import dataclass
+from transformers.utils import ModelOutput
+from transformers.modeling_utils import PreTrainedModel
+from transformers.configuration_utils import PretrainedConfig
+from .build import load_sd_model, load_Florence2_model
+from .vlv_utils import initiate_time_steps, normalize
+class SDConfig(PretrainedConfig):
+    """Configuration class for SDModel."""
+    model_type = "sd"
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+class MLP(nn.Module):
+    def __init__(self, input_dim, output_dim):
+        super().__init__()
+        self.layers = nn.Sequential(
+            nn.Linear(input_dim, output_dim),
+            nn.GELU(),
+            nn.Linear(output_dim, output_dim),
+        )
+    def forward(self, x):
+        return self.layers(x)
+@dataclass
+class SDOutput(ModelOutput):
+    loss: Optional[torch.FloatTensor] = None
+class SDModel(PreTrainedModel):
+    config_class = SDConfig
+    def __init__(
+        self,
+        config=None,
+        training_args = None,
+    ):
+        if config is None:
+            config = SDConfig()
+        super().__init__(config)
+        self.training_args = training_args
+        if self.training_args.fp32:
+            self._dtype = torch.float32
+        else:
+            self._dtype = torch.bfloat16
+        self._device = torch.device(self.training_args.device if hasattr(self.training_args, 'device') else "cuda" if torch.cuda.is_available() else "cpu")
+        self.vae, self.tokenizer, self.text_encoder, self.unet, self.scheduler = load_sd_model(training_args)
+        torch.cuda.empty_cache()
+        self.unet.eval()
+        self.text_encoder.eval()
+        self.model, self.processor = load_Florence2_model(training_args)
+        self.unet = self.unet.to(self._dtype).to(device=self._device)
+        self.text_encoder = self.text_encoder.to(self._dtype).to_empty(device=self._device)
+        self.model = self.model.to(self._dtype).to_empty(device=self._device)
+        self.vae = self.vae.to(torch.float32).to_empty(device=self._device)
+        self.batch_size = self.training_args.batch_size
+        hidden_dim = 1024
+        self.language_proj = nn.Sequential(
+            nn.Linear(1024, hidden_dim, dtype=self._dtype),
+            nn.GELU(),
+            nn.Linear(hidden_dim, 1024, dtype=self._dtype)
+        ).to_empty(device=self._device)
+        for param in self.language_proj.parameters():
+            param.requires_grad = True
+        self.num_queries = self.training_args.learnable_token_length
+        self.query_embed = nn.Parameter(torch.randn(1, self.num_queries, 1024, dtype=self._dtype))
+        self.query_embed.requires_grad = True
+        self.unet.enable_gradient_checkpointing()
+    def _unet_pred_noise(self, x_start, t, noise, context):
+        t = t.to(dtype=torch.long)
+        dtype = self.unet.dtype
+        x_start = x_start.to(dtype)
+        noise = noise.to(dtype)
+        context = context.to(dtype)
+        nt = t.shape[0]
+        noised_latent = self.scheduler.add_noise(x_start, noise, t)
+        pred_noise = self.unet(
+            noised_latent,
+            t,
+            encoder_hidden_states=context.expand(nt, -1, -1)
+        ).sample
+        return pred_noise
+    def generate_images(self, images):
+        batch_size = self.training_args.eval_batch_size
+        prompt = ["<MORE_DETAILED_CAPTION>"] * batch_size
+        inputs = self.processor(text=prompt, images=images, return_tensors="pt").to(self._device).to(self._dtype)
+        if inputs["input_ids"] is not None:
+            inputs_embeds = self.model.language_model.get_input_embeddings()(inputs["input_ids"]).to(self._dtype)
+        if inputs["pixel_values"] is not None:
+            image_features = self.model._encode_image(inputs["pixel_values"]).to(self._dtype)
+            inputs_embeds, attention_mask = self.model._merge_input_ids_with_image_features(image_features, inputs_embeds)
+        if inputs_embeds is not None:
+            attention_mask = attention_mask.to(inputs_embeds.dtype)
+        encoder_outputs = self.model.language_model.model.encoder(
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            output_hidden_states=True,
+            return_dict=True
+        )
+        decoder_input_embeds = self.query_embed.expand(batch_size, -1, -1)
+        decoder_attention_mask = torch.ones(
+            (batch_size, self.num_queries),
+            dtype=self._dtype,
+            device=self._device
+        )
+        encoder_hidden_states = encoder_outputs.last_hidden_state.to(self._dtype)
+        decoder_input_embeds = decoder_input_embeds.to(self._dtype)
+        attention_mask = attention_mask.to(self._dtype)
+        decoder_outputs = self.model.language_model.model.decoder(
+            inputs_embeds=decoder_input_embeds,
+            attention_mask=decoder_attention_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=attention_mask,
+            output_hidden_states=True,
+            return_dict=True
+        )
+        last_decoder_hidden_state = decoder_outputs.last_hidden_state
+        conditional_context = self.language_proj(last_decoder_hidden_state)
+        un_token = self.tokenizer("", padding="max_length", truncation=True,max_length=77, return_tensors="pt").input_ids.to(self._device)
+        un_context_embeddings = self.text_encoder(un_token).last_hidden_state
+        un_context_embeddings = un_context_embeddings.expand(batch_size, -1, -1)
+        if self.training_args.use_text_encoder:
+            context_embeddings = self.text_encoder(
+                inputs_embeds=conditional_context.to(self._dtype)
+            ).last_hidden_state
+        latent_shape = (batch_size, 4, self.training_args.image_size // 8, self.training_args.image_size // 8)
+        latents = torch.randn(latent_shape, device=self._device, dtype=self._dtype)
+        scheduler = self.scheduler
+        scheduler.set_timesteps(self.training_args.num_inference_steps)
+        with torch.no_grad():
+            for t in scheduler.timesteps:
+                latent_model_input = torch.cat([latents, latents], dim=0)
+                latent_model_input = scheduler.scale_model_input(latent_model_input, t)
+                combined_embeddings = torch.cat([un_context_embeddings, context_embeddings], dim=0).to(self._dtype)
+                noise_pred = self.unet(
+                    latent_model_input, t, encoder_hidden_states=combined_embeddings
+                )[0]
+                noise_pred_uncond, noise_pred_cond = noise_pred.chunk(2, dim=0)
+                noise_pred = noise_pred_uncond + self.training_args.guidance_scale * (noise_pred_cond - noise_pred_uncond)
+                latents = scheduler.step(noise_pred, t, latents)[0]
+        scaled_latents = latents / 0.18215
+        with torch.no_grad():
+            decoded_latents = self.vae.decode(scaled_latents.to(torch.float32))[0]
+        return decoded_latents
+    def get_conditional_context(self, images, batch_size=None):
+        if batch_size is None:
+            batch_size = self.batch_size
+        prompt = ["<MORE_DETAILED_CAPTION>"] * batch_size
+        inputs = self.processor(text=prompt, images=images, return_tensors="pt").to(self._device).to(self._dtype)
+        if inputs["input_ids"] is not None:
+            inputs_embeds = self.model.language_model.get_input_embeddings()(inputs["input_ids"]).to(self._dtype)
+        if inputs["pixel_values"] is not None:
+            image_features = self.model._encode_image(inputs["pixel_values"]).to(self._dtype)
+            inputs_embeds, attention_mask = self.model._merge_input_ids_with_image_features(image_features, inputs_embeds)
+        if inputs_embeds is not None:
+            attention_mask = attention_mask.to(inputs_embeds.dtype)
+        encoder_outputs = self.model.language_model.model.encoder(
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            output_hidden_states=True,
+            return_dict=True
+        )
+        decoder_input_embeds = self.query_embed.expand(batch_size, -1, -1)
+        decoder_attention_mask = torch.ones(
+            (batch_size, self.num_queries),
+            dtype=self._dtype,
+            device=self._device
+        )
+        encoder_hidden_states = encoder_outputs.last_hidden_state.to(self._dtype)
+        decoder_input_embeds = decoder_input_embeds.to(self._dtype)
+        attention_mask = attention_mask.to(self._dtype)
+        decoder_outputs = self.model.language_model.model.decoder(
+            inputs_embeds=decoder_input_embeds,
+            attention_mask=decoder_attention_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=attention_mask,
+            output_hidden_states=True,
+            return_dict=True
+        )
+        last_decoder_hidden_state = decoder_outputs.last_hidden_state
+        return last_decoder_hidden_state
+    def forward(
+        self,
+        image=None,
+        filename=None,
+        **kwargs,
+    ) -> SDOutput:
+        images_for_language_model = image
+        normalize_images = normalize(image, rescale=True)
+        x0=self.vae.encode(normalize_images.to(torch.float32)).latent_dist.sample()
+        latent = x0 * 0.18215
+        total_timestep = self.scheduler.num_train_timesteps
+        timesteps = initiate_time_steps(0, total_timestep, self.batch_size, self.training_args).long()
+        timesteps = timesteps.to(self._device)
+        c, h, w = latent.shape[1:]
+        if not self.training_args.use_same_noise_among_timesteps:
+            noise = torch.randn((self.batch_size, c, h, w), device=self._device, dtype=self._dtype)
+        else:
+            noise = torch.randn((1, c, h, w), device=self._device, dtype=self._dtype)
+            noise = noise.repeat(self.batch_size, 1, 1, 1)
+        conditional_context = self.get_conditional_context(images_for_language_model)
+        conditional_context = self.language_proj(conditional_context)
+        if self.training_args.use_text_encoder:
+            text_encoder_output = self.text_encoder(input_ids=None, inputs_embeds=conditional_context.to(self._dtype))
+            pred_noise = self._unet_pred_noise(x_start=latent, t=timesteps, noise=noise, context=text_encoder_output.last_hidden_state.to(self._dtype)).to(self._dtype)
+        else:
+            pred_noise = self._unet_pred_noise(x_start=latent, t=timesteps, noise=noise, context=conditional_context.to(self._dtype)).to(self._dtype)
+        if self.training_args.loss == "l1":
+            loss = torch.nn.functional.l1_loss(pred_noise, noise)
+        else:
+            loss = torch.nn.functional.mse_loss(pred_noise, noise)
+        return SDOutput(loss=loss)

VLV_stage2.py ADDED Viewed

	@@ -0,0 +1,460 @@

+from dataclasses import dataclass
+from typing import Optional, Tuple, Dict, Any, Union
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import numpy as np
+from transformers.utils import ModelOutput
+from transformers.modeling_utils import PreTrainedModel
+from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, PretrainedConfig
+from safetensors.torch import load_file
+import torchvision.transforms as transforms
+from .build import load_sd_model, load_Florence2_model
+from .vlv_utils import initiate_time_steps, normalize, process_caption
+from .VLV_stage1 import SDModel, SDConfig
+from .configuration_vlv import VLV_Config
+import os
+import sys
+import argparse
+def handle_module_prefix(state_dict):
+    """Handle 'module.' prefix in state dict keys."""
+    if any(k.startswith('module.') for k in state_dict.keys()):
+        return {k.replace('module.', ''): v for k, v in state_dict.items()}
+    return state_dict
+def create_model_args(args):
+    """Create model arguments needed by SDModel."""
+    model_args = argparse.Namespace()
+    model_args.use_text_encoder = args.use_text_encoder
+    model_args.batch_size = args.batch_size
+    model_args.eval_batch_size = args.batch_size
+    model_args.distributed_strategy = 'none'
+    model_args.fp32 = args.fp32
+    model_args.learnable_token_length = args.learnable_token_length
+    model_args.num_inference_steps = args.num_inference_steps
+    model_args.image_size = args.image_size
+    model_args.guidance_scale = args.guidance_scale
+    model_args.unfreeze_florence2_all = False
+    model_args.unfreeze_florence2_language_model = False
+    model_args.unfreeze_florence2_language_model_decoder = False
+    return model_args
+def load_model_checkpoint(model, model_path, device):
+    """Load model checkpoint."""
+    try:
+        checkpoint = torch.load(model_path, map_location="cpu")
+        # Handle different checkpoint formats
+        if isinstance(checkpoint, dict) and 'model_state_dict' in checkpoint:
+            state_dict = checkpoint['model_state_dict']
+        elif isinstance(checkpoint, dict) and 'state_dict' in checkpoint:
+            state_dict = checkpoint['state_dict']
+        else:
+            state_dict = checkpoint
+        state_dict = handle_module_prefix(state_dict)
+        missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False)
+        if missing_keys:
+            print(f"Missing keys: {missing_keys[:10]}...")  # Show first 10
+        if unexpected_keys:
+            print(f"Unexpected keys: {unexpected_keys[:10]}...")  # Show first 10
+        print(f"Successfully loaded model from {model_path}")
+    except Exception as e:
+        print(f"Error loading model: {e}")
+        raise e
+    return model
+def initialize_diffusion_model(args):
+    """Initialize the diffusion model."""
+    config = SDConfig()
+    diffusion_model_args = create_model_args(args)
+    diffusion_model = SDModel(config, diffusion_model_args)
+    _dtype = torch.float32 if diffusion_model_args.fp32 else torch.bfloat16
+    # Delete components that aren't needed for inference
+    if hasattr(diffusion_model, 'vae'):
+        del diffusion_model.vae
+    if hasattr(diffusion_model, 'unet'):
+        del diffusion_model.unet
+    # Clear CUDA cache
+    torch.cuda.empty_cache()
+    diffusion_model = diffusion_model.to(_dtype)
+    # Freeze parameters that shouldn't be trained
+    for param in diffusion_model.language_proj.parameters():
+        param.requires_grad = False
+    diffusion_model.query_embed.requires_grad = False
+    return diffusion_model
+class MLP(nn.Module):
+    def __init__(self, input_dim, output_dim):
+        super(MLP, self).__init__()
+        self.layers = nn.Sequential(
+            nn.Linear(input_dim, output_dim),
+            nn.GELU(),
+            nn.Linear(output_dim, output_dim),
+        )
+    def forward(self, x):
+        return self.layers(x)
+@dataclass
+class CLIPDecoderOutput(ModelOutput):
+    """
+    Output class for the CLIP Decoder model.
+    """
+    last_hidden_state: Optional[torch.FloatTensor] = None
+    generated_ids: Optional[torch.LongTensor] = None
+    generated_text: Optional[list] = None
+class CLIPDecoder(nn.Module):
+    def __init__(
+        self,
+        language_model: str,
+        VLV_model: SDModel,
+        device: torch.device,
+        bf16: str,
+        qwen2_config: dict = None,
+        args: argparse.Namespace = None
+    ):
+        """
+        Initialize the CLIP Decoder model.
+        Args:
+            language_model: Path to the language model
+            VLV_model: The VLV model instance
+            device: The device to run the model on
+            bf16: Whether to use bfloat16 precision
+            qwen2_config: Optional qwen2 configuration dict
+        """
+        super(CLIPDecoder, self).__init__()
+        self._dtype = torch.bfloat16 if bf16 == "bf16" else torch.float32
+        self.qwen2_tokenizer = AutoTokenizer.from_pretrained(language_model)
+        self.qwen2_config = AutoConfig.from_pretrained(language_model)
+        self.qwen2_model = AutoModelForCausalLM.from_pretrained(
+            language_model,
+            torch_dtype=self._dtype,
+            device_map=None,
+            low_cpu_mem_usage=True
+        )
+        self.VLV_model = VLV_model  # fp32 in this case
+        self.device = device
+        self.mlp = MLP(input_dim=1024, output_dim=self.qwen2_model.config.hidden_size)
+        self.ignore_token_id = -100
+    def get_conditional_context(self, images, batch_size):
+        """
+        Get conditional context from images using the diffusion model.
+        Args:
+            images: Input images
+            batch_size: Batch size
+        Returns:
+            Decoder hidden states from the diffusion model
+        """
+        prompt = ["<MORE_DETAILED_CAPTION>"] * batch_size
+        inputs = self.VLV_model.processor(text=prompt, images=images, return_tensors="pt").to(self.device).to(self._dtype)
+        # Ensure all components are on the correct device
+        self.VLV_model = self.VLV_model.to(inputs["input_ids"].device)
+        self.qwen2_model = self.qwen2_model.to(inputs["input_ids"].device)
+        self.mlp = self.mlp.to(inputs["input_ids"].device)
+        self.VLV_model.model.language_model.model = self.VLV_model.model.language_model.model.to(inputs["input_ids"].device)
+        if inputs["input_ids"] is not None:
+            inputs_embeds = self.VLV_model.model.language_model.get_input_embeddings()(inputs["input_ids"]).to(self.device)
+        if inputs["pixel_values"] is not None:
+            image_features = self.VLV_model.model._encode_image(inputs["pixel_values"]).to(self.device)
+            inputs_embeds, attention_mask = self.VLV_model.model._merge_input_ids_with_image_features(
+                image_features, inputs_embeds
+            )
+        if inputs_embeds is not None:
+            attention_mask = attention_mask.to(inputs_embeds.dtype)
+        encoder_outputs = self.VLV_model.model.language_model.model.encoder(
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            output_hidden_states=True,
+            return_dict=True
+        )
+        decoder_inputs_embeds = self.VLV_model.query_embed.expand(batch_size, -1, -1)
+        decoder_attention_mask = torch.ones(
+            (batch_size, self.VLV_model.num_queries),
+            dtype=self._dtype,
+            device=self.device
+        )
+        encoder_hidden_states = encoder_outputs.last_hidden_state.to(self._dtype)
+        decoder_input_embeds = decoder_inputs_embeds.to(self._dtype)
+        attention_mask = attention_mask.to(self._dtype)
+        decoder_outputs = self.VLV_model.model.language_model.model.decoder(
+            inputs_embeds=decoder_input_embeds,
+            attention_mask=decoder_attention_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=attention_mask,
+            output_hidden_states=True,
+            return_dict=True
+        )
+        return decoder_outputs.last_hidden_state
+    def process_image(self, images, batch_size):
+        """
+        Process images to get clip text embeddings.
+        Args:
+            images: Input images
+            batch_size: Batch size
+        Returns:
+            Processed clip text embeddings and attention mask
+        """
+        decoder_hidden_states = self.get_conditional_context(images, batch_size)
+        context_embeds = self.VLV_model.language_proj(decoder_hidden_states)
+        clip_text_embeds = self.VLV_model.text_encoder(inputs_embeds=context_embeds).last_hidden_state
+        clip_text_embeds = self.mlp(clip_text_embeds)
+        clip_text_embeds_attention_mask = torch.ones(
+            (batch_size, self.VLV_model.num_queries),
+            dtype=torch.long,
+            device=self.device
+        )
+        return clip_text_embeds, clip_text_embeds_attention_mask
+    def prepare_generation_inputs(self, clip_text_embeds, clip_text_attention_mask=None):
+        """
+        Prepare inputs for text generation.
+        Args:
+            clip_text_embeds: Processed clip text embeddings
+            clip_text_attention_mask: Attention mask for clip text embeddings
+        Returns:
+            Dictionary of generation inputs
+        """
+        if clip_text_attention_mask is None:
+            clip_text_attention_mask = torch.ones(
+                (clip_text_embeds.shape[0], clip_text_embeds.shape[1]),
+                dtype=torch.long,
+                device=clip_text_embeds.device
+            )
+        return {
+            "inputs_embeds": clip_text_embeds,
+            "attention_mask": clip_text_attention_mask
+        }
+    def generate(self, images, max_new_tokens=300, num_beams=4, early_stopping=True):
+        """
+        Generate text from images.
+        Args:
+            images: Input images
+            max_new_tokens: Maximum number of tokens to generate
+            num_beams: Number of beams for beam search
+            early_stopping: Whether to stop early in beam search
+        Returns:
+            CLIPDecoderOutput with generated ids and text
+        """
+        batch_size = len(images)
+        clip_text_embeds, clip_text_attention_mask = self.process_image(images, batch_size)
+        generation_inputs = self.prepare_generation_inputs(clip_text_embeds, clip_text_attention_mask)
+        generation_inputs["inputs_embeds"] = generation_inputs["inputs_embeds"].to(self._dtype)
+        generation_inputs["attention_mask"] = generation_inputs["attention_mask"].to(self._dtype)
+        generated_ids = self.qwen2_model.generate(
+            inputs_embeds=generation_inputs["inputs_embeds"],
+            attention_mask=generation_inputs["attention_mask"],
+            max_new_tokens=max_new_tokens,
+            num_beams=num_beams,
+            early_stopping=early_stopping
+        )
+        generated_text = self.qwen2_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
+        processed_generated_text = [process_caption(text) for text in generated_text]
+        return CLIPDecoderOutput(
+            generated_ids=generated_ids,
+            generated_text=processed_generated_text
+        )
+    def forward(self, images, captions=None):
+        """
+        Forward pass for training.
+        Args:
+            images: Input images
+            captions: Target captions (optional, for training)
+        Returns:
+            CLIPDecoderOutput with loss and logits
+        """
+        batch_size = images.shape[0]
+        # Process images
+        clip_text_embeds, clip_text_attention_mask = self.process_image(images, batch_size)
+        # If no captions provided, return embeddings for generation
+        if captions is None:
+            return CLIPDecoderOutput(
+                last_hidden_state=clip_text_embeds
+            )
+        assert len(captions) == batch_size
+        # Process captions for training
+        processed_captions = [process_caption(caption) for caption in captions]
+        qwen_input_ids = self.qwen2_tokenizer(
+            text=processed_captions,
+            truncation=True,
+            return_tensors="pt",
+            padding="max_length",
+            max_length=300,
+            return_token_type_ids=False,
+        ).input_ids
+        assert len(captions) == batch_size
+        qwen_attention_mask = qwen_input_ids.ne(self.qwen2_tokenizer.pad_token_id).to(torch.long).to(self.device)
+        # Prepare labels for training
+        labels = qwen_input_ids
+        labels[labels == self.qwen2_tokenizer.pad_token_id] = self.ignore_token_id
+        labels = labels.to(self.device)
+        # Get embeddings for captions to create the full input sequence
+        labels_for_embeddings = labels.clone()
+        labels_for_embeddings[labels_for_embeddings == self.ignore_token_id] = self.qwen2_tokenizer.pad_token_id
+        clip_text_embeds_qwen = self.qwen2_model.get_input_embeddings()(labels_for_embeddings)
+        # Concatenate the embeddings and prepare attention mask
+        inputs_embeds = torch.cat((clip_text_embeds, clip_text_embeds_qwen), dim=1)
+        clip_seq_len = clip_text_embeds.shape[1]
+        clip_ignore_labels = torch.full((labels.shape[0], clip_seq_len), self.ignore_token_id).to(labels)
+        combined_labels = torch.cat((clip_ignore_labels, labels), dim=1)
+        attention_mask = torch.cat((
+            clip_text_attention_mask,
+            qwen_attention_mask
+        ), dim=1)
+        # Forward through language model
+        outputs = self.qwen2_model(
+            inputs_embeds=inputs_embeds,
+            labels=combined_labels,
+            attention_mask=attention_mask,
+            use_cache=False
+        )
+        return outputs
+# HuggingFace Model Wrapper
+class VLV_MODEL(PreTrainedModel):
+    config_class = VLV_Config
+    model_type = "VLV_decoder"
+    def __init__(self, config):
+        super().__init__(config)
+        """Load the CLIPDecoder model."""
+        # Initialize the diffusion model first
+        device = "cuda"
+        de_diffusion_model = initialize_diffusion_model(config)
+        clip_decoder_model = CLIPDecoder(
+            language_model=config.qwen_model,
+            VLV_model=de_diffusion_model,
+            device=device,
+            bf16=config.mixed_precision,
+            qwen2_config=config.qwen2_config
+        )
+        # Load the trained weights
+        # clip_decoder_model = load_model_checkpoint(clip_decoder_model, config.clip_decoder_checkpoint, device)
+        # Set to evaluation mode
+        clip_decoder_model.eval()
+        # Store components directly as attributes to match checkpoint structure
+        self.VLV_model = clip_decoder_model.VLV_model
+        self.qwen2_model = clip_decoder_model.qwen2_model
+        self.mlp = clip_decoder_model.mlp
+        # Keep the full model for methods
+        self._clip_decoder_model = clip_decoder_model
+        self.max_new_tokens = config.max_length
+        self.num_beams = config.num_beams
+        self.transform = self.get_transform(config.image_size)
+    def get_transform(self, image_size):
+        """Transformation pipeline for input images."""
+        return transforms.Compose([
+            transforms.Resize(image_size),
+            transforms.CenterCrop((image_size, image_size)),
+            transforms.PILToTensor(),
+        ])
+    @classmethod
+    def from_checkpoint(cls, checkpoint_path, config=None, **kwargs):
+        """
+        Load model from original training checkpoint.
+        Args:
+            checkpoint_path: Path to the original model.pt checkpoint
+            config: Optional VLV_Config, will create default if None
+            **kwargs: Additional arguments for model initialization
+        """
+        if config is None:
+            # Create default config
+            config = VLV_Config(
+                image_size=384,
+                guidance_scale=7.5,
+                learnable_token_length=77,
+                max_length=300,
+                num_beams=4,
+                **kwargs
+            )
+        # Initialize model
+        model = cls(config)
+        # Load checkpoint weights
+        device = "cuda" if torch.cuda.is_available() else "cpu"
+        load_model_checkpoint(model._clip_decoder_model, checkpoint_path, device)
+        return model
+    def forward(self, valid_images, max_length):
+        valid_images = [self.transform(img) for img in valid_images]
+        if hasattr(self._clip_decoder_model, 'module'):
+            outputs = self._clip_decoder_model.module.generate(
+                valid_images,
+                max_new_tokens=max_length,
+                num_beams=self.num_beams,
+                early_stopping=True
+            )
+        else:
+            outputs = self._clip_decoder_model.generate(
+                valid_images,
+                max_new_tokens=max_length,
+                num_beams=self.num_beams,
+                early_stopping=True
+            )
+        return outputs

build.py ADDED Viewed

	@@ -0,0 +1,78 @@

+import torch
+from diffusers import AutoencoderKL, UNet2DConditionModel, DDPMScheduler
+from transformers import CLIPTokenizer, AutoProcessor
+from .modeling_clip import CustomCLIPTextModel
+from .modeling_florence2 import Florence2ForConditionalGeneration
+from .configuration_florence2 import Florence2Config
+def load_sd_model(training_args):
+    """Load Stable Diffusion model"""
+    repo_id = "stabilityai/stable-diffusion-2-1-base"
+    text_encoder = CustomCLIPTextModel.from_pretrained(repo_id, subfolder="text_encoder")
+    tokenizer = CLIPTokenizer.from_pretrained(repo_id, subfolder="tokenizer")
+    vae = AutoencoderKL.from_pretrained(repo_id, subfolder="vae",revision=None)
+    scheduler = DDPMScheduler.from_pretrained(repo_id, subfolder="scheduler")
+    unet = UNet2DConditionModel.from_pretrained(repo_id, subfolder="unet",revision=None)
+    for m in [vae, text_encoder, unet]:
+        for param in m.parameters():
+            param.requires_grad = False
+    return (vae, tokenizer, text_encoder, unet, scheduler)
+def load_Florence2_model(training_args):
+    config = Florence2Config.from_pretrained("microsoft/Florence-2-large")
+    config.vision_config.model_type = "davit"
+    config._attn_implementation = "eager"
+    # Load the model with pre-trained weights
+    model = Florence2ForConditionalGeneration.from_pretrained("microsoft/Florence-2-large", config=config)
+    processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)
+    # freeze the model
+    if training_args.unfreeze_florence2_all:
+        for param in model.parameters():
+            param.requires_grad = True
+    elif training_args.unfreeze_florence2_language_model:
+        for param in model.parameters():
+            param.requires_grad = False
+        for param in model.language_model.parameters():
+            param.requires_grad = True
+        for param in model.language_model.lm_head.parameters():
+            param.requires_grad = False
+        model.language_model.lm_head.weight = torch.nn.Parameter(
+        model.language_model.lm_head.weight.detach().clone())
+        for p in model.language_model.lm_head.parameters():
+            p.requires_grad = False
+    elif training_args.unfreeze_florence2_language_model_decoder:
+        # Create a separate embedding layer for decoder
+        original_embeddings = model.language_model.model.shared
+        new_decoder_embeddings = torch.nn.Embedding(
+            num_embeddings=original_embeddings.num_embeddings,
+            embedding_dim=original_embeddings.embedding_dim,
+            padding_idx=original_embeddings.padding_idx
+        )
+        # Copy the weights
+        new_decoder_embeddings.weight.data = original_embeddings.weight.data.clone()
+        # Replace the decoder embeddings
+        model.language_model.model.encoder.embed_tokens = original_embeddings
+        model.language_model.model.decoder.embed_tokens = new_decoder_embeddings
+        for param in model.parameters():
+            param.requires_grad = False
+        for param in model.language_model.model.decoder.parameters():
+            param.requires_grad = True
+        model.language_model.model.decoder.embed_tokens.weight.requires_grad = False
+    else:
+        for param in model.parameters():
+            param.requires_grad = False
+    return model, processor

config.json CHANGED Viewed

@@ -3,27 +3,31 @@
     "VLV_MODEL"
   ],
   "auto_map": {
-    "AutoConfig": "De_DiffusionV2_stage2.VLV_Config",
-    "AutoModel": "De_DiffusionV2_stage2.VLV_MODEL",
-    "AutoModelForCausalLM": "De_DiffusionV2_stage2.VLV_MODEL"
   },
   "model_type": "VLV_decoder",
   "batch_size": 1,
   "deepspeed": true,
   "distributed": true,
   "fp32": true,
-  "guidance_scale": 2.0,
   "hidden_size": 128,
-  "image_size": 768,
   "learnable_token_length": 77,
   "local_rank": 0,
-  "mixed_precision": "bf16",
   "num_inference_steps": 50,
-  "torch_dtype": "bfloat16",
   "transformers_version": "4.51.1",
   "use_text_encoder": true,
   "verbose": true,
   "qwen_model": "Qwen/Qwen2.5-3B",
   "qwen2_config":{
     "architectures": [
       "Qwen2ForCausalLM"
@@ -45,11 +49,11 @@
     "rope_theta": 1000000.0,
     "sliding_window": 32768,
     "tie_word_embeddings": true,
-    "torch_dtype": "bfloat16",
     "transformers_version": "4.40.1",
     "use_cache": true,
     "use_mrope": false,
     "use_sliding_window": false,
     "vocab_size": 151936
   }
-}

     "VLV_MODEL"
   ],
   "auto_map": {
+    "AutoConfig": "configuration_vlv.VLV_Config",
+    "AutoModel": "VLV_stage2.VLV_MODEL",
+    "AutoModelForCausalLM": "VLV_stage2.VLV_MODEL"
   },
   "model_type": "VLV_decoder",
   "batch_size": 1,
   "deepspeed": true,
   "distributed": true,
   "fp32": true,
+  "guidance_scale": 2.5,
   "hidden_size": 128,
+  "image_size": 384,
   "learnable_token_length": 77,
   "local_rank": 0,
+  "mixed_precision": "fp32",
   "num_inference_steps": 50,
+  "torch_dtype": "float32",
   "transformers_version": "4.51.1",
   "use_text_encoder": true,
   "verbose": true,
   "qwen_model": "Qwen/Qwen2.5-3B",
+  "stable_diffusion_model_path": "stabilityai/stable-diffusion-2-1-base",
+  "florence2_model_path": "microsoft/Florence-2-large",
+  "max_length": 300,
+  "num_beams": 4,
   "qwen2_config":{
     "architectures": [
       "Qwen2ForCausalLM"
     "rope_theta": 1000000.0,
     "sliding_window": 32768,
     "tie_word_embeddings": true,
+    "torch_dtype": "float32",
     "transformers_version": "4.40.1",
     "use_cache": true,
     "use_mrope": false,
     "use_sliding_window": false,
     "vocab_size": 151936
   }
+}

configuration_vlv.py ADDED Viewed

	@@ -0,0 +1,172 @@

+# coding=utf-8
+# Copyright 2024 VLV Team and the HuggingFace Inc. team. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""VLV model configuration"""
+from typing import Optional, Dict, Any
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+class VLV_Config(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`VLV_MODEL`]. It is used to instantiate a VLV model
+    according to the specified arguments, defining the model architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        model_type (`str`, *optional*, defaults to "VLV_decoder"):
+            The model type identifier.
+        batch_size (`int`, *optional*, defaults to 1):
+            The batch size for inference.
+        deepspeed (`bool`, *optional*, defaults to True):
+            Whether to use deepspeed.
+        distributed (`bool`, *optional*, defaults to True):
+            Whether to use distributed training.
+        fp32 (`bool`, *optional*, defaults to True):
+            Whether to use fp32 precision.
+        guidance_scale (`float`, *optional*, defaults to 2.0):
+            The guidance scale for generation.
+        hidden_size (`int`, *optional*, defaults to 128):
+            The hidden size of the model.
+        image_size (`int`, *optional*, defaults to 768):
+            The size of input images.
+        learnable_token_length (`int`, *optional*, defaults to 77):
+            The length of learnable tokens.
+        local_rank (`int`, *optional*, defaults to 0):
+            The local rank for distributed training.
+        mixed_precision (`str`, *optional*, defaults to "bf16"):
+            The mixed precision mode.
+        num_inference_steps (`int`, *optional*, defaults to 50):
+            The number of inference steps.
+        torch_dtype (`str`, *optional*, defaults to "bfloat16"):
+            The torch dtype.
+        use_text_encoder (`bool`, *optional*, defaults to True):
+            Whether to use text encoder.
+        verbose (`bool`, *optional*, defaults to True):
+            Whether to enable verbose mode.
+        qwen_model (`str`, *optional*, defaults to "Qwen/Qwen2.5-3B"):
+            The Qwen model to use.
+        qwen2_config (`dict`, *optional*):
+            The Qwen2 configuration.
+        max_length (`int`, *optional*, defaults to 300):
+            Maximum length for generation.
+        num_beams (`int`, *optional*, defaults to 4):
+            Number of beams for beam search.
+    """
+    model_type = "VLV_decoder"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    def __init__(
+        self,
+        model_type: str = "VLV_decoder",
+        batch_size: int = 1,
+        deepspeed: bool = True,
+        distributed: bool = True,
+        fp32: bool = True,
+        guidance_scale: float = 2.0,
+        hidden_size: int = 128,
+        image_size: int = 768,
+        learnable_token_length: int = 77,
+        local_rank: int = 0,
+        mixed_precision: str = "bf16",
+        num_inference_steps: int = 50,
+        torch_dtype: str = "bfloat16",
+        transformers_version: str = "4.51.1",
+        use_text_encoder: bool = True,
+        verbose: bool = True,
+        qwen_model: str = "Qwen/Qwen2.5-3B",
+        stable_diffusion_model_path: str = "stabilityai/stable-diffusion-2-1-base",
+        florence2_model_path: str = "microsoft/Florence-2-large",
+        qwen2_config: Optional[Dict[str, Any]] = None,
+        max_length: int = 300,
+        num_beams: int = 4,
+        **kwargs,
+    ):
+        self.model_type = model_type
+        self.batch_size = batch_size
+        self.deepspeed = deepspeed
+        self.distributed = distributed
+        self.fp32 = fp32
+        self.guidance_scale = guidance_scale
+        self.hidden_size = hidden_size
+        self.image_size = image_size
+        self.learnable_token_length = learnable_token_length
+        self.local_rank = local_rank
+        self.mixed_precision = mixed_precision
+        self.num_inference_steps = num_inference_steps
+        self.torch_dtype = torch_dtype
+        self.transformers_version = transformers_version
+        self.use_text_encoder = use_text_encoder
+        self.verbose = verbose
+        self.qwen_model = qwen_model
+        self.stable_diffusion_model_path = stable_diffusion_model_path
+        self.florence2_model_path = florence2_model_path
+        self.qwen2_config = qwen2_config or self._get_default_qwen2_config()
+        self.max_length = max_length
+        self.num_beams = num_beams
+        super().__init__(**kwargs)
+    def _get_default_qwen2_config(self):
+        """Get default Qwen2 configuration."""
+        return {
+            "architectures": ["Qwen2ForCausalLM"],
+            "attention_dropout": 0.0,
+            "bos_token_id": 151643,
+            "eos_token_id": 151643,
+            "hidden_act": "silu",
+            "hidden_size": 2048,
+            "initializer_range": 0.02,
+            "intermediate_size": 11008,
+            "max_position_embeddings": 32768,
+            "max_window_layers": 36,
+            "model_type": "qwen2",
+            "num_attention_heads": 16,
+            "num_hidden_layers": 36,
+            "num_key_value_heads": 2,
+            "rms_norm_eps": 1e-06,
+            "rope_theta": 1000000.0,
+            "sliding_window": 32768,
+            "tie_word_embeddings": True,
+            "torch_dtype": "bfloat16",
+            "transformers_version": "4.40.1",
+            "use_cache": True,
+            "use_mrope": False,
+            "use_sliding_window": False,
+            "vocab_size": 151936
+        }
+class CLIPDecoderConfig(PretrainedConfig):
+    r"""
+    Configuration class for CLIPDecoder model (legacy support).
+    """
+    model_type = "vlv_stage2"
+    def __init__(
+        self,
+        input_dim: int = 1024,
+        bf16: bool = False,
+        **kwargs,
+    ):
+        self.input_dim = input_dim
+        self.bf16 = bf16
+        super().__init__(**kwargs)

model-00001-of-00005.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d7460963b2ea4c7cde35d0c64c8d46d4a9324c7574433f8cf9878bbaf687f61b
+size 622330008

model-00002-of-00005.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dca6a859202a8817026897383409ec85fb0a22d4b6527da6ab5f5e2ccd3745be
+size 832409864

model-00003-of-00005.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:30d67c2d202ae6c4166ba0b82310f19225665305e6fb3b22c66ff5318fbf6f50
+size 210079920

model-00004-of-00005.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c015745c4638633cfb7d09e9b2b96bfa15fd21511fd74642d13296afc9423a4f
+size 5215310704

model-00005-of-00005.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:da8bab2f53dbd82612d2034d6e67724a171a44fc040198ca5fe9d6120cc3409e
+size 5046894020

model.safetensors.index.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

modeling_clip.py CHANGED Viewed

@@ -1,5 +1,5 @@
 from transformers import CLIPTokenizer, CLIPImageProcessor, CLIPTextModel, CLIPPreTrainedModel, CLIPTextConfig
-from transformers.models.clip.modeling_clip import CLIPTextEmbeddings, CLIPEncoder, CLIPAttention, CLIPMLP, CLIPEncoderLayer, _create_4d_causal_attention_mask, _prepare_4d_attention_mask, BaseModelOutputWithPooling
 from typing import Optional, Union, Tuple
 import torch
 from torch import nn
@@ -53,7 +53,8 @@ class CustomCLIPTextTransformer(nn.Module):
         if inputs_embeds is not None:
-            inputs_embeds = self.embeddings(inputs_embeds=inputs_embeds)
         else:
             inputs_embeds = self.embeddings(input_ids=input_ids, position_ids=position_ids)
@@ -134,9 +135,49 @@ class CustomCLIPTextModel(CLIPPreTrainedModel):
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
     ) -> Union[Tuple, BaseModelOutputWithPooling]:
-        return self.text_model(
             input_ids=input_ids,
             attention_mask=attention_mask,
             position_ids=position_ids,
@@ -145,3 +186,19 @@ class CustomCLIPTextModel(CLIPPreTrainedModel):
             output_hidden_states=output_hidden_states,
             return_dict=return_dict,
         )

 from transformers import CLIPTokenizer, CLIPImageProcessor, CLIPTextModel, CLIPPreTrainedModel, CLIPTextConfig
+from transformers.models.clip.modeling_clip import CLIPTextEmbeddings, CLIPEncoder, CLIPAttention, CLIPMLP, CLIPEncoderLayer, _create_4d_causal_attention_mask, _prepare_4d_attention_mask, BaseModelOutputWithPooling, CLIPTextModelOutput
 from typing import Optional, Union, Tuple
 import torch
 from torch import nn
         if inputs_embeds is not None:
+            # inputs_embeds are already embeddings, just add positional embeddings
+            inputs_embeds = self.embeddings.position_embedding(self.embeddings.position_ids[:, :inputs_embeds.size(1)]) + inputs_embeds
         else:
             inputs_embeds = self.embeddings(input_ids=input_ids, position_ids=position_ids)
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
     ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        return self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+class CustomCLIPTextModelWithProjection(CLIPPreTrainedModel):
+    config_class = CLIPTextConfig
+    _no_split_modules = ["CLIPTextEmbeddings", "CLIPEncoderLayer"]
+    def __init__(self, config: CLIPTextConfig):
+        super().__init__(config)
+        self.text_model = CustomCLIPTextTransformer(config)
+        # Add the projection layer for SDXL's second text encoder
+        projection_dim = getattr(config, 'projection_dim', config.hidden_size)
+        self.text_projection = nn.Linear(config.hidden_size, projection_dim, bias=False)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self) -> nn.Module:
+        return self.text_model.embeddings.token_embedding
+    def set_input_embeddings(self, value):
+        self.text_model.embeddings.token_embedding = value
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CLIPTextModelOutput]:
+        text_outputs = self.text_model(
             input_ids=input_ids,
             attention_mask=attention_mask,
             position_ids=position_ids,
             output_hidden_states=output_hidden_states,
             return_dict=return_dict,
         )
+        pooled_output = text_outputs[1] if not return_dict else text_outputs.pooler_output
+        # Apply the projection to the pooled output
+        text_embeds = self.text_projection(pooled_output)
+        if not return_dict:
+            # Include both last_hidden_state, pooler_output, text_embeds, and other outputs
+            return (text_outputs[0], text_outputs[1], text_embeds) + text_outputs[2:]
+        return CLIPTextModelOutput(
+            text_embeds=text_embeds,                           # Projected embeddings (for similarity)
+            last_hidden_state=text_outputs.last_hidden_state, # All token representations
+            hidden_states=text_outputs.hidden_states,
+            attentions=text_outputs.attentions,
+        )

vlv_utils.py ADDED Viewed

	@@ -0,0 +1,71 @@

+"""Utility functions"""
+import importlib
+import random
+import re
+import torch
+import numpy as np
+from PIL import Image
+def normalize(image,rescale=True):
+    if rescale:
+        image = image.float() / 255.0  # Convert to float and rescale to [0, 1]
+    normalize_image = 2*image-1 # normalize to [-1, 1]
+    return normalize_image
+def process_caption(caption):
+    """Process a caption to ensure proper formatting and remove duplicates.
+    Args:
+        caption: A string containing the caption text
+    Returns:
+        processed_caption: A string with processed caption
+    """
+    if not caption.endswith('.'):
+        last_period_index = caption.rfind('.')
+        if last_period_index != -1:
+            caption = caption[:last_period_index + 1]
+    sentences = re.split(r'(?<=[.!?])\s+', caption)
+    unique_sentences = []
+    for sentence in sentences:
+        if sentence and sentence not in unique_sentences:
+            unique_sentences.append(sentence)
+    processed_caption = ' '.join(unique_sentences)
+    return processed_caption
+def initiate_time_steps(step, total_timestep, batch_size, config):
+    """A helper function to initiate time steps for the diffusion model.
+    Args:
+        step: An integer of the constant step
+        total_timestep: An integer of the total timesteps of the diffusion model
+        batch_size: An integer of the batch size
+        config: A config object
+    Returns:
+        timesteps: A tensor of shape [batch_size,] of the time steps
+    """
+    if config.rand_timestep_equal_int:
+        # the same timestep for each image in the batch
+        interval_val = total_timestep // batch_size
+        start_point = random.randint(0, interval_val - 1)
+        timesteps = torch.tensor(
+            list(range(start_point, total_timestep, interval_val))
+        ).long()
+        return timesteps
+    elif config.random_timestep_per_iteration:
+        # random timestep for each image in the batch
+        return torch.randint(0, total_timestep, (batch_size,)).long()          #default
+    else:
+        # why we need to do this?
+        return torch.tensor([step] * batch_size).long()