File size: 7,175 Bytes
b156b02
 
492f6af
 
 
 
 
 
 
 
b156b02
492f6af
 
 
 
2ae17e5
 
492f6af
2ae17e5
492f6af
b399505
492f6af
2ae17e5
492f6af
 
 
 
2ae17e5
492f6af
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2ae17e5
492f6af
2ae17e5
 
492f6af
2ae17e5
 
492f6af
 
 
 
 
 
 
2ae17e5
492f6af
 
2ae17e5
492f6af
 
 
 
 
2ae17e5
492f6af
 
2ae17e5
492f6af
 
 
2ae17e5
492f6af
 
 
 
 
 
 
2ae17e5
 
492f6af
 
2ae17e5
492f6af
 
 
 
 
 
 
 
 
 
2ae17e5
 
492f6af
 
 
 
2ae17e5
492f6af
2ae17e5
492f6af
2ae17e5
492f6af
2ae17e5
492f6af
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2ae17e5
492f6af
 
 
 
 
 
 
 
2ae17e5
492f6af
 
2ae17e5
 
492f6af
 
 
 
 
2ae17e5
492f6af
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
---
license: apache-2.0
tags:
- image-captioning
- multimodal
- vision-language
- diffusion
- pytorch
- transformers
library_name: transformers
pipeline_tag: image-to-text
datasets:
- conceptual_captions
- coco
model_type: VLV_decoder
---

# VLV Captioner Model

This is a VLV (Vision-Language-Vision) model for image captioning. The model combines stable diffusion image encoding with Qwen language model for generating descriptive captions from images.

## Model Description

The VLV Captioner is a multimodal model that:
- Uses a diffusion-based vision encoder to extract image features
- Employs the Qwen2.5-3B language model for text generation
- Generates natural language descriptions of input images

## Model Architecture

- **Vision Encoder**: Stable Diffusion-based image encoder with Florence2 components
- **Language Model**: Qwen2.5-3B transformer model
- **Image Size**: 384x384 pixels
- **Max Caption Length**: 300 tokens
- **Precision**: Mixed precision (bfloat16/float32)

## Usage

### Method 1: Load from Hugging Face Hub

```python
from transformers import AutoModel, AutoConfig
from PIL import Image
import torch
import os

# Optional: Set custom cache directory if needed
cache_dir = "/path/to/your/cache"  # Use a directory with sufficient space
os.makedirs(cache_dir, exist_ok=True)

# Load the model with authentication token (if required)
token = os.getenv('HUGGINGFACE_TOKEN')  # or your token string

print("Loading config...")
config = AutoConfig.from_pretrained(
    "your-username/vlv-captioner", 
    trust_remote_code=True, 
    token=token, 
    cache_dir=cache_dir
)

print("Loading model...")
try:
    model = AutoModel.from_pretrained(
        "your-username/vlv-captioner", 
        trust_remote_code=True, 
        token=token, 
        cache_dir=cache_dir,
        torch_dtype=torch.float32,  # Specify dtype explicitly
        low_cpu_mem_usage=True
        # Note: Avoid device_map="auto" to prevent meta tensor issues
    )
    print("Model loaded successfully!")
    
    # Load and process an image
    image = Image.open("path/to/your/image.jpg")
    
    # Move model to GPU if available
    if torch.cuda.is_available():
        model = model.to('cuda')
        print("Model moved to GPU!")
    
    # Generate caption
    print("Generating caption...")
    with torch.no_grad():
        captions = model([image], max_length=300)
        
        # Handle different possible output formats
        if hasattr(captions, 'generated_text'):
            print("Generated caption:", captions.generated_text[0])
        elif isinstance(captions, list):
            print("Generated caption:", captions[0])
        else:
            print("Generated caption:", captions)
            
except Exception as e:
    print(f"Error during model loading or inference: {e}")
    # If cached files are corrupted, try clearing cache and redownloading
    import shutil
    cache_path = f"{cache_dir}/modules/transformers_modules/your-username/vlv-captioner"
    if os.path.exists(cache_path):
        print(f"Clearing cache at {cache_path}")
        shutil.rmtree(cache_path)
    
    # Retry with force download
    model = AutoModel.from_pretrained(
        "your-username/vlv-captioner", 
        trust_remote_code=True, 
        token=token, 
        cache_dir=cache_dir,
        force_download=True,
        torch_dtype=torch.float32
    )
```

### Method 2: Load from original checkpoint

```python
from VLV_stage2 import VLV_MODEL

# Load from original .pt checkpoint file
model = VLV_MODEL.from_checkpoint("path/to/model.pt")

# Load and process an image
image = Image.open("path/to/your/image.jpg")

# Generate caption
with torch.no_grad():
    captions = model([image], max_length=300)
    print(captions.generated_text[0])  # Generated caption
```

## Model Details

- **Model Type**: Vision-Language Model
- **Architecture**: VLV_decoder
- **Language Backbone**: Qwen/Qwen2.5-3B
- **Vision Backbone**: Stable Diffusion + Florence2
- **Training Data**: Various image-caption datasets
- **Framework**: PyTorch, Transformers

## Training Configuration

- **Batch Size**: 1 (inference)
- **Learnable Token Length**: 77
- **Guidance Scale**: 7.5
- **Inference Steps**: 50
- **Beam Search**: 4 beams

## Requirements

```bash
pip install torch transformers safetensors torchvision pillow diffusers
```

## Troubleshooting

### Common Issues and Solutions

#### 1. Meta Tensor Issues
If you encounter meta tensor errors, avoid using `device_map="auto"` when loading the model:

```python
# ❌ Don't use this - can cause meta tensor issues
model = AutoModel.from_pretrained("model-name", device_map="auto")

# ✅ Use this instead
model = AutoModel.from_pretrained("model-name", torch_dtype=torch.float32, low_cpu_mem_usage=True)
if torch.cuda.is_available():
    model = model.to('cuda')
```

#### 2. Cache Issues
If you experience corrupted cache files, clear the cache and redownload:

```python
import shutil
import os

cache_dir = "/your/cache/directory"
cache_path = f"{cache_dir}/modules/transformers_modules/your-username/model-name"
if os.path.exists(cache_path):
    shutil.rmtree(cache_path)

# Then reload with force_download=True
model = AutoModel.from_pretrained("model-name", force_download=True)
```

#### 3. Authentication Issues
Make sure your Hugging Face token is properly set:

```bash
# Option 1: Environment variable
export HUGGINGFACE_TOKEN="your_token_here"

# Option 2: Hugging Face CLI login
huggingface-cli login
```

#### 4. Memory Issues
For large models, use a custom cache directory with sufficient space:

```python
cache_dir = "/path/to/large/storage"
os.makedirs(cache_dir, exist_ok=True)
model = AutoModel.from_pretrained("model-name", cache_dir=cache_dir, low_cpu_mem_usage=True)
```

## Advanced Usage

### Batch Processing with Original Inference Script

For large-scale inference, you can use the original training inference script:

```bash
python Caption_inference.py \
  --input_path /path/to/images \
  --output_path captions.json \
  --clip_decoder_checkpoint /path/to/model.pt \
  --qwen_model Qwen/Qwen2.5-3B \
  --stable_diffusion_model_path stabilityai/stable-diffusion-2-1-base \
  --florence2_model_path microsoft/Florence-2-large \
  --batch_size 4 \
  --max_length 300 \
  --num_beams 4 \
  --image_size 384 \
  --guidance_scale 7.5 \
  --use_text_encoder \
  --distributed  # For multi-GPU inference
```

### Configuration Parameters

- `image_size`: Input image resolution (default: 384)
- `guidance_scale`: Diffusion guidance scale (default: 7.5)  
- `learnable_token_length`: Number of vision tokens (default: 77)
- `max_length`: Maximum caption length (default: 300)
- `num_beams`: Beam search width (default: 4)
- `use_text_encoder`: Enable CLIP text encoder (recommended: True)
```

## Citation

```bibtex
@article{vlv_autoencoder,
  title={Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models},
  author={Zhang, Tiezheng and Li, Yitong and Chou, Yu-Cheng and Chen, Jieneng and Yuille, Alan L. and Wei, Chen and Xiao, Junfei},
  journal={arXiv preprint},
  year={2024}
}
```

## License

This model is released under the Apache 2.0 license.