lambertxiao
/

Vision-Language-Vision-Captioner-Qwen2.5-3B

feature-extraction

image-captioning

vision-language

Model card Files Files and versions

lambertxiao commited on Jul 10

Commit

2ae17e5

·

verified ·

1 Parent(s): b156b02

Update README.md

Files changed (1) hide show

README.md +91 -1

README.md CHANGED Viewed

@@ -6,4 +6,94 @@ base_model:
 - Qwen/Qwen2.5-3B-Instruct
 - microsoft/Florence-2-large
 pipeline_tag: image-to-text
----

 - Qwen/Qwen2.5-3B-Instruct
 - microsoft/Florence-2-large
 pipeline_tag: image-to-text
+---
+# Vision-Language-Vision Auto-Encoder
+**Scalable Knowledge Distillation from Diffusion Models**
+## Official Checkpoint · VLV Captioner (Qwen 2.5 3B)
+This repository hosts the 3-billion-parameter **Vision-Language-Vision Captioner** model, distantly supervised by diffusion models and built on top of Qwen 2.5 3B.
+Checkpoint URL: **<https://huggingface.co/lambertxiao/Vision-Language-Vision-Captioner-Qwen2.5-3B>**
+---
+## 1 · Install Dependencies
+```bash
+# inside your virtualenv / conda env
+pip install -r requirements.txt
+```
+## 2 · Example Usage
+```python
+from transformers import AutoModel
+from PIL import Image
+import torch, numpy as np
+MODEL_NAME = "lambertxiao/Vision-Language-Vision-Captioner-Qwen2.5-3B"
+device = "cuda" if torch.cuda.is_available() else "cpu"
+# ────── load model ──────
+model = (
+    AutoModel.from_pretrained(
+        MODEL_NAME,
+        trust_remote_code=True,
+        low_cpu_mem_usage=False,
+    )
+    .to(device)
+    .eval()
+)
+# ────── helpers ──────
+def _trim_tail(text: str) -> str:
+    """Remove an incomplete trailing sentence fragment, if any."""
+    sentences = [s.strip() for s in text.split(".") if s.strip()]
+    if not text.rstrip().endswith("."):
+        sentences = sentences[:-1]            # drop dangling fragment
+    return ". ".join(sentences) + ("." if sentences else "")
+def caption_image(img: Image.Image, max_len: int = 77) -> str:
+    """Generate a caption for one PIL image."""
+    with torch.no_grad():
+        raw = model([img], max_len).generated_text[0]
+    return _trim_tail(raw)
+def caption_from_numpy(arr: np.ndarray, max_len: int = 77) -> str:
+    """
+    Wrapper for NumPy arrays.
+    Accepts uint8 [0, 255] or float [0, 1] ranges.
+    """
+    if arr.dtype != np.uint8:
+        arr = (np.clip(arr, 0, 1) * 255).astype(np.uint8)
+    return caption_image(Image.fromarray(arr, mode="RGB"), max_len)
+```
+## 3 · Quick Test
+```python
+# caption a remote sample image (cat photo) in one cell
+import io, requests
+from PIL import Image
+from IPython.display import display  # Jupyter/Colab only
+IMG_URL = "https://huggingface.co/datasets/huggingface/cats-image/resolve/main/cats_image.jpeg"
+# download & open
+img = Image.open(io.BytesIO(requests.get(IMG_URL, timeout=10).content)).convert("RGB")
+display(img)                    # show the image
+print(caption_image(img))       # generate and print the caption
+```
+## 4 · Citation
+```bibtex
+@article{zhang2025vision,
+  title   = {Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models},
+  author  = {Zhang, Tiezheng and Li, Yitong and Chou, Yu-Cheng and Chen, Jieneng and Yuille, Alan and Wei, Chen and Xiao, Junfei},
+  journal = {arXiv preprint arXiv:2507.07104},
+  year    = {2025}
+}