|
--- |
|
library_name: transformers |
|
model_name: GUIrilla-See-0.7B |
|
tags: |
|
- sft |
|
license: mit |
|
base_model: |
|
- microsoft/Florence-2-large |
|
datasets: |
|
- GUIrilla/GUIrilla-Task |
|
--- |
|
|
|
# GUIrilla-See-0.7B |
|
|
|
*Lightweight vision–language model for GUI element localisation* |
|
|
|
--- |
|
|
|
## Summary |
|
|
|
**GUIrilla-See-0.7B** is a 0.7-billion-parameter model derived from **Florence 2-large** and fine-tuned for **open-vocabulary detection** in graphical user-interface (GUI) screenshots. |
|
Given an image and a free-form textual description, the model returns either |
|
|
|
* the bounding box of the best-matching element, or |
|
* a polygon mask, when a bounding box is unavailable. |
|
|
|
The model is intended for research on lightweight GUI agents, automated testing, and accessibility tools where a small footprint is preferred over the larger counterpart. |
|
|
|
--- |
|
|
|
## Quick-start |
|
|
|
```python |
|
import torch, PIL.Image as Image |
|
from transformers import AutoModelForCausalLM, AutoProcessor |
|
|
|
# --- load pipeline ----------------------------------------------------------- |
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
model_name = "GUIrilla/GUIrilla-See-0.7B" # 0.7 B weights |
|
dtype = torch.bfloat16 if device == "cuda" else torch.float32 |
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_name, torch_dtype=dtype, trust_remote_code=True |
|
).to(device) |
|
|
|
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True) |
|
|
|
# --- inference --------------------------------------------------------------- |
|
image = Image.open("screenshot.png").convert("RGB") |
|
task_prompt = "<OPEN_VOCABULARY_DETECTION>" |
|
text_query = "button with the label “Submit”" |
|
|
|
prompt = task_prompt + text_query |
|
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(device, dtype) |
|
|
|
with torch.no_grad(): |
|
ids = model.generate( |
|
input_ids = inputs["input_ids"], |
|
pixel_values= inputs["pixel_values"], |
|
max_new_tokens = 1024, |
|
num_beams = 3, |
|
do_sample = False, |
|
early_stopping = False, |
|
) |
|
|
|
decoded = processor.batch_decode(ids, skip_special_tokens=False)[0] |
|
result = processor.post_process_generation( |
|
decoded, task=task_prompt, image_size=image.size |
|
)["<OPEN_VOCABULARY_DETECTION>"] |
|
|
|
``` |
|
|
|
--- |
|
|
|
## Training Data |
|
|
|
Trained on [GUIrilla-Task](https://huggingface.co/datasets/GUIrilla/GUIrilla-Task). |
|
|
|
* **Train data:** 25,606 tasks across 881 macOS applications (10% of apps from it for validation) |
|
* **Test data:** 1,565 tasks across 227 macOS applications |
|
|
|
--- |
|
|
|
## Training Procedure |
|
|
|
* 4 epochs LoRA fine-tuning on 1 × A100 40 GB. |
|
* Optimiser – AdamW (β₁ = 0.9, β₂ = 0.95), LR = 5 e-6 with 0.01 warm up ratio. |
|
|
|
--- |
|
|
|
## Evaluation |
|
|
|
| Split | Success Rate % | |
|
| ----- | ---------------| |
|
| Test | **53.55** | |
|
|
|
--- |
|
|
|
## Ethical & Safety Notes |
|
|
|
* Always sandbox or use confirmation steps when connecting the model to real GUIs. |
|
* Screenshots may reveal sensitive data – ensure compliance with privacy regulations. |
|
|
|
--- |
|
|
|
## License |
|
|
|
MIT (see `LICENSE`). |