GUIrilla
/

GUIrilla-See-0.7B

Model card Files Files and versions

GUIrilla-See-0.7B / README.md

GUIrilla's picture

Update README.md

0b4bb7e verified 4 months ago

|

history blame contribute delete

3 kB

	---
	library_name: transformers
	model_name: GUIrilla-See-0.7B
	tags:
	- sft
	license: mit
	base_model:
	- microsoft/Florence-2-large
	datasets:
	- GUIrilla/GUIrilla-Task
	---

	# GUIrilla-See-0.7B

	Lightweight vision–language model for GUI element localisation

	---

	## Summary

	GUIrilla-See-0.7B is a 0.7-billion-parameter model derived from Florence 2-large and fine-tuned for open-vocabulary detection in graphical user-interface (GUI) screenshots.
	Given an image and a free-form textual description, the model returns either

	* the bounding box of the best-matching element, or
	* a polygon mask, when a bounding box is unavailable.

	The model is intended for research on lightweight GUI agents, automated testing, and accessibility tools where a small footprint is preferred over the larger counterpart.

	---

	## Quick-start

	```python
	import torch, PIL.Image as Image
	from transformers import AutoModelForCausalLM, AutoProcessor

	# --- load pipeline -----------------------------------------------------------
	device = "cuda" if torch.cuda.is_available() else "cpu"
	model_name = "GUIrilla/GUIrilla-See-0.7B" # 0.7 B weights
	dtype = torch.bfloat16 if device == "cuda" else torch.float32

	model = AutoModelForCausalLM.from_pretrained(
	model_name, torch_dtype=dtype, trust_remote_code=True
	).to(device)

	processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

	# --- inference ---------------------------------------------------------------
	image = Image.open("screenshot.png").convert("RGB")
	task_prompt = "<OPEN_VOCABULARY_DETECTION>"
	text_query = "button with the label “Submit”"

	prompt = task_prompt + text_query
	inputs = processor(text=prompt, images=[image], return_tensors="pt").to(device, dtype)

	with torch.no_grad():
	ids = model.generate(
	input_ids = inputs["input_ids"],
	pixel_values= inputs["pixel_values"],
	max_new_tokens = 1024,
	num_beams = 3,
	do_sample = False,
	early_stopping = False,
	)

	decoded = processor.batch_decode(ids, skip_special_tokens=False)[0]
	result = processor.post_process_generation(
	decoded, task=task_prompt, image_size=image.size
	)["<OPEN_VOCABULARY_DETECTION>"]

	```

	---

	## Training Data

	Trained on [GUIrilla-Task](https://huggingface.co/datasets/GUIrilla/GUIrilla-Task).

	* Train data: 25,606 tasks across 881 macOS applications (10% of apps from it for validation)
	* Test data: 1,565 tasks across 227 macOS applications

	---

	## Training Procedure

	* 4 epochs LoRA fine-tuning on 1 × A100 40 GB.
	* Optimiser – AdamW (β₁ = 0.9, β₂ = 0.95), LR = 5 e-6 with 0.01 warm up ratio.

	---

	## Evaluation

	\| Split \| Success Rate % \|
	\| ----- \| ---------------\|
	\| Test \| 53.55 \|

	---

	## Ethical & Safety Notes

	* Always sandbox or use confirmation steps when connecting the model to real GUIs.
	* Screenshots may reveal sensitive data – ensure compliance with privacy regulations.

	---

	## License

	MIT (see `LICENSE`).