Image-Text-to-Text
Transformers
Safetensors
File size: 2,615 Bytes
16019c2
 
 
60aa2d9
 
 
16019c2
 
 
 
 
 
 
60aa2d9
 
16019c2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60aa2d9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
---
base_model:
- InternVL/InternVL2-26B
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
---

## SpiritSight Agent: Advanced GUI Agent with One Look

<p align="center">
    <a href="https://arxiv.org/abs/2503.03196">📄 Paper</a> •
    <a href="https://huggingface.co/SenseLLM/SpiritSight-Agent-26B">🤖 Models</a> •
    <a href="https://hzhiyuan.github.io/SpiritSight-Agent">🌐 Project Page</a> •
    <a href="https://huggingface.co/datasets/SenseLLM/GUI-Lasagne-L1">📚 Datasets</a>
</p>


## Introduction

SpiritSight-Agent is a vision-based, end-to-end GUI agent that excels in GUI navigation tasks across various GUI platforms.

![](results.png)
![](results2.png)


## Models

We recommend fine-tuning the base model on custom data.

| Model | Checkpoint | Size | License|
|:-------|:------------|:------|:--------|
| SpiritSight-Agent-2B-base  | 🤗 [HF Link](https://huggingface.co/SenseLLM/SpiritSight-Agent-2B)  | 2B  | [InternVL](https://github.com/OpenGVLab/InternVL/blob/main/LICENSE) |
| SpiritSight-Agent-8B-base  | 🤗 [HF Link](https://huggingface.co/SenseLLM/SpiritSight-Agent-8B)  | 8B  | [InternVL](https://github.com/OpenGVLab/InternVL/blob/main/LICENSE) |
| SpiritSight-Agent-26B-base | 🤗 [HF Link](https://huggingface.co/SenseLLM/SpiritSight-Agent-26B) | 26B | [InternVL](https://github.com/OpenGVLab/InternVL/blob/main/LICENSE) |


## Datasets

Coming soon.


## Inference

```shell
conda create -n spiritsight-agent python=3.9

pip install -r requirements.txt
pip install flash-attn==2.3.6 --no-build-isolation

python infer_SSAgent-26B.py
```


## Citation

If you find this repo useful for your research, please kindly cite our paper:
```
@misc{huang2025spiritsightagentadvancedgui,
      title={SpiritSight Agent: Advanced GUI Agent with One Look}, 
      author={Zhiyuan Huang and Ziming Cheng and Junting Pan and Zhaohui Hou and Mingjie Zhan},
      year={2025},
      eprint={2503.03196},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.03196},
}
```


## Acknowledgments

We thank the following amazing projects that truly inspired us:

- [InternVL2](https://huggingface.co/OpenGVLab/InternVL2-8B)
- [SeeClick]( https://github.com/njucckevin/SeeClick)
- [Mind2Web](https://huggingface.co/datasets/osunlp/Multimodal-Mind2Web)
- [GUI-Odyssey](https://github.com/OpenGVLab/GUI-Odyssey)
- [AMEX](https://huggingface.co/datasets/Yuxiang007/AMEX)
- [AndroidControl](https://github.com/google-research/google-research/tree/master/android_control)
- [GUICourse](https://github.com/yiye3/GUICourse)