nielsr HF Staff commited on
Commit
46016f8
·
verified ·
1 Parent(s): 3d3b95b

Improve model card: Add pipeline tag, library name, quickstart, and expanded details

Browse files

This PR significantly improves the model card for `spiral-rl/Spiral-Qwen3-4B` by:

* **Adding Metadata**: Including `pipeline_tag: text-generation` for better discoverability on the Hub ([https://huggingface.co/models?pipeline_tag=text-generation](https://huggingface.co/models?pipeline_tag=text-generation)) and `library_name: transformers` to enable the "how to use" widget.
* **Enriching Introduction**: Expanding the model's introduction with details from the paper abstract and GitHub README, including key concept images.
* **Adding Architecture Details**: Incorporating a dedicated section explaining the SPIRAL architecture, along with its illustrative image.
* **Providing Quickstart Example**: Adding a Python code snippet for immediate inference using the `transformers` library, making the model easier to use directly from the Hub.
* **Including Acknowledgements**: Adding the acknowledgements from the GitHub repository for transparency and credit.

These changes enhance the model's discoverability, usability, and overall informativeness on the Hugging Face Hub.

Files changed (1) hide show
  1. README.md +67 -3
README.md CHANGED
@@ -1,7 +1,9 @@
1
  ---
2
- license: apache-2.0
3
  base_model:
4
  - Qwen/Qwen3-4B-Base
 
 
 
5
  ---
6
 
7
  # Spiral-Qwen3-4B
@@ -16,11 +18,65 @@ base_model:
16
 
17
  This model is trained with self-play on multi-games (TicTacToe, Kuhn Poker, Simple Negotiation) using the SPIRAL framework.
18
 
19
- <img src="https://raw.githubusercontent.com/spiral-rl/spiral/refs/heads/main/assets/framework.png" width=100%/>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  ## Citation
23
 
 
 
24
  ```latex
25
  @article{liu2025spiral,
26
  title={SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning},
@@ -29,4 +85,12 @@ This model is trained with self-play on multi-games (TicTacToe, Kuhn Poker, Simp
29
  journal={arXiv preprint arXiv:2506.24119},
30
  url={https://arxiv.org/abs/2506.24119}
31
  }
32
- ```
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  base_model:
3
  - Qwen/Qwen3-4B-Base
4
+ license: apache-2.0
5
+ pipeline_tag: text-generation
6
+ library_name: transformers
7
  ---
8
 
9
  # Spiral-Qwen3-4B
 
18
 
19
  This model is trained with self-play on multi-games (TicTacToe, Kuhn Poker, Simple Negotiation) using the SPIRAL framework.
20
 
21
+ Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on expert-curated problem-answer pairs and domain-specific reward engineering.
22
+
23
+ We introduce SPIRAL, a self-play framework where models learn by playing **multi-turn, zero-sum games against continuously improving versions of themselves**, eliminating the need for human supervision. Through zero-sum self-play, SPIRAL generates an **_infinite curriculum_** of progressively challenging problems as models must constantly adapt to stronger opponents.
24
+
25
+ Applying SPIRAL to Qwen3 base models in two-player zero-sum text games, we observe the agents develop advanced reasoning strategies to win the competitive game. Furthermore, the trained models show substantial gains on a range of math and general reasoning benchmarks. These results suggest that self-play in zero-sum games can naturally induce transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.
26
+
27
+ <p align="center"><img src="https://raw.githubusercontent.com/spiral-rl/spiral/refs/heads/main/assets/teaser-1.png" width="100%" /></p>
28
+ <p align="center"><img src="https://raw.githubusercontent.com/spiral-rl/spiral/refs/heads/main/assets/fig1-1.png" width="100%" /></p>
29
+
30
+ ## Architecture
31
+
32
+ SPIRAL employs an actor-learner architecture for scalable self-play training. Parallel actors sample trajectories from a diverse set of games using vectorized environments. A single policy $\pi_t$ plays both roles, generating zero-sum, sparse reward game trajectories. The centralized learner processes these trajectories using Role-conditioned Advantage Estimation (RAE) to compute separate advantages, $A_0(s,a)$ and $A_1(s,a)$, for each role. These are then used for on-policy reinforcement learning updates.
33
+
34
+ <p align="center"><img src="https://raw.githubusercontent.com/spiral-rl/spiral/refs/heads/main/assets/framework.png" width="90%" /></p>
35
+
36
+ ## Usage (Quickstart)
37
 
38
+ You can easily load and use this model with the `transformers` library:
39
+
40
+ ```python
41
+ from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
42
+ import torch
43
+
44
+ model_id = "spiral-rl/Spiral-Qwen3-4B"
45
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
46
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
47
+
48
+ # Example usage for text generation following Qwen chat template
49
+ prompt = "What is the capital of France?"
50
+ messages = [
51
+ {"role": "user", "content": prompt}
52
+ ]
53
+ text = tokenizer.apply_chat_template(
54
+ messages,
55
+ tokenize=False,
56
+ add_generation_prompt=True
57
+ )
58
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
59
+
60
+ # Using a simple generation config (adjust as needed)
61
+ generation_config = GenerationConfig(
62
+ max_new_tokens=50,
63
+ temperature=0.7,
64
+ do_sample=True,
65
+ top_p=0.9
66
+ )
67
+
68
+ outputs = model.generate(**inputs, generation_config=generation_config)
69
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
70
+ print(generated_text)
71
+ # Expected output: "What is the capital of France? Paris." (or similar)
72
+ ```
73
+
74
+ For more advanced usage, including training and evaluation scripts, please refer to the [GitHub repository](https://github.com/spiral-rl/spiral).
75
 
76
  ## Citation
77
 
78
+ If you find our work useful for your research, please consider citing:
79
+
80
  ```latex
81
  @article{liu2025spiral,
82
  title={SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning},
 
85
  journal={arXiv preprint arXiv:2506.24119},
86
  url={https://arxiv.org/abs/2506.24119}
87
  }
88
+ ```
89
+
90
+ ## Acknowledgements
91
+
92
+ This work is supported by [PlasticLabs](https://plasticlabs.ai/) and [Sea AI Lab](https://sail.sea.com/) for computing resources.
93
+ The language games are sampled from [TextArena](https://github.com/LeonGuertler/TextArena), a collection of competitive text-based games for language model evaluation and reinforcement learning.
94
+ The multi-agent, multi-turn RL training is implemented with 🌾 [Oat](https://github.com/sail-sg/oat), a modular and research-friendly LLM RL framework.
95
+ We did exploration on PEFT experiments using [UnstableBaselines](https://github.com/LeonGuertler/UnstableBaselines), a lightweight, LoRA-first library for fast prototyping of self-play algorithms on TextArena.
96
+ The base models are from [Qwen3](https://huggingface.co/Qwen/Qwen3-4B).