|
--- |
|
license: mit |
|
pipeline_tag: text-generation |
|
library_name: transformers |
|
--- |
|
|
|
<div align="center"> |
|
<img src="https://github.com/dongguanting/ARPO/blob/main/logo1.png" width="150px"> |
|
</div> |
|
|
|
<h1 align="center" style="margin-top: -50px;">✨ Agentic Reinforced Policy Optimization (ARPO)</h1> |
|
|
|
<div align="center"> |
|
|
|
[](https://arxiv.org/abs/2507.19849) |
|
[](https://huggingface.co/papers/2507.19849) |
|
[](https://huggingface.co/collections/dongguanting/arpo-688229ff8a6143fe5b4ad8ae) |
|
[](https://huggingface.co/collections/dongguanting/arpo-688229ff8a6143fe5b4ad8ae) |
|
[](https://opensource.org/licenses/MIT) |
|
[](https://www.python.org/downloads/release/python-390/) |
|
[]() |
|
</div> |
|
|
|
**Agentic Reinforced Policy Optimization (ARPO)** is a novel agentic RL algorithm tailored for training multi-turn Large Language Model (LLM)-based agents. It addresses the challenges of balancing LLMs' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. |
|
|
|
## 💡 Overview |
|
|
|
We propose **Agentic Reinforced Policy Optimization (ARPO)**, **an agentic RL algorithm tailored for training multi-turn LLM-based agent**. The core principle of ARPO is to encourage the policy model to adaptively branch sampling during high-entropy tool-call rounds, thereby efficiently aligning step-level tool-use behaviors. |
|
|
|
<img width="1686" height="866" alt="intro" src="https://github.com/user-attachments/assets/8b9daf54-c4ba-4e79-bf79-f98b5a893edd" /> |
|
|
|
- In figure (left), The initial tokens generated by the LLM after receiving **each round of tool-call feedback consistently exhibit a high entropy**. This indicates that external tool-call significantly **introduces uncertainty into the LLM’s reasoning process**. |
|
|
|
- In the figure (right), we validate ARPO's performance **across 13 datasets**. Notably, Qwen3-14B with ARPO excelled in Pass@5, **achieving 61.2% on GAIA and 24.0% on HLE**, while requiring only about **half the tool calls** compared to GRPO during training. |
|
|
|
## 🏃 Quick Start |
|
|
|
This section provides a basic example of how to perform inference with an ARPO-trained model using the `transformers` library. For more detailed instructions on training and evaluation, please refer to the [official GitHub repository](https://github.com/dongguanting/ARPO). |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig |
|
|
|
# Load the model and tokenizer |
|
# Replace "dongguanting/Llama3.1-8B-ARPO" with the specific ARPO checkpoint you want to use. |
|
model_name = "dongguanting/Llama3.1-8B-ARPO" # Example ARPO model |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_name, |
|
torch_dtype=torch.bfloat16, # Use bfloat16 for better performance on compatible hardware |
|
device_map="auto", |
|
trust_remote_code=True # Required for custom modeling if applicable |
|
).eval() |
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
|
|
# Set generation configuration based on model's generation_config.json |
|
model.generation_config = GenerationConfig.from_pretrained( |
|
model_name, |
|
temperature=0.6, |
|
top_p=0.9, |
|
do_sample=True, |
|
eos_token_id=[128001, 128008, 128009], # From special_tokens_map.json and generation_config.json |
|
pad_token_id=tokenizer.eos_token_id, # Common practice for LLMs |
|
) |
|
|
|
# Prepare messages using the chat template (e.g., Llama 3.1 or similar) |
|
messages = [ |
|
{"role": "system", "content": "You are a helpful AI assistant."}, |
|
{"role": "user", "content": "What is the capital of France?"} |
|
] |
|
|
|
# Apply chat template and tokenize input |
|
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device) |
|
|
|
# Generate response |
|
with torch.no_grad(): |
|
output_ids = model.generate(input_ids, max_new_tokens=256) |
|
|
|
# Decode and print the generated text, excluding the input prompt |
|
response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True).strip() |
|
|
|
print(f"Assistant: {response}") |
|
``` |
|
|
|
## 📄 Citation |
|
|
|
If you find this work helpful, please cite our paper: |
|
```bibtex |
|
@misc{dong2025arpo, |
|
title={Agentic Reinforced Policy Optimization}, |
|
author={Guanting Dong and Hangyu Mao and Kai Ma and Licheng Bao and Yifei Chen and Zhongyuan Wang and Zhongxia Chen and Jiazhen Du and Huiyang Wang and Fuzheng Zhang and Guorui Zhou and Yutao Zhu and Ji-Rong Wen and Zhicheng Dou}, |
|
year={2025}, |
|
eprint={2507.19849}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.LG}, |
|
url={https://arxiv.org/abs/2507.19849}, |
|
} |
|
``` |
|
|
|
## 🤝 Acknowledge |
|
|
|
This training implementation builds upon [Tool-Star](https://github.com/dongguanting/Tool-Star), [Llama Factory](https://github.com/hiyouga/LLaMA-Factory), [verl](https://github.com/volcengine/verl) and [ReCall](https://github.com/Agent-RL/ReCall). For evaluation, we rely on [WebThinker](https://github.com/RUC-NLPIR/WebThinker), [HIRA](https://github.com/RUC-NLPIR/HiRA), [WebSailor](https://github.com/Alibaba-NLP/WebAgent), [Search-o1](https://github.com/sunnynexus/Search-o1), and [FlashRAG](https://github.com/RUC-NLPIR/FlashRAG). The Python interpreter design references [ToRA](https://github.com/microsoft/ToRA) and [ToRL](https://github.com/GAIR-NLP/ToRL), while our models are trained using [Qwen2.5](https://qwenlm.github.io/blog/qwen2.5/). We express our sincere gratitude to these projects for their invaluable contributions to the open-source community. |
|
|
|
## 📞 Contact |
|
|
|
For any questions or feedback, please reach out to us at [[email protected]]([email protected]). |