Llama3.1-8B-ARPO / README.md

Enhance model card with metadata and details

262cf61 verified about 2 months ago

6.18 kB

	---
	license: mit
	pipeline_tag: text-generation
	library_name: transformers
	---

	<div align="center">
	<img src="https://github.com/dongguanting/ARPO/blob/main/logo1.png" width="150px">
	</div>

	<h1 align="center" style="margin-top: -50px;">✨ Agentic Reinforced Policy Optimization (ARPO)</h1>

	<div align="center">

	[![Paper](https://img.shields.io/badge/Paper-arXiv-b5212f.svg?logo=arxiv)](https://arxiv.org/abs/2507.19849)
	[![Paper](https://img.shields.io/badge/Paper-Hugging%20Face-yellow?logo=huggingface)](https://huggingface.co/papers/2507.19849)
	[![Model](https://img.shields.io/badge/Model-Hugging%20Face-blue?logo=huggingface)](https://huggingface.co/collections/dongguanting/arpo-688229ff8a6143fe5b4ad8ae)
	[![Dataset](https://img.shields.io/badge/Dataset-Hugging%20Face-blue?logo=huggingface)](https://huggingface.co/collections/dongguanting/arpo-688229ff8a6143fe5b4ad8ae)
	[![License](https://img.shields.io/badge/LICENSE-MIT-green.svg)](https://opensource.org/licenses/MIT)
	[![Python 3.10+](https://img.shields.io/badge/Python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-390/)
	[![X (formerly Twitter) URL](https://img.shields.io/twitter/url?url=https%3A%2F%2Fx.com%2FKevin_GuoweiXu%2Fstatus%2F1858338565463421244)]()
	</div>

	Agentic Reinforced Policy Optimization (ARPO) is a novel agentic RL algorithm tailored for training multi-turn Large Language Model (LLM)-based agents. It addresses the challenges of balancing LLMs' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.

	## 💡 Overview

	We propose Agentic Reinforced Policy Optimization (ARPO), an agentic RL algorithm tailored for training multi-turn LLM-based agent. The core principle of ARPO is to encourage the policy model to adaptively branch sampling during high-entropy tool-call rounds, thereby efficiently aligning step-level tool-use behaviors.

	<img width="1686" height="866" alt="intro" src="https://github.com/user-attachments/assets/8b9daf54-c4ba-4e79-bf79-f98b5a893edd" />

	- In figure (left), The initial tokens generated by the LLM after receiving each round of tool-call feedback consistently exhibit a high entropy. This indicates that external tool-call significantly introduces uncertainty into the LLM’s reasoning process.

	- In the figure (right), we validate ARPO's performance across 13 datasets. Notably, Qwen3-14B with ARPO excelled in Pass@5, achieving 61.2% on GAIA and 24.0% on HLE, while requiring only about half the tool calls compared to GRPO during training.

	## 🏃 Quick Start

	This section provides a basic example of how to perform inference with an ARPO-trained model using the `transformers` library. For more detailed instructions on training and evaluation, please refer to the [official GitHub repository](https://github.com/dongguanting/ARPO).

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

	# Load the model and tokenizer
	# Replace "dongguanting/Llama3.1-8B-ARPO" with the specific ARPO checkpoint you want to use.
	model_name = "dongguanting/Llama3.1-8B-ARPO" # Example ARPO model
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype=torch.bfloat16, # Use bfloat16 for better performance on compatible hardware
	device_map="auto",
	trust_remote_code=True # Required for custom modeling if applicable
	).eval()
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

	# Set generation configuration based on model's generation_config.json
	model.generation_config = GenerationConfig.from_pretrained(
	model_name,
	temperature=0.6,
	top_p=0.9,
	do_sample=True,
	eos_token_id=[128001, 128008, 128009], # From special_tokens_map.json and generation_config.json
	pad_token_id=tokenizer.eos_token_id, # Common practice for LLMs
	)

	# Prepare messages using the chat template (e.g., Llama 3.1 or similar)
	messages = [
	{"role": "system", "content": "You are a helpful AI assistant."},
	{"role": "user", "content": "What is the capital of France?"}
	]

	# Apply chat template and tokenize input
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)

	# Generate response
	with torch.no_grad():
	output_ids = model.generate(input_ids, max_new_tokens=256)

	# Decode and print the generated text, excluding the input prompt
	response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True).strip()

	print(f"Assistant: {response}")
	```

	## 📄 Citation

	If you find this work helpful, please cite our paper:
	```bibtex
	@misc{dong2025arpo,
	title={Agentic Reinforced Policy Optimization},
	author={Guanting Dong and Hangyu Mao and Kai Ma and Licheng Bao and Yifei Chen and Zhongyuan Wang and Zhongxia Chen and Jiazhen Du and Huiyang Wang and Fuzheng Zhang and Guorui Zhou and Yutao Zhu and Ji-Rong Wen and Zhicheng Dou},
	year={2025},
	eprint={2507.19849},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2507.19849},
	}
	```

	## 🤝 Acknowledge

	This training implementation builds upon [Tool-Star](https://github.com/dongguanting/Tool-Star), [Llama Factory](https://github.com/hiyouga/LLaMA-Factory), [verl](https://github.com/volcengine/verl) and [ReCall](https://github.com/Agent-RL/ReCall). For evaluation, we rely on [WebThinker](https://github.com/RUC-NLPIR/WebThinker), [HIRA](https://github.com/RUC-NLPIR/HiRA), [WebSailor](https://github.com/Alibaba-NLP/WebAgent), [Search-o1](https://github.com/sunnynexus/Search-o1), and [FlashRAG](https://github.com/RUC-NLPIR/FlashRAG). The Python interpreter design references [ToRA](https://github.com/microsoft/ToRA) and [ToRL](https://github.com/GAIR-NLP/ToRL), while our models are trained using [Qwen2.5](https://qwenlm.github.io/blog/qwen2.5/). We express our sincere gratitude to these projects for their invaluable contributions to the open-source community.

	## 📞 Contact

	For any questions or feedback, please reach out to us at [[email protected]]([email protected]).