SATQuest: A Verifier for Logical Reasoning Evaluation and Reinforcement Fine-Tuning of LLMs
Abstract
SATQuest evaluates and enhances LLM logical reasoning by generating diverse SAT-based problems, offering insights into reasoning performance and enabling effective fine-tuning.
Recent advances in Large Language Models (LLMs) have demonstrated remarkable general reasoning capabilities. However, systematically evaluating and enhancing these reasoning capabilities is challenging due to the lack of controllable and scalable tools for fine-grained analysis. Existing benchmarks and datasets often lack the necessary variable control for multi-dimensional, systematic analysis and training, or have narrow problem types and formats. To address these limitations, we introduce SATQuest, a systematic verifier designed to evaluate and enhance logical reasoning in LLMs by generating diverse, Satisfiability-based logical reasoning problems directly from Conjunctive Normal Form (CNF) instances. SATQuest structures these problems along three orthogonal dimensions: instance scale, problem type, and question format, employing randomized, SAT-based problem generation and objective answer verification via PySAT. This design mitigates memorization issues, allows for nuanced insights into reasoning performance, and enables effective reinforcement fine-tuning. Our extensive evaluation of various LLMs using SATQuest identified significant limitations in their logical reasoning, particularly in generalizing beyond familiar mathematical formats. Furthermore, we show that reinforcement fine-tuning with SATQuest rewards substantially improves targeted task performance and generalizes to more complex instances, while highlighting remaining challenges in cross-format adaptation. Through these demonstrations, we showcase SATQuest's potential as a foundational tool and a valuable starting point for advancing LLM logical reasoning.
Community
This paper introduces SATQuest, a systematic verifier for evaluating and improving logical reasoning in LLMs. Unlike existing benchmarks that often lack multi-dimensional control, SATQuest generates SAT-based reasoning problems directly from CNF instances, structured along three orthogonal dimensions: instance scale, problem type, and question format. The framework employs randomized generation with objective answer verification via PySAT, enabling reproducible and fine-grained analysis.
Experiments show that current LLMs—both vanilla and reasoning-enhanced—still face serious challenges in logical reasoning, especially in scaling to harder instances and generalizing across different formats (Math, DIMACS, Story, DualStory). Reinforcement Fine-Tuning (RFT) with SATQuest rewards substantially improves targeted performance and promotes longer reasoning chains, though cross-format generalization remains difficult.
Overall, SATQuest stands out as a foundational tool for advancing logical reasoning in LLMs, offering both a robust benchmark and a verifier-driven RFT framework.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward (2025)
- VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains (2025)
- StructVRM: Aligning Multimodal Reasoning with Structured and Verifiable Reward Models (2025)
- InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling (2025)
- Do Cognitively Interpretable Reasoning Traces Improve LLM Performance? (2025)
- Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code (2025)
- Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper