Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT Paper • 2511.17405 • Published Nov 21, 2025 • 11
Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench Paper • 2510.26865 • Published Oct 30, 2025 • 12
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? Paper • 2509.16941 • Published Sep 21, 2025 • 21
FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions Paper • 2509.17177 • Published Sep 21, 2025 • 13
Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information Paper • 2508.11252 • Published Aug 15, 2025 • 3
SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications Paper • 2506.18951 • Published Jun 23, 2025 • 22
view article Article Letting Large Models Debate: The First Multilingual LLM Debate Competition +10 Nov 20, 2024 • 33
Running on CPU Upgrade 13.9k Open LLM Leaderboard 🏆 13.9k Track, rank and evaluate open LLMs and chatbots
Runtime error 124 Open Chinese LLM Leaderboard 🏆 124 Explore LLM benchmark leaderboard and submit models
WARM: On the Benefits of Weight Averaged Reward Models Paper • 2401.12187 • Published Jan 22, 2024 • 19 • 7