7 6 11

Sherlock

eyuansu71

https://scholar.google.com/citations?user=75pkx3YAAAAJ&hl=en

AI & ML interests

None yet

Recent Activity

liked a model 3 months ago

Skywork/Skywork-Reward-V2-Llama-3.1-8B

upvoted a paper 3 months ago

Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT

upvoted a paper 4 months ago

Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

View all activity

Organizations

liked a model 3 months ago

Skywork/Skywork-Reward-V2-Llama-3.1-8B

Text Classification • 8B • Updated Jul 6, 2025 • 18k • 39

upvoted a paper 3 months ago

Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT

Paper • 2511.17405 • Published Nov 21, 2025 • 11

upvoted a paper 4 months ago

Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

Paper • 2510.26865 • Published Oct 30, 2025 • 12

upvoted 2 papers 5 months ago

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Paper • 2509.16941 • Published Sep 21, 2025 • 21

FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

Paper • 2509.17177 • Published Sep 21, 2025 • 13

upvoted a paper 6 months ago

Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information

Paper • 2508.11252 • Published Aug 15, 2025 • 3

commented a paper 8 months ago

One Token to Fool LLM-as-a-Judge

Paper • 2507.08794 • Published Jul 11, 2025 • 32 •

upvoted a paper 8 months ago

SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications

Paper • 2506.18951 • Published Jun 23, 2025 • 22

updated a dataset 10 months ago

FlagEval/HMMT_2025

Viewer • Updated May 6, 2025 • 30 • 455 • 1

published a dataset 10 months ago

FlagEval/HMMT_2025

Viewer • Updated May 6, 2025 • 30 • 455 • 1

liked a dataset 10 months ago

zwhe99/DeepMath-103K

Viewer • Updated May 29, 2025 • 103k • 6.73k • 351

liked a dataset about 1 year ago

KingNish/reasoning-base-20k

Viewer • Updated May 15, 2025 • 19.9k • 555 • 230

updated a model about 1 year ago

FlagEval/flageval_judgemodel

Text Generation • 33B • Updated Dec 30, 2024 • 2 • 1

published an article over 1 year ago

Article

Letting Large Models Debate: The First Multilingual LLM Debate Competition

Nov 20, 2024

•

liked a model over 1 year ago

Shitao/OmniGen-v1

Text-to-Image • Updated Nov 7, 2024 • 986 • 322

liked a Space over 1 year ago

Open LLM Leaderboard

🏆

13.9k

Track, rank and evaluate open LLMs and chatbots

updated a dataset over 1 year ago

FlagEval/CLCC_v1

Viewer • Updated Jul 29, 2024 • 760 • 16 • 3

liked a dataset over 1 year ago

FlagEval/CLCC_v1

Viewer • Updated Jul 29, 2024 • 760 • 16 • 3

liked a Space almost 2 years ago

Open Chinese LLM Leaderboard

🏆

124

Explore LLM benchmark leaderboard and submit models

commented a paper about 2 years ago

WARM: On the Benefits of Weight Averaged Reward Models

Paper • 2401.12187 • Published Jan 22, 2024 • 19 •