Junteng commited on
Commit
3b94390
·
verified ·
1 Parent(s): 191d1ff

Upload file README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -12
README.md CHANGED
@@ -35,18 +35,30 @@ Built on Qwen3-8B base model and trained through a two-phase approach:
35
 
36
  ## 📊 Performance
37
 
38
- WebExplorer-8B achieves state-of-the-art performance across multiple information-seeking benchmarks:
39
-
40
- | Benchmark | Score |
41
- |-----------|-------|
42
- | BrowseComp-en | **15.7** |
43
- | BrowseComp-zh | **32.0** |
44
- | GAIA | **50.0** |
45
- | WebWalkerQA | **62.7** |
46
- | FRAMES | **75.7** |
47
- | XBench-DeepSearch | **53.7** |
48
- | HLE | **17.3** |
49
-
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
 
52
  ## 🛠️ Tool Schema
 
35
 
36
  ## 📊 Performance
37
 
38
+ WebExplorer-8B achieves state-of-the-art performance across multiple information-seeking benchmarks at its scale:
39
+
40
+ | Model | BC-en | BC-zh | GAIA | WebWalkerQA | FRAMES | Xbench-DS | HLE |
41
+ |-------|-------|-------|------|-------------|--------|-----------|-----|
42
+ | OpenAI-o3† | 50.9 | 58.1 | 70.5† | 71.7 | 84.0 | 66.7 | 20.2 |
43
+ | Claude-4-Sonnet† | 12.2 | 29.1 | 68.3† | 61.7 | 80.7 | 64.6 | 20.3 |
44
+ | GLM-4.5 | 26.4 | 37.5 | 66.0 | 65.6† | 78.9† | 70.0† | 21.2† |
45
+ | DeepSeek-V3.1 | 30.0 | 49.2 | 63.1† | 61.2† | 83.7 | 71.2 | 29.8 |
46
+ | Kimi-K2† | 14.1 | 28.8 | 57.7 | 63.0 | 72.0 | 50.0 | 18.1 |
47
+ |====|====|====|====|====|====|====|====|
48
+ | WebShaper-72B | - | - | **60.0** | 52.2 | - | - | - |
49
+ | WebShaper-32B (QwQ) | - | - | 53.3 | 49.7 | - | - | - |
50
+ | WebShaper-32B | - | - | 52.4 | 51.4 | - | - | - |
51
+ | WebSailor-72B | 12.0 | 30.1 | 55.4 | - | - | **55.0** | - |
52
+ | WebSailor-32B | 10.5 | 25.5 | 53.2 | - | - | 53.3 | - |
53
+ | WebSailor-7B | 6.7 | 14.2 | 33.0 | - | - | 34.3 | - |
54
+ | ASearcher-Web-QwQ | 5.2 | 15.6 | 52.8 | 34.3 | 70.9 | 42.1 | 12.5 |
55
+ | WebThinker-32B | 2.8 | - | 48.5 | 46.5 | - | - | 15.8 |
56
+ | MiroThinker-32B-DPO-v0.1 | 13.0 | 17.0 | 57.3 | 49.3 | 71.7 | - | 11.8 |
57
+ | MiroThinker-8B-DPO-v0.1 | 8.7 | 13.6 | 46.6 | 45.7 | 64.4 | - | - |
58
+ | WebExplorer-8B (SFT) | 7.9 | 21.3 | 43.7 | 59.8 | 72.6 | 47.5 | 16.0 |
59
+ | WebExplorer-8B (RL) | <u>**15.7**</u> | <u>**32.0**</u> | <u>50.0</u> | <u>**62.7**</u> | <u>**75.7**</u> | <u>53.7</u> | <u>**17.3**</u> |
60
+
61
+ Accuracy (%) of web agents on information-seeking benchmarks. BC-en and BC-zh denote BrowseComp-en and BrowseComp-zh respectively. XBench-DS refers to XBench-DeepSearch. **Bold** indicates the best performance among open-source models < 100B, while <u>underlined</u> values represent the best performance among models < 10B parameters. All scores of WebExplorer-8B are computed as Avg@4 using LLM-as-Judge. Entries marked with a dagger (†) were reproduced by us under our scaffold: on model name = entire row; on a number = that entry only.
62
 
63
 
64
  ## 🛠️ Tool Schema