kdizzled commited on
Commit
92e1006
·
verified ·
1 Parent(s): 46a2044

Write model card

Browse files
Files changed (1) hide show
  1. README.md +67 -1
README.md CHANGED
@@ -5,4 +5,70 @@ language:
5
  base_model:
6
  - microsoft/codebert-base
7
  ---
8
- # Model Card for Model ID
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  base_model:
6
  - microsoft/codebert-base
7
  ---
8
+ # Model Card for Model ID
9
+
10
+ # **RocqStar Ranker Embedder**
11
+
12
+ A self‑attentive embedding model for premise / proof selection in Rocq‑based interactive theorem proving (ITP). RocqStar is fine‑tuned from CodeBERT so that distances in the embedding space correlate with proof similarity rather than with surface‑level statement overlap. It replaces BM25/Jaccard similarity in retrieval‑augmented generation (RAG) pipelines such as **CoqPilot**, leading to higher proof success rates on the IMM‑300 benchmark.
13
+
14
+ ## Model Details
15
+
16
+ ### Model Description
17
+
18
+ RocqStar is a 108 M‑parameter Transformer encoder (12 layers, 768‑dim hidden size) with multi‑head self‑attention and a learned self‑attentive pooling head. It is trained with an InfoNCE contrastive objective so that the cosine similarity of two statement embeddings approximates the similarity of their proofs, measured by a hybrid Levenshtein + Jaccard distance. The model takes tokenised Rocq (Gallina) theorem statements as input and outputs a 768‑d embedding.
19
+
20
+ * **Model type:** Transformer encoder with self‑attentive pooling
21
+ * **Language(s):** Rocq / Coq (Gallina) syntax (tokens)
22
+ * **License:**: Apache‑2.0
23
+ * **Fine‑tuned from:** `microsoft/codebert-base`
24
+
25
+ ### Model Sources
26
+
27
+ * **Repository:** [https://github.com/JetBrains-Research/big-rocq](https://github.com/JetBrains-Research/big-rocq)
28
+ * **Paper:** (coming soon)
29
+
30
+ ## Bias, Risks, and Limitations
31
+
32
+ * **Domain limitation:** Trained only on Rocq projects; performance degrades on other languages or heavily tactic‑macro‑based proofs.
33
+ * **Data leakage:** May memorise proofs present in training data and surface them verbatim.
34
+ * **Proof similarity assumption:** While improved over BM25, cosine distance may still fail on very long or highly creative proofs.
35
+
36
+ ## How to Get Started
37
+
38
+ Go to [https://github.com/JetBrains-Research/big-rocq](https://github.com/JetBrains-Research/big-rocq). In subdirectory `ranker-server` you can find the server that you need to run to be able to use the model from the CoqPilot plugin. Also it may be used as a reference example of how to do inference with the model.
39
+
40
+ ## Training Details
41
+
42
+ ### Training Data
43
+
44
+ * **Corpus:** 76 524 mined statements + proofs from 4 large Rocq projects.
45
+ * **Positive/negative mining:** Proof distance < τ₊ (0.35) → positive; > τ₋ (0.6) → negative; 4 hard negatives per anchor.
46
+
47
+ ### Training Procedure
48
+
49
+ * **Objective:** InfoNCE with temperature T = 0.07
50
+ * **Batch size:** 512 pairs (effective) – gradient accumulation = 8×64
51
+ * **Optimizer / LR:** AdamW, lr = 4e‑6, linear warm‑up 10 %, 22k steps
52
+ * **Hardware:** 1× NVIDIA H100 GPU, 160 GB RAM, 14 h wall‑clock
53
+
54
+ ## Evaluation
55
+
56
+ ### Testing Data, Factors & Metrics
57
+
58
+ * **Dataset:** IMM‑300 (300 Rocq theorems) from the IMM project
59
+ * **Metrics:** Downstream proof success rate of CoqPilot when given top‑7 retrieved premises; averaged over 12 generations.
60
+
61
+ ### Results
62
+
63
+ | Model (back‑end) | Bucket | Baseline Jaccard | **RocqStar** |
64
+ | ---------------- | ------ | ---------------- | ------------ |
65
+ | GPT‑4o | ≤ 4 | 48 ± 5 % | **51 ± 5 %** |
66
+ | GPT‑4o | 5–8 | 18 ± 4 % | **25 ± 3 %** |
67
+ | GPT‑4o | 9–20 | 11 ± 4 % | 11 ± 5 % |
68
+ | Claude 3.5 | ≤ 4 | 58 ± 5 % | **61 ± 4 %** |
69
+ | Claude 3.5 | 5–8 | 28 ± 5 % | **36 ± 5 %** |
70
+ | Claude 3.5 | 9–20 | 16 ± 5 % | **21 ± 5 %** |
71
+
72
+ #### Summary
73
+
74
+ RocqStar delivers consistent gains, up to 28% relative improvement over token‑overlap retrieval, especially for medium‑length theorems where proof similarity diverges most from statement similarity.