JetBrains-Research
/

rocq-language-theorem-embeddings

 base_model:
 - microsoft/codebert-base
 ---
+# Model Card for Model ID
+# **RocqStar Ranker Embedder**
+A self‑attentive embedding model for premise / proof selection in Rocq‑based interactive theorem proving (ITP). RocqStar is fine‑tuned from CodeBERT so that distances in the embedding space correlate with proof similarity rather than with surface‑level statement overlap. It replaces BM25/Jaccard similarity in retrieval‑augmented generation (RAG) pipelines such as **CoqPilot**, leading to higher proof success rates on the IMM‑300 benchmark.
+## Model Details
+### Model Description
+RocqStar is a 108 M‑parameter Transformer encoder (12 layers, 768‑dim hidden size) with multi‑head self‑attention and a learned self‑attentive pooling head. It is trained with an InfoNCE contrastive objective so that the cosine similarity of two statement embeddings approximates the similarity of their proofs, measured by a hybrid Levenshtein + Jaccard distance. The model takes tokenised Rocq (Gallina) theorem statements as input and outputs a 768‑d embedding.
+* **Model type:** Transformer encoder with self‑attentive pooling
+* **Language(s):** Rocq / Coq (Gallina) syntax (tokens)
+* **License:**: Apache‑2.0
+* **Fine‑tuned from:** `microsoft/codebert-base`
+### Model Sources
+* **Repository:** [https://github.com/JetBrains-Research/big-rocq](https://github.com/JetBrains-Research/big-rocq)
+* **Paper:** (coming soon)
+## Bias, Risks, and Limitations
+* **Domain limitation:** Trained only on Rocq projects; performance degrades on other languages or heavily tactic‑macro‑based proofs.
+* **Data leakage:** May memorise proofs present in training data and surface them verbatim.
+* **Proof similarity assumption:** While improved over BM25, cosine distance may still fail on very long or highly creative proofs.
+## How to Get Started
+Go to [https://github.com/JetBrains-Research/big-rocq](https://github.com/JetBrains-Research/big-rocq). In subdirectory `ranker-server` you can find the server that you need to run to be able to use the model from the CoqPilot plugin. Also it may be used as a reference example of how to do inference with the model.
+## Training Details
+### Training Data
+* **Corpus:** 76 524 mined statements + proofs from 4 large Rocq projects.
+* **Positive/negative mining:** Proof distance < τ₊ (0.35) → positive; > τ₋ (0.6) → negative; 4 hard negatives per anchor.
+### Training Procedure
+* **Objective:** InfoNCE with temperature T = 0.07
+* **Batch size:** 512 pairs (effective) – gradient accumulation = 8×64
+* **Optimizer / LR:** AdamW, lr = 4e‑6, linear warm‑up 10 %, 22k steps
+* **Hardware:** 1× NVIDIA H100 GPU, 160 GB RAM, 14 h wall‑clock
+## Evaluation
+### Testing Data, Factors & Metrics
+* **Dataset:** IMM‑300 (300 Rocq theorems) from the IMM project
+* **Metrics:** Downstream proof success rate of CoqPilot when given top‑7 retrieved premises; averaged over 12 generations.
+### Results
+| Model (back‑end) | Bucket | Baseline Jaccard | **RocqStar** |
+| ---------------- | ------ | ---------------- | ------------ |
+| GPT‑4o           | ≤ 4    | 48 ± 5 %         | **51 ± 5 %** |
+| GPT‑4o           | 5–8    | 18 ± 4 %         | **25 ± 3 %** |
+| GPT‑4o           | 9–20   | 11 ± 4 %         | 11 ± 5 %     |
+| Claude 3.5       | ≤ 4    | 58 ± 5 %         | **61 ± 4 %** |
+| Claude 3.5       | 5–8    | 28 ± 5 %         | **36 ± 5 %** |
+| Claude 3.5       | 9–20   | 16 ± 5 %         | **21 ± 5 %** |
+#### Summary
+RocqStar delivers consistent gains, up to 28% relative improvement over token‑overlap retrieval, especially for medium‑length theorems where proof similarity diverges most from statement similarity.