File size: 8,773 Bytes
7b03a43
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
---
language:
- zh
- en
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- transformers
pipeline_tag: sentence-similarity
library_name: sentence-transformers
license: apache-2.0
---

<h1 align="center">FlagEmbedding</h1>

For more details please refer to our Github: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).

**BGE-Code-v1** is an LLM-based code embedding model that supports code retrieval, text retrieval, and multilingual retrieval. It primarily demonstrates the following capabilities:
- Superior Code Retrieval Performance: The model demonstrates exceptional code retrieval capabilities, supporting natural language queries in both English and Chinese, as well as 20 programming languages.
- Robust Text Retrieval Capabilities: The model maintains strong text retrieval capabilities comparable to text embedding models of similar scale.
- Extensive Multilingual Support: BGE-Code-v1 offers comprehensive multilingual retrieval capabilities, excelling in languages such as English, Chinese, Japanese, French, and more.

## Usage

### Using FlagEmbedding

```
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
pip install -e .
```

```python
from FlagEmbedding import FlagLLMModel
queries = [
    "Delete the record with ID 4 from the 'Staff' table.", 
    'Delete all records in the "Livestock" table where age is greater than 5'
]
documents = [
    "DELETE FROM Staff WHERE StaffID = 4;",
    "DELETE FROM Livestock WHERE age > 5;"
]
model = FlagLLMModel('BAAI/BGE-Code-v1', 
                     query_instruction_format="<instruct>{}\n<query>{}",
                     query_instruction_for_retrieval="Given a question in text, retrieve SQL queries that are appropriate responses to the question.",
                     trust_remote_code=True,
                     use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
embeddings_1 = model.encode_queries(queries)
embeddings_2 = model.encode_corpus(documents)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
```

By default, FlagLLMModel will use all available GPUs when encoding. Please set `os.environ["CUDA_VISIBLE_DEVICES"]` to select specific GPUs. You also can set `os.environ["CUDA_VISIBLE_DEVICES"]=""` to make all GPUs unavailable.

### Using Sentence Transformers

```python
from sentence_transformers import SentenceTransformer
import torch

# Load the model, optionally in float16 precision for faster inference
model = SentenceTransformer("BAAI/bge-code-v1", model_kwargs={"torch_dtype": torch.float16, "trust_remote_code": True}, tokenizer_kwargs={"trust_remote_code": True})

# Prepare a prompt given an instruction
instruction = 'Given a question in text, retrieve SQL queries that are appropriate responses to the question.'
prompt = f'<instruct>{instruction}\n<query>'
# Prepare queries and documents
queries = [
    "Delete the record with ID 4 from the 'Staff' table.", 
    'Delete all records in the "Livestock" table where age is greater than 5'
]
documents = [
    "DELETE FROM Staff WHERE StaffID = 4;",
    "DELETE FROM Livestock WHERE age > 5;"
]

# Compute the query and document embeddings
query_embeddings = model.encode(queries, prompt=prompt)
document_embeddings = model.encode(documents)

# Compute the cosine similarity between the query and document embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
```

### Using HuggingFace Transformers

```python
import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]


def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'<instruct>{task_description}\n<query>{query}'


instruction = 'Given a question in text, retrieve SQL queries that are appropriate responses to the question.'
queries = [
    "Delete the record with ID 4 from the 'Staff' table.", 
    'Delete all records in the "Livestock" table where age is greater than 5'
]
documents = [
    "DELETE FROM Staff WHERE StaffID = 4;",
    "DELETE FROM Livestock WHERE age > 5;"
]
input_texts = queries + documents

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-code-v1', trust_remote_code=True)
model = AutoModel.from_pretrained('BAAI/bge-code-v1', trust_remote_code=True)
model.eval()

max_length = 4096
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt', pad_to_multiple_of=8)

with torch.no_grad():
    outputs = model(**batch_dict)
    embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
    
# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
```

## Evaluation

**BGE-Code-v1** achieves state-of-the-art performance on both the CoIR and CodeRAG benchmarks.

- CoIR

|                                 | CodeXEmbed-2B | CodeXEmbed-7B | Voyage-Code-002 | Voyage-Code-003 | BGE-Code-v1 |
|---------------------------------------|---------------|---------------|-----------------|-----------------|-----------|
| Apps                                  | 76.86         | 85.38         | 26.52           | 93.62           | 98.08     |
| CosQA                                 | 40.47         | 42.47         | 29.79           | 34.45           | 46.72     |
| Text2SQL                              | 78.42         | 78.94         | 69.26           | 62.87           | 64.35     |
| CSN                                   | 87.87         | 89.67         | 81.79           | 89.35           | 89.53     |
| CSN-CCR                               | 97.66         | 97.95         | 73.45           | 90.05           | 98.30     |
| CodeTrans-Contest                     | 90.30         | 94.45         | 72.77           | 94.96           | 94.38     |
| CodeTrans-DL                          | 38.57         | 40.46         | 27.48           | 38.57           | 46.13     |
| StackOverFlow-QA                      | 94.47         | 96.33         | 67.68           | 97.17           | 95.35     |
| CodeFeedBack-ST                       | 86.36         | 87.53         | 65.35           | 90.67           | 90.56     |
| CodeFeedBack-MT                       | 65.51         | 68.83         | 28.74           | 93.58           | 94.38     |
| AVG                                   | 75.65         | 78.20         | 56.26           | 78.53    | 81.77     |

- CodedRAG

|                 | HummanEval | MBPP | DS-1000 | ODEX | RepoEval | SWE-bench-Lite | AVG  |
| --------------- | ---------- | ---- | ------- | ---- | -------- | -------------- | ---- |
| SFR             | 100.0      | 99.0 | 19.3    | 37.1 | 83.8     | 62.7           | 67.0 |
| Jina-v2-code    | 100.0      | 97.7 | 26.2    | 19.9 | 90.5     | 58.3           | 65.4 |
| CodeXEmbed-2B   | 100.0      | 97.4 | 25.4    | 23.9 | 88.7     | 52.4           | 64.6 |
| Voyage-Code-002 | 100.0      | 99.0 | 33.1    | 26.6 | 94.3     | 29.1           | 63.7 |
| Voyage-Code-003 | 100.0      | 99.6 | 38.9    | 36.3 | 90.0     | 70.1           | 72.5 |
| BGE-Code-v1       | 100.0      | 99.2 | 40.9    | 36.1 | 93.1     | 67.4           | 72.8 |

## Citation

If you find this repository useful, please consider giving a star :star: and citation

```
@article{bge-llm,
  title={Making text embedders few-shot learners},
  author={Li, Chaofan and Qin, MingHao and Xiao, Shitao and Chen, Jianlyu and Luo, Kun and Shao, Yingxia and Lian, Defu and Liu, Zheng},
  journal={arXiv preprint arXiv:2409.15700},
  year={2024}
}

@misc{bge-m3,
      title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation}, 
      author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
      year={2024},
      eprint={2402.03216},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}


@misc{bge_embedding,
      title={C-Pack: Packaged Resources To Advance General Chinese Embedding}, 
      author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff},
      year={2023},
      eprint={2309.07597},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```