YuPeng0214 commited on
Commit
18e145e
·
verified ·
1 Parent(s): 75f9ff2

Upload folder using huggingface_hub

Browse files
.DS_Store ADDED
Binary file (10.2 kB). View file
 
.gitattributes CHANGED
@@ -33,3 +33,10 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ image-1.png filter=lfs diff=lfs merge=lfs -text
37
+ image-10.png filter=lfs diff=lfs merge=lfs -text
38
+ image-11.png filter=lfs diff=lfs merge=lfs -text
39
+ image-16.png filter=lfs diff=lfs merge=lfs -text
40
+ image-18.png filter=lfs diff=lfs merge=lfs -text
41
+ image-9.png filter=lfs diff=lfs merge=lfs -text
42
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 3584,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md CHANGED
@@ -1,3 +1,206 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # QZhou-Embedding
3
+ <div align="center">
4
+ <img src="image-1.png" width="800" height="300"></img>
5
+ </div>
6
+
7
+ ## Introduction
8
+ We have released <a href="https://huggingface.co/Kingsoft-LLM/QZhou-Embedding">QZhou-Embedding</a> (called "Qingzhou Embedding"), a large-scale text embedding model designed for general use,excelling at various text embedding tasks (retrieval, re-ranking, sentence similarity, and classification). Leveraging the general language capabilities of its underlying model, and pre-trained on massive amounts of text, QZhou-Embedding achieves even more powerful text embedding representations. QZhou-Embedding is continuously trained using millions of high-quality open-source embedding datasets and over 5 million high-quality synthetic data (using two synthetic techniques: rewriting and expansion). Initial retrieval training provides the model with a foundation for query-doc semantic matching capabilities. Later, multi-dimensional training such as STS and clustering, helps the model achieve continuous breakthroughs in various tasks. QZhou-Embedding is a 7B model and can embed long text vectors up to 8k in size. It achieved the highest average score on the mteb/cmteb evaluation benchmarks. In terms of various task scores, its clustering, sentence pair classification, rearrangement, and STS task achieved the highest average scores.
9
+ ## Basic Features
10
+
11
+ - Powerful text embedding capabilities;
12
+ - Long context: up to 8k context length;
13
+ - 7B parameter size
14
+
15
+
16
+ ## Technical Introduction
17
+ ### Unified Task Modeling Framework
18
+ We unify the text embedding objectives into three major modeling optimization issues and propose a unified training data structured solution and corresponding training mechanism. This approach can integrate most open source data as retrieval training sets. The structured data can be as follows:
19
+ - Retrieval
20
+ - title-body
21
+ - title-abstract
22
+ - Question Answering Dataset
23
+ - Reading comprehension
24
+ - ...
25
+
26
+ - STS
27
+ - text pair + label in {true, false}、{yes, no}
28
+ - text pair + score(such as 0.2, 3.1. 4.8, etc.)
29
+ - NLI dataset:text pair + label in {'entailment', 'neutral', 'contradiction'}
30
+
31
+ - CLS
32
+ - text+CLS label
33
+
34
+ <div align="center"><img src="image-18.png" width="1000" height="600"></img></div>
35
+ <div align="center"><img src="image-16.png" width="1000" height="550"></img></div>
36
+
37
+ ### Training Objectives
38
+
39
+ - Retrieval: Apply InfoNCE contrastive loss function, and follow the gte/qwen3-embedding to add the query-query negative as part of the denominator.<br>
40
+ $$
41
+ L_{ret}=-\frac{1}{n}\sum_{i} log{\frac{e^{sim(q_i,d_i^+)/\tau}}{e^{sim(q_i,d_i^+)/\tau}+\sum_{j}e^{sim(q_i,d_j^-)/\tau}+\sum_{j≠i}e^{sim(q_i,q_j)/\tau}}}
42
+ $$
43
+
44
+ - STS:Apply Cosent loss:
45
+ $$
46
+ L_{cosent}=log \bigg(1+\sum_{sim(i,j)>sim(k,l)}exp(\frac{sim(x_k, x_l)-sim(x_i,x_j)}{\tau})\bigg)
47
+ $$
48
+
49
+ - CLS: Apply the same InfoNCE loss as retrieval, but for In-Batch Negative, due to the high probability of same-class conflicts, a mask mechanism is used to cover up similar samples in negative examples shared by different samples.
50
+ $$
51
+ L_{ret}=-\frac{1}{n}\sum_{i} log{\frac{e^{sim(t_i,t_i^+)/\tau}}{e^{sim(t_i,t_i^+)/\tau}+\sum_{n}MASK(t_i,t_{i,n}^-)·e^{sim(t_i,t_{i,n}^-)/\tau}+\sum_{j≠i}MASK(t_i,t_j)·e^{sim(t_i,t_j)/\tau}+\sum_{j≠i}\sum_{n}MASK(t_i,t_{j,n}^-)e^{sim(t_i,t_{j,n}^-)/\tau}}}
52
+ $$
53
+ $$
54
+ where\:\:C_{t_i}=C_{t_i^+}
55
+ $$
56
+ $$
57
+ MASK(t_i, t_j)=
58
+ \begin{cases}
59
+ 0 & \quad \text{if } C_{t_i}=C_{t_j}, \\
60
+ 1 & \quad \text{otherwise}
61
+ \end{cases}
62
+ $$
63
+ Where $C_{t_i}$ represents the class label of sample $t_i$ , and $n$ is the number of negative samples for a single data point.
64
+ ### Feature Enhancement Data Synthesis Technology
65
+ In the context of powerful languages and writing capabilities in LLMs, we've fully leveraged the LLMs API to propose a data synthesis technology. To address issues like limited data and narrow topics/features in training sets, we've proposed rewriting and expanding synthesis techniques. Furthermore, to increase the difficulty of negative examples during training, we've designed a hard negative example synthesis technology based on big models, combined with existing strong retriever-based hard negative examples sampling. Several of these technologies are described below:
66
+ <div align="center"><img src="image-9.png" width="930" height="290"></img></div>
67
+ <div align="center"><img src="image-10.png" width="880" height="220"></img></div>
68
+ <div align="center"><img src="image-11.png" width="880" height="210"></img></div>
69
+
70
+ For more details, including reproduction of evaluation results, Instruction content and adding method, please refer to our <a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a> repo, thanks!
71
+
72
+ ## Evaluation Results
73
+ ### mteb details
74
+ <div align="center"><img src="image-7.png" width="1100" height="260"></img></div>
75
+
76
+ ### cmteb details
77
+ <div align="center"><img src="image-8.png" width="1000" height="260"></img></div>
78
+
79
+ ## Usage
80
+ ### Completely reproduce the benchmark results
81
+ We provide detailed parameters and environment configurations so that you can run results that are completely consistent with the mteb leaderboard on your own machine, including configurations such as environment dependencies and model arguments.
82
+ #### Requirements
83
+ - Python: 3.10.12
84
+ - Sentence Transformers: 3.4.1
85
+ - Transformers: 4.51.1
86
+ - PyTorch: 2.7.1
87
+ - Accelerate: 1.3.0
88
+ - Datasets: 3.2.0
89
+ - Tokenizers: 0.21.2
90
+ #### Transformers model load arguments
91
+ torch_dtype=torch.bfloat16<br>
92
+ attn_implementation='sdpa'<br>
93
+ **NOTE:** The ranking results use the sdpa mode. Other modes ('eager', 'flash_attention_2') may have deviations in results, but still keep the overall performance consistent.
94
+ #### Instruction Adding Rules
95
+ Details can be found on our <a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a>.
96
+ #### Evaluation code usage
97
+ Find our benchmark evaluation code on <a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a>. The mteb benchmark script is **run_mteb_all_v2.py**, and the cmteb benchmark script is **run_cmteb_all.py**. Run the following command:
98
+ ```
99
+ POOLING_MODE=mean
100
+ normalize=true
101
+ use_instruction=true
102
+ export TOKENIZERS_PARALLELISM=true
103
+
104
+ model_name_or_path=<model dir>
105
+
106
+ python3 ./run_cmteb_all.py \
107
+ --model_name_or_path ${model_name_or_path} \
108
+ --pooling_mode ${POOLING_MODE} \
109
+ --normalize ${normalize} \
110
+ --use_instruction ${use_instruction} \
111
+ --output_dir <output dir>
112
+
113
+ python3 ./run_mteb_all_v2.py \
114
+ --model_name_or_path ${model_name_or_path} \
115
+ --pooling_mode ${POOLING_MODE} \
116
+ --normalize ${normalize} \
117
+ --use_instruction ${use_instruction} \
118
+ --output_dir <output dir>
119
+ ```
120
+ The "<>" should be replaced with your actual setting.<br>
121
+ This is a general script that can be used to evaluate other huggingface embedding models, but you need to ensure that the pooling and other configurations are correct.
122
+
123
+ ### Sentence-transformers
124
+
125
+ ```
126
+ from sentence_transformers import SentenceTransformer
127
+
128
+ model = SentenceTransformer("QZhou-Embedding")
129
+
130
+ model = SentenceTransformer(
131
+ "QZhou-Embedding",
132
+ model_kwargs={"device_map": "auto", "trust_remote_code": True},
133
+ tokenizer_kwargs={"padding_side": "left", "trust_remote_code": True},
134
+ trust_remote_code=True
135
+ )
136
+
137
+ queries = [
138
+ "What is photosynthesis?",
139
+ "Who invented the telephone?",
140
+ ]
141
+ documents = [
142
+ "Photosynthesis is the process by which green plants use sunlight, carbon dioxide, and water to produce glucose and oxygen. This biochemical reaction occurs in chloroplasts.",
143
+ "Alexander Graham Bell is credited with inventing the first practical telephone in 1876, receiving US patent number 174,465 for his device."
144
+ ]
145
+
146
+ query_embeddings = model.encode(queries, prompt_name="query", normalize_embeddings=True)
147
+ document_embeddings = model.encode(documents, normalize_embeddings=True)
148
+
149
+ similarity = model.similarity(query_embeddings, document_embeddings)
150
+ ```
151
+
152
+ ### Huggingface Transformers
153
+
154
+ ```
155
+ import torch
156
+ import torch.nn.functional as F
157
+
158
+ from torch import Tensor
159
+ from transformers import AutoTokenizer, AutoModel
160
+
161
+
162
+ def last_token_pool(last_hidden_states: Tensor,
163
+ attention_mask: Tensor) -> Tensor:
164
+ left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
165
+ if left_padding:
166
+ return last_hidden_states[:, -1]
167
+ else:
168
+ sequence_lengths = attention_mask.sum(dim=1) - 1
169
+ batch_size = last_hidden_states.shape[0]
170
+ return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
171
+
172
+
173
+ def get_detailed_instruct(task_description: str, query: str) -> str:
174
+ return f'Instruct: {task_description}\nQuery:{query}'
175
+
176
+ task = 'Given a web search query, retrieve relevant passages that answer the query'
177
+
178
+ queries = [
179
+ get_detailed_instruct(task, 'What is photosynthesis?'),
180
+ get_detailed_instruct(task, 'Who invented the telephone?')
181
+ ]
182
+
183
+ documents = [
184
+ "Photosynthesis is the process by which green plants use sunlight, carbon dioxide, and water to produce glucose and oxygen. This biochemical reaction occurs in chloroplasts.",
185
+ "Alexander Graham Bell is credited with inventing the first practical telephone in 1876, receiving US patent number 174,465 for his device."
186
+ ]
187
+
188
+ input_texts = queries + documents
189
+
190
+ tokenizer = AutoTokenizer.from_pretrained('QZhou-Embedding', padding_side='left', trust_remote_code=True)
191
+ model = AutoModel.from_pretrained('QZhou-Embedding', trust_remote_code=True, device_map='auto')
192
+
193
+ batch_dict = tokenizer(
194
+ input_texts,
195
+ padding=True,
196
+ truncation=True,
197
+ max_length=8192,
198
+ return_tensors="pt",
199
+ )
200
+ batch_dict.to(model.device)
201
+ outputs = model(**batch_dict)
202
+ embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
203
+
204
+ embeddings = F.normalize(embeddings, p=2, dim=1)
205
+ scores = (embeddings[:2] @ embeddings[2:].T)
206
+ ```
README_zh.md ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # QZhou-Embedding
3
+ <div align="center">
4
+ <img src="image-1.png" width="800" height="300"></img>
5
+ </div>
6
+
7
+ ## 简介
8
+ 我们发布<a href="https://huggingface.co/Kingsoft-LLM/QZhou-Embedding">QZhou-Embedding</a>(轻舟Embedding😈😈😈),面向通用领域的文本向量表示大模型,擅长各种文本嵌入(检索、重排、句对相似度、分类)任务。得益于基础模型在海量文本上预训练获得的通用语言能力,QZhou-Embedding能够获得更加强大的文本嵌入表示。QZhou-Embedding使用百万量级高质量开源检索数据,以及500万+高质量合成数据(改写、扩展两大合成技术)进行持续训练。我们通过第一阶段检索训练为模型提供query-doc语义匹配能力基础,第二阶段的STS、聚类等多维度能力训练帮助模型在各种场景下持续突破。QZhou-Embedding的模型参数为7B,具备最大8k的长文本向量嵌入能力。在mteb/cmteb评测基准上取得均值全榜最高,各任务指标方面,聚类、句对分类、重排、STS任务指标均值全榜最高的效果。
9
+
10
+ ## QZhou-Embedding基本特点
11
+
12
+ - 强大的文本嵌入能力;
13
+ - 长上下文:最大支持8k;
14
+ - 参数量7B
15
+
16
+
17
+ ## 技术介绍
18
+ ### 统一任务建模框架
19
+ 将文本嵌入目标统一为三大问题建模优化,提出统一的训练数据结构化方案和对应的训练机制---可融入大部分开源数据作为检索训练集,可结构化数据如下:
20
+ - 检索
21
+ - title-body
22
+ - title-abstract
23
+ - 问答类数据
24
+ - 阅读理解
25
+ - ...
26
+
27
+ - STS
28
+ - 文本对+{true, false}、{yes, no}标签
29
+ - 文本对+分数(如0.2、3.1、4.8等)
30
+ - NLI数据:文本对+{'entailment', 'neutral', 'contradiction'}标签
31
+
32
+ - CLS
33
+ - 句子+类标签
34
+
35
+ <div align="center"><img src="image-18.png" width="1000" height="600"></img></div>
36
+ <div align="center"><img src="image-16.png" width="1000" height="550"></img></div>
37
+
38
+ ### 训练目标
39
+
40
+ - 检索:使用InfoNCE对比学习loss函数,效仿gte/qwen3-embedding的改进增加q-q对负样例惩罚<br>
41
+ $$
42
+ L_{ret}=-\frac{1}{n}\sum_{i} log{\frac{e^{sim(q_i,d_i^+)/\tau}}{e^{sim(q_i,d_i^+)/\tau}+\sum_{j}e^{sim(q_i,d_j^-)/\tau}+\sum_{j≠i}e^{sim(q_i,q_j)/\tau}}}
43
+ $$
44
+
45
+ - STS:使用Cosent loss:
46
+ $$
47
+ L_{cosent}=log \bigg(1+\sum_{sim(i,j)>sim(k,l)}exp(\frac{sim(x_k, x_l)-sim(x_i,x_j)}{\tau})\bigg)
48
+ $$
49
+
50
+ - CLS:同检索一致使用InfoNCE loss,但In-Batch Negative时由于同类冲突概率大,使用mask机制掩盖不同样本共享的负样例中的同类样本。
51
+ $$
52
+ L_{ret}=-\frac{1}{n}\sum_{i} log{\frac{e^{sim(t_i,t_i^+)/\tau}}{e^{sim(t_i,t_i^+)/\tau}+\sum_{n}MASK(t_i,t_{i,n}^-)·e^{sim(t_i,t_{i,n}^-)/\tau}+\sum_{j≠i}MASK(t_i,t_j)·e^{sim(t_i,t_j)/\tau}+\sum_{j≠i}\sum_{n}MASK(t_i,t_{j,n}^-)e^{sim(t_i,t_{j,n}^-)/\tau}}}
53
+ $$
54
+ $$
55
+ 其中C_{t_i}=C_{t_i^+}
56
+ $$
57
+ $$
58
+ MASK(t_i, t_j)=
59
+ \begin{cases}
60
+ 0 & \quad \text{if } C_{t_i}=C_{t_j}, \\
61
+ 1 & \quad \text{otherwise}
62
+ \end{cases}
63
+ $$
64
+ 其中${C_{t_i}}$表示样本${t_i}$的类标签,n是单条数据的负样本数。
65
+
66
+ ### 特征增强数据合成技术
67
+ 在当今大模型语言及创作能力强大的背景下,我们充分利用了大模型API设计数据合成技术。针对训练集中存在数据少、话题狭隘等问题,我们提出改写、扩展合成技术;同时为增强训练时的负样例难度,我们在现有基于强大Embedding实现难负例采样的基础上,使用基于大模型的难负样例合成技术。几种技术介绍如下:
68
+ <div align="center"><img src="image-9.png" width="930" height="290"></img></div>
69
+ <div align="center"><img src="image-10.png" width="880" height="220"></img></div>
70
+ <div align="center"><img src="image-11.png" width="880" height="210"></img></div>
71
+
72
+ 想要获取更多信息(如评测脚本、指令格式等),欢迎访问我们的Github:<a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a>
73
+
74
+ ## 评测结果
75
+ ### mteb榜单明细
76
+ <div align="center"><img src="image-7.png" width="1100" height="260"></img></div>
77
+
78
+ ### cmteb榜单明细
79
+ <div align="center"><img src="image-8.png" width="1000" height="260"></img></div>
80
+
81
+ ## 使用指南
82
+ ### 完全复现榜单结果
83
+ 我们提供详细的参数、环境配置,以便能够在自己的机器上完全跑出跟榜单一致的结果,包括环境依赖、模型参数等配置。
84
+ #### 环境依赖版本
85
+ - Python: 3.10.12
86
+ - Sentence Transformers: 3.4.1
87
+ - Transformers: 4.51.1
88
+ - PyTorch: 2.7.1
89
+ - Accelerate: 1.3.0
90
+ - Datasets: 3.2.0
91
+ - Tokenizers: 0.21.2
92
+ #### 模型加载参数
93
+ torch_dtype=torch.bfloat16<br>
94
+ attn_implementation='sdpa'<br>
95
+ **注:** 榜单结果使用了sdpa模式,其他模式('eager'、 'flash_attention_2')存在偏差,但不影响整体表现
96
+ #### 指令添加规则
97
+ 在我们的<a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a>上可以找到。
98
+ #### 评测代码使用
99
+ 在<a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a>上找到我们的评测代码,其中mteb评测脚本是**run_mteb_all_v2.py**,cmteb评测脚本是**run_cmteb_all.py**,运行如下命令:
100
+ ```
101
+ POOLING_MODE=mean
102
+ normalize=true
103
+ use_instruction=true
104
+ export TOKENIZERS_PARALLELISM=true
105
+
106
+ model_name_or_path=模型目录位置
107
+
108
+ python3 ./run_cmteb_all.py \
109
+ --model_name_or_path ${model_name_or_path} \
110
+ --pooling_mode ${POOLING_MODE} \
111
+ --normalize ${normalize} \
112
+ --use_instruction ${use_instruction} \
113
+ --output_dir 结果输出路径
114
+
115
+ python3 ./run_mteb_all_v2.py \
116
+ --model_name_or_path ${model_name_or_path} \
117
+ --pooling_mode ${POOLING_MODE} \
118
+ --normalize ${normalize} \
119
+ --use_instruction ${use_instruction} \
120
+ --output_dir 结果输出路径
121
+ ```
122
+ 这是一套通用脚本,可以用于其他huggingface embedding模型的评测,但需要确保pooling等配置正确。
123
+
124
+ ### Sentence Transformers
125
+
126
+ ```
127
+ from sentence_transformers import SentenceTransformer
128
+
129
+ model = SentenceTransformer("QZhou-Embedding")
130
+
131
+ model = SentenceTransformer(
132
+ "QZhou-Embedding",
133
+ model_kwargs={"device_map": "auto", "trust_remote_code": True},
134
+ tokenizer_kwargs={"padding_side": "left", "trust_remote_code": True},
135
+ trust_remote_code=True
136
+ )
137
+
138
+ queries = [
139
+ "What is photosynthesis?",
140
+ "Who invented the telephone?",
141
+ ]
142
+ documents = [
143
+ "Photosynthesis is the process by which green plants use sunlight, carbon dioxide, and water to produce glucose and oxygen. This biochemical reaction occurs in chloroplasts.",
144
+ "Alexander Graham Bell is credited with inventing the first practical telephone in 1876, receiving US patent number 174,465 for his device."
145
+ ]
146
+
147
+ query_embeddings = model.encode(queries, prompt_name="query", normalize_embeddings=True)
148
+ document_embeddings = model.encode(documents, normalize_embeddings=True)
149
+
150
+ similarity = model.similarity(query_embeddings, document_embeddings)
151
+ ```
152
+
153
+ ### Huggingface Transformers
154
+
155
+ ```
156
+ import torch
157
+ import torch.nn.functional as F
158
+
159
+ from torch import Tensor
160
+ from transformers import AutoTokenizer, AutoModel
161
+
162
+
163
+ def last_token_pool(last_hidden_states: Tensor,
164
+ attention_mask: Tensor) -> Tensor:
165
+ left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
166
+ if left_padding:
167
+ return last_hidden_states[:, -1]
168
+ else:
169
+ sequence_lengths = attention_mask.sum(dim=1) - 1
170
+ batch_size = last_hidden_states.shape[0]
171
+ return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
172
+
173
+
174
+ def get_detailed_instruct(task_description: str, query: str) -> str:
175
+ return f'Instruct: {task_description}\nQuery:{query}'
176
+
177
+ task = 'Given a web search query, retrieve relevant passages that answer the query'
178
+
179
+ queries = [
180
+ get_detailed_instruct(task, 'What is photosynthesis?'),
181
+ get_detailed_instruct(task, 'Who invented the telephone?')
182
+ ]
183
+
184
+ documents = [
185
+ "Photosynthesis is the process by which green plants use sunlight, carbon dioxide, and water to produce glucose and oxygen. This biochemical reaction occurs in chloroplasts.",
186
+ "Alexander Graham Bell is credited with inventing the first practical telephone in 1876, receiving US patent number 174,465 for his device."
187
+ ]
188
+
189
+ input_texts = queries + documents
190
+
191
+ tokenizer = AutoTokenizer.from_pretrained('QZhou-Embedding', padding_side='left', trust_remote_code=True)
192
+ model = AutoModel.from_pretrained('QZhou-Embedding', trust_remote_code=True, device_map='auto')
193
+
194
+ batch_dict = tokenizer(
195
+ input_texts,
196
+ padding=True,
197
+ truncation=True,
198
+ max_length=8192,
199
+ return_tensors="pt",
200
+ )
201
+ batch_dict.to(model.device)
202
+ outputs = model(**batch_dict)
203
+ embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
204
+
205
+ embeddings = F.normalize(embeddings, p=2, dim=1)
206
+ scores = (embeddings[:2] @ embeddings[2:].T)
207
+ ```
added_tokens.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</tool_call>": 151658,
3
+ "<tool_call>": 151657,
4
+ "<|box_end|>": 151649,
5
+ "<|box_start|>": 151648,
6
+ "<|endoftext|>": 151643,
7
+ "<|file_sep|>": 151664,
8
+ "<|fim_middle|>": 151660,
9
+ "<|fim_pad|>": 151662,
10
+ "<|fim_prefix|>": 151659,
11
+ "<|fim_suffix|>": 151661,
12
+ "<|im_end|>": 151645,
13
+ "<|im_start|>": 151644,
14
+ "<|image_pad|>": 151655,
15
+ "<|object_ref_end|>": 151647,
16
+ "<|object_ref_start|>": 151646,
17
+ "<|quad_end|>": 151651,
18
+ "<|quad_start|>": 151650,
19
+ "<|repo_name|>": 151663,
20
+ "<|video_pad|>": 151656,
21
+ "<|vision_end|>": 151653,
22
+ "<|vision_pad|>": 151654,
23
+ "<|vision_start|>": 151652
24
+ }
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "QZhouModel"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "auto_map": {
7
+ "AutoModel": "modeling_qzhou.QZhouModel"
8
+ },
9
+ "bos_token_id": 151643,
10
+ "eos_token_id": 151643,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 3584,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 18944,
15
+ "max_position_embeddings": 32768,
16
+ "max_window_layers": 28,
17
+ "model_type": "qwen2",
18
+ "num_attention_heads": 28,
19
+ "num_hidden_layers": 28,
20
+ "num_key_value_heads": 4,
21
+ "rms_norm_eps": 1e-06,
22
+ "rope_scaling": null,
23
+ "rope_theta": 1000000.0,
24
+ "sliding_window": 131072,
25
+ "tie_word_embeddings": false,
26
+ "torch_dtype": "bfloat16",
27
+ "transformers_version": "4.51.1",
28
+ "use_cache": true,
29
+ "use_sliding_window": false,
30
+ "vocab_size": 152064
31
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.4.1",
4
+ "transformers": "4.51.1",
5
+ "pytorch": "2.4.1+cu121"
6
+ },
7
+ "prompts": {
8
+ "query": "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: "
9
+ },
10
+ "default_prompt_name": null,
11
+ "similarity_fn_name": "cosine"
12
+ }
image-1.png ADDED

Git LFS Details

  • SHA256: 5176a7a58a1d0cf04d6cd81c58129a1417f81f6a4ab8a25481b5d8de2baa5da6
  • Pointer size: 131 Bytes
  • Size of remote file: 704 kB
image-10.png ADDED

Git LFS Details

  • SHA256: 6e73984905fa9f64e512b0bf73f2fafeaeca2314a4e0964aee00960328b78df4
  • Pointer size: 131 Bytes
  • Size of remote file: 167 kB
image-11.png ADDED

Git LFS Details

  • SHA256: 3773c291e1f56cc0fa0f0a74bec06dacd36b993a3c2085b5f9a367847f3ae90f
  • Pointer size: 131 Bytes
  • Size of remote file: 142 kB
image-16.png ADDED

Git LFS Details

  • SHA256: c2c9e9dc7dd496eb41a796b453ee77a6760778c52a70f091afecae4808e05c5c
  • Pointer size: 131 Bytes
  • Size of remote file: 190 kB
image-18.png ADDED

Git LFS Details

  • SHA256: 7dcfdd868f2c5640d91dfde973a413f5ba220e89d59528570836f844317314fc
  • Pointer size: 131 Bytes
  • Size of remote file: 159 kB
image-7.png ADDED
image-8.png ADDED
image-9.png ADDED

Git LFS Details

  • SHA256: 02be8d68c2d949b09958fb3aff46db77141422ca2ca51ac591c480d46df1cdce
  • Pointer size: 131 Bytes
  • Size of remote file: 191 kB
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors.index.json ADDED
@@ -0,0 +1,345 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 14141238272
4
+ },
5
+ "weight_map": {
6
+ "embed_tokens.weight": "model-00001-of-00003.safetensors",
7
+ "layers.0.input_layernorm.weight": "model-00001-of-00003.safetensors",
8
+ "layers.0.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
9
+ "layers.0.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
10
+ "layers.0.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
11
+ "layers.0.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
12
+ "layers.0.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
13
+ "layers.0.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
14
+ "layers.0.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
15
+ "layers.0.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
16
+ "layers.0.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
17
+ "layers.0.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
18
+ "layers.0.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
19
+ "layers.1.input_layernorm.weight": "model-00001-of-00003.safetensors",
20
+ "layers.1.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
21
+ "layers.1.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
22
+ "layers.1.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
23
+ "layers.1.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
24
+ "layers.1.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
25
+ "layers.1.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
26
+ "layers.1.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
27
+ "layers.1.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
28
+ "layers.1.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
29
+ "layers.1.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
30
+ "layers.1.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
31
+ "layers.10.input_layernorm.weight": "model-00002-of-00003.safetensors",
32
+ "layers.10.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
33
+ "layers.10.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
34
+ "layers.10.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
35
+ "layers.10.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
36
+ "layers.10.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
37
+ "layers.10.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
38
+ "layers.10.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
39
+ "layers.10.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
40
+ "layers.10.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
41
+ "layers.10.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
42
+ "layers.10.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
43
+ "layers.11.input_layernorm.weight": "model-00002-of-00003.safetensors",
44
+ "layers.11.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
45
+ "layers.11.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
46
+ "layers.11.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
47
+ "layers.11.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
48
+ "layers.11.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
49
+ "layers.11.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
50
+ "layers.11.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
51
+ "layers.11.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
52
+ "layers.11.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
53
+ "layers.11.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
54
+ "layers.11.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
55
+ "layers.12.input_layernorm.weight": "model-00002-of-00003.safetensors",
56
+ "layers.12.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
57
+ "layers.12.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
58
+ "layers.12.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
59
+ "layers.12.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
60
+ "layers.12.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
61
+ "layers.12.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
62
+ "layers.12.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
63
+ "layers.12.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
64
+ "layers.12.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
65
+ "layers.12.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
66
+ "layers.12.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
67
+ "layers.13.input_layernorm.weight": "model-00002-of-00003.safetensors",
68
+ "layers.13.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
69
+ "layers.13.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
70
+ "layers.13.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
71
+ "layers.13.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
72
+ "layers.13.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
73
+ "layers.13.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
74
+ "layers.13.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
75
+ "layers.13.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
76
+ "layers.13.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
77
+ "layers.13.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
78
+ "layers.13.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
79
+ "layers.14.input_layernorm.weight": "model-00002-of-00003.safetensors",
80
+ "layers.14.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
81
+ "layers.14.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
82
+ "layers.14.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
83
+ "layers.14.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
84
+ "layers.14.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
85
+ "layers.14.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
86
+ "layers.14.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
87
+ "layers.14.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
88
+ "layers.14.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
89
+ "layers.14.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
90
+ "layers.14.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
91
+ "layers.15.input_layernorm.weight": "model-00002-of-00003.safetensors",
92
+ "layers.15.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
93
+ "layers.15.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
94
+ "layers.15.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
95
+ "layers.15.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
96
+ "layers.15.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
97
+ "layers.15.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
98
+ "layers.15.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
99
+ "layers.15.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
100
+ "layers.15.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
101
+ "layers.15.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
102
+ "layers.15.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
103
+ "layers.16.input_layernorm.weight": "model-00002-of-00003.safetensors",
104
+ "layers.16.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
105
+ "layers.16.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
106
+ "layers.16.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
107
+ "layers.16.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
108
+ "layers.16.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
109
+ "layers.16.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
110
+ "layers.16.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
111
+ "layers.16.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
112
+ "layers.16.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
113
+ "layers.16.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
114
+ "layers.16.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
115
+ "layers.17.input_layernorm.weight": "model-00002-of-00003.safetensors",
116
+ "layers.17.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
117
+ "layers.17.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
118
+ "layers.17.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
119
+ "layers.17.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
120
+ "layers.17.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
121
+ "layers.17.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
122
+ "layers.17.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
123
+ "layers.17.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
124
+ "layers.17.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
125
+ "layers.17.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
126
+ "layers.17.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
127
+ "layers.18.input_layernorm.weight": "model-00003-of-00003.safetensors",
128
+ "layers.18.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
129
+ "layers.18.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
130
+ "layers.18.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
131
+ "layers.18.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
132
+ "layers.18.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
133
+ "layers.18.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
134
+ "layers.18.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
135
+ "layers.18.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
136
+ "layers.18.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
137
+ "layers.18.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
138
+ "layers.18.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
139
+ "layers.19.input_layernorm.weight": "model-00003-of-00003.safetensors",
140
+ "layers.19.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
141
+ "layers.19.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
142
+ "layers.19.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
143
+ "layers.19.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
144
+ "layers.19.self_attn.k_proj.bias": "model-00003-of-00003.safetensors",
145
+ "layers.19.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
146
+ "layers.19.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
147
+ "layers.19.self_attn.q_proj.bias": "model-00003-of-00003.safetensors",
148
+ "layers.19.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
149
+ "layers.19.self_attn.v_proj.bias": "model-00003-of-00003.safetensors",
150
+ "layers.19.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
151
+ "layers.2.input_layernorm.weight": "model-00001-of-00003.safetensors",
152
+ "layers.2.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
153
+ "layers.2.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
154
+ "layers.2.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
155
+ "layers.2.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
156
+ "layers.2.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
157
+ "layers.2.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
158
+ "layers.2.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
159
+ "layers.2.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
160
+ "layers.2.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
161
+ "layers.2.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
162
+ "layers.2.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
163
+ "layers.20.input_layernorm.weight": "model-00003-of-00003.safetensors",
164
+ "layers.20.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
165
+ "layers.20.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
166
+ "layers.20.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
167
+ "layers.20.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
168
+ "layers.20.self_attn.k_proj.bias": "model-00003-of-00003.safetensors",
169
+ "layers.20.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
170
+ "layers.20.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
171
+ "layers.20.self_attn.q_proj.bias": "model-00003-of-00003.safetensors",
172
+ "layers.20.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
173
+ "layers.20.self_attn.v_proj.bias": "model-00003-of-00003.safetensors",
174
+ "layers.20.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
175
+ "layers.21.input_layernorm.weight": "model-00003-of-00003.safetensors",
176
+ "layers.21.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
177
+ "layers.21.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
178
+ "layers.21.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
179
+ "layers.21.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
180
+ "layers.21.self_attn.k_proj.bias": "model-00003-of-00003.safetensors",
181
+ "layers.21.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
182
+ "layers.21.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
183
+ "layers.21.self_attn.q_proj.bias": "model-00003-of-00003.safetensors",
184
+ "layers.21.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
185
+ "layers.21.self_attn.v_proj.bias": "model-00003-of-00003.safetensors",
186
+ "layers.21.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
187
+ "layers.22.input_layernorm.weight": "model-00003-of-00003.safetensors",
188
+ "layers.22.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
189
+ "layers.22.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
190
+ "layers.22.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
191
+ "layers.22.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
192
+ "layers.22.self_attn.k_proj.bias": "model-00003-of-00003.safetensors",
193
+ "layers.22.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
194
+ "layers.22.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
195
+ "layers.22.self_attn.q_proj.bias": "model-00003-of-00003.safetensors",
196
+ "layers.22.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
197
+ "layers.22.self_attn.v_proj.bias": "model-00003-of-00003.safetensors",
198
+ "layers.22.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
199
+ "layers.23.input_layernorm.weight": "model-00003-of-00003.safetensors",
200
+ "layers.23.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
201
+ "layers.23.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
202
+ "layers.23.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
203
+ "layers.23.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
204
+ "layers.23.self_attn.k_proj.bias": "model-00003-of-00003.safetensors",
205
+ "layers.23.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
206
+ "layers.23.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
207
+ "layers.23.self_attn.q_proj.bias": "model-00003-of-00003.safetensors",
208
+ "layers.23.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
209
+ "layers.23.self_attn.v_proj.bias": "model-00003-of-00003.safetensors",
210
+ "layers.23.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
211
+ "layers.24.input_layernorm.weight": "model-00003-of-00003.safetensors",
212
+ "layers.24.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
213
+ "layers.24.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
214
+ "layers.24.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
215
+ "layers.24.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
216
+ "layers.24.self_attn.k_proj.bias": "model-00003-of-00003.safetensors",
217
+ "layers.24.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
218
+ "layers.24.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
219
+ "layers.24.self_attn.q_proj.bias": "model-00003-of-00003.safetensors",
220
+ "layers.24.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
221
+ "layers.24.self_attn.v_proj.bias": "model-00003-of-00003.safetensors",
222
+ "layers.24.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
223
+ "layers.25.input_layernorm.weight": "model-00003-of-00003.safetensors",
224
+ "layers.25.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
225
+ "layers.25.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
226
+ "layers.25.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
227
+ "layers.25.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
228
+ "layers.25.self_attn.k_proj.bias": "model-00003-of-00003.safetensors",
229
+ "layers.25.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
230
+ "layers.25.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
231
+ "layers.25.self_attn.q_proj.bias": "model-00003-of-00003.safetensors",
232
+ "layers.25.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
233
+ "layers.25.self_attn.v_proj.bias": "model-00003-of-00003.safetensors",
234
+ "layers.25.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
235
+ "layers.26.input_layernorm.weight": "model-00003-of-00003.safetensors",
236
+ "layers.26.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
237
+ "layers.26.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
238
+ "layers.26.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
239
+ "layers.26.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
240
+ "layers.26.self_attn.k_proj.bias": "model-00003-of-00003.safetensors",
241
+ "layers.26.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
242
+ "layers.26.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
243
+ "layers.26.self_attn.q_proj.bias": "model-00003-of-00003.safetensors",
244
+ "layers.26.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
245
+ "layers.26.self_attn.v_proj.bias": "model-00003-of-00003.safetensors",
246
+ "layers.26.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
247
+ "layers.27.input_layernorm.weight": "model-00003-of-00003.safetensors",
248
+ "layers.27.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
249
+ "layers.27.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
250
+ "layers.27.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
251
+ "layers.27.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
252
+ "layers.27.self_attn.k_proj.bias": "model-00003-of-00003.safetensors",
253
+ "layers.27.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
254
+ "layers.27.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
255
+ "layers.27.self_attn.q_proj.bias": "model-00003-of-00003.safetensors",
256
+ "layers.27.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
257
+ "layers.27.self_attn.v_proj.bias": "model-00003-of-00003.safetensors",
258
+ "layers.27.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
259
+ "layers.3.input_layernorm.weight": "model-00001-of-00003.safetensors",
260
+ "layers.3.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
261
+ "layers.3.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
262
+ "layers.3.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
263
+ "layers.3.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
264
+ "layers.3.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
265
+ "layers.3.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
266
+ "layers.3.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
267
+ "layers.3.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
268
+ "layers.3.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
269
+ "layers.3.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
270
+ "layers.3.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
271
+ "layers.4.input_layernorm.weight": "model-00001-of-00003.safetensors",
272
+ "layers.4.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
273
+ "layers.4.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
274
+ "layers.4.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
275
+ "layers.4.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
276
+ "layers.4.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
277
+ "layers.4.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
278
+ "layers.4.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
279
+ "layers.4.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
280
+ "layers.4.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
281
+ "layers.4.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
282
+ "layers.4.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
283
+ "layers.5.input_layernorm.weight": "model-00001-of-00003.safetensors",
284
+ "layers.5.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
285
+ "layers.5.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
286
+ "layers.5.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
287
+ "layers.5.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
288
+ "layers.5.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
289
+ "layers.5.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
290
+ "layers.5.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
291
+ "layers.5.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
292
+ "layers.5.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
293
+ "layers.5.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
294
+ "layers.5.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
295
+ "layers.6.input_layernorm.weight": "model-00001-of-00003.safetensors",
296
+ "layers.6.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
297
+ "layers.6.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
298
+ "layers.6.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
299
+ "layers.6.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
300
+ "layers.6.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
301
+ "layers.6.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
302
+ "layers.6.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
303
+ "layers.6.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
304
+ "layers.6.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
305
+ "layers.6.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
306
+ "layers.6.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
307
+ "layers.7.input_layernorm.weight": "model-00001-of-00003.safetensors",
308
+ "layers.7.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
309
+ "layers.7.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
310
+ "layers.7.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
311
+ "layers.7.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
312
+ "layers.7.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
313
+ "layers.7.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
314
+ "layers.7.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
315
+ "layers.7.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
316
+ "layers.7.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
317
+ "layers.7.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
318
+ "layers.7.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
319
+ "layers.8.input_layernorm.weight": "model-00002-of-00003.safetensors",
320
+ "layers.8.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
321
+ "layers.8.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
322
+ "layers.8.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
323
+ "layers.8.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
324
+ "layers.8.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
325
+ "layers.8.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
326
+ "layers.8.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
327
+ "layers.8.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
328
+ "layers.8.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
329
+ "layers.8.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
330
+ "layers.8.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
331
+ "layers.9.input_layernorm.weight": "model-00002-of-00003.safetensors",
332
+ "layers.9.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
333
+ "layers.9.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
334
+ "layers.9.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
335
+ "layers.9.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
336
+ "layers.9.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
337
+ "layers.9.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
338
+ "layers.9.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
339
+ "layers.9.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
340
+ "layers.9.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
341
+ "layers.9.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
342
+ "layers.9.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
343
+ "norm.weight": "model-00003-of-00003.safetensors"
344
+ }
345
+ }
modeling_qzhou.py ADDED
@@ -0,0 +1,934 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ import math
3
+ from typing import List, Optional, Tuple, Union
4
+
5
+ import torch
6
+ import torch.utils.checkpoint
7
+ from torch import nn
8
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
9
+
10
+ from transformers.activations import ACT2FN
11
+ from transformers.cache_utils import Cache, DynamicCache, StaticCache
12
+ from transformers.modeling_attn_mask_utils import AttentionMaskConverter
13
+ from transformers.modeling_outputs import (
14
+ BaseModelOutputWithPast,
15
+ CausalLMOutputWithPast,
16
+ SequenceClassifierOutputWithPast,
17
+ TokenClassifierOutput,
18
+ )
19
+ from transformers.modeling_utils import PreTrainedModel
20
+ from transformers.utils import (
21
+ add_start_docstrings,
22
+ add_start_docstrings_to_model_forward,
23
+ is_flash_attn_2_available,
24
+ is_flash_attn_greater_or_equal_2_10,
25
+ logging,
26
+ replace_return_docstrings,
27
+ )
28
+ from transformers.models.qwen2.configuration_qwen2 import Qwen2Config
29
+
30
+ if is_flash_attn_2_available():
31
+ from transformers.modeling_flash_attention_utils import _flash_attention_forward
32
+
33
+ logger = logging.get_logger(__name__)
34
+
35
+
36
+ def _prepare_4d_causal_attention_mask_with_cache_position(
37
+ attention_mask: torch.Tensor,
38
+ sequence_length: int,
39
+ target_length: int,
40
+ dtype: torch.dtype,
41
+ device: torch.device,
42
+ min_dtype: float,
43
+ cache_position: torch.Tensor,
44
+ batch_size: int,
45
+ ):
46
+ """
47
+ Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
48
+ `(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.
49
+
50
+ Args:
51
+ attention_mask (`torch.Tensor`):
52
+ A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape `(batch_size, 1, query_length, key_value_length)`.
53
+ sequence_length (`int`):
54
+ The sequence length being processed.
55
+ target_length (`int`):
56
+ The target length: when generating with static cache, the mask should be as long as the static cache, to account for the 0 padding, the part of the cache that is not filled yet.
57
+ dtype (`torch.dtype`):
58
+ The dtype to use for the 4D attention mask.
59
+ device (`torch.device`):
60
+ The device to plcae the 4D attention mask on.
61
+ min_dtype (`float`):
62
+ The minimum value representable with the dtype `dtype`.
63
+ cache_position (`torch.Tensor`):
64
+ Indices depicting the position of the input sequence tokens in the sequence.
65
+ batch_size (`torch.Tensor`):
66
+ Batch size.
67
+ """
68
+ if attention_mask is not None and attention_mask.dim() == 4:
69
+ # In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
70
+ causal_mask = attention_mask
71
+ else:
72
+ causal_mask = torch.full((sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device)
73
+ if sequence_length != 1:
74
+ causal_mask = torch.triu(causal_mask, diagonal=1)
75
+ causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
76
+ causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
77
+ if attention_mask is not None:
78
+ causal_mask = causal_mask.clone() # copy to contiguous memory for in-place edit
79
+ mask_length = attention_mask.shape[-1]
80
+ padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :]
81
+ padding_mask = padding_mask == 0
82
+ causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
83
+ padding_mask, min_dtype
84
+ )
85
+
86
+ return causal_mask
87
+
88
+
89
+ def _prepare_4d_attention_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
90
+ """
91
+ Creates a non-causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
92
+ `(batch_size, key_value_length)`
93
+
94
+ Args:
95
+ mask (`torch.Tensor`):
96
+ A 2D attention mask of shape `(batch_size, key_value_length)`
97
+ dtype (`torch.dtype`):
98
+ The torch dtype the created mask shall have.
99
+ tgt_len (`int`):
100
+ The target length or query length the created mask shall have.
101
+ """
102
+ return AttentionMaskConverter._expand_mask(mask=mask, dtype=dtype, tgt_len=tgt_len)
103
+
104
+
105
+ def _prepare_4d_attention_mask_for_sdpa(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
106
+ """
107
+ Creates a non-causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
108
+ `(batch_size, key_value_length)`
109
+
110
+ Args:
111
+ mask (`torch.Tensor`):
112
+ A 2D attention mask of shape `(batch_size, key_value_length)`
113
+ dtype (`torch.dtype`):
114
+ The torch dtype the created mask shall have.
115
+ tgt_len (`int`):
116
+ The target length or query length the created mask shall have.
117
+ """
118
+ _, key_value_length = mask.shape
119
+ tgt_len = tgt_len if tgt_len is not None else key_value_length
120
+
121
+ is_tracing = (
122
+ torch.jit.is_tracing()
123
+ or isinstance(mask, torch.fx.Proxy)
124
+ or (hasattr(torch, "_dynamo") and torch._dynamo.is_compiling())
125
+ )
126
+
127
+ # torch.jit.trace, symbolic_trace and torchdynamo with fullgraph=True are unable to capture data-dependent controlflows.
128
+ if not is_tracing and torch.all(mask == 1):
129
+ return None
130
+ else:
131
+ return AttentionMaskConverter._expand_mask(mask=mask, dtype=dtype, tgt_len=tgt_len)
132
+
133
+
134
+ class Qwen2RMSNorm(nn.Module):
135
+ def __init__(self, hidden_size, eps=1e-6):
136
+
137
+ super().__init__()
138
+ self.weight = nn.Parameter(torch.ones(hidden_size))
139
+ self.variance_epsilon = eps
140
+
141
+ def forward(self, hidden_states):
142
+ input_dtype = hidden_states.dtype
143
+ hidden_states = hidden_states.to(torch.float32)
144
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
145
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
146
+ return self.weight * hidden_states.to(input_dtype)
147
+
148
+ def extra_repr(self):
149
+ return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
150
+
151
+
152
+ class Qwen2RotaryEmbedding(nn.Module):
153
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
154
+ super().__init__()
155
+
156
+ self.dim = dim
157
+ self.max_position_embeddings = max_position_embeddings
158
+ self.base = base
159
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
160
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
161
+
162
+ # Build here to make `torch.jit.trace` work.
163
+ self._set_cos_sin_cache(
164
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
165
+ )
166
+
167
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
168
+ self.max_seq_len_cached = seq_len
169
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
170
+
171
+ freqs = torch.outer(t, self.inv_freq)
172
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
173
+ emb = torch.cat((freqs, freqs), dim=-1)
174
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
175
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
176
+
177
+ def forward(self, x, seq_len=None):
178
+ # x: [bs, num_attention_heads, seq_len, head_size]
179
+ if seq_len > self.max_seq_len_cached:
180
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
181
+
182
+ return (
183
+ self.cos_cached[:seq_len].to(dtype=x.dtype),
184
+ self.sin_cached[:seq_len].to(dtype=x.dtype),
185
+ )
186
+
187
+
188
+ def rotate_half(x):
189
+ """Rotates half the hidden dims of the input."""
190
+ x1 = x[..., : x.shape[-1] // 2]
191
+ x2 = x[..., x.shape[-1] // 2 :]
192
+ return torch.cat((-x2, x1), dim=-1)
193
+
194
+
195
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
196
+ """Applies Rotary Position Embedding to the query and key tensors.
197
+
198
+ Args:
199
+ q (`torch.Tensor`): The query tensor.
200
+ k (`torch.Tensor`): The key tensor.
201
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
202
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
203
+ position_ids (`torch.Tensor`):
204
+ The position indices of the tokens corresponding to the query and key tensors. For example, this can be
205
+ used to pass offsetted position ids when working with a KV-cache.
206
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
207
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
208
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
209
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
210
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
211
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
212
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
213
+ Returns:
214
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
215
+ """
216
+ cos = cos[position_ids].unsqueeze(unsqueeze_dim)
217
+ sin = sin[position_ids].unsqueeze(unsqueeze_dim)
218
+ q_embed = (q * cos) + (rotate_half(q) * sin)
219
+ k_embed = (k * cos) + (rotate_half(k) * sin)
220
+ return q_embed, k_embed
221
+
222
+
223
+ class Qwen2MLP(nn.Module):
224
+ def __init__(self, config):
225
+ super().__init__()
226
+ self.hidden_size = config.hidden_size
227
+ self.intermediate_size = config.intermediate_size
228
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
229
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
230
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
231
+ self.act_fn = ACT2FN[config.hidden_act]
232
+
233
+ def forward(self, hidden_state):
234
+ return self.down_proj(self.act_fn(self.gate_proj(hidden_state)) * self.up_proj(hidden_state))
235
+
236
+
237
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
238
+ """
239
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
240
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
241
+ """
242
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
243
+ if n_rep == 1:
244
+ return hidden_states
245
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
246
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
247
+
248
+
249
+ class Qwen2Attention(nn.Module):
250
+ """
251
+ Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer
252
+ and "Generating Long Sequences with Sparse Transformers".
253
+ """
254
+
255
+ def __init__(self, config: Qwen2Config, layer_idx: Optional[int] = None):
256
+ super().__init__()
257
+ self.config = config
258
+ self.layer_idx = layer_idx
259
+ if layer_idx is None:
260
+ logger.warning_once(
261
+ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will "
262
+ "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
263
+ "when creating this class."
264
+ )
265
+
266
+ self.hidden_size = config.hidden_size
267
+ self.num_heads = config.num_attention_heads
268
+ self.head_dim = self.hidden_size // self.num_heads
269
+ self.num_key_value_heads = config.num_key_value_heads
270
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
271
+ self.max_position_embeddings = config.max_position_embeddings
272
+ self.rope_theta = config.rope_theta
273
+ self.is_causal = True
274
+ self.attention_dropout = config.attention_dropout
275
+
276
+ if (self.head_dim * self.num_heads) != self.hidden_size:
277
+ raise ValueError(
278
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
279
+ f" and `num_heads`: {self.num_heads})."
280
+ )
281
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True)
282
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
283
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
284
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
285
+
286
+ self.rotary_emb = Qwen2RotaryEmbedding(
287
+ self.head_dim,
288
+ max_position_embeddings=self.max_position_embeddings,
289
+ base=self.rope_theta,
290
+ )
291
+
292
+ def forward(
293
+ self,
294
+ hidden_states: torch.Tensor,
295
+ attention_mask: Optional[torch.Tensor] = None,
296
+ position_ids: Optional[torch.LongTensor] = None,
297
+ past_key_value: Optional[Cache] = None,
298
+ output_attentions: bool = False,
299
+ use_cache: bool = False,
300
+ cache_position: Optional[torch.LongTensor] = None,
301
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
302
+ bsz, q_len, _ = hidden_states.size()
303
+
304
+ query_states = self.q_proj(hidden_states)
305
+ key_states = self.k_proj(hidden_states)
306
+ value_states = self.v_proj(hidden_states)
307
+
308
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
309
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
310
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
311
+
312
+ kv_seq_len = key_states.shape[-2]
313
+ if past_key_value is not None:
314
+ if self.layer_idx is None:
315
+ raise ValueError(
316
+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
317
+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
318
+ "with a layer index."
319
+ )
320
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
321
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
322
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
323
+
324
+ if past_key_value is not None:
325
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models
326
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
327
+
328
+ # repeat k/v heads if n_kv_heads < n_heads
329
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
330
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
331
+
332
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
333
+
334
+ if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
335
+ raise ValueError(
336
+ f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
337
+ f" {attn_weights.size()}"
338
+ )
339
+
340
+ if attention_mask is not None: # no matter the length, we just slice it
341
+ causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
342
+ attn_weights = attn_weights + causal_mask
343
+
344
+ # upcast attention to fp32
345
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
346
+ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
347
+ attn_output = torch.matmul(attn_weights, value_states)
348
+
349
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
350
+ raise ValueError(
351
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
352
+ f" {attn_output.size()}"
353
+ )
354
+
355
+ attn_output = attn_output.transpose(1, 2).contiguous()
356
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
357
+
358
+ attn_output = self.o_proj(attn_output)
359
+
360
+ if not output_attentions:
361
+ attn_weights = None
362
+
363
+ return attn_output, attn_weights, past_key_value
364
+
365
+
366
+ class Qwen2FlashAttention2(Qwen2Attention):
367
+
368
+ def __init__(self, *args, **kwargs):
369
+ super().__init__(*args, **kwargs)
370
+
371
+ # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
372
+ # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
373
+ # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
374
+ self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
375
+
376
+ def forward(
377
+ self,
378
+ hidden_states: torch.Tensor,
379
+ attention_mask: Optional[torch.Tensor] = None,
380
+ position_ids: Optional[torch.LongTensor] = None,
381
+ past_key_value: Optional[Cache] = None,
382
+ output_attentions: bool = False,
383
+ use_cache: bool = False,
384
+ cache_position: Optional[torch.LongTensor] = None,
385
+ ):
386
+ bsz, q_len, _ = hidden_states.size()
387
+
388
+ query_states = self.q_proj(hidden_states)
389
+ key_states = self.k_proj(hidden_states)
390
+ value_states = self.v_proj(hidden_states)
391
+
392
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
393
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
394
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
395
+
396
+ kv_seq_len = key_states.shape[-2]
397
+ if past_key_value is not None:
398
+ if self.layer_idx is None:
399
+ raise ValueError(
400
+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
401
+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
402
+ "with a layer index."
403
+ )
404
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
405
+
406
+ # Because the input can be padded, the absolute sequence length depends on the max position id.
407
+ rotary_seq_len = (
408
+ max(kv_seq_len, position_ids[:, -1].max().item() + 1) if position_ids is not None else kv_seq_len
409
+ )
410
+
411
+ cos, sin = self.rotary_emb(value_states, seq_len=rotary_seq_len)
412
+
413
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
414
+
415
+ if past_key_value is not None:
416
+ # Activate slicing cache only if the config has a value `sliding_windows` attribute
417
+ cache_has_contents = past_key_value.get_seq_length(self.layer_idx) > 0
418
+ if (
419
+ getattr(self.config, "sliding_window", None) is not None
420
+ and kv_seq_len > self.config.sliding_window
421
+ and cache_has_contents
422
+ ):
423
+ slicing_tokens = 1 - self.config.sliding_window
424
+
425
+ past_key = past_key_value[self.layer_idx][0]
426
+ past_value = past_key_value[self.layer_idx][1]
427
+
428
+ past_key = past_key[:, :, slicing_tokens:, :].contiguous()
429
+ past_value = past_value[:, :, slicing_tokens:, :].contiguous()
430
+
431
+ if past_key.shape[-2] != self.config.sliding_window - 1:
432
+ raise ValueError(
433
+ f"past key must have a shape of (`batch_size, num_heads, self.config.sliding_window-1, head_dim`), got"
434
+ f" {past_key.shape}"
435
+ )
436
+
437
+ if attention_mask is not None:
438
+ attention_mask = attention_mask[:, slicing_tokens:]
439
+ attention_mask = torch.cat([attention_mask, torch.ones_like(attention_mask[:, -1:])], dim=-1)
440
+
441
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models
442
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
443
+
444
+ # repeat k/v heads if n_kv_heads < n_heads
445
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
446
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
447
+ dropout_rate = 0.0 if not self.training else self.attention_dropout
448
+
449
+ # In PEFT, usually we cast the layer norms in float32 for training stability reasons
450
+ # therefore the input hidden states gets silently casted in float32. Hence, we need
451
+ # cast them back in float16 just to be sure everything works as expected.
452
+ input_dtype = query_states.dtype
453
+ if input_dtype == torch.float32:
454
+ if torch.is_autocast_enabled():
455
+ target_dtype = torch.get_autocast_gpu_dtype()
456
+ # Handle the case where the model is quantized
457
+ elif hasattr(self.config, "_pre_quantization_dtype"):
458
+ target_dtype = self.config._pre_quantization_dtype
459
+ else:
460
+ target_dtype = self.q_proj.weight.dtype
461
+
462
+ logger.warning_once(
463
+ f"The input hidden states seems to be silently casted in float32, this might be related to"
464
+ f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
465
+ f" {target_dtype}."
466
+ )
467
+
468
+ query_states = query_states.to(target_dtype)
469
+ key_states = key_states.to(target_dtype)
470
+ value_states = value_states.to(target_dtype)
471
+
472
+ # Reashape to the expected shape for Flash Attention
473
+ query_states = query_states.transpose(1, 2)
474
+ key_states = key_states.transpose(1, 2)
475
+ value_states = value_states.transpose(1, 2)
476
+
477
+ if (
478
+ self.config.use_sliding_window
479
+ and getattr(self.config, "sliding_window", None) is not None
480
+ and self.layer_idx >= self.config.max_window_layers
481
+ ):
482
+ sliding_window = self.config.sliding_window
483
+ else:
484
+ sliding_window = None
485
+
486
+ attn_output = _flash_attention_forward(
487
+ query_states,
488
+ key_states,
489
+ value_states,
490
+ attention_mask,
491
+ q_len,
492
+ position_ids=position_ids,
493
+ dropout=dropout_rate,
494
+ sliding_window=sliding_window,
495
+ is_causal=False, #### Revised
496
+ use_top_left_mask=self._flash_attn_uses_top_left_mask,
497
+ )
498
+
499
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
500
+ attn_output = self.o_proj(attn_output)
501
+
502
+ if not output_attentions:
503
+ attn_weights = None
504
+
505
+ return attn_output, attn_weights, past_key_value
506
+
507
+
508
+ class Qwen2SdpaAttention(Qwen2Attention):
509
+
510
+ def forward(
511
+ self,
512
+ hidden_states: torch.Tensor,
513
+ attention_mask: Optional[torch.Tensor] = None,
514
+ position_ids: Optional[torch.LongTensor] = None,
515
+ past_key_value: Optional[Cache] = None,
516
+ output_attentions: bool = False,
517
+ use_cache: bool = False,
518
+ cache_position: Optional[torch.LongTensor] = None,
519
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
520
+ if output_attentions:
521
+ # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
522
+ logger.warning_once(
523
+ "QZhouModel is using Qwen2SdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
524
+ 'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
525
+ )
526
+ return super().forward(
527
+ hidden_states=hidden_states,
528
+ attention_mask=attention_mask,
529
+ position_ids=position_ids,
530
+ past_key_value=past_key_value,
531
+ output_attentions=output_attentions,
532
+ use_cache=use_cache,
533
+ )
534
+
535
+ bsz, q_len, _ = hidden_states.size()
536
+
537
+ query_states = self.q_proj(hidden_states)
538
+ key_states = self.k_proj(hidden_states)
539
+ value_states = self.v_proj(hidden_states)
540
+
541
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
542
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
543
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
544
+
545
+ kv_seq_len = key_states.shape[-2]
546
+ if past_key_value is not None:
547
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
548
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
549
+
550
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
551
+
552
+ if past_key_value is not None:
553
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models
554
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
555
+
556
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
557
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
558
+
559
+ causal_mask = attention_mask
560
+ if attention_mask is not None: # no matter the length, we just slice it
561
+ causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
562
+
563
+ # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
564
+ # Reference: https://github.com/pytorch/pytorch/issues/112577.
565
+ if query_states.device.type == "cuda" and attention_mask is not None:
566
+ query_states = query_states.contiguous()
567
+ key_states = key_states.contiguous()
568
+ value_states = value_states.contiguous()
569
+
570
+ # We dispatch to SDPA's Flash Attention or Efficient kernels via this `is_causal` if statement instead of an inline conditional assignment
571
+ # in SDPA to support both torch.compile's dynamic shapes and full graph options. An inline conditional prevents dynamic shapes from compiling.
572
+ # The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1.
573
+ is_causal = False # True if causal_mask is None and q_len > 1 else False #### Revised
574
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
575
+ query_states,
576
+ key_states,
577
+ value_states,
578
+ attn_mask=causal_mask,
579
+ dropout_p=self.attention_dropout if self.training else 0.0,
580
+ is_causal=is_causal,
581
+ )
582
+
583
+ attn_output = attn_output.transpose(1, 2).contiguous()
584
+ attn_output = attn_output.view(bsz, q_len, self.hidden_size)
585
+
586
+ attn_output = self.o_proj(attn_output)
587
+
588
+ return attn_output, None, past_key_value
589
+
590
+
591
+ QWEN2_ATTENTION_CLASSES = {
592
+ "eager": Qwen2Attention,
593
+ "flash_attention_2": Qwen2FlashAttention2,
594
+ "sdpa": Qwen2SdpaAttention,
595
+ }
596
+
597
+
598
+ class Qwen2DecoderLayer(nn.Module):
599
+ def __init__(self, config: Qwen2Config, layer_idx: int):
600
+ super().__init__()
601
+ self.hidden_size = config.hidden_size
602
+
603
+ if config.sliding_window and config._attn_implementation != "flash_attention_2":
604
+ logger.warning_once(
605
+ f"Sliding Window Attention is enabled but not implemented for `{config._attn_implementation}`; "
606
+ "unexpected results may be encountered."
607
+ )
608
+ self.self_attn = QWEN2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
609
+ self.mlp = Qwen2MLP(config)
610
+ self.input_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
611
+ self.post_attention_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
612
+
613
+ def forward(
614
+ self,
615
+ hidden_states: torch.Tensor,
616
+ attention_mask: Optional[torch.Tensor] = None,
617
+ position_ids: Optional[torch.LongTensor] = None,
618
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
619
+ output_attentions: Optional[bool] = False,
620
+ use_cache: Optional[bool] = False,
621
+ cache_position: Optional[torch.LongTensor] = None,
622
+ **kwargs,
623
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
624
+ """
625
+ Args:
626
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
627
+ attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
628
+ `(batch, sequence_length)` where padding elements are indicated by 0.
629
+ output_attentions (`bool`, *optional*):
630
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
631
+ returned tensors for more detail.
632
+ use_cache (`bool`, *optional*):
633
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
634
+ (see `past_key_values`).
635
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
636
+ cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
637
+ Indices depicting the position of the input sequence tokens in the sequence.
638
+ kwargs (`dict`, *optional*):
639
+ Arbitrary kwargs to be ignored, used for FSDP and other methods that injects code
640
+ into the model
641
+ """
642
+
643
+ residual = hidden_states
644
+
645
+ hidden_states = self.input_layernorm(hidden_states)
646
+
647
+ # Self Attention
648
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
649
+ hidden_states=hidden_states,
650
+ attention_mask=attention_mask,
651
+ position_ids=position_ids,
652
+ past_key_value=past_key_value,
653
+ output_attentions=output_attentions,
654
+ use_cache=use_cache,
655
+ cache_position=cache_position,
656
+ )
657
+ hidden_states = residual + hidden_states
658
+
659
+ # Fully Connected
660
+ residual = hidden_states
661
+ hidden_states = self.post_attention_layernorm(hidden_states)
662
+ hidden_states = self.mlp(hidden_states)
663
+ hidden_states = residual + hidden_states
664
+
665
+ outputs = (hidden_states,)
666
+
667
+ if output_attentions:
668
+ outputs += (self_attn_weights,)
669
+
670
+ if use_cache:
671
+ outputs += (present_key_value,)
672
+
673
+ return outputs
674
+
675
+
676
+ class Qwen2PreTrainedModel(PreTrainedModel):
677
+ config_class = Qwen2Config
678
+ base_model_prefix = "model"
679
+ supports_gradient_checkpointing = True
680
+ _no_split_modules = ["Qwen2DecoderLayer"]
681
+ _skip_keys_device_placement = "past_key_values"
682
+ _supports_flash_attn_2 = True
683
+ _supports_sdpa = True
684
+ _supports_cache_class = True
685
+
686
+ def _init_weights(self, module):
687
+ std = self.config.initializer_range
688
+ if isinstance(module, nn.Linear):
689
+ module.weight.data.normal_(mean=0.0, std=std)
690
+ if module.bias is not None:
691
+ module.bias.data.zero_()
692
+ elif isinstance(module, nn.Embedding):
693
+ module.weight.data.normal_(mean=0.0, std=std)
694
+ if module.padding_idx is not None:
695
+ module.weight.data[module.padding_idx].zero_()
696
+
697
+
698
+ class QZhouModel(Qwen2PreTrainedModel):
699
+
700
+ def __init__(self, config: Qwen2Config):
701
+ super().__init__(config)
702
+ self.padding_idx = config.pad_token_id
703
+ self.vocab_size = config.vocab_size
704
+
705
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
706
+ self.layers = nn.ModuleList(
707
+ [Qwen2DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
708
+ )
709
+ self._attn_implementation = config._attn_implementation
710
+ self.norm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
711
+
712
+ self.gradient_checkpointing = False
713
+ # Initialize weights and apply final processing
714
+ self.post_init()
715
+
716
+ def get_input_embeddings(self):
717
+ return self.embed_tokens
718
+
719
+ def set_input_embeddings(self, value):
720
+ self.embed_tokens = value
721
+
722
+ def forward(
723
+ self,
724
+ input_ids: torch.LongTensor = None,
725
+ attention_mask: Optional[torch.Tensor] = None,
726
+ position_ids: Optional[torch.LongTensor] = None,
727
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
728
+ inputs_embeds: Optional[torch.FloatTensor] = None,
729
+ use_cache: Optional[bool] = None,
730
+ output_attentions: Optional[bool] = None,
731
+ output_hidden_states: Optional[bool] = None,
732
+ return_dict: Optional[bool] = None,
733
+ cache_position: Optional[torch.LongTensor] = None,
734
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
735
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
736
+ output_hidden_states = (
737
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
738
+ )
739
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
740
+
741
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
742
+
743
+ if (input_ids is None) ^ (inputs_embeds is not None):
744
+ raise ValueError(
745
+ "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
746
+ )
747
+
748
+ if self.gradient_checkpointing and self.training:
749
+ if use_cache:
750
+ logger.warning_once(
751
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
752
+ )
753
+ use_cache = False
754
+
755
+ use_legacy_cache = False
756
+ if use_cache and not isinstance(past_key_values, Cache) and not self.training:
757
+ use_legacy_cache = True
758
+ past_key_values = DynamicCache.from_legacy_cache(past_key_values)
759
+ logger.warning_once(
760
+ "We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. "
761
+ "Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)"
762
+ )
763
+
764
+ if inputs_embeds is None:
765
+ inputs_embeds = self.embed_tokens(input_ids)
766
+
767
+ if cache_position is None:
768
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
769
+ cache_position = torch.arange(
770
+ past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
771
+ )
772
+ if position_ids is None:
773
+ position_ids = cache_position.unsqueeze(0)
774
+
775
+ bi_attn_mask = self._update_bi_attn_mask(
776
+ attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
777
+ )
778
+
779
+ hidden_states = inputs_embeds
780
+
781
+ # decoder layers
782
+ all_hidden_states = () if output_hidden_states else None
783
+ all_self_attns = () if output_attentions else None
784
+ next_decoder_cache = None
785
+
786
+ for decoder_layer in self.layers:
787
+ if output_hidden_states:
788
+ all_hidden_states += (hidden_states,)
789
+
790
+ if self.gradient_checkpointing and self.training:
791
+ layer_outputs = self._gradient_checkpointing_func(
792
+ decoder_layer.__call__,
793
+ hidden_states,
794
+ bi_attn_mask,
795
+ position_ids,
796
+ past_key_values,
797
+ output_attentions,
798
+ use_cache,
799
+ cache_position,
800
+ )
801
+ else:
802
+ layer_outputs = decoder_layer(
803
+ hidden_states,
804
+ attention_mask=bi_attn_mask,
805
+ position_ids=position_ids,
806
+ past_key_value=past_key_values,
807
+ output_attentions=output_attentions,
808
+ use_cache=use_cache,
809
+ cache_position=cache_position,
810
+ )
811
+
812
+ hidden_states = layer_outputs[0]
813
+
814
+ if use_cache:
815
+ next_decoder_cache = layer_outputs[2 if output_attentions else 1]
816
+
817
+ if output_attentions:
818
+ all_self_attns += (layer_outputs[1],)
819
+
820
+ hidden_states = self.norm(hidden_states)
821
+
822
+ # add hidden states from the last decoder layer
823
+ if output_hidden_states:
824
+ all_hidden_states += (hidden_states,)
825
+
826
+ next_cache = None
827
+ if use_cache:
828
+ next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache
829
+
830
+ if not return_dict:
831
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
832
+ return BaseModelOutputWithPast(
833
+ last_hidden_state=hidden_states,
834
+ past_key_values=next_cache,
835
+ hidden_states=all_hidden_states,
836
+ attentions=all_self_attns,
837
+ )
838
+
839
+ def _update_bi_attn_mask(
840
+ self,
841
+ attention_mask: torch.Tensor,
842
+ input_tensor: torch.Tensor,
843
+ cache_position: torch.Tensor,
844
+ past_key_values: Cache,
845
+ output_attentions: bool,
846
+ ):
847
+ if self.config._attn_implementation == "flash_attention_2":
848
+ if attention_mask is not None and 0.0 in attention_mask:
849
+ return attention_mask
850
+ return None
851
+
852
+ elif self.config._attn_implementation == "sdpa" and not output_attentions:
853
+ attention_mask = _prepare_4d_attention_mask_for_sdpa(
854
+ attention_mask, input_tensor.dtype
855
+ )
856
+ return attention_mask
857
+ else:
858
+ attention_mask = _prepare_4d_attention_mask(
859
+ attention_mask, input_tensor.dtype
860
+ )
861
+ return attention_mask
862
+
863
+ # Copied from transformers.models.llama.modeling_llama.LlamaModel._update_causal_mask
864
+ def _update_causal_mask(
865
+ self,
866
+ attention_mask: torch.Tensor,
867
+ input_tensor: torch.Tensor,
868
+ cache_position: torch.Tensor,
869
+ past_key_values: Cache,
870
+ output_attentions: bool,
871
+ ):
872
+ # TODO: As of torch==2.2.0, the `attention_mask` passed to the model in `generate` is 2D and of dynamic length even when the static
873
+ # KV cache is used. This is an issue for torch.compile which then recaptures cudagraphs at each decode steps due to the dynamic shapes.
874
+ # (`recording cudagraph tree for symint key 13`, etc.), which is VERY slow. A workaround is `@torch.compiler.disable`, but this prevents using
875
+ # `fullgraph=True`. See more context in https://github.com/huggingface/transformers/pull/29114
876
+
877
+ if self.config._attn_implementation == "flash_attention_2":
878
+ if attention_mask is not None and 0.0 in attention_mask:
879
+ return attention_mask
880
+ return None
881
+
882
+ # For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
883
+ # order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
884
+ # to infer the attention mask.
885
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
886
+ using_static_cache = isinstance(past_key_values, StaticCache)
887
+
888
+ # When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
889
+ if self.config._attn_implementation == "sdpa" and not using_static_cache and not output_attentions:
890
+ if AttentionMaskConverter._ignore_causal_mask_sdpa(
891
+ attention_mask,
892
+ inputs_embeds=input_tensor,
893
+ past_key_values_length=past_seen_tokens,
894
+ is_training=self.training,
895
+ ):
896
+ return None
897
+
898
+ dtype, device = input_tensor.dtype, input_tensor.device
899
+ min_dtype = torch.finfo(dtype).min
900
+ sequence_length = input_tensor.shape[1]
901
+ if using_static_cache:
902
+ target_length = past_key_values.get_max_length()
903
+ else:
904
+ target_length = (
905
+ attention_mask.shape[-1]
906
+ if isinstance(attention_mask, torch.Tensor)
907
+ else past_seen_tokens + sequence_length + 1
908
+ )
909
+
910
+ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D).
911
+ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position(
912
+ attention_mask,
913
+ sequence_length=sequence_length,
914
+ target_length=target_length,
915
+ dtype=dtype,
916
+ device=device,
917
+ min_dtype=min_dtype,
918
+ cache_position=cache_position,
919
+ batch_size=input_tensor.shape[0],
920
+ )
921
+
922
+ if (
923
+ self.config._attn_implementation == "sdpa"
924
+ and attention_mask is not None
925
+ and attention_mask.device.type == "cuda"
926
+ and not output_attentions
927
+ ):
928
+ # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
929
+ # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
930
+ # Details: https://github.com/pytorch/pytorch/issues/110213
931
+ causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
932
+
933
+ return causal_mask
934
+
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 32768,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|endoftext|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ece2d31f4d1f21e42d2f46a1749bea0d4e6b6745ea8fd4f19516c338b1cb2f8c
3
+ size 11422175
tokenizer_config.json ADDED
@@ -0,0 +1,216 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_eos_token": true,
4
+ "add_prefix_space": false,
5
+ "added_tokens_decoder": {
6
+ "151643": {
7
+ "content": "<|endoftext|>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "151644": {
15
+ "content": "<|im_start|>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "151645": {
23
+ "content": "<|im_end|>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ },
30
+ "151646": {
31
+ "content": "<|object_ref_start|>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false,
36
+ "special": true
37
+ },
38
+ "151647": {
39
+ "content": "<|object_ref_end|>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false,
44
+ "special": true
45
+ },
46
+ "151648": {
47
+ "content": "<|box_start|>",
48
+ "lstrip": false,
49
+ "normalized": false,
50
+ "rstrip": false,
51
+ "single_word": false,
52
+ "special": true
53
+ },
54
+ "151649": {
55
+ "content": "<|box_end|>",
56
+ "lstrip": false,
57
+ "normalized": false,
58
+ "rstrip": false,
59
+ "single_word": false,
60
+ "special": true
61
+ },
62
+ "151650": {
63
+ "content": "<|quad_start|>",
64
+ "lstrip": false,
65
+ "normalized": false,
66
+ "rstrip": false,
67
+ "single_word": false,
68
+ "special": true
69
+ },
70
+ "151651": {
71
+ "content": "<|quad_end|>",
72
+ "lstrip": false,
73
+ "normalized": false,
74
+ "rstrip": false,
75
+ "single_word": false,
76
+ "special": true
77
+ },
78
+ "151652": {
79
+ "content": "<|vision_start|>",
80
+ "lstrip": false,
81
+ "normalized": false,
82
+ "rstrip": false,
83
+ "single_word": false,
84
+ "special": true
85
+ },
86
+ "151653": {
87
+ "content": "<|vision_end|>",
88
+ "lstrip": false,
89
+ "normalized": false,
90
+ "rstrip": false,
91
+ "single_word": false,
92
+ "special": true
93
+ },
94
+ "151654": {
95
+ "content": "<|vision_pad|>",
96
+ "lstrip": false,
97
+ "normalized": false,
98
+ "rstrip": false,
99
+ "single_word": false,
100
+ "special": true
101
+ },
102
+ "151655": {
103
+ "content": "<|image_pad|>",
104
+ "lstrip": false,
105
+ "normalized": false,
106
+ "rstrip": false,
107
+ "single_word": false,
108
+ "special": true
109
+ },
110
+ "151656": {
111
+ "content": "<|video_pad|>",
112
+ "lstrip": false,
113
+ "normalized": false,
114
+ "rstrip": false,
115
+ "single_word": false,
116
+ "special": true
117
+ },
118
+ "151657": {
119
+ "content": "<tool_call>",
120
+ "lstrip": false,
121
+ "normalized": false,
122
+ "rstrip": false,
123
+ "single_word": false,
124
+ "special": false
125
+ },
126
+ "151658": {
127
+ "content": "</tool_call>",
128
+ "lstrip": false,
129
+ "normalized": false,
130
+ "rstrip": false,
131
+ "single_word": false,
132
+ "special": false
133
+ },
134
+ "151659": {
135
+ "content": "<|fim_prefix|>",
136
+ "lstrip": false,
137
+ "normalized": false,
138
+ "rstrip": false,
139
+ "single_word": false,
140
+ "special": false
141
+ },
142
+ "151660": {
143
+ "content": "<|fim_middle|>",
144
+ "lstrip": false,
145
+ "normalized": false,
146
+ "rstrip": false,
147
+ "single_word": false,
148
+ "special": false
149
+ },
150
+ "151661": {
151
+ "content": "<|fim_suffix|>",
152
+ "lstrip": false,
153
+ "normalized": false,
154
+ "rstrip": false,
155
+ "single_word": false,
156
+ "special": false
157
+ },
158
+ "151662": {
159
+ "content": "<|fim_pad|>",
160
+ "lstrip": false,
161
+ "normalized": false,
162
+ "rstrip": false,
163
+ "single_word": false,
164
+ "special": false
165
+ },
166
+ "151663": {
167
+ "content": "<|repo_name|>",
168
+ "lstrip": false,
169
+ "normalized": false,
170
+ "rstrip": false,
171
+ "single_word": false,
172
+ "special": false
173
+ },
174
+ "151664": {
175
+ "content": "<|file_sep|>",
176
+ "lstrip": false,
177
+ "normalized": false,
178
+ "rstrip": false,
179
+ "single_word": false,
180
+ "special": false
181
+ }
182
+ },
183
+ "additional_special_tokens": [
184
+ "<|im_start|>",
185
+ "<|im_end|>",
186
+ "<|object_ref_start|>",
187
+ "<|object_ref_end|>",
188
+ "<|box_start|>",
189
+ "<|box_end|>",
190
+ "<|quad_start|>",
191
+ "<|quad_end|>",
192
+ "<|vision_start|>",
193
+ "<|vision_end|>",
194
+ "<|vision_pad|>",
195
+ "<|image_pad|>",
196
+ "<|video_pad|>"
197
+ ],
198
+ "bos_token": null,
199
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- message.content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
200
+ "clean_up_tokenization_spaces": false,
201
+ "eos_token": "<|endoftext|>",
202
+ "errors": "replace",
203
+ "extra_special_tokens": {},
204
+ "max_length": 1536,
205
+ "model_max_length": 131072,
206
+ "pad_to_multiple_of": null,
207
+ "pad_token": "<|endoftext|>",
208
+ "pad_token_type_id": 0,
209
+ "padding_side": "left",
210
+ "split_special_tokens": false,
211
+ "stride": 0,
212
+ "tokenizer_class": "Qwen2Tokenizer",
213
+ "truncation_side": "right",
214
+ "truncation_strategy": "longest_first",
215
+ "unk_token": null
216
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff