Update README.md (#2)
Browse files- Update README.md (1ecacfae83f8c133d25d3e9dab947b165e7dad4f)
- Update README.md (f3b0d1834c69cc833d7309ca5a0332ee5ea89f8c)
- Update README.md (e50debb5b571598f215e2c6df1b116a55b13b573)
- Update README.md (3468cf0477fae6e4853c29dfd7eaad21e0bbe43b)
- Update README.md (b9fed2067a3fa4b079d81676cae638cf33b65c8a)
README.md
CHANGED
|
@@ -125,29 +125,38 @@ The easiest way to starting using `jina-embeddings-v3` is to use Jina AI's [Embe
|
|
| 125 |
|
| 126 |
## Intended Usage & Model Info
|
| 127 |
|
| 128 |
-
`jina-embeddings-v3` is a multilingual **text embedding model** supporting **8192 sequence length**.
|
| 129 |
-
It is based on a XLMRoBERTa architecture (JinaXLMRoBERTa) that supports the Rotary Position Embeddings to allow longer sequence length.
|
| 130 |
-
The backbone `JinaXLMRoBERTa ` is pretrained on variable length textual data on Mask Language Modeling objective for 160k steps on 89 languages.
|
| 131 |
-
The model is further trained on Jina AI's collection of more than 500 millions of multilingual sentence pairs and hard negatives.
|
| 132 |
-
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
|
| 133 |
|
| 134 |
-
`jina-embeddings-v3`
|
|
|
|
|
|
|
|
|
|
| 135 |
|
| 136 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 137 |
|
| 138 |
-
|
| 139 |
-
2. **index**: Manages user documents submitted for indexing.
|
| 140 |
-
3. **text-matching**: Processes symmetric text similarity tasks, whether short or long, such as STS (Semantic Textual Similarity).
|
| 141 |
-
4. **classification**: Classifies user inputs into predefined categories.
|
| 142 |
-
5. **clustering**: Facilitates the clustering of embeddings for further analysis.
|
| 143 |
|
| 144 |
-
`jina-embeddings-v3`
|
|
|
|
|
|
|
| 145 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 146 |
|
| 147 |
|
| 148 |
## Data & Parameters
|
| 149 |
|
| 150 |
-
coming soon.
|
| 151 |
|
| 152 |
## Usage
|
| 153 |
|
|
|
|
| 125 |
|
| 126 |
## Intended Usage & Model Info
|
| 127 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
|
| 129 |
+
`jina-embeddings-v3` is a **multilingual multi-task text embedding model** designed for a variety of NLP applications.
|
| 130 |
+
Based on the [XLM-RoBERTa architecture](https://huggingface.co/jinaai/xlm-roberta-flash-implementation),
|
| 131 |
+
this model supports [Rotary Position Embeddings (RoPE)](https://arxiv.org/abs/2104.09864) to handle long sequences up to **8192 tokens**.
|
| 132 |
+
Additionally, it features [LoRA](https://arxiv.org/abs/2106.09685) adapters to generate task-specific embeddings efficiently.
|
| 133 |
|
| 134 |
+
### Key Features:
|
| 135 |
+
- **Extended Sequence Length:** Supports up to 8192 tokens with RoPE.
|
| 136 |
+
- **Task-Specific Embedding:** Customize embeddings through the `task_type` argument with the following options:
|
| 137 |
+
- `retrieval.query`: Used for query embeddings in asymmetric retrieval tasks
|
| 138 |
+
- `retrieval.passage`: Used for passage embeddings in asymmetric retrieval tasks
|
| 139 |
+
- `separation`: Used for embeddings in clustering and re-ranking applications
|
| 140 |
+
- `classification`: Used for embeddings in classification tasks
|
| 141 |
+
- `text-matching`: Used for embeddings in tasks that quantify similarity between two texts, such as STS or symmetric retrieval tasks
|
| 142 |
+
- **Matryoshka Embeddings**: Supports flexible embedding sizes (`32, 64, 128, 256, 512, 768, 1024`), allowing for truncating embeddings to fit your application.
|
| 143 |
|
| 144 |
+
### Model Lineage:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
+
`jina-embeddings-v3` builds upon the [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) model, which was originally trained on 100 languages.
|
| 147 |
+
We extended its capabilities with an extra pretraining phase on the [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset,
|
| 148 |
+
then contrastively fine-tuned it on 30 languages for enhanced performance on embedding tasks in both monolingual and cross-lingual setups.
|
| 149 |
|
| 150 |
+
### Supported Languages:
|
| 151 |
+
While the base model supports 100 languages, we've focused our tuning efforts on the following 30 languages:
|
| 152 |
+
**Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek,
|
| 153 |
+
Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian,
|
| 154 |
+
Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
|
| 155 |
|
| 156 |
|
| 157 |
## Data & Parameters
|
| 158 |
|
| 159 |
+
The data and training details are described in the technical report (coming soon).
|
| 160 |
|
| 161 |
## Usage
|
| 162 |
|