NuNER-v1.0 / README.md
Serega6678's picture
Update README.md
9fcf0ba verified
|
raw
history blame
3.06 kB
metadata
language:
  - en
license: mit
tags:
  - token-classification
  - entity-recognition
  - foundation-model
  - feature-extraction
  - RoBERTa
  - generic
datasets:
  - numind/NuNER
pipeline_tag: token-classification
inference: false

SOTA Entity Recognition English Foundation Model by NuMind 🔥

This model provides the best embedding for the Entity Recognition task in English.

This is the model from our Paper: NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data

Checkout other models by NuMind:

  • SOTA Multilingual Entity Recognition Foundation Model: link
  • SOTA Sentiment Analysis Foundation Model: English, Multilingual

About

Roberta-base fine-tuned on NuNER data.

Metrics:

Read more about evaluation protocol & datasets in our paper.

Here is the aggregated performance of the models over several datasets.

k=X means that as a training data for this evaluation, we took only X examples for each class, trained the model, and evaluated it on the full test set.

Model k=1 k=4 k=16 k=64
RoBERTa-base 24.5 44.7 58.1 65.4
RoBERTa-base + NER-BERT pre-training 32.3 50.9 61.9 67.6
NuNER v1.0 39.4 59.6 67.8 71.5

NuNER v1.0 has similar performance to 7B LLMs (70 times bigger that NuNER v1.0) created specifically for NER task.

Model k=8~16 k=64~128
UniversalNER (7B) 57.89 ± 4.34 71.02 ± 1.53
NuNER v1.0 (100M) 58.75 ± 0.93 70.30 ± 0.35

Usage

Embeddings can be used out of the box or fine-tuned on specific datasets.

Get embeddings:

import torch
import transformers


model = transformers.AutoModel.from_pretrained(
    'numind/NuNER-v1.0',
    output_hidden_states=True
)
tokenizer = transformers.AutoTokenizer.from_pretrained(
    'numind/NuNER-v1.0'
)

text = [
    "NuMind is an AI company based in Paris and USA.",
    "See other models from us on https://huggingface.co/numind"
]
encoded_input = tokenizer(
    text,
    return_tensors='pt',
    padding=True,
    truncation=True
)
output = model(**encoded_input)

# for better quality
emb = torch.cat(
    (output.hidden_states[-1], output.hidden_states[-7]),
    dim=2
)

# for better speed
# emb = output.hidden_states[-1]

Citation

@misc{bogdanov2024nuner,
      title={NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data}, 
      author={Sergei Bogdanov and Alexandre Constantin and Timothée Bernard and Benoit Crabbé and Etienne Bernard},
      year={2024},
      eprint={2402.15343},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}