File size: 6,764 Bytes
437067c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
09d29f8
437067c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
# Neurobiber: Fast and Interpretable Stylistic Feature Extraction

**Neurobiber** is a transformer-based model that quickly predicts **96 interpretable stylistic features** in text. These features are inspired by Biber’s multidimensional framework of linguistic style, capturing everything from **pronouns** and **passives** to **modal verbs** and **discourse devices**. By combining a robust linguistically informed feature set with the speed of neural inference, NeuroBiber enables large-scale stylistic analyses that were previously infeasible.

## Why Neurobiber?

Extracting Biber-style features typically involves running a full parser or specialized tagger, which can be computationally expensive for large datasets or real-time applications. NeuroBiber overcomes these challenges by:
- **Operating up to 56x faster** than parsing-based approaches.
- Retaining the **interpretability** of classical Biber-like feature definitions.
- Delivering **high accuracy** on diverse text genres (e.g., social media, news, literary works).
- Allowing seamless integration with **modern deep learning** pipelines via Hugging Face.

By bridging detailed linguistic insights and industrial-scale performance, Neurobiber supports tasks in register analysis, style transfer, and more.

## Example Script

Below is an **example** showing how to load Neurobiber from Hugging Face, process single or multiple texts, and obtain a 96-dimensional binary vector for each input.

```python
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification

MODEL_NAME = "Blablablab/neurobiber"
CHUNK_SIZE = 512  # Neurobiber was trained with max_length=512

# List of the 96 features that Neurobiber can predict
BIBER_FEATURES = [
    "BIN_QUAN","BIN_QUPR","BIN_AMP","BIN_PASS","BIN_XX0","BIN_JJ",
    "BIN_BEMA","BIN_CAUS","BIN_CONC","BIN_COND","BIN_CONJ","BIN_CONT",
    "BIN_DPAR","BIN_DWNT","BIN_EX","BIN_FPP1","BIN_GER","BIN_RB",
    "BIN_PIN","BIN_INPR","BIN_TO","BIN_NEMD","BIN_OSUB","BIN_PASTP",
    "BIN_VBD","BIN_PHC","BIN_PIRE","BIN_PLACE","BIN_POMD","BIN_PRMD",
    "BIN_WZPRES","BIN_VPRT","BIN_PRIV","BIN_PIT","BIN_PUBV","BIN_SPP2",
    "BIN_SMP","BIN_SERE","BIN_STPR","BIN_SUAV","BIN_SYNE","BIN_TPP3",
    "BIN_TIME","BIN_NOMZ","BIN_BYPA","BIN_PRED","BIN_TOBJ","BIN_TSUB",
    "BIN_THVC","BIN_NN","BIN_DEMP","BIN_DEMO","BIN_WHQU","BIN_EMPH",
    "BIN_HDG","BIN_WZPAST","BIN_THAC","BIN_PEAS","BIN_ANDC","BIN_PRESP",
    "BIN_PROD","BIN_SPAU","BIN_SPIN","BIN_THATD","BIN_WHOBJ","BIN_WHSUB",
    "BIN_WHCL","BIN_ART","BIN_AUXB","BIN_CAP","BIN_SCONJ","BIN_CCONJ",
    "BIN_DET","BIN_EMOJ","BIN_EMOT","BIN_EXCL","BIN_HASH","BIN_INF",
    "BIN_UH","BIN_NUM","BIN_LAUGH","BIN_PRP","BIN_PREP","BIN_NNP",
    "BIN_QUES","BIN_QUOT","BIN_AT","BIN_SBJP","BIN_URL","BIN_WH",
    "BIN_INDA","BIN_ACCU","BIN_PGAS","BIN_CMADJ","BIN_SPADJ","BIN_X"
]

def load_model_and_tokenizer():
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
    model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME).to("cuda")
    model.eval()
    return model, tokenizer

def chunk_text(text, chunk_size=CHUNK_SIZE):
    tokens = text.strip().split()
    if not tokens:
        return []
    return [" ".join(tokens[i:i + chunk_size]) for i in range(0, len(tokens), chunk_size)]

def get_predictions_chunked_batch(model, tokenizer, texts, chunk_size=CHUNK_SIZE, subbatch_size=32):
    chunked_texts = []
    chunk_indices = []
    for idx, text in enumerate(texts):
        start = len(chunked_texts)
        text_chunks = chunk_text(text, chunk_size)
        chunked_texts.extend(text_chunks)
        chunk_indices.append({
            'original_idx': idx,
            'chunk_range': (start, start + len(text_chunks))
        })

    # If there are no chunks (empty inputs), return zeros
    if not chunked_texts:
        return np.zeros((len(texts), model.config.num_labels))

    all_chunk_preds = []
    for i in range(0, len(chunked_texts), subbatch_size):
        batch_chunks = chunked_texts[i : i + subbatch_size]
        encodings = tokenizer(
            batch_chunks,
            return_tensors='pt',
            padding=True,
            truncation=True,
            max_length=chunk_size
        ).to("cuda")

        with torch.no_grad(), torch.amp.autocast("cuda"):
            outputs = model(**encodings)
            probs = torch.sigmoid(outputs.logits)
        all_chunk_preds.append(probs.cpu())

    all_chunk_preds = torch.cat(all_chunk_preds, dim=0) if all_chunk_preds else torch.empty(0)
    predictions = [None] * len(texts)

    for info in chunk_indices:
        start, end = info['chunk_range']
        if start == end:
            # No tokens => no features
            pred = torch.zeros(model.config.num_labels)
        else:
            # Take max across chunks for each feature
            chunk_preds = all_chunk_preds[start:end]
            pred, _ = torch.max(chunk_preds, dim=0)
        predictions[info['original_idx']] = (pred > 0.5).int().numpy()

    return np.array(predictions)

def predict_batch(model, tokenizer, texts, chunk_size=CHUNK_SIZE, subbatch_size=32):
    return get_predictions_chunked_batch(model, tokenizer, texts, chunk_size, subbatch_size)

def predict_text(model, tokenizer, text, chunk_size=CHUNK_SIZE, subbatch_size=32):
    batch_preds = predict_batch(model, tokenizer, [text], chunk_size, subbatch_size)
    return batch_preds[0]
```

## Single-Text Usage
``` python
model, tokenizer = load_model_and_tokenizer()
sample_text = "This is a sample text demonstrating certain stylistic features."
predictions = predict_text(model, tokenizer, sample_text)
print("Binary feature vector:", predictions)
# For example: [0, 1, 0, 1, ... 1, 0] (96-length)

```

## Batch Usage
``` python

docs = [
    "First text goes here.",
    "Second text, slightly different style."
]
model, tokenizer = load_model_and_tokenizer()
preds = predict_batch(model, tokenizer, docs)
print(preds.shape)  # (2, 96)
```


## How It Works

Neurobiber is fine-tuned RoBERTa. Given a text:
1. The text is split into **chunks** (up to 512 tokens each).
2. Each chunk is fed through the model to produce **96 logistic outputs** (one per feature).
3. The feature probabilities are aggregated across chunks so that each feature is marked as `1` if it appears in at least one chunk.

Each row in preds is a 96-element array corresponding to the feature order in BIBER_FEATURES.

Interpreting Outputs

- Each element in the vector is a binary label (0 or 1), indicating the model’s detection of a specific linguistic feature (e.g., BIN_VBD for past tense verbs).
- For long texts, we chunk them into segments of length 512 tokens. If a feature appears in any chunk, you get a 1 for that feature.