|
--- |
|
library_name: transformers |
|
license: mit |
|
pipeline_tag: text-generation |
|
tags: |
|
- biology |
|
- genomics |
|
- long-context |
|
--- |
|
|
|
# GENERator-eukaryote-3b-base model |
|
|
|
## **Important Notice** |
|
If you are using **GENERator** for sequence generation, please ensure that the length of each input sequence is a multiple of **6**. This can be achieved by either: |
|
1. Padding the sequence on the left with `'A'` (**left padding**); |
|
2. Truncating the sequence from the left (**left truncation**). |
|
|
|
This requirement arises because **GENERator** employs a 6-mer tokenizer. If the input sequence length is not a multiple of **6**, the tokenizer will append an `'<oov>'` (out-of-vocabulary) token to the end of the token sequence. This can result in uninformative subsequent generations, such as repeated `'AAAAAA'`. |
|
|
|
We apologize for any inconvenience this may cause and recommend adhering to the above guidelines to ensure accurate and meaningful generation results. |
|
|
|
|
|
## Abouts |
|
In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 3B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. The extensive and diverse pre-training data endow the GENERator with enhanced understanding and generation capabilities across various organisms. |
|
|
|
For more technical details, please refer to our paper [GENERator: A Long-Context Generative Genomic Foundation Model](https://arxiv.org/abs/2502.07272). The code and implementation details are available on Github: [https://github.com/GenerTeam/GENERator](https://github.com/GenerTeam/GENERator). |
|
|
|
|
|
## How to use |
|
### Simple example1: generation |
|
|
|
```python |
|
|
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
# Load the tokenizer and model. |
|
tokenizer = AutoTokenizer.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base", trust_remote_code=True) |
|
model = AutoModelForCausalLM.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base") |
|
config = model.config |
|
|
|
max_length = config.max_position_embeddings |
|
|
|
# Define input sequences. |
|
sequences = [ |
|
"ATGAGGTGGCAAGAAATGGGCTAC", |
|
"GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT" |
|
] |
|
|
|
def left_padding(sequence, padding_char='A', multiple=6): |
|
remainder = len(sequence) % multiple |
|
if remainder != 0: |
|
padding_length = multiple - remainder |
|
return padding_char * padding_length + sequence |
|
return sequence |
|
|
|
def left_truncation(sequence, multiple=6): |
|
remainder = len(sequence) % multiple |
|
if remainder != 0: |
|
return sequence[remainder:] |
|
return sequence |
|
|
|
# Apply left_padding to all sequences |
|
# padded_sequences = [left_padding(seq) for seq in sequences] |
|
|
|
# Apply left_truncation to all sequences |
|
truncated_sequences = [left_truncation(seq) for seq in sequences] |
|
|
|
# Process the sequences |
|
sequences = [tokenizer.bos_token + sequence for sequence in truncated_sequences] |
|
|
|
# Tokenize the sequences |
|
tokenizer.padding_side = "left" |
|
inputs = tokenizer( |
|
sequences, |
|
add_special_tokens=False, |
|
return_tensors="pt", |
|
padding=True, |
|
truncation=True, |
|
max_length=max_length |
|
) |
|
|
|
# Generate the sequences |
|
with torch.inference_mode(): |
|
outputs = model.generate(**inputs, max_new_tokens=32, temperature=0.00001, top_k=1) |
|
|
|
# Decode the generated sequences |
|
decoded_sequences = tokenizer.batch_decode(outputs, skip_special_tokens=True) |
|
|
|
# Print the decoded sequences |
|
print(decoded_sequences) |
|
|
|
# It is expected to observe non-sense decoded sequences (e.g., 'AAAAAA') |
|
# The input sequences are too short to provide sufficient context. |
|
``` |
|
|
|
### Simple example2: embedding |
|
|
|
```python |
|
|
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
# Load the tokenizer and model. |
|
tokenizer = AutoTokenizer.from_pretrained("GENERator-eukaryote-3b-base", trust_remote_code=True) |
|
model = AutoModelForCausalLM.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base") |
|
|
|
config = model.config |
|
max_length = config.max_position_embeddings |
|
|
|
# Define input sequences. |
|
sequences = [ |
|
"ATGAGGTGGCAAGAAATGGGCTAC", |
|
"GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT" |
|
] |
|
|
|
# Tokenize the sequences with add_special_tokens=True to automatically add special tokens, |
|
# such as the BOS EOS token, at the appropriate positions. |
|
tokenizer.padding_side = "right" |
|
inputs = tokenizer( |
|
sequences, |
|
add_special_tokens=True, |
|
return_tensors="pt", |
|
padding=True, |
|
truncation=True, |
|
max_length=max_length |
|
) |
|
|
|
# Perform a forward pass through the model to obtain the outputs, including hidden states. |
|
with torch.inference_mode(): |
|
outputs = model(**inputs, output_hidden_states=True) |
|
|
|
# Retrieve the hidden states from the last layer. |
|
hidden_states = outputs.hidden_states[-1] # Shape: (batch_size, sequence_length, hidden_size) |
|
|
|
# Use the attention_mask to determine the index of the last token in each sequence. |
|
# Since add_special_tokens=True is used, the last token is typically the EOS token. |
|
attention_mask = inputs["attention_mask"] |
|
last_token_indices = attention_mask.sum(dim=1) - 1 # Index of the last token for each sequence |
|
|
|
# Extract the embedding corresponding to the EOS token for each sequence. |
|
seq_embeddings = [] |
|
for i, token_index in enumerate(last_token_indices): |
|
# Fetch the embedding for the last token (EOS token). |
|
seq_embedding = hidden_states[i, token_index, :] |
|
seq_embeddings.append(seq_embedding) |
|
|
|
# Stack the embeddings into a tensor with shape (batch_size, hidden_size) |
|
seq_embeddings = torch.stack(seq_embeddings) |
|
|
|
print("Sequence Embeddings:", seq_embeddings) |
|
|
|
``` |
|
|
|
## Citation |
|
``` |
|
@misc{wu2025generator, |
|
title={GENERator: A Long-Context Generative Genomic Foundation Model}, |
|
author={Wei Wu and Qiuyi Li and Mingyang Li and Kun Fu and Fuli Feng and Jieping Ye and Hui Xiong and Zheng Wang}, |
|
year={2025}, |
|
eprint={2502.07272}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2502.07272}, |
|
} |
|
``` |