Update README.md

d80aeda verified 4 months ago

5.94 kB

	---
	library_name: transformers
	license: mit
	pipeline_tag: text-generation
	tags:
	- biology
	- genomics
	- long-context
	---

	# GENERator-eukaryote-3b-base model

	## Important Notice
	If you are using GENERator for sequence generation, please ensure that the length of each input sequence is a multiple of 6. This can be achieved by either:
	1. Padding the sequence on the left with `'A'` (left padding);
	2. Truncating the sequence from the left (left truncation).

	This requirement arises because GENERator employs a 6-mer tokenizer. If the input sequence length is not a multiple of 6, the tokenizer will append an `'<oov>'` (out-of-vocabulary) token to the end of the token sequence. This can result in uninformative subsequent generations, such as repeated `'AAAAAA'`.

	We apologize for any inconvenience this may cause and recommend adhering to the above guidelines to ensure accurate and meaningful generation results.


	## Abouts
	In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 3B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. The extensive and diverse pre-training data endow the GENERator with enhanced understanding and generation capabilities across various organisms.

	For more technical details, please refer to our paper [GENERator: A Long-Context Generative Genomic Foundation Model](https://arxiv.org/abs/2502.07272). The code and implementation details are available on Github: [https://github.com/GenerTeam/GENERator](https://github.com/GenerTeam/GENERator).


	## How to use
	### Simple example1: generation

	```python

	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	# Load the tokenizer and model.
	tokenizer = AutoTokenizer.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base", trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base")
	config = model.config

	max_length = config.max_position_embeddings

	# Define input sequences.
	sequences = [
	"ATGAGGTGGCAAGAAATGGGCTAC",
	"GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
	]

	def left_padding(sequence, padding_char='A', multiple=6):
	remainder = len(sequence) % multiple
	if remainder != 0:
	padding_length = multiple - remainder
	return padding_char * padding_length + sequence
	return sequence

	def left_truncation(sequence, multiple=6):
	remainder = len(sequence) % multiple
	if remainder != 0:
	return sequence[remainder:]
	return sequence

	# Apply left_padding to all sequences
	# padded_sequences = [left_padding(seq) for seq in sequences]

	# Apply left_truncation to all sequences
	truncated_sequences = [left_truncation(seq) for seq in sequences]

	# Process the sequences
	sequences = [tokenizer.bos_token + sequence for sequence in truncated_sequences]

	# Tokenize the sequences
	tokenizer.padding_side = "left"
	inputs = tokenizer(
	sequences,
	add_special_tokens=False,
	return_tensors="pt",
	padding=True,
	truncation=True,
	max_length=max_length
	)

	# Generate the sequences
	with torch.inference_mode():
	outputs = model.generate(**inputs, max_new_tokens=32, temperature=0.00001, top_k=1)

	# Decode the generated sequences
	decoded_sequences = tokenizer.batch_decode(outputs, skip_special_tokens=True)

	# Print the decoded sequences
	print(decoded_sequences)

	# It is expected to observe non-sense decoded sequences (e.g., 'AAAAAA')
	# The input sequences are too short to provide sufficient context.
	```

	### Simple example2: embedding

	```python

	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	# Load the tokenizer and model.
	tokenizer = AutoTokenizer.from_pretrained("GENERator-eukaryote-3b-base", trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base")

	config = model.config
	max_length = config.max_position_embeddings

	# Define input sequences.
	sequences = [
	"ATGAGGTGGCAAGAAATGGGCTAC",
	"GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
	]

	# Tokenize the sequences with add_special_tokens=True to automatically add special tokens,
	# such as the BOS EOS token, at the appropriate positions.
	tokenizer.padding_side = "right"
	inputs = tokenizer(
	sequences,
	add_special_tokens=True,
	return_tensors="pt",
	padding=True,
	truncation=True,
	max_length=max_length
	)

	# Perform a forward pass through the model to obtain the outputs, including hidden states.
	with torch.inference_mode():
	outputs = model(**inputs, output_hidden_states=True)

	# Retrieve the hidden states from the last layer.
	hidden_states = outputs.hidden_states[-1] # Shape: (batch_size, sequence_length, hidden_size)

	# Use the attention_mask to determine the index of the last token in each sequence.
	# Since add_special_tokens=True is used, the last token is typically the EOS token.
	attention_mask = inputs["attention_mask"]
	last_token_indices = attention_mask.sum(dim=1) - 1 # Index of the last token for each sequence

	# Extract the embedding corresponding to the EOS token for each sequence.
	seq_embeddings = []
	for i, token_index in enumerate(last_token_indices):
	# Fetch the embedding for the last token (EOS token).
	seq_embedding = hidden_states[i, token_index, :]
	seq_embeddings.append(seq_embedding)

	# Stack the embeddings into a tensor with shape (batch_size, hidden_size)
	seq_embeddings = torch.stack(seq_embeddings)

	print("Sequence Embeddings:", seq_embeddings)

	```

	## Citation
	```
	@misc{wu2025generator,
	title={GENERator: A Long-Context Generative Genomic Foundation Model},
	author={Wei Wu and Qiuyi Li and Mingyang Li and Kun Fu and Fuli Feng and Jieping Ye and Hui Xiong and Zheng Wang},
	year={2025},
	eprint={2502.07272},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2502.07272},
	}
	```