namespace-Pt
/

beacon-qwen-2-7b-instruct

Text Generation

text-generation-inference

Model card Files Files and versions

beacon-qwen-2-7b-instruct / README.md

namespace-Pt's picture

Update README.md

30b1390 verified about 1 year ago

|

2.21 kB

	---
	license: mit
	pipeline_tag: text-generation
	---


	# Intro

	[Activation Beacon](https://arxiv.org/abs/2401.03462) is a plug-in module to transformer-based LLMs that enables effective, efficient, and flexible compression of long contexts.

	# Environment
	```
	pip install transformers
	pip install flash-attn --no-build-isolation
	```


	# Usage
	```python
	import json
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "namespace-Pt/beacon-qwen-2-7b-instruct"

	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	trust_remote_code=True,
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2"
	)

	model = model.cuda().eval()

	with torch.no_grad():
	# short context
	messages = [{"role": "user", "content": "Tell me about yourself."}]
	inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True).to("cuda")
	outputs = model.generate(**inputs, max_new_tokens=50)
	print(f"Input Length: {inputs['input_ids'].shape[1]}")
	print(f"Output: {repr(tokenizer.decode(outputs[0], skip_special_tokens=True))}")

	# reset memory before new generation task
	model.memory.reset()

	# long context
	with open("infbench.json", encoding="utf-8") as f:
	example = json.load(f)
	messages = [{"role": "user", "content": example["context"]}]
	inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True).to("cuda")
	outputs = model.generate(**inputs, do_sample=False, top_p=1, temperature=1, max_new_tokens=20)[:, inputs["input_ids"].shape[1]:]
	print(""20)
	print(f"Input Length: {inputs['input_ids'].shape[1]}")
	print(f"Answers: {example['answer']}")
	print(f"Prediction: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")
	```
	NOTE: It's okay to see warnings like `This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (32768). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.` Just ignore it.