Update README.md

86db9a2 verified 7 months ago

5.36 kB

	---
	license: apache-2.0
	datasets:
	- linxy/LaTeX_OCR
	- prithivMLmods/Img2Text-Plaintext-Retrieval
	- prithivMLmods/Img2Text-Algorithm-Retrieval
	- unsloth/LaTeX_OCR
	- mychen76/invoices-and-receipts_ocr_v1
	language:
	- en
	base_model:
	- Qwen/Qwen2-VL-2B-Instruct
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- OCR
	- KIE
	- Key Information Extraction
	- Messy Handwriting Recognition
	- text-generation-inference
	- VLM
	- Callisto
	- OCR#3
	- RAG
	- 2B
	---

	![xfghnbfgt.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/dblNhOatHlsLemn1Yt_zo.png)

	# Callisto-OCR3-2B-Instruct

	> [!Note]
	> The Callisto-OCR3-2B-Instruct model is a fine-tuned version of Qwen2-VL-2B-Instruct, specifically optimized for messy handwriting recognition, Optical Character Recognition (OCR), English language understanding, and math problem solving with LaTeX formatting. This model integrates a conversational approach with visual and textual understanding to handle multi-modal tasks effectively.

	[![Open Demo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://huggingface.co/prithivMLmods/Callisto-OCR3-2B-Instruct/blob/main/Callisto-OCR3-2B-Instruct-Demo/Callisto_OCR3_2B_Instruct.ipynb)


	#### Key Enhancements:

	* SoTA understanding of images of various resolution & ratio: Callisto-OCR3 achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.

	* Enhanced Handwriting OCR: Optimized for recognizing and interpreting messy handwriting with high accuracy, making it ideal for digitizing handwritten documents and notes.

	* Understanding videos of 20min+: Callisto-OCR3 can process long videos, enabling high-quality video-based question answering, transcription, and content generation.

	* Agent that can operate your mobiles, robots, etc.: With advanced reasoning and decision-making, Callisto-OCR3 can be integrated with mobile phones, robots, and other devices to perform automated tasks based on visual and textual input.

	* Multilingual Support: Besides English and Chinese, Callisto-OCR3 supports text recognition inside images in multiple languages, including European languages, Japanese, Korean, Arabic, and Vietnamese.

	### How to Use

	```python
	from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
	from qwen_vl_utils import process_vision_info

	# Load the model on the available device(s)
	model = Qwen2VLForConditionalGeneration.from_pretrained(
	"prithivMLmods/Callisto-OCR3-2B-Instruct", torch_dtype="auto", device_map="auto"
	)

	# Enable flash_attention_2 for better acceleration and memory optimization
	# model = Qwen2VLForConditionalGeneration.from_pretrained(
	# "prithivMLmods/Callisto-OCR3-2B-Instruct",
	# torch_dtype=torch.bfloat16,
	# attn_implementation="flash_attention_2",
	# device_map="auto",
	# )

	# Default processor
	processor = AutoProcessor.from_pretrained("prithivMLmods/Callisto-OCR3-2B-Instruct")

	# Customize visual token range for speed-memory balance
	# min_pixels = 2562828
	# max_pixels = 12802828
	# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
	},
	{"type": "text", "text": "Recognize the handwriting in this image."},
	],
	}
	]

	# Preparation for inference
	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to("cuda")

	# Inference: Generate the output
	generated_ids = model.generate(**inputs, max_new_tokens=128)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text)
	```

	### Buffering Output
	```python
	buffer = ""
	for new_text in streamer:
	buffer += new_text
	# Remove <\|im_end\|> or similar tokens from the output
	buffer = buffer.replace("<\|im_end\|>", "")
	yield buffer
	```

	### Key Features

	1. Advanced Handwriting OCR:
	- Excels at recognizing and transcribing messy and cursive handwriting into digital text with high accuracy.

	2. Vision-Language Integration:
	- Combines image understanding with natural language processing to convert images into text.

	3. Optical Character Recognition (OCR):
	- Extracts and processes textual information from images with precision.

	4. Math and LaTeX Support:
	- Solves math problems and outputs equations in LaTeX format.

	5. Conversational Capabilities:
	- Designed to handle multi-turn interactions, providing context-aware responses.

	6. Image-Text-to-Text Generation:
	- Inputs can include images, text, or a combination, and the model generates descriptive or problem-solving text.