--- license: mit datasets: - Floppanacci/QWQ-LongCOT-AIMO base_model: - Floppanacci/DeepSeek-R1-Distill-Qwen-7B-Floppanacci pipeline_tag: text-generation tags: - math - qwen2.5 - aimo language: - en --- # DeepSeek-R1-Distill-Qwen-7B-Floppanacci (4-bit AWQ Quantized) This repository contains the 4-bit AWQ (Activation-aware Weight Quantization) version of the [`Floppanacci/DeepSeek-R1-Distill-Qwen-7B-Floppanacci`](https://huggingface.co/Floppanacci/DeepSeek-R1-Distill-Qwen-7B-Floppanacci) model. ## Model Description This model is optimized for faster inference and lower memory footprint compared to the original bf16/fp16 fine-tuned model. It's designed for mathematical reasoning tasks, especially Chain-of-Thought style problem-solving relevant to the [AIMO competition](https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-2). The original model was fine-tuned on the [`Floppanacci/QWQ-LongCOT-AIMO`](https://huggingface.co/datasets/Floppanacci/QWQ-LongCOT-AIMO) dataset. ## How to Use ### With `transformers` (and `autoawq`) You need to install the `autoawq` library: ```bash pip install autoawq transformers torch ``` Then use the model with `transformers`: ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "Floppanacci/DeepSeek-R1-Distill-Qwen-7B-Floppanacci-AWQ" tokenizer = AutoTokenizer.from_pretrained(model_id) # Load the AWQ quantized model model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto" # Automatically uses available GPU(s) ) # Example Prompt (adjust based on how the model expects input) prompt = "Question: Let $ABCD$ be a unit square. Let $P$ be a point inside the square such that $PA = \sqrt{5}/3$, $PB = \sqrt{2}/3$, and $PC = \sqrt{5}/3$. Find the distance $PD$. Answer:" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) # Generate outputs = model.generate(**inputs, max_new_tokens=300, temperature=0.1, do_sample=False) # Example settings response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ### With `vLLM` (Optimized Inference) For higher throughput and optimized inference, you can use vLLM. First, install vLLM: ```bash pip install vllm ``` Then run the following Python code: ```python from vllm import LLM, SamplingParams # Define prompts prompts = [ "Question: Let $ABCD$ be a unit square. Let $P$ be a point inside the square such that $PA = \sqrt{5}/3$, $PB = \sqrt{2}/3$, and $PC = \sqrt{5}/3$. Find the distance $PD$. Answer:", "Question: What is the sum of the first 100 positive integers? Answer:", ] # Define sampling parameters sampling_params = SamplingParams(temperature=0.1, top_p=0.95, max_tokens=300) # Initialize the LLM engine with the AWQ model llm = LLM(model="Floppanacci/DeepSeek-R1-Distill-Qwen-7B-Floppanacci-AWQ", quantization="awq", dtype="auto", # vLLM will typically use half-precision for activations (use bfloat16 on compatible hardware e.g. L4, A100, H100, etc.) trust_remote_code=True ) # Generate responses outputs = llm.generate(prompts, sampling_params) # Print the outputs for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ```