File size: 1,593 Bytes
			
			| 46e7744 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | # Local GGUF Chat (Q2_K_L) — Run on CPU (16GB RAM)
This repository shows how to:
1. Download a single GGUF quantized weight (`*Q2_K_L.gguf`) from Hugging Face by pasting your token into a file.
2. Run a small local Flask chat UI that talks to the model using `llama-cpp-python`.
## Files
- `download_model.py` — edit & paste your HF token, then run to download only the Q2_K_L gguf file.
- `app.py` — Flask server + model loader + chat endpoints.
- `templates/index.html` — Chat UI (ChatGPT-like).
- `requirements.txt` — Python dependencies.
## Requirements
- Python 3.10.9 (**recommend**)
- ~16 GB RAM (CPU-only); speed depends on quantization & CPU cores.
## Quick start
1. Create & activate a virtual environment:
   ```bash
   python -m venv oss_env
   # Windows
   oss_env\Scripts\activate
   # Linux / macOS
   source oss_env/bin/activate
2. Install Python dependencies:
`pip install -r requirements.txt`
3. Edit download_model.py:
Paste your Hugging Face token into HUGGINGFACE_TOKEN.
If your model repo is different, update REPO_ID.
4. Download the Q2_K_L GGUF:
`python download_model.py`
The script will print the full path to the downloaded .gguf file.
5. (Optional) Edit app.py:
If you want to explicitly set the exact .gguf path, set MODEL_PATH at top of app.py.
Otherwise app.py will auto-detect the first .gguf under models/.
6. Run the Flask app:
`python app.py`
## Open http://localhost:5000
 in your browser.
7. If need you can run the inference.py code for the single stage demo without chat loop
 |