File size: 1,593 Bytes
46e7744
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# Local GGUF Chat (Q2_K_L) — Run on CPU (16GB RAM)

This repository shows how to:
1. Download a single GGUF quantized weight (`*Q2_K_L.gguf`) from Hugging Face by pasting your token into a file.
2. Run a small local Flask chat UI that talks to the model using `llama-cpp-python`.

## Files
- `download_model.py` — edit & paste your HF token, then run to download only the Q2_K_L gguf file.
- `app.py` — Flask server + model loader + chat endpoints.
- `templates/index.html` — Chat UI (ChatGPT-like).
- `requirements.txt` — Python dependencies.

## Requirements
- Python 3.10.9 (**recommend**)
- ~16 GB RAM (CPU-only); speed depends on quantization & CPU cores.

## Quick start

1. Create & activate a virtual environment:
   ```bash

   python -m venv oss_env

   # Windows

   oss_env\Scripts\activate

   # Linux / macOS

   source oss_env/bin/activate





2. Install Python dependencies:

`pip install -r requirements.txt`





3. Edit download_model.py:

Paste your Hugging Face token into HUGGINGFACE_TOKEN.

If your model repo is different, update REPO_ID.





4. Download the Q2_K_L GGUF:

`python download_model.py`

The script will print the full path to the downloaded .gguf file.





5. (Optional) Edit app.py:

If you want to explicitly set the exact .gguf path, set MODEL_PATH at top of app.py.

Otherwise app.py will auto-detect the first .gguf under models/.





6. Run the Flask app:

`python app.py`

## Open http://localhost:5000

 in your browser.



7. If need you can run the inference.py code for the single stage demo without chat loop