Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -21,11 +21,11 @@ inference: false
|
|
| 21 |
</div>
|
| 22 |
<!-- header end -->
|
| 23 |
|
| 24 |
-
# Falcon-40B-Instruct GPTQ
|
| 25 |
|
| 26 |
-
This repo contains an experimantal GPTQ
|
| 27 |
|
| 28 |
-
It is the result of quantising to
|
| 29 |
|
| 30 |
## EXPERIMENTAL
|
| 31 |
|
|
@@ -33,6 +33,10 @@ Please note this is an experimental GPTQ model. Support for it is currently quit
|
|
| 33 |
|
| 34 |
It is also expected to be **VERY SLOW**. This is currently unavoidable, but is being looked at.
|
| 35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
## AutoGPTQ
|
| 37 |
|
| 38 |
AutoGPTQ is required: `pip install auto-gptq`
|
|
@@ -61,11 +65,11 @@ So please first update text-genration-webui to the latest version.
|
|
| 61 |
|
| 62 |
1. Launch text-generation-webui with the following command-line arguments: `--autogptq --trust-remote-code`
|
| 63 |
2. Click the **Model tab**.
|
| 64 |
-
3. Under **Download custom model or LoRA**, enter `TheBloke/falcon-40B-instruct-GPTQ`.
|
| 65 |
4. Click **Download**.
|
| 66 |
5. Wait until it says it's finished downloading.
|
| 67 |
6. Click the **Refresh** icon next to **Model** in the top left.
|
| 68 |
-
7. In the **Model drop-down**: choose the model you just downloaded, `falcon-40B-instruct-GPTQ`.
|
| 69 |
8. Once it says it's loaded, click the **Text Generation tab** and enter a prompt!
|
| 70 |
|
| 71 |
## About `trust_remote_code`
|
|
@@ -91,7 +95,7 @@ from transformers import AutoTokenizer
|
|
| 91 |
from auto_gptq import AutoGPTQForCausalLM
|
| 92 |
|
| 93 |
# Download the model from HF and store it locally, then reference its location here:
|
| 94 |
-
quantized_model_dir = "/path/to/falcon40b-instruct-gptq"
|
| 95 |
|
| 96 |
from transformers import AutoTokenizer
|
| 97 |
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)
|
|
@@ -108,13 +112,13 @@ print(tokenizer.decode(output[0]))
|
|
| 108 |
|
| 109 |
## Provided files
|
| 110 |
|
| 111 |
-
**gptq_model-
|
| 112 |
|
| 113 |
This will work with AutoGPTQ as of commit `3cb1bf5` (`3cb1bf5a6d43a06dc34c6442287965d1838303d3`)
|
| 114 |
|
| 115 |
It was created without groupsize to reduce VRAM requirements, and with `desc_act` (act-order) to improve inference quality.
|
| 116 |
|
| 117 |
-
* `gptq_model-
|
| 118 |
* Works only with latest AutoGPTQ CUDA, compiled from source as of commit `3cb1bf5`
|
| 119 |
* At this time it does not work with AutoGPTQ Triton, but support will hopefully be added in time.
|
| 120 |
* Works with text-generation-webui using `--autogptq --trust_remote_code`
|
|
@@ -326,3 +330,4 @@ Falcon-40B-Instruct is made available under the [TII Falcon LLM License](https:/
|
|
| 326 |
|
| 327 |
## Contact
|
| 328 | |
|
|
|
|
|
| 21 |
</div>
|
| 22 |
<!-- header end -->
|
| 23 |
|
| 24 |
+
# Falcon-40B-Instruct 3bit GPTQ
|
| 25 |
|
| 26 |
+
This repo contains an experimantal GPTQ 3bit model for [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct).
|
| 27 |
|
| 28 |
+
It is the result of quantising to 3bit using [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).
|
| 29 |
|
| 30 |
## EXPERIMENTAL
|
| 31 |
|
|
|
|
| 33 |
|
| 34 |
It is also expected to be **VERY SLOW**. This is currently unavoidable, but is being looked at.
|
| 35 |
|
| 36 |
+
This is a 3bit model with the aim of being loadable on a 24GB VRAM. In my testing so far it will not exceed 24GB VRAM at least up to 512 tokens returned. It may exceed 24GB beyond that.
|
| 37 |
+
|
| 38 |
+
Please be aware that you should currently expect around 0.7 tokens/s on 40B Falcon GPTQ.
|
| 39 |
+
|
| 40 |
## AutoGPTQ
|
| 41 |
|
| 42 |
AutoGPTQ is required: `pip install auto-gptq`
|
|
|
|
| 65 |
|
| 66 |
1. Launch text-generation-webui with the following command-line arguments: `--autogptq --trust-remote-code`
|
| 67 |
2. Click the **Model tab**.
|
| 68 |
+
3. Under **Download custom model or LoRA**, enter `TheBloke/falcon-40B-instruct-3bit-GPTQ`.
|
| 69 |
4. Click **Download**.
|
| 70 |
5. Wait until it says it's finished downloading.
|
| 71 |
6. Click the **Refresh** icon next to **Model** in the top left.
|
| 72 |
+
7. In the **Model drop-down**: choose the model you just downloaded, `falcon-40B-instruct-3bit-GPTQ`.
|
| 73 |
8. Once it says it's loaded, click the **Text Generation tab** and enter a prompt!
|
| 74 |
|
| 75 |
## About `trust_remote_code`
|
|
|
|
| 95 |
from auto_gptq import AutoGPTQForCausalLM
|
| 96 |
|
| 97 |
# Download the model from HF and store it locally, then reference its location here:
|
| 98 |
+
quantized_model_dir = "/path/to/falcon40b-instruct-3bit-gptq"
|
| 99 |
|
| 100 |
from transformers import AutoTokenizer
|
| 101 |
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)
|
|
|
|
| 112 |
|
| 113 |
## Provided files
|
| 114 |
|
| 115 |
+
**gptq_model-3bit--1g.safetensors**
|
| 116 |
|
| 117 |
This will work with AutoGPTQ as of commit `3cb1bf5` (`3cb1bf5a6d43a06dc34c6442287965d1838303d3`)
|
| 118 |
|
| 119 |
It was created without groupsize to reduce VRAM requirements, and with `desc_act` (act-order) to improve inference quality.
|
| 120 |
|
| 121 |
+
* `gptq_model-3bit--1g.safetensors`
|
| 122 |
* Works only with latest AutoGPTQ CUDA, compiled from source as of commit `3cb1bf5`
|
| 123 |
* At this time it does not work with AutoGPTQ Triton, but support will hopefully be added in time.
|
| 124 |
* Works with text-generation-webui using `--autogptq --trust_remote_code`
|
|
|
|
| 330 |
|
| 331 |
## Contact
|
| 332 | |
| 333 |
+
|