robgreenberg3 commited on
Commit
8cf93d7
·
verified ·
1 Parent(s): 1a2afbe

Validated Model changes (#1)

Browse files

- Validated Model changes (08ebd0c9b9d0066a050f918a0b77c1b0e69e0eee)
- Update README.md (65e4de928bebcaffddb8ae671de81baa6dcf84af)

Files changed (1) hide show
  1. README.md +186 -2
README.md CHANGED
@@ -13,8 +13,16 @@ base_model: meta-llama/Llama-3.1-70B-Instruct
13
  pipeline_tag: text-generation
14
  library_name: transformers
15
  ---
16
- # Model Overview
17
- **Built with Llama**
 
 
 
 
 
 
 
 
18
 
19
  ## Description:
20
 
@@ -59,6 +67,182 @@ Note: This model is a demonstration of our techniques for improving helpfulness
59
  Your use of this model is governed by the [NVIDIA Open Model License](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf).
60
  Additional Information: [Llama 3.1 Community License Agreement](https://www.llama.com/llama3_1/license/). Built with Llama.
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  ## Evaluation Metrics
63
 
64
  As of 1 Oct 2024, Llama-3.1-Nemotron-70B-Instruct performs best on Arena Hard, AlpacaEval 2 LC (verified tab) and MT Bench (GPT-4-Turbo)
 
13
  pipeline_tag: text-generation
14
  library_name: transformers
15
  ---
16
+
17
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
18
+ Llama-3.1-Nemotron-70B-Instruct-HF
19
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
20
+ </h1>
21
+
22
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
23
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
24
+ </a>
25
+
26
 
27
  ## Description:
28
 
 
67
  Your use of this model is governed by the [NVIDIA Open Model License](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf).
68
  Additional Information: [Llama 3.1 Community License Agreement](https://www.llama.com/llama3_1/license/). Built with Llama.
69
 
70
+ ## Deployment
71
+
72
+ This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
73
+
74
+ Deploy on <strong>vLLM</strong>
75
+
76
+ ```python
77
+ from vllm import LLM, SamplingParams
78
+ from transformers import AutoTokenizer
79
+ model_id = "RedHatAI/Llama-3.1-Nemotron-70B-Instruct-HF"
80
+ number_gpus = 4
81
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
82
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
83
+ prompt = "Give me a short introduction to large language model."
84
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
85
+ outputs = llm.generate(prompt, sampling_params)
86
+ generated_text = outputs[0].outputs[0].text
87
+ print(generated_text)
88
+ ```
89
+
90
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
91
+
92
+ <details>
93
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
94
+
95
+ ```bash
96
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
97
+ --ipc=host \
98
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
99
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
100
+ --name=vllm \
101
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
102
+ vllm serve \
103
+ --tensor-parallel-size 8 \
104
+ --max-model-len 32768 \
105
+ --enforce-eager --model RedHatAI/Llama-3.1-Nemotron-70B-Instruct-HF
106
+ ```
107
+
108
+ See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
109
+ </details>
110
+
111
+ <details>
112
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
113
+
114
+ ```bash
115
+ # Download model from Red Hat Registry via docker
116
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
117
+ ilab model download --repository docker://registry.redhat.io/rhelai1/llama-3-1-nemotron-70b-instruct-hf:1.5
118
+ ```
119
+
120
+ ```bash
121
+ # Serve model via ilab
122
+ ilab model serve --model-path ~/.cache/instructlab/models/llama-3-1-nemotron-70b-instruct-hf
123
+
124
+ # Chat with model
125
+ ilab model chat --model ~/.cache/instructlab/models/llama-3-1-nemotron-70b-instruct-hf
126
+ ```
127
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
128
+ </details>
129
+
130
+ <details>
131
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
132
+
133
+ ```python
134
+ # Setting up vllm server with ServingRuntime
135
+ # Save as: vllm-servingruntime.yaml
136
+ apiVersion: serving.kserve.io/v1alpha1
137
+ kind: ServingRuntime
138
+ metadata:
139
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
140
+ annotations:
141
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
142
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
143
+ labels:
144
+ opendatahub.io/dashboard: 'true'
145
+ spec:
146
+ annotations:
147
+ prometheus.io/port: '8080'
148
+ prometheus.io/path: '/metrics'
149
+ multiModel: false
150
+ supportedModelFormats:
151
+ - autoSelect: true
152
+ name: vLLM
153
+ containers:
154
+ - name: kserve-container
155
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
156
+ command:
157
+ - python
158
+ - -m
159
+ - vllm.entrypoints.openai.api_server
160
+ args:
161
+ - "--port=8080"
162
+ - "--model=/mnt/models"
163
+ - "--served-model-name={{.Name}}"
164
+ env:
165
+ - name: HF_HOME
166
+ value: /tmp/hf_home
167
+ ports:
168
+ - containerPort: 8080
169
+ protocol: TCP
170
+ ```
171
+
172
+ ```python
173
+ # Attach model to vllm server. This is an NVIDIA template
174
+ # Save as: inferenceservice.yaml
175
+ apiVersion: serving.kserve.io/v1beta1
176
+ kind: InferenceService
177
+ metadata:
178
+ annotations:
179
+ openshift.io/display-name: Llama-3.1-Nemotron-70B-Instruct-HF # OPTIONAL CHANGE
180
+ serving.kserve.io/deploymentMode: RawDeployment
181
+ name: Llama-3.1-Nemotron-70B-Instruct-HF # specify model name. This value will be used to invoke the model in the payload
182
+ labels:
183
+ opendatahub.io/dashboard: 'true'
184
+ spec:
185
+ predictor:
186
+ maxReplicas: 1
187
+ minReplicas: 1
188
+ model:
189
+ modelFormat:
190
+ name: vLLM
191
+ name: ''
192
+ resources:
193
+ limits:
194
+ cpu: '2' # this is model specific
195
+ memory: 8Gi # this is model specific
196
+ nvidia.com/gpu: '1' # this is accelerator specific
197
+ requests: # same comment for this block
198
+ cpu: '1'
199
+ memory: 4Gi
200
+ nvidia.com/gpu: '1'
201
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
202
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-llama-3-1-nemotron-70b-instruct-hf:1.5
203
+
204
+ tolerations:
205
+ - effect: NoSchedule
206
+ key: nvidia.com/gpu
207
+ operator: Exists
208
+ ```
209
+
210
+ ```bash
211
+ # make sure first to be in the project where you want to deploy the model
212
+ # oc project <project-name>
213
+ # apply both resources to run model
214
+ # Apply the ServingRuntime
215
+ oc apply -f vllm-servingruntime.yaml
216
+ # Apply the InferenceService
217
+ oc apply -f qwen-inferenceservice.yaml
218
+ ```
219
+
220
+ ```python
221
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
222
+ # - Run `oc get inferenceservice` to find your URL if unsure.
223
+ # Call the server using curl:
224
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
225
+ -H "Content-Type: application/json" \
226
+ -d '{
227
+ "model": $(model-name),
228
+ "stream": true,
229
+ "stream_options": {
230
+ "include_usage": true
231
+ },
232
+ "max_tokens": 1,
233
+ "messages": [
234
+ {
235
+ "role": "user",
236
+ "content": "How can a bee fly when its wings are so small?"
237
+ }
238
+ ]
239
+ }'
240
+ ```
241
+
242
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
243
+ </details>
244
+
245
+
246
  ## Evaluation Metrics
247
 
248
  As of 1 Oct 2024, Llama-3.1-Nemotron-70B-Instruct performs best on Arena Hard, AlpacaEval 2 LC (verified tab) and MT Bench (GPT-4-Turbo)