robgreenberg3 jennyyyi commited on
Commit
1f9bf6b
·
verified ·
1 Parent(s): fe954f3

Update readme.md with logos and deployment details (#2)

Browse files

- Update readme.md with logos and deployment details (bcc03c1cb8cc8173688bf20971ff991bb8df36cf)
- Update README.md (7702b2221c9da7d17ef7c8c2a27b45631bc562cd)
- Update README.md (5d5fd45cadfd7f57544c828ca1c14cf6e73a7a0a)
- Update README.md (ea69d15624d25710c5f19e41433e250a21bc761c)
- Update README.md (a8e512a31260a053a1bcde02ec3f6e3b1a37a3c9)


Co-authored-by: Jenny Y <[email protected]>

Files changed (1) hide show
  1. README.md +169 -3
README.md CHANGED
@@ -31,8 +31,14 @@ tags:
31
  license: other
32
  license_name: llama4
33
  ---
34
-
35
- # Llama-4-Scout-17B-16E-Instruct-quantized.w4a16
 
 
 
 
 
 
36
 
37
  ## Model Overview
38
  - **Model Architecture:** Llama4ForConditionalGeneration
@@ -55,7 +61,9 @@ Weight quantization also reduces disk size requirements by approximately 75%. Th
55
 
56
  ## Deployment
57
 
58
- This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
 
 
59
 
60
  ```python
61
  from vllm import LLM, SamplingParams
@@ -80,6 +88,164 @@ print(generated_text)
80
 
81
  vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
82
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
  ## Evaluation
85
 
 
31
  license: other
32
  license_name: llama4
33
  ---
34
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
35
+ Llama-4-Scout-17B-16E-Instruct-quantized.w4a16
36
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
37
+ </h1>
38
+
39
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
40
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
41
+ </a>
42
 
43
  ## Model Overview
44
  - **Model Architecture:** Llama4ForConditionalGeneration
 
61
 
62
  ## Deployment
63
 
64
+ This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
65
+
66
+ Deploy on <strong>vLLM</strong>
67
 
68
  ```python
69
  from vllm import LLM, SamplingParams
 
88
 
89
  vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
90
 
91
+ <details>
92
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
93
+
94
+ ```bash
95
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
96
+ --ipc=host \
97
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
98
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
99
+ --name=vllm \
100
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
101
+ vllm serve \
102
+ --tensor-parallel-size 8 \
103
+ --max-model-len 32768 \
104
+ --enforce-eager --model RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16
105
+ ```
106
+
107
+ See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
108
+ </details>
109
+
110
+ <details>
111
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
112
+
113
+ ```bash
114
+ # Download model from Red Hat Registry via docker
115
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
116
+ ilab model download --repository docker://registry.redhat.io/rhelai1/llama-4-scout-17b-16e-instruct-quantized-w4a16:1.5
117
+ ```
118
+
119
+ ```bash
120
+ # Serve model via ilab
121
+ ilab model serve --model-path ~/.cache/instructlab/models/llama-4-scout-17b-16e-instruct-quantized-w4a16
122
+
123
+ # Chat with model
124
+ ilab model chat --model ~/.cache/instructlab/models/llama-4-scout-17b-16e-instruct-quantized-w4a16
125
+ ```
126
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
127
+ </details>
128
+
129
+ <details>
130
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
131
+
132
+ ```python
133
+ # Setting up vllm server with ServingRuntime
134
+ # Save as: vllm-servingruntime.yaml
135
+ apiVersion: serving.kserve.io/v1alpha1
136
+ kind: ServingRuntime
137
+ metadata:
138
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
139
+ annotations:
140
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
141
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
142
+ labels:
143
+ opendatahub.io/dashboard: 'true'
144
+ spec:
145
+ annotations:
146
+ prometheus.io/port: '8080'
147
+ prometheus.io/path: '/metrics'
148
+ multiModel: false
149
+ supportedModelFormats:
150
+ - autoSelect: true
151
+ name: vLLM
152
+ containers:
153
+ - name: kserve-container
154
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
155
+ command:
156
+ - python
157
+ - -m
158
+ - vllm.entrypoints.openai.api_server
159
+ args:
160
+ - "--port=8080"
161
+ - "--model=/mnt/models"
162
+ - "--served-model-name={{.Name}}"
163
+ env:
164
+ - name: HF_HOME
165
+ value: /tmp/hf_home
166
+ ports:
167
+ - containerPort: 8080
168
+ protocol: TCP
169
+ ```
170
+
171
+ ```python
172
+ # Attach model to vllm server. This is an NVIDIA template
173
+ # Save as: inferenceservice.yaml
174
+ apiVersion: serving.kserve.io/v1beta1
175
+ kind: InferenceService
176
+ metadata:
177
+ annotations:
178
+ openshift.io/display-name: Llama-4-Scout-17B-16E-Instruct-quantized.w4a16 # OPTIONAL CHANGE
179
+ serving.kserve.io/deploymentMode: RawDeployment
180
+ name: Llama-4-Scout-17B-16E-Instruct-quantized.w4a16 # specify model name. This value will be used to invoke the model in the payload
181
+ labels:
182
+ opendatahub.io/dashboard: 'true'
183
+ spec:
184
+ predictor:
185
+ maxReplicas: 1
186
+ minReplicas: 1
187
+ model:
188
+ modelFormat:
189
+ name: vLLM
190
+ name: ''
191
+ resources:
192
+ limits:
193
+ cpu: '2' # this is model specific
194
+ memory: 8Gi # this is model specific
195
+ nvidia.com/gpu: '1' # this is accelerator specific
196
+ requests: # same comment for this block
197
+ cpu: '1'
198
+ memory: 4Gi
199
+ nvidia.com/gpu: '1'
200
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
201
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-llama-4-scout-17b-16e-instruct-quantized-w4a16:1.5
202
+ tolerations:
203
+ - effect: NoSchedule
204
+ key: nvidia.com/gpu
205
+ operator: Exists
206
+ ```
207
+
208
+ ```bash
209
+ # make sure first to be in the project where you want to deploy the model
210
+ # oc project <project-name>
211
+
212
+ # apply both resources to run model
213
+
214
+ # Apply the ServingRuntime
215
+ oc apply -f vllm-servingruntime.yaml
216
+
217
+ # Apply the InferenceService
218
+ oc apply -f qwen-inferenceservice.yaml
219
+ ```
220
+
221
+ ```python
222
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
223
+ # - Run `oc get inferenceservice` to find your URL if unsure.
224
+
225
+ # Call the server using curl:
226
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
227
+ -H "Content-Type: application/json" \
228
+ -d '{
229
+ "model": "Llama-4-Scout-17B-16E-Instruct-quantized.w4a16",
230
+ "stream": true,
231
+ "stream_options": {
232
+ "include_usage": true
233
+ },
234
+ "max_tokens": 1,
235
+ "messages": [
236
+ {
237
+ "role": "user",
238
+ "content": "How can a bee fly when its wings are so small?"
239
+ }
240
+ ]
241
+ }'
242
+
243
+ ```
244
+
245
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
246
+ </details>
247
+
248
+
249
 
250
  ## Evaluation
251