--- library_name: transformers license: apache-2.0 language: - en - fr - es - it - pt - zh - ar - ru base_model: - HuggingFaceTB/SmolLM3-3B-Base tags: - openvino - int4 - quantization - edge-deployment - optimization - smollm3 inference: false --- # SmolLM3 INT4 OpenVINO ## ๐Ÿš€ Optimized for Edge Deployment This is an INT4 quantized version of [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) using OpenVINO, designed for efficient inference on edge devices and CPUs. ## Model Overview - **Base Model:** SmolLM3-3B - **Quantization:** INT4 via OpenVINO - **Size Reduction:** Significant compression achieved - **Target Hardware:** CPUs, Intel GPUs, NPUs - **Use Cases:** Local inference, edge deployment, resource-constrained environments ## ๐Ÿ”ง Technical Details ### Quantization Process ```python # Quantized using OpenVINO NNCF # INT4 symmetric quantization # Calibration dataset: [specify if used] ``` ### Model Architecture - Same architecture as SmolLM3-3B - GQA and NoPE preserved - 64k context support (128k with YARN) - Multilingual capabilities maintained ## ๐Ÿ“Š Performance (Experimental) > โš ๏ธ **Note:** This is an experimental quantization. Formal benchmarks pending. Expected benefits of INT4 quantization: - Reduced model size - Faster CPU inference - Lower memory requirements - Some quality trade-off Actual metrics will be added after proper benchmarking. ## ๐Ÿ› ๏ธ How to Use ### Installation ```bash pip install optimum[openvino] transformers ``` ### Basic Usage ```python from optimum.intel import OVModelForCausalLM from transformers import AutoTokenizer model_id = "dev-bjoern/smollm3-int4-ov" tokenizer = AutoTokenizer.from_pretrained(model_id) model = OVModelForCausalLM.from_pretrained(model_id) # Generate text prompt = "Explain quantum computing in simple terms" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### With Extended Thinking ```python messages = [ {"role": "system", "content": "/think"}, {"role": "user", "content": "Solve this step by step: 25 * 16"} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) ``` ## ๐ŸŽฏ Intended Use - **Edge AI applications** - **Local LLM deployment** - **Resource-constrained environments** - **Privacy-focused applications** - **Offline AI assistants** ## โšก Optimization Tips 1. **CPU Inference:** Use OpenVINO runtime for best performance 2. **Batch Processing:** Consider batching requests when possible 3. **Memory:** INT4 significantly reduces memory requirements ## ๐Ÿงช Experimental Status This is my first experiment with OpenVINO INT4 quantization. Feedback and contributions are welcome! ### Known Limitations - No formal benchmarks yet - Quantization settings not fully optimized - Some quality degradation vs full precision ### Future Improvements - [ ] Comprehensive benchmarking - [ ] Mixed precision experiments - [ ] Model compression analysis - [ ] Calibration dataset optimization ## ๐Ÿค Contributing Found issues or have suggestions? Please open a discussion or issue! ## ๐Ÿ“š Resources - [Original SmolLM3 Model](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) - [OpenVINO Documentation](https://docs.openvino.ai/) - [Optimum Intel](https://huggingface.co/docs/optimum/intel/index) ## ๐Ÿ™ Acknowledgments - HuggingFace team for SmolLM3 - Intel OpenVINO team for quantization tools - Community for feedback and support ## ๐Ÿ“ Citation If you use this model, please cite both the original and this work: ```bibtex @misc{smollm3-int4-ov, author = {Bjoern Bethge}, title = {SmolLM3 INT4 OpenVINO}, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/dev-bjoern/smollm3-int4-ov}} } ``` --- **Status:** ๐Ÿงช Experimental | **Feedback:** Welcome | **License:** Apache 2.0