YAML Metadata Warning: The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

Llama-3.2 1B 4-bit Quantized Model

Model Overview

  • Base Model: Meta-Llama/Llama-3.2-1B
  • Model Name: rautaditya/llama-3.2-1b-4bit-gptq
  • Quantization: 4-bit GPTQ (Generative Pretrained Transformer Quantization)

Model Description

This is a 4-bit quantized version of the Llama-3.2 1B model, designed to reduce model size and inference latency while maintaining reasonable performance. The quantization process allows for more efficient deployment on resource-constrained environments.

Key Features

  • Reduced model size
  • Faster inference times
  • Compatible with Hugging Face Transformers
  • GPTQ quantization for optimal compression

Quantization Details

  • Quantization Method: GPTQ (Generative Pretrained Transformer Quantization)
  • Bit Depth: 4-bit
  • Base Model: Llama-3.2 1B
  • Quantization Library: AutoGPTQ

Installation Requirements

pip install transformers accelerate auto-gptq torch

Usage

Transformers Pipeline

from transformers import AutoTokenizer, pipeline

ModelFolder = "rautaditya/llama-3.2-1b-4bit-gptq"
tokenizer = AutoTokenizer.from_pretrained(ModelFolder)
pipe = pipeline(
    "text-generation",
    model=ModelFolder,
    tokenizer=tokenizer,
    device_map="auto"
)

prompt = "What is the meaning of life?"
generated_text = pipe(prompt, max_length=100)
print(generated_text)

Direct Model Loading

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

model_name = "rautaditya/llama-3.2-1b-4bit-gptq"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoGPTQForCausalLM.from_pretrained(
    model_name, 
    device_map="auto"
)

Performance Considerations

  • Memory Efficiency: Significantly reduced memory footprint compared to full-precision model
  • Inference Speed: Faster inference due to reduced computational requirements
  • Potential Accuracy Trade-off: Minor performance degradation compared to full-precision model

Limitations

  • May show slight differences in output quality compared to the original model
  • Performance can vary based on specific use case and inference environment

Recommended Use Cases

  • Low-resource environments
  • Edge computing
  • Mobile applications
  • Embedded systems
  • Rapid prototyping

License

Please refer to the original Meta Llama 3.2 model license for usage restrictions and permissions.

Citation

If you use this model, please cite:

@misc{llama3.2_4bit_quantized,
  title={Llama-3.2 1B 4-bit Quantized Model},
  author={Raut, Aditya},
  year={2024},
  publisher={Hugging Face}
}

Contributions and Feedback

  • Open to suggestions and improvements
  • Please file issues on the GitHub repository for any bugs or performance concerns

Acknowledgments

  • Meta AI for the base Llama-3.2 model
  • Hugging Face Transformers team
  • AutoGPTQ library contributors
Downloads last month
12
Safetensors
Model size
0.8B params
Tensor type
F32
·
F16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rautaditya/llama-3.2-1b-4bit-gptq

Quantized
(212)
this model