Text Generation
Transformers
Safetensors
PyTorch
nvidia
conversational

feat: Add CPU support

#18

Description

This PR adds support to modeling_nemotron.py for running inference on CPU. This is a cleaned up version of the edits I made while working on support in llama.cpp.

Changes

  • Handle failed imports of rmsnorm_fn
  • Add un-optimized implementation of MambaRMSNormGated.forward
  • Fix NemotronHMamba2Mixer.torch_forward to use repeat_interleaved for B and C (see discussion here)
Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment