MLP Design Choice

#47

by scissorstail - opened May 18, 2025

May 18, 2025

Why does this model series, unlike other models, always use the approach of splitting into two chunks when computing the up_states and gate in the MLP? I’m not sure if this is the right place to ask, but I didn’t really have anywhere else to turn.

https://github.com/huggingface/transformers/blob/40a493c7ed4f19f08eadb0639cf26d49bfa5e180/src/transformers/models/phi4_multimodal/modeling_phi4_multimodal.py#L763

gugarosa

Microsoft org Jul 8, 2025

Hello @scissorstail !

We took advantage of some performance tricks to increase the throughput / MFU during pre-training, e.g., using a single matrix to compute the up and gate states.

gugarosa changed discussion status to closed Jul 8, 2025

gugarosa changed discussion status to open Jul 8, 2025

xujfcn

26 days ago

For those asking about API access — I've been using Crazyrouter as a unified gateway. One API key, OpenAI SDK compatible. Works well for testing different models without managing multiple accounts.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment