MLP Design Choice

#47
by scissorstail - opened

Why does this model series, unlike other models, always use the approach of splitting into two chunks when computing the up_states and gate in the MLP? I’m not sure if this is the right place to ask, but I didn’t really have anywhere else to turn.

https://github.com/huggingface/transformers/blob/40a493c7ed4f19f08eadb0639cf26d49bfa5e180/src/transformers/models/phi4_multimodal/modeling_phi4_multimodal.py#L763

Microsoft org

Hello @scissorstail !

We took advantage of some performance tricks to increase the throughput / MFU during pre-training, e.g., using a single matrix to compute the up and gate states.

gugarosa changed discussion status to closed
gugarosa changed discussion status to open

For those asking about API access — I've been using Crazyrouter as a unified gateway. One API key, OpenAI SDK compatible. Works well for testing different models without managing multiple accounts.

Sign up or log in to comment

Free AI Image Generator No sign-up. Instant results. Open Now