Skip to content

graph : avoid huge warm-up graphs for MoE models #14753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 18, 2025

Conversation

ggerganov
Copy link
Member

Just hot loading the experts for matrix multiplication is enough to heat-up the caches. No need to add extra GGML_OP_ADD nodes for aggregating the results.

@ggerganov ggerganov force-pushed the gg/context-reduce-min-nodes branch from 4feb0bf to 4c1bacb Compare July 18, 2025 08:47
@ggerganov ggerganov requested a review from slaren July 18, 2025 08:48
@ggerganov ggerganov force-pushed the gg/context-reduce-min-nodes branch from 4c1bacb to 033b306 Compare July 18, 2025 08:55
@@ -1312,7 +1312,7 @@ uint32_t llama_context::output_reserve(int32_t n_outputs) {
//

uint32_t llama_context::graph_max_nodes() const {
return std::max<uint32_t>(65536u, 5u*model.n_tensors());
return std::max<uint32_t>(1024u, 6u*model.n_tensors());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably bump this up to 8u*model.n_tensors() just to be safe.

@slaren
Copy link
Member

slaren commented Jul 18, 2025

If I understand correctly, the motivation of this change was to ensure that all weights are loaded into memory when using mmap on a NUMA system. This would effectively revert #11571.

@ggerganov
Copy link
Member Author

ggerganov commented Jul 18, 2025

I think the experts are still mapped continue to be loaded because when we run the ggml_mul_mat_id() calls, we use the large n_expert_used == hparams.n_expert, instead of the original hparams.n_expert_used. So for example this call, during warmup would still load all the experts into memory and perform the warmup:

ggml_tensor * up = build_lora_mm_id(up_exps, cur, selected_experts); // [n_ff, n_expert_used, n_tokens]
cb(up, "ffn_moe_up", il);

The change only removes the summation nodes that sum together the obtained results for each expert. Those do not involve reading data from the model, but contribute to many number of graph nodes.


For reference, here is the n_expert_used initialization:

n_expert (hparams.n_expert),
n_expert_used (cparams.warmup ? hparams.n_expert : hparams.n_expert_used),
freq_base (cparams.rope_freq_base),

Edit: fixed wording at the start for clarity

@ggerganov ggerganov merged commit d498af3 into master Jul 18, 2025
47 checks passed
@ggerganov ggerganov deleted the gg/context-reduce-min-nodes branch July 18, 2025 11:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants