graph : avoid huge warm-up graphs for MoE models #14753

ggerganov · 2025-07-18T08:47:28Z

Just hot loading the experts for matrix multiplication is enough to heat-up the caches. No need to add extra GGML_OP_ADD nodes for aggregating the results.

ggml-ci

ggerganov · 2025-07-18T08:56:38Z

src/llama-context.cpp

@@ -1312,7 +1312,7 @@ uint32_t llama_context::output_reserve(int32_t n_outputs) {
 //

 uint32_t llama_context::graph_max_nodes() const {
-    return std::max<uint32_t>(65536u, 5u*model.n_tensors());
+    return std::max<uint32_t>(1024u, 6u*model.n_tensors());


We should probably bump this up to 8u*model.n_tensors() just to be safe.

slaren · 2025-07-18T09:01:55Z

If I understand correctly, the motivation of this change was to ensure that all weights are loaded into memory when using mmap on a NUMA system. This would effectively revert #11571.

ggerganov · 2025-07-18T10:53:05Z

I think the experts ~~are still mapped~~ continue to be loaded because when we run the ggml_mul_mat_id() calls, we use the large n_expert_used == hparams.n_expert, instead of the original hparams.n_expert_used. So for example this call, during warmup would still load all the experts into memory and perform the warmup:

llama.cpp/src/llama-graph.cpp

Lines 867 to 870 in 033b306

    
           ggml_tensor * up = build_lora_mm_id(up_exps, cur, selected_experts); // [n_ff, n_expert_used, n_tokens] 
        
           cb(up, "ffn_moe_up", il);

The change only removes the summation nodes that sum together the obtained results for each expert. Those do not involve reading data from the model, but contribute to many number of graph nodes.

For reference, here is the n_expert_used initialization:

llama.cpp/src/llama-graph.cpp

Lines 512 to 514 in 033b306

    
           n_expert         (hparams.n_expert), 
        
           n_expert_used    (cparams.warmup ? hparams.n_expert : hparams.n_expert_used), 
        
           freq_base        (cparams.rope_freq_base),

Edit: fixed wording at the start for clarity

ggerganov force-pushed the gg/context-reduce-min-nodes branch from 4feb0bf to 4c1bacb Compare July 18, 2025 08:47

ggerganov requested a review from slaren July 18, 2025 08:48

graph : avoid huge warm-up graphs for MoE models

033b306

ggml-ci

ggerganov force-pushed the gg/context-reduce-min-nodes branch from 4c1bacb to 033b306 Compare July 18, 2025 08:55

ggerganov commented Jul 18, 2025

View reviewed changes

cont : bump max nodes to 8x model tensors

5883f01

slaren approved these changes Jul 18, 2025

View reviewed changes

ggerganov merged commit d498af3 into master Jul 18, 2025
47 checks passed

ggerganov deleted the gg/context-reduce-min-nodes branch July 18, 2025 11:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

graph : avoid huge warm-up graphs for MoE models #14753

graph : avoid huge warm-up graphs for MoE models #14753

ggerganov commented Jul 18, 2025

Uh oh!

ggerganov Jul 18, 2025

Uh oh!

slaren commented Jul 18, 2025

Uh oh!

ggerganov commented Jul 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

graph : avoid huge warm-up graphs for MoE models #14753

graph : avoid huge warm-up graphs for MoE models #14753

Conversation

ggerganov commented Jul 18, 2025

Uh oh!

ggerganov Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

slaren commented Jul 18, 2025

Uh oh!

ggerganov commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov commented Jul 18, 2025 •

edited

Loading