-
Notifications
You must be signed in to change notification settings - Fork 12.4k
Model: Add support for Ernie 4.5 MoE #14658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
All right, I made a Q4_0 model and got a coherent response, so I guess this somewhat works. I'm upgrading this from draft status, maybe someone can take a look. |
A sample quant for this model has been uploaded here: https://huggingface.co/ilintar/ERNIE-4.5-21B-A3B-PT-gguf |
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
@CISC all right, think that should be all the fixes. |
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
I'm testing this branch, while testing speculative decoding, it seems it caused a regression loading the dense 300M model. Logs loading the dense model only
It seems to complain about missing tensors, which doesn't happen on master. |
Yep, it's broken for all dense models right now, will suggest a fix. :) |
@CISC Thanks :) yeah, would love to, I'm quantizing the small one with the qwen imatrix calibration data from Bartowski, but I don't have a machine to fit the large one. |
Also, would be probably a good idea to incorporate the vision models somehow, but I'm not sure I'll be able to handle that one myself :) |
My apologies for this slightly offtopic post, but with the introduction of Ernie 4.5 a new quantization algorithm was also introduced with supposedly SOTA performance at 2 bit. Is that something that will also be incorporated into llama.cpp? |
Dropping this in here for future reference (this is the only reference implementation of the VL part so far, from what I can tell): https://github.com/PaddlePaddle/FastDeploy/tree/develop/fastdeploy/model_executor/models/ernie4_5_vl |
I noticed (because I wanted to try this branch), you are trying to merge from "master" of your fork into "master" of ggml-org/llama.cpp. Is this accepted practice or is creating a separate branch a requirement for merging into llama.cpp? |
It's acceptable, but not recommended. |
Yeah, it's generally not a great idea because if there are conflicts and you have to merge upstream changes then you have no master branch locally to easily pull them to, I just realized too late that I forgot to make a branch :> |
I did try https://huggingface.co/ilintar/ERNIE-4.5-21B-A3B-PT-gguf/blob/main/baidu-ERNIE-4.5-21B-A3B-PT-iq3_M.gguf on my rtx 3060 12GB with cuda 12.9. Command: something is wrong with beginning of sentence (bos) or end of sentence (eos). ![]() but not always ![]() For some reason the default tokenizer-config.json holds a jinja template that sets the csl token. I suppose at Baidu, they are having an app or downstream application that can make use of that somehow, but maybe for llama.cpp we can add a default template that works out of the box and filters those out. I would need to do some experiments to fix this and I don't have time for that in the coming days, unfortunately. Over and out for now. ![]() |
@ThiloteE It looks like this is an error in the model config, they have not put the There's not much we can do here though, this needs to be fixed by Baidu. The |
What problems might this cause? |
Incorrect tokenization and incorrect BOS/CLS and/or EOS/SEP will cause the model to respond differently, quite often badly, to prompts. Ernie uses SPM tokenization, which means it will add a BOS ( In effect this means that you have to use |
@pwilkin I just tested the 300B model on latest commit. It unfortunately fails the load due to missing tensor 'blk.3.ffn_gate_shexp.weight'. Do you have any idea how to fix this? This error does occur before llama.cpp even loads the model into memory so it should be no problem to reproduce it on your side even if you don't have the resources necessary to actually load it. If there is anything I can do to help you debug this please let me know. Here the full log: root@AI:/apool/llama.cpp/build/bin# ./llama-cli -m /bpool/ERNIE-4.5-300B-A47B-PT.gguf -ngl 0 -c 7000 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
build: 5937 (075ffdcd) with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23686 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) - 23689 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 591 tensors from /bpool/ERNIE-4.5-300B-A47B-PT.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = ernie4_5-moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = ERNIE 4.5 300B A47B PT
llama_model_loader: - kv 3: general.finetune str = PT
llama_model_loader: - kv 4: general.basename str = ERNIE-4.5
llama_model_loader: - kv 5: general.size_label str = 300B-A47B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: general.tags arr[str,2] = ["ERNIE4.5", "text-generation"]
llama_model_loader: - kv 8: general.languages arr[str,2] = ["en", "zh"]
llama_model_loader: - kv 9: ernie4_5-moe.block_count u32 = 54
llama_model_loader: - kv 10: ernie4_5-moe.context_length u32 = 131072
llama_model_loader: - kv 11: ernie4_5-moe.embedding_length u32 = 8192
llama_model_loader: - kv 12: ernie4_5-moe.feed_forward_length u32 = 28672
llama_model_loader: - kv 13: ernie4_5-moe.attention.head_count u32 = 64
llama_model_loader: - kv 14: ernie4_5-moe.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: ernie4_5-moe.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: ernie4_5-moe.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: general.file_type u32 = 1
llama_model_loader: - kv 18: ernie4_5-moe.expert_count u32 = 64
llama_model_loader: - kv 19: ernie4_5-moe.expert_used_count u32 = 8
llama_model_loader: - kv 20: ernie4_5-moe.interleave_moe_layer_step u32 = 1
llama_model_loader: - kv 21: ernie4_5-moe.leading_dense_block_count u32 = 3
llama_model_loader: - kv 22: ernie4_5-moe.expert_feed_forward_length u32 = 3584
llama_model_loader: - kv 23: ernie4_5-moe.expert_shared_feed_forward_length u32 = 3584
llama_model_loader: - kv 24: general.quantization_version u32 = 2
llama_model_loader: - kv 25: tokenizer.ggml.model str = llama
llama_model_loader: - kv 26: tokenizer.ggml.pre str = default
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,103424] = ["<unk>", "<s>", "</s>", "0", "1", "2...
llama_model_loader: - kv 28: tokenizer.ggml.scores arr[f32,103424] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,103424] = [2, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if not add_generation_prompt is d...
llama_model_loader: - type f32: 211 tensors
llama_model_loader: - type f16: 380 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = F16
print_info: file size = 557.88 GiB (16.00 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1012
load: token to piece cache size = 0.5907 MB
print_info: arch = ernie4_5-moe
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 8192
print_info: n_layer = 54
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 28672
print_info: n_expert = 64
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 300B.A47B
print_info: model params = 299.48 B
print_info: general.name = ERNIE 4.5 300B A47B PT
print_info: vocab type = SPM
print_info: n_vocab = 103424
print_info: n_merges = 0
print_info: BOS token = 1 '<s>'
print_info: EOS token = 2 '</s>'
print_info: UNK token = 0 '<unk>'
print_info: PAD token = 0 '<unk>'
print_info: LF token = 23 '<0x0A>'
print_info: EOG token = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
llama_model_load: error loading model: missing tensor 'blk.3.ffn_gate_shexp.weight'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/bpool/ERNIE-4.5-300B-A47B-PT.gguf'
main: error: unable to load model In case it helps here the
|
LOL, the timing! :D |
@pwilkin The fix seems simple, just check |
Oh, and |
@pwilkin Looking closer at it I think things are a little more broken, but we can address that when you make the follow up PR. |
There's one more difference in the "big" MoE: "moe_gate": "topk", I guess this refers to:
|
No, that would be the I think Lines 827 to 828 in 760b448
|
Hi thanks for the amazing work! Q: Are there any 300B ggufs up on HF? :P |
Well, this got relevant pretty quickly. I tried to get to work on the VL model. Actually, getting the projector converted wasn't that hard. But the normal MoE... It turns out the VL model (the 28B one) uses something they call a "top2 gate": https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-PT/blob/main/modeling_ernie_45t_vl.py In the config, the But here's where my competence ends - I could of course create a new tensor type to store the "other" experts and even write some logic for storing the weight and weight_1 tensors together and then decoupling them on execution, but I have no clue how to implement this whole "top2 gate" algorithm. Would love some help with this or some pointers at least (I don't even understand why there are two different feed forward lengths for the two different tensor types). |
I think they actually mean multimodel, as in that's why you have 2 values, one for each model baked into the same tensor. You can just ignore top2 gating for now, topk probably works fine, anyway, for future reference it is described in the GShard paper. |
You mean just ignore the second layer of tensors? I guess that would just be the 21B-A3B model with a projector then 😄 |
No, I just meant top2 vs topk should not cause much issue. I suspect you will have to read that GShard paper for more info on what's going on with the layers, but it looks like they are combining results from both somehow. Oh, and we have trailing dense layers! :) |
[x] I have no idea what I'm doing
This is my first attempt at adding a new arch and I am very much out of my depth, so I would really appreciate if someone took a look at it and verified if it even made any sense. I basically asked Gemini to make a patch based on the existing vLLM / Chatllm.cpp implementations, then tackled some of the conversion logic myself so that it actually generates a GGUF file with all the layers.
Would close #14465