Skip to content

Update Llama.cpp Submodule to #9fb13f #1007

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Conversation

AuLaSW
Copy link

@AuLaSW AuLaSW commented Dec 13, 2023

This pull request is small and simple: update the Llama.cpp submodule to #9fb13f. The submodule has been updated enough to include support for MoE models (such as the new Mixtral-8x7B-v0.1 that came out yesterday). I have tested this on WSL and it works with the quantized version of that model from TheBloke.

AuLaSW and others added 2 commits December 13, 2023 09:26
The latest commit allows for MoE models thanks to commit #799a1cb. This
should update the connector to use the new llama.cpp files and allow for
MoE models (such as Mixtral-8x7B-v0.1) to be used.
Update the Llama.cpp submodule to include commit #799a1cb, which expands Llama.cpp to include MoE models such as Mixtral-8x7B-v0.1.
@shell-skrimp
Copy link

Hi @AuLaSW . I went through and tested your PR and it seems to work fine. I used mixtral-8x7b-Q5_K_M.gguf. Output is below. I also tested llama2 and mistral and they worked fine.

llama_model_loader: loaded meta data with 24 key-value pairs and 995 tensors from mixtral-8x7b-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q5_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:          blk.0.ffn_gate.0.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    2:          blk.0.ffn_down.0.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_up.0.weight q5_K     [  4096, 14336,     1,     1 ]

llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               [general.name](http://general.name/) str              = .
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:                         [llama.expert](http://llama.expert/)_count u32              = 8
llama_model_loader: - kv  10:                    [llama.expert](http://llama.expert/)_used_count u32              = 2
llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:                          general.file_type u32              = 17
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  913 tensors
llama_model_loader: - type q6_K:   17 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 8
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q5_K - Medium
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 29.93 GiB (5.50 BPW)
llm_load_print_meta: [general.name](http://general.name/)     = .
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.39 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 21128.36 MiB
llm_load_tensors: offloading 10 repeating layers to GPU
llm_load_tensors: offloaded 10/33 layers to GPU
llm_load_tensors: VRAM used: 9518.71 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_build_graph: non-view tensors processed: 1124/1124
llama_new_context_with_model: compute buffer total size = 117.85 MiB
llama_new_context_with_model: VRAM scratch buffer: 114.54 MiB
llama_new_context_with_model: total VRAM used: 9633.25 MiB (model: 9518.71 MiB, context: 114.54 MiB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

llama_print_timings:        load time =    3605.30 ms
llama_print_timings:      sample time =      31.00 ms /   139 runs   (    0.22 ms per token,  4484.45 tokens per second)
llama_print_timings: prompt eval time =    3605.24 ms /    22 tokens (  163.87 ms per token,     6.10 tokens per second)
llama_print_timings:        eval time =   26828.24 ms /   138 runs   (  194.41 ms per token,     5.14 tokens per second)
llama_print_timings:       total time =   30731.81 ms

I'm looking at getting one of these soon and I have two questions that need to be answered. First how much power does the stock turbo make on the supra, is it true it has less boost then my 90 celica gt (12psi). Secondly what are some good bolt-on mods for this car. What kind of power can I get from a full exhaust and maybe a chip? How about an intercooler?

Thanks in advance.

------------------
Glenn
90 Celica gt, 15psi, HKS cams, B&M FPR, Full Exhaust

@AuLaSW
Copy link
Author

AuLaSW commented Dec 13, 2023

Does this also cover #1000? I read through it and it seems like it would.

@shell-skrimp
Copy link

I believe it does

@pabl-o-ce
Copy link
Contributor

I love it!

@mclassen
Copy link

Please merge it! 🙏

@abetlen
Copy link
Owner

abetlen commented Dec 14, 2023

@AuLaSW thank you for this! I've merged the latest llama.cpp release into main and published a new release (v0.2.23) to pypi.

@abetlen abetlen closed this Dec 14, 2023
@shell-skrimp
Copy link

Tested new release, seems good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants