-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
I am loading a GGUF model. Inferencing it by either
response = self.llm(prompt, max_tokens=t, echo=True)
tokens = self.llm.tokenize(prompt.encode('utf-8'))
self.llm._model.encode(tokens)
response = self.llm(
prompt,
max_tokens=10,
stop=["</s>", "\n"],
echo=False
)
generated_text = response['choices'][0]['text'].strip()
Must return a response
Current Behavior
Llama-cpp-python got
GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first")
error in the 1st case and
'LlamaModel' object has no attribute 'encode'
in the second case.
Although GGUF model is successfully loaded. So response is not returned, only errors
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
- MacOS, M3, Metal disabled, CPU-only for reproduction on other machines
$ lscpu
23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6030 arm64
- SDK version, e.g. for Linux:
Python 3.10.8
GNU Make 3.81
Apple clang version 16.0.0 (clang-1600.0.26.6)
Failure Information (for bugs)
GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first")
'LlamaModel' object has no attribute 'encode'
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
- Install any version of library, even 0.3.12
- launch script where you initialized gguf model, for example https://huggingface.co/fareshzm/flan-t5-base-Q4_K_M-GGUF, and trying to inference it
- get the error and become upset that you can't inference your model
Issue is only with llama-cpp-python, with llama-cpp there is no problem
Failure Logs
...
11:44:48 AM [ai-service] AI Service: llama_new_context_with_model: n_ctx = 256
llama_new_context_with_model: n_batch = 256
llama_new_context_with_model: n_ubatch = 256
llama_new_context_with_model: flash_attn = 0
11:44:48 AM [ai-service] AI Service: llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
11:44:48 AM [ai-service] AI Service: llama_kv_cache_init: CPU KV buffer size = 9.00 MiB
llama_new_context_with_model: KV self size = 9.00 MiB, K (f16): 4.50 MiB, V (f16): 4.50 MiB
llama_new_context_with_model: CPU output buffer size = 0.12 MiB
11:44:48 AM [ai-service] AI Service: llama_new_context_with_model: CPU compute buffer size = 13.25 MiB
llama_new_context_with_model: graph nodes = 425
llama_new_context_with_model: graph splits = 193
11:44:48 AM [ai-service] AI Service: AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
11:44:48 AM [ai-service] AI Service: Model metadata: {'tokenizer.ggml.add_bos_token': 'false', 'tokenizer.ggml.unknown_token_id': '2', 'tokenizer.ggml.pre': 'default', 'tokenizer.ggml.model': 't5', 'tokenizer.ggml.eos_token_id': '1', 'general.architecture': 't5', 'tokenizer.ggml.add_space_prefix': 'true', 't5.attention.layer_norm_rms_epsilon': '0.000001', 'tokenizer.ggml.add_eos_token': 'true', 't5.attention.relative_buckets_count': '32', 't5.attention.layer_norm_epsilon': '0.000001', 'tokenizer.ggml.padding_token_id': '0', 't5.attention.value_length': '64', 't5.block_count': '12', 't5.attention.head_count': '12', 'general.file_type': '15', 't5.embedding_length': '768', 'general.size_label': '248M', 't5.context_length': '512', 'general.quantization_version': '2', 'general.license': 'apache-2.0', 't5.attention.key_length': '64', 't5.feed_forward_length': '2048', 't5.decoder_start_token_id': '0', 'general.type': 'model', 'tokenizer.ggml.remove_extra_whitespaces': 'true', 'general.name': 'Flan T5 Base'}
11:44:48 AM [ai-service] AI Service: Using fallback chat format: llama-2
INFO:creative_connections:✅ FLAN-T5 model loaded successfully!
11:44:48 AM [ai-service] AI Service: DEBUG:creative_connections:LLM object after loading: <llama_cpp.llama.Llama object at 0x14fae9f60>
DEBUG:creative_connections:LLM state after _ensure_llm_loaded: <llama_cpp.llama.Llama object at 0x14fae9f60>
DEBUG:creative_connections:Checking LLM state after _ensure_llm_loaded: <llama_cpp.llama.Llama object at 0x14fae9f60>
INFO:creative_connections:Attempting FLAN-T5 analysis...
DEBUG:creative_connections:→ Entered _ensure_llm_loaded()
11:44:48 AM [ai-service] AI Service: DEBUG:creative_connections:LLM already initialized: self.llm = <llama_cpp.llama.Llama object at 0x14fae9f60>
INFO:creative_connections:Calling FLAN-T5 with prompt: Tell me "hello world"!
11:44:48 AM [ai-service] AI Service: /private/var/folders/hl/zs0vv1rj0pb3jp28__qcvkn80000gq/T/pip-install-gng95xkt/llama-cpp-python_7fb0bb26e01a4212ae72ffffb70d14e9/vendor/llama.cpp/src/llama.cpp:15122: GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed