Skip to content

Inferencing Flan-T5 - GGML_ASSERT error #2038

@railesDev

Description

@railesDev

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I am loading a GGUF model. Inferencing it by either

response = self.llm(prompt, max_tokens=t, echo=True)
tokens = self.llm.tokenize(prompt.encode('utf-8'))
self.llm._model.encode(tokens)
response = self.llm(
            prompt,
            max_tokens=10,
            stop=["</s>", "\n"],
            echo=False
)
generated_text = response['choices'][0]['text'].strip()

Must return a response

Current Behavior

Llama-cpp-python got

GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first")

error in the 1st case and

'LlamaModel' object has no attribute 'encode'

in the second case.

Although GGUF model is successfully loaded. So response is not returned, only errors

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

  • MacOS, M3, Metal disabled, CPU-only for reproduction on other machines

$ lscpu

23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6030 arm64

  • SDK version, e.g. for Linux:
Python 3.10.8
GNU Make 3.81
Apple clang version 16.0.0 (clang-1600.0.26.6)

Failure Information (for bugs)

GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first")
'LlamaModel' object has no attribute 'encode'

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. Install any version of library, even 0.3.12
  2. launch script where you initialized gguf model, for example https://huggingface.co/fareshzm/flan-t5-base-Q4_K_M-GGUF, and trying to inference it
  3. get the error and become upset that you can't inference your model

Issue is only with llama-cpp-python, with llama-cpp there is no problem

Failure Logs

...
11:44:48 AM [ai-service] AI Service: llama_new_context_with_model: n_ctx      = 256
llama_new_context_with_model: n_batch    = 256
llama_new_context_with_model: n_ubatch   = 256
llama_new_context_with_model: flash_attn = 0
11:44:48 AM [ai-service] AI Service: llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
11:44:48 AM [ai-service] AI Service: llama_kv_cache_init:        CPU KV buffer size =     9.00 MiB
llama_new_context_with_model: KV self size  =    9.00 MiB, K (f16):    4.50 MiB, V (f16):    4.50 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
11:44:48 AM [ai-service] AI Service: llama_new_context_with_model:        CPU compute buffer size =    13.25 MiB
llama_new_context_with_model: graph nodes  = 425
llama_new_context_with_model: graph splits = 193
11:44:48 AM [ai-service] AI Service: AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
11:44:48 AM [ai-service] AI Service: Model metadata: {'tokenizer.ggml.add_bos_token': 'false', 'tokenizer.ggml.unknown_token_id': '2', 'tokenizer.ggml.pre': 'default', 'tokenizer.ggml.model': 't5', 'tokenizer.ggml.eos_token_id': '1', 'general.architecture': 't5', 'tokenizer.ggml.add_space_prefix': 'true', 't5.attention.layer_norm_rms_epsilon': '0.000001', 'tokenizer.ggml.add_eos_token': 'true', 't5.attention.relative_buckets_count': '32', 't5.attention.layer_norm_epsilon': '0.000001', 'tokenizer.ggml.padding_token_id': '0', 't5.attention.value_length': '64', 't5.block_count': '12', 't5.attention.head_count': '12', 'general.file_type': '15', 't5.embedding_length': '768', 'general.size_label': '248M', 't5.context_length': '512', 'general.quantization_version': '2', 'general.license': 'apache-2.0', 't5.attention.key_length': '64', 't5.feed_forward_length': '2048', 't5.decoder_start_token_id': '0', 'general.type': 'model', 'tokenizer.ggml.remove_extra_whitespaces': 'true', 'general.name': 'Flan T5 Base'}
11:44:48 AM [ai-service] AI Service: Using fallback chat format: llama-2
INFO:creative_connections:✅ FLAN-T5 model loaded successfully!
11:44:48 AM [ai-service] AI Service: DEBUG:creative_connections:LLM object after loading: <llama_cpp.llama.Llama object at 0x14fae9f60>
DEBUG:creative_connections:LLM state after _ensure_llm_loaded: <llama_cpp.llama.Llama object at 0x14fae9f60>
DEBUG:creative_connections:Checking LLM state after _ensure_llm_loaded: <llama_cpp.llama.Llama object at 0x14fae9f60>
INFO:creative_connections:Attempting FLAN-T5 analysis...
DEBUG:creative_connections:→ Entered _ensure_llm_loaded()
11:44:48 AM [ai-service] AI Service: DEBUG:creative_connections:LLM already initialized: self.llm = <llama_cpp.llama.Llama object at 0x14fae9f60>
INFO:creative_connections:Calling FLAN-T5 with prompt: Tell me "hello world"!
11:44:48 AM [ai-service] AI Service: /private/var/folders/hl/zs0vv1rj0pb3jp28__qcvkn80000gq/T/pip-install-gng95xkt/llama-cpp-python_7fb0bb26e01a4212ae72ffffb70d14e9/vendor/llama.cpp/src/llama.cpp:15122: GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions